diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..57e173c3 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2023-10-23T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.15166v1","updated":"2023-10-23T17:59:31Z","published":"2023-10-23T17:59:31Z","title":"Large Language Models are Visual Reasoning Coordinators","summary":" Visual reasoning requires multimodal perception and commonsense cognition of\nthe world. Recently, multiple vision-language models (VLMs) have been proposed\nwith excellent commonsense reasoning ability in various domains. However, how\nto harness the collective power of these complementary VLMs is rarely explored.\nExisting methods like ensemble still struggle to aggregate these models with\nthe desired higher-order communications. In this work, we propose Cola, a novel\nparadigm that coordinates multiple VLMs for visual reasoning. Our key insight\nis that a large language model (LLM) can efficiently coordinate multiple VLMs\nby facilitating natural language communication that leverages their distinct\nand complementary capabilities. Extensive experiments demonstrate that our\ninstruction tuning variant, Cola-FT, achieves state-of-the-art performance on\nvisual question answering (VQA), outside knowledge VQA, visual entailment, and\nvisual spatial reasoning tasks. Moreover, we show that our in-context learning\nvariant, Cola-Zero, exhibits competitive performance in zero and few-shot\nsettings, without finetuning. Through systematic ablation studies and\nvisualizations, we validate that a coordinator LLM indeed comprehends the\ninstruction prompts as well as the separate functionalities of VLMs; it then\ncoordinates them to enable impressive visual reasoning capabilities.\n","authors":["Liangyu Chen","Bo Li","Sheng Shen","Jingkang Yang","Chunyuan Li","Kurt Keutzer","Trevor Darrell","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15166v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15164v1","updated":"2023-10-23T17:58:40Z","published":"2023-10-23T17:58:40Z","title":"LINC: A Neurosymbolic Approach for Logical Reasoning by Combining\n Language Models with First-Order Logic Provers","summary":" Logical reasoning, i.e., deductively inferring the truth value of a\nconclusion from a set of premises, is an important task for artificial\nintelligence with wide potential impacts on science, mathematics, and society.\nWhile many prompting-based strategies have been proposed to enable Large\nLanguage Models (LLMs) to do such reasoning more effectively, they still appear\nunsatisfactory, often failing in subtle and unpredictable ways. In this work,\nwe investigate the validity of instead reformulating such tasks as modular\nneurosymbolic programming, which we call LINC: Logical Inference via\nNeurosymbolic Computation. In LINC, the LLM acts as a semantic parser,\ntranslating premises and conclusions from natural language to expressions in\nfirst-order logic. These expressions are then offloaded to an external theorem\nprover, which symbolically performs deductive inference. Leveraging this\napproach, we observe significant performance gains on FOLIO and a balanced\nsubset of ProofWriter for three different models in nearly all experimental\nconditions we evaluate. On ProofWriter, augmenting the comparatively small\nopen-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5\nand GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%,\nrespectively. When used with GPT-4, LINC scores 26% higher than CoT on\nProofWriter while performing comparatively on FOLIO. Further analysis reveals\nthat although both methods on average succeed roughly equally often on this\ndataset, they exhibit distinct and complementary failure modes. We thus provide\npromising evidence for how logical reasoning over natural language can be\ntackled through jointly leveraging LLMs alongside symbolic provers. All\ncorresponding code is publicly available at https://github.com/benlipkin/linc\n","authors":["Theo X. Olausson","Alex Gu","Benjamin Lipkin","Cedegao E. Zhang","Armando Solar-Lezama","Joshua B. Tenenbaum","Roger Levy"],"pdf_url":"https://arxiv.org/pdf/2310.15164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15154v1","updated":"2023-10-23T17:55:31Z","published":"2023-10-23T17:55:31Z","title":"Linear Representations of Sentiment in Large Language Models","summary":" Sentiment is a pervasive feature in natural language text, yet it is an open\nquestion how sentiment is represented within Large Language Models (LLMs). In\nthis study, we reveal that across a range of models, sentiment is represented\nlinearly: a single direction in activation space mostly captures the feature\nacross a range of tasks with one extreme for positive and the other for\nnegative. Through causal interventions, we isolate this direction and show it\nis causally relevant in both toy tasks and real world datasets such as Stanford\nSentiment Treebank. Through this case study we model a thorough investigation\nof what a single direction means on a broad data distribution.\n We further uncover the mechanisms that involve this direction, highlighting\nthe roles of a small subset of attention heads and neurons. Finally, we\ndiscover a phenomenon which we term the summarization motif: sentiment is not\nsolely represented on emotionally charged words, but is additionally summarized\nat intermediate positions without inherent sentiment, such as punctuation and\nnames. We show that in Stanford Sentiment Treebank zero-shot classification,\n76% of above-chance classification accuracy is lost when ablating the sentiment\ndirection, nearly half of which (36%) is due to ablating the summarized\nsentiment direction exclusively at comma positions.\n","authors":["Curt Tigges","Oskar John Hollinsworth","Atticus Geiger","Neel Nanda"],"pdf_url":"https://arxiv.org/pdf/2310.15154v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15151v1","updated":"2023-10-23T17:53:47Z","published":"2023-10-23T17:53:47Z","title":"Verb Conjugation in Transformers Is Determined by Linear Encodings of\n Subject Number","summary":" Deep architectures such as Transformers are sometimes criticized for having\nuninterpretable \"black-box\" representations. We use causal intervention\nanalysis to show that, in fact, some linguistic features are represented in a\nlinear, interpretable format. Specifically, we show that BERT's ability to\nconjugate verbs relies on a linear encoding of subject number that can be\nmanipulated with predictable effects on conjugation accuracy. This encoding is\nfound in the subject position at the first layer and the verb position at the\nlast layer, but distributed across positions at middle layers, particularly\nwhen there are multiple cues to subject number.\n","authors":["Sophie Hao","Tal Linzen"],"pdf_url":"https://arxiv.org/pdf/2310.15151v1.pdf","comment":"To appear in Findings of the Association for Computational\n Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15147v1","updated":"2023-10-23T17:52:06Z","published":"2023-10-23T17:52:06Z","title":"S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large\n Language Models","summary":" The rapid development of Large Language Models (LLMs) has led to great\nstrides in model capabilities like reasoning and long-context understanding.\nHowever, as LLMs are able to process longer contexts, it becomes more\nchallenging to evaluate whether they have acquired certain capabilities, since\nthe length of text (e.g., 100K tokens) they can process far exceeds what humans\ncan reliably assess in a reasonable duration. In this paper, we propose using\ncomplex synthetic tasks as a proxy evaluation method, and present S3Eval, a\nSynthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a\nsynthetic benchmark, S3Eval enables the creation of any number of evaluation\nexamples that are theoretically invisible to LLMs, mitigating the test set\ncontamination issue. The synthetic nature of S3Eval provides users full control\nover the dataset, allowing them to systematically probe LLM capabilities by\nscaling text length and varying task difficulty across diverse scenarios. The\nstrong correlation between S3Eval performance and scores of real-world\nbenchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval\nfor evaluation of LLMs. The in-depth analysis also uncover additional insights,\nincluding performance drop when the answer is sparsely distributed or located\nin the middle context, as well as some counter-intuitive trends of model\nperformance.\n","authors":["Fangyu Lei","Qian Liu","Yiming Huang","Shizhu He","Jun Zhao","Kang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15147v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2305.02996v2","updated":"2023-10-23T17:48:34Z","published":"2023-05-04T17:01:17Z","title":"Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR\n Decomposition","summary":" Cross-encoder models, which jointly encode and score a query-item pair, are\nprohibitively expensive for direct k-nearest neighbor (k-NN) search.\nConsequently, k-NN search typically employs a fast approximate retrieval (e.g.\nusing BM25 or dual-encoder vectors), followed by reranking with a\ncross-encoder; however, the retrieval approximation often has detrimental\nrecall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent\nwork that employs a cross-encoder only, making search efficient using a\nrelatively small number of anchor items, and a CUR matrix factorization. While\nANNCUR's one-time selection of anchors tends to approximate the cross-encoder\ndistances on average, doing so forfeits the capacity to accurately estimate\ndistances to items near the query, leading to regret in the crucial end-task:\nrecall of top-k items. In this paper, we propose ADACUR, a method that\nadaptively, iteratively, and efficiently minimizes the approximation error for\nthe practically important top-k neighbors. It does so by iteratively performing\nk-NN search using the anchors available so far, then adding these retrieved\nnearest neighbors to the anchor set for the next round. Empirically, on\nmultiple datasets, in comparison to previous traditional and state-of-the-art\nmethods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed\napproach ADACUR consistently reduces recall error-by up to 70% on the important\nk = 1 setting-while using no more compute than its competitors.\n","authors":["Nishant Yadav","Nicholas Monath","Manzil Zaheer","Andrew McCallum"],"pdf_url":"https://arxiv.org/pdf/2305.02996v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.13132v2","updated":"2023-10-23T17:47:47Z","published":"2023-10-19T20:02:40Z","title":"Better to Ask in English: Cross-Lingual Evaluation of Large Language\n Models for Healthcare Queries","summary":" Large language models (LLMs) are transforming the ways the general public\naccesses and consumes information. Their influence is particularly pronounced\nin pivotal sectors like healthcare, where lay individuals are increasingly\nappropriating LLMs as conversational agents for everyday queries. While LLMs\ndemonstrate impressive language understanding and generation proficiencies,\nconcerns regarding their safety remain paramount in these high-stake domains.\nMoreover, the development of LLMs is disproportionately focused on English. It\nremains unclear how these LLMs perform in the context of non-English languages,\na gap that is critical for ensuring equity in the real-world use of these\nsystems.This paper provides a framework to investigate the effectiveness of\nLLMs as multi-lingual dialogue systems for healthcare queries. Our\nempirically-derived framework XlingEval focuses on three fundamental criteria\nfor evaluating LLM responses to naturalistic human-authored health-related\nquestions: correctness, consistency, and verifiability. Through extensive\nexperiments on four major global languages, including English, Spanish,\nChinese, and Hindi, spanning three expert-annotated large health Q&A datasets,\nand through an amalgamation of algorithmic and human-evaluation strategies, we\nfound a pronounced disparity in LLM responses across these languages,\nindicating a need for enhanced cross-lingual capabilities. We further propose\nXlingHealth, a cross-lingual benchmark for examining the multilingual\ncapabilities of LLMs in the healthcare context. Our findings underscore the\npressing need to bolster the cross-lingual capacities of these models, and to\nprovide an equitable information ecosystem accessible to all.\n","authors":["Yiqiao Jin","Mohit Chandra","Gaurav Verma","Yibo Hu","Munmun De Choudhury","Srijan Kumar"],"pdf_url":"https://arxiv.org/pdf/2310.13132v2.pdf","comment":"18 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.15141v1","updated":"2023-10-23T17:47:34Z","published":"2023-10-23T17:47:34Z","title":"SpecTr: Fast Speculative Decoding via Optimal Transport","summary":" Autoregressive sampling from large language models has led to\nstate-of-the-art results in several natural language tasks. However,\nautoregressive sampling generates tokens one at a time making it slow, and even\nprohibitive in certain tasks. One way to speed up sampling is\n$\\textit{speculative decoding}$: use a small model to sample a $\\textit{draft}$\n(block or sequence of tokens), and then score all tokens in the draft by the\nlarge language model in parallel. A subset of the tokens in the draft are\naccepted (and the rest rejected) based on a statistical method to guarantee\nthat the final output follows the distribution of the large model. In this\nwork, we provide a principled understanding of speculative decoding through the\nlens of optimal transport (OT) with $\\textit{membership cost}$. This framework\ncan be viewed as an extension of the well-known $\\textit{maximal-coupling}$\nproblem. This new formulation enables us to generalize the speculative decoding\nmethod to allow for a set of $k$ candidates at the token-level, which leads to\nan improved optimal membership cost. We show that the optimal draft selection\nalgorithm (transport plan) can be computed via linear programming, whose\nbest-known runtime is exponential in $k$. We then propose a valid draft\nselection algorithm whose acceptance probability is $(1-1/e)$-optimal\nmultiplicatively. Moreover, it can be computed in time almost linear with size\nof domain of a single token. Using this $new draft selection$ algorithm, we\ndevelop a new autoregressive sampling algorithm called $\\textit{SpecTr}$, which\nprovides speedup in decoding while ensuring that there is no quality\ndegradation in the decoded output. We experimentally demonstrate that for\nstate-of-the-art large language models, the proposed approach achieves a wall\nclock speedup of 2.13X, a further 1.37X speedup over speculative decoding on\nstandard benchmarks.\n","authors":["Ziteng Sun","Ananda Theertha Suresh","Jae Hun Ro","Ahmad Beirami","Himanshu Jain","Felix Yu"],"pdf_url":"https://arxiv.org/pdf/2310.15141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15140v1","updated":"2023-10-23T17:46:07Z","published":"2023-10-23T17:46:07Z","title":"AutoDAN: Automatic and Interpretable Adversarial Attacks on Large\n Language Models","summary":" Safety alignment of Large Language Models (LLMs) can be compromised with\nmanual jailbreak attacks and (automatic) adversarial attacks. Recent work\nsuggests that patching LLMs against these attacks is possible: manual jailbreak\nattacks are human-readable but often limited and public, making them easy to\nblock; adversarial attacks generate gibberish prompts that can be detected\nusing perplexity-based filters. In this paper, we show that these solutions may\nbe too optimistic. We propose an interpretable adversarial attack,\n\\texttt{AutoDAN}, that combines the strengths of both types of attacks. It\nautomatically generates attack prompts that bypass perplexity-based filters\nwhile maintaining a high attack success rate like manual jailbreak attacks.\nThese prompts are interpretable and diverse, exhibiting strategies commonly\nused in manual jailbreak attacks, and transfer better than their non-readable\ncounterparts when using limited training data or a single proxy model. We also\ncustomize \\texttt{AutoDAN}'s objective to leak system prompts, another\njailbreak application not addressed in the adversarial attack literature. %,\ndemonstrating the versatility of the approach. We can also customize the\nobjective of \\texttt{AutoDAN} to leak system prompts, beyond the ability to\nelicit harmful content from the model, demonstrating the versatility of the\napproach. Our work provides a new way to red-team LLMs and to understand the\nmechanism of jailbreak attacks.\n","authors":["Sicheng Zhu","Ruiyi Zhang","Bang An","Gang Wu","Joe Barrow","Zichao Wang","Furong Huang","Ani Nenkova","Tong Sun"],"pdf_url":"https://arxiv.org/pdf/2310.15140v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15135v1","updated":"2023-10-23T17:42:01Z","published":"2023-10-23T17:42:01Z","title":"Quantifying the Dialect Gap and its Correlates Across Languages","summary":" Historically, researchers and consumers have noticed a decrease in quality\nwhen applying NLP tools to minority variants of languages (i.e. Puerto Rican\nSpanish or Swiss German), but studies exploring this have been limited to a\nselect few languages. Additionally, past studies have mainly been conducted in\na monolingual context, so cross-linguistic trends have not been identified and\ntied to external factors. In this work, we conduct a comprehensive evaluation\nof the most influential, state-of-the-art large language models (LLMs) across\ntwo high-use applications, machine translation and automatic speech\nrecognition, to assess their functionality on the regional dialects of several\nhigh- and low-resource languages. Additionally, we analyze how the regional\ndialect gap is correlated with economic, social, and linguistic factors. The\nimpact of training data, including related factors like dataset size and its\nconstruction procedure, is shown to be significant but not consistent across\nmodels or languages, meaning a one-size-fits-all approach cannot be taken in\nsolving the dialect gap. This work will lay the foundation for furthering the\nfield of dialectal NLP by laying out evident disparities and identifying\npossible pathways for addressing them through mindful data collection.\n","authors":["Anjali Kantharuban","Ivan Vulić","Anna Korhonen"],"pdf_url":"https://arxiv.org/pdf/2310.15135v1.pdf","comment":"Accepted to EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2310.08740v3","updated":"2023-10-23T17:39:51Z","published":"2023-10-12T21:53:37Z","title":"A Zero-Shot Language Agent for Computer Control with Structured\n Reflection","summary":" Large language models (LLMs) have shown increasing capacity at planning and\nexecuting a high-level goal in a live computer environment (e.g. MiniWoB++). To\nperform a task, recent works often require a model to learn from trace examples\nof the task via either supervised learning or few/many-shot prompting. Without\nthese trace examples, it remains a challenge how an agent can autonomously\nlearn and improve its control on a computer, which limits the ability of an\nagent to perform a new task. We approach this problem with a zero-shot agent\nthat requires no given expert traces. Our agent plans for executable actions on\na partially observed environment, and iteratively progresses a task by\nidentifying and learning from its mistakes via self-reflection and structured\nthought management. On the easy tasks of MiniWoB++, we show that our zero-shot\nagent often outperforms recent SoTAs, with more efficient reasoning. For tasks\nwith more complexity, our reflective agent performs on par with prior best\nmodels, even though previous works had the advantages of accessing expert\ntraces or additional screen information.\n","authors":["Tao Li","Gang Li","Zhiwei Deng","Bryan Wang","Yang Li"],"pdf_url":"https://arxiv.org/pdf/2310.08740v3.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15129v1","updated":"2023-10-23T17:33:31Z","published":"2023-10-23T17:33:31Z","title":"Location-Aware Visual Question Generation with Lightweight Models","summary":" This work introduces a novel task, location-aware visual question generation\n(LocaVQG), which aims to generate engaging questions from data relevant to a\nparticular geographical location. Specifically, we represent such\nlocation-aware information with surrounding images and a GPS coordinate. To\ntackle this task, we present a dataset generation pipeline that leverages GPT-4\nto produce diverse and sophisticated questions. Then, we aim to learn a\nlightweight model that can address the LocaVQG task and fit on an edge device,\nsuch as a mobile phone. To this end, we propose a method which can reliably\ngenerate engaging questions from location-aware information. Our proposed\nmethod outperforms baselines regarding human evaluation (e.g., engagement,\ngrounding, coherence) and automatic evaluation metrics (e.g., BERTScore,\nROUGE-2). Moreover, we conduct extensive ablation studies to justify our\nproposed techniques for both generating the dataset and solving the task.\n","authors":["Nicholas Collin Suwono","Justin Chih-Yao Chen","Tun Min Hung","Ting-Hao Kenneth Huang","I-Bin Liao","Yung-Hui Li","Lun-Wei Ku","Shao-Hua Sun"],"pdf_url":"https://arxiv.org/pdf/2310.15129v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15127v1","updated":"2023-10-23T17:31:55Z","published":"2023-10-23T17:31:55Z","title":"Open-Ended Instructable Embodied Agents with Memory-Augmented Large\n Language Models","summary":" Pre-trained and frozen LLMs can effectively map simple scene re-arrangement\ninstructions to programs over a robot's visuomotor functions through\nappropriate few-shot example prompting. To parse open-domain natural language\nand adapt to a user's idiosyncratic procedures, not known during prompt\nengineering time, fixed prompts fall short. In this paper, we introduce HELPER,\nan embodied agent equipped with an external memory of language-program pairs\nthat parses free-form human-robot dialogue into action programs through\nretrieval-augmented LLM prompting: relevant memories are retrieved based on the\ncurrent dialogue, instruction, correction or VLM description, and used as\nin-context prompt examples for LLM querying. The memory is expanded during\ndeployment to include pairs of user's language and action plans, to assist\nfuture inferences and personalize them to the user's language and routines.\nHELPER sets a new state-of-the-art in the TEACh benchmark in both Execution\nfrom Dialog History (EDH) and Trajectory from Dialogue (TfD), with 1.7x\nimprovement over the previous SOTA for TfD. Our models, code and video results\ncan be found in our project's website: https://helper-agent-llm.github.io.\n","authors":["Gabriel Sarch","Yue Wu","Michael J. Tarr","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2310.15127v1.pdf","comment":"https://helper-agent-llm.github.io"},{"id":"http://arxiv.org/abs/2310.10903v2","updated":"2023-10-23T17:31:26Z","published":"2023-10-17T00:22:10Z","title":"Emergent AI-Assisted Discourse: Case Study of a Second Language Writer\n Authoring with ChatGPT","summary":" The rapid proliferation of ChatGPT has incited debates regarding its impact\non human writing. Amid concerns about declining writing standards, this study\ninvestigates the role of ChatGPT in facilitating academic writing, especially\namong language learners. Using a case study approach, this study examines the\nexperiences of Kailing, a doctoral student, who integrates ChatGPT throughout\ntheir academic writing process. The study employs activity theory as a lens for\nunderstanding writing with generative AI tools and data analyzed includes\nsemi-structured interviews, writing samples, and GPT logs. Results indicate\nthat Kailing effectively collaborates with ChatGPT across various writing\nstages while preserving her distinct authorial voice and agency. This\nunderscores the potential of AI tools such as ChatGPT to enhance academic\nwriting for language learners without overshadowing individual authenticity.\nThis case study offers a critical exploration of how ChatGPT is utilized in the\nacademic writing process and the preservation of a student's authentic voice\nwhen engaging with the tool.\n","authors":["Sharin Jacob","Tamara Tate","Mark Warschauer"],"pdf_url":"https://arxiv.org/pdf/2310.10903v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2310.15123v1","updated":"2023-10-23T17:29:48Z","published":"2023-10-23T17:29:48Z","title":"Branch-Solve-Merge Improves Large Language Model Evaluation and\n Generation","summary":" Large Language Models (LLMs) are frequently used for multi-faceted language\ngeneration and evaluation tasks that involve satisfying intricate user\nconstraints or taking into account multiple aspects and criteria. However,\ntheir performance can fall short, due to the model's lack of coherence and\ninability to plan and decompose the problem. We propose Branch-Solve-Merge\n(BSM), a Large Language Model program (Schlag et al., 2023) for tackling such\nchallenging natural language tasks. It consists of branch, solve, and merge\nmodules that are parameterized with specific prompts to the base LLM. These\nthree modules plan a decomposition of the task into multiple parallel\nsub-tasks, independently solve them, and fuse the solutions to the sub-tasks.\nWe apply our method to the tasks of LLM response evaluation and constrained\ntext generation and evaluate its effectiveness with multiple LLMs, including\nVicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and\nconsistency for each LLM by enhancing human-LLM agreement by up to 26%,\nreducing length and pairwise position biases by up to 50%, and allowing\nLLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint\nstory generation task, BSM improves the coherence of the stories while also\nimproving constraint satisfaction by 12%.\n","authors":["Swarnadeep Saha","Omer Levy","Asli Celikyilmaz","Mohit Bansal","Jason Weston","Xian Li"],"pdf_url":"https://arxiv.org/pdf/2310.15123v1.pdf","comment":"22 pages, 7 figures, 10 tables"},{"id":"http://arxiv.org/abs/2307.09384v2","updated":"2023-10-23T17:24:02Z","published":"2023-07-18T16:05:25Z","title":"Zero-shot Query Reformulation for Conversational Search","summary":" As the popularity of voice assistants continues to surge, conversational\nsearch has gained increased attention in Information Retrieval. However, data\nsparsity issues in conversational search significantly hinder the progress of\nsupervised conversational search methods. Consequently, researchers are\nfocusing more on zero-shot conversational search approaches. Nevertheless,\nexisting zero-shot methods face three primary limitations: they are not\nuniversally applicable to all retrievers, their effectiveness lacks sufficient\nexplainability, and they struggle to resolve common conversational ambiguities\ncaused by omission. To address these limitations, we introduce a novel\nZero-shot Query Reformulation (ZeQR) framework that reformulates queries based\non previous dialogue contexts without requiring supervision from conversational\nsearch data. Specifically, our framework utilizes language models designed for\nmachine reading comprehension tasks to explicitly resolve two common\nambiguities: coreference and omission, in raw queries. In comparison to\nexisting zero-shot methods, our approach is universally applicable to any\nretriever without additional adaptation or indexing. It also provides greater\nexplainability and effectively enhances query intent understanding because\nambiguities are explicitly and proactively resolved. Through extensive\nexperiments on four TREC conversational datasets, we demonstrate the\neffectiveness of our method, which consistently outperforms state-of-the-art\nbaselines.\n","authors":["Dayu Yang","Yue Zhang","Hui Fang"],"pdf_url":"https://arxiv.org/pdf/2307.09384v2.pdf","comment":"Accepted by the 9th ACM SIGIR International Conference on the Theory\n of Information Retrieval"},{"id":"http://arxiv.org/abs/2310.15117v1","updated":"2023-10-23T17:23:56Z","published":"2023-10-23T17:23:56Z","title":"Causal Inference Using LLM-Guided Discovery","summary":" At the core of causal inference lies the challenge of determining reliable\ncausal graphs solely based on observational data. Since the well-known backdoor\ncriterion depends on the graph, any errors in the graph can propagate\ndownstream to effect inference. In this work, we initially show that complete\ngraph information is not necessary for causal effect inference; the topological\norder over graph variables (causal order) alone suffices. Further, given a node\npair, causal order is easier to elicit from domain experts compared to graph\nedges since determining the existence of an edge can depend extensively on\nother variables. Interestingly, we find that the same principle holds for Large\nLanguage Models (LLMs) such as GPT-3.5-turbo and GPT-4, motivating an automated\nmethod to obtain causal order (and hence causal effect) with LLMs acting as\nvirtual domain experts. To this end, we employ different prompting strategies\nand contextual cues to propose a robust technique of obtaining causal order\nfrom LLMs. Acknowledging LLMs' limitations, we also study possible techniques\nto integrate LLMs with established causal discovery algorithms, including\nconstraint-based and score-based methods, to enhance their performance.\nExtensive experiments demonstrate that our approach significantly improves\ncausal ordering accuracy as compared to discovery algorithms, highlighting the\npotential of LLMs to enhance causal inference across diverse fields.\n","authors":["Aniket Vashishtha","Abbavaram Gowtham Reddy","Abhinav Kumar","Saketh Bachu","Vineeth N Balasubramanian","Amit Sharma"],"pdf_url":"https://arxiv.org/pdf/2310.15117v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15114v1","updated":"2023-10-23T17:21:32Z","published":"2023-10-23T17:21:32Z","title":"How To Build Competitive Multi-gender Speech Translation Models For\n Controlling Speaker Gender Translation","summary":" When translating from notional gender languages (e.g., English) into\ngrammatical gender languages (e.g., Italian), the generated translation\nrequires explicit gender assignments for various words, including those\nreferring to the speaker. When the source sentence does not convey the\nspeaker's gender, speech translation (ST) models either rely on the\npossibly-misleading vocal traits of the speaker or default to the masculine\ngender, the most frequent in existing training corpora. To avoid such biased\nand not inclusive behaviors, the gender assignment of speaker-related\nexpressions should be guided by externally-provided metadata about the\nspeaker's gender. While previous work has shown that the most effective\nsolution is represented by separate, dedicated gender-specific models, the goal\nof this paper is to achieve the same results by integrating the speaker's\ngender metadata into a single \"multi-gender\" neural ST model, easier to\nmaintain. Our experiments demonstrate that a single multi-gender model\noutperforms gender-specialized ones when trained from scratch (with gender\naccuracy gains up to 12.9 for feminine forms), while fine-tuning from existing\nST models does not lead to competitive results.\n","authors":["Marco Gaido","Dennis Fucci","Matteo Negri","Luisa Bentivogli"],"pdf_url":"https://arxiv.org/pdf/2310.15114v1.pdf","comment":"To appear in CLiC-it 2023"},{"id":"http://arxiv.org/abs/2310.15113v1","updated":"2023-10-23T17:21:03Z","published":"2023-10-23T17:21:03Z","title":"Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into\n the Morphological Capabilities of a Large Language Model","summary":" Large language models (LLMs) have recently reached an impressive level of\nlinguistic capability, prompting comparisons with human language skills.\nHowever, there have been relatively few systematic inquiries into the\nlinguistic capabilities of the latest generation of LLMs, and those studies\nthat do exist (i) ignore the remarkable ability of humans to generalize, (ii)\nfocus only on English, and (iii) investigate syntax or semantics and overlook\nother capabilities that lie at the heart of human language, like morphology.\nHere, we close these gaps by conducting the first rigorous analysis of the\nmorphological capabilities of ChatGPT in four typologically varied languages\n(specifically, English, German, Tamil, and Turkish). We apply a version of\nBerko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for\nthe four examined languages. We find that ChatGPT massively underperforms\npurpose-built systems, particularly in English. Overall, our results -- through\nthe lens of morphology -- cast a new light on the linguistic capabilities of\nChatGPT, suggesting that claims of human-like language skills are premature and\nmisleading.\n","authors":["Leonie Weissweiler","Valentin Hofmann","Anjali Kantharuban","Anna Cai","Ritam Dutt","Amey Hengle","Anubha Kabra","Atharva Kulkarni","Abhishek Vijayakumar","Haofei Yu","Hinrich Schütze","Kemal Oflazer","David R. Mortensen"],"pdf_url":"https://arxiv.org/pdf/2310.15113v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.05280v4","updated":"2023-10-23T17:18:53Z","published":"2023-10-08T21:03:18Z","title":"Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona\n Biases in Dialogue Systems","summary":" Recent advancements in Large Language Models empower them to follow freeform\ninstructions, including imitating generic or specific demographic personas in\nconversations. We define generic personas to represent demographic groups, such\nas \"an Asian person\", whereas specific personas may take the form of specific\npopular Asian names like \"Yumi\". While the adoption of personas enriches user\nexperiences by making dialogue systems more engaging and approachable, it also\ncasts a shadow of potential risk by exacerbating social biases within model\nresponses, thereby causing societal harm through interactions with users. In\nthis paper, we systematically study \"persona biases\", which we define to be the\nsensitivity of dialogue models' harmful behaviors contingent upon the personas\nthey adopt. We categorize persona biases into biases in harmful expression and\nharmful agreement, and establish a comprehensive evaluation framework to\nmeasure persona biases in five aspects: Offensiveness, Toxic Continuation,\nRegard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to\ninvestigate persona biases by experimenting with UNIVERSALPERSONA, a\nsystematically constructed persona dataset encompassing various types of both\ngeneric and specific model personas. Through benchmarking on four different\nmodels -- including Blender, ChatGPT, Alpaca, and Vicuna -- our study uncovers\nsignificant persona biases in dialogue systems. Our findings also underscore\nthe pressing need to revisit the use of personas in dialogue agents to ensure\nsafe application.\n","authors":["Yixin Wan","Jieyu Zhao","Aman Chadha","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2310.05280v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15109v1","updated":"2023-10-23T17:18:35Z","published":"2023-10-23T17:18:35Z","title":"GRENADE: Graph-Centric Language Model for Self-Supervised Representation\n Learning on Text-Attributed Graphs","summary":" Self-supervised representation learning on text-attributed graphs, which aims\nto create expressive and generalizable representations for various downstream\ntasks, has received increasing research attention lately. However, existing\nmethods either struggle to capture the full extent of structural context\ninformation or rely on task-specific training labels, which largely hampers\ntheir effectiveness and generalizability in practice. To solve the problem of\nself-supervised representation learning on text-attributed graphs, we develop a\nnovel Graph-Centric Language model -- GRENADE. Specifically, GRENADE exploits\nthe synergistic effect of both pre-trained language model and graph neural\nnetwork by optimizing with two specialized self-supervised learning algorithms:\ngraph-centric contrastive learning and graph-centric knowledge alignment. The\nproposed graph-centric self-supervised learning algorithms effectively help\nGRENADE to capture informative textual semantics as well as structural context\ninformation on text-attributed graphs. Through extensive experiments, GRENADE\nshows its superiority over state-of-the-art methods. Implementation is\navailable at \\url{https://github.com/bigheiniu/GRENADE}.\n","authors":["Yichuan Li","Kaize Ding","Kyumin Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15109v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.01339v2","updated":"2023-10-23T17:15:55Z","published":"2023-10-02T17:02:57Z","title":"Improving Dialogue Management: Quality Datasets vs Models","summary":" Task-oriented dialogue systems (TODS) have become crucial for users to\ninteract with machines and computers using natural language. One of its key\ncomponents is the dialogue manager, which guides the conversation towards a\ngood goal for the user by providing the best possible response. Previous works\nhave proposed rule-based systems (RBS), reinforcement learning (RL), and\nsupervised learning (SL) as solutions for the correct dialogue management; in\nother words, select the best response given input by the user. However, this\nwork argues that the leading cause of DMs not achieving maximum performance\nresides in the quality of the datasets rather than the models employed thus\nfar; this means that dataset errors, like mislabeling, originate a large\npercentage of failures in dialogue management. We studied the main errors in\nthe most widely used datasets, Multiwoz 2.1 and SGD, to demonstrate this\nhypothesis. To do this, we have designed a synthetic dialogue generator to\nfully control the amount and type of errors introduced in the dataset. Using\nthis generator, we demonstrated that errors in the datasets contribute\nproportionally to the performance of the models\n","authors":["Miguel Ángel Medina-Ramírez","Cayetano Guerra-Artal","Mario Hernández-Tejera"],"pdf_url":"https://arxiv.org/pdf/2310.01339v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14755v2","updated":"2023-10-23T17:11:30Z","published":"2023-05-24T05:58:17Z","title":"Don't Take This Out of Context! On the Need for Contextual Models and\n Evaluations for Stylistic Rewriting","summary":" Most existing stylistic text rewriting methods and evaluation metrics operate\non a sentence level, but ignoring the broader context of the text can lead to\npreferring generic, ambiguous, and incoherent rewrites. In this paper, we\ninvestigate integrating the preceding textual context into both the\n$\\textit{rewriting}$ and $\\textit{evaluation}$ stages of stylistic text\nrewriting, and introduce a new composite contextual evaluation metric\n$\\texttt{CtxSimFit}$ that combines similarity to the original sentence with\ncontextual cohesiveness. We comparatively evaluate non-contextual and\ncontextual rewrites in formality, toxicity, and sentiment transfer tasks. Our\nexperiments show that humans significantly prefer contextual rewrites as more\nfitting and natural over non-contextual ones, yet existing sentence-level\nautomatic metrics (e.g., ROUGE, SBERT) correlate poorly with human preferences\n($\\rho$=0--0.3). In contrast, human preferences are much better reflected by\nboth our novel $\\texttt{CtxSimFit}$ ($\\rho$=0.7--0.9) as well as proposed\ncontext-infused versions of common metrics ($\\rho$=0.4--0.7). Overall, our\nfindings highlight the importance of integrating context into the generation\nand especially the evaluation stages of stylistic text rewriting.\n","authors":["Akhila Yerukola","Xuhui Zhou","Elizabeth Clark","Maarten Sap"],"pdf_url":"https://arxiv.org/pdf/2305.14755v2.pdf","comment":"emnlp 2023 main camera ready"},{"id":"http://arxiv.org/abs/2310.15100v1","updated":"2023-10-23T17:05:59Z","published":"2023-10-23T17:05:59Z","title":"LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis","summary":" Thematic analysis (TA) has been widely used for analyzing qualitative data in\nmany disciplines and fields. To ensure reliable analysis, the same piece of\ndata is typically assigned to at least two human coders. Moreover, to produce\nmeaningful and useful analysis, human coders develop and deepen their data\ninterpretation and coding over multiple iterations, making TA labor-intensive\nand time-consuming. Recently the emerging field of large language models (LLMs)\nresearch has shown that LLMs have the potential replicate human-like behavior\nin various tasks: in particular, LLMs outperform crowd workers on\ntext-annotation tasks, suggesting an opportunity to leverage LLMs on TA. We\npropose a human-LLM collaboration framework (i.e., LLM-in-the-loop) to conduct\nTA with in-context learning (ICL). This framework provides the prompt to frame\ndiscussions with a LLM (e.g., GPT-3.5) to generate the final codebook for TA.\nWe demonstrate the utility of this framework using survey datasets on the\naspects of the music listening experience and the usage of a password manager.\nResults of the two case studies show that the proposed framework yields similar\ncoding quality to that of human coders but reduces TA's labor and time demands.\n","authors":["Shih-Chieh Dai","Aiping Xiong","Lun-Wei Ku"],"pdf_url":"https://arxiv.org/pdf/2310.15100v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2307.10700v2","updated":"2023-10-23T17:04:35Z","published":"2023-07-20T08:45:00Z","title":"Topics, Authors, and Networks in Large Language Model Research: Trends\n from a Survey of 17K arXiv Papers","summary":" Large language model (LLM) research is dramatically impacting society, making\nit essential to understand the topics and values it prioritizes, the authors\nand institutions driving it, and its networks of collaboration. Due to the\nrecent growth of the field, many of these fundamental attributes lack\nsystematic description. We gather, annotate, and analyze a new dataset of\n16,979 LLM-related arXiv papers, focusing on changes in 2023 vs. 2018-2022. We\nshow that LLM research increasingly focuses on societal impacts: the Computers\nand Society sub-arXiv has seen 20x growth in its proportion of LLM-related\npapers in 2023. This change is driven in part by an influx of new authors: a\nmajority of 2023 papers are first-authored by researchers who have not\npreviously written an LLM-related paper, and these papers focus particularly on\napplications and societal considerations. While a handful of companies hold\noutsize influence, academia publishes a much larger fraction of papers than\nindustry overall, and this gap widens in 2023. LLM research is also being\nshaped by social dynamics: there are gender and academic/industry differences\nin the topics authors prioritize, and a stark U.S./China schism in the\ncollaboration network. Overall, our analysis documents how LLM research both\nshapes and is shaped by society, attesting to the necessity of sociotechnical\nlenses; we discuss implications for researchers and policymakers.\n","authors":["Rajiv Movva","Sidhika Balachandar","Kenny Peng","Gabriel Agostini","Nikhil Garg","Emma Pierson"],"pdf_url":"https://arxiv.org/pdf/2307.10700v2.pdf","comment":"Working paper. Data/code available at\n https://github.com/rmovva/LLM-publication-patterns-public"},{"id":"http://arxiv.org/abs/2305.12710v2","updated":"2023-10-23T16:44:59Z","published":"2023-05-22T04:38:10Z","title":"Beyond Labels: Empowering Human Annotators with Natural Language\n Explanations through a Novel Active-Learning Architecture","summary":" Real-world domain experts (e.g., doctors) rarely annotate only a decision\nlabel in their day-to-day workflow without providing explanations. Yet,\nexisting low-resource learning techniques, such as Active Learning (AL), that\naim to support human annotators mostly focus on the label while neglecting the\nnatural language explanation of a data point. This work proposes a novel AL\narchitecture to support experts' real-world need for label and explanation\nannotations in low-resource scenarios. Our AL architecture leverages an\nexplanation-generation model to produce explanations guided by human\nexplanations, a prediction model that utilizes generated explanations toward\nprediction faithfully, and a novel data diversity-based AL sampling strategy\nthat benefits from the explanation annotations. Automated and human evaluations\ndemonstrate the effectiveness of incorporating explanations into AL sampling\nand the improved human annotation efficiency and trustworthiness with our AL\narchitecture. Additional ablation studies illustrate the potential of our AL\narchitecture for transfer learning, generalizability, and integration with\nlarge language models (LLMs). While LLMs exhibit exceptional\nexplanation-generation capabilities for relatively simple tasks, their\neffectiveness in complex real-world tasks warrants further in-depth study.\n","authors":["Bingsheng Yao","Ishan Jindal","Lucian Popa","Yannis Katsis","Sayan Ghosh","Lihong He","Yuxuan Lu","Shashank Srivastava","Yunyao Li","James Hendler","Dakuo Wang"],"pdf_url":"https://arxiv.org/pdf/2305.12710v2.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.15080v1","updated":"2023-10-23T16:37:59Z","published":"2023-10-23T16:37:59Z","title":"Federated Learning of Large Language Models with Parameter-Efficient\n Prompt Tuning and Adaptive Optimization","summary":" Federated learning (FL) is a promising paradigm to enable collaborative model\ntraining with decentralized data. However, the training process of Large\nLanguage Models (LLMs) generally incurs the update of significant parameters,\nwhich limits the applicability of FL techniques to tackle the LLMs in real\nscenarios. Prompt tuning can significantly reduce the number of parameters to\nupdate, but it either incurs performance degradation or low training\nefficiency. The straightforward utilization of prompt tuning in the FL often\nraises non-trivial communication costs and dramatically degrades performance.\nIn addition, the decentralized data is generally non-Independent and\nIdentically Distributed (non-IID), which brings client drift problems and thus\npoor performance. This paper proposes a Parameter-efficient prompt Tuning\napproach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and\neffective FL of LLMs. First, an efficient partial prompt tuning approach is\nproposed to improve performance and efficiency simultaneously. Second, a novel\nadaptive optimization method is developed to address the client drift problems\non both the device and server sides to enhance performance further. Extensive\nexperiments based on 10 datasets demonstrate the superb performance (up to\n60.8\\% in terms of accuracy) and efficiency (up to 97.59\\% in terms of training\ntime) of FedPepTAO compared with 9 baseline approaches. Our code is available\nat https://github.com/llm-eff/FedPepTAO.\n","authors":["Tianshi Che","Ji Liu","Yang Zhou","Jiaxiang Ren","Jiwen Zhou","Victor S. Sheng","Huaiyu Dai","Dejing Dou"],"pdf_url":"https://arxiv.org/pdf/2310.15080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15079v1","updated":"2023-10-23T16:37:14Z","published":"2023-10-23T16:37:14Z","title":"Affective and Dynamic Beam Search for Story Generation","summary":" Storytelling's captivating potential makes it a fascinating research area,\nwith implications for entertainment, education, therapy, and cognitive studies.\nIn this paper, we propose Affective Story Generator (AffGen) for generating\ninteresting narratives. AffGen introduces \"intriguing twists\" in narratives by\nemploying two novel techniques-Dynamic Beam Sizing and Affective Reranking.\nDynamic Beam Sizing encourages less predictable, more captivating word choices\nusing a contextual multi-arm bandit model. Affective Reranking prioritizes\nsentence candidates based on affect intensity. Our empirical evaluations, both\nautomatic and human, demonstrate AffGen's superior performance over existing\nbaselines in generating affectively charged and interesting narratives. Our\nablation study and analysis provide insights into the strengths and weaknesses\nof AffGen.\n","authors":["Tenghao Huang","Ehsan Qasemi","Bangzheng Li","He Wang","Faeze Brahman","Muhao Chen","Snigdha Chaturvedi"],"pdf_url":"https://arxiv.org/pdf/2310.15079v1.pdf","comment":"Accepted at EMNLP-findings 2023"},{"id":"http://arxiv.org/abs/2310.15077v1","updated":"2023-10-23T16:35:05Z","published":"2023-10-23T16:35:05Z","title":"'Don't Get Too Technical with Me': A Discourse Structure-Based Framework\n for Science Journalism","summary":" Science journalism refers to the task of reporting technical findings of a\nscientific paper as a less technical news article to the general public\naudience. We aim to design an automated system to support this real-world task\n(i.e., automatic science journalism) by 1) introducing a newly-constructed and\nreal-world dataset (SciTechNews), with tuples of a publicly-available\nscientific paper, its corresponding news article, and an expert-written short\nsummary snippet; 2) proposing a novel technical framework that integrates a\npaper's discourse structure with its metadata to guide generation; and, 3)\ndemonstrating with extensive automatic and human experiments that our framework\noutperforms other baseline methods (e.g. Alpaca and ChatGPT) in elaborating a\ncontent plan meaningful for the target audience, simplifying the information\nselected, and producing a coherent final report in a layman's style.\n","authors":["Ronald Cardenas","Bingsheng Yao","Dakuo Wang","Yufang Hou"],"pdf_url":"https://arxiv.org/pdf/2310.15077v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15075v1","updated":"2023-10-23T16:33:23Z","published":"2023-10-23T16:33:23Z","title":"TableQAKit: A Comprehensive and Practical Toolkit for Table-based\n Question Answering","summary":" Table-based question answering (TableQA) is an important task in natural\nlanguage processing, which requires comprehending tables and employing various\nreasoning ways to answer the questions. This paper introduces TableQAKit, the\nfirst comprehensive toolkit designed specifically for TableQA. The toolkit\ndesigns a unified platform that includes plentiful TableQA datasets and\nintegrates popular methods of this task as well as large language models\n(LLMs). Users can add their datasets and methods according to the friendly\ninterface. Also, pleasantly surprised using the modules in this toolkit\nachieves new SOTA on some datasets. Finally, \\tableqakit{} also provides an\nLLM-based TableQA Benchmark for evaluating the role of LLMs in TableQA.\nTableQAKit is open-source with an interactive interface that includes visual\noperations, and comprehensive data for ease of use.\n","authors":["Fangyu Lei","Tongxu Luo","Pengqi Yang","Weihao Liu","Hanwen Liu","Jiahe Lei","Yiming Huang","Yifan Wei","Shizhu He","Jun Zhao","Kang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15075v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.15066v1","updated":"2023-10-23T16:14:05Z","published":"2023-10-23T16:14:05Z","title":"Localizing Active Objects from Egocentric Vision with Symbolic World\n Knowledge","summary":" The ability to actively ground task instructions from an egocentric view is\ncrucial for AI agents to accomplish tasks or assist humans virtually. One\nimportant step towards this goal is to localize and track key active objects\nthat undergo major state change as a consequence of human actions/interactions\nto the environment without being told exactly what/where to ground (e.g.,\nlocalizing and tracking the `sponge` in video from the instruction \"Dip the\n`sponge` into the bucket.\"). While existing works approach this problem from a\npure vision perspective, we investigate to which extent the textual modality\n(i.e., task instructions) and their interaction with visual modality can be\nbeneficial. Specifically, we propose to improve phrase grounding models'\nability on localizing the active objects by: (1) learning the role of `objects\nundergoing change` and extracting them accurately from the instructions, (2)\nleveraging pre- and post-conditions of the objects during actions, and (3)\nrecognizing the objects more robustly with descriptional knowledge. We leverage\nlarge language models (LLMs) to extract the aforementioned action-object\nknowledge, and design a per-object aggregation masking technique to effectively\nperform joint inference on object phrases and symbolic knowledge. We evaluate\nour framework on Ego4D and Epic-Kitchens datasets. Extensive experiments\ndemonstrate the effectiveness of our proposed framework, which leads to>54%\nimprovements in all standard metrics on the TREK-150-OPE-Det localization +\ntracking task, >7% improvements in all standard metrics on the TREK-150-OPE\ntracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD\ntask.\n","authors":["Te-Lin Wu","Yu Zhou","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.15066v1.pdf","comment":"In Proceedings of the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP)"},{"id":"http://arxiv.org/abs/2310.15061v1","updated":"2023-10-23T16:05:13Z","published":"2023-10-23T16:05:13Z","title":"The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained\n Multimodal Models","summary":" Despite the impressive performance achieved by pre-trained\nlanguage-and-vision models in downstream tasks, it remains an open question\nwhether this reflects a proper understanding of image-text interaction. In this\nwork, we explore to what extent they handle basic linguistic constructions --\nactive-passive voice, coordination, and relative clauses -- that even preschool\nchildren can typically master. We present BLA, a novel, automatically\nconstructed benchmark to evaluate multimodal models on these Basic Language\nAbilities. We show that different types of Transformer-based systems, such as\nCLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting,\nin line with previous findings. Our experiments, in particular, show that most\nof the tested models only marginally benefit when fine-tuned or prompted with\nconstruction-specific samples. Yet, the generative BLIP2 shows promising\ntrends, especially in an in-context learning setting. This opens the door to\nusing BLA not only as an evaluation benchmark but also to improve models' basic\nlanguage abilities.\n","authors":["Xinyi Chen","Raquel Fernández","Sandro Pezzelle"],"pdf_url":"https://arxiv.org/pdf/2310.15061v1.pdf","comment":"This is the camera-ready version of the paper that will be published\n in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023)"},{"id":"http://arxiv.org/abs/2310.15055v1","updated":"2023-10-23T15:57:41Z","published":"2023-10-23T15:57:41Z","title":"Towards Conceptualization of \"Fair Explanation\": Disparate Impacts of\n anti-Asian Hate Speech Explanations on Content Moderators","summary":" Recent research at the intersection of AI explainability and fairness has\nfocused on how explanations can improve human-plus-AI task performance as\nassessed by fairness measures. We propose to characterize what constitutes an\nexplanation that is itself \"fair\" -- an explanation that does not adversely\nimpact specific populations. We formulate a novel evaluation method of \"fair\nexplanations\" using not just accuracy and label time, but also psychological\nimpact of explanations on different user groups across many metrics (mental\ndiscomfort, stereotype activation, and perceived workload). We apply this\nmethod in the context of content moderation of potential hate speech, and its\ndifferential impact on Asian vs. non-Asian proxy moderators, across explanation\napproaches (saliency map and counterfactual explanation). We find that saliency\nmaps generally perform better and show less evidence of disparate impact\n(group) and individual unfairness than counterfactual explanations.\n Content warning: This paper contains examples of hate speech and racially\ndiscriminatory language. The authors do not support such content. Please\nconsider your risk of discomfort carefully before continuing reading!\n","authors":["Tin Nguyen","Jiannan Xu","Aayushi Roy","Hal Daumé III","Marine Carpuat"],"pdf_url":"https://arxiv.org/pdf/2310.15055v1.pdf","comment":"EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2310.15040v1","updated":"2023-10-23T15:39:09Z","published":"2023-10-23T15:39:09Z","title":"SLOG: A Structural Generalization Benchmark for Semantic Parsing","summary":" The goal of compositional generalization benchmarks is to evaluate how well\nmodels generalize to new complex linguistic expressions. Existing benchmarks\noften focus on lexical generalization, the interpretation of novel lexical\nitems in syntactic structures familiar from training; structural generalization\ntasks, where a model needs to interpret syntactic structures that are\nthemselves unfamiliar from training, are often underrepresented, resulting in\noverly optimistic perceptions of how well models can generalize. We introduce\nSLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with\n17 structural generalization cases. In our experiments, the generalization\naccuracy of Transformer models, including pretrained ones, only reaches 40.6%,\nwhile a structure-aware parser only achieves 70.8%. These results are far from\nthe near-perfect accuracy existing models achieve on COGS, demonstrating the\nrole of SLOG in foregrounding the large discrepancy between models' lexical and\nstructural generalization capacities.\n","authors":["Bingzhi Li","Lucia Donatelli","Alexander Koller","Tal Linzen","Yuekun Yao","Najoung Kim"],"pdf_url":"https://arxiv.org/pdf/2310.15040v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.05703v2","updated":"2023-10-23T15:26:40Z","published":"2023-10-09T13:24:44Z","title":"An Attribution Method for Siamese Encoders","summary":" Despite the success of Siamese encoder models such as sentence transformers\n(ST), little is known about the aspects of inputs they pay attention to. A\nbarrier is that their predictions cannot be attributed to individual features,\nas they compare two inputs rather than processing a single one. This paper\nderives a local attribution method for Siamese encoders by generalizing the\nprinciple of integrated gradients to models with multiple inputs. The solution\ntakes the form of feature-pair attributions, and can be reduced to a\ntoken-token matrix for STs. Our method involves the introduction of integrated\nJacobians and inherits the advantageous formal properties of integrated\ngradients: it accounts for the model's full computation graph and is guaranteed\nto converge to the actual prediction. A pilot study shows that in an ST few\ntoken-pairs can often explain large fractions of predictions, and it focuses on\nnouns and verbs. For accurate predictions, it however needs to attend to the\nmajority of tokens and parts of speech.\n","authors":["Lucas Möller","Dmitry Nikolaev","Sebastian Padó"],"pdf_url":"https://arxiv.org/pdf/2310.05703v2.pdf","comment":"Accepted to EMNLP'23"},{"id":"http://arxiv.org/abs/2310.13678v2","updated":"2023-10-23T15:25:55Z","published":"2023-10-20T17:31:39Z","title":"Long-Form Speech Translation through Segmentation with Finite-State\n Decoding Constraints on Large Language Models","summary":" One challenge in speech translation is that plenty of spoken content is\nlong-form, but short units are necessary for obtaining high-quality\ntranslations. To address this mismatch, we adapt large language models (LLMs)\nto split long ASR transcripts into segments that can be independently\ntranslated so as to maximize the overall translation quality. We overcome the\ntendency of hallucination in LLMs by incorporating finite-state constraints\nduring decoding; these eliminate invalid outputs without requiring additional\ntraining. We discover that LLMs are adaptable to transcripts containing ASR\nerrors through prompt-tuning or fine-tuning. Relative to a state-of-the-art\nautomatic punctuation baseline, our best LLM improves the average BLEU by 2.9\npoints for English-German, English-Spanish, and English-Arabic TED talk\ntranslation in 9 test sets, just by improving segmentation.\n","authors":["Arya D. McCarthy","Hao Zhang","Shankar Kumar","Felix Stahlberg","Ke Wu"],"pdf_url":"https://arxiv.org/pdf/2310.13678v2.pdf","comment":"accepted to the Findings of EMNLP 2023. arXiv admin note: text\n overlap with arXiv:2212.09895"},{"id":"http://arxiv.org/abs/2305.02239v2","updated":"2023-10-23T15:24:57Z","published":"2023-05-03T16:19:31Z","title":"The Benefits of Label-Description Training for Zero-Shot Text\n Classification","summary":" Pretrained language models have improved zero-shot text classification by\nallowing the transfer of semantic knowledge from the training data in order to\nclassify among specific label sets in downstream tasks. We propose a simple way\nto further improve zero-shot accuracies with minimal effort. We curate small\nfinetuning datasets intended to describe the labels for a task. Unlike typical\nfinetuning data, which has texts annotated with labels, our data simply\ndescribes the labels in language, e.g., using a few related terms,\ndictionary/encyclopedia entries, and short templates. Across a range of topic\nand sentiment datasets, our method is more accurate than zero-shot by 17-19%\nabsolute. It is also more robust to choices required for zero-shot\nclassification, such as patterns for prompting the model to classify and\nmappings from labels to tokens in the model's vocabulary. Furthermore, since\nour data merely describes the labels but does not use input texts, finetuning\non it yields a model that performs strongly on multiple text domains for a\ngiven label set, even improving over few-shot out-of-domain classification in\nmultiple settings.\n","authors":["Lingyu Gao","Debanjan Ghosh","Kevin Gimpel"],"pdf_url":"https://arxiv.org/pdf/2305.02239v2.pdf","comment":"Accepted at the EMNLP 2023 main conference (long paper)"},{"id":"http://arxiv.org/abs/2305.15017v2","updated":"2023-10-23T15:23:22Z","published":"2023-05-24T10:58:20Z","title":"Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through\n Interaction with Symbolic Systems","summary":" Despite outstanding performance in many tasks, language models are\nnotoriously inclined to make factual errors in tasks requiring arithmetic\ncomputation. We address this deficiency by creating Calc-X, a collection of\ndatasets that demonstrates the appropriate use of a calculator in reasoning\nchains. Calc-X is suitable for teaching language models to offload computations\nto a symbolic system. We survey and unify several existing chain-of-thought\ndatasets into a proposed format, resulting in a standard collection of over\n300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X\ncollection to train open-source calculator-using models we call Calcformers and\nshow that these models approximately double the accuracy of generating correct\nresults compared to vanilla language model baselines. We make all Calc-X\ndatasets, source code and Calcformers models publicly available.\n","authors":["Marek Kadlčík","Michal Štefánik","Ondřej Sotolář","Vlastimil Martinek"],"pdf_url":"https://arxiv.org/pdf/2305.15017v2.pdf","comment":"Published in EMNLP 2023: Main track"},{"id":"http://arxiv.org/abs/2310.15021v1","updated":"2023-10-23T15:19:24Z","published":"2023-10-23T15:19:24Z","title":"Efficient Data Learning for Open Information Extraction with Pre-trained\n Language Models","summary":" Open Information Extraction (OpenIE) is a fundamental yet challenging task in\nNatural Language Processing, which involves extracting all triples (subject,\npredicate, object) from a given sentence. While labeling-based methods have\ntheir merits, generation-based techniques offer unique advantages, such as the\nability to generate tokens not present in the original sentence. However, these\ngeneration-based methods often require a significant amount of training data to\nlearn the task form of OpenIE and substantial training time to overcome slow\nmodel convergence due to the order penalty. In this paper, we introduce a novel\nframework, OK-IE, that ingeniously transforms the task form of OpenIE into the\npre-training task form of the T5 model, thereby reducing the need for extensive\ntraining data. Furthermore, we introduce an innovative concept of Anchor to\ncontrol the sequence of model outputs, effectively eliminating the impact of\norder penalty on model convergence and significantly reducing training time.\nExperimental results indicate that, compared to previous SOTA methods, OK-IE\nrequires only 1/100 of the training data (900 instances) and 1/120 of the\ntraining time (3 minutes) to achieve comparable results.\n","authors":["Zhiyuan Fan","Shizhu He"],"pdf_url":"https://arxiv.org/pdf/2310.15021v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14235v2","updated":"2023-10-23T15:17:59Z","published":"2023-05-23T16:50:48Z","title":"Multilingual Large Language Models Are Not (Yet) Code-Switchers","summary":" Multilingual Large Language Models (LLMs) have recently shown great\ncapabilities in a wide range of tasks, exhibiting state-of-the-art performance\nthrough zero-shot or few-shot prompting methods. While there have been\nextensive studies on their abilities in monolingual tasks, the investigation of\ntheir potential in the context of code-switching (CSW), the practice of\nalternating languages within an utterance, remains relatively uncharted. In\nthis paper, we provide a comprehensive empirical analysis of various\nmultilingual LLMs, benchmarking their performance across four tasks: sentiment\nanalysis, machine translation, summarization and word-level language\nidentification. Our results indicate that despite multilingual LLMs exhibiting\npromising outcomes in certain tasks using zero or few-shot prompting, they\nstill underperform in comparison to fine-tuned models of much smaller scales.\nWe argue that current \"multilingualism\" in LLMs does not inherently imply\nproficiency with code-switching texts, calling for future research to bridge\nthis discrepancy.\n","authors":["Ruochen Zhang","Samuel Cahyawijaya","Jan Christian Blaise Cruz","Genta Indra Winata","Alham Fikri Aji"],"pdf_url":"https://arxiv.org/pdf/2305.14235v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15019v1","updated":"2023-10-23T15:14:55Z","published":"2023-10-23T15:14:55Z","title":"Meta learning with language models: Challenges and opportunities in the\n classification of imbalanced text","summary":" Detecting out of policy speech (OOPS) content is important but difficult.\nWhile machine learning is a powerful tool to tackle this challenging task, it\nis hard to break the performance ceiling due to factors like quantity and\nquality limitations on training data and inconsistencies in OOPS definition and\ndata labeling. To realize the full potential of available limited resources, we\npropose a meta learning technique (MLT) that combines individual models built\nwith different text representations. We analytically show that the resulting\ntechnique is numerically stable and produces reasonable combining weights. We\ncombine the MLT with a threshold-moving (TM) technique to further improve the\nperformance of the combined predictor on highly-imbalanced in-distribution and\nout-of-distribution datasets. We also provide computational results to show the\nstatistically significant advantages of the proposed MLT approach.\n All authors contributed equally to this work.\n","authors":["Apostol Vassilev","Honglan Jin","Munawar Hasan"],"pdf_url":"https://arxiv.org/pdf/2310.15019v1.pdf","comment":"22 pages, including 5 figures, 12 tables, 1 appendix"},{"id":"http://arxiv.org/abs/2305.07355v2","updated":"2023-10-23T15:04:20Z","published":"2023-05-12T10:07:12Z","title":"ZARA: Improving Few-Shot Self-Rationalization for Small Language Models","summary":" Language models (LMs) that jointly generate end-task answers as well as\nfree-text rationales are known as self-rationalization models. Recent works\ndemonstrate great performance gain for self-rationalization by few-shot\nprompting LMs with rationale-augmented exemplars. However, the ability to\nbenefit from explanations only emerges with large-scale LMs, which have poor\naccessibility. In this work, we explore the less-studied setting of leveraging\nexplanations for small LMs to improve few-shot self-rationalization. We first\nrevisit the relationship between rationales and answers. Inspired by the\nimplicit mental process of how human beings assess explanations, we present a\nnovel approach, Zero-shot Augmentation of Rationale-Answer pairs (ZARA), to\nautomatically construct pseudo-parallel data for self-training by reducing the\nproblem of plausibility judgement to natural language inference. Experimental\nresults show ZARA achieves SOTA performance on the FEB benchmark, for both the\ntask accuracy and the explanation metric. In addition, we conduct human and\nquantitative evaluation validating ZARA's ability to automatically identify\nplausible and accurate rationale-answer pairs.\n","authors":["Wei-Lin Chen","An-Zi Yen","Cheng-Kuang Wu","Hen-Hsen Huang","Hsin-Hsi Chen"],"pdf_url":"https://arxiv.org/pdf/2305.07355v2.pdf","comment":"Accepted as a long paper at EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2310.15010v1","updated":"2023-10-23T15:02:44Z","published":"2023-10-23T15:02:44Z","title":"Statistical Depth for Ranking and Characterizing Transformer-Based Text\n Embeddings","summary":" The popularity of transformer-based text embeddings calls for better\nstatistical tools for measuring distributions of such embeddings. One such tool\nwould be a method for ranking texts within a corpus by centrality, i.e.\nassigning each text a number signifying how representative that text is of the\ncorpus as a whole. However, an intrinsic center-outward ordering of\nhigh-dimensional text representations is not trivial. A statistical depth is a\nfunction for ranking k-dimensional objects by measuring centrality with respect\nto some observed k-dimensional distribution. We adopt a statistical depth to\nmeasure distributions of transformer-based text embeddings, transformer-based\ntext embedding (TTE) depth, and introduce the practical use of this depth for\nboth modeling and distributional inference in NLP pipelines. We first define\nTTE depth and an associated rank sum test for determining whether two corpora\ndiffer significantly in embedding space. We then use TTE depth for the task of\nin-context learning prompt selection, showing that this approach reliably\nimproves performance over statistical baseline approaches across six text\nclassification tasks. Finally, we use TTE depth and the associated rank sum\ntest to characterize the distributions of synthesized and human-generated\ncorpora, showing that five recent synthetic data augmentation processes cause a\nmeasurable distributional shift away from associated human-generated text.\n","authors":["Parker Seegmiller","Sarah Masud Preum"],"pdf_url":"https://arxiv.org/pdf/2310.15010v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15007v1","updated":"2023-10-23T15:00:46Z","published":"2023-10-23T15:00:46Z","title":"Did the Neurons Read your Book? Document-level Membership Inference for\n Large Language Models","summary":" With large language models (LLMs) poised to become embedded in our daily\nlives, questions are starting to be raised about the dataset(s) they learned\nfrom. These questions range from potential bias or misinformation LLMs could\nretain from their training data to questions of copyright and fair use of\nhuman-generated text. However, while these questions emerge, developers of the\nrecent state-of-the-art LLMs become increasingly reluctant to disclose details\non their training corpus. We here introduce the task of document-level\nmembership inference for real-world LLMs, i.e. inferring whether the LLM has\nseen a given document during training or not. First, we propose a procedure for\nthe development and evaluation of document-level membership inference for LLMs\nby leveraging commonly used data sources for training and the model release\ndate. We then propose a practical, black-box method to predict document-level\nmembership and instantiate it on OpenLLaMA-7B with both books and academic\npapers. We show our methodology to perform very well, reaching an impressive\nAUC of 0.856 for books and 0.678 for papers. We then show our approach to\noutperform the sentence-level membership inference attacks used in the privacy\nliterature for the document-level membership task. We finally evaluate whether\nsmaller models might be less sensitive to document-level inference and show\nOpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach.\nTaken together, our results show that accurate document-level membership can be\ninferred for LLMs, increasing the transparency of technology poised to change\nour lives.\n","authors":["Matthieu Meeus","Shubham Jain","Marek Rei","Yves-Alexandre de Montjoye"],"pdf_url":"https://arxiv.org/pdf/2310.15007v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15004v1","updated":"2023-10-23T14:57:52Z","published":"2023-10-23T14:57:52Z","title":"When Language Models Fall in Love: Animacy Processing in Transformer\n Language Models","summary":" Animacy - whether an entity is alive and sentient - is fundamental to\ncognitive processing, impacting areas such as memory, vision, and language.\nHowever, animacy is not always expressed directly in language: in English it\noften manifests indirectly, in the form of selectional constraints on verbs and\nadjectives. This poses a potential issue for transformer language models (LMs):\nthey often train only on text, and thus lack access to extralinguistic\ninformation from which humans learn about animacy. We ask: how does this impact\nLMs' animacy processing - do they still behave as humans do? We answer this\nquestion using open-source LMs. Like previous studies, we find that LMs behave\nmuch like humans when presented with entities whose animacy is typical.\nHowever, we also show that even when presented with stories about atypically\nanimate entities, such as a peanut in love, LMs adapt: they treat these\nentities as animate, though they do not adapt as well as humans. Even when the\ncontext indicating atypical animacy is very short, LMs pick up on subtle clues\nand change their behavior. We conclude that despite the limited signal through\nwhich LMs can learn about animacy, they are indeed sensitive to the relevant\nlexical semantic nuances available in English.\n","authors":["Michael Hanna","Yonatan Belinkov","Sandro Pezzelle"],"pdf_url":"https://arxiv.org/pdf/2310.15004v1.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.15035v2","updated":"2023-10-23T14:50:57Z","published":"2023-05-24T11:22:34Z","title":"Self-ICL: Zero-Shot In-Context Learning with Self-Generated\n Demonstrations","summary":" Large language models (LLMs) have exhibited striking in-context learning\n(ICL) ability to adapt to target tasks with a few input-output demonstrations.\nFor better ICL, different methods are proposed to select representative\ndemonstrations from existing training corpora. However, such settings are not\naligned with real-world practices, as end-users usually query LMs without\naccess to demonstration pools. In this work, we introduce Self-ICL -- a simple\nframework which bootstraps LMs' intrinsic capabilities to perform zero-shot\nICL. Given a test input, Self-ICL first prompts the model to generate\npseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via\nzero-shot prompting. Finally, we perform ICL for the test input with the\npseudo-input-label pairs as demonstrations. Evaluation on 23 BIG-Bench Hard\ntasks shows Self-ICL outperforms zero-shot baselines on both average accuracy\nand head-to-head comparison. Moreover, with zero-shot chain-of-thought,\nSelf-ICL achieves results comparable to using real demonstrations.\nAdditionally, we conduct a range of analyses to validate Self-ICL's\neffectiveness and provide insights for its behaviors under different settings.\n","authors":["Wei-Lin Chen","Cheng-Kuang Wu","Yun-Nung Chen","Hsin-Hsi Chen"],"pdf_url":"https://arxiv.org/pdf/2305.15035v2.pdf","comment":"Accepted as a long paper at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14997v1","updated":"2023-10-23T14:48:51Z","published":"2023-10-23T14:48:51Z","title":"Simple Hardware-Efficient PCFGs with Independent Left and Right\n Productions","summary":" Scaling dense PCFGs to thousands of nonterminals via a low-rank\nparameterization of the rule probability tensor has been shown to be beneficial\nfor unsupervised parsing. However, PCFGs scaled this way still perform poorly\nas a language model, and even underperform similarly-sized HMMs. This work\nintroduces \\emph{SimplePCFG}, a simple PCFG formalism with independent left and\nright productions. Despite imposing a stronger independence assumption than the\nlow-rank approach, we find that this formalism scales more effectively both as\na language model and as an unsupervised parser. As an unsupervised parser, our\nsimple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language\nmodel, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank\nPCFGs. We further introduce \\emph{FlashInside}, a hardware IO-aware\nimplementation of the inside algorithm for efficiently scaling simple PCFGs.\n","authors":["Wei Liu","Songlin Yang","Yoon Kim","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2310.14997v1.pdf","comment":"Accepted to Findings of EMNLP, 2023"},{"id":"http://arxiv.org/abs/2305.14499v2","updated":"2023-10-23T14:46:34Z","published":"2023-05-23T20:09:52Z","title":"NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive\n Decoders","summary":" Neural document rerankers are extremely effective in terms of accuracy.\nHowever, the best models require dedicated hardware for serving, which is\ncostly and often not feasible. To avoid this serving-time requirement, we\npresent a method of capturing up to 86% of the gains of a Transformer\ncross-attention model with a lexicalized scoring function that only requires\n10-6% of the Transformer's FLOPs per document and can be served using commodity\nCPUs. When combined with a BM25 retriever, this approach matches the quality of\na state-of-the art dual encoder retriever, that still requires an accelerator\nfor query encoding. We introduce NAIL (Non-Autoregressive Indexing with\nLanguage models) as a model architecture that is compatible with recent\nencoder-decoder and decoder-only large language models, such as T5, GPT-3 and\nPaLM. This model architecture can leverage existing pre-trained checkpoints and\ncan be fine-tuned for efficiently constructing document representations that do\nnot require neural processing of queries.\n","authors":["Livio Baldini Soares","Daniel Gillick","Jeremy R. Cole","Tom Kwiatkowski"],"pdf_url":"https://arxiv.org/pdf/2305.14499v2.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14993v1","updated":"2023-10-23T14:46:20Z","published":"2023-10-23T14:46:20Z","title":"Understanding the Inner Workings of Language Models Through\n Representation Dissimilarity","summary":" As language models are applied to an increasing number of real-world\napplications, understanding their inner workings has become an important issue\nin model trust, interpretability, and transparency. In this work we show that\nrepresentation dissimilarity measures, which are functions that measure the\nextent to which two model's internal representations differ, can be a valuable\ntool for gaining insight into the mechanics of language models. Among our\ninsights are: (i) an apparent asymmetry in the internal representations of\nmodel using SoLU and GeLU activation functions, (ii) evidence that\ndissimilarity measures can identify and locate generalization properties of\nmodels that are invisible via in-distribution test set performance, and (iii)\nnew evaluations of how language model features vary as width and depth are\nincreased. Our results suggest that dissimilarity measures are a promising set\nof tools for shedding light on the inner workings of language models.\n","authors":["Davis Brown","Charles Godfrey","Nicholas Konz","Jonathan Tu","Henry Kvinge"],"pdf_url":"https://arxiv.org/pdf/2310.14993v1.pdf","comment":"EMNLP 2023 (main)"},{"id":"http://arxiv.org/abs/2305.14333v2","updated":"2023-10-23T14:46:07Z","published":"2023-05-23T17:57:59Z","title":"Automatic Model Selection with Large Language Models for Reasoning","summary":" Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two\ndistinct reasoning methods, each with its own strengths. CoT employs natural\nlanguage, offering flexibility and interpretability, while PAL utilizes\nprogramming language, yielding more structured and rigorous logic. We introduce\na model selection method to combine the best of both worlds by employing a\nlarge language model (LLM) to dynamically select between them. Our theoretical\nanalysis underscores the feasibility of this method, which is further\ncorroborated by empirical results. Our proposed method demonstrates significant\nperformance improvements across eight reasoning datasets with Codex, ChatGPT,\nand GPT-4. Additionally, our method is complementary to self-consistency; when\nintegrated, it can further enhance performance while significantly reducing\ncomputation costs. Moreover, we achieve new state-of-the-art results on GSM8K\nand SVAMP, with respective accuracies of 96.8% and 93.7%. Our code, data and\nprompts are available at https://github.com/XuZhao0/Model-Selection-Reasoning\n","authors":["James Xu Zhao","Yuxi Xie","Kenji Kawaguchi","Junxian He","Michael Qizhe Xie"],"pdf_url":"https://arxiv.org/pdf/2305.14333v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05442v2","updated":"2023-10-23T14:43:40Z","published":"2023-10-09T06:32:10Z","title":"Establishing Trustworthiness: Rethinking Tasks and Model Evaluation","summary":" Language understanding is a multi-faceted cognitive capability, which the\nNatural Language Processing (NLP) community has striven to model\ncomputationally for decades. Traditionally, facets of linguistic intelligence\nhave been compartmentalized into tasks with specialized model architectures and\ncorresponding evaluation protocols. With the advent of large language models\n(LLMs) the community has witnessed a dramatic shift towards general purpose,\ntask-agnostic approaches powered by generative models. As a consequence, the\ntraditional compartmentalized notion of language tasks is breaking down,\nfollowed by an increasing challenge for evaluation and analysis. At the same\ntime, LLMs are being deployed in more real-world scenarios, including\npreviously unforeseen zero-shot setups, increasing the need for trustworthy and\nreliable systems. Therefore, we argue that it is time to rethink what\nconstitutes tasks and model evaluation in NLP, and pursue a more holistic view\non language, placing trustworthiness at the center. Towards this goal, we\nreview existing compartmentalized approaches for understanding the origins of a\nmodel's functional capacity, and provide recommendations for more multi-faceted\nevaluation protocols.\n","authors":["Robert Litschko","Max Müller-Eberstein","Rob van der Goot","Leon Weber","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2310.05442v2.pdf","comment":"Accepted at EMNLP 2023 (Main Conference), camera-ready"},{"id":"http://arxiv.org/abs/2310.14985v1","updated":"2023-10-23T14:35:26Z","published":"2023-10-23T14:35:26Z","title":"LLM-Based Agent Society Investigation: Collaboration and Confrontation\n in Avalon Gameplay","summary":" This paper aims to investigate the open research problem of uncovering the\nsocial behaviors of LLM-based agents. To achieve this goal, we adopt Avalon, a\nrepresentative communication game, as the environment and use system prompts to\nguide LLM agents to play the game. While previous studies have conducted\npreliminary investigations into gameplay with LLM agents, there lacks research\non their social behaviors. In this paper, we present a novel framework designed\nto seamlessly adapt to Avalon gameplay. The core of our proposed framework is a\nmulti-agent system that enables efficient communication and interaction among\nagents. We evaluate the performance of our framework based on metrics from two\nperspectives: winning the game and analyzing the social behaviors of LLM\nagents. Our results demonstrate the effectiveness of our framework in\ngenerating adaptive and intelligent agents and highlight the potential of\nLLM-based agents in addressing the challenges associated with dynamic social\nenvironment interaction. By analyzing the social behaviors of LLM agents from\nthe aspects of both collaboration and confrontation, we provide insights into\nthe research and applications of this domain.\n","authors":["Yihuai Lan","Zhiqiang Hu","Lei Wang","Yang Wang","Deheng Ye","Peilin Zhao","Ee-Peng Lim","Hui Xiong","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14981v1","updated":"2023-10-23T14:27:45Z","published":"2023-10-23T14:27:45Z","title":"Fidelity-Enriched Contrastive Search: Reconciling the\n Faithfulness-Diversity Trade-Off in Text Generation","summary":" In this paper, we address the hallucination problem commonly found in natural\nlanguage generation tasks. Language models often generate fluent and convincing\ncontent but can lack consistency with the provided source, resulting in\npotential inaccuracies. We propose a new decoding method called\nFidelity-Enriched Contrastive Search (FECS), which augments the contrastive\nsearch framework with context-aware regularization terms. FECS promotes tokens\nthat are semantically similar to the provided source while penalizing\nrepetitiveness in the generated text. We demonstrate its effectiveness across\ntwo tasks prone to hallucination: abstractive summarization and dialogue\ngeneration. Results show that FECS consistently enhances faithfulness across\nvarious language model sizes while maintaining output diversity comparable to\nwell-performing decoding algorithms.\n","authors":["Wei-Lin Chen","Cheng-Kuang Wu","Hsin-Hsi Chen","Chung-Chi Chen"],"pdf_url":"https://arxiv.org/pdf/2310.14981v1.pdf","comment":"Accepted as a short paper at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14979v1","updated":"2023-10-23T14:26:43Z","published":"2023-10-23T14:26:43Z","title":"ACTOR: Active Learning with Annotator-specific Classification Heads to\n Embrace Human Label Variation","summary":" Label aggregation such as majority voting is commonly used to resolve\nannotator disagreement in dataset creation. However, this may disregard\nminority values and opinions. Recent studies indicate that learning from\nindividual annotations outperforms learning from aggregated labels, though they\nrequire a considerable amount of annotation. Active learning, as an annotation\ncost-saving strategy, has not been fully explored in the context of learning\nfrom disagreement. We show that in the active learning setting, a multi-head\nmodel performs significantly better than a single-head model in terms of\nuncertainty estimation. By designing and evaluating acquisition functions with\nannotator-specific heads on two datasets, we show that group-level entropy\nworks generally well on both datasets. Importantly, it achieves performance in\nterms of both prediction and uncertainty estimation comparable to full-scale\ntraining from disagreement, while saving up to 70% of the annotation budget.\n","authors":["Xinpeng Wang","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2310.14979v1.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2305.12421v4","updated":"2023-10-23T14:26:26Z","published":"2023-05-21T10:40:55Z","title":"Evaluating Open-QA Evaluation","summary":" This study focuses on the evaluation of the Open Question Answering (Open-QA)\ntask, which can directly estimate the factuality of large language models\n(LLMs). Current automatic evaluation methods have shown limitations, indicating\nthat human evaluation still remains the most reliable approach. We introduce a\nnew task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset\nEVOUNA, designed to assess the accuracy of AI-generated answers in relation to\nstandard answers within Open-QA. Our evaluation of these methods utilizes\nhuman-annotated results to measure their performance. Specifically, the work\ninvestigates methods that show high correlation with human evaluations, deeming\nthem more reliable. We also discuss the pitfalls of current methods and methods\nto improve LLM-based evaluators. We believe this new QA-Eval task and\ncorresponding dataset EVOUNA will facilitate the development of more effective\nautomatic evaluation tools and prove valuable for future research in this area.\nAll resources are available at \\url{https://github.com/wangcunxiang/QA-Eval}\nand it is under the Apache-2.0 License.\n","authors":["Cunxiang Wang","Sirui Cheng","Qipeng Guo","Yuanhao Yue","Bowen Ding","Zhikun Xu","Yidong Wang","Xiangkun Hu","Zheng Zhang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.12421v4.pdf","comment":"Accepted by Neurips-2023 Datasets and Benchmarks track; 28 pages"},{"id":"http://arxiv.org/abs/2310.14971v1","updated":"2023-10-23T14:20:04Z","published":"2023-10-23T14:20:04Z","title":"Penalty Decoding: Well Suppress the Self-Reinforcement Effect in\n Open-Ended Text Generation","summary":" The decoding algorithm is critical for open-ended text generation,\ntransforming latent representations into coherent and meaningful outputs. This\npaper investigates the self-reinforcement effect in text generation and the\neffectiveness of a repetition penalty to mitigate it. However, determining the\noptimal repetition penalty value is challenging. To tackle this, we propose a\nforgetting mechanism that disregards distant tokens, reducing the burden of\npenalty selection. In addition, we introduce a length penalty to address overly\nshort sentences caused by excessive penalties. Our penalty decoding approach\nincorporating three strategies helps resolve issues with sampling methods\ndeviating from factual information. Experimental results demonstrate the\nefficacy of our approach in generating high-quality sentences resembling human\noutput.\n","authors":["Wenhong Zhu","Hongkun Hao","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14971v1.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.14970v1","updated":"2023-10-23T14:15:28Z","published":"2023-10-23T14:15:28Z","title":"Towards LLM-driven Dialogue State Tracking","summary":" Dialogue State Tracking (DST) is of paramount importance in ensuring accurate\ntracking of user goals and system actions within task-oriented dialogue\nsystems. The emergence of large language models (LLMs) such as GPT3 and ChatGPT\nhas sparked considerable interest in assessing their efficacy across diverse\napplications. In this study, we conduct an initial examination of ChatGPT's\ncapabilities in DST. Our evaluation uncovers the exceptional performance of\nChatGPT in this task, offering valuable insights to researchers regarding its\ncapabilities and providing useful directions for designing and enhancing\ndialogue systems. Despite its impressive performance, ChatGPT has significant\nlimitations including its closed-source nature, request restrictions, raising\ndata privacy concerns, and lacking local deployment capabilities. To address\nthese concerns, we present LDST, an LLM-driven DST framework based on smaller,\nopen-source foundation models. By utilizing a novel domain-slot instruction\ntuning method, LDST achieves performance on par with ChatGPT. Comprehensive\nevaluations across three distinct experimental settings, we find that LDST\nexhibits remarkable performance improvements in both zero-shot and few-shot\nsetting compared to previous SOTA methods. The source code is provided for\nreproducibility.\n","authors":["Yujie Feng","Zexin Lu","Bo Liu","Liming Zhan","Xiao-Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2310.14970v1.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.13264v2","updated":"2023-10-23T14:12:59Z","published":"2023-05-22T17:33:17Z","title":"Prompting is not a substitute for probability measurements in large\n language models","summary":" Prompting is now a dominant method for evaluating the linguistic knowledge of\nlarge language models (LLMs). While other methods directly read out models'\nprobability distributions over strings, prompting requires models to access\nthis internal information by processing linguistic input, thereby implicitly\ntesting a new type of emergent ability: metalinguistic judgment. In this study,\nwe compare metalinguistic prompting and direct probability measurements as ways\nof measuring models' linguistic knowledge. Broadly, we find that LLMs'\nmetalinguistic judgments are inferior to quantities directly derived from\nrepresentations. Furthermore, consistency gets worse as the prompt query\ndiverges from direct measurements of next-word probabilities. Our findings\nsuggest that negative results relying on metalinguistic prompts cannot be taken\nas conclusive evidence that an LLM lacks a particular linguistic\ngeneralization. Our results also highlight the value that is lost with the move\nto closed APIs where access to probability distributions is limited.\n","authors":["Jennifer Hu","Roger Levy"],"pdf_url":"https://arxiv.org/pdf/2305.13264v2.pdf","comment":"Camera-ready version for EMNLP 2023"},{"id":"http://arxiv.org/abs/2302.04012v2","updated":"2023-10-23T14:08:09Z","published":"2023-02-08T11:54:07Z","title":"CodeLMSec Benchmark: Systematically Evaluating and Finding Security\n Vulnerabilities in Black-Box Code Language Models","summary":" Large language models (LLMs) for automatic code generation have achieved\nbreakthroughs in several programming tasks. Their advances in competition-level\nprogramming problems have made them an essential pillar of AI-assisted pair\nprogramming, and tools such as GitHub Copilot have emerged as part of the daily\nprogramming workflow used by millions of developers. The training data for\nthese models is usually collected from the Internet (e.g., from open-source\nrepositories) and is likely to contain faults and security vulnerabilities.\nThis unsanitized training data can cause the language models to learn these\nvulnerabilities and propagate them during the code generation procedure. While\nthese models have been extensively assessed for their ability to produce\nfunctionally correct programs, there remains a lack of comprehensive\ninvestigations and benchmarks addressing the security aspects of these models.\n In this work, we propose a method to systematically study the security issues\nof code language models to assess their susceptibility to generating vulnerable\ncode. To this end, we introduce the first approach to automatically find\ngenerated code that contains vulnerabilities in black-box code generation\nmodels. To achieve this, we present an approach to approximate inversion of the\nblack-box code generation models based on few-shot prompting. We evaluate the\neffectiveness of our approach by examining code language models in generating\nhigh-risk security weaknesses. Furthermore, we establish a collection of\ndiverse non-secure prompts for various vulnerability scenarios using our\nmethod. This dataset forms a benchmark for evaluating and comparing the\nsecurity weaknesses in code language models.\n","authors":["Hossein Hajipour","Keno Hassler","Thorsten Holz","Lea Schönherr","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2302.04012v2.pdf","comment":"23 pages, 9 figures"},{"id":"http://arxiv.org/abs/2212.11680v2","updated":"2023-10-23T14:05:49Z","published":"2022-12-20T19:37:20Z","title":"Smooth Sailing: Improving Active Learning for Pre-trained Language\n Models with Representation Smoothness Analysis","summary":" Developed to alleviate prohibitive labeling costs, active learning (AL)\nmethods aim to reduce label complexity in supervised learning. While recent\nwork has demonstrated the benefit of using AL in combination with large\npre-trained language models (PLMs), it has often overlooked the practical\nchallenges that hinder the effectiveness of AL. We address these challenges by\nleveraging representation smoothness analysis to ensure AL is feasible, that\nis, both effective and practicable. Firstly, we propose an early stopping\ntechnique that does not require a validation set -- often unavailable in\nrealistic AL conditions -- and observe significant improvements over random\nsampling across multiple datasets and AL methods. Further, we find that task\nadaptation improves AL, whereas standard short fine-tuning in AL does not\nprovide improvements over random sampling. Our work demonstrates the usefulness\nof representation smoothness analysis for AL and introduces an AL stopping\ncriterion that reduces label complexity.\n","authors":["Josip Jukić","Jan Šnajder"],"pdf_url":"https://arxiv.org/pdf/2212.11680v2.pdf","comment":"Accepted at Learning with Small Data 2023, Association for\n Computational Linguistics"},{"id":"http://arxiv.org/abs/2310.05592v2","updated":"2023-10-23T14:01:26Z","published":"2023-10-09T10:27:26Z","title":"InterroLang: Exploring NLP Models and Datasets through Dialogue-based\n Explanations","summary":" While recently developed NLP explainability methods let us open the black box\nin various ways (Madsen et al., 2022), a missing ingredient in this endeavor is\nan interactive tool offering a conversational interface. Such a dialogue system\ncan help users explore datasets and models with explanations in a\ncontextualized manner, e.g. via clarification or follow-up questions, and\nthrough a natural language interface. We adapt the conversational explanation\nframework TalkToModel (Slack et al., 2022) to the NLP domain, add new\nNLP-specific operations such as free-text rationalization, and illustrate its\ngeneralizability on three NLP tasks (dialogue act classification, question\nanswering, hate speech detection). To recognize user queries for explanations,\nwe evaluate fine-tuned and few-shot prompting models and implement a novel\nAdapter-based approach. We then conduct two user studies on (1) the perceived\ncorrectness and helpfulness of the dialogues, and (2) the simulatability, i.e.\nhow objectively helpful dialogical explanations are for humans in figuring out\nthe model's predicted label when it's not shown. We found rationalization and\nfeature attribution were helpful in explaining the model behavior. Moreover,\nusers could more reliably predict the model outcome based on an explanation\ndialogue rather than one-off explanations.\n","authors":["Nils Feldhus","Qianli Wang","Tatiana Anikina","Sahil Chopra","Cennet Oguz","Sebastian Möller"],"pdf_url":"https://arxiv.org/pdf/2310.05592v2.pdf","comment":"EMNLP 2023 Findings. Camera-ready version"},{"id":"http://arxiv.org/abs/2304.08315v2","updated":"2023-10-23T14:01:06Z","published":"2023-04-17T14:37:43Z","title":"Thorny Roses: Investigating the Dual Use Dilemma in Natural Language\n Processing","summary":" Dual use, the intentional, harmful reuse of technology and scientific\nartefacts, is a problem yet to be well-defined within the context of Natural\nLanguage Processing (NLP). However, as NLP technologies continue to advance and\nbecome increasingly widespread in society, their inner workings have become\nincreasingly opaque. Therefore, understanding dual use concerns and potential\nways of limiting them is critical to minimising the potential harms of research\nand development. In this paper, we conduct a survey of NLP researchers and\npractitioners to understand the depth and their perspective of the problem as\nwell as to assess existing available support. Based on the results of our\nsurvey, we offer a definition of dual use that is tailored to the needs of the\nNLP community. The survey revealed that a majority of researchers are concerned\nabout the potential dual use of their research but only take limited action\ntoward it. In light of the survey results, we discuss the current state and\npotential means for mitigating dual use in NLP and propose a checklist that can\nbe integrated into existing conference ethics-frameworks, e.g., the ACL ethics\nchecklist.\n","authors":["Lucie-Aimée Kaffee","Arnav Arora","Zeerak Talat","Isabelle Augenstein"],"pdf_url":"https://arxiv.org/pdf/2304.08315v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14576v2","updated":"2023-10-23T13:58:26Z","published":"2023-05-23T23:27:20Z","title":"Parameter-Efficient Language Model Tuning with Active Learning in\n Low-Resource Settings","summary":" Pre-trained language models (PLMs) have ignited a surge in demand for\neffective fine-tuning techniques, particularly in low-resource domains and\nlanguages. Active learning (AL), a set of algorithms designed to decrease\nlabeling costs by minimizing label complexity, has shown promise in confronting\nthe labeling bottleneck. In parallel, adapter modules designed for\nparameter-efficient fine-tuning (PEFT) have demonstrated notable potential in\nlow-resource settings. However, the interplay between AL and adapter-based PEFT\nremains unexplored. We present an empirical study of PEFT behavior with AL in\nlow-resource settings for text classification tasks. Our findings affirm the\nsuperiority of PEFT over full-fine tuning (FFT) in low-resource settings and\ndemonstrate that this advantage persists in AL setups. We further examine the\nproperties of PEFT and FFT through the lens of forgetting dynamics and\ninstance-level representations, where we find that PEFT yields more stable\nrepresentations of early and middle layers compared to FFT. Our research\nunderscores the synergistic potential of AL and PEFT in low-resource settings,\npaving the way for advancements in efficient and effective fine-tuning.\n","authors":["Josip Jukić","Jan Šnajder"],"pdf_url":"https://arxiv.org/pdf/2305.14576v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14954v1","updated":"2023-10-23T13:55:49Z","published":"2023-10-23T13:55:49Z","title":"Key Frame Mechanism For Efficient Conformer Based End-to-end Speech\n Recognition","summary":" Recently, Conformer as a backbone network for end-to-end automatic speech\nrecognition achieved state-of-the-art performance. The Conformer block\nleverages a self-attention mechanism to capture global information, along with\na convolutional neural network to capture local information, resulting in\nimproved performance. However, the Conformer-based model encounters an issue\nwith the self-attention mechanism, as computational complexity grows\nquadratically with the length of the input sequence. Inspired by previous\nConnectionist Temporal Classification (CTC) guided blank skipping during\ndecoding, we introduce intermediate CTC outputs as guidance into the\ndownsampling procedure of the Conformer encoder. We define the frame with\nnon-blank output as key frame. Specifically, we introduce the key frame-based\nself-attention (KFSA) mechanism, a novel method to reduce the computation of\nthe self-attention mechanism using key frames. The structure of our proposed\napproach comprises two encoders. Following the initial encoder, we introduce an\nintermediate CTC loss function to compute the label frame, enabling us to\nextract the key frames and blank frames for KFSA. Furthermore, we introduce the\nkey frame-based downsampling (KFDS) mechanism to operate on high-dimensional\nacoustic features directly and drop the frames corresponding to blank labels,\nwhich results in new acoustic feature sequences as input to the second encoder.\nBy using the proposed method, which achieves comparable or higher performance\nthan vanilla Conformer and other similar work such as Efficient Conformer.\nMeantime, our proposed method can discard more than 60\\% useless frames during\nmodel training and inference, which will accelerate the inference speed\nsignificantly. This work code is available in\n{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}\n","authors":["Peng Fan","Changhao Shan","Jianwei Zhang","Sining Sun","Qing Yang"],"pdf_url":"https://arxiv.org/pdf/2310.14954v1.pdf","comment":"This manuscript has been accepted by IEEE Signal Processing Letters\n for publication"},{"id":"http://arxiv.org/abs/2310.14947v1","updated":"2023-10-23T13:46:49Z","published":"2023-10-23T13:46:49Z","title":"System Combination via Quality Estimation for Grammatical Error\n Correction","summary":" Quality estimation models have been developed to assess the corrections made\nby grammatical error correction (GEC) models when the reference or\ngold-standard corrections are not available. An ideal quality estimator can be\nutilized to combine the outputs of multiple GEC systems by choosing the best\nsubset of edits from the union of all edits proposed by the GEC base systems.\nHowever, we found that existing GEC quality estimation models are not good\nenough in differentiating good corrections from bad ones, resulting in a low\nF0.5 score when used for system combination. In this paper, we propose GRECO, a\nnew state-of-the-art quality estimation model that gives a better estimate of\nthe quality of a corrected sentence, as indicated by having a higher\ncorrelation to the F0.5 score of a corrected sentence. It results in a combined\nGEC system with a higher F0.5 score. We also propose three methods for\nutilizing GEC quality estimation models for system combination with varying\ngenerality: model-agnostic, model-agnostic with voting bias, and\nmodel-dependent method. The combined GEC system outperforms the state of the\nart on the CoNLL-2014 test set and the BEA-2019 test set, achieving the highest\nF0.5 scores published to date.\n","authors":["Muhammad Reza Qorib","Hwee Tou Ng"],"pdf_url":"https://arxiv.org/pdf/2310.14947v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14928v1","updated":"2023-10-23T13:31:32Z","published":"2023-10-23T13:31:32Z","title":"Unveiling A Core Linguistic Region in Large Language Models","summary":" Brain localization, which describes the association between specific regions\nof the brain and their corresponding functions, is widely accepted in the field\nof cognitive science as an objective fact. Today's large language models (LLMs)\npossess human-level linguistic competence and can execute complex tasks\nrequiring abstract knowledge and reasoning. To deeply understand the inherent\nmechanisms of intelligence emergence in LLMs, this paper conducts an analogical\nresearch using brain localization as a prototype. We have discovered a core\nregion in LLMs that corresponds to linguistic competence, accounting for\napproximately 1% of the total model parameters. This core region exhibits\nsignificant dimension dependency, and perturbations to even a single parameter\non specific dimensions can lead to a loss of linguistic competence.\nFurthermore, we observe that an improvement in linguistic competence does not\nnecessarily accompany an elevation in the model's knowledge level, which might\nimply the existence of regions of domain knowledge that are dissociated from\nthe linguistic region. Overall, exploring the LLMs' functional regions provides\ninsights into the foundation of their intelligence. In the future, we will\ncontinue to investigate knowledge regions within LLMs and the interactions\nbetween them.\n","authors":["Jun Zhao","Zhihao Zhang","Yide Ma","Qi Zhang","Tao Gui","Luhui Gao","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.14928v1.pdf","comment":"Work on progress"},{"id":"http://arxiv.org/abs/2301.09112v2","updated":"2023-10-23T13:28:35Z","published":"2023-01-22T12:29:03Z","title":"Differentially Private Natural Language Models: Recent Advances and\n Future Directions","summary":" Recent developments in deep learning have led to great success in various\nnatural language processing (NLP) tasks. However, these applications may\ninvolve data that contain sensitive information. Therefore, how to achieve good\nperformance while also protecting the privacy of sensitive data is a crucial\nchallenge in NLP. To preserve privacy, Differential Privacy (DP), which can\nprevent reconstruction attacks and protect against potential side knowledge, is\nbecoming a de facto technique for private data analysis. In recent years, NLP\nin DP models (DP-NLP) has been studied from different perspectives, which\ndeserves a comprehensive review. In this paper, we provide the first systematic\nreview of recent advances in DP deep learning models in NLP. In particular, we\nfirst discuss some differences and additional challenges of DP-NLP compared\nwith the standard DP deep learning. Then, we investigate some existing work on\nDP-NLP and present its recent developments from three aspects: gradient\nperturbation based methods, embedding vector perturbation based methods, and\nensemble model based methods. We also discuss some challenges and future\ndirections.\n","authors":["Lijie Hu","Ivan Habernal","Lei Shen","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2301.09112v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14921v1","updated":"2023-10-23T13:25:54Z","published":"2023-10-23T13:25:54Z","title":"PartialFormer: Modeling Part Instead of Whole","summary":" The design choices in Transformer feed-forward neural networks have resulted\nin significant computational and parameter overhead. In this work, we emphasize\nthe importance of hidden dimension in designing lightweight FFNs, a factor\noften overlooked in previous architectures. Guided by this principle, we\nintroduce PartialFormer, a parameter-efficient Transformer architecture\nutilizing multiple smaller FFNs to reduce parameters and computation while\nmaintaining essential hidden dimensions. These smaller FFNs are integrated into\na multi-head attention system to enable effective collaboration. We also\npropose a tailored head scaling strategy to enhance PartialFormer's\ncapabilities. Furthermore, we present a residual-like attention calculation to\nimprove depth scaling within PartialFormer. Extensive experiments on 9\ntranslation tasks and 1 abstractive summarization task validate the\neffectiveness of our PartialFormer approach. Our code would be available at:\n\\url{https://github.com/zhengkid/PartialFormer}.\n","authors":["Tong Zheng","Bei Li","Huiwen Bao","Weiqiao Shan","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14921v1.pdf","comment":"11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.14909v1","updated":"2023-10-23T13:18:49Z","published":"2023-10-23T13:18:49Z","title":"Linking Surface Facts to Large-Scale Knowledge Graphs","summary":" Open Information Extraction (OIE) methods extract facts from natural language\ntext in the form of (\"subject\"; \"relation\"; \"object\") triples. These facts are,\nhowever, merely surface forms, the ambiguity of which impedes their downstream\nusage; e.g., the surface phrase \"Michael Jordan\" may refer to either the former\nbasketball player or the university professor. Knowledge Graphs (KGs), on the\nother hand, contain facts in a canonical (i.e., unambiguous) form, but their\ncoverage is limited by a static schema (i.e., a fixed set of entities and\npredicates). To bridge this gap, we need the best of both worlds: (i) high\ncoverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of\nKGs. In order to achieve this goal, we propose a new benchmark with novel\nevaluation protocols that can, for example, measure fact linking performance on\na granular triple slot level, while also measuring if a system has the ability\nto recognize that a surface form has no match in the existing KG. Our extensive\nevaluation of several baselines show that detection of out-of-KG entities and\npredicates is more difficult than accurate linking to existing ones, thus\ncalling for more research efforts on this difficult task. We publicly release\nall resources (data, benchmark and code) on\nhttps://github.com/nec-research/fact-linking.\n","authors":["Gorjan Radevski","Kiril Gashteovski","Chia-Chien Hung","Carolin Lawrence","Goran Glavaš"],"pdf_url":"https://arxiv.org/pdf/2310.14909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14892v1","updated":"2023-10-23T12:59:11Z","published":"2023-10-23T12:59:11Z","title":"Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time\n Controllable Text Generation","summary":" Controllable text generation (CTG) aims to generate text with desired\nattributes, and decoding-time-based methods have shown promising performance on\nthis task. However, in this paper, we identify the phenomenon of Attribute\nCollapse for the first time. It causes the fluency of generated text to rapidly\ndecrease when the control strength exceeds a critical value, rendering the text\ncompletely unusable. This limitation hinders the effectiveness of decoding\nmethods in achieving high levels of controllability. To address this problem,\nwe propose a novel lightweight decoding framework named Air-Decoding. Its main\nidea is reconstructing the attribute distributions to balance the weights\nbetween attribute words and non-attribute words to generate more fluent text.\nSpecifically, we train prefixes by prefix-tuning to obtain attribute\ndistributions. Then we design a novel attribute distribution reconstruction\nmethod to balance the obtained distributions and use the reconstructed\ndistributions to guide language models for generation, effectively avoiding the\nissue of Attribute Collapse. Experiments on multiple CTG tasks prove that our\nmethod achieves a new state-of-the-art control performance.\n","authors":["Tianqi Zhong","Quan Wang","Jingxuan Han","Yongdong Zhang","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2310.14892v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14883v1","updated":"2023-10-23T12:52:24Z","published":"2023-10-23T12:52:24Z","title":"Non-autoregressive Streaming Transformer for Simultaneous Translation","summary":" Simultaneous machine translation (SiMT) models are trained to strike a\nbalance between latency and translation quality. However, training these models\nto achieve high quality while maintaining low latency often leads to a tendency\nfor aggressive anticipation. We argue that such issue stems from the\nautoregressive architecture upon which most existing SiMT models are built. To\naddress those issues, we propose non-autoregressive streaming Transformer\n(NAST) which comprises a unidirectional encoder and a non-autoregressive\ndecoder with intra-chunk parallelism. We enable NAST to generate the blank\ntoken or repetitive tokens to adjust its READ/WRITE strategy flexibly, and\ntrain it to maximize the non-monotonic latent alignment with an alignment-based\nlatency loss. Experiments on various SiMT benchmarks demonstrate that NAST\noutperforms previous strong autoregressive SiMT baselines.\n","authors":["Zhengrui Ma","Shaolei Zhang","Shoutao Guo","Chenze Shao","Min Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.14883v1.pdf","comment":"EMNLP 2023 main conference; Source code is available at\n https://github.com/ictnlp/NAST"},{"id":"http://arxiv.org/abs/2310.14880v1","updated":"2023-10-23T12:51:49Z","published":"2023-10-23T12:51:49Z","title":"Can ChatGPT Perform Reasoning Using the IRAC Method in Analyzing Legal\n Scenarios Like a Lawyer?","summary":" Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions\nrecently in the legal domain due to its emergent ability to tackle a variety of\nlegal tasks. However, it is still unknown if LLMs are able to analyze a legal\ncase and perform reasoning in the same manner as lawyers. Therefore, we\nconstructed a novel corpus consisting of scenarios pertain to Contract Acts\nMalaysia and Australian Social Act for Dependent Child. ChatGPT is applied to\nperform analysis on the corpus using the IRAC method, which is a framework\nwidely used by legal professionals for organizing legal analysis. Each scenario\nin the corpus is annotated with a complete IRAC analysis in a semi-structured\nformat so that both machines and legal professionals are able to interpret and\nunderstand the annotations. In addition, we conducted the first empirical\nassessment of ChatGPT for IRAC analysis in order to understand how well it\naligns with the analysis of legal professionals. Our experimental results shed\nlights on possible future research directions to improve alignments between\nLLMs and legal experts in terms of legal reasoning.\n","authors":["Xiaoxi Kang","Lizhen Qu","Lay-Ki Soon","Adnan Trakic","Terry Yue Zhuo","Patrick Charles Emerton","Genevieve Grant"],"pdf_url":"https://arxiv.org/pdf/2310.14880v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05317v4","updated":"2023-10-23T12:46:50Z","published":"2023-10-09T00:20:59Z","title":"Enhancing Long-form Text Generation Efficacy with Task-adaptive\n Tokenization","summary":" We propose task-adaptive tokenization as a way to adapt the generation\npipeline to the specifics of a downstream task and enhance long-form generation\nin mental health. Inspired by insights from cognitive science, our\ntask-adaptive tokenizer samples variable segmentations from multiple outcomes,\nwith sampling probabilities optimized based on task-specific data. We introduce\na strategy for building a specialized vocabulary and introduce a vocabulary\nmerging protocol that allows for the integration of task-specific tokens into\nthe pre-trained model's tokenization step. Through extensive experiments on\npsychological question-answering tasks in both Chinese and English, we find\nthat our task-adaptive tokenization approach brings a significant improvement\nin generation performance while using up to 60% fewer tokens. Preliminary\nexperiments point to promising results when using our tokenization approach\nwith very large language models.\n","authors":["Siyang Liu","Naihao Deng","Sahand Sabour","Yilin Jia","Minlie Huang","Rada Mihalcea"],"pdf_url":"https://arxiv.org/pdf/2310.05317v4.pdf","comment":"Accepted at the main conference of The 2023 Conference on Empirical\n Methods in Natural Language Processing; 8 pages"},{"id":"http://arxiv.org/abs/2303.12314v4","updated":"2023-10-23T12:43:35Z","published":"2023-03-22T05:04:21Z","title":"Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization\n for Few-shot Generalization","summary":" Prompt tuning is a parameter-efficient method, which learns soft prompts and\nconditions frozen language models to perform specific downstream tasks. Though\neffective, prompt tuning under few-shot settings on the one hand heavily relies\non a good initialization of soft prompts. On the other hand, it can easily\noverfit to few-shot training samples, thereby undermining generalizability.\nExisting works leverage pre-training or supervised meta-learning to initialize\nsoft prompts but they fail to data-efficiently generalize to unseen downstream\ntasks. To address the above problems, this paper proposes a novel\nSelf-sUpervised meta-Prompt learning framework with MEta-gradient\nRegularization for few-shot generalization (SUPMER). SUPMER leverages\nself-supervised meta-learning with a diverse set of well-designed meta-training\ntasks to learn a universal prompt initialization for efficient adaptation using\nonly unlabeled data. Additionally, it jointly meta-learns a gradient\nregularization function to transform raw gradients into a domain-generalizable\ndirection, thus alleviating the problem of overfitting. Extensive experiments\nshow that SUPMER achieves better performance for different few-shot downstream\ntasks, and also exhibits a stronger domain generalization ability. The code for\nSUPMER will be available at https://github.com/beepkh/SUPMER.\n","authors":["Kaihang Pan","Juncheng Li","Hongye Song","Jun Lin","Xiaozhong Liu","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2303.12314v4.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.14870v1","updated":"2023-10-23T12:42:06Z","published":"2023-10-23T12:42:06Z","title":"We are Who We Cite: Bridges of Influence Between Natural Language\n Processing and Other Academic Fields","summary":" Natural Language Processing (NLP) is poised to substantially influence the\nworld. However, significant progress comes hand-in-hand with substantial risks.\nAddressing them requires broad engagement with various fields of study. Yet,\nlittle empirical work examines the state of such engagement (past or current).\nIn this paper, we quantify the degree of influence between 23 fields of study\nand NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP\npapers to other papers, and ~1.8m citations from other papers to NLP papers. We\nshow that, unlike most fields, the cross-field engagement of NLP, measured by\nour proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in\n1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown\nmore insular -- citing increasingly more NLP papers and having fewer papers\nthat act as bridges between fields. NLP citations are dominated by computer\nscience; Less than 8% of NLP citations are to linguistics, and less than 3% are\nto math and psychology. These findings underscore NLP's urgent need to reflect\non its engagement with various fields.\n","authors":["Jan Philip Wahle","Terry Ruas","Mohamed Abdalla","Bela Gipp","Saif M. Mohammad"],"pdf_url":"https://arxiv.org/pdf/2310.14870v1.pdf","comment":"Published at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14868v1","updated":"2023-10-23T12:40:41Z","published":"2023-10-23T12:40:41Z","title":"Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study\n on Syllogism","summary":" Large language models (LLMs) take advantage of step-by-step reasoning\ninstructions, e.g., chain-of-thought (CoT) prompting. Building on this, their\nability to perform CoT-style reasoning robustly is of interest from a probing\nperspective. In this study, we inspect the step-by-step reasoning ability of\nLLMs with a focus on negation, which is a core linguistic phenomenon that is\ndifficult to process. In particular, we introduce several controlled settings\n(e.g., reasoning in case of fictional entities) to evaluate the logical\nreasoning abilities of the models. We observed that dozens of modern LLMs were\nnot robust against lexical negation (e.g., plausible ->implausible) when\nperforming CoT-style reasoning, and the results highlight unique limitations in\neach LLM family.\n","authors":["Mengyu Ye","Tatsuki Kuribayashi","Jun Suzuki","Goro Kobayashi","Hiroaki Funayama"],"pdf_url":"https://arxiv.org/pdf/2310.14868v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18287v2","updated":"2023-10-23T12:32:47Z","published":"2023-05-29T17:56:35Z","title":"LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and\n Unlabeled Image Collections","summary":" Recently, large-scale pre-trained Vision and Language (VL) models have set a\nnew state-of-the-art (SOTA) in zero-shot visual classification enabling\nopen-vocabulary recognition of potentially unlimited set of categories defined\nas simple language prompts. However, despite these great advances, the\nperformance of these zeroshot classifiers still falls short of the results of\ndedicated (closed category set) classifiers trained with supervised fine\ntuning. In this paper we show, for the first time, how to reduce this gap\nwithout any labels and without any paired VL data, using an unlabeled image\ncollection and a set of texts auto-generated using a Large Language Model (LLM)\ndescribing the categories of interest and effectively substituting labeled\nvisual instances of those categories. Using our label-free approach, we are\nable to attain significant performance improvements over the zero-shot\nperformance of the base VL model and other contemporary methods and baselines\non a wide variety of datasets, demonstrating absolute improvement of up to\n11.7% (3.8% on average) in the label-free setting. Moreover, despite our\napproach being label-free, we observe 1.3% average gains over leading few-shot\nprompting baselines that do use 5-shot supervision.\n","authors":["M. Jehanzeb Mirza","Leonid Karlinsky","Wei Lin","Mateusz Kozinski","Horst Possegger","Rogerio Feris","Horst Bischof"],"pdf_url":"https://arxiv.org/pdf/2305.18287v2.pdf","comment":"NeurIPS 2023 (Camera Ready) - Project Page:\n https://jmiemirza.github.io/LaFTer/"},{"id":"http://arxiv.org/abs/2310.14863v1","updated":"2023-10-23T12:32:41Z","published":"2023-10-23T12:32:41Z","title":"Paraphrase Types for Generation and Detection","summary":" Current approaches in paraphrase generation and detection heavily rely on a\nsingle general similarity score, ignoring the intricate linguistic properties\nof language. This paper introduces two new tasks to address this shortcoming by\nconsidering paraphrase types - specific linguistic perturbations at particular\ntext positions. We name these tasks Paraphrase Type Generation and Paraphrase\nType Detection. Our results suggest that while current techniques perform well\nin a binary classification scenario, i.e., paraphrased or not, the inclusion of\nfine-grained paraphrase types poses a significant challenge. While most\napproaches are good at generating and detecting general semantic similar\ncontent, they fail to understand the intrinsic linguistic variables they\nmanipulate. Models trained in generating and identifying paraphrase types also\nshow improvements in tasks without them. In addition, scaling these models\nfurther improves their ability to understand paraphrase types. We believe\nparaphrase types can unlock a new paradigm for developing paraphrase models and\nsolving tasks in the future.\n","authors":["Jan Philip Wahle","Bela Gipp","Terry Ruas"],"pdf_url":"https://arxiv.org/pdf/2310.14863v1.pdf","comment":"Published at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.12020v2","updated":"2023-10-23T12:29:33Z","published":"2023-10-18T14:53:14Z","title":"LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic\n Tabletop Manipulation","summary":" The convergence of embodied agents and large language models (LLMs) has\nbrought significant advancements to embodied instruction following.\nParticularly, the strong reasoning capabilities of LLMs make it possible for\nrobots to perform long-horizon tasks without expensive annotated\ndemonstrations. However, public benchmarks for testing the long-horizon\nreasoning capabilities of language-conditioned robots in various scenarios are\nstill missing. To fill this gap, this work focuses on the tabletop manipulation\ntask and releases a simulation benchmark, \\textit{LoHoRavens}, which covers\nvarious long-horizon reasoning aspects spanning color, size, space, arithmetics\nand reference. Furthermore, there is a key modality bridging problem for\nlong-horizon manipulation tasks with LLMs: how to incorporate the observation\nfeedback during robot execution for the LLM's closed-loop planning, which is\nhowever less studied by prior work. We investigate two methods of bridging the\nmodality gap: caption generation and learnable interface for incorporating\nexplicit and implicit observation feedback to the LLM, respectively. These\nmethods serve as the two baselines for our proposed benchmark. Experiments show\nthat both methods struggle to solve some tasks, indicating long-horizon\nmanipulation tasks are still challenging for current popular models. We expect\nthe proposed public benchmark and baselines can help the community develop\nbetter models for long-horizon tabletop manipulation tasks.\n","authors":["Shengqiang Zhang","Philipp Wicke","Lütfi Kerem Şenel","Luis Figueredo","Abdeldjallil Naceri","Sami Haddadin","Barbara Plank","Hinrich Schütze"],"pdf_url":"https://arxiv.org/pdf/2310.12020v2.pdf","comment":"6 pages, 4 figures. The video and code of LoHoRavens are available at\n https://cisnlp.github.io/lohoravens-webpage/"},{"id":"http://arxiv.org/abs/2310.14859v1","updated":"2023-10-23T12:29:10Z","published":"2023-10-23T12:29:10Z","title":"3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for\n Embodied Turn-Taking Prediction","summary":" Predicting turn-taking in multiparty conversations has many practical\napplications in human-computer/robot interaction. However, the complexity of\nhuman communication makes it a challenging task. Recent advances have shown\nthat synchronous multi-perspective egocentric data can significantly improve\nturn-taking prediction compared to asynchronous, single-perspective\ntranscriptions. Building on this research, we propose a new multimodal\ntransformer-based architecture for predicting turn-taking in embodied,\nsynchronized multi-perspective data. Our experimental results on the recently\nintroduced EgoCom dataset show a substantial performance improvement of up to\n14.01% on average compared to existing baselines and alternative\ntransformer-based approaches. The source code, and the pre-trained models of\nour 3T-Transformer will be available upon acceptance.\n","authors":["Mehdi Fatan","Emanuele Mincato","Dimitra Pintzou","Mariella Dimiccoli"],"pdf_url":"https://arxiv.org/pdf/2310.14859v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12075v2","updated":"2023-10-23T12:25:30Z","published":"2023-09-21T13:45:32Z","title":"Prompt Tuned Embedding Classification for Multi-Label Industry Sector\n Allocation","summary":" Prompt Tuning is emerging as a scalable and cost-effective method to\nfine-tune Pretrained Language Models (PLMs), which are often referred to as\nLarge Language Models (LLMs). This study benchmarks the performance and\ncomputational efficiency of Prompt Tuning and baselines for multi-label text\nclassification. This is applied to the challenging task of classifying\ncompanies into an investment firm's proprietary industry taxonomy, supporting\ntheir thematic investment strategy. Text-to-text classification is frequently\nreported to outperform task-specific classification heads, but has several\nlimitations when applied to a multi-label classification problem where each\nlabel consists of multiple tokens: (a) Generated labels may not match any label\nin the label taxonomy; (b) The fine-tuning process lacks permutation invariance\nand is sensitive to the order of the provided labels; (c) The model provides\nbinary decisions rather than appropriate confidence scores. Limitation (a) is\naddressed by applying constrained decoding using Trie Search, which slightly\nimproves classification performance. All limitations (a), (b), and (c) are\naddressed by replacing the PLM's language head with a classification head,\nwhich is referred to as Prompt Tuned Embedding Classification (PTEC). This\nimproves performance significantly, while also reducing computational costs\nduring inference. In our industrial application, the training data is skewed\ntowards well-known companies. We confirm that the model's performance is\nconsistent across both well-known and less-known companies. Our overall results\nindicate the continuing need to adapt state-of-the-art methods to\ndomain-specific tasks, even in the era of PLMs with strong generalization\nabilities. We release our codebase and a benchmarking dataset at\nhttps://github.com/EQTPartners/PTEC.\n","authors":["Valentin Leonhard Buchner","Lele Cao","Jan-Christoph Kalo","Vilhelm von Ehrenheim"],"pdf_url":"https://arxiv.org/pdf/2309.12075v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14855v1","updated":"2023-10-23T12:22:15Z","published":"2023-10-23T12:22:15Z","title":"Contextual Refinement of Translations: Large Language Models for\n Sentence and Document-Level Post-Editing","summary":" Large Language Models (LLM's) have demonstrated considerable success in\nvarious Natural Language Processing tasks, but they have yet to attain\nstate-of-the-art performance in Neural Machine Translation (NMT). Nevertheless,\ntheir significant performance in tasks demanding a broad understanding and\ncontextual processing shows their potential for translation. To exploit these\nabilities, we investigate using LLM's for MT and explore recent\nparameter-efficient fine-tuning techniques. Surprisingly, our initial\nexperiments find that fine-tuning for translation purposes even led to\nperformance degradation. To overcome this, we propose an alternative approach:\nadapting LLM's as Automatic Post-Editors (APE) rather than direct translators.\nBuilding on the LLM's exceptional ability to process and generate lengthy\nsequences, we also propose extending our approach to document-level\ntranslation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can\nyield significant improvements across both sentence and document-level metrics\nwhile generalizing to out-of-domain data. Most notably, we achieve a\nstate-of-the-art accuracy rate of 89\\% on the ContraPro test set, which\nspecifically assesses the model's ability to resolve pronoun ambiguities when\ntranslating from English to German. Lastly, we investigate a practical scenario\ninvolving manual post-editing for document-level translation, where reference\ncontext is made available. Here, we demonstrate that leveraging human\ncorrections can significantly reduce the number of edits required for\nsubsequent translations\\footnote{Interactive Demo for integrating manual\nfeedback can be found\n\\href{https://huggingface.co/spaces/skoneru/contextual_refinement_ende}{here}}\n","authors":["Sai Koneru","Miriam Exel","Matthias Huck","Jan Niehues"],"pdf_url":"https://arxiv.org/pdf/2310.14855v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.00482v2","updated":"2023-10-23T12:16:52Z","published":"2022-12-01T13:17:25Z","title":"IRRGN: An Implicit Relational Reasoning Graph Network for Multi-turn\n Response Selection","summary":" The task of response selection in multi-turn dialogue is to find the best\noption from all candidates. In order to improve the reasoning ability of the\nmodel, previous studies pay more attention to using explicit algorithms to\nmodel the dependencies between utterances, which are deterministic, limited and\ninflexible. In addition, few studies consider differences between the options\nbefore and after reasoning. In this paper, we propose an Implicit Relational\nReasoning Graph Network to address these issues, which consists of the\nUtterance Relational Reasoner (URR) and the Option Dual Comparator (ODC). URR\naims to implicitly extract dependencies between utterances, as well as\nutterances and options, and make reasoning with relational graph convolutional\nnetworks. ODC focuses on perceiving the difference between the options through\ndual comparison, which can eliminate the interference of the noise options.\nExperimental results on two multi-turn dialogue reasoning benchmark datasets\nMuTual and MuTual+ show that our method significantly improves the baseline of\nfour pretrained language models and achieves state-of-the-art performance. The\nmodel surpasses human performance for the first time on the MuTual dataset.\n","authors":["Jingcheng Deng","Hengwei Dai","Xuewei Guo","Yuanchen Ju","Wei Peng"],"pdf_url":"https://arxiv.org/pdf/2212.00482v2.pdf","comment":"Accepted by EMNLP 2022"},{"id":"http://arxiv.org/abs/2310.10567v2","updated":"2023-10-23T12:16:44Z","published":"2023-10-16T16:42:01Z","title":"RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder\n for Language Modeling","summary":" Retrieval-augmented language models show promise in addressing issues like\noutdated information and hallucinations in language models (LMs). However,\ncurrent research faces two main problems: 1) determining what information to\nretrieve, and 2) effectively combining retrieved information during generation.\nWe argue that valuable retrieved information should not only be related to the\ncurrent source text but also consider the future target text, given the nature\nof LMs that model future tokens. Moreover, we propose that aggregation using\nlatent variables derived from a compact latent space is more efficient than\nutilizing explicit raw text, which is limited by context length and susceptible\nto noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model\nbuilt upon the variational auto-encoder (VAE). It encodes the text corpus into\na latent space, capturing current and future information from both source and\ntarget text. Additionally, we leverage the VAE to initialize the latent space\nand adopt the probabilistic form of the retrieval generation paradigm by\nexpanding the Gaussian prior distribution into a Gaussian mixture distribution.\nTheoretical analysis provides an optimizable upper bound for RegaVAE.\nExperimental results on various datasets demonstrate significant improvements\nin text generation quality and hallucination removal.\n","authors":["Jingcheng Deng","Liang Pang","Huawei Shen","Xueqi Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.10567v2.pdf","comment":"Accepted to the Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14853v1","updated":"2023-10-23T12:16:32Z","published":"2023-10-23T12:16:32Z","title":"Adaptive Policy with Wait-$k$ Model for Simultaneous Translation","summary":" Simultaneous machine translation (SiMT) requires a robust read/write policy\nin conjunction with a high-quality translation model. Traditional methods rely\non either a fixed wait-$k$ policy coupled with a standalone wait-$k$\ntranslation model, or an adaptive policy jointly trained with the translation\nmodel. In this study, we propose a more flexible approach by decoupling the\nadaptive policy model from the translation model. Our motivation stems from the\nobservation that a standalone multi-path wait-$k$ model performs competitively\nwith adaptive policies utilized in state-of-the-art SiMT approaches.\nSpecifically, we introduce DaP, a divergence-based adaptive policy, that makes\nread/write decisions for any translation model based on the potential\ndivergence in translation distributions resulting from future information. DaP\nextends a frozen wait-$k$ model with lightweight parameters, and is both memory\nand computation efficient. Experimental results across various benchmarks\ndemonstrate that our approach offers an improved trade-off between translation\naccuracy and latency, outperforming strong baselines.\n","authors":["Libo Zhao","Kai Fan","Wei Luo","Jing Wu","Shushu Wang","Ziqian Zeng","Zhongqiang Huang"],"pdf_url":"https://arxiv.org/pdf/2310.14853v1.pdf","comment":"Accept to EMNLP 2023 main conference. 17 pages, 12 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.14849v1","updated":"2023-10-23T12:15:25Z","published":"2023-10-23T12:15:25Z","title":"Universal Domain Adaptation for Robust Handling of Distributional Shifts\n in NLP","summary":" When deploying machine learning systems to the wild, it is highly desirable\nfor them to effectively leverage prior knowledge to the unfamiliar domain while\nalso firing alarms to anomalous inputs. In order to address these requirements,\nUniversal Domain Adaptation (UniDA) has emerged as a novel research area in\ncomputer vision, focusing on achieving both adaptation ability and robustness\n(i.e., the ability to detect out-of-distribution samples). While UniDA has led\nsignificant progress in computer vision, its application on language input\nstill needs to be explored despite its feasibility. In this paper, we propose a\ncomprehensive benchmark for natural language that offers thorough viewpoints of\nthe model's generalizability and robustness. Our benchmark encompasses multiple\ndatasets with varying difficulty levels and characteristics, including temporal\nshifts and diverse domains. On top of our testbed, we validate existing UniDA\nmethods from computer vision and state-of-the-art domain adaptation techniques\nfrom NLP literature, yielding valuable findings: We observe that UniDA methods\noriginally designed for image input can be effectively transferred to the\nnatural language domain while also underscoring the effect of adaptation\ndifficulty in determining the model's performance.\n","authors":["Hyuhng Joon Kim","Hyunsoo Cho","Sang-Woo Lee","Junyeob Kim","Choonghyun Park","Sang-goo Lee","Kang Min Yoo","Taeuk Kim"],"pdf_url":"https://arxiv.org/pdf/2310.14849v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2308.01684v2","updated":"2023-10-23T12:05:52Z","published":"2023-08-03T10:52:52Z","title":"Baby's CoThought: Leveraging Large Language Models for Enhanced\n Reasoning in Compact Models","summary":" Large Language Models (LLMs) demonstrate remarkable performance on a variety\nof natural language understanding (NLU) tasks, primarily due to their\nin-context learning ability. This ability could be applied to building babylike\nmodels, i.e. models at small scales, improving training efficiency. In this\npaper, we propose a \"CoThought\" pipeline, which efficiently trains smaller\n\"baby\" language models (BabyLMs) by leveraging the Chain of Thought prompting\nof LLMs. Our pipeline restructures a dataset of less than 100M in size using\nGPT-3.5-turbo, transforming it into task-oriented, human-readable texts that\nare comparable to the school texts for language learners. The BabyLM is then\npretrained on this restructured dataset in a RoBERTa fashion. In evaluations\nacross 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10\nlinguistic, NLU, and question-answering tasks by more than 3 points, showing a\nsuperior ability to extract contextual information. These results suggest that\ncompact LMs pretrained on small, LLM-restructured data can better understand\ntasks and achieve improved performance.\n","authors":["Zheyu Zhang","Han Yang","Bolei Ma","David Rügamer","Ercong Nie"],"pdf_url":"https://arxiv.org/pdf/2308.01684v2.pdf","comment":"CoNLL 2023 BabyLM Challenge"},{"id":"http://arxiv.org/abs/2310.14840v1","updated":"2023-10-23T12:03:01Z","published":"2023-10-23T12:03:01Z","title":"Transparency at the Source: Evaluating and Interpreting Language Models\n With Access to the True Distribution","summary":" We present a setup for training, evaluating and interpreting neural language\nmodels, that uses artificial, language-like data. The data is generated using a\nmassive probabilistic grammar (based on state-split PCFGs), that is itself\nderived from a large natural language corpus, but also provides us complete\ncontrol over the generative process. We describe and release both grammar and\ncorpus, and test for the naturalness of our generated data. This approach\nallows us to define closed-form expressions to efficiently compute exact lower\nbounds on obtainable perplexity using both causal and masked language\nmodelling. Our results show striking differences between neural language\nmodelling architectures and training objectives in how closely they allow\napproximating the lower bound on perplexity. Our approach also allows us to\ndirectly compare learned representations to symbolic rules in the underlying\nsource. We experiment with various techniques for interpreting model behaviour\nand learning dynamics. With access to the underlying true source, our results\nshow striking differences and outcomes in learning dynamics between different\nclasses of words.\n","authors":["Jaap Jumelet","Willem Zuidema"],"pdf_url":"https://arxiv.org/pdf/2310.14840v1.pdf","comment":"EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2305.14536v2","updated":"2023-10-23T12:00:01Z","published":"2023-05-23T21:44:56Z","title":"MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties\n Grounded in Math Reasoning Problems","summary":" While automatic dialogue tutors hold great potential in making education\npersonalized and more accessible, research on such systems has been hampered by\na lack of sufficiently large and high-quality datasets. Collecting such\ndatasets remains challenging, as recording tutoring sessions raises privacy\nconcerns and crowdsourcing leads to insufficient data quality. To address this,\nwe propose a framework to generate such dialogues by pairing human teachers\nwith a Large Language Model (LLM) prompted to represent common student errors.\nWe describe how we use this framework to collect MathDial, a dataset of 3k\none-to-one teacher-student tutoring dialogues grounded in multi-step math\nreasoning problems. While models like GPT-3 are good problem solvers, they fail\nat tutoring because they generate factually incorrect feedback or are prone to\nrevealing solutions to students too early. To overcome this, we let teachers\nprovide learning opportunities to students by guiding them using various\nscaffolding questions according to a taxonomy of teacher moves. We demonstrate\nMathDial and its extensive annotations can be used to finetune models to be\nmore effective tutors (and not just solvers). We confirm this by automatic and\nhuman evaluation, notably in an interactive setting that measures the trade-off\nbetween student solving success and telling solutions. The dataset is released\npublicly.\n","authors":["Jakub Macina","Nico Daheim","Sankalan Pal Chowdhury","Tanmay Sinha","Manu Kapur","Iryna Gurevych","Mrinmaya Sachan"],"pdf_url":"https://arxiv.org/pdf/2305.14536v2.pdf","comment":"Jakub Macina, Nico Daheim, and Sankalan Pal Chowdhury contributed\n equally to this work. Accepted at EMNLP2023 Findings. Code and dataset\n available: https://github.com/eth-nlped/mathdial"},{"id":"http://arxiv.org/abs/2310.12874v2","updated":"2023-10-23T11:58:24Z","published":"2023-10-19T16:29:23Z","title":"StoryAnalogy: Deriving Story-level Analogies from Large Language Models\n to Unlock Analogical Understanding","summary":" Analogy-making between narratives is crucial for human reasoning. In this\npaper, we evaluate the ability to identify and generate analogies by\nconstructing a first-of-its-kind large-scale story-level analogy corpus,\n\\textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with\nhuman annotations on two similarities from the extended Structure-Mapping\nTheory. We design a set of tests on \\textsc{StoryAnalogy}, presenting the first\nevaluation of story-level analogy identification and generation. Interestingly,\nwe find that the analogy identification tasks are incredibly difficult not only\nfor sentence embedding models but also for the recent large language models\n(LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around\n30% accuracy in multiple-choice questions (compared to over 85% accuracy for\nhumans). Furthermore, we observe that the data in \\textsc{StoryAnalogy} can\nimprove the quality of analogy generation in LLMs, where a fine-tuned\nFlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.\n","authors":["Cheng Jiayang","Lin Qiu","Tsz Ho Chan","Tianqing Fang","Weiqi Wang","Chunkit Chan","Dongyu Ru","Qipeng Guo","Hongming Zhang","Yangqiu Song","Yue Zhang","Zheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.12874v2.pdf","comment":"Accepted by EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2305.15074v3","updated":"2023-10-23T11:55:58Z","published":"2023-05-24T11:55:59Z","title":"Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For\n Large Language Models","summary":" The performance of large language models (LLMs) on existing reasoning\nbenchmarks has significantly improved over the past years. In response, we\npresent JEEBench, a considerably more challenging benchmark dataset for\nevaluating the problem solving abilities of LLMs. We curate 515 challenging\npre-engineering mathematics, physics and chemistry problems from the highly\ncompetitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep\nin-domain knowledge is essential for solving problems in this benchmark. Our\nevaluation on various open-source and proprietary models reveals that the\nhighest performance, even after using techniques like self-consistency,\nself-refinement and chain-of-thought prompting, is less than 40%. The typical\nfailure modes of GPT-4, the best model, are errors in algebraic manipulation,\ndifficulty in grounding abstract concepts into mathematical equations\naccurately and failure in retrieving relevant domain-specific concepts. We also\nobserve that by mere prompting, GPT-4 is unable to assess risk introduced by\nnegative marking for incorrect answers. For this, we develop a post-hoc\nconfidence-thresholding method over self-consistency, which enables effective\nresponse selection. We hope that our challenging benchmark will guide future\nre-search in problem-solving using LLMs.\n","authors":["Daman Arora","Himanshu Gaurav Singh"," Mausam"],"pdf_url":"https://arxiv.org/pdf/2305.15074v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.03368v2","updated":"2023-10-23T11:52:52Z","published":"2023-10-05T07:57:09Z","title":"Evaluating Hallucinations in Chinese Large Language Models","summary":" In this paper, we establish a benchmark named HalluQA (Chinese Hallucination\nQuestion-Answering) to measure the hallucination phenomenon in Chinese large\nlanguage models. HalluQA contains 450 meticulously designed adversarial\nquestions, spanning multiple domains, and takes into account Chinese historical\nculture, customs, and social phenomena. During the construction of HalluQA, we\nconsider two types of hallucinations: imitative falsehoods and factual errors,\nand we construct adversarial samples based on GLM-130B and ChatGPT. For\nevaluation, we design an automated evaluation method using GPT-4 to judge\nwhether a model output is hallucinated. We conduct extensive experiments on 24\nlarge language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk\nand etc. Out of the 24 models, 18 achieved non-hallucination rates lower than\n50%. This indicates that HalluQA is highly challenging. We analyze the primary\ntypes of hallucinations in different types of models and their causes.\nAdditionally, we discuss which types of hallucinations should be prioritized\nfor different types of models.\n","authors":["Qinyuan Cheng","Tianxiang Sun","Wenwei Zhang","Siyin Wang","Xiangyang Liu","Mozhi Zhang","Junliang He","Mianqiu Huang","Zhangyue Yin","Kai Chen","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.03368v2.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.14829v1","updated":"2023-10-23T11:48:23Z","published":"2023-10-23T11:48:23Z","title":"Characterizing how 'distributional' NLP corpora distance metrics are","summary":" A corpus of vector-embedded text documents has some empirical distribution.\nGiven two corpora, we want to calculate a single metric of distance (e.g.,\nMauve, Frechet Inception) between them. We describe an abstract quality, called\n`distributionality', of such metrics. A non-distributional metric tends to use\nvery local measurements, or uses global measurements in a way that does not\nfully reflect the distributions' true distance. For example, if individual\npairwise nearest-neighbor distances are low, it may judge the two corpora to\nhave low distance, even if their two distributions are in fact far from each\nother. A more distributional metric will, in contrast, better capture the\ndistributions' overall distance. We quantify this quality by constructing a\nKnown-Similarity Corpora set from two paraphrase corpora and calculating the\ndistance between paired corpora from it. The distances' trend shape as set\nelement separation increases should quantify the distributionality of the\nmetric. We propose that Average Hausdorff Distance and energy distance between\ncorpora are representative examples of non-distributional and distributional\ndistance metrics, to which other metrics can be compared, to evaluate how\ndistributional they are.\n","authors":["Samuel Ackerman","George Kour","Eitan Farchi"],"pdf_url":"https://arxiv.org/pdf/2310.14829v1.pdf","comment":"Published in the August 2023 Joint Statistical Meetings proceedings"},{"id":"http://arxiv.org/abs/2310.12648v2","updated":"2023-10-23T11:47:53Z","published":"2023-10-19T11:15:02Z","title":"Towards Real-World Streaming Speech Translation for Code-Switched Speech","summary":" Code-switching (CS), i.e. mixing different languages in a single sentence, is\na common phenomenon in communication and can be challenging in many Natural\nLanguage Processing (NLP) settings. Previous studies on CS speech have shown\npromising results for end-to-end speech translation (ST), but have been limited\nto offline scenarios and to translation to one of the languages present in the\nsource (\\textit{monolingual transcription}).\n In this paper, we focus on two essential yet unexplored areas for real-world\nCS speech translation: streaming settings, and translation to a third language\n(i.e., a language not included in the source). To this end, we extend the\nFisher and Miami test and validation datasets to include new targets in Spanish\nand German. Using this data, we train a model for both offline and streaming ST\nand we establish baseline results for the two settings mentioned earlier.\n","authors":["Belen Alastruey","Matthias Sperber","Christian Gollan","Dominic Telaar","Tim Ng","Aashish Agarwal"],"pdf_url":"https://arxiv.org/pdf/2310.12648v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03415v2","updated":"2023-10-23T11:47:04Z","published":"2023-08-07T09:06:20Z","title":"End-to-End Evaluation for Low-Latency Simultaneous Speech Translation","summary":" The challenge of low-latency speech translation has recently draw significant\ninterest in the research community as shown by several publications and shared\ntasks. Therefore, it is essential to evaluate these different approaches in\nrealistic scenarios. However, currently only specific aspects of the systems\nare evaluated and often it is not possible to compare different approaches.\n In this work, we propose the first framework to perform and evaluate the\nvarious aspects of low-latency speech translation under realistic conditions.\nThe evaluation is carried out in an end-to-end fashion. This includes the\nsegmentation of the audio as well as the run-time of the different components.\n Secondly, we compare different approaches to low-latency speech translation\nusing this framework. We evaluate models with the option to revise the output\nas well as methods with fixed output. Furthermore, we directly compare\nstate-of-the-art cascaded as well as end-to-end systems. Finally, the framework\nallows to automatically evaluate the translation quality as well as latency and\nalso provides a web interface to show the low-latency model outputs to the\nuser.\n","authors":["Christian Huber","Tu Anh Dinh","Carlos Mullov","Ngoc Quan Pham","Thai Binh Nguyen","Fabian Retkowski","Stefan Constantin","Enes Yavuz Ugan","Danni Liu","Zhaolin Li","Sai Koneru","Jan Niehues","Alexander Waibel"],"pdf_url":"https://arxiv.org/pdf/2308.03415v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14820v1","updated":"2023-10-23T11:40:05Z","published":"2023-10-23T11:40:05Z","title":"ALCUNA: Large Language Models Meet New Knowledge","summary":" With the rapid development of NLP, large-scale language models (LLMs) excel\nin various tasks across multiple domains now. However, existing benchmarks may\nnot adequately measure these models' capabilities, especially when faced with\nnew knowledge. In this paper, we address the lack of benchmarks to evaluate\nLLMs' ability to handle new knowledge, an important and challenging aspect in\nthe rapidly evolving world. We propose an approach called KnowGen that\ngenerates new knowledge by altering existing entity attributes and\nrelationships, resulting in artificial entities that are distinct from\nreal-world entities. With KnowGen, we introduce a benchmark named ALCUNA to\nassess LLMs' abilities in knowledge understanding, differentiation, and\nassociation. We benchmark several LLMs, reveals that their performance in face\nof new knowledge is not satisfactory, particularly in reasoning between new and\ninternal knowledge. We also explore the impact of entity similarity on the\nmodel's understanding of entity knowledge and the influence of contextual\nentities. We appeal to the need for caution when using LLMs in new scenarios or\nwith new knowledge, and hope that our benchmarks can help drive the development\nof LLMs in face of new knowledge.\n","authors":["Xunjian Yin","Baizhou Huang","Xiaojun Wan"],"pdf_url":"https://arxiv.org/pdf/2310.14820v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14819v1","updated":"2023-10-23T11:40:04Z","published":"2023-10-23T11:40:04Z","title":"Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction\n Following: A Case Study of Arabic","summary":" While significant progress has been made in benchmarking Large Language\nModels (LLMs) across various tasks, there is a lack of comprehensive evaluation\nof their abilities in responding to multi-turn instructions in less-commonly\ntested languages like Arabic. Our paper offers a detailed examination of the\nproficiency of open LLMs in such scenarios in Arabic. Utilizing a customized\nArabic translation of the MT-Bench benchmark suite, we employ GPT-4 as a\nuniform evaluator for both English and Arabic queries to assess and compare the\nperformance of the LLMs on various open-ended tasks. Our findings reveal\nvariations in model responses on different task categories, e.g., logic vs.\nliteracy, when instructed in English or Arabic. We find that fine-tuned base\nmodels using multilingual and multi-turn datasets could be competitive to\nmodels trained from scratch on multilingual data. Finally, we hypothesize that\nan ensemble of small, open LLMs could perform competitively to proprietary LLMs\non the benchmark.\n","authors":["Sabri Boughorbel","Majd Hawasly"],"pdf_url":"https://arxiv.org/pdf/2310.14819v1.pdf","comment":"Accepted at SIGARAB ArabicNLP 2023"},{"id":"http://arxiv.org/abs/2305.14794v2","updated":"2023-10-23T11:22:58Z","published":"2023-05-24T06:45:33Z","title":"Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak\n Supervision for Text Classification","summary":" Recent advances in weakly supervised text classification mostly focus on\ndesigning sophisticated methods to turn high-level human heuristics into\nquality pseudo-labels. In this paper, we revisit the seed matching-based\nmethod, which is arguably the simplest way to generate pseudo-labels, and show\nthat its power was greatly underestimated. We show that the limited performance\nof seed matching is largely due to the label bias injected by the simple\nseed-match rule, which prevents the classifier from learning reliable\nconfidence for selecting high-quality pseudo-labels. Interestingly, simply\ndeleting the seed words present in the matched input texts can mitigate the\nlabel bias and help learn better confidence. Subsequently, the performance\nachieved by seed matching can be improved significantly, making it on par with\nor even better than the state-of-the-art. Furthermore, to handle the case when\nthe seed words are not made known, we propose to simply delete the word tokens\nin the input text randomly with a high deletion ratio. Remarkably, seed\nmatching equipped with this random deletion method can often achieve even\nbetter performance than that with seed deletion.\n","authors":["Chengyu Dong","Zihan Wang","Jingbo Shang"],"pdf_url":"https://arxiv.org/pdf/2305.14794v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14214v2","updated":"2023-10-23T11:17:53Z","published":"2023-05-23T16:32:27Z","title":"CompoundPiece: Evaluating and Improving Decompounding Performance of\n Language Models","summary":" While many languages possess processes of joining two or more words to create\ncompound words, previous studies have been typically limited only to languages\nwith excessively productive compound formation (e.g., German, Dutch) and there\nis no public dataset containing compound and non-compound words across a large\nnumber of languages. In this work, we systematically study decompounding, the\ntask of splitting compound words into their constituents, at a wide scale. We\nfirst address the data gap by introducing a dataset of 255k compound and\nnon-compound words across 56 diverse languages obtained from Wiktionary. We\nthen use this dataset to evaluate an array of Large Language Models (LLMs) on\nthe decompounding task. We find that LLMs perform poorly, especially on words\nwhich are tokenized unfavorably by subword tokenization. We thus introduce a\nnovel methodology to train dedicated models for decompounding. The proposed\ntwo-stage procedure relies on a fully self-supervised objective in the first\nstage, while the second, supervised learning stage optionally fine-tunes the\nmodel on the annotated Wiktionary data. Our self-supervised models outperform\nthe prior best unsupervised decompounding models by 13.9% accuracy on average.\nOur fine-tuned models outperform all prior (language-specific) decompounding\ntools. Furthermore, we use our models to leverage decompounding during the\ncreation of a subword tokenizer, which we refer to as CompoundPiece.\nCompoundPiece tokenizes compound words more favorably on average, leading to\nimproved performance on decompounding over an otherwise equivalent model using\nSentencePiece tokenization.\n","authors":["Benjamin Minixhofer","Jonas Pfeiffer","Ivan Vulić"],"pdf_url":"https://arxiv.org/pdf/2305.14214v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2205.12422v3","updated":"2023-10-23T11:12:48Z","published":"2022-05-25T00:35:12Z","title":"Non-Programmers Can Label Programs Indirectly via Active Examples: A\n Case Study with Text-to-SQL","summary":" Can non-programmers annotate natural language utterances with complex\nprograms that represent their meaning? We introduce APEL, a framework in which\nnon-programmers select among candidate programs generated by a seed semantic\nparser (e.g., Codex). Since they cannot understand the candidate programs, we\nask them to select indirectly by examining the programs' input-ouput examples.\nFor each utterance, APEL actively searches for a simple input on which the\ncandidate programs tend to produce different outputs. It then asks the\nnon-programmers only to choose the appropriate output, thus allowing us to\ninfer which program is correct and could be used to fine-tune the parser. As a\nfirst case study, we recruited human non-programmers to use APEL to re-annotate\nSPIDER, a text-to-SQL dataset. Our approach achieved the same annotation\naccuracy as the original expert annotators (75%) and exposed many subtle errors\nin the original annotations.\n","authors":["Ruiqi Zhong","Charlie Snell","Dan Klein","Jason Eisner"],"pdf_url":"https://arxiv.org/pdf/2205.12422v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14806v1","updated":"2023-10-23T11:00:27Z","published":"2023-10-23T11:00:27Z","title":"Leveraging Timestamp Information for Serialized Joint Streaming\n Recognition and Translation","summary":" The growing need for instant spoken language transcription and translation is\ndriven by increased global communication and cross-lingual interactions. This\nhas made offering translations in multiple languages essential for user\napplications. Traditional approaches to automatic speech recognition (ASR) and\nspeech translation (ST) have often relied on separate systems, leading to\ninefficiencies in computational resources, and increased synchronization\ncomplexity in real time. In this paper, we propose a streaming\nTransformer-Transducer (T-T) model able to jointly produce many-to-one and\none-to-many transcription and translation using a single decoder. We introduce\na novel method for joint token-level serialized output training based on\ntimestamp information to effectively produce ASR and ST outputs in the\nstreaming setting. Experiments on {it,es,de}->en prove the effectiveness of our\napproach, enabling the generation of one-to-many joint outputs with a single\ndecoder for the first time.\n","authors":["Sara Papi","Peidong Wang","Junkun Chen","Jian Xue","Naoyuki Kanda","Jinyu Li","Yashesh Gaur"],"pdf_url":"https://arxiv.org/pdf/2310.14806v1.pdf","comment":"\\c{opyright} 2024 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2310.14805v1","updated":"2023-10-23T11:00:19Z","published":"2023-10-23T11:00:19Z","title":"Cross-Modal Conceptualization in Bottleneck Models","summary":" Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray\nimages) are annotated with high-level concepts (e.g., types of abnormalities),\nand perform classification by first predicting the concepts, followed by\npredicting the label relying on these concepts. The main difficulty in using\nCBMs comes from having to choose concepts that are predictive of the label and\nthen having to label training examples with these concepts. In our approach, we\nadopt a more moderate assumption and instead use text descriptions (e.g.,\nradiology reports), accompanying the images in training, to guide the induction\nof concepts. Our cross-modal approach treats concepts as discrete latent\nvariables and promotes concepts that (1) are predictive of the label, and (2)\ncan be predicted reliably from both the image and text. Through experiments\nconducted on datasets ranging from synthetic datasets (e.g., synthetic images\nwith generated descriptions) to realistic medical imaging datasets, we\ndemonstrate that cross-modal learning encourages the induction of interpretable\nconcepts while also facilitating disentanglement. Our results also suggest that\nthis guidance leads to increased robustness by suppressing the reliance on\nshortcut features.\n","authors":["Danis Alukaev","Semen Kiselev","Ilya Pershin","Bulat Ibragimov","Vladimir Ivanov","Alexey Kornaev","Ivan Titov"],"pdf_url":"https://arxiv.org/pdf/2310.14805v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14804v1","updated":"2023-10-23T10:59:21Z","published":"2023-10-23T10:59:21Z","title":"Large Language Models can Share Images, Too!","summary":" This paper explores the image-sharing capability of Large Language Models\n(LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting,\nwithout the help of visual foundation models. Inspired by the two-stage process\nof image-sharing in human dialogues, we propose a two-stage framework that\nallows LLMs to predict potential image-sharing turns and generate related image\ndescriptions using our effective restriction-based prompt template. With\nextensive experiments, we unlock the \\textit{image-sharing} capability of LLMs\nin zero-shot prompting, with GPT-4 achieving the best performance.\nAdditionally, we uncover the emergent \\textit{image-sharing} ability in\nzero-shot prompting, demonstrating the effectiveness of restriction-based\nprompts in both stages of our framework. Based on this framework, we augment\nthe PhotoChat dataset with images generated by Stable Diffusion at predicted\nturns, namely PhotoChat++. To our knowledge, this is the first study to assess\nthe image-sharing ability of LLMs in a zero-shot setting without visual\nfoundation models. The source code and the dataset will be released after\npublication.\n","authors":["Young-Jun Lee","Jonghwan Hyeon","Ho-Jin Choi"],"pdf_url":"https://arxiv.org/pdf/2310.14804v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14799v1","updated":"2023-10-23T10:56:03Z","published":"2023-10-23T10:56:03Z","title":"Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning\n across Languages","summary":" Chain-of-thought (CoT) is capable of eliciting models to explicitly generate\nreasoning paths, thus promoting reasoning accuracy and attracting increasing\nattention. Specifically, zero-shot CoT achieves remarkable improvements in a\nwide range of reasoning tasks by simply instructing the LLM with the prompt\n\"Let's think step by step!\". Despite the success of zero-shot CoT, the existing\nzero-shot prompting techniques remain limited to a single language, making it\nchallenging to generalize to other languages and hindering global development.\nIn this work, we introduce cross-lingual prompting (CLP), aiming to improve\nzero-shot CoT reasoning across languages. Specifically, CLP consists of two\nmain components: (1) cross-lingual alignment prompting and (2) task-specific\nsolver prompting. The cross-lingual alignment prompting is responsible for\naligning representations across different languages, whereas the task-specific\nsolver prompting is used to generate the final chain of thoughts and results\nfor the reasoning task. In addition, we further introduce cross-lingual\nself-consistent prompting (CLSP) to ensemble different reasoning paths across\nlanguages. Our experimental evaluations on several benchmarks demonstrate that\nCLP and CLSP significantly outperform the existing prompting methods and\nachieve state-of-the-art performance. We hope this work will inspire further\nbreakthroughs in cross-lingual CoT.\n","authors":["Libo Qin","Qiguang Chen","Fuxuan Wei","Shijue Huang","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2310.14799v1.pdf","comment":"Accepted at EMNLP2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.14793v1","updated":"2023-10-23T10:53:25Z","published":"2023-10-23T10:53:25Z","title":"What do Deck Chairs and Sun Hats Have in Common? Uncovering Shared\n Properties in Large Concept Vocabularies","summary":" Concepts play a central role in many applications. This includes settings\nwhere concepts have to be modelled in the absence of sentence context. Previous\nwork has therefore focused on distilling decontextualised concept embeddings\nfrom language models. But concepts can be modelled from different perspectives,\nwhereas concept embeddings typically mostly capture taxonomic structure. To\naddress this issue, we propose a strategy for identifying what different\nconcepts, from a potentially large concept vocabulary, have in common with\nothers. We then represent concepts in terms of the properties they share with\nthe other concepts. To demonstrate the practical usefulness of this way of\nmodelling concepts, we consider the task of ultra-fine entity typing, which is\na challenging multi-label classification problem. We show that by augmenting\nthe label set with shared properties, we can improve the performance of the\nstate-of-the-art models for this task.\n","authors":["Amit Gajbhiye","Zied Bouraoui","Na Li","Usashi Chatterjee","Luis Espinosa Anke","Steven Schockaert"],"pdf_url":"https://arxiv.org/pdf/2310.14793v1.pdf","comment":"Accepted for EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14740v2","updated":"2023-10-23T10:35:30Z","published":"2023-05-24T05:21:13Z","title":"ECHo: A Visio-Linguistic Dataset for Event Causality Inference via\n Human-Centric Reasoning","summary":" We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a\ndiagnostic dataset of event causality inference grounded in visio-linguistic\nsocial scenarios. ECHo employs real-world human-centric deductive information\nbuilding on a television crime drama. ECHo requires the Theory-of-Mind (ToM)\nability to understand and reason about social interactions based on multimodal\ninformation. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework\nto assess the reasoning capability of current AI systems. Our ToM-enhanced CoT\npipeline accommodates various large foundation models in both zero-shot and\nfew-shot visio-linguistic reasoning. We use this framework to scrutinize recent\nlarge foundation models such as InstructGPT and MiniGPT-4 on three diagnostic\nhuman-centric tasks. Further analysis demonstrates ECHo as a challenging\ndataset to expose imperfections and inconsistencies in reasoning. Our data and\ncode are publicly available at https://github.com/YuxiXie/ECHo.\n","authors":["Yuxi Xie","Guanzhen Li","Min-Yen Kan"],"pdf_url":"https://arxiv.org/pdf/2305.14740v2.pdf","comment":"Findings of EMNLP 2023. 10 pages, 6 figures, 5 tables (22 pages, 8\n figures, 15 tables including references and appendices)"},{"id":"http://arxiv.org/abs/2210.07440v2","updated":"2023-10-23T10:35:14Z","published":"2022-10-14T00:54:12Z","title":"InterFair: Debiasing with Natural Language Feedback for Fair\n Interpretable Predictions","summary":" Debiasing methods in NLP models traditionally focus on isolating information\nrelated to a sensitive attribute (e.g., gender or race). We instead argue that\na favorable debiasing method should use sensitive information 'fairly,' with\nexplanations, rather than blindly eliminating it. This fair balance is often\nsubjective and can be challenging to achieve algorithmically. We explore two\ninteractive setups with a frozen predictive model and show that users able to\nprovide feedback can achieve a better and fairer balance between task\nperformance and bias mitigation. In one setup, users, by interacting with test\nexamples, further decreased bias in the explanations (5-8%) while maintaining\nthe same prediction accuracy. In the other setup, human feedback was able to\ndisentangle associated bias and predictive information from the input leading\nto superior bias mitigation and improved task performance (4-5%)\nsimultaneously.\n","authors":["Bodhisattwa Prasad Majumder","Zexue He","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2210.07440v2.pdf","comment":"Accepted in EMNLP 2023 (Main)"},{"id":"http://arxiv.org/abs/2305.13971v3","updated":"2023-10-23T10:30:37Z","published":"2023-05-23T11:54:37Z","title":"Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning","summary":" Despite their impressive performance, large language models (LMs) still\nstruggle with reliably generating complex output structures when not finetuned\nto follow the required output format exactly. To address this issue,\ngrammar-constrained decoding (GCD) can be used to control the generation of\nLMs, guaranteeing that the output follows a given structure. Most existing GCD\nmethods are, however, limited to specific tasks, such as parsing or code\ngeneration. In this work, we demonstrate that formal grammars can describe the\noutput space for a much wider range of tasks and argue that GCD can serve as a\nunified framework for structured NLP tasks in general. For increased\nflexibility, we introduce input-dependent grammars, which allow the grammar to\ndepend on the input and thus enable the generation of different output\nstructures for different inputs. We then empirically demonstrate the power and\nflexibility of GCD-enhanced LMs on (1) information extraction, (2) entity\ndisambiguation, and (3) constituency parsing. Our results indicate that\ngrammar-constrained LMs substantially outperform unconstrained LMs or even beat\ntask-specific finetuned models. Grammar constraints thus hold great promise for\nharnessing off-the-shelf LMs for a wide range of structured NLP tasks,\nespecially where training data is scarce or finetuning is expensive. Code and\ndata: https://github.com/epfl-dlab/GCD.\n","authors":["Saibo Geng","Martin Josifosky","Maxime Peyrard","Robert West"],"pdf_url":"https://arxiv.org/pdf/2305.13971v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14777v1","updated":"2023-10-23T10:26:14Z","published":"2023-10-23T10:26:14Z","title":"Geographical Erasure in Language Generation","summary":" Large language models (LLMs) encode vast amounts of world knowledge. However,\nsince these models are trained on large swaths of internet data, they are at\nrisk of inordinately capturing information about dominant groups. This\nimbalance can propagate into generated language. In this work, we study and\noperationalise a form of geographical erasure, wherein language models\nunderpredict certain countries. We demonstrate consistent instances of erasure\nacross a range of LLMs. We discover that erasure strongly correlates with low\nfrequencies of country mentions in the training corpus. Lastly, we mitigate\nerasure by finetuning using a custom objective.\n","authors":["Pola Schwöbel","Jacek Golebiowski","Michele Donini","Cédric Archambeau","Danish Pruthi"],"pdf_url":"https://arxiv.org/pdf/2310.14777v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.13012v2","updated":"2023-10-23T10:24:55Z","published":"2023-10-17T09:40:58Z","title":"H2O Open Ecosystem for State-of-the-art Large Language Models","summary":" Large Language Models (LLMs) represent a revolution in AI. However, they also\npose many significant risks, such as the presence of biased, private,\ncopyrighted or harmful text. For this reason we need open, transparent and safe\nsolutions. We introduce a complete open-source ecosystem for developing and\ntesting LLMs. The goal of this project is to boost open alternatives to\nclosed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of\ndiverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI\ndesigned for efficient fine-tuning, evaluation, and deployment of LLMs using\nthe most recent state-of-the-art techniques. Our code and models are fully\nopen-source. We believe this work helps to boost AI development and make it\nmore accessible, efficient and trustworthy. The demo is available at:\nhttps://gpt.h2o.ai/\n","authors":["Arno Candel","Jon McKinney","Philipp Singer","Pascal Pfeiffer","Maximilian Jeblick","Chun Ming Lee","Marcos V. Conde"],"pdf_url":"https://arxiv.org/pdf/2310.13012v2.pdf","comment":"EMNLP 2023 Demo - ACL Empirical Methods in Natural Language\n Processing"},{"id":"http://arxiv.org/abs/2310.12942v2","updated":"2023-10-23T10:24:26Z","published":"2023-10-19T17:39:47Z","title":"On the Representational Capacity of Recurrent Neural Language Models","summary":" This work investigates the computational expressivity of language models\n(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992)\nfamously showed that RNNs with rational weights and hidden states and unbounded\ncomputation time are Turing complete. However, LMs define weightings over\nstrings in addition to just (unweighted) language membership and the analysis\nof the computational power of RNN LMs (RLMs) should reflect this. We extend the\nTuring completeness result to the probabilistic case, showing how a rationally\nweighted RLM with unbounded computation time can simulate any probabilistic\nTuring machine (PTM). Since, in practice, RLMs work in real-time, processing a\nsymbol at every time step, we treat the above result as an upper bound on the\nexpressivity of RLMs. We also provide a lower bound by showing that under the\nrestriction to real-time computation, such models can simulate deterministic\nreal-time rational PTMs.\n","authors":["Franz Nowak","Anej Svete","Li Du","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.12942v2.pdf","comment":"To be published at EMNLP 2023;"},{"id":"http://arxiv.org/abs/2310.14771v1","updated":"2023-10-23T10:15:13Z","published":"2023-10-23T10:15:13Z","title":"Evaluating the Knowledge Base Completion Potential of GPT","summary":" Structured knowledge bases (KBs) are an asset for search engines and other\napplications, but are inevitably incomplete. Language models (LMs) have been\nproposed for unsupervised knowledge base completion (KBC), yet, their ability\nto do this at scale and with high accuracy remains an open question. Prior\nexperimental studies mostly fall short because they only evaluate on popular\nsubjects, or sample already existing facts from KBs. In this work, we perform a\ncareful evaluation of GPT's potential to complete the largest public KB:\nWikidata. We find that, despite their size and capabilities, models like GPT-3,\nChatGPT and GPT-4 do not achieve fully convincing results on this task.\nNonetheless, they provide solid improvements over earlier approaches with\nsmaller LMs. In particular, we show that, with proper thresholding, GPT-3\nenables to extend Wikidata by 27M facts at 90% precision.\n","authors":["Blerta Veseli","Simon Razniewski","Jan-Christoph Kalo","Gerhard Weikum"],"pdf_url":"https://arxiv.org/pdf/2310.14771v1.pdf","comment":"12 pages 4 tables"},{"id":"http://arxiv.org/abs/2305.18396v2","updated":"2023-10-23T10:06:52Z","published":"2023-05-28T13:08:13Z","title":"LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly\n Transformers","summary":" The community explored to build private inference frameworks for\ntransformer-based large language models (LLMs) in a server-client setting,\nwhere the server holds the model parameters and the client inputs its private\ndata (or prompt) for inference. However, these frameworks impose significant\noverhead when the private inputs are forward propagated through the original\nLLMs. In this paper, we show that substituting the computation- and\ncommunication-heavy operators in the transformer architecture with\nprivacy-computing friendly approximations can greatly reduce the private\ninference costs while incurring very minor impact on model performance.\nCompared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing\nfriendly model inference pipeline achieves a $5\\times$ acceleration in\ncomputation and an 80% reduction in communication overhead, while retaining\nnearly identical accuracy.\n","authors":["Xuanqi Liu","Zhuotao Liu"],"pdf_url":"https://arxiv.org/pdf/2305.18396v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15294v2","updated":"2023-10-23T09:58:13Z","published":"2023-05-24T16:17:36Z","title":"Enhancing Retrieval-Augmented Large Language Models with Iterative\n Retrieval-Generation Synergy","summary":" Large language models are powerful text processors and reasoners, but are\nstill subject to limitations including outdated knowledge and hallucinations,\nwhich necessitates connecting them to the world. Retrieval-augmented large\nlanguage models have raised extensive attention for grounding model generation\non external knowledge. However, retrievers struggle to capture relevance,\nespecially for queries with complex information needs. Recent work has proposed\nto improve relevance modeling by having large language models actively involved\nin retrieval, i.e., to improve retrieval with generation. In this paper, we\nshow that strong performance can be achieved by a method we call Iter-RetGen,\nwhich synergizes retrieval and generation in an iterative manner. A model\noutput shows what might be needed to finish a task, and thus provides an\ninformative context for retrieving more relevant knowledge which in turn helps\ngenerate a better output in the next iteration. Compared with recent work which\ninterleaves retrieval with generation when producing an output, Iter-RetGen\nprocesses all retrieved knowledge as a whole and largely preserves the\nflexibility in generation without structural constraints. We evaluate\nIter-RetGen on multi-hop question answering, fact verification, and commonsense\nreasoning, and show that it can flexibly leverage parametric knowledge and\nnon-parametric knowledge, and is superior to or competitive with\nstate-of-the-art retrieval-augmented baselines while causing fewer overheads of\nretrieval and generation. We can further improve performance via\ngeneration-augmented retrieval adaptation.\n","authors":["Zhihong Shao","Yeyun Gong","Yelong Shen","Minlie Huang","Nan Duan","Weizhu Chen"],"pdf_url":"https://arxiv.org/pdf/2305.15294v2.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.14757v1","updated":"2023-10-23T09:48:25Z","published":"2023-10-23T09:48:25Z","title":"SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for\n Social Media NLP Research","summary":" Despite its relevance, the maturity of NLP for social media pales in\ncomparison with general-purpose models, metrics and benchmarks. This fragmented\nlandscape makes it hard for the community to know, for instance, given a task,\nwhich is the best performing model and how it compares with others. To\nalleviate this issue, we introduce a unified benchmark for NLP evaluation in\nsocial media, SuperTweetEval, which includes a heterogeneous set of tasks and\ndatasets combined, adapted and constructed from scratch. We benchmarked the\nperformance of a wide range of models on SuperTweetEval and our results suggest\nthat, despite the recent advances in language modelling, social media remains\nchallenging.\n","authors":["Dimosthenis Antypas","Asahi Ushio","Francesco Barbieri","Leonardo Neves","Kiamehr Rezaee","Luis Espinosa-Anke","Jiaxin Pei","Jose Camacho-Collados"],"pdf_url":"https://arxiv.org/pdf/2310.14757v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.05074v3","updated":"2023-10-23T09:38:01Z","published":"2023-10-08T08:52:13Z","title":"DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller\n Language Models","summary":" Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the\nreasoning capabilities of Large Language Models (LLMs) with at least 100\nbillion parameters. However, it is ineffective or even detrimental when applied\nto reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion\nparameters. To address this limitation, we introduce Dialogue-guided\nChain-of-Thought (DialCoT) which employs a dialogue format to generate\nintermediate reasoning steps, guiding the model toward the final answer.\nAdditionally, we optimize the model's reasoning path selection using the\nProximal Policy Optimization (PPO) algorithm, further enhancing its reasoning\ncapabilities. Our method offers several advantages compared to previous\napproaches. Firstly, we transform the process of solving complex reasoning\nquestions by breaking them down into a series of simpler sub-questions,\nsignificantly reducing the task difficulty and making it more suitable for\nSLMs. Secondly, we optimize the model's reasoning path selection through the\nPPO algorithm. We conduct comprehensive experiments on four arithmetic\nreasoning datasets, demonstrating that our method achieves significant\nperformance improvements compared to state-of-the-art competitors.\n","authors":["Chengcheng Han","Xiaowei Du","Che Zhang","Yixin Lian","Xiang Li","Ming Gao","Baoyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.05074v3.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14747v1","updated":"2023-10-23T09:32:53Z","published":"2023-10-23T09:32:53Z","title":"MCC-KD: Multi-CoT Consistent Knowledge Distillation","summary":" Large language models (LLMs) have showcased remarkable capabilities in\ncomplex reasoning through chain of thought (CoT) prompting.~Recently, there has\nbeen a growing interest in transferring these reasoning abilities from LLMs to\nsmaller models.~However, achieving both the diversity and consistency in\nrationales presents a challenge.~In this paper, we focus on enhancing these two\naspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to\nefficiently distill the reasoning capabilities. In MCC-KD, we generate multiple\nrationales for each question and enforce consistency among the corresponding\npredictions by minimizing the bidirectional KL-divergence between the answer\ndistributions.~We investigate the effectiveness of MCC-KD with different model\narchitectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both\nmathematical reasoning and commonsense reasoning benchmarks. The empirical\nresults not only confirm MCC-KD's superior performance on in-distribution\ndatasets but also highlight its robust generalization ability on\nout-of-distribution datasets.\n","authors":["Hongzhan Chen","Siyue Wu","Xiaojun Quan","Rui Wang","Ming Yan","Ji Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14747v1.pdf","comment":"Accepted to ENMLP 2023"},{"id":"http://arxiv.org/abs/2305.09800v2","updated":"2023-10-23T09:26:22Z","published":"2023-05-16T20:50:46Z","title":"Mirages: On Anthropomorphism in Dialogue Systems","summary":" Automated dialogue or conversational systems are anthropomorphised by\ndevelopers and personified by users. While a degree of anthropomorphism may be\ninevitable due to the choice of medium, conscious and unconscious design\nchoices can guide users to personify such systems to varying degrees.\nEncouraging users to relate to automated systems as if they were human can lead\nto high risk scenarios caused by over-reliance on their outputs. As a result,\nnatural language processing researchers have investigated the factors that\ninduce personification and develop resources to mitigate such effects. However,\nthese efforts are fragmented, and many aspects of anthropomorphism have yet to\nbe explored. In this paper, we discuss the linguistic factors that contribute\nto the anthropomorphism of dialogue systems and the harms that can arise,\nincluding reinforcing gender stereotypes and notions of acceptable language. We\nrecommend that future efforts towards developing dialogue systems take\nparticular care in their design, development, release, and description; and\nattend to the many linguistic cues that can elicit personification by users.\n","authors":["Gavin Abercrombie","Amanda Cercas Curry","Tanvi Dinkar","Verena Rieser","Zeerak Talat"],"pdf_url":"https://arxiv.org/pdf/2305.09800v2.pdf","comment":"Accepted for publication at EMNLP. See ACL Anthology for published\n version"},{"id":"http://arxiv.org/abs/2309.16583v3","updated":"2023-10-23T09:19:09Z","published":"2023-09-28T16:43:35Z","title":"GPT-Fathom: Benchmarking Large Language Models to Decipher the\n Evolutionary Path towards GPT-4 and Beyond","summary":" With the rapid advancement of large language models (LLMs), there is a\npressing need for a comprehensive evaluation suite to assess their capabilities\nand limitations. Existing LLM leaderboards often reference scores reported in\nother papers without consistent settings and prompts, which may inadvertently\nencourage cherry-picking favored settings and prompts for better results. In\nthis work, we introduce GPT-Fathom, an open-source and reproducible LLM\nevaluation suite built on top of OpenAI Evals. We systematically evaluate 10+\nleading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across\n7 capability categories, all under aligned settings. Our retrospective study on\nOpenAI's earlier models offers valuable insights into the evolutionary path\nfrom GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3\nprogressively improves to GPT-4, including technical details like whether\nadding code data improves LLM's reasoning capability, which aspects of LLM\ncapability can be improved by SFT and RLHF, how much is the alignment tax, etc.\nOur analysis sheds light on many of these questions, aiming to improve the\ntransparency of advanced LLMs.\n","authors":["Shen Zheng","Yuyu Zhang","Yijie Zhu","Chenguang Xi","Pengyang Gao","Xun Zhou","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2309.16583v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14735v1","updated":"2023-10-23T09:15:18Z","published":"2023-10-23T09:15:18Z","title":"Unleashing the potential of prompt engineering in Large Language Models:\n a comprehensive review","summary":" This paper delves into the pivotal role of prompt engineering in unleashing\nthe capabilities of Large Language Models (LLMs). Prompt engineering is the\nprocess of structuring input text for LLMs and is a technique integral to\noptimizing the efficacy of LLMs. This survey elucidates foundational principles\nof prompt engineering, such as role-prompting, one-shot, and few-shot\nprompting, as well as more advanced methodologies such as the chain-of-thought\nand tree-of-thoughts prompting. The paper sheds light on how external\nassistance in the form of plugins can assist in this task, and reduce machine\nhallucination by retrieving external knowledge. We subsequently delineate\nprospective directions in prompt engineering research, emphasizing the need for\na deeper understanding of structures and the role of agents in Artificial\nIntelligence-Generated Content (AIGC) tools. We discuss how to assess the\nefficacy of prompt methods from different perspectives and using different\nmethods. Finally, we gather information about the application of prompt\nengineering in such fields as education and programming, showing its\ntransformative potential. This comprehensive survey aims to serve as a friendly\nguide for anyone venturing through the big world of LLMs and prompt\nengineering.\n","authors":["Banghao Chen","Zhaofeng Zhang","Nicolas Langrené","Shengxin Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14735v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14732v1","updated":"2023-10-23T09:07:27Z","published":"2023-10-23T09:07:27Z","title":"Generating Prototypes for Contradiction Detection Using Large Language\n Models and Linguistic Rules","summary":" We introduce a novel data generation method for contradiction detection,\nwhich leverages the generative power of large language models as well as\nlinguistic rules. Our vision is to provide a condensed corpus of prototypical\ncontradictions, allowing for in-depth linguistic analysis as well as efficient\nlanguage model fine-tuning. To this end, we instruct the generative models to\ncreate contradicting statements with respect to descriptions of specific\ncontradiction types. In addition, the model is also instructed to come up with\ncompletely new contradiction typologies. As an auxiliary approach, we use\nlinguistic rules to construct simple contradictions such as those arising from\nnegation, antonymy and numeric mismatch. We find that our methods yield\npromising results in terms of coherence and variety of the data. Further\nstudies, as well as manual refinement are necessary to make use of this data in\na machine learning setup.\n","authors":["Maren Pielka","Svetlana Schmidt","Rafet Sifa"],"pdf_url":"https://arxiv.org/pdf/2310.14732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14724v1","updated":"2023-10-23T09:01:13Z","published":"2023-10-23T09:01:13Z","title":"A Survey on LLM-gernerated Text Detection: Necessity, Methods, and\n Future Directions","summary":" The powerful ability to understand, follow, and generate complex language\nemerging from large language models (LLMs) makes LLM-generated text flood many\nareas of our daily lives at an incredible speed and is widely accepted by\nhumans. As LLMs continue to expand, there is an imperative need to develop\ndetectors that can detect LLM-generated text. This is crucial to mitigate\npotential misuse of LLMs and safeguard realms like artistic expression and\nsocial networks from harmful influence of LLM-generated content. The\nLLM-generated text detection aims to discern if a piece of text was produced by\nan LLM, which is essentially a binary classification task. The detector\ntechniques have witnessed notable advancements recently, propelled by\ninnovations in watermarking techniques, zero-shot methods, fine-turning LMs\nmethods, adversarial learning methods, LLMs as detectors, and human-assisted\nmethods. In this survey, we collate recent research breakthroughs in this area\nand underscore the pressing need to bolster detector research. We also delve\ninto prevalent datasets, elucidating their limitations and developmental\nrequirements. Furthermore, we analyze various LLM-generated text detection\nparadigms, shedding light on challenges like out-of-distribution problems,\npotential attacks, and data ambiguity. Conclusively, we highlight interesting\ndirections for future research in LLM-generated text detection to advance the\nimplementation of responsible artificial intelligence (AI). Our aim with this\nsurvey is to provide a clear and comprehensive introduction for newcomers while\nalso offering seasoned researchers a valuable update in the field of\nLLM-generated text detection.\n","authors":["Junchao Wu","Shu Yang","Runzhe Zhan","Yulin Yuan","Derek F. Wong","Lidia S. Chao"],"pdf_url":"https://arxiv.org/pdf/2310.14724v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14709v1","updated":"2023-10-23T08:49:00Z","published":"2023-10-23T08:49:00Z","title":"Once Upon a $\\textit{Time}$ in $\\textit{Graph}$: Relative-Time\n Pretraining for Complex Temporal Reasoning","summary":" Our physical world is constantly evolving over time, rendering challenges for\npre-trained language models to understand and reason over the temporal contexts\nof texts. Existing work focuses on strengthening the direct association between\na piece of text and its time-stamp. However, the knowledge-time association is\nusually insufficient for the downstream tasks that require reasoning over\ntemporal dependencies between knowledge. In this work, we make use of the\nunderlying nature of time, all temporally-scoped sentences are strung together\nthrough a one-dimensional time axis, and suggest creating a graph structure\nbased on the relative placements of events along the time axis. Inspired by the\ngraph view, we propose RemeMo ($\\underline{Re}$lative Ti$\\underline{me}$\n$\\underline{Mo}$deling), which explicitly connects all temporally-scoped facts\nby modeling the time relations between any two sentences. Experimental results\nshow that RemeMo outperforms the baseline T5 on multiple temporal question\nanswering datasets under various settings. Further analysis suggests that\nRemeMo is especially good at modeling long-range complex temporal dependencies.\nWe release our code and pre-trained checkpoints at\n$\\href{https://github.com/DAMO-NLP-SG/RemeMo}{\\text{this url}}$.\n","authors":["Sen Yang","Xin Li","Lidong Bing","Wai Lam"],"pdf_url":"https://arxiv.org/pdf/2310.14709v1.pdf","comment":"EMNLP 2023 main"},{"id":"http://arxiv.org/abs/2304.09145v3","updated":"2023-10-23T08:48:31Z","published":"2023-04-18T17:34:23Z","title":"Outlier Suppression+: Accurate quantization of large language models by\n equivalent and optimal shifting and scaling","summary":" Post-training quantization~(PTQ) of transformer language models faces\nsignificant challenges due to the existence of detrimental outliers in\nactivations. We observe that these outliers are concentrated in specific\nchannels and are asymmetric across channels. To address this issue, we propose\nthe Outlier Suppression+~(OS+) framework, which contains the channel-wise\nshifting for asymmetry and channel-wise scaling for concentration. We show that\nthese operations can be seamlessly migrated into subsequent modules while\nmaintaining equivalence. Second, we propose a fast and stable scheme to\ncalculate effective shifting and scaling values. The channel-wise shifting\naligns the center of each channel for removal of outlier asymmetry. The\nchannel-wise scaling quantitatively evaluates changes brought by migration and\nquantization for better quantization burden balance. We validate our OS+ under\nboth standard and fine-grained quantization settings with models including\nBERT, OPT, BLOOM, BLOOMZ, and LLaMA. Comprehensive results across various tasks\ndemonstrate the superiority of our approach. Especially, with standard\nquantization, OS+ can achieve near-floating-point performance on both small\nmodels and large language models on 8-bit and 6-bit. Besides, we establish a\nnew state-of-the-art for 4-bit BERT with 15.5\\% improvement. Our code is\navailable at \\url{https://github.com/ModelTC/Outlier_Suppression_Plus}.\n","authors":["Xiuying Wei","Yunchen Zhang","Yuhang Li","Xiangguo Zhang","Ruihao Gong","Jinyang Guo","Xianglong Liu"],"pdf_url":"https://arxiv.org/pdf/2304.09145v3.pdf","comment":"Accepted to EMNLP23 (main)"},{"id":"http://arxiv.org/abs/2310.14708v1","updated":"2023-10-23T08:48:14Z","published":"2023-10-23T08:48:14Z","title":"Strong and Efficient Baselines for Open Domain Conversational Question\n Answering","summary":" Unlike the Open Domain Question Answering (ODQA) setting, the conversational\n(ODConvQA) domain has received limited attention when it comes to reevaluating\nbaselines for both efficiency and effectiveness. In this paper, we study the\nState-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and\nFusion-in-Decoder (FiD) reader pipeline, and show that it significantly\nunderperforms when applied to ODConvQA tasks due to various limitations. We\nthen propose and evaluate strong yet simple and efficient baselines, by\nintroducing a fast reranking component between the retriever and the reader,\nand by performing targeted finetuning steps. Experiments on two ODConvQA tasks,\nnamely TopiOCQA and OR-QuAC, show that our method improves the SotA results,\nwhile reducing reader's latency by 60%. Finally, we provide new and valuable\ninsights into the development of challenging baselines that serve as a\nreference for future, more intricate approaches, including those that leverage\nLarge Language Models (LLMs).\n","authors":["Andrei C. Coman","Gianni Barlacchi","Adrià de Gispert"],"pdf_url":"https://arxiv.org/pdf/2310.14708v1.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.14703v1","updated":"2023-10-23T08:45:12Z","published":"2023-10-23T08:45:12Z","title":"The continued usefulness of vocabulary tests for evaluating large\n language models","summary":" In their seminal article on semantic vectors, Landauer and Dumain (1997)\nproposed testing the quality of AI language models with a challenging\nvocabulary test. We show that their Test of English as a Foreign Language\n(TOEFL) test remains informative for contemporary major language models, since\nnone of the models was perfect and made errors on divergent items. The TOEFL\ntest consists of target words with four alternatives to choose from. We further\ntested the models on a Yes/No test that requires distinguishing between\nexisting words and made-up nonwords. The models performed significantly worse\non the nonword items, in line with other observations that current major\nlanguage models provide non-existent information. The situation was worse when\nwe generalized the tests to Spanish. Here, most models gave\nmeanings/translations for the majority of random letter sequences. On the plus\nside, the best models began to perform quite well, and they also pointed to\nnonwords that were unknown to the test participants but can be found in\ndictionaries.\n","authors":["Gonzalo Martínez","Javier Conde","Elena Merino-Gómez","Beatriz Bermúdez-Margaretto","José Alberto Hernández","Pedro Reviriego","Marc Brysbaert"],"pdf_url":"https://arxiv.org/pdf/2310.14703v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14696v1","updated":"2023-10-23T08:42:49Z","published":"2023-10-23T08:42:49Z","title":"Tree of Clarifications: Answering Ambiguous Questions with\n Retrieval-Augmented Large Language Models","summary":" Questions in open-domain question answering are often ambiguous, allowing\nmultiple interpretations. One approach to handling them is to identify all\npossible interpretations of the ambiguous question (AQ) and to generate a\nlong-form answer addressing them all, as suggested by Stelmakh et al., (2022).\nWhile it provides a comprehensive response without bothering the user for\nclarification, considering multiple dimensions of ambiguity and gathering\ncorresponding knowledge remains a challenge. To cope with the challenge, we\npropose a novel framework, Tree of Clarifications (ToC): It recursively\nconstructs a tree of disambiguations for the AQ -- via few-shot prompting\nleveraging external knowledge -- and uses it to generate a long-form answer.\nToC outperforms existing baselines on ASQA in a few-shot setup across the\nmetrics, while surpassing fully-supervised baselines trained on the whole\ntraining set in terms of Disambig-F1 and Disambig-ROUGE. Code is available at\nhttps://github.com/gankim/tree-of-clarifications.\n","authors":["Gangwoo Kim","Sungdong Kim","Byeongguk Jeon","Joonsuk Park","Jaewoo Kang"],"pdf_url":"https://arxiv.org/pdf/2310.14696v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.13664v2","updated":"2023-10-23T08:31:50Z","published":"2023-10-20T17:05:27Z","title":"Explainable Depression Symptom Detection in Social Media","summary":" Users of social platforms often perceive these sites as supportive spaces to\npost about their mental health issues. Those conversations contain important\ntraces about individuals' health risks. Recently, researchers have exploited\nthis online information to construct mental health detection models, which aim\nto identify users at risk on platforms like Twitter, Reddit or Facebook. Most\nof these models are centred on achieving good classification results, ignoring\nthe explainability and interpretability of the decisions. Recent research has\npointed out the importance of using clinical markers, such as the use of\nsymptoms, to improve trust in the computational models by health professionals.\nIn this paper, we propose using transformer-based architectures to detect and\nexplain the appearance of depressive symptom markers in the users' writings. We\npresent two approaches: i) train a model to classify, and another one to\nexplain the classifier's decision separately and ii) unify the two tasks\nsimultaneously using a single model. Additionally, for this latter manner, we\nalso investigated the performance of recent conversational LLMs when using\nin-context learning. Our natural language explanations enable clinicians to\ninterpret the models' decisions based on validated symptoms, enhancing trust in\nthe automated process. We evaluate our approach using recent symptom-based\ndatasets, employing both offline and expert-in-the-loop metrics to assess the\nquality of the explanations generated by our models. The experimental results\nshow that it is possible to achieve good classification results while\ngenerating interpretable symptom-based explanations.\n","authors":["Eliseo Bao Souto","Anxo Pérez","Javier Parapar"],"pdf_url":"https://arxiv.org/pdf/2310.13664v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14687v1","updated":"2023-10-23T08:26:28Z","published":"2023-10-23T08:26:28Z","title":"API-Assisted Code Generation for Question Answering on Varied Table\n Structures","summary":" A persistent challenge to table question answering (TableQA) by generating\nexecutable programs has been adapting to varied table structures, typically\nrequiring domain-specific logical forms. In response, this paper introduces a\nunified TableQA framework that: (1) provides a unified representation for\nstructured tables as multi-index Pandas data frames, (2) uses Python as a\npowerful querying language, and (3) uses few-shot prompting to translate NL\nquestions into Python programs, which are executable on Pandas data frames.\nFurthermore, to answer complex relational questions with extended program\nfunctionality and external knowledge, our framework allows customized APIs that\nPython programs can call. We experiment with four TableQA datasets that involve\ntables of different structures -- relational, multi-table, and hierarchical\nmatrix shapes -- and achieve prominent improvements over past state-of-the-art\nsystems. In ablation studies, we (1) show benefits from our multi-index\nrepresentation and APIs over baselines that use only an LLM, and (2)\ndemonstrate that our approach is modular and can incorporate additional APIs.\n","authors":["Yihan Cao","Shuyi Chen","Ryan Liu","Zhiruo Wang","Daniel Fried"],"pdf_url":"https://arxiv.org/pdf/2310.14687v1.pdf","comment":"EMNLP 2023 camera ready, 13 pages, 11 figures"},{"id":"http://arxiv.org/abs/2310.14684v1","updated":"2023-10-23T08:24:35Z","published":"2023-10-23T08:24:35Z","title":"SpEL: Structured Prediction for Entity Linking","summary":" Entity linking is a prominent thread of research focused on structured data\ncreation by linking spans of text to an ontology or knowledge source. We\nrevisit the use of structured prediction for entity linking which classifies\neach individual input token as an entity, and aggregates the token predictions.\nOur system, called SpEL (Structured prediction for Entity Linking) is a\nstate-of-the-art entity linking system that uses some new ideas to apply\nstructured prediction to the task of entity linking including: two refined\nfine-tuning steps; a context sensitive prediction aggregation strategy;\nreduction of the size of the model's output vocabulary, and; we address a\ncommon problem in entity-linking systems where there is a training vs.\ninference tokenization mismatch. Our experiments show that we can outperform\nthe state-of-the-art on the commonly used AIDA benchmark dataset for entity\nlinking to Wikipedia. Our method is also very compute efficient in terms of\nnumber of parameters and speed of inference.\n","authors":["Hassan S. Shavarani","Anoop Sarkar"],"pdf_url":"https://arxiv.org/pdf/2310.14684v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13469v2","updated":"2023-10-23T08:22:12Z","published":"2023-10-20T13:05:32Z","title":"Ask Language Model to Clean Your Noisy Translation Data","summary":" Transformer models have demonstrated remarkable performance in neural machine\ntranslation (NMT). However, their vulnerability to noisy input poses a\nsignificant challenge in practical implementation, where generating clean\noutput from noisy input is crucial. The MTNT dataset is widely used as a\nbenchmark for evaluating the robustness of NMT models against noisy input.\nNevertheless, its utility is limited due to the presence of noise in both the\nsource and target sentences. To address this limitation, we focus on cleaning\nthe noise from the target sentences in MTNT, making it more suitable as a\nbenchmark for noise evaluation. Leveraging the capabilities of large language\nmodels (LLMs), we observe their impressive abilities in noise removal. For\nexample, they can remove emojis while considering their semantic meaning.\nAdditionally, we show that LLM can effectively rephrase slang, jargon, and\nprofanities. The resulting datasets, called C-MTNT, exhibit significantly less\nnoise in the target sentences while preserving the semantic integrity of the\noriginal sentences. Our human and GPT-4 evaluations also lead to a consistent\nconclusion that LLM performs well on this task. Lastly, experiments on C-MTNT\nshowcased its effectiveness in evaluating the robustness of NMT models,\nhighlighting the potential of advanced language models for data cleaning and\nemphasizing C-MTNT as a valuable resource.\n","authors":["Quinten Bolding","Baohao Liao","Brandon James Denis","Jun Luo","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2310.13469v2.pdf","comment":"EMNLP 2023, Findings"},{"id":"http://arxiv.org/abs/2211.05895v2","updated":"2023-10-23T08:19:43Z","published":"2022-11-10T21:44:33Z","title":"Understanding ME? Multimodal Evaluation for Fine-grained Visual\n Commonsense","summary":" Visual commonsense understanding requires Vision Language (VL) models to not\nonly understand image and text but also cross-reference in-between to fully\nintegrate and achieve comprehension of the visual scene described. Recently,\nvarious approaches have been developed and have achieved high performance on\nvisual commonsense benchmarks. However, it is unclear whether the models really\nunderstand the visual scene and underlying commonsense knowledge due to limited\nevaluation data resources. To provide an in-depth analysis, we present a\nMultimodal Evaluation (ME) pipeline to automatically generate question-answer\npairs to test models' understanding of the visual scene, text, and related\nknowledge. We then take a step further to show that training with the ME data\nboosts the model's performance in standard VCR evaluation. Lastly, our in-depth\nanalysis and comparison reveal interesting findings: (1) semantically low-level\ninformation can assist the learning of high-level information but not the\nopposite; (2) visual information is generally under utilization compared with\ntext.\n","authors":["Zhecan Wang","Haoxuan You","Yicheng He","Wenhao Li","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2211.05895v2.pdf","comment":"Accepted to EMNLP 2022 Long Paper"},{"id":"http://arxiv.org/abs/2310.14676v1","updated":"2023-10-23T08:15:38Z","published":"2023-10-23T08:15:38Z","title":"Pre-Trained Language Models Augmented with Synthetic Scanpaths for\n Natural Language Understanding","summary":" Human gaze data offer cognitive information that reflects natural language\ncomprehension. Indeed, augmenting language models with human scanpaths has\nproven beneficial for a range of NLP tasks, including language understanding.\nHowever, the applicability of this approach is hampered because the abundance\nof text corpora is contrasted by a scarcity of gaze data. Although models for\nthe generation of human-like scanpaths during reading have been developed, the\npotential of synthetic gaze data across NLP tasks remains largely unexplored.\nWe develop a model that integrates synthetic scanpath generation with a\nscanpath-augmented language model, eliminating the need for human gaze data.\nSince the model's error gradient can be propagated throughout all parts of the\nmodel, the scanpath generator can be fine-tuned to downstream tasks. We find\nthat the proposed model not only outperforms the underlying language model, but\nachieves a performance that is comparable to a language model augmented with\nreal human gaze data. Our code is publicly available.\n","authors":["Shuwen Deng","Paul Prasse","David R. Reich","Tobias Scheffer","Lena A. Jäger"],"pdf_url":"https://arxiv.org/pdf/2310.14676v1.pdf","comment":"Pre-print for EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14670v1","updated":"2023-10-23T08:09:42Z","published":"2023-10-23T08:09:42Z","title":"Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and\n Beyond","summary":" Vision-language (VL) understanding tasks evaluate models' comprehension of\ncomplex visual scenes through multiple-choice questions. However, we have\nidentified two dataset biases that models can exploit as shortcuts to resolve\nvarious VL tasks correctly without proper understanding. The first type of\ndataset bias is \\emph{Unbalanced Matching} bias, where the correct answer\noverlaps the question and image more than the incorrect answers. The second\ntype of dataset bias is \\emph{Distractor Similarity} bias, where incorrect\nanswers are overly dissimilar to the correct answer but significantly similar\nto other incorrect answers within the same sample. To address these dataset\nbiases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic\ntraining and debiased evaluation data. We then introduce Intra-sample\nCounterfactual Training (ICT) to assist models in utilizing the synthesized\ntraining data, particularly the counterfactual data, via focusing on\nintra-sample differentiation. Extensive experiments demonstrate the\neffectiveness of ADS and ICT in consistently improving model performance across\ndifferent benchmarks, even in domain-shifted scenarios.\n","authors":["Zhecan Wang","Long Chen","Haoxuan You","Keyang Xu","Yicheng He","Wenhao Li","Noal Codella","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2310.14670v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.12074v2","updated":"2023-10-23T08:01:46Z","published":"2023-10-18T16:07:13Z","title":"Towards Safer Operations: An Expert-involved Dataset of High-Pressure\n Gas Incidents for Preventing Future Failures","summary":" This paper introduces a new IncidentAI dataset for safety prevention.\nDifferent from prior corpora that usually contain a single task, our dataset\ncomprises three tasks: named entity recognition, cause-effect extraction, and\ninformation retrieval. The dataset is annotated by domain experts who have at\nleast six years of practical experience as high-pressure gas conservation\nmanagers. We validate the contribution of the dataset in the scenario of safety\nprevention. Preliminary results on the three tasks show that NLP techniques are\nbeneficial for analyzing incident reports to prevent future failures. The\ndataset facilitates future research in NLP and incident management communities.\nThe access to the dataset is also provided (the IncidentAI dataset is available\nat: https://github.com/Cinnamon/incident-ai-dataset).\n","authors":["Shumpei Inoue","Minh-Tien Nguyen","Hiroki Mizokuchi","Tuan-Anh D. Nguyen","Huu-Hiep Nguyen","Dung Tien Le"],"pdf_url":"https://arxiv.org/pdf/2310.12074v2.pdf","comment":"Accepted by EMNLP 2023 (The Industry Track)"},{"id":"http://arxiv.org/abs/2310.14663v1","updated":"2023-10-23T07:59:46Z","published":"2023-10-23T07:59:46Z","title":"DPP-TTS: Diversifying prosodic features of speech via determinantal\n point processes","summary":" With the rapid advancement in deep generative models, recent neural\nText-To-Speech(TTS) models have succeeded in synthesizing human-like speech.\nThere have been some efforts to generate speech with various prosody beyond\nmonotonous prosody patterns. However, previous works have several limitations.\nFirst, typical TTS models depend on the scaled sampling temperature for\nboosting the diversity of prosody. Speech samples generated at high sampling\ntemperatures often lack perceptual prosodic diversity, which can adversely\naffect the naturalness of the speech. Second, the diversity among samples is\nneglected since the sampling procedure often focuses on a single speech sample\nrather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech\nmodel based on Determinantal Point Processes (DPPs) with a prosody diversifying\nmodule. Our TTS model is capable of generating speech samples that\nsimultaneously consider perceptual diversity in each sample and among multiple\nsamples. We demonstrate that DPP-TTS generates speech samples with more\ndiversified prosody than baselines in the side-by-side comparison test\nconsidering the naturalness of speech at the same time.\n","authors":["Seongho Joo","Hyukhun Koh","Kyomin Jung"],"pdf_url":"https://arxiv.org/pdf/2310.14663v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14657v1","updated":"2023-10-23T07:52:38Z","published":"2023-10-23T07:52:38Z","title":"Reasoning about Ambiguous Definite Descriptions","summary":" Natural language reasoning plays an increasingly important role in improving\nlanguage models' ability to solve complex language understanding tasks. An\ninteresting use case for reasoning is the resolution of context-dependent\nambiguity. But no resources exist to evaluate how well Large Language Models\ncan use explicit reasoning to resolve ambiguity in language. We propose to use\nambiguous definite descriptions for this purpose and create and publish the\nfirst benchmark dataset consisting of such phrases. Our method includes all\ninformation required to resolve the ambiguity in the prompt, which means a\nmodel does not require anything but reasoning to do well. We find this to be a\nchallenging task for recent LLMs. Code and data available at:\nhttps://github.com/sfschouten/exploiting-ambiguity\n","authors":["Stefan F. Schouten","Peter Bloem","Ilia Markov","Piek Vossen"],"pdf_url":"https://arxiv.org/pdf/2310.14657v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.09645v2","updated":"2023-10-23T07:51:23Z","published":"2023-05-16T17:45:23Z","title":"StructGPT: A General Framework for Large Language Model to Reason over\n Structured Data","summary":" In this paper, we study how to improve the zero-shot reasoning ability of\nlarge language models~(LLMs) over structured data in a unified way. Inspired by\nthe study on tool augmentation for LLMs, we develop an \\emph{Iterative\nReading-then-Reasoning~(IRR)} approach for solving question answering tasks\nbased on structured data, called \\textbf{StructGPT}. In our approach, we\nconstruct the specialized function to collect relevant evidence from structured\ndata (\\ie \\emph{reading}), and let LLMs concentrate the reasoning task based on\nthe collected information (\\ie \\emph{reasoning}). Specially, we propose an\n\\emph{invoking-linearization-generation} procedure to support LLMs in reasoning\non the structured data with the help of the external interfaces. By iterating\nthis procedures with provided interfaces, our approach can gradually approach\nthe target answer to a given query. Extensive experiments conducted on three\ntypes of structured data demonstrate the effectiveness of our approach, which\ncan significantly boost the performance of ChatGPT and achieve comparable\nperformance against the full-data supervised-tuning baselines. Our codes and\ndata are publicly available at~\\url{https://github.com/RUCAIBox/StructGPT}.\n","authors":["Jinhao Jiang","Kun Zhou","Zican Dong","Keming Ye","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2305.09645v2.pdf","comment":"LLM+Structured Data(KG, Table, DB); EMNLP-23 Camera-ready"},{"id":"http://arxiv.org/abs/2310.14654v1","updated":"2023-10-23T07:50:10Z","published":"2023-10-23T07:50:10Z","title":"SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,\n IIT Madras","summary":" India is home to a multitude of languages of which 22 languages are\nrecognised by the Indian Constitution as official. Building speech based\napplications for the Indian population is a difficult problem owing to limited\ndata and the number of languages and accents to accommodate. To encourage the\nlanguage technology community to build speech based applications in Indian\nlanguages, we are open sourcing SPRING-INX data which has about 2000 hours of\nlegally sourced and manually transcribed speech data for ASR system building in\nAssamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi\nand Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology\nMadras and is a part of National Language Translation Mission (NLTM), funded by\nthe Indian Ministry of Electronics and Information Technology (MeitY),\nGovernment of India. We describe the data collection and data cleaning process\nalong with the data statistics in this paper.\n","authors":["Nithya R","Malavika S","Jordan F","Arjun Gangwar","Metilda N J","S Umesh","Rithik Sarab","Akhilesh Kumar Dubey","Govind Divakaran","Samudra Vijaya K","Suryakanth V Gangashetty"],"pdf_url":"https://arxiv.org/pdf/2310.14654v1.pdf","comment":"3 pages, About SPRING-INX Data"},{"id":"http://arxiv.org/abs/2310.13321v2","updated":"2023-10-23T07:41:09Z","published":"2023-10-20T07:31:23Z","title":"Beyond Hard Samples: Robust and Effective Grammatical Error Correction\n with Cycle Self-Augmenting","summary":" Recent studies have revealed that grammatical error correction methods in the\nsequence-to-sequence paradigm are vulnerable to adversarial attack, and simply\nutilizing adversarial examples in the pre-training or post-training process can\nsignificantly enhance the robustness of GEC models to certain types of attack\nwithout suffering too much performance loss on clean data. In this paper, we\nfurther conduct a thorough robustness evaluation of cutting-edge GEC methods\nfor four different types of adversarial attacks and propose a simple yet very\neffective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the\naugmenting data from the GEC models themselves in the post-training process and\nintroducing regularization data for cycle training, our proposed method can\neffectively improve the model robustness of well-trained GEC models with only a\nfew more training epochs as an extra cost. More concretely, further training on\nthe regularization data can prevent the GEC models from over-fitting on\neasy-to-learn samples and thus can improve the generalization capability and\nrobustness towards unseen data (adversarial noise/samples). Meanwhile, the\nself-augmented data can provide more high-quality pseudo pairs to improve model\nperformance on the original testing data. Experiments on four benchmark\ndatasets and seven strong models indicate that our proposed training method can\nsignificantly enhance the robustness of four types of attacks without using\npurposely built adversarial examples in training. Evaluation results on clean\ndata further confirm that our proposed CSA method significantly improves the\nperformance of four baselines and yields nearly comparable results with other\nstate-of-the-art models. Our code is available at\nhttps://github.com/ZetangForward/CSA-GEC.\n","authors":["Zecheng Tang","Kaifeng Qi","Juntao Li","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.13321v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14644v1","updated":"2023-10-23T07:35:37Z","published":"2023-10-23T07:35:37Z","title":"Multilingual k-Nearest-Neighbor Machine Translation","summary":" k-nearest-neighbor machine translation has demonstrated remarkable\nimprovements in machine translation quality by creating a datastore of cached\nexamples. However, these improvements have been limited to high-resource\nlanguage pairs, with large datastores, and remain a challenge for low-resource\nlanguages. In this paper, we address this issue by combining representations\nfrom multiple languages into a single datastore. Our results consistently\ndemonstrate substantial improvements not only in low-resource translation\nquality (up to +3.6 BLEU), but also for high-resource translation quality (up\nto +0.5 BLEU). Our experiments show that it is possible to create multilingual\ndatastores that are a quarter of the size, achieving a 5.3x speed improvement,\nby using linguistic similarities for datastore creation.\n","authors":["David Stap","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2310.14644v1.pdf","comment":"Accepted to EMNLP"},{"id":"http://arxiv.org/abs/2310.08395v3","updated":"2023-10-23T07:34:28Z","published":"2023-10-12T15:08:14Z","title":"Prompting Large Language Models with Chain-of-Thought for Few-Shot\n Knowledge Base Question Generation","summary":" The task of Question Generation over Knowledge Bases (KBQG) aims to convert a\nlogical form into a natural language question. For the sake of expensive cost\nof large-scale question annotation, the methods of KBQG under low-resource\nscenarios urgently need to be developed. However, current methods heavily rely\non annotated data for fine-tuning, which is not well-suited for few-shot\nquestion generation. The emergence of Large Language Models (LLMs) has shown\ntheir impressive generalization ability in few-shot tasks. Inspired by\nChain-of-Thought (CoT) prompting, which is an in-context learning strategy for\nreasoning, we formulate KBQG task as a reasoning problem, where the generation\nof a complete question is splitted into a series of sub-question generation.\nOur proposed prompting method KQG-CoT first retrieves supportive logical forms\nfrom the unlabeled data pool taking account of the characteristics of the\nlogical form. Then, we write a prompt to explicit the reasoning chain of\ngenerating complicated questions based on the selected demonstrations. To\nfurther ensure prompt quality, we extend KQG-CoT into KQG-CoT+ via sorting the\nlogical forms by their complexity. We conduct extensive experiments over three\npublic KBQG datasets. The results demonstrate that our prompting method\nconsistently outperforms other prompting baselines on the evaluated datasets.\nRemarkably, our KQG-CoT+ method could surpass existing few-shot SoTA results of\nthe PathQuestions dataset by 18.25, 10.72, and 10.18 absolute points on BLEU-4,\nMETEOR, and ROUGE-L, respectively.\n","authors":["Yuanyuan Liang","Jianing Wang","Hanlun Zhu","Lei Wang","Weining Qian","Yunshi Lan"],"pdf_url":"https://arxiv.org/pdf/2310.08395v3.pdf","comment":"Accepted by EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2305.11550v2","updated":"2023-10-23T07:29:13Z","published":"2023-05-19T09:36:48Z","title":"Viewing Knowledge Transfer in Multilingual Machine Translation Through a\n Representational Lens","summary":" We argue that translation quality alone is not a sufficient metric for\nmeasuring knowledge transfer in multilingual neural machine translation. To\nsupport this claim, we introduce Representational Transfer Potential (RTP),\nwhich measures representational similarities between languages. We show that\nRTP can measure both positive and negative transfer (interference), and find\nthat RTP is strongly correlated with changes in translation quality, indicating\nthat transfer does occur. Furthermore, we investigate data and language\ncharacteristics that are relevant for transfer, and find that multi-parallel\noverlap is an important yet under-explored feature. Based on this, we develop a\nnovel training scheme, which uses an auxiliary similarity loss that encourages\nrepresentations to be more invariant across languages by taking advantage of\nmulti-parallel data. We show that our method yields increased translation\nquality for low- and mid-resource languages across multiple data and model\nsetups.\n","authors":["David Stap","Vlad Niculae","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2305.11550v2.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.14992v2","updated":"2023-10-23T07:24:28Z","published":"2023-05-24T10:28:28Z","title":"Reasoning with Language Model is Planning with World Model","summary":" Large language models (LLMs) have shown remarkable reasoning capabilities,\nespecially when prompted to generate intermediate reasoning steps (e.g.,\nChain-of-Thought, CoT). However, LLMs can still struggle with problems that are\neasy for humans, such as generating action plans for executing tasks in a given\nenvironment, or performing complex math, logical, and commonsense reasoning.\nThe deficiency stems from the key fact that LLMs lack an internal\n$\\textit{world model}$ to predict the world $\\textit{state}$ (e.g., environment\nstatus, intermediate variable values) and simulate long-term outcomes of\nactions. This prevents LLMs from performing deliberate planning akin to human\nbrains, which involves exploring alternative reasoning paths, anticipating\nfuture states and rewards, and iteratively refining existing reasoning steps.\nTo overcome the limitations, we propose a new LLM reasoning framework,\n$\\underline{R}$easoning vi$\\underline{a}$ $\\underline{P}$lanning\n$\\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning\nagent, and incorporates a principled planning algorithm (based on Monto Carlo\nTree Search) for strategic exploration in the vast reasoning space. During\nreasoning, the LLM (as agent) incrementally builds a reasoning tree under the\nguidance of the LLM (as world model) and task-specific rewards, and obtains a\nhigh-reward reasoning path efficiently with a proper balance between\nexploration $\\textit{vs.}$ exploitation. We apply RAP to a variety of\nchallenging reasoning problems including plan generation, math reasoning, and\nlogical inference. Empirical results on these tasks demonstrate the superiority\nof RAP over various strong baselines, including CoT and least-to-most prompting\nwith self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33%\nrelative improvement in a plan generation setting.\n","authors":["Shibo Hao","Yi Gu","Haodi Ma","Joshua Jiahua Hong","Zhen Wang","Daisy Zhe Wang","Zhiting Hu"],"pdf_url":"https://arxiv.org/pdf/2305.14992v2.pdf","comment":"EMNLP 2023. Code is available at\n https://github.com/Ber666/llm-reasoners"},{"id":"http://arxiv.org/abs/2305.13186v3","updated":"2023-10-23T07:19:30Z","published":"2023-05-22T16:13:50Z","title":"SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim\n Verification on Scientific Tables","summary":" Current scientific fact-checking benchmarks exhibit several shortcomings,\nsuch as biases arising from crowd-sourced claims and an over-reliance on\ntext-based evidence. We present SCITAB, a challenging evaluation dataset\nconsisting of 1.2K expert-verified scientific claims that 1) originate from\nauthentic scientific publications and 2) require compositional reasoning for\nverification. The claims are paired with evidence-containing scientific tables\nannotated with labels. Through extensive evaluations, we demonstrate that\nSCITAB poses a significant challenge to state-of-the-art models, including\ntable-based pretraining models and large language models. All models except\nGPT-4 achieved performance barely above random guessing. Popular prompting\ntechniques, such as Chain-of-Thought, do not achieve much performance gains on\nSCITAB. Our analysis uncovers several unique challenges posed by SCITAB,\nincluding table grounding, claim ambiguity, and compositional reasoning. Our\ncodes and data are publicly available at https://github.com/XinyuanLu00/SciTab.\n","authors":["Xinyuan Lu","Liangming Pan","Qian Liu","Preslav Nakov","Min-Yen Kan"],"pdf_url":"https://arxiv.org/pdf/2305.13186v3.pdf","comment":"Accepted at EMNLP 2023 (main conference, long paper)"},{"id":"http://arxiv.org/abs/2310.14633v1","updated":"2023-10-23T07:13:31Z","published":"2023-10-23T07:13:31Z","title":"Extending Input Contexts of Language Models through Training on\n Segmented Sequences","summary":" Effectively training language models on long inputs poses many technical\nchallenges. As a cost consideration, languages models are pretrained on a fixed\nsequence length before being adapted to longer sequences. We explore various\nmethods for adapting models to longer inputs by training on segmented sequences\nand an interpolation-based method for extending absolute positional embeddings.\nWe develop a training procedure to extend the input context size of pretrained\nmodels with no architectural changes and no additional memory costs than\ntraining on the original input lengths. By sub-sampling segments from long\ninputs while maintaining their original position the model is able to learn new\npositional interactions. Our method benefits both models trained with absolute\npositional embeddings, by extending their input contexts, as well as popular\nrelative positional embedding methods showing a reduced perplexity on sequences\nlonger than they were trained on. We demonstrate our method can extend input\ncontexts by a factor of 4x while improving perplexity.\n","authors":["Petros Karypis","Julian McAuley","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2310.14633v1.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.14628v1","updated":"2023-10-23T07:02:20Z","published":"2023-10-23T07:02:20Z","title":"Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts","summary":" As large language models (LLMs) have shown effectiveness with different\nprompting methods, such as Chain of Thought, Program of Thought, we find that\nthese methods have formed a great complementarity to each other on math\nreasoning tasks. In this work, we propose XoT, an integrated problem solving\nframework by prompting LLMs with diverse reasoning thoughts. For each question,\nXoT always begins with selecting the most suitable method then executes each\nmethod iteratively. Within each iteration, XoT actively checks the validity of\nthe generated answer and incorporates the feedback from external executors,\nallowing it to dynamically switch among different prompting methods. Through\nextensive experiments on 10 popular math reasoning datasets, we demonstrate the\neffectiveness of our proposed approach and thoroughly analyze the strengths of\neach module. Moreover, empirical results suggest that our framework is\northogonal to recent work that makes improvements on single reasoning methods\nand can further generalise to logical reasoning domain. By allowing method\nswitching, XoT provides a fresh perspective on the collaborative integration of\ndiverse reasoning thoughts in a unified framework.\n","authors":["Tengxiao Liu","Qipeng Guo","Yuqing Yang","Xiangkun Hu","Yue Zhang","Xipeng Qiu","Zheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14628v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14627v1","updated":"2023-10-23T07:01:09Z","published":"2023-10-23T07:01:09Z","title":"CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster\n Tweet Classification","summary":" The shared real-time information about natural disasters on social media\nplatforms like Twitter and Facebook plays a critical role in informing\nvolunteers, emergency managers, and response organizations. However, supervised\nlearning models for monitoring disaster events require large amounts of\nannotated data, making them unrealistic for real-time use in disaster events.\nTo address this challenge, we present a fine-grained disaster tweet\nclassification model under the semi-supervised, few-shot learning setting where\nonly a small number of annotated data is required. Our model, CrisisMatch,\neffectively classifies tweets into fine-grained classes of interest using few\nlabeled data and large amounts of unlabeled data, mimicking the early stage of\na disaster. Through integrating effective semi-supervised learning ideas and\nincorporating TextMixUp, CrisisMatch achieves performance improvement on two\ndisaster datasets of 11.2\\% on average. Further analyses are also provided for\nthe influence of the number of labeled data and out-of-domain results.\n","authors":["Henry Peng Zou","Yue Zhou","Cornelia Caragea","Doina Caragea"],"pdf_url":"https://arxiv.org/pdf/2310.14627v1.pdf","comment":"Accepted by ISCRAM 2023"},{"id":"http://arxiv.org/abs/2310.14626v1","updated":"2023-10-23T07:00:51Z","published":"2023-10-23T07:00:51Z","title":"Conversational Recommender System and Large Language Model Are Made for\n Each Other in E-commerce Pre-sales Dialogue","summary":" E-commerce pre-sales dialogue aims to understand and elicit user needs and\npreferences for the items they are seeking so as to provide appropriate\nrecommendations. Conversational recommender systems (CRSs) learn user\nrepresentation and provide accurate recommendations based on dialogue context,\nbut rely on external knowledge. Large language models (LLMs) generate responses\nthat mimic pre-sales dialogues after fine-tuning, but lack domain-specific\nknowledge for accurate recommendations. Intuitively, the strengths of LLM and\nCRS in E-commerce pre-sales dialogues are complementary, yet no previous work\nhas explored this. This paper investigates the effectiveness of combining LLM\nand CRS in E-commerce pre-sales dialogues, proposing two collaboration methods:\nCRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a\nreal-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of\ntwo collaborative approaches with two CRSs and two LLMs on four tasks of\nEcommerce pre-sales dialogue. We find that collaborations between CRS and LLM\ncan be very effective in some cases.\n","authors":["Yuanxing Liu","Wei-Nan Zhang","Yifan Chen","Yuchi Zhang","Haopeng Bai","Fan Feng","Hengbin Cui","Yongbin Li","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2310.14626v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.14623v1","updated":"2023-10-23T06:54:51Z","published":"2023-10-23T06:54:51Z","title":"CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine\n Chain-of-Thought Prompting for Multi-domain NLU Tasks","summary":" While Chain-of-Thought prompting is popular in reasoning tasks, its\napplication to Large Language Models (LLMs) in Natural Language Understanding\n(NLU) is under-explored. Motivated by multi-step reasoning of LLMs, we propose\nCoarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks\ninto multiple reasoning steps where LLMs can learn to acquire and leverage\nessential concepts to solve tasks from different granularities. Moreover, we\npropose leveraging semantic-based Abstract Meaning Representation (AMR)\nstructured knowledge as an intermediate step to capture the nuances and diverse\nstructures of utterances, and to understand connections between their varying\nlevels of granularity. Our proposed approach is demonstrated effective in\nassisting the LLMs adapt to the multi-grained NLU tasks under both zero-shot\nand few-shot multi-domain settings.\n","authors":["Hoang H. Nguyen","Ye Liu","Chenwei Zhang","Tao Zhang","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.14623v1.pdf","comment":"Accepted at EMNLP 2023 (Main Conference)"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.15171v1","updated":"2023-10-23T17:59:59Z","published":"2023-10-23T17:59:59Z","title":"RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions","summary":" Depth estimation from monocular images is pivotal for real-world visual\nperception systems. While current learning-based depth estimation models train\nand test on meticulously curated data, they often overlook out-of-distribution\n(OoD) situations. Yet, in practical settings -- especially safety-critical ones\nlike autonomous driving -- common corruptions can arise. Addressing this\noversight, we introduce a comprehensive robustness test suite, RoboDepth,\nencompassing 18 corruptions spanning three categories: i) weather and lighting\nconditions; ii) sensor failures and movement; and iii) data processing\nanomalies. We subsequently benchmark 42 depth estimation models across indoor\nand outdoor scenes to assess their resilience to these corruptions. Our\nfindings underscore that, in the absence of a dedicated robustness evaluation\nframework, many leading depth estimation models may be susceptible to typical\ncorruptions. We delve into design considerations for crafting more robust depth\nestimation models, touching upon pre-training, augmentation, modality, model\ncapacity, and learning paradigms. We anticipate our benchmark will establish a\nfoundational platform for advancing robust OoD depth estimation.\n","authors":["Lingdong Kong","Shaoyuan Xie","Hanjiang Hu","Lai Xing Ng","Benoit R. Cottereau","Wei Tsang Ooi"],"pdf_url":"https://arxiv.org/pdf/2310.15171v1.pdf","comment":"NeurIPS 2023; 45 pages, 25 figures, 13 tables; Code at\n https://github.com/ldkong1205/RoboDepth"},{"id":"http://arxiv.org/abs/2310.15169v1","updated":"2023-10-23T17:59:58Z","published":"2023-10-23T17:59:58Z","title":"FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling","summary":" With the availability of large-scale video datasets and the advances of\ndiffusion models, text-driven video generation has achieved substantial\nprogress. However, existing video generation models are typically trained on a\nlimited number of frames, resulting in the inability to generate high-fidelity\nlong videos during inference. Furthermore, these models only support\nsingle-text conditions, whereas real-life scenarios often require multi-text\nconditions as the video content changes over time. To tackle these challenges,\nthis study explores the potential of extending the text-driven capability to\ngenerate longer videos conditioned on multiple texts. 1) We first analyze the\nimpact of initial noise in video diffusion models. Then building upon the\nobservation of noise, we propose FreeNoise, a tuning-free and time-efficient\nparadigm to enhance the generative capabilities of pretrained video diffusion\nmodels while preserving content consistency. Specifically, instead of\ninitializing noises for all frames, we reschedule a sequence of noises for\nlong-range correlation and perform temporal attention over them by window-based\nfunction. 2) Additionally, we design a novel motion injection method to support\nthe generation of videos conditioned on multiple text prompts. Extensive\nexperiments validate the superiority of our paradigm in extending the\ngenerative capabilities of video diffusion models. It is noteworthy that\ncompared with the previous best-performing method which brought about 255%\nextra time cost, our method incurs only negligible time cost of approximately\n17%. Generated video samples are available at our website:\nhttp://haonanqiu.com/projects/FreeNoise.html.\n","authors":["Haonan Qiu","Menghan Xia","Yong Zhang","Yingqing He","Xintao Wang","Ying Shan","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15169v1.pdf","comment":"Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo:\n https://github.com/arthur-qiu/LongerCrafter"},{"id":"http://arxiv.org/abs/2310.15168v1","updated":"2023-10-23T17:59:52Z","published":"2023-10-23T17:59:52Z","title":"Ghost on the Shell: An Expressive Representation of General 3D Shapes","summary":" The creation of photorealistic virtual worlds requires the accurate modeling\nof 3D surface geometry for a wide range of objects. For this, meshes are\nappealing since they 1) enable fast physics-based rendering with realistic\nmaterial and lighting, 2) support physical simulation, and 3) are\nmemory-efficient for modern graphics pipelines. Recent work on reconstructing\nand statistically modeling 3D shape, however, has critiqued meshes as being\ntopologically inflexible. To capture a wide range of object shapes, any 3D\nrepresentation must be able to model solid, watertight, shapes as well as thin,\nopen, surfaces. Recent work has focused on the former, and methods for\nreconstructing open surfaces do not support fast reconstruction with material\nand lighting or unconditional generative modelling. Inspired by the observation\nthat open surfaces can be seen as islands floating on watertight surfaces, we\nparameterize open surfaces by defining a manifold signed distance field on\nwatertight templates. With this parameterization, we further develop a\ngrid-based and differentiable representation that parameterizes both watertight\nand non-watertight meshes of arbitrary topology. Our new representation, called\nGhost-on-the-Shell (G-Shell), enables two important applications:\ndifferentiable rasterization-based reconstruction from multiview images and\ngenerative modelling of non-watertight meshes. We empirically demonstrate that\nG-Shell achieves state-of-the-art performance on non-watertight mesh\nreconstruction and generation tasks, while also performing effectively for\nwatertight meshes.\n","authors":["Zhen Liu","Yao Feng","Yuliang Xiu","Weiyang Liu","Liam Paull","Michael J. Black","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2310.15168v1.pdf","comment":"Technical Report (26 pages, 16 figures)"},{"id":"http://arxiv.org/abs/2310.15166v1","updated":"2023-10-23T17:59:31Z","published":"2023-10-23T17:59:31Z","title":"Large Language Models are Visual Reasoning Coordinators","summary":" Visual reasoning requires multimodal perception and commonsense cognition of\nthe world. Recently, multiple vision-language models (VLMs) have been proposed\nwith excellent commonsense reasoning ability in various domains. However, how\nto harness the collective power of these complementary VLMs is rarely explored.\nExisting methods like ensemble still struggle to aggregate these models with\nthe desired higher-order communications. In this work, we propose Cola, a novel\nparadigm that coordinates multiple VLMs for visual reasoning. Our key insight\nis that a large language model (LLM) can efficiently coordinate multiple VLMs\nby facilitating natural language communication that leverages their distinct\nand complementary capabilities. Extensive experiments demonstrate that our\ninstruction tuning variant, Cola-FT, achieves state-of-the-art performance on\nvisual question answering (VQA), outside knowledge VQA, visual entailment, and\nvisual spatial reasoning tasks. Moreover, we show that our in-context learning\nvariant, Cola-Zero, exhibits competitive performance in zero and few-shot\nsettings, without finetuning. Through systematic ablation studies and\nvisualizations, we validate that a coordinator LLM indeed comprehends the\ninstruction prompts as well as the separate functionalities of VLMs; it then\ncoordinates them to enable impressive visual reasoning capabilities.\n","authors":["Liangyu Chen","Bo Li","Sheng Shen","Jingkang Yang","Chunyuan Li","Kurt Keutzer","Trevor Darrell","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15166v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15165v1","updated":"2023-10-23T17:59:16Z","published":"2023-10-23T17:59:16Z","title":"Handling Data Heterogeneity via Architectural Design for Federated\n Visual Recognition","summary":" Federated Learning (FL) is a promising research paradigm that enables the\ncollaborative training of machine learning models among various parties without\nthe need for sensitive information exchange. Nonetheless, retaining data in\nindividual clients introduces fundamental challenges to achieving performance\non par with centrally trained models. Our study provides an extensive review of\nfederated learning applied to visual recognition. It underscores the critical\nrole of thoughtful architectural design choices in achieving optimal\nperformance, a factor often neglected in the FL literature. Many existing FL\nsolutions are tested on shallow or simple networks, which may not accurately\nreflect real-world applications. This practice restricts the transferability of\nresearch findings to large-scale visual recognition models. Through an in-depth\nanalysis of diverse cutting-edge architectures such as convolutional neural\nnetworks, transformers, and MLP-mixers, we experimentally demonstrate that\narchitectural choices can substantially enhance FL systems' performance,\nparticularly when handling heterogeneous data. We study 19 visual recognition\nmodels from five different architectural families on four challenging FL\ndatasets. We also re-investigate the inferior performance of convolution-based\narchitectures in the FL setting and analyze the influence of normalization\nlayers on the FL performance. Our findings emphasize the importance of\narchitectural design for computer vision tasks in practical scenarios,\neffectively narrowing the performance gap between federated and centralized\nlearning. Our source code is available at\nhttps://github.com/sarapieri/fed_het.git.\n","authors":["Sara Pieri","Jose Renato Restom","Samuel Horvath","Hisham Cholakkal"],"pdf_url":"https://arxiv.org/pdf/2310.15165v1.pdf","comment":"to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15161v1","updated":"2023-10-23T17:57:36Z","published":"2023-10-23T17:57:36Z","title":"SAM-Med3D","summary":" Although the Segment Anything Model (SAM) has demonstrated impressive\nperformance in 2D natural image segmentation, its application to 3D volumetric\nmedical images reveals significant shortcomings, namely suboptimal performance\nand unstable prediction, necessitating an excessive number of prompt points to\nattain the desired outcomes. These issues can hardly be addressed by\nfine-tuning SAM on medical data because the original 2D structure of SAM\nneglects 3D spatial information. In this paper, we introduce SAM-Med3D, the\nmost comprehensive study to modify SAM for 3D medical images. Our approach is\ncharacterized by its comprehensiveness in two primary aspects: firstly, by\ncomprehensively reformulating SAM to a thorough 3D architecture trained on a\ncomprehensively processed large-scale volumetric medical dataset; and secondly,\nby providing a comprehensive evaluation of its performance. Specifically, we\ntrain SAM-Med3D with over 131K 3D masks and 247 categories. Our SAM-Med3D\nexcels at capturing 3D spatial information, exhibiting competitive performance\nwith significantly fewer prompt points than the top-performing fine-tuned SAM\nin the medical domain. We then evaluate its capabilities across 15 datasets and\nanalyze it from multiple perspectives, including anatomical structures,\nmodalities, targets, and generalization abilities. Our approach, compared with\nSAM, showcases pronouncedly enhanced efficiency and broad segmentation\ncapabilities for 3D volumetric medical images. Our code is released at\nhttps://github.com/uni-medical/SAM-Med3D.\n","authors":["Haoyu Wang","Sizheng Guo","Jin Ye","Zhongying Deng","Junlong Cheng","Tianbin Li","Jianpin Chen","Yanzhou Su","Ziyan Huang","Yiqing Shen","Bin Fu","Shaoting Zhang","Junjun He","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2310.15161v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15160v1","updated":"2023-10-23T17:57:27Z","published":"2023-10-23T17:57:27Z","title":"FreeMask: Synthetic Images with Dense Annotations Make Stronger\n Segmentation Models","summary":" Semantic segmentation has witnessed tremendous progress due to the proposal\nof various advanced network architectures. However, they are extremely hungry\nfor delicate annotations to train, and the acquisition is laborious and\nunaffordable. Therefore, we present FreeMask in this work, which resorts to\nsynthetic images from generative models to ease the burden of both data\ncollection and annotation procedures. Concretely, we first synthesize abundant\ntraining images conditioned on the semantic masks provided by realistic\ndatasets. This yields extra well-aligned image-mask training pairs for semantic\nsegmentation models. We surprisingly observe that, solely trained with\nsynthetic images, we already achieve comparable performance with real ones\n(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we\ninvestigate the role of synthetic images by joint training with real images, or\npre-training for real images. Meantime, we design a robust filtering principle\nto suppress incorrectly synthesized regions. In addition, we propose to\ninequally treat different semantic masks to prioritize those harder ones and\nsample more corresponding synthetic images for them. As a result, either\njointly trained or pre-trained with our filtered and re-sampled synthesized\nimages, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on\nADE20K. Code is available at https://github.com/LiheYoung/FreeMask.\n","authors":["Lihe Yang","Xiaogang Xu","Bingyi Kang","Yinghuan Shi","Hengshuang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.15160v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15150v1","updated":"2023-10-23T17:53:14Z","published":"2023-10-23T17:53:14Z","title":"Online Detection of AI-Generated Images","summary":" With advancements in AI-generated images coming on a continuous basis, it is\nincreasingly difficult to distinguish traditionally-sourced images (e.g.,\nphotos, artwork) from AI-generated ones. Previous detection methods study the\ngeneralization from a single generator to another in isolation. However, in\nreality, new generators are released on a streaming basis. We study\ngeneralization in this setting, training on N models and testing on the next\n(N+k), following the historical release dates of well-known generation methods.\nFurthermore, images increasingly consist of both real and generated components,\nfor example through image inpainting. Thus, we extend this approach to pixel\nprediction, demonstrating strong performance using automatically-generated\ninpainted data. In addition, for settings where commercial models are not\npublicly available for automatic data generation, we evaluate if pixel\ndetectors can be trained solely on whole synthetic images.\n","authors":["David C. Epstein","Ishan Jain","Oliver Wang","Richard Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.15150v1.pdf","comment":"ICCV DeepFake Analysis and Detection Workshop, 2023"},{"id":"http://arxiv.org/abs/2310.15144v1","updated":"2023-10-23T17:48:38Z","published":"2023-10-23T17:48:38Z","title":"DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual\n Design","summary":" We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored\nfor visual design scenarios. Recent T2I models like DALL-E 3 and others, have\ndemonstrated remarkable capabilities in generating photorealistic images that\nalign closely with textual inputs. While the allure of creating visually\ncaptivating images is undeniable, our emphasis extends beyond mere aesthetic\npleasure. We aim to investigate the potential of using these powerful models in\nauthentic design contexts. In pursuit of this goal, we develop DEsignBench,\nwhich incorporates test samples designed to assess T2I models on both \"design\ntechnical capability\" and \"design application scenario.\" Each of these two\ndimensions is supported by a diverse set of specific design categories. We\nexplore DALL-E 3 together with other leading T2I models on DEsignBench,\nresulting in a comprehensive visual gallery for side-by-side comparisons. For\nDEsignBench benchmarking, we perform human evaluations on generated images in\nDEsignBench gallery, against the criteria of image-text alignment, visual\naesthetic, and design creativity. Our evaluation also considers other\nspecialized design capabilities, including text rendering, layout composition,\ncolor harmony, 3D design, and medium style. In addition to human evaluations,\nwe introduce the first automatic image generation evaluator powered by GPT-4V.\nThis evaluator provides ratings that align well with human judgments, while\nbeing easily replicable and cost-efficient. A high-resolution version is\navailable at\nhttps://github.com/design-bench/design-bench.github.io/raw/main/designbench.pdf?download=\n","authors":["Kevin Lin","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.15144v1.pdf","comment":"Project page at https://design-bench.github.io/"},{"id":"http://arxiv.org/abs/2310.15138v1","updated":"2023-10-23T17:44:59Z","published":"2023-10-23T17:44:59Z","title":"Fusion-Driven Tree Reconstruction and Fruit Localization: Advancing\n Precision in Agriculture","summary":" Fruit distribution is pivotal in shaping the future of both agriculture and\nagricultural robotics, paving the way for a streamlined supply chain. This\nstudy introduces an innovative methodology that harnesses the synergy of RGB\nimagery, LiDAR, and IMU data, to achieve intricate tree reconstructions and the\npinpoint localization of fruits. Such integration not only offers insights into\nthe fruit distribution, which enhances the precision of guidance for\nagricultural robotics and automation systems, but also sets the stage for\nsimulating synthetic fruit patterns across varied tree architectures. To\nvalidate this approach, experiments have been carried out in both a controlled\nenvironment and an actual peach orchard. The results underscore the robustness\nand efficacy of this fusion-driven methodology, highlighting its potential as a\ntransformative tool for future agricultural robotics and precision farming.\n","authors":["Kaiming Fu","Peng Wei","Juan Villacres","Zhaodan Kong","Stavros G. Vougioukas","Brian N. Bailey"],"pdf_url":"https://arxiv.org/pdf/2310.15138v1.pdf","comment":"This work was presented at IEEE/RSI International Conference on\n Intelligent Robots and Systems (IROS) Workshop"},{"id":"http://arxiv.org/abs/2303.05639v3","updated":"2023-10-23T17:40:57Z","published":"2023-03-10T01:04:27Z","title":"Self-Supervised One-Shot Learning for Automatic Segmentation of StyleGAN\n Images","summary":" We propose a framework for the automatic one-shot segmentation of synthetic\nimages generated by a StyleGAN. Our framework is based on the observation that\nthe multi-scale hidden features in the GAN generator hold useful semantic\ninformation that can be utilized for automatic on-the-fly segmentation of the\ngenerated images. Using these features, our framework learns to segment\nsynthetic images using a self-supervised contrastive clustering algorithm that\nprojects the hidden features into a compact space for per-pixel classification.\nThis contrastive learner is based on using a novel data augmentation strategy\nand a pixel-wise swapped prediction loss that leads to faster learning of the\nfeature vectors for one-shot segmentation. We have tested our implementation on\nfive standard benchmarks to yield a segmentation performance that not only\noutperforms the semi-supervised baselines by an average wIoU margin of 1.02 %\nbut also improves the inference speeds by a factor of 4.5. Finally, we also\nshow the results of using the proposed one-shot learner in implementing BagGAN,\na framework for producing annotated synthetic baggage X-ray scans for threat\ndetection. This framework was trained and tested on the PIDRay baggage\nbenchmark to yield a performance comparable to its baseline segmenter based on\nmanual annotations.\n","authors":["Ankit Manerikar","Avinash C. Kak"],"pdf_url":"https://arxiv.org/pdf/2303.05639v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15130v1","updated":"2023-10-23T17:34:31Z","published":"2023-10-23T17:34:31Z","title":"Novel-View Acoustic Synthesis from 3D Reconstructed Rooms","summary":" We investigate the benefit of combining blind audio recordings with 3D scene\ninformation for novel-view acoustic synthesis. Given audio recordings from 2-4\nmicrophones and the 3D geometry and material of a scene containing multiple\nunknown sound sources, we estimate the sound anywhere in the scene. We identify\nthe main challenges of novel-view acoustic synthesis as sound source\nlocalization, separation, and dereverberation. While naively training an\nend-to-end network fails to produce high-quality results, we show that\nincorporating room impulse responses (RIRs) derived from 3D reconstructed rooms\nenables the same network to jointly tackle these tasks. Our method outperforms\nexisting methods designed for the individual tasks, demonstrating its\neffectiveness at utilizing 3D visual information. In a simulated study on the\nMatterport3D-NVAS dataset, our model achieves near-perfect accuracy on source\nlocalization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation\nand dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on\nnovel-view acoustic synthesis. Code, pretrained model, and video results are\navailable on the project webpage (https://github.com/apple/ml-nvas3d).\n","authors":["Byeongjoo Ahn","Karren Yang","Brian Hamilton","Jonathan Sheaffer","Anurag Ranjan","Miguel Sarabia","Oncel Tuzel","Jen-Hao Rick Chang"],"pdf_url":"https://arxiv.org/pdf/2310.15130v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15128v1","updated":"2023-10-23T17:32:38Z","published":"2023-10-23T17:32:38Z","title":"Projected Stochastic Gradient Descent with Quantum Annealed Binary\n Gradients","summary":" We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards\ntraining neural networks with binary weights, known as binary neural networks\n(BNNs), on quantum hardware. BNNs reduce the computational requirements and\nenergy consumption of deep learning models with minimal loss in accuracy.\nHowever, training them in practice remains to be an open challenge. Most known\nBNN-optimisers either rely on projected updates or binarise weights\npost-training. Instead, QP-SBGD approximately maps the gradient onto binary\nvariables, by solving a quadratic constrained binary optimisation. Under\npractically reasonable assumptions, we show that this update rule converges\nwith a rate of $\\mathcal{O}(1 / \\sqrt{T})$. Moreover, we show how the\n$\\mathcal{NP}$-hard projection can be effectively executed on an adiabatic\nquantum annealer, harnessing recent advancements in quantum computation. We\nalso introduce a projected version of this update rule and prove that if a\nfixed point exists in the binary variable space, the modified updates will\nconverge to it. Last but not least, our algorithm is implemented layer-wise,\nmaking it suitable to train larger networks on resource-limited quantum\nhardware. Through extensive evaluations, we show that QP-SBGD outperforms or is\non par with competitive and well-established baselines such as BinaryConnect,\nsignSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as\nwell as binary graph neural networks.\n","authors":["Maximilian Krahn","Michelle Sasdelli","Fengyi Yang","Vladislav Golyanik","Juho Kannala","Tat-Jun Chin","Tolga Birdal"],"pdf_url":"https://arxiv.org/pdf/2310.15128v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15115v1","updated":"2023-10-23T17:21:33Z","published":"2023-10-23T17:21:33Z","title":"SpVOS: Efficient Video Object Segmentation with Triple Sparse\n Convolution","summary":" Semi-supervised video object segmentation (Semi-VOS), which requires only\nannotating the first frame of a video to segment future frames, has received\nincreased attention recently. Among existing pipelines, the\nmemory-matching-based one is becoming the main research stream, as it can fully\nutilize the temporal sequence information to obtain high-quality segmentation\nresults. Even though this type of method has achieved promising performance,\nthe overall framework still suffers from heavy computation overhead, mainly\ncaused by the per-frame dense convolution operations between high-resolution\nfeature maps and each kernel filter. Therefore, we propose a sparse baseline of\nVOS named SpVOS in this work, which develops a novel triple sparse convolution\nto reduce the computation costs of the overall VOS framework. The designed\ntriple gate, taking full consideration of both spatial and temporal redundancy\nbetween adjacent video frames, adaptively makes a triple decision to decide how\nto apply the sparse convolution on each pixel to control the computation\noverhead of each layer, while maintaining sufficient discrimination capability\nto distinguish similar objects and avoid error accumulation. A mixed sparse\ntraining strategy, coupled with a designed objective considering the sparsity\nconstraint, is also developed to balance the VOS segmentation performance and\ncomputation costs. Experiments are conducted on two mainstream VOS datasets,\nincluding DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves\nsuperior performance over other state-of-the-art sparse methods, and even\nmaintains comparable performance, e.g., an 83.04% (79.29%) overall score on the\nDAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS\nbaseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to\n42% FLOPs, showing its application potential for resource-constrained\nscenarios.\n","authors":["Weihao Lin","Tao Chen","Chong Yu"],"pdf_url":"https://arxiv.org/pdf/2310.15115v1.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.15111v1","updated":"2023-10-23T17:20:01Z","published":"2023-10-23T17:20:01Z","title":"Matryoshka Diffusion Models","summary":" Diffusion models are the de facto approach for generating high-quality images\nand videos, but learning high-dimensional models remains a formidable task due\nto computational and optimization challenges. Existing methods often resort to\ntraining cascaded models in pixel space or using a downsampled latent space of\na separately trained auto-encoder. In this paper, we introduce Matryoshka\nDiffusion Models(MDM), an end-to-end framework for high-resolution image and\nvideo synthesis. We propose a diffusion process that denoises inputs at\nmultiple resolutions jointly and uses a NestedUNet architecture where features\nand parameters for small-scale inputs are nested within those of large scales.\nIn addition, MDM enables a progressive training schedule from lower to higher\nresolutions, which leads to significant improvements in optimization for\nhigh-resolution generation. We demonstrate the effectiveness of our approach on\nvarious benchmarks, including class-conditioned image generation,\nhigh-resolution text-to-image, and text-to-video applications. Remarkably, we\ncan train a single pixel-space model at resolutions of up to 1024x1024 pixels,\ndemonstrating strong zero-shot generalization using the CC12M dataset, which\ncontains only 12 million images.\n","authors":["Jiatao Gu","Shuangfei Zhai","Yizhe Zhang","Josh Susskind","Navdeep Jaitly"],"pdf_url":"https://arxiv.org/pdf/2310.15111v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2310.15110v1","updated":"2023-10-23T17:18:59Z","published":"2023-10-23T17:18:59Z","title":"Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model","summary":" We report Zero123++, an image-conditioned diffusion model for generating\n3D-consistent multi-view images from a single input view. To take full\nadvantage of pretrained 2D generative priors, we develop various conditioning\nand training schemes to minimize the effort of finetuning from off-the-shelf\nimage diffusion models such as Stable Diffusion. Zero123++ excels in producing\nhigh-quality, consistent multi-view images from a single image, overcoming\ncommon issues like texture degradation and geometric misalignment. Furthermore,\nwe showcase the feasibility of training a ControlNet on Zero123++ for enhanced\ncontrol over the generation process. The code is available at\nhttps://github.com/SUDO-AI-3D/zero123plus.\n","authors":["Ruoxi Shi","Hansheng Chen","Zhuoyang Zhang","Minghua Liu","Chao Xu","Xinyue Wei","Linghao Chen","Chong Zeng","Hao Su"],"pdf_url":"https://arxiv.org/pdf/2310.15110v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15105v1","updated":"2023-10-23T17:12:01Z","published":"2023-10-23T17:12:01Z","title":"FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained\n Models in Few-Shot Learning","summary":" Due to the limited availability of data, existing few-shot learning methods\ntrained from scratch fail to achieve satisfactory performance. In contrast,\nlarge-scale pre-trained models such as CLIP demonstrate remarkable few-shot and\nzero-shot capabilities. To enhance the performance of pre-trained models for\ndownstream tasks, fine-tuning the model on downstream data is frequently\nnecessary. However, fine-tuning the pre-trained model leads to a decrease in\nits generalizability in the presence of distribution shift, while the limited\nnumber of samples in few-shot learning makes the model highly susceptible to\noverfitting. Consequently, existing methods for fine-tuning few-shot learning\nprimarily focus on fine-tuning the model's classification head or introducing\nadditional structure. In this paper, we introduce a fine-tuning approach termed\nFeature Discrimination Alignment (FD-Align). Our method aims to bolster the\nmodel's generalizability by preserving the consistency of spurious features\nacross the fine-tuning process. Extensive experimental results validate the\nefficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model\ncan seamlessly integrate with existing methods, leading to performance\nimprovements. Our code can be found in https://github.com/skingorz/FD-Align.\n","authors":["Kun Song","Huimin Ma","Bochao Zou","HuiShuai Zhang","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15105v1.pdf","comment":"Acceptedd by NeurIPS2023"},{"id":"http://arxiv.org/abs/2310.15099v1","updated":"2023-10-23T17:05:53Z","published":"2023-10-23T17:05:53Z","title":"Dual-path convolutional neural network using micro-FTIR imaging to\n predict breast cancer subtypes and biomarkers levels: estrogen receptor,\n progesterone receptor, HER2 and Ki67","summary":" Breast cancer molecular subtypes classification plays an import role to sort\npatients with divergent prognosis. The biomarkers used are Estrogen Receptor\n(ER), Progesterone Receptor (PR), HER2, and Ki67. Based on these biomarkers\nexpression levels, subtypes are classified as Luminal A (LA), Luminal B (LB),\nHER2 subtype, and Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is\nused to classify subtypes, although interlaboratory and interobserver\nvariations can affect its accuracy, besides being a time-consuming technique.\nThe Fourier transform infrared micro-spectroscopy may be coupled with deep\nlearning for cancer evaluation, where there is still a lack of studies for\nsubtypes and biomarker levels prediction. This study presents a novel 2D deep\nlearning approach to achieve these predictions. Sixty micro-FTIR images of\n320x320 pixels were collected from a human breast biopsies microarray. Data\nwere clustered by K-means, preprocessed and 32x32 patches were generated using\na fully automated approach. CaReNet-V2, a novel convolutional neural network,\nwas developed to classify breast cancer (CA) vs adjacent tissue (AT) and\nmolecular subtypes, and to predict biomarkers level. The clustering method\nenabled to remove non-tissue pixels. Test accuracies for CA vs AT and subtype\nwere above 0.84. The model enabled the prediction of ER, PR, and HER2 levels,\nwhere borderline values showed lower performance (minimum accuracy of 0.54).\nKi67 percentage regression demonstrated a mean error of 3.6%. Thus, CaReNet-V2\nis a potential technique for breast cancer biopsies evaluation, standing out as\na screening analysis technique and helping to prioritize patients.\n","authors":["Matheus del-Valle","Emerson Soares Bernardes","Denise Maria Zezell"],"pdf_url":"https://arxiv.org/pdf/2310.15099v1.pdf","comment":"32 pages, 3 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.15098v1","updated":"2023-10-23T17:03:02Z","published":"2023-10-23T17:03:02Z","title":"Acquiring Weak Annotations for Tumor Localization in Temporal and\n Volumetric Data","summary":" Creating large-scale and well-annotated datasets to train AI algorithms is\ncrucial for automated tumor detection and localization. However, with limited\nresources, it is challenging to determine the best type of annotations when\nannotating massive amounts of unlabeled data. To address this issue, we focus\non polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans;\nboth applications require significant effort and time for pixel-wise annotation\ndue to the high dimensional nature of the data, involving either temporary or\nspatial dimensions. In this paper, we develop a new annotation strategy, termed\nDrag&Drop, which simplifies the annotation process to drag and drop. This\nannotation strategy is more efficient, particularly for temporal and volumetric\nimaging, than other types of weak annotations, such as per-pixel, bounding\nboxes, scribbles, ellipses, and points. Furthermore, to exploit our Drag&Drop\nannotations, we develop a novel weakly supervised learning method based on the\nwatershed algorithm. Experimental results show that our method achieves better\ndetection and localization performance than alternative weak annotations and,\nmore importantly, achieves similar performance to that trained on detailed\nper-pixel annotations. Interestingly, we find that, with limited resources,\nallocating weak annotations from a diverse patient population can foster models\nmore robust to unseen images than allocating per-pixel annotations for a small\nset of images. In summary, this research proposes an efficient annotation\nstrategy for tumor detection and localization that is less accurate than\nper-pixel annotations but useful for creating large-scale datasets for\nscreening tumors in various medical modalities.\n","authors":["Yu-Cheng Chou","Bowen Li","Deng-Ping Fan","Alan Yuille","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.15098v1.pdf","comment":"Published in Machine Intelligence Research"},{"id":"http://arxiv.org/abs/2310.15085v1","updated":"2023-10-23T16:46:28Z","published":"2023-10-23T16:46:28Z","title":"On the Detection of Image-Scaling Attacks in Machine Learning","summary":" Image scaling is an integral part of machine learning and computer vision\nsystems. Unfortunately, this preprocessing step is vulnerable to so-called\nimage-scaling attacks where an attacker makes unnoticeable changes to an image\nso that it becomes a new image after scaling. This opens up new ways for\nattackers to control the prediction or to improve poisoning and backdoor\nattacks. While effective techniques exist to prevent scaling attacks, their\ndetection has not been rigorously studied yet. Consequently, it is currently\nnot possible to reliably spot these attacks in practice.\n This paper presents the first in-depth systematization and analysis of\ndetection methods for image-scaling attacks. We identify two general detection\nparadigms and derive novel methods from them that are simple in design yet\nsignificantly outperform previous work. We demonstrate the efficacy of these\nmethods in a comprehensive evaluation with all major learning platforms and\nscaling algorithms. First, we show that image-scaling attacks modifying the\nentire scaled image can be reliably detected even under an adaptive adversary.\nSecond, we find that our methods provide strong detection performance even if\nonly minor parts of the image are manipulated. As a result, we can introduce a\nnovel protection layer against image-scaling attacks.\n","authors":["Erwin Quiring","Andreas Müller","Konrad Rieck"],"pdf_url":"https://arxiv.org/pdf/2310.15085v1.pdf","comment":"Accepted at ACSAC'23"},{"id":"http://arxiv.org/abs/2310.15081v1","updated":"2023-10-23T16:41:13Z","published":"2023-10-23T16:41:13Z","title":"E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion","summary":" This paper proposes a novel approach to face swapping from the perspective of\nfine-grained facial editing, dubbed \"editing for swapping\" (E4S). The\ntraditional face swapping methods rely on global feature extraction and often\nfail to preserve the source identity. In contrast, our framework proposes a\nRegional GAN Inversion (RGI) method, which allows the explicit disentanglement\nof shape and texture. Specifically, our E4S performs face swapping in the\nlatent space of a pretrained StyleGAN, where a multi-scale mask-guided encoder\nis applied to project the texture of each facial component into regional style\ncodes and a mask-guided injection module then manipulates feature maps with the\nstyle codes. Based on this disentanglement, face swapping can be simplified as\nstyle and mask swapping. Besides, since reconstructing the source face in the\ntarget image may lead to disharmony lighting, we propose to train a re-coloring\nnetwork to make the swapped face maintain the lighting condition on the target\nface. Further, to deal with the potential mismatch area during mask exchange,\nwe designed a face inpainting network as post-processing. The extensive\ncomparisons with state-of-the-art methods demonstrate that our E4S outperforms\nexisting methods in preserving texture, shape, and lighting. Our implementation\nis available at https://github.com/e4s2023/E4S2023.\n","authors":["Maomao Li","Ge Yuan","Cairong Wang","Zhian Liu","Yong Zhang","Yongwei Nie","Jue Wang","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15081v1.pdf","comment":"Project Page: https://e4s2023.github.io/ ;"},{"id":"http://arxiv.org/abs/2310.15072v1","updated":"2023-10-23T16:30:39Z","published":"2023-10-23T16:30:39Z","title":"RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in\n Dynamic Environments","summary":" It is typically challenging for visual or visual-inertial odometry systems to\nhandle the problems of dynamic scenes and pure rotation. In this work, we\ndesign a novel visual-inertial odometry (VIO) system called RD-VIO to handle\nboth of these two problems. Firstly, we propose an IMU-PARSAC algorithm which\ncan robustly detect and match keypoints in a two-stage process. In the first\nstate, landmarks are matched with new keypoints using visual and IMU\nmeasurements. We collect statistical information from the matching and then\nguide the intra-keypoint matching in the second stage. Secondly, to handle the\nproblem of pure rotation, we detect the motion type and adapt the\ndeferred-triangulation technique during the data-association process. We make\nthe pure-rotational frames into the special subframes. When solving the\nvisual-inertial bundle adjustment, they provide additional constraints to the\npure-rotational motion. We evaluate the proposed VIO system on public datasets.\nExperiments show the proposed RD-VIO has obvious advantages over other methods\nin dynamic environments.\n","authors":["Jinyu Li","Xiaokun Pan","Gan Huang","Ziyang Zhang","Nan Wang","Hujun Bao","Guofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.15072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03116v3","updated":"2023-10-23T16:28:09Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v3.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2308.11909v5","updated":"2023-10-23T16:23:08Z","published":"2023-08-23T04:29:40Z","title":"Edge-aware Hard Clustering Graph Pooling for Brain Imaging","summary":" Graph Convolutional Networks (GCNs) can capture non-Euclidean spatial\ndependence between different brain regions. The graph pooling operator, a\ncrucial element of GCNs, enhances the representation learning capability and\nfacilitates the acquisition of abnormal brain maps. However, most existing\nresearch designs graph pooling operators solely from the perspective of nodes\nwhile disregarding the original edge features, in a way that not only confines\ngraph pooling application scenarios, but also diminishes its ability to capture\ncritical substructures. To design a graph clustering pooling operator that is\ntailored to dominant edge features, we proposed the edge-aware hard clustering\ngraph pool (EHCPool) and redefined the graph clustering process. Specifically,\nthe 'Edge-to-node' criterion was proposed to evaluate the significance of both\nedge and node features. Guided by edge scores, we designed a revolutionary\nIteration n-top strategy, aimed at adaptively learning sparse hard clustering\nassignments for graphs. Subsequently, a novel N-E Aggregation strategy is\nintroduced to aggregate node and edge information in each independent subgraph.\nExtensive experiments on the multi-site public datasets demonstrate the\nsuperiority and robustness of the proposed model. More notably, EHCPool has the\npotential to probe different types of dysfunctional brain networks from a\ndata-driven perspective. Core code is at: https://github.com/swfen/EHCPool.\n","authors":["Cheng Zhu","Jiayi Zhu","Lijuan Zhang","Xi Wu","Shuqi Yang","Ping Liang","Honghan Chen","Ying Tan"],"pdf_url":"https://arxiv.org/pdf/2308.11909v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15066v1","updated":"2023-10-23T16:14:05Z","published":"2023-10-23T16:14:05Z","title":"Localizing Active Objects from Egocentric Vision with Symbolic World\n Knowledge","summary":" The ability to actively ground task instructions from an egocentric view is\ncrucial for AI agents to accomplish tasks or assist humans virtually. One\nimportant step towards this goal is to localize and track key active objects\nthat undergo major state change as a consequence of human actions/interactions\nto the environment without being told exactly what/where to ground (e.g.,\nlocalizing and tracking the `sponge` in video from the instruction \"Dip the\n`sponge` into the bucket.\"). While existing works approach this problem from a\npure vision perspective, we investigate to which extent the textual modality\n(i.e., task instructions) and their interaction with visual modality can be\nbeneficial. Specifically, we propose to improve phrase grounding models'\nability on localizing the active objects by: (1) learning the role of `objects\nundergoing change` and extracting them accurately from the instructions, (2)\nleveraging pre- and post-conditions of the objects during actions, and (3)\nrecognizing the objects more robustly with descriptional knowledge. We leverage\nlarge language models (LLMs) to extract the aforementioned action-object\nknowledge, and design a per-object aggregation masking technique to effectively\nperform joint inference on object phrases and symbolic knowledge. We evaluate\nour framework on Ego4D and Epic-Kitchens datasets. Extensive experiments\ndemonstrate the effectiveness of our proposed framework, which leads to>54%\nimprovements in all standard metrics on the TREK-150-OPE-Det localization +\ntracking task, >7% improvements in all standard metrics on the TREK-150-OPE\ntracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD\ntask.\n","authors":["Te-Lin Wu","Yu Zhou","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.15066v1.pdf","comment":"In Proceedings of the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP)"},{"id":"http://arxiv.org/abs/2310.15061v1","updated":"2023-10-23T16:05:13Z","published":"2023-10-23T16:05:13Z","title":"The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained\n Multimodal Models","summary":" Despite the impressive performance achieved by pre-trained\nlanguage-and-vision models in downstream tasks, it remains an open question\nwhether this reflects a proper understanding of image-text interaction. In this\nwork, we explore to what extent they handle basic linguistic constructions --\nactive-passive voice, coordination, and relative clauses -- that even preschool\nchildren can typically master. We present BLA, a novel, automatically\nconstructed benchmark to evaluate multimodal models on these Basic Language\nAbilities. We show that different types of Transformer-based systems, such as\nCLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting,\nin line with previous findings. Our experiments, in particular, show that most\nof the tested models only marginally benefit when fine-tuned or prompted with\nconstruction-specific samples. Yet, the generative BLIP2 shows promising\ntrends, especially in an in-context learning setting. This opens the door to\nusing BLA not only as an evaluation benchmark but also to improve models' basic\nlanguage abilities.\n","authors":["Xinyi Chen","Raquel Fernández","Sandro Pezzelle"],"pdf_url":"https://arxiv.org/pdf/2310.15061v1.pdf","comment":"This is the camera-ready version of the paper that will be published\n in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023)"},{"id":"http://arxiv.org/abs/2310.15059v1","updated":"2023-10-23T16:03:23Z","published":"2023-10-23T16:03:23Z","title":"Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic\n Gaussian Mixture Models","summary":" A long-standing challenge for a robotic manipulation system operating in\nreal-world scenarios is adapting and generalizing its acquired motor skills to\nunseen environments. We tackle this challenge employing hybrid skill models\nthat integrate imitation and reinforcement paradigms, to explore how the\nlearning and adaptation of a skill, along with its core grounding in the scene\nthrough a learned keypoint, can facilitate such generalization. To that end, we\ndevelop Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models (KIS-GMM)\napproach that learns to predict the reference of a dynamical system within the\nscene as a 3D keypoint, leveraging visual observations obtained by the robot's\nphysical interactions during skill learning. Through conducting comprehensive\nevaluations in both simulated and real-world environments, we show that our\nmethod enables a robot to gain a significant zero-shot generalization to novel\nenvironments and to refine skills in the target environments faster than\nlearning from scratch. Importantly, this is achieved without the need for new\nground truth data. Moreover, our method effectively copes with scene\ndisplacements.\n","authors":["Iman Nematollahi","Kirill Yankov","Wolfram Burgard","Tim Welschehold"],"pdf_url":"https://arxiv.org/pdf/2310.15059v1.pdf","comment":"Accepted at the International Symposium on Experimental Robotics\n (ISER) 2023. Videos at http://kis-gmm.cs.uni-freiburg.de/"},{"id":"http://arxiv.org/abs/2310.15052v1","updated":"2023-10-23T15:55:30Z","published":"2023-10-23T15:55:30Z","title":"DREAM+: Efficient Dataset Distillation by Bidirectional Representative\n Matching","summary":" Dataset distillation plays a crucial role in creating compact datasets with\nsimilar training performance compared with original large-scale ones. This is\nessential for addressing the challenges of data storage and training costs.\nPrevalent methods facilitate knowledge transfer by matching the gradients,\nembedding distributions, or training trajectories of synthetic images with\nthose of the sampled original images. Although there are various matching\nobjectives, currently the strategy for selecting original images is limited to\nnaive random sampling. We argue that random sampling overlooks the evenness of\nthe selected sample distribution, which may result in noisy or biased matching\ntargets. Besides, the sample diversity is also not constrained by random\nsampling. Additionally, current methods predominantly focus on\nsingle-dimensional matching, where information is not fully utilized. To\naddress these challenges, we propose a novel matching strategy called Dataset\nDistillation by Bidirectional REpresentAtive Matching (DREAM+), which selects\nrepresentative original images for bidirectional matching. DREAM+ is applicable\nto a variety of mainstream dataset distillation frameworks and significantly\nreduces the number of distillation iterations by more than 15 times without\naffecting performance. Given sufficient training time, DREAM+ can further\nimprove the performance and achieve state-of-the-art results. We have released\nthe code at github.com/NUS-HPC-AI-Lab/DREAM+.\n","authors":["Yanqing Liu","Jianyang Gu","Kai Wang","Zheng Zhu","Kaipeng Zhang","Wei Jiang","Yang You"],"pdf_url":"https://arxiv.org/pdf/2310.15052v1.pdf","comment":"This is an extension of the ICCV conference version"},{"id":"http://arxiv.org/abs/2310.15044v1","updated":"2023-10-23T15:46:47Z","published":"2023-10-23T15:46:47Z","title":"A Universal Anti-Spoofing Approach for Contactless Fingerprint Biometric\n Systems","summary":" With the increasing integration of smartphones into our daily lives,\nfingerphotos are becoming a potential contactless authentication method. While\nit offers convenience, it is also more vulnerable to spoofing using various\npresentation attack instruments (PAI). The contactless fingerprint is an\nemerging biometric authentication but has not yet been heavily investigated for\nanti-spoofing. While existing anti-spoofing approaches demonstrated fair\nresults, they have encountered challenges in terms of universality and\nscalability to detect any unseen/unknown spoofed samples. To address this\nissue, we propose a universal presentation attack detection method for\ncontactless fingerprints, despite having limited knowledge of presentation\nattack samples. We generated synthetic contactless fingerprints using StyleGAN\nfrom live finger photos and integrating them to train a semi-supervised\nResNet-18 model. A novel joint loss function, combining the Arcface and Center\nloss, is introduced with a regularization to balance between the two loss\nfunctions and minimize the variations within the live samples while enhancing\nthe inter-class variations between the deepfake and live samples. We also\nconducted a comprehensive comparison of different regularizations' impact on\nthe joint loss function for presentation attack detection (PAD) and explored\nthe performance of a modified ResNet-18 architecture with different activation\nfunctions (i.e., leaky ReLU and RelU) in conjunction with Arcface and center\nloss. Finally, we evaluate the performance of the model using unseen types of\nspoof attacks and live data. Our proposed method achieves a Bona Fide\nClassification Error Rate (BPCER) of 0.12\\%, an Attack Presentation\nClassification Error Rate (APCER) of 0.63\\%, and an Average Classification\nError Rate (ACER) of 0.37\\%.\n","authors":["Banafsheh Adami","Sara Tehranipoor","Nasser Nasrabadi","Nima Karimian"],"pdf_url":"https://arxiv.org/pdf/2310.15044v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15043v1","updated":"2023-10-23T15:46:39Z","published":"2023-10-23T15:46:39Z","title":"CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate\n Measurements by Calibrating Between Multiple Cameras","summary":" Video-based heart and respiratory rate measurements using facial videos are\nmore useful and user-friendly than traditional contact-based sensors. However,\nmost of the current deep learning approaches require ground-truth pulse and\nrespiratory waves for model training, which are expensive to collect. In this\npaper, we propose CalibrationPhys, a self-supervised video-based heart and\nrespiratory rate measurement method that calibrates between multiple cameras.\nCalibrationPhys trains deep learning models without supervised labels by using\nfacial videos captured simultaneously by multiple cameras. Contrastive learning\nis performed so that the pulse and respiratory waves predicted from the\nsynchronized videos using multiple cameras are positive and those from\ndifferent videos are negative. CalibrationPhys also improves the robustness of\nthe models by means of a data augmentation technique and successfully leverages\na pre-trained model for a particular camera. Experimental results utilizing two\ndatasets demonstrate that CalibrationPhys outperforms state-of-the-art heart\nand respiratory rate measurement methods. Since we optimize camera-specific\nmodels using only videos from multiple cameras, our approach makes it easy to\nuse arbitrary cameras for heart and respiratory rate measurements.\n","authors":["Yusuke Akamatsu","Terumi Umematsu","Hitoshi Imaoka"],"pdf_url":"https://arxiv.org/pdf/2310.15043v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2306.06599v6","updated":"2023-10-23T15:40:43Z","published":"2023-06-11T06:27:06Z","title":"Variational Imbalanced Regression: Fair Uncertainty Quantification via\n Probabilistic Smoothing","summary":" Existing regression models tend to fall short in both accuracy and\nuncertainty estimation when the label distribution is imbalanced. In this\npaper, we propose a probabilistic deep learning model, dubbed variational\nimbalanced regression (VIR), which not only performs well in imbalanced\nregression but naturally produces reasonable uncertainty estimation as a\nbyproduct. Different from typical variational autoencoders assuming I.I.D.\nrepresentations (a data point's representation is not directly affected by\nother data points), our VIR borrows data with similar regression labels to\ncompute the latent representation's variational distribution; furthermore,\ndifferent from deterministic regression models producing point estimates, VIR\npredicts the entire normal-inverse-gamma distributions and modulates the\nassociated conjugate distributions to impose probabilistic reweighting on the\nimbalanced data, thereby providing better uncertainty estimation. Experiments\nin several real-world datasets show that our VIR can outperform\nstate-of-the-art imbalanced regression models in terms of both accuracy and\nuncertainty estimation. Code will soon be available at\nhttps://github.com/Wang-ML-Lab/variational-imbalanced-regression.\n","authors":["Ziyan Wang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06599v6.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15041v1","updated":"2023-10-23T15:40:00Z","published":"2023-10-23T15:40:00Z","title":"Manipulation Mask Generator: High-Quality Image Manipulation Mask\n Generation Method Based on Modified Total Variation Noise Reduction","summary":" In artificial intelligence, any model that wants to achieve a good result is\ninseparable from a large number of high-quality data. It is especially true in\nthe field of tamper detection. This paper proposes a modified total variation\nnoise reduction method to acquire high-quality tampered images. We\nautomatically crawl original and tampered images from the Baidu PS Bar. Baidu\nPS Bar is a website where net friends post countless tampered images.\nSubtracting the original image with the tampered image can highlight the\ntampered area. However, there is also substantial noise on the final print, so\nthese images can't be directly used in the deep learning model. Our modified\ntotal variation noise reduction method is aimed at solving this problem.\nBecause a lot of text is slender, it is easy to lose text information after the\nopening and closing operation. We use MSER (Maximally Stable Extremal Regions)\nand NMS (Non-maximum Suppression) technology to extract text information. And\nthen use the modified total variation noise reduction technology to process the\nsubtracted image. Finally, we can obtain an image with little noise by adding\nthe image and text information. And the idea also largely retains the text\ninformation. Datasets generated in this way can be used in deep learning\nmodels, and they will help the model achieve better results.\n","authors":["Xinyu Yang","Jizhe Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.15041v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09275v2","updated":"2023-10-23T15:37:48Z","published":"2023-10-13T17:38:41Z","title":"Understanding and Modeling the Effects of Task and Context on Drivers'\n Gaze Allocation","summary":" Understanding what drivers look at is important for many applications,\nincluding driver training, monitoring, and assistance, as well as self-driving.\nTraditionally, factors affecting human visual attention have been divided into\nbottom-up (involuntary attraction to salient regions) and top-down (task- and\ncontext-driven). Although both play a role in drivers' gaze allocation, most of\nthe existing modeling approaches apply techniques developed for bottom-up\nsaliency and do not consider task and context influences explicitly. Likewise,\ncommon driving attention benchmarks lack relevant task and context annotations.\nTherefore, to enable analysis and modeling of these factors for drivers' gaze\nprediction, we propose the following: 1) address some shortcomings of the\npopular DR(eye)VE dataset and extend it with per-frame annotations for driving\ntask and context; 2) benchmark a number of baseline and SOTA models for\nsaliency and driver gaze prediction and analyze them w.r.t. the new\nannotations; and finally, 3) a novel model that modulates drivers' gaze\nprediction with explicit action and context information, and as a result\nsignificantly improves SOTA performance on DR(eye)VE overall (by 24\\% KLD and\n89\\% NSS) and on a subset of action and safety-critical intersection scenarios\n(by 10--30\\% KLD). Extended annotations, code for model and evaluation will be\nmade publicly available.\n","authors":["Iuliia Kotseruba","John K. Tsotsos"],"pdf_url":"https://arxiv.org/pdf/2310.09275v2.pdf","comment":"12 pages, 8 figures, 8 tables"},{"id":"http://arxiv.org/abs/2310.15036v1","updated":"2023-10-23T15:34:03Z","published":"2023-10-23T15:34:03Z","title":"UWB Based Static Gesture Classification","summary":" Our paper presents a robust framework for UWB-based static gesture\nrecognition, leveraging proprietary UWB radar sensor technology. Extensive data\ncollection efforts were undertaken to compile datasets containing five commonly\nused gestures. Our approach involves a comprehensive data pre-processing\npipeline that encompasses outlier handling, aspect ratio-preserving resizing,\nand false-color image transformation. Both CNN and MobileNet models were\ntrained on the processed images. Remarkably, our best-performing model achieved\nan accuracy of 96.78%. Additionally, we developed a user-friendly GUI framework\nto assess the model's system resource usage and processing times, which\nrevealed low memory utilization and real-time task completion in under one\nsecond. This research marks a significant step towards enhancing static gesture\nrecognition using UWB technology, promising practical applications in various\ndomains.\n","authors":["Abhishek Sebastian"],"pdf_url":"https://arxiv.org/pdf/2310.15036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15025v1","updated":"2023-10-23T15:23:31Z","published":"2023-10-23T15:23:31Z","title":"P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic\n Segmentation","summary":" Recently, Transformer-based models have achieved promising results in various\nvision tasks, due to their ability to model long-range dependencies. However,\ntransformers are computationally expensive, which limits their applications in\nreal-time tasks such as autonomous driving. In addition, an efficient local and\nglobal feature selection and fusion are vital for accurate dense prediction,\nespecially driving scene understanding tasks. In this paper, we propose a\nreal-time semantic segmentation architecture named Pyramid Pooling Axial\nTransformer (P2AT). The proposed P2AT takes a coarse feature from the CNN\nencoder to produce scale-aware contextual features, which are then combined\nwith the multi-level feature aggregation scheme to produce enhanced contextual\nfeatures. Specifically, we introduce a pyramid pooling axial transformer to\ncapture intricate spatial and channel dependencies, leading to improved\nperformance on semantic segmentation. Then, we design a Bidirectional Fusion\nmodule (BiF) to combine semantic information at different levels. Meanwhile, a\nGlobal Context Enhancer is introduced to compensate for the inadequacy of\nconcatenating different semantic levels. Finally, a decoder block is proposed\nto help maintain a larger receptive field. We evaluate P2AT variants on three\nchallenging scene-understanding datasets. In particular, our P2AT variants\nachieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for\nP2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on\nCityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed\narchitecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes.\nThe source code will be available at\n","authors":["Mohammed A. M. Elhassan","Changjun Zhou","Amina Benabid","Abuzar B. M. Adam"],"pdf_url":"https://arxiv.org/pdf/2310.15025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15023v1","updated":"2023-10-23T15:21:46Z","published":"2023-10-23T15:21:46Z","title":"SONIC: Sonar Image Correspondence using Pose Supervised Learning for\n Imaging Sonars","summary":" In this paper, we address the challenging problem of data association for\nunderwater SLAM through a novel method for sonar image correspondence using\nlearned features. We introduce SONIC (SONar Image Correspondence), a\npose-supervised network designed to yield robust feature correspondence capable\nof withstanding viewpoint variations. The inherent complexity of the underwater\nenvironment stems from the dynamic and frequently limited visibility\nconditions, restricting vision to a few meters of often featureless expanses.\nThis makes camera-based systems suboptimal in most open water application\nscenarios. Consequently, multibeam imaging sonars emerge as the preferred\nchoice for perception sensors. However, they too are not without their\nlimitations. While imaging sonars offer superior long-range visibility compared\nto cameras, their measurements can appear different from varying viewpoints.\nThis inherent variability presents formidable challenges in data association,\nparticularly for feature-based methods. Our method demonstrates significantly\nbetter performance in generating correspondences for sonar images which will\npave the way for more accurate loop closure constraints and sonar-based place\nrecognition. Code as well as simulated and real-world datasets will be made\npublic to facilitate further development in the field.\n","authors":["Samiran Gode","Akshay Hinduja","Michael Kaess"],"pdf_url":"https://arxiv.org/pdf/2310.15023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15020v1","updated":"2023-10-23T15:15:19Z","published":"2023-10-23T15:15:19Z","title":"Invariance is Key to Generalization: Examining the Role of\n Representation in Sim-to-Real Transfer for Visual Navigation","summary":" The data-driven approach to robot control has been gathering pace rapidly,\nyet generalization to unseen task domains remains a critical challenge. We\nargue that the key to generalization is representations that are (i) rich\nenough to capture all task-relevant information and (ii) invariant to\nsuperfluous variability between the training and the test domains. We\nexperimentally study such a representation -- containing both depth and\nsemantic information -- for visual navigation and show that it enables a\ncontrol policy trained entirely in simulated indoor scenes to generalize to\ndiverse real-world environments, both indoors and outdoors. Further, we show\nthat our representation reduces the A-distance between the training and test\ndomains, improving the generalization error bound as a result. Our proposed\napproach is scalable: the learned policy improves continuously, as the\nfoundation models that it exploits absorb more diverse data during\npre-training.\n","authors":["Bo Ai","Zhanxin Wu","David Hsu"],"pdf_url":"https://arxiv.org/pdf/2310.15020v1.pdf","comment":"11 pages, accepted by the 18th International Symposium on\n Experimental Robotics (ISER 2023)"},{"id":"http://arxiv.org/abs/2310.15008v1","updated":"2023-10-23T15:02:23Z","published":"2023-10-23T15:02:23Z","title":"Wonder3D: Single Image to 3D using Cross-Domain Diffusion","summary":" In this work, we introduce Wonder3D, a novel method for efficiently\ngenerating high-fidelity textured meshes from single-view images.Recent methods\nbased on Score Distillation Sampling (SDS) have shown the potential to recover\n3D geometry from 2D diffusion priors, but they typically suffer from\ntime-consuming per-shape optimization and inconsistent geometry. In contrast,\ncertain works directly produce 3D information via fast network inferences, but\ntheir results are often of low quality and lack geometric details.To\nholistically improve the quality, consistency, and efficiency of image-to-3D\ntasks, we propose a cross-domain diffusion model that generates multi-view\nnormal maps and the corresponding color images. To ensure consistency, we\nemploy a multi-view cross-domain attention mechanism that facilitates\ninformation exchange across views and modalities. Lastly, we introduce a\ngeometry-aware normal fusion algorithm that extracts high-quality surfaces from\nthe multi-view 2D representations. Our extensive evaluations demonstrate that\nour method achieves high-quality reconstruction results, robust generalization,\nand reasonably good efficiency compared to prior works.\n","authors":["Xiaoxiao Long","Yuan-Chen Guo","Cheng Lin","Yuan Liu","Zhiyang Dou","Lingjie Liu","Yuexin Ma","Song-Hai Zhang","Marc Habermann","Christian Theobalt","Wenping Wang"],"pdf_url":"https://arxiv.org/pdf/2310.15008v1.pdf","comment":"Project page: https://www.xxlong.site/Wonder3D/"},{"id":"http://arxiv.org/abs/2302.07241v3","updated":"2023-10-23T14:56:15Z","published":"2023-02-14T18:40:26Z","title":"ConceptFusion: Open-set Multimodal 3D Mapping","summary":" Building 3D maps of the environment is central to robot navigation, planning,\nand interaction with objects in a scene. Most existing approaches that\nintegrate semantic concepts with 3D maps largely remain confined to the\nclosed-set setting: they can only reason about a finite set of concepts,\npre-defined at training time. Further, these maps can only be queried using\nclass labels, or in recent work, using text prompts.\n We address both these issues with ConceptFusion, a scene representation that\nis (1) fundamentally open-set, enabling reasoning beyond a closed set of\nconcepts and (ii) inherently multimodal, enabling a diverse range of possible\nqueries to the 3D map, from language, to images, to audio, to 3D geometry, all\nworking in concert. ConceptFusion leverages the open-set capabilities of\ntoday's foundation models pre-trained on internet-scale data to reason about\nconcepts across modalities such as natural language, images, and audio. We\ndemonstrate that pixel-aligned open-set features can be fused into 3D maps via\ntraditional SLAM and multi-view fusion approaches. This enables effective\nzero-shot spatial reasoning, not needing any additional training or finetuning,\nand retains long-tailed concepts better than supervised approaches,\noutperforming them by more than 40% margin on 3D IoU. We extensively evaluate\nConceptFusion on a number of real-world datasets, simulated home environments,\na real-world tabletop manipulation task, and an autonomous driving platform. We\nshowcase new avenues for blending foundation models with 3D open-set multimodal\nmapping.\n For more information, visit our project page https://concept-fusion.github.io\nor watch our 5-minute explainer video\nhttps://www.youtube.com/watch?v=rkXgws8fiDs\n","authors":["Krishna Murthy Jatavallabhula","Alihusein Kuwajerwala","Qiao Gu","Mohd Omama","Tao Chen","Alaa Maalouf","Shuang Li","Ganesh Iyer","Soroush Saryazdi","Nikhil Keetha","Ayush Tewari","Joshua B. Tenenbaum","Celso Miguel de Melo","Madhava Krishna","Liam Paull","Florian Shkurti","Antonio Torralba"],"pdf_url":"https://arxiv.org/pdf/2302.07241v3.pdf","comment":"RSS 2023. Project page: https://concept-fusion.github.io Explainer\n video: https://www.youtube.com/watch?v=rkXgws8fiDs Code:\n https://github.com/concept-fusion/concept-fusion"},{"id":"http://arxiv.org/abs/2310.10352v2","updated":"2023-10-23T14:45:07Z","published":"2023-10-16T12:42:43Z","title":"Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating\n Holistic Understanding of Crowd Scenes","summary":" To alleviate the heavy annotation burden for training a reliable crowd\ncounting model and thus make the model more practicable and accurate by being\nable to benefit from more data, this paper presents a new semi-supervised\nmethod based on the mean teacher framework. When there is a scarcity of labeled\ndata available, the model is prone to overfit local patches. Within such\ncontexts, the conventional approach of solely improving the accuracy of local\npatch predictions through unlabeled data proves inadequate. Consequently, we\npropose a more nuanced approach: fostering the model's intrinsic 'subitizing'\ncapability. This ability allows the model to accurately estimate the count in\nregions by leveraging its understanding of the crowd scenes, mirroring the\nhuman cognitive process. To achieve this goal, we apply masking on unlabeled\ndata, guiding the model to make predictions for these masked patches based on\nthe holistic cues. Furthermore, to help with feature learning, herein we\nincorporate a fine-grained density classification task. Our method is general\nand applicable to most existing crowd counting methods as it doesn't have\nstrict structural or loss constraints. In addition, we observe that the model\ntrained with our framework exhibits a 'subitizing'-like behavior. It accurately\npredicts low-density regions with only a 'glance', while incorporating local\ndetails to predict high-density regions. Our method achieves the\nstate-of-the-art performance, surpassing previous approaches by a large margin\non challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is\navailable at: https://github.com/cha15yq/MRC-Crowd.\n","authors":["Yifei Qian","Xiaopeng Hong","Ognjen Arandjelović","Zhongliang Guo","Carl R. Donovan"],"pdf_url":"https://arxiv.org/pdf/2310.10352v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14961v1","updated":"2023-10-23T14:04:18Z","published":"2023-10-23T14:04:18Z","title":"StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography","summary":" Coronary angiography continues to serve as the primary method for diagnosing\ncoronary artery disease (CAD), which is the leading global cause of mortality.\nThe severity of CAD is quantified by the location, degree of narrowing\n(stenosis), and number of arteries involved. In current practice, this\nquantification is performed manually using visual inspection and thus suffers\nfrom poor inter- and intra-rater reliability. The MICCAI grand challenge:\nAutomatic Region-based Coronary Artery Disease diagnostics using the X-ray\nangiography imagEs (ARCADE) curated a dataset with stenosis annotations, with\nthe goal of creating an automated stenosis detection algorithm. Using a\ncombination of machine learning and other computer vision techniques, we\npropose the architecture and algorithm StenUNet to accurately detect stenosis\nfrom X-ray Coronary Angiography. Our submission to the ARCADE challenge placed\n3rd among all teams. We achieved an F1 score of 0.5348 on the test set, 0.0005\nlower than the 2nd place.\n","authors":["Hui Lin","Tom Liu","Aggelos Katsaggelos","Adrienne Kline"],"pdf_url":"https://arxiv.org/pdf/2310.14961v1.pdf","comment":"12 pages, 5 figures, 1 table"},{"id":"http://arxiv.org/abs/2310.14958v1","updated":"2023-10-23T14:02:57Z","published":"2023-10-23T14:02:57Z","title":"Learning Real-World Image De-Weathering with Imperfect Supervision","summary":" Real-world image de-weathering aims at removing various undesirable\nweather-related artifacts. Owing to the impossibility of capturing image pairs\nconcurrently, existing real-world de-weathering datasets often exhibit\ninconsistent illumination, position, and textures between the ground-truth\nimages and the input degraded images, resulting in imperfect supervision. Such\nnon-ideal supervision negatively affects the training process of learning-based\nde-weathering methods. In this work, we attempt to address the problem with a\nunified solution for various inconsistencies. Specifically, inspired by\ninformation bottleneck theory, we first develop a Consistent Label Constructor\n(CLC) to generate a pseudo-label as consistent as possible with the input\ndegraded image while removing most weather-related degradations. In particular,\nmultiple adjacent frames of the current input are also fed into CLC to enhance\nthe pseudo-label. Then we combine the original imperfect labels and\npseudo-labels to jointly supervise the de-weathering model by the proposed\nInformation Allocation Strategy (IAS). During testing, only the de-weathering\nmodel is used for inference. Experiments on two real-world de-weathering\ndatasets show that our method helps existing de-weathering models achieve\nbetter performance. Codes are available at\nhttps://github.com/1180300419/imperfect-deweathering.\n","authors":["Xiaohui Liu","Zhilu Zhang","Xiaohe Wu","Chaoyu Feng","Xiaotao Wang","LEI LEI","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2310.14958v1.pdf","comment":"16 pages, 13 figures"},{"id":"http://arxiv.org/abs/2307.13899v2","updated":"2023-10-23T13:58:25Z","published":"2023-07-26T01:47:49Z","title":"Regularizing Neural Networks with Meta-Learning Generative Models","summary":" This paper investigates methods for improving generative data augmentation\nfor deep learning. Generative data augmentation leverages the synthetic samples\nproduced by generative models as an additional dataset for classification with\nsmall dataset settings. A key challenge of generative data augmentation is that\nthe synthetic data contain uninformative samples that degrade accuracy. This is\nbecause the synthetic samples do not perfectly represent class categories in\nreal data and uniform sampling does not necessarily provide useful samples for\ntasks. In this paper, we present a novel strategy for generative data\naugmentation called meta generative regularization (MGR). To avoid the\ndegradation of generative data augmentation, MGR utilizes synthetic samples in\nthe regularization term for feature extractors instead of in the loss function,\ne.g., cross-entropy. These synthetic samples are dynamically determined to\nminimize the validation losses through meta-learning. We observed that MGR can\navoid the performance degradation of na\\\"ive generative data augmentation and\nboost the baselines. Experiments on six datasets showed that MGR is effective\nparticularly when datasets are smaller and stably outperforms baselines.\n","authors":["Shin'ya Yamaguchi","Daiki Chijiwa","Sekitoshi Kanai","Atsutoshi Kumagai","Hisashi Kashima"],"pdf_url":"https://arxiv.org/pdf/2307.13899v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.14934v1","updated":"2023-10-23T13:34:59Z","published":"2023-10-23T13:34:59Z","title":"Robust Depth Linear Error Decomposition with Double Total Variation and\n Nuclear Norm for Dynamic MRI Reconstruction","summary":" Compressed Sensing (CS) significantly speeds up Magnetic Resonance Image\n(MRI) processing and achieves accurate MRI reconstruction from under-sampled\nk-space data. According to the current research, there are still several\nproblems with dynamic MRI k-space reconstruction based on CS. 1) There are\ndifferences between the Fourier domain and the Image domain, and the\ndifferences between MRI processing of different domains need to be considered.\n2) As three-dimensional data, dynamic MRI has its spatial-temporal\ncharacteristics, which need to calculate the difference and consistency of\nsurface textures while preserving structural integrity and uniqueness. 3)\nDynamic MRI reconstruction is time-consuming and computationally\nresource-dependent. In this paper, we propose a novel robust low-rank dynamic\nMRI reconstruction optimization model via highly under-sampled and Discrete\nFourier Transform (DFT) called the Robust Depth Linear Error Decomposition\nModel (RDLEDM). Our method mainly includes linear decomposition, double Total\nVariation (TV), and double Nuclear Norm (NN) regularizations. By adding linear\nimage domain error analysis, the noise is reduced after under-sampled and DFT\nprocessing, and the anti-interference ability of the algorithm is enhanced.\nDouble TV and NN regularizations can utilize both spatial-temporal\ncharacteristics and explore the complementary relationship between different\ndimensions in dynamic MRI sequences. In addition, Due to the non-smoothness and\nnon-convexity of TV and NN terms, it is difficult to optimize the unified\nobjective model. To address this issue, we utilize a fast algorithm by solving\na primal-dual form of the original problem. Compared with five state-of-the-art\nmethods, extensive experiments on dynamic MRI data demonstrate the superior\nperformance of the proposed method in terms of both reconstruction accuracy and\ntime complexity.\n","authors":["Junpeng Tan","Chunmei Qing","Xiangmin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.14934v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14924v1","updated":"2023-10-23T13:29:42Z","published":"2023-10-23T13:29:42Z","title":"Converting Depth Images and Point Clouds for Feature-based Pose\n Estimation","summary":" In recent years, depth sensors have become more and more affordable and have\nfound their way into a growing amount of robotic systems. However, mono- or\nmulti-modal sensor registration, often a necessary step for further processing,\nfaces many challenges on raw depth images or point clouds. This paper presents\na method of converting depth data into images capable of visualizing spatial\ndetails that are basically hidden in traditional depth images. After noise\nremoval, a neighborhood of points forms two normal vectors whose difference is\nencoded into this new conversion. Compared to Bearing Angle images, our method\nyields brighter, higher-contrast images with more visible contours and more\ndetails. We tested feature-based pose estimation of both conversions in a\nvisual odometry task and RGB-D SLAM. For all tested features, AKAZE, ORB, SIFT,\nand SURF, our new Flexion images yield better results than Bearing Angle images\nand show great potential to bridge the gap between depth data and classical\ncomputer vision. Source code is available here:\nhttps://rlsch.github.io/depth-flexion-conversion.\n","authors":["Robert Lösch","Mark Sastuba","Jonas Toth","Bernhard Jung"],"pdf_url":"https://arxiv.org/pdf/2310.14924v1.pdf","comment":"to be published in IROS 2023 conference proceedings"},{"id":"http://arxiv.org/abs/2310.14919v1","updated":"2023-10-23T13:24:21Z","published":"2023-10-23T13:24:21Z","title":"GRLib: An Open-Source Hand Gesture Detection and Recognition Python\n Library","summary":" Hand gesture recognition systems provide a natural way for humans to interact\nwith computer systems. Although various algorithms have been designed for this\ntask, a host of external conditions, such as poor lighting or distance from the\ncamera, make it difficult to create an algorithm that performs well across a\nrange of environments. In this work, we present GRLib: an open-source Python\nlibrary able to detect and classify static and dynamic hand gestures. Moreover,\nthe library can be trained on existing data for improved classification\nrobustness. The proposed solution utilizes a feed from an RGB camera. The\nretrieved frames are then subjected to data augmentation and passed on to\nMediaPipe Hands to perform hand landmark detection. The landmarks are then\nclassified into their respective gesture class. The library supports dynamic\nhand gestures through trajectories and keyframe extraction. It was found that\nthe library outperforms another publicly available HGR system - MediaPipe\nSolutions, on three diverse, real-world datasets. The library is available at\nhttps://github.com/mikhail-vlasenko/grlib and can be installed with pip.\n","authors":["Jan Warchocki","Mikhail Vlasenko","Yke Bauke Eisma"],"pdf_url":"https://arxiv.org/pdf/2310.14919v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14914v1","updated":"2023-10-23T13:21:44Z","published":"2023-10-23T13:21:44Z","title":"Object Pose Estimation Annotation Pipeline for Multi-view Monocular\n Camera Systems in Industrial Settings","summary":" Object localization, and more specifically object pose estimation, in large\nindustrial spaces such as warehouses and production facilities, is essential\nfor material flow operations. Traditional approaches rely on artificial\nartifacts installed in the environment or excessively expensive equipment, that\nis not suitable at scale. A more practical approach is to utilize existing\ncameras in such spaces in order to address the underlying pose estimation\nproblem and to localize objects of interest. In order to leverage\nstate-of-the-art methods in deep learning for object pose estimation, large\namounts of data need to be collected and annotated. In this work, we provide an\napproach to the annotation of large datasets of monocular images without the\nneed for manual labor. Our approach localizes cameras in space, unifies their\nlocation with a motion capture system, and uses a set of linear mappings to\nproject 3D models of objects of interest at their ground truth 6D pose\nlocations. We test our pipeline on a custom dataset collected from a system of\neight cameras in an industrial setting that mimics the intended area of\noperation. Our approach was able to provide consistent quality annotations for\nour dataset with 26, 482 object instances at a fraction of the time required by\nhuman annotators.\n","authors":["Hazem Youssef","Frederik Polachowski","Jérôme Rutinowski","Moritz Roidl","Christopher Reining"],"pdf_url":"https://arxiv.org/pdf/2310.14914v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14907v1","updated":"2023-10-23T13:16:51Z","published":"2023-10-23T13:16:51Z","title":"Orientation-Aware Leg Movement Learning for Action-Driven Human Motion\n Prediction","summary":" The task of action-driven human motion prediction aims to forecast future\nhuman motion from the observed sequence while respecting the given action\nlabel. It requires modeling not only the stochasticity within human motion but\nthe smooth yet realistic transition between multiple action labels. However,\nthe fact that most of the datasets do not contain such transition data\ncomplicates this task. Existing work tackles this issue by learning a\nsmoothness prior to simply promote smooth transitions, yet doing so can result\nin unnatural transitions especially when the history and predicted motions\ndiffer significantly in orientations. In this paper, we argue that valid human\nmotion transitions should incorporate realistic leg movements to handle\norientation changes, and cast it as an action-conditioned in-betweening (ACB)\nlearning task to encourage transition naturalness. Because modeling all\npossible transitions is virtually unreasonable, our ACB is only performed on\nvery few selected action classes with active gait motions, such as Walk or Run.\nSpecifically, we follow a two-stage forecasting strategy by first employing the\nmotion diffusion model to generate the target motion with a specified future\naction, and then producing the in-betweening to smoothly connect the\nobservation and prediction to eventually address motion prediction. Our method\nis completely free from the labeled motion transition data during training. To\nshow the robustness of our approach, we generalize our trained in-betweening\nlearning model on one dataset to two unseen large-scale motion datasets to\nproduce natural transitions. Extensive methods on three benchmark datasets\ndemonstrate that our method yields the state-of-the-art performance in terms of\nvisual quality, prediction accuracy, and action faithfulness.\n","authors":["Chunzhi Gu","Chao Zhang","Shigeru Kuriyama"],"pdf_url":"https://arxiv.org/pdf/2310.14907v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04086v2","updated":"2023-10-23T12:49:08Z","published":"2023-06-07T01:14:16Z","title":"TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for\n Medical Image Segmentation","summary":" The hybrid architecture of convolution neural networks (CNN) and Transformer\nhas been the most popular method for medical image segmentation. However, the\nexisting networks based on the hybrid architecture suffer from two problems.\nFirst, although the CNN branch can capture image local features by using\nconvolution operation, the vanilla convolution is unable to achieve adaptive\nextraction of image features. Second, although the Transformer branch can model\nthe global information of images, the conventional self-attention only focuses\non the spatial self-attention of images and ignores the channel and\ncross-dimensional self-attention leading to low segmentation accuracy for\nmedical images with complex backgrounds. To solve these problems, we propose\nvision Transformer embrace convolutional neural networks for medical image\nsegmentation (TEC-Net). Our network has two advantages. First, dynamic\ndeformable convolution (DDConv) is designed in the CNN branch, which not only\novercomes the difficulty of adaptive feature extraction using fixed-size\nconvolution kernels, but also solves the defect that different inputs share the\nsame convolution kernel parameters, effectively improving the feature\nexpression ability of CNN branch. Second, in the Transformer branch, a\n(shifted)-window adaptive complementary attention module ((S)W-ACAM) and\ncompact convolutional projection are designed to enable the network to fully\nlearn the cross-dimensional long-range dependency of medical images with few\nparameters and calculations. Experimental results show that the proposed\nTEC-Net provides better medical image segmentation results than SOTA methods\nincluding CNN and Transformer networks. In addition, our TEC-Net requires fewer\nparameters and computational costs and does not rely on pre-training. The code\nis publicly available at https://github.com/SR0920/TEC-Net.\n","authors":["Tao Lei","Rui Sun","Weichuan Zhang","Yong Wan","Yong Xia","Asoke K. Nandi"],"pdf_url":"https://arxiv.org/pdf/2306.04086v2.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2306.03373"},{"id":"http://arxiv.org/abs/2306.01567v2","updated":"2023-10-23T12:40:57Z","published":"2023-06-02T14:23:59Z","title":"Segment Anything in High Quality","summary":" The recent Segment Anything Model (SAM) represents a big leap in scaling up\nsegmentation models, allowing for powerful zero-shot capabilities and flexible\nprompting. Despite being trained with 1.1 billion masks, SAM's mask prediction\nquality falls short in many cases, particularly when dealing with objects that\nhave intricate structures. We propose HQ-SAM, equipping SAM with the ability to\naccurately segment any object, while maintaining SAM's original promptable\ndesign, efficiency, and zero-shot generalizability. Our careful design reuses\nand preserves the pre-trained model weights of SAM, while only introducing\nminimal additional parameters and computation. We design a learnable\nHigh-Quality Output Token, which is injected into SAM's mask decoder and is\nresponsible for predicting the high-quality mask. Instead of only applying it\non mask-decoder features, we first fuse them with early and final ViT features\nfor improved mask details. To train our introduced learnable parameters, we\ncompose a dataset of 44K fine-grained masks from several sources. HQ-SAM is\nonly trained on the introduced detaset of 44k masks, which takes only 4 hours\non 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation\ndatasets across different downstream tasks, where 8 out of them are evaluated\nin a zero-shot transfer protocol. Our code and pretrained models are at\nhttps://github.com/SysCV/SAM-HQ.\n","authors":["Lei Ke","Mingqiao Ye","Martin Danelljan","Yifan Liu","Yu-Wing Tai","Chi-Keung Tang","Fisher Yu"],"pdf_url":"https://arxiv.org/pdf/2306.01567v2.pdf","comment":"NeurIPS 2023. We propose HQ-SAM to upgrade SAM for high-quality\n zero-shot segmentation. Github: https://github.com/SysCV/SAM-HQ"},{"id":"http://arxiv.org/abs/2305.18287v2","updated":"2023-10-23T12:32:47Z","published":"2023-05-29T17:56:35Z","title":"LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and\n Unlabeled Image Collections","summary":" Recently, large-scale pre-trained Vision and Language (VL) models have set a\nnew state-of-the-art (SOTA) in zero-shot visual classification enabling\nopen-vocabulary recognition of potentially unlimited set of categories defined\nas simple language prompts. However, despite these great advances, the\nperformance of these zeroshot classifiers still falls short of the results of\ndedicated (closed category set) classifiers trained with supervised fine\ntuning. In this paper we show, for the first time, how to reduce this gap\nwithout any labels and without any paired VL data, using an unlabeled image\ncollection and a set of texts auto-generated using a Large Language Model (LLM)\ndescribing the categories of interest and effectively substituting labeled\nvisual instances of those categories. Using our label-free approach, we are\nable to attain significant performance improvements over the zero-shot\nperformance of the base VL model and other contemporary methods and baselines\non a wide variety of datasets, demonstrating absolute improvement of up to\n11.7% (3.8% on average) in the label-free setting. Moreover, despite our\napproach being label-free, we observe 1.3% average gains over leading few-shot\nprompting baselines that do use 5-shot supervision.\n","authors":["M. Jehanzeb Mirza","Leonid Karlinsky","Wei Lin","Mateusz Kozinski","Horst Possegger","Rogerio Feris","Horst Bischof"],"pdf_url":"https://arxiv.org/pdf/2305.18287v2.pdf","comment":"NeurIPS 2023 (Camera Ready) - Project Page:\n https://jmiemirza.github.io/LaFTer/"},{"id":"http://arxiv.org/abs/2310.12020v2","updated":"2023-10-23T12:29:33Z","published":"2023-10-18T14:53:14Z","title":"LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic\n Tabletop Manipulation","summary":" The convergence of embodied agents and large language models (LLMs) has\nbrought significant advancements to embodied instruction following.\nParticularly, the strong reasoning capabilities of LLMs make it possible for\nrobots to perform long-horizon tasks without expensive annotated\ndemonstrations. However, public benchmarks for testing the long-horizon\nreasoning capabilities of language-conditioned robots in various scenarios are\nstill missing. To fill this gap, this work focuses on the tabletop manipulation\ntask and releases a simulation benchmark, \\textit{LoHoRavens}, which covers\nvarious long-horizon reasoning aspects spanning color, size, space, arithmetics\nand reference. Furthermore, there is a key modality bridging problem for\nlong-horizon manipulation tasks with LLMs: how to incorporate the observation\nfeedback during robot execution for the LLM's closed-loop planning, which is\nhowever less studied by prior work. We investigate two methods of bridging the\nmodality gap: caption generation and learnable interface for incorporating\nexplicit and implicit observation feedback to the LLM, respectively. These\nmethods serve as the two baselines for our proposed benchmark. Experiments show\nthat both methods struggle to solve some tasks, indicating long-horizon\nmanipulation tasks are still challenging for current popular models. We expect\nthe proposed public benchmark and baselines can help the community develop\nbetter models for long-horizon tabletop manipulation tasks.\n","authors":["Shengqiang Zhang","Philipp Wicke","Lütfi Kerem Şenel","Luis Figueredo","Abdeldjallil Naceri","Sami Haddadin","Barbara Plank","Hinrich Schütze"],"pdf_url":"https://arxiv.org/pdf/2310.12020v2.pdf","comment":"6 pages, 4 figures. The video and code of LoHoRavens are available at\n https://cisnlp.github.io/lohoravens-webpage/"},{"id":"http://arxiv.org/abs/2310.14859v1","updated":"2023-10-23T12:29:10Z","published":"2023-10-23T12:29:10Z","title":"3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for\n Embodied Turn-Taking Prediction","summary":" Predicting turn-taking in multiparty conversations has many practical\napplications in human-computer/robot interaction. However, the complexity of\nhuman communication makes it a challenging task. Recent advances have shown\nthat synchronous multi-perspective egocentric data can significantly improve\nturn-taking prediction compared to asynchronous, single-perspective\ntranscriptions. Building on this research, we propose a new multimodal\ntransformer-based architecture for predicting turn-taking in embodied,\nsynchronized multi-perspective data. Our experimental results on the recently\nintroduced EgoCom dataset show a substantial performance improvement of up to\n14.01% on average compared to existing baselines and alternative\ntransformer-based approaches. The source code, and the pre-trained models of\nour 3T-Transformer will be available upon acceptance.\n","authors":["Mehdi Fatan","Emanuele Mincato","Dimitra Pintzou","Mariella Dimiccoli"],"pdf_url":"https://arxiv.org/pdf/2310.14859v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12795v2","updated":"2023-10-23T12:21:05Z","published":"2023-06-22T10:53:10Z","title":"Learning Unseen Modality Interaction","summary":" Multimodal learning assumes all modality combinations of interest are\navailable during training to learn cross-modal correspondences.In this paper,\nwe challenge this modality-complete assumption for multimodal learning and\ninstead strive for generalization to unseen modality combinations during\ninference. We pose the problem of unseen modality interaction and introduce a\nfirst solution. It exploits a module that projects the multidimensional\nfeatures of different modalities into a common space with rich information\npreserved. This allows the information to be accumulated with a simple\nsummation operation across available modalities. To reduce overfitting to less\ndiscriminative modality combinations during training, we further improve the\nmodel learning with pseudo-supervision indicating the reliability of a\nmodality's prediction. We demonstrate that our approach is effective for\ndiverse tasks and modalities by evaluating it for multimodal video\nclassification, robot state regression, and multimedia retrieval. Project\nwebsite: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.\n","authors":["Yunhua Zhang","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2306.12795v2.pdf","comment":"Published at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.13856v3","updated":"2023-10-23T12:20:56Z","published":"2023-06-24T04:11:31Z","title":"Learning-to-Rank Meets Language: Boosting Language-Driven Ordering\n Alignment for Ordinal Classification","summary":" We present a novel language-driven ordering alignment method for ordinal\nclassification. The labels in ordinal classification contain additional\nordering relations, making them prone to overfitting when relying solely on\ntraining data. Recent developments in pre-trained vision-language models\ninspire us to leverage the rich ordinal priors in human language by converting\nthe original task into a visionlanguage alignment task. Consequently, we\npropose L2RCLIP, which fully utilizes the language priors from two\nperspectives. First, we introduce a complementary prompt tuning technique\ncalled RankFormer, designed to enhance the ordering relation of original rank\nprompts. It employs token-level attention with residual-style prompt blending\nin the word embedding space. Second, to further incorporate language priors, we\nrevisit the approximate bound optimization of vanilla cross-entropy loss and\nrestructure it within the cross-modal embedding space. Consequently, we propose\na cross-modal ordinal pairwise loss to refine the CLIP feature space, where\ntexts and images maintain both semantic alignment and ordering alignment.\nExtensive experiments on three ordinal classification tasks, including facial\nage estimation, historical color image (HCI) classification, and aesthetic\nassessment demonstrate its promising performance. The code is available at\nhttps://github.com/raywang335/L2RCLIP.\n","authors":["Rui Wang","Peipei Li","Huaibo Huang","Chunshui Cao","Ran He","Zhaofeng He"],"pdf_url":"https://arxiv.org/pdf/2306.13856v3.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.14839v1","updated":"2023-10-23T12:01:10Z","published":"2023-10-23T12:01:10Z","title":"ESVAE: An Efficient Spiking Variational Autoencoder with\n Reparameterizable Poisson Spiking Sampling","summary":" In recent years, studies on image generation models of spiking neural\nnetworks (SNNs) have gained the attention of many researchers. Variational\nautoencoders (VAEs), as one of the most popular image generation models, have\nattracted a lot of work exploring their SNN implementation. Due to the\nconstrained binary representation in SNNs, existing SNN VAE methods implicitly\nconstruct the latent space by an elaborated autoregressive network and use the\nnetwork outputs as the sampling variables. However, this unspecified implicit\nrepresentation of the latent space will increase the difficulty of generating\nhigh-quality images and introduces additional network parameters. In this\npaper, we propose an efficient spiking variational autoencoder (ESVAE) that\nconstructs an interpretable latent space distribution and design a\nreparameterizable spiking sampling method. Specifically, we construct the prior\nand posterior of the latent space as a Poisson distribution using the firing\nrate of the spiking neurons. Subsequently, we propose a reparameterizable\nPoisson spiking sampling method, which is free from the additional network.\nComprehensive experiments have been conducted, and the experimental results\nshow that the proposed ESVAE outperforms previous SNN VAE methods in\nreconstructed & generated images quality. In addition, experiments demonstrate\nthat ESVAE's encoder is able to retain the original image information more\nefficiently, and the decoder is more robust. The source code is available at\nhttps://github.com/QgZhan/ESVAE.\n","authors":["Qiugang Zhan","Xiurui Xie","Guisong Liu","Malu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14839v1.pdf","comment":"11 pages, 13 figures"},{"id":"http://arxiv.org/abs/2309.13524v3","updated":"2023-10-23T11:42:33Z","published":"2023-09-24T02:10:25Z","title":"Global-correlated 3D-decoupling Transformer for Clothed Avatar\n Reconstruction","summary":" Reconstructing 3D clothed human avatars from single images is a challenging\ntask, especially when encountering complex poses and loose clothing. Current\nmethods exhibit limitations in performance, largely attributable to their\ndependence on insufficient 2D image features and inconsistent query methods.\nOwing to this, we present the Global-correlated 3D-decoupling Transformer for\nclothed Avatar reconstruction (GTA), a novel transformer-based architecture\nthat reconstructs clothed human avatars from monocular images. Our approach\nleverages transformer architectures by utilizing a Vision Transformer model as\nan encoder for capturing global-correlated image features. Subsequently, our\ninnovative 3D-decoupling decoder employs cross-attention to decouple tri-plane\nfeatures, using learnable embeddings as queries for cross-plane generation. To\neffectively enhance feature fusion with the tri-plane 3D feature and human body\nprior, we propose a hybrid prior fusion strategy combining spatial and\nprior-enhanced queries, leveraging the benefits of spatial localization and\nhuman body prior knowledge. Comprehensive experiments on CAPE and THuman2.0\ndatasets illustrate that our method outperforms state-of-the-art approaches in\nboth geometry and texture reconstruction, exhibiting high robustness to\nchallenging poses and loose clothing, and producing higher-resolution textures.\nCodes will be available at https://github.com/River-Zhang/GTA.\n","authors":["Zechuan Zhang","Li Sun","Zongxin Yang","Ling Chen","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2309.13524v3.pdf","comment":"Accepted by NeurIPS 2023. Update appendix. Project page:\n https://river-zhang.github.io/GTA-projectpage/"},{"id":"http://arxiv.org/abs/2310.14815v1","updated":"2023-10-23T11:30:54Z","published":"2023-10-23T11:30:54Z","title":"Deep learning denoiser assisted roughness measurements extraction from\n thin resists with low Signal-to-Noise Ratio(SNR) SEM images: analysis with\n SMILE","summary":" The technological advance of High Numerical Aperture Extreme Ultraviolet\nLithography (High NA EUVL) has opened the gates to extensive researches on\nthinner photoresists (below 30nm), necessary for the industrial implementation\nof High NA EUVL. Consequently, images from Scanning Electron Microscopy (SEM)\nsuffer from reduced imaging contrast and low Signal-to-Noise Ratio (SNR),\nimpacting the measurement of unbiased Line Edge Roughness (uLER) and Line Width\nRoughness (uLWR). Thus, the aim of this work is to enhance the SNR of SEM\nimages by using a Deep Learning denoiser and enable robust roughness extraction\nof the thin resist. For this study, we acquired SEM images of Line-Space (L/S)\npatterns with a Chemically Amplified Resist (CAR) with different thicknesses\n(15nm, 20nm, 25nm, 30nm), underlayers (Spin-On-Glass-SOG, Organic\nUnderlayer-OUL) and frames of averaging (4, 8, 16, 32, and 64 Fr). After\ndenoising, a systematic analysis has been carried out on both noisy and\ndenoised images using an open-source metrology software, SMILE 2.3.2, for\ninvestigating mean CD, SNR improvement factor, biased and unbiased LWR/LER\nPower Spectral Density (PSD). Denoised images with lower number of frames\npresent unaltered Critical Dimensions (CDs), enhanced SNR (especially for low\nnumber of integration frames), and accurate measurements of uLER and uLWR, with\nthe same accuracy as for noisy images with a consistent higher number of\nframes. Therefore, images with a small number of integration frames and with\nSNR < 2 can be successfully denoised, and advantageously used in improving\nmetrology throughput while maintaining reliable roughness measurements for the\nthin resist.\n","authors":["Sara Sacchi","Bappaditya Dey","Iacopo Mochi","Sandip Halder","Philippe Leray"],"pdf_url":"https://arxiv.org/pdf/2310.14815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14804v1","updated":"2023-10-23T10:59:21Z","published":"2023-10-23T10:59:21Z","title":"Large Language Models can Share Images, Too!","summary":" This paper explores the image-sharing capability of Large Language Models\n(LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting,\nwithout the help of visual foundation models. Inspired by the two-stage process\nof image-sharing in human dialogues, we propose a two-stage framework that\nallows LLMs to predict potential image-sharing turns and generate related image\ndescriptions using our effective restriction-based prompt template. With\nextensive experiments, we unlock the \\textit{image-sharing} capability of LLMs\nin zero-shot prompting, with GPT-4 achieving the best performance.\nAdditionally, we uncover the emergent \\textit{image-sharing} ability in\nzero-shot prompting, demonstrating the effectiveness of restriction-based\nprompts in both stages of our framework. Based on this framework, we augment\nthe PhotoChat dataset with images generated by Stable Diffusion at predicted\nturns, namely PhotoChat++. To our knowledge, this is the first study to assess\nthe image-sharing ability of LLMs in a zero-shot setting without visual\nfoundation models. The source code and the dataset will be released after\npublication.\n","authors":["Young-Jun Lee","Jonghwan Hyeon","Ho-Jin Choi"],"pdf_url":"https://arxiv.org/pdf/2310.14804v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14802v1","updated":"2023-10-23T10:58:09Z","published":"2023-10-23T10:58:09Z","title":"DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye\n Movement for Machine Reading","summary":" The use of visually-rich documents (VRDs) in various fields has created a\ndemand for Document AI models that can read and comprehend documents like\nhumans, which requires the overcoming of technical, linguistic, and cognitive\nbarriers. Unfortunately, the lack of appropriate datasets has significantly\nhindered advancements in the field. To address this issue, we introduce\n\\textsc{DocTrack}, a VRD dataset really aligned with human eye-movement\ninformation using eye-tracking technology. This dataset can be used to\ninvestigate the challenges mentioned above. Additionally, we explore the impact\nof human reading order on document understanding tasks and examine what would\nhappen if a machine reads in the same order as a human. Our results suggest\nthat although Document AI models have made significant progress, they still\nhave a long way to go before they can read VRDs as accurately, continuously,\nand flexibly as humans do. These findings have potential implications for\nfuture research and development of Document AI models. The data is available at\n\\url{https://github.com/hint-lab/doctrack}.\n","authors":["Hao Wang","Qingxuan Wang","Yue Li","Changqing Wang","Chenhui Chu","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14802v1.pdf","comment":"14 pages, 8 figures, Accepted by Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.14785v1","updated":"2023-10-23T10:37:22Z","published":"2023-10-23T10:37:22Z","title":"Vision-Enhanced Semantic Entity Recognition in Document Images via\n Visually-Asymmetric Consistency Learning","summary":" Extracting meaningful entities belonging to predefined categories from\nVisually-rich Form-like Documents (VFDs) is a challenging task. Visual and\nlayout features such as font, background, color, and bounding box location and\nsize provide important cues for identifying entities of the same type. However,\nexisting models commonly train a visual encoder with weak cross-modal\nsupervision signals, resulting in a limited capacity to capture these\nnon-textual features and suboptimal performance. In this paper, we propose a\nnovel \\textbf{V}isually-\\textbf{A}symmetric co\\textbf{N}sisten\\textbf{C}y\n\\textbf{L}earning (\\textsc{Vancl}) approach that addresses the above limitation\nby enhancing the model's ability to capture fine-grained visual and layout\nfeatures through the incorporation of color priors. Experimental results on\nbenchmark datasets show that our approach substantially outperforms the strong\nLayoutLM series baseline, demonstrating the effectiveness of our approach.\nAdditionally, we investigate the effects of different color schemes on our\napproach, providing insights for optimizing model performance. We believe our\nwork will inspire future research on multimodal information extraction.\n","authors":["Hao Wang","Xiahua Chen","Rui Wang","Chenhui Chu"],"pdf_url":"https://arxiv.org/pdf/2310.14785v1.pdf","comment":"14 pages, 6 figures, Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2305.14740v2","updated":"2023-10-23T10:35:30Z","published":"2023-05-24T05:21:13Z","title":"ECHo: A Visio-Linguistic Dataset for Event Causality Inference via\n Human-Centric Reasoning","summary":" We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a\ndiagnostic dataset of event causality inference grounded in visio-linguistic\nsocial scenarios. ECHo employs real-world human-centric deductive information\nbuilding on a television crime drama. ECHo requires the Theory-of-Mind (ToM)\nability to understand and reason about social interactions based on multimodal\ninformation. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework\nto assess the reasoning capability of current AI systems. Our ToM-enhanced CoT\npipeline accommodates various large foundation models in both zero-shot and\nfew-shot visio-linguistic reasoning. We use this framework to scrutinize recent\nlarge foundation models such as InstructGPT and MiniGPT-4 on three diagnostic\nhuman-centric tasks. Further analysis demonstrates ECHo as a challenging\ndataset to expose imperfections and inconsistencies in reasoning. Our data and\ncode are publicly available at https://github.com/YuxiXie/ECHo.\n","authors":["Yuxi Xie","Guanzhen Li","Min-Yen Kan"],"pdf_url":"https://arxiv.org/pdf/2305.14740v2.pdf","comment":"Findings of EMNLP 2023. 10 pages, 6 figures, 5 tables (22 pages, 8\n figures, 15 tables including references and appendices)"},{"id":"http://arxiv.org/abs/2310.10975v2","updated":"2023-10-23T10:34:34Z","published":"2023-10-17T03:42:12Z","title":"NICE: Improving Panoptic Narrative Detection and Segmentation with\n Cascading Collaborative Learning","summary":" Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging\ntasks that involve identifying and locating multiple targets in an image\naccording to a long narrative description. In this paper, we propose a unified\nand effective framework called NICE that can jointly learn these two panoptic\nnarrative recognition tasks. Existing visual grounding tasks use a two-branch\nparadigm, but applying this directly to PND and PNS can result in prediction\nconflict due to their intrinsic many-to-many alignment property. To address\nthis, we introduce two cascading modules based on the barycenter of the mask,\nwhich are Coordinate Guided Aggregation (CGA) and Barycenter Driven\nLocalization (BDL), responsible for segmentation and detection, respectively.\nBy linking PNS and PND in series with the barycenter of segmentation as the\nanchor, our approach naturally aligns the two tasks and allows them to\ncomplement each other for improved performance. Specifically, CGA provides the\nbarycenter as a reference for detection, reducing BDL's reliance on a large\nnumber of candidate boxes. BDL leverages its excellent properties to\ndistinguish different instances, which improves the performance of CGA for\nsegmentation. Extensive experiments demonstrate that NICE surpasses all\nexisting methods by a large margin, achieving 4.1% for PND and 2.9% for PNS\nover the state-of-the-art. These results validate the effectiveness of our\nproposed collaborative learning strategy. The project of this work is made\npublicly available at https://github.com/Mr-Neko/NICE.\n","authors":["Haowei Wang","Jiayi Ji","Tianyu Guo","Yilong Yang","Yiyi Zhou","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2310.10975v2.pdf","comment":"18 pages. 9 figures, 9 tables"},{"id":"http://arxiv.org/abs/2306.12045v4","updated":"2023-10-23T10:30:04Z","published":"2023-06-21T06:30:18Z","title":"Temporal Conditioning Spiking Latent Variable Models of the Neural\n Response to Natural Visual Scenes","summary":" Developing computational models of neural response is crucial for\nunderstanding sensory processing and neural computations. Current\nstate-of-the-art neural network methods use temporal filters to handle temporal\ndependencies, resulting in an unrealistic and inflexible processing paradigm.\nMeanwhile, these methods target trial-averaged firing rates and fail to capture\nimportant features in spike trains. This work presents the temporal\nconditioning spiking latent variable models (TeCoS-LVM) to simulate the neural\nresponse to natural visual stimuli. We use spiking neurons to produce spike\noutputs that directly match the recorded trains. This approach helps to avoid\nlosing information embedded in the original spike trains. We exclude the\ntemporal dimension from the model parameter space and introduce a temporal\nconditioning operation to allow the model to adaptively explore and exploit\ntemporal dependencies in stimuli sequences in a {\\it natural paradigm}. We show\nthat TeCoS-LVM models can produce more realistic spike activities and\naccurately fit spike statistics than powerful alternatives. Additionally,\nlearned TeCoS-LVM models can generalize well to longer time scales. Overall,\nwhile remaining computationally tractable, our model effectively captures key\nfeatures of neural coding systems. It thus provides a useful tool for building\naccurate predictive computational accounts for various sensory perception\ncircuits.\n","authors":["Gehua Ma","Runhao Jiang","Rui Yan","Huajin Tang"],"pdf_url":"https://arxiv.org/pdf/2306.12045v4.pdf","comment":"Accepted at NeurIPS 2023. 22 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.13479v2","updated":"2023-10-23T09:42:29Z","published":"2023-10-20T13:20:17Z","title":"Segment, Select, Correct: A Framework for Weakly-Supervised Referring\n Segmentation","summary":" Referring Image Segmentation (RIS) - the problem of identifying objects in\nimages through natural language sentences - is a challenging task currently\nmostly solved through supervised learning. However, while collecting referred\nannotation masks is a time-consuming process, the few existing\nweakly-supervised and zero-shot approaches fall significantly short in\nperformance compared to fully-supervised learning ones. To bridge the\nperformance gap without mask annotations, we propose a novel weakly-supervised\nframework that tackles RIS by decomposing it into three steps: obtaining\ninstance masks for the object mentioned in the referencing instruction\n(segment), using zero-shot learning to select a potentially correct mask for\nthe given instruction (select), and bootstrapping a model which allows for\nfixing the mistakes of zero-shot selection (correct). In our experiments, using\nonly the first two steps (zero-shot segment and select) outperforms other\nzero-shot baselines by as much as 19%, while our full method improves upon this\nmuch stronger baseline and sets the new state-of-the-art for weakly-supervised\nRIS, reducing the gap between the weakly-supervised and fully-supervised\nmethods in some cases from around 33% to as little as 14%. Code is available at\nhttps://github.com/fgirbal/segment-select-correct.\n","authors":["Francisco Eiras","Kemal Oksuz","Adel Bibi","Philip H. S. Torr","Puneet K. Dokania"],"pdf_url":"https://arxiv.org/pdf/2310.13479v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14736v1","updated":"2023-10-23T09:16:04Z","published":"2023-10-23T09:16:04Z","title":"SAMCLR: Contrastive pre-training on complex scenes using SAM for view\n sampling","summary":" In Computer Vision, self-supervised contrastive learning enforces similar\nrepresentations between different views of the same image. The pre-training is\nmost often performed on image classification datasets, like ImageNet, where\nimages mainly contain a single class of objects. However, when dealing with\ncomplex scenes with multiple items, it becomes very unlikely for several views\nof the same image to represent the same object category. In this setting, we\npropose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into\nsemantic regions, then sample the two views from the same region. Preliminary\nresults show empirically that when pre-training on Cityscapes and ADE20K, then\nevaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs\nat least on par with, and most often significantly outperforms not only SimCLR,\nbut also DINO and MoCo.\n","authors":["Benjamin Missaoui","Chongbin Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.14736v1.pdf","comment":"Preprint, under review"},{"id":"http://arxiv.org/abs/2310.14729v1","updated":"2023-10-23T09:05:18Z","published":"2023-10-23T09:05:18Z","title":"MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D\n diffusion","summary":" We introduce Multi-view Ancestral Sampling (MAS), a method for generating\nconsistent multi-view 2D samples of a motion sequence, enabling the creation of\nits 3D counterpart. MAS leverages a diffusion model trained solely on 2D data,\nopening opportunities to exciting and diverse fields of motion previously\nunder-explored as 3D data is scarce and hard to collect. MAS works by\nsimultaneously denoising multiple 2D motion sequences representing the same\nmotion from different angles. Our consistency block ensures consistency across\nall views at each diffusion step by combining the individual generations into a\nunified 3D sequence, and projecting it back to the original views for the next\niteration. We demonstrate MAS on 2D pose data acquired from videos depicting\nprofessional basketball maneuvers, rhythmic gymnastic performances featuring a\nball apparatus, and horse obstacle course races. In each of these domains, 3D\nmotion capture is arduous, and yet, MAS generates diverse and realistic 3D\nsequences without textual conditioning. As we demonstrate, our ancestral\nsampling-based approach offers a more natural integration with the diffusion\nframework compared to popular denoising optimization-based approaches, and\navoids common issues such as out-of-domain sampling, lack of details and\nmode-collapse. https://guytevet.github.io/mas-page/\n","authors":["Roy Kapon","Guy Tevet","Daniel Cohen-Or","Amit H. Bermano"],"pdf_url":"https://arxiv.org/pdf/2310.14729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01830v3","updated":"2023-10-23T08:57:58Z","published":"2023-10-03T06:55:19Z","title":"AI-Generated Images as Data Source: The Dawn of Synthetic Era","summary":" The advancement of visual intelligence is intrinsically tethered to the\navailability of large-scale data. In parallel, generative Artificial\nIntelligence (AI) has unlocked the potential to create synthetic images that\nclosely resemble real-world photographs. This prompts a compelling inquiry: how\nmuch visual intelligence could benefit from the advance of generative AI? This\npaper explores the innovative concept of harnessing these AI-generated images\nas new data sources, reshaping traditional modeling paradigms in visual\nintelligence. In contrast to real data, AI-generated data exhibit remarkable\nadvantages, including unmatched abundance and scalability, the rapid generation\nof vast datasets, and the effortless simulation of edge cases. Built on the\nsuccess of generative AI models, we examine the potential of their generated\ndata in a range of applications, from training machine learning models to\nsimulating scenarios for computational modeling, testing, and validation. We\nprobe the technological foundations that support this groundbreaking use of\ngenerative AI, engaging in an in-depth discussion on the ethical, legal, and\npractical considerations that accompany this transformative paradigm shift.\nThrough an exhaustive survey of current technologies and applications, this\npaper presents a comprehensive view of the synthetic era in visual\nintelligence. A project associated with this paper can be found at\nhttps://github.com/mwxely/AIGS .\n","authors":["Zuhao Yang","Fangneng Zhan","Kunhao Liu","Muyu Xu","Shijian Lu"],"pdf_url":"https://arxiv.org/pdf/2310.01830v3.pdf","comment":"20 pages, 11 figures"},{"id":"http://arxiv.org/abs/2310.14718v1","updated":"2023-10-23T08:55:10Z","published":"2023-10-23T08:55:10Z","title":"Rethinking Scale Imbalance in Semi-supervised Object Detection for\n Aerial Images","summary":" This paper focuses on the scale imbalance problem of semi-supervised object\ndetection(SSOD) in aerial images. Compared to natural images, objects in aerial\nimages show smaller sizes and larger quantities per image, increasing the\ndifficulty of manual annotation. Meanwhile, the advanced SSOD technique can\ntrain superior detectors by leveraging limited labeled data and massive\nunlabeled data, saving annotation costs. However, as an understudied task in\naerial images, SSOD suffers from a drastic performance drop when facing a large\nproportion of small objects. By analyzing the predictions between small and\nlarge objects, we identify three imbalance issues caused by the scale bias,\ni.e., pseudo-label imbalance, label assignment imbalance, and negative learning\nimbalance. To tackle these issues, we propose a novel Scale-discriminative\nSemi-Supervised Object Detection (S^3OD) learning pipeline for aerial images.\nIn our S^3OD, three key components, Size-aware Adaptive Thresholding (SAT),\nSize-rebalanced Label Assignment (SLA), and Teacher-guided Negative Learning\n(TNL), are proposed to warrant scale unbiased learning. Specifically, SAT\nadaptively selects appropriate thresholds to filter pseudo-labels for objects\nat different scales. SLA balances positive samples of objects at different\nscales through resampling and reweighting. TNL alleviates the imbalance in\nnegative samples by leveraging information generated by a teacher model.\nExtensive experiments conducted on the DOTA-v1.5 benchmark demonstrate the\nsuperiority of our proposed methods over state-of-the-art competitors. Codes\nwill be released soon.\n","authors":["Ruixiang Zhang","Chang Xu","Fang Xu","Wen Yang","Guangjun He","Huai Yu","Gui-Song Xia"],"pdf_url":"https://arxiv.org/pdf/2310.14718v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01210v2","updated":"2023-10-23T08:52:36Z","published":"2023-10-02T13:55:06Z","title":"Towards Robust Cardiac Segmentation using Graph Convolutional Networks","summary":" Fully automatic cardiac segmentation can be a fast and reproducible method to\nextract clinical measurements from an echocardiography examination. The U-Net\narchitecture is the current state-of-the-art deep learning architecture for\nmedical segmentation and can segment cardiac structures in real-time with\naverage errors comparable to inter-observer variability. However, this\narchitecture still generates large outliers that are often anatomically\nincorrect. This work uses the concept of graph convolutional neural networks\nthat predict the contour points of the structures of interest instead of\nlabeling each pixel. We propose a graph architecture that uses two\nconvolutional rings based on cardiac anatomy and show that this eliminates\nanatomical incorrect multi-structure segmentations on the publicly available\nCAMUS dataset. Additionally, this work contributes with an ablation study on\nthe graph convolutional architecture and an evaluation of clinical measurements\non the clinical HUNT4 dataset. Finally, we propose to use the inter-model\nagreement of the U-Net and the graph network as a predictor of both the input\nand segmentation quality. We show this predictor can detect out-of-distribution\nand unsuitable input images in real-time. Source code is available online:\nhttps://github.com/gillesvntnu/GCN_multistructure\n","authors":["Gilles Van De Vyver","Sarina Thomas","Guy Ben-Yosef","Sindre Hellum Olaisen","Håvard Dalen","Lasse Løvstakken","Erik Smistad"],"pdf_url":"https://arxiv.org/pdf/2310.01210v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12877v2","updated":"2023-10-23T08:45:53Z","published":"2023-10-19T16:32:18Z","title":"Perceptual Assessment and Optimization of High Dynamic Range Image\n Rendering","summary":" High dynamic range (HDR) imaging has gained increasing popularity for its\nability to faithfully reproduce the luminance levels in natural scenes.\nAccordingly, HDR image quality assessment (IQA) is crucial but has been\nsuperficially treated. The majority of existing IQA models are developed for\nand calibrated against low dynamic range (LDR) images, which have been shown to\nbe poorly correlated with human perception of HDR image quality. In this work,\nwe propose a family of HDR IQA models by transferring the recent advances in\nLDR IQA. The key step in our approach is to specify a simple inverse display\nmodel that decomposes an HDR image to a set of LDR images with different\nexposures, which will be assessed by existing LDR quality models. The local\nquality scores of each exposure are then aggregated with the help of a simple\nwell-exposedness measure into a global quality score for each exposure, which\nwill be further weighted across exposures to obtain the overall quality score.\nWhen assessing LDR images, the proposed HDR quality models reduce gracefully to\nthe original LDR ones with the same performance. Experiments on four\nhuman-rated HDR image datasets demonstrate that our HDR quality models are\nconsistently better than existing IQA methods, including the HDR-VDP family.\nMoreover, we demonstrate their strengths in perceptual optimization of HDR\nnovel view synthesis.\n","authors":["Peibei Cao","Rafal K. Mantiuk","Kede Ma"],"pdf_url":"https://arxiv.org/pdf/2310.12877v2.pdf","comment":"need more changes"},{"id":"http://arxiv.org/abs/2310.14702v1","updated":"2023-10-23T08:45:12Z","published":"2023-10-23T08:45:12Z","title":"BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities","summary":" Collaborative perception enables agents to share complementary perceptual\ninformation with nearby agents. This would improve the perception performance\nand alleviate the issues of single-view perception, such as occlusion and\nsparsity. Most existing approaches mainly focus on single modality (especially\nLiDAR), and not fully exploit the superiority of multi-modal perception. We\npropose a collaborative perception paradigm, BM2CP, which employs LiDAR and\ncamera to achieve efficient multi-modal perception. It utilizes LiDAR-guided\nmodal fusion, cooperative depth generation and modality-guided intermediate\nfusion to acquire deep interactions among modalities of different agents,\nMoreover, it is capable to cope with the special case where one of the sensors,\nsame or different type, of any agent is missing. Extensive experiments validate\nthat our approach outperforms the state-of-the-art methods with 50X lower\ncommunication volumes in both simulated and real-world autonomous driving\nscenarios. Our code is available at https://github.com/byzhaoAI/BM2CP.\n","authors":["Binyu Zhao","Wei Zhang","Zhaonian Zou"],"pdf_url":"https://arxiv.org/pdf/2310.14702v1.pdf","comment":"14 pages, 8 figures. Accepted by CoRL 2023"},{"id":"http://arxiv.org/abs/2310.14700v1","updated":"2023-10-23T08:44:38Z","published":"2023-10-23T08:44:38Z","title":"Interaction-Driven Active 3D Reconstruction with Object Interiors","summary":" We introduce an active 3D reconstruction method which integrates visual\nperception, robot-object interaction, and 3D scanning to recover both the\nexterior and interior, i.e., unexposed, geometries of a target 3D object.\nUnlike other works in active vision which focus on optimizing camera viewpoints\nto better investigate the environment, the primary feature of our\nreconstruction is an analysis of the interactability of various parts of the\ntarget object and the ensuing part manipulation by a robot to enable scanning\nof occluded regions. As a result, an understanding of part articulations of the\ntarget object is obtained on top of complete geometry acquisition. Our method\noperates fully automatically by a Fetch robot with built-in RGBD sensors. It\niterates between interaction analysis and interaction-driven reconstruction,\nscanning and reconstructing detected moveable parts one at a time, where both\nthe articulated part detection and mesh reconstruction are carried out by\nneural networks. In the final step, all the remaining, non-articulated parts,\nincluding all the interior structures that had been exposed by prior part\nmanipulations and subsequently scanned, are reconstructed to complete the\nacquisition. We demonstrate the performance of our method via qualitative and\nquantitative evaluation, ablation studies, comparisons to alternatives, as well\nas experiments in a real environment.\n","authors":["Zihao Yan","Fubao Su","Mingyang Wang","Ruizhen Hu","Hao Zhang","Hui Huang"],"pdf_url":"https://arxiv.org/pdf/2310.14700v1.pdf","comment":"Accepted to SIGGRAPH Asia 2023, project page at\n https://vcc.tech/research/2023/InterRecon"},{"id":"http://arxiv.org/abs/2310.14695v1","updated":"2023-10-23T08:40:44Z","published":"2023-10-23T08:40:44Z","title":"CAwa-NeRF: Instant Learning of Compression-Aware NeRF Features","summary":" Modeling 3D scenes by volumetric feature grids is one of the promising\ndirections of neural approximations to improve Neural Radiance Fields (NeRF).\nInstant-NGP (INGP) introduced multi-resolution hash encoding from a lookup\ntable of trainable feature grids which enabled learning high-quality neural\ngraphics primitives in a matter of seconds. However, this improvement came at\nthe cost of higher storage size. In this paper, we address this challenge by\nintroducing instant learning of compression-aware NeRF features (CAwa-NeRF),\nthat allows exporting the zip compressed feature grids at the end of the model\ntraining with a negligible extra time overhead without changing neither the\nstorage architecture nor the parameters used in the original INGP paper.\nNonetheless, the proposed method is not limited to INGP but could also be\nadapted to any model. By means of extensive simulations, our proposed instant\nlearning pipeline can achieve impressive results on different kinds of static\nscenes such as single object masked background scenes and real-life scenes\ncaptured in our studio. In particular, for single object masked background\nscenes CAwa-NeRF compresses the feature grids down to 6% (1.2 MB) of the\noriginal size without any loss in the PSNR (33 dB) or down to 2.4% (0.53 MB)\nwith a slight virtual loss (32.31 dB).\n","authors":["Omnia Mahmoud","Théo Ladune","Matthieu Gendrin"],"pdf_url":"https://arxiv.org/pdf/2310.14695v1.pdf","comment":"10 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.14692v1","updated":"2023-10-23T08:32:50Z","published":"2023-10-23T08:32:50Z","title":"On Partial Shape Correspondence and Functional Maps","summary":" While dealing with matching shapes to their parts, we often utilize an\ninstrument known as functional maps. The idea is to translate the shape\nmatching problem into ``convenient'' spaces by which matching is performed\nalgebraically by solving a least squares problem. Here, we argue that such\nformulations, though popular in this field, introduce errors in the estimated\nmatch when partiality is invoked. Such errors are unavoidable even when\nconsidering advanced feature extraction networks, and they can be shown to\nescalate with increasing degrees of shape partiality, adversely affecting the\nlearning capability of such systems. To circumvent these limitations, we\npropose a novel approach for partial shape matching.\n Our study of functional maps led us to a novel method that establishes direct\ncorrespondence between partial and full shapes through feature matching\nbypassing the need for functional map intermediate spaces. The Gromov distance\nbetween metric spaces leads to the construction of the first part of our loss\nfunctions. For regularization we use two options: a term based on the area\npreserving property of the mapping, and a relaxed version of it without the\nneed to compute a functional map.\n The proposed approach shows superior performance on the SHREC'16 dataset,\noutperforming existing unsupervised methods for partial shape matching. In\nparticular, it achieves state-of-the-art result on the SHREC'16 HOLES\nbenchmark, superior also compared to supervised methods.\n","authors":["Amit Bracha","Thomas Dagès","Ron Kimmel"],"pdf_url":"https://arxiv.org/pdf/2310.14692v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.05895v2","updated":"2023-10-23T08:19:43Z","published":"2022-11-10T21:44:33Z","title":"Understanding ME? Multimodal Evaluation for Fine-grained Visual\n Commonsense","summary":" Visual commonsense understanding requires Vision Language (VL) models to not\nonly understand image and text but also cross-reference in-between to fully\nintegrate and achieve comprehension of the visual scene described. Recently,\nvarious approaches have been developed and have achieved high performance on\nvisual commonsense benchmarks. However, it is unclear whether the models really\nunderstand the visual scene and underlying commonsense knowledge due to limited\nevaluation data resources. To provide an in-depth analysis, we present a\nMultimodal Evaluation (ME) pipeline to automatically generate question-answer\npairs to test models' understanding of the visual scene, text, and related\nknowledge. We then take a step further to show that training with the ME data\nboosts the model's performance in standard VCR evaluation. Lastly, our in-depth\nanalysis and comparison reveal interesting findings: (1) semantically low-level\ninformation can assist the learning of high-level information but not the\nopposite; (2) visual information is generally under utilization compared with\ntext.\n","authors":["Zhecan Wang","Haoxuan You","Yicheng He","Wenhao Li","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2211.05895v2.pdf","comment":"Accepted to EMNLP 2022 Long Paper"},{"id":"http://arxiv.org/abs/2310.14675v1","updated":"2023-10-23T08:15:29Z","published":"2023-10-23T08:15:29Z","title":"Online Out-of-Domain Detection for Automated Driving","summary":" Ensuring safety in automated driving is a major challenge for the automotive\nindustry. Special attention is paid to artificial intelligence, in particular\nto Deep Neural Networks (DNNs), which is considered a key technology in the\nrealization of highly automated driving. DNNs learn from training data, which\nmeans that they only achieve good accuracy within the underlying data\ndistribution of the training data. When leaving the training domain, a\ndistributional shift is caused, which can lead to a drastic reduction of\naccuracy. In this work, we present a proof of concept for a safety mechanism\nthat can detect the leaving of the domain online, i.e. at runtime. In our\nexperiments with the Synthia data set we can show that a 100 % correct\ndetection of whether the input data is inside or outside the domain is\nachieved. The ability to detect when the vehicle leaves the domain can be an\nimportant requirement for certification.\n","authors":["Timo Sämann","Horst-Michael Groß"],"pdf_url":"https://arxiv.org/pdf/2310.14675v1.pdf","comment":"Machine Learning in Certified Systems (MLCS) Workshop, 14.-15.01.2021"},{"id":"http://arxiv.org/abs/2310.13198v2","updated":"2023-10-23T08:14:18Z","published":"2023-10-19T23:36:17Z","title":"A Car Model Identification System for Streamlining the Automobile Sales\n Process","summary":" This project presents an automated solution for the efficient identification\nof car models and makes from images, aimed at streamlining the vehicle listing\nprocess on online car-selling platforms. Through a thorough exploration\nencompassing various efficient network architectures including Convolutional\nNeural Networks (CNNs), Vision Transformers (ViTs), and hybrid models, we\nachieved a notable accuracy of 81.97% employing the EfficientNet (V2 b2)\narchitecture. To refine performance, a combination of strategies, including\ndata augmentation, fine-tuning pretrained models, and extensive hyperparameter\ntuning, were applied. The trained model offers the potential for automating\ninformation extraction, promising enhanced user experiences across car-selling\nwebsites.\n","authors":["Said Togru","Marco Moldovan"],"pdf_url":"https://arxiv.org/pdf/2310.13198v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14670v1","updated":"2023-10-23T08:09:42Z","published":"2023-10-23T08:09:42Z","title":"Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and\n Beyond","summary":" Vision-language (VL) understanding tasks evaluate models' comprehension of\ncomplex visual scenes through multiple-choice questions. However, we have\nidentified two dataset biases that models can exploit as shortcuts to resolve\nvarious VL tasks correctly without proper understanding. The first type of\ndataset bias is \\emph{Unbalanced Matching} bias, where the correct answer\noverlaps the question and image more than the incorrect answers. The second\ntype of dataset bias is \\emph{Distractor Similarity} bias, where incorrect\nanswers are overly dissimilar to the correct answer but significantly similar\nto other incorrect answers within the same sample. To address these dataset\nbiases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic\ntraining and debiased evaluation data. We then introduce Intra-sample\nCounterfactual Training (ICT) to assist models in utilizing the synthesized\ntraining data, particularly the counterfactual data, via focusing on\nintra-sample differentiation. Extensive experiments demonstrate the\neffectiveness of ADS and ICT in consistently improving model performance across\ndifferent benchmarks, even in domain-shifted scenarios.\n","authors":["Zhecan Wang","Long Chen","Haoxuan You","Keyang Xu","Yicheng He","Wenhao Li","Noal Codella","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2310.14670v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14664v1","updated":"2023-10-23T08:00:03Z","published":"2023-10-23T08:00:03Z","title":"Data Pruning via Moving-one-Sample-out","summary":" In this paper, we propose a novel data-pruning approach called\nmoving-one-sample-out (MoSo), which aims to identify and remove the least\ninformative samples from the training set. The core insight behind MoSo is to\ndetermine the importance of each sample by assessing its impact on the optimal\nempirical risk. This is achieved by measuring the extent to which the empirical\nrisk changes when a particular sample is excluded from the training set.\nInstead of using the computationally expensive leaving-one-out-retraining\nprocedure, we propose an efficient first-order approximator that only requires\ngradient information from different training stages. The key idea behind our\napproximation is that samples with gradients that are consistently aligned with\nthe average gradient of the training set are more informative and should\nreceive higher scores, which could be intuitively understood as follows: if the\ngradient from a specific sample is consistent with the average gradient vector,\nit implies that optimizing the network using the sample will yield a similar\neffect on all remaining samples. Experimental results demonstrate that MoSo\neffectively mitigates severe performance degradation at high pruning ratios and\nachieves satisfactory performance across various settings.\n","authors":["Haoru Tan","Sitong Wu","Fei Du","Yukang Chen","Zhibin Wang","Fan Wang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2310.14664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14652v1","updated":"2023-10-23T07:44:12Z","published":"2023-10-23T07:44:12Z","title":"Invariant Feature Regularization for Fair Face Recognition","summary":" Fair face recognition is all about learning invariant feature that\ngeneralizes to unseen faces in any demographic group. Unfortunately, face\ndatasets inevitably capture the imbalanced demographic attributes that are\nubiquitous in real-world observations, and the model learns biased feature that\ngeneralizes poorly in the minority group. We point out that the bias arises due\nto the confounding demographic attributes, which mislead the model to capture\nthe spurious demographic-specific feature. The confounding effect can only be\nremoved by causal intervention, which requires the confounder annotations.\nHowever, such annotations can be prohibitively expensive due to the diversity\nof the demographic attributes. To tackle this, we propose to generate diverse\ndata partitions iteratively in an unsupervised fashion. Each data partition\nacts as a self-annotated confounder, enabling our Invariant Feature\nRegularization (INV-REG) to deconfound. INV-REG is orthogonal to existing\nmethods, and combining INV-REG with two strong baselines (Arcface and CIFP)\nleads to new state-of-the-art that improves face recognition on a variety of\ndemographic groups. Code is available at\nhttps://github.com/PanasonicConnect/InvReg.\n","authors":["Jiali Ma","Zhongqi Yue","Kagaya Tomoyuki","Suzuki Tomoki","Karlekar Jayashree","Sugiri Pranata","Hanwang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14652v1.pdf","comment":"Accepted by International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2310.14642v1","updated":"2023-10-23T07:29:51Z","published":"2023-10-23T07:29:51Z","title":"Relit-NeuLF: Efficient Relighting and Novel View Synthesis via Neural 4D\n Light Field","summary":" In this paper, we address the problem of simultaneous relighting and novel\nview synthesis of a complex scene from multi-view images with a limited number\nof light sources. We propose an analysis-synthesis approach called Relit-NeuLF.\nFollowing the recent neural 4D light field network (NeuLF), Relit-NeuLF first\nleverages a two-plane light field representation to parameterize each ray in a\n4D coordinate system, enabling efficient learning and inference. Then, we\nrecover the spatially-varying bidirectional reflectance distribution function\n(SVBRDF) of a 3D scene in a self-supervised manner. A DecomposeNet learns to\nmap each ray to its SVBRDF components: albedo, normal, and roughness. Based on\nthe decomposed BRDF components and conditioning light directions, a RenderNet\nlearns to synthesize the color of the ray. To self-supervise the SVBRDF\ndecomposition, we encourage the predicted ray color to be close to the\nphysically-based rendering result using the microfacet model. Comprehensive\nexperiments demonstrate that the proposed method is efficient and effective on\nboth synthetic data and real-world human face data, and outperforms the\nstate-of-the-art results. We publicly released our code on GitHub. You can find\nit here: https://github.com/oppo-us-research/RelitNeuLF\n","authors":["Zhong Li","Liangchen Song","Zhang Chen","Xiangyu Du","Lele Chen","Junsong Yuan","Yi Xu"],"pdf_url":"https://arxiv.org/pdf/2310.14642v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2304.01999v2","updated":"2023-10-23T07:23:10Z","published":"2023-04-04T17:54:32Z","title":"Revisiting the Evaluation of Image Synthesis with GANs","summary":" A good metric, which promises a reliable comparison between solutions, is\nessential for any well-defined task. Unlike most vision tasks that have\nper-sample ground-truth, image synthesis tasks target generating unseen data\nand hence are usually evaluated through a distributional distance between one\nset of real samples and another set of generated samples. This study presents\nan empirical investigation into the evaluation of synthesis performance, with\ngenerative adversarial networks (GANs) as a representative of generative\nmodels. In particular, we make in-depth analyses of various factors, including\nhow to represent a data point in the representation space, how to calculate a\nfair distance using selected samples, and how many instances to use from each\nset. Extensive experiments conducted on multiple datasets and settings reveal\nseveral important findings. Firstly, a group of models that include both\nCNN-based and ViT-based architectures serve as reliable and robust feature\nextractors for measurement evaluation. Secondly, Centered Kernel Alignment\n(CKA) provides a better comparison across various extractors and hierarchical\nlayers in one model. Finally, CKA is more sample-efficient and enjoys better\nagreement with human judgment in characterizing the similarity between two\ninternal data correlations. These findings contribute to the development of a\nnew measurement system, which enables a consistent and reliable re-evaluation\nof current state-of-the-art generative models.\n","authors":["Mengping Yang","Ceyuan Yang","Yichi Zhang","Qingyan Bai","Yujun Shen","Bo Dai"],"pdf_url":"https://arxiv.org/pdf/2304.01999v2.pdf","comment":"NeurIPS 2023 datasets and benchmarks track"},{"id":"http://arxiv.org/abs/2310.14637v1","updated":"2023-10-23T07:21:40Z","published":"2023-10-23T07:21:40Z","title":"Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval","summary":" Deep hashing has been intensively studied and successfully applied in\nlarge-scale image retrieval systems due to its efficiency and effectiveness.\nRecent studies have recognized that the existence of adversarial examples poses\na security threat to deep hashing models, that is, adversarial vulnerability.\nNotably, it is challenging to efficiently distill reliable semantic\nrepresentatives for deep hashing to guide adversarial learning, and thereby it\nhinders the enhancement of adversarial robustness of deep hashing-based\nretrieval models. Moreover, current researches on adversarial training for deep\nhashing are hard to be formalized into a unified minimax structure. In this\npaper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the\nadversarial robustness of deep hashing models. Specifically, we conceive a\ndiscriminative mainstay features learning (DMFL) scheme to construct semantic\nrepresentatives for guiding adversarial learning in deep hashing. Particularly,\nour DMFL with the strict theoretical guarantee is adaptively optimized in a\ndiscriminative learning manner, where both discriminative and semantic\nproperties are jointly considered. Moreover, adversarial examples are\nfabricated by maximizing the Hamming distance between the hash codes of\nadversarial samples and mainstay features, the efficacy of which is validated\nin the adversarial attack trials. Further, we, for the first time, formulate\nthe formalized adversarial training of deep hashing into a unified minimax\noptimization under the guidance of the generated mainstay codes. Extensive\nexperiments on benchmark datasets show superb attack performance against the\nstate-of-the-art algorithms, meanwhile, the proposed adversarial training can\neffectively eliminate adversarial perturbations for trustworthy deep\nhashing-based retrieval. Our code is available at\nhttps://github.com/xandery-geek/SAAT.\n","authors":["Xu Yuan","Zheng Zhang","Xunguang Wang","Lin Wu"],"pdf_url":"https://arxiv.org/pdf/2310.14637v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14636v1","updated":"2023-10-23T07:21:02Z","published":"2023-10-23T07:21:02Z","title":"Multilevel Perception Boundary-guided Network for Breast Lesion\n Segmentation in Ultrasound Images","summary":" Automatic segmentation of breast tumors from the ultrasound images is\nessential for the subsequent clinical diagnosis and treatment plan. Although\nthe existing deep learning-based methods have achieved significant progress in\nautomatic segmentation of breast tumor, their performance on tumors with\nsimilar intensity to the normal tissues is still not pleasant, especially for\nthe tumor boundaries. To address this issue, we propose a PBNet composed by a\nmultilevel global perception module (MGPM) and a boundary guided module (BGM)\nto segment breast tumors from ultrasound images. Specifically, in MGPM, the\nlong-range spatial dependence between the voxels in a single level feature maps\nare modeled, and then the multilevel semantic information is fused to promote\nthe recognition ability of the model for non-enhanced tumors. In BGM, the tumor\nboundaries are extracted from the high-level semantic maps using the dilation\nand erosion effects of max pooling, such boundaries are then used to guide the\nfusion of low and high-level features. Moreover, to improve the segmentation\nperformance for tumor boundaries, a multi-level boundary-enhanced segmentation\n(BS) loss is proposed. The extensive comparison experiments on both publicly\navailable dataset and in-house dataset demonstrate that the proposed PBNet\noutperforms the state-of-the-art methods in terms of both qualitative\nvisualization results and quantitative evaluation metrics, with the Dice score,\nJaccard coefficient, Specificity and HD95 improved by 0.70%, 1.1%, 0.1% and\n2.5% respectively. In addition, the ablation experiments validate that the\nproposed MGPM is indeed beneficial for distinguishing the non-enhanced tumors\nand the BGM as well as the BS loss are also helpful for refining the\nsegmentation contours of the tumor.\n","authors":["Xing Yang","Jian Zhang","Qijian Chen","Li Wang","Lihui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14636v1.pdf","comment":"12pages,5 figures"},{"id":"http://arxiv.org/abs/2310.12790v2","updated":"2023-10-23T06:39:27Z","published":"2023-10-19T14:47:11Z","title":"Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection","summary":" Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly\ndetection area - aims at utilizing a few samples of anomaly classes seen during\ntraining to detect unseen anomalies (i.e., samples from open-set anomaly\nclasses), while effectively identifying the seen anomalies. Benefiting from the\nprior knowledge illustrated by the seen anomalies, current OSAD methods can\noften largely reduce false positive errors. However, these methods treat the\nanomaly examples as from a homogeneous distribution, rendering them less\neffective in generalizing to unseen anomalies that can be drawn from any\ndistribution. In this paper, we propose to learn heterogeneous anomaly\ndistributions using the limited anomaly examples to address this issue. To this\nend, we introduce a novel approach, namely Anomaly Heterogeneity Learning\n(AHL), that simulates a diverse set of heterogeneous (seen and unseen) anomaly\ndistributions and then utilizes them to learn a unified heterogeneous\nabnormality model. Further, AHL is a generic framework that existing OSAD\nmodels can plug and play for enhancing their abnormality modeling. Extensive\nexperiments on nine real-world anomaly detection datasets show that AHL can 1)\nsubstantially enhance different state-of-the-art (SOTA) OSAD models in\ndetecting both seen and unseen anomalies, achieving new SOTA performance on a\nlarge set of datasets, and 2) effectively generalize to unseen anomalies in new\ntarget domains.\n","authors":["Jiawen Zhu","Choubo Ding","Yu Tian","Guansong Pang"],"pdf_url":"https://arxiv.org/pdf/2310.12790v2.pdf","comment":"18 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.07602v2","updated":"2023-10-23T06:34:33Z","published":"2023-10-11T15:41:52Z","title":"Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autonomous\n Driving","summary":" Radar has stronger adaptability in adverse scenarios for autonomous driving\nenvironmental perception compared to widely adopted cameras and LiDARs.\nCompared with commonly used 3D radars, the latest 4D radars have precise\nvertical resolution and higher point cloud density, making it a highly\npromising sensor for autonomous driving in complex environmental perception.\nHowever, due to the much higher noise than LiDAR, manufacturers choose\ndifferent filtering strategies, resulting in an inverse ratio between noise\nlevel and point cloud density. There is still a lack of comparative analysis on\nwhich method is beneficial for deep learning-based perception algorithms in\nautonomous driving. One of the main reasons is that current datasets only adopt\none type of 4D radar, making it difficult to compare different 4D radars in the\nsame scene. Therefore, in this paper, we introduce a novel large-scale\nmulti-modal dataset featuring, for the first time, two types of 4D radars\ncaptured simultaneously. This dataset enables further research into effective\n4D radar perception algorithms.Our dataset consists of 151 consecutive series,\nmost of which last 20 seconds and contain 10,007 meticulously synchronized and\nannotated frames. Moreover, our dataset captures a variety of challenging\ndriving scenarios, including many road conditions, weather conditions,\nnighttime and daytime with different lighting intensities and periods. Our\ndataset annotates consecutive frames, which can be applied to 3D object\ndetection and tracking, and also supports the study of multi-modal tasks. We\nexperimentally validate our dataset, providing valuable results for studying\ndifferent types of 4D radars. This dataset is released on\nhttps://github.com/adept-thu/Dual-Radar.\n","authors":["Xinyu Zhang","Li Wang","Jian Chen","Cheng Fang","Lei Yang","Ziying Song","Guangqi Yang","Yichen Wang","Xiaofei Zhang","Qingshan Yang","Jun Li"],"pdf_url":"https://arxiv.org/pdf/2310.07602v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01852v4","updated":"2023-10-23T06:13:24Z","published":"2023-10-03T07:33:27Z","title":"LanguageBind: Extending Video-Language Pretraining to N-modality by\n Language-based Semantic Alignment","summary":" The video-language (VL) pretraining has achieved remarkable improvement in\nmultiple downstream tasks. However, the current VL pretraining framework is\nhard to extend to multiple modalities (N modalities, N>=3) beyond vision and\nlanguage. We thus propose LanguageBind, taking the language as the bind across\ndifferent modalities because the language modality is well-explored and\ncontains rich semantics. Specifically, we freeze the language encoder acquired\nby VL pretraining, then train encoders for other modalities with contrastive\nlearning. As a result, all modalities are mapped to a shared feature space,\nimplementing multi-modal semantic alignment. While LanguageBind ensures that we\ncan extend VL modalities to N modalities, we also need a high-quality dataset\nwith alignment data pairs centered on language. We thus propose VIDAL-10M with\nVideo, Infrared, Depth, Audio and their corresponding Language, naming as\nVIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with\ncomplete semantics rather than truncated segments from long videos, and all the\nvideo, depth, infrared, and audio modalities are aligned to their textual\ndescriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8%\nR@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot\nvideo-text retrieval task. Beyond this, our LanguageBind has greatly improved\nin the zero-shot video, audio, depth, and infrared understanding tasks. For\ninstance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD,\n6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets,\nLanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code\naddress: https://github.com/PKU-YuanGroup/LanguageBind.\n","authors":["Bin Zhu","Bin Lin","Munan Ning","Yang Yan","Jiaxi Cui","HongFa Wang","Yatian Pang","Wenhao Jiang","Junwu Zhang","Zongwei Li","Wancai Zhang","Zhifeng Li","Wei Liu","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.01852v4.pdf","comment":"Under review as a conference paper at ICLR 2024"},{"id":"http://arxiv.org/abs/2309.11331v5","updated":"2023-10-23T06:07:39Z","published":"2023-09-20T14:03:47Z","title":"Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism","summary":" In the past years, YOLO-series models have emerged as the leading approaches\nin the area of real-time object detection. Many studies pushed up the baseline\nto a higher level by modifying the architecture, augmenting data and designing\nnew losses. However, we find previous models still suffer from information\nfusion problem, although Feature Pyramid Network (FPN) and Path Aggregation\nNetwork (PANet) have alleviated this. Therefore, this study provides an\nadvanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with\nconvolution and self-attention operations. This new designed model named as\nGold-YOLO, which boosts the multi-scale feature fusion capabilities and\nachieves an ideal balance between latency and accuracy across all model scales.\nAdditionally, we implement MAE-style pretraining in the YOLO-series for the\nfirst time, allowing YOLOseries models could be to benefit from unsupervised\npretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017\ndatasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model\nYOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at\nhttps://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO,\nand the MindSpore code is available at\nhttps://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO.\n","authors":["Chengcheng Wang","Wei He","Ying Nie","Jianyuan Guo","Chuanjian Liu","Kai Han","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2309.11331v5.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.14592v1","updated":"2023-10-23T06:00:24Z","published":"2023-10-23T06:00:24Z","title":"Pre-Training LiDAR-Based 3D Object Detectors Through Colorization","summary":" Accurate 3D object detection and understanding for self-driving cars heavily\nrelies on LiDAR point clouds, necessitating large amounts of labeled data to\ntrain. In this work, we introduce an innovative pre-training approach, Grounded\nPoint Colorization (GPC), to bridge the gap between data and labels by teaching\nthe model to colorize LiDAR point clouds, equipping it with valuable semantic\ncues. To tackle challenges arising from color variations and selection bias, we\nincorporate color as \"context\" by providing ground-truth colors as hints during\ncolorization. Experimental results on the KITTI and Waymo datasets demonstrate\nGPC's remarkable effectiveness. Even with limited labeled data, GPC\nsignificantly improves fine-tuning performance; notably, on just 20% of the\nKITTI dataset, GPC outperforms training from scratch with the entire dataset.\nIn sum, we introduce a fresh perspective on pre-training for 3D object\ndetection, aligning the objective with the model's intended role and ultimately\nadvancing the accuracy and efficiency of 3D object detection for autonomous\nvehicles.\n","authors":["Tai-Yu Pan","Chenyang Ma","Tianle Chen","Cheng Perng Phoo","Katie Z Luo","Yurong You","Mark Campbell","Kilian Q. Weinberger","Bharath Hariharan","Wei-Lun Chao"],"pdf_url":"https://arxiv.org/pdf/2310.14592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.09210v3","updated":"2023-10-23T05:58:45Z","published":"2022-07-19T11:49:21Z","title":"KinD-LCE Curve Estimation And Retinex Fusion On Low-Light Image","summary":" Low-light images often suffer from noise and color distortion. Object\ndetection, semantic segmentation, instance segmentation, and other tasks are\nchallenging when working with low-light images because of image noise and\nchromatic aberration. We also found that the conventional Retinex theory loses\ninformation in adjusting the image for low-light tasks. In response to the\naforementioned problem, this paper proposes an algorithm for low illumination\nenhancement. The proposed method, KinD-LCE, uses a light curve estimation\nmodule to enhance the illumination map in the Retinex decomposed image,\nimproving the overall image brightness. An illumination map and reflection map\nfusion module were also proposed to restore the image details and reduce detail\nloss. Additionally, a TV(total variation) loss function was applied to\neliminate noise. Our method was trained on the GladNet dataset, known for its\ndiverse collection of low-light images, tested against the Low-Light dataset,\nand evaluated using the ExDark dataset for downstream tasks, demonstrating\ncompetitive performance with a PSNR of 19.7216 and SSIM of 0.8213.\n","authors":["Xiaochun Lei","Weiliang Mai","Junlin Xie","He Liu","Zetao Jiang","Zhaoting Gong","Chang Lu","Linjun Lu"],"pdf_url":"https://arxiv.org/pdf/2207.09210v3.pdf","comment":"Accepted by Signal, Image and Video Processing"},{"id":"http://arxiv.org/abs/2309.13226v3","updated":"2023-10-23T05:46:10Z","published":"2023-09-23T00:43:38Z","title":"Real3D-AD: A Dataset of Point Cloud Anomaly Detection","summary":" High-precision point cloud anomaly detection is the gold standard for\nidentifying the defects of advancing machining and precision manufacturing.\nDespite some methodological advances in this area, the scarcity of datasets and\nthe lack of a systematic benchmark hinder its development. We introduce\nReal3D-AD, a challenging high-precision point cloud anomaly detection dataset,\naddressing the limitations in the field. With 1,254 high-resolution 3D items\nfrom forty thousand to millions of points for each item, Real3D-AD is the\nlargest dataset for high-precision 3D industrial anomaly detection to date.\nReal3D-AD surpasses existing 3D anomaly detection datasets available regarding\npoint cloud resolution (0.0010mm-0.0015mm), 360 degree coverage and perfect\nprototype. Additionally, we present a comprehensive benchmark for Real3D-AD,\nrevealing the absence of baseline methods for high-precision point cloud\nanomaly detection. To address this, we propose Reg3D-AD, a registration-based\n3D anomaly detection method incorporating a novel feature memory bank that\npreserves local and global representations. Extensive experiments on the\nReal3D-AD dataset highlight the effectiveness of Reg3D-AD. For reproducibility\nand accessibility, we provide the Real3D-AD dataset, benchmark source code, and\nReg3D-AD on our website:https://github.com/M-3LAB/Real3D-AD.\n","authors":["Jiaqi Liu","Guoyang Xie","Ruitao Chen","Xinpeng Li","Jinbao Wang","Yong Liu","Chengjie Wang","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2309.13226v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13631v2","updated":"2023-10-23T05:42:51Z","published":"2023-05-23T02:59:19Z","title":"EDIS: Entity-Driven Image Search over Multimodal Web Content","summary":" Making image retrieval methods practical for real-world search applications\nrequires significant progress in dataset scales, entity comprehension, and\nmultimodal information fusion. In this work, we introduce\n\\textbf{E}ntity-\\textbf{D}riven \\textbf{I}mage \\textbf{S}earch (EDIS), a\nchallenging dataset for cross-modal image search in the news domain. EDIS\nconsists of 1 million web images from actual search engine results and curated\ndatasets, with each image paired with a textual description. Unlike datasets\nthat assume a small set of single-modality candidates, EDIS reflects real-world\nweb image search scenarios by including a million multimodal image-text pairs\nas candidates. EDIS encourages the development of retrieval models that\nsimultaneously address cross-modal information fusion and matching. To achieve\naccurate ranking results, a model must: 1) understand named entities and events\nfrom text queries, 2) ground entities onto images or text descriptions, and 3)\neffectively fuse textual and visual representations. Our experimental results\nshow that EDIS challenges state-of-the-art methods with dense entities and a\nlarge-scale candidate set. The ablation study also proves that fusing textual\nfeatures with visual features is critical in improving retrieval results.\n","authors":["Siqi Liu","Weixi Feng","Tsu-jui Fu","Wenhu Chen","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.13631v2.pdf","comment":"EMNLP 2023 camera ready version"},{"id":"http://arxiv.org/abs/2310.14581v1","updated":"2023-10-23T05:40:43Z","published":"2023-10-23T05:40:43Z","title":"Leveraging Image-Text Similarity and Caption Modification for the\n DataComp Challenge: Filtering Track and BYOD Track","summary":" Large web crawl datasets have already played an important role in learning\nmultimodal features with high generalization capabilities. However, there are\nstill very limited studies investigating the details or improvements of data\ndesign. Recently, a DataComp challenge has been designed to propose the best\ntraining data with the fixed models. This paper presents our solution to both\nfiltering track and BYOD track of the DataComp challenge. Our solution adopts\nlarge multimodal models CLIP and BLIP-2 to filter and modify web crawl data,\nand utilize external datasets along with a bag of tricks to improve the data\nquality. Experiments show our solution significantly outperforms DataComp\nbaselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement).\n","authors":["Shuhei Yokoo","Peifei Zhu","Yuchi Ishikawa","Mikihiro Tanaka","Masayoshi Kondo","Hirokatsu Kataoka"],"pdf_url":"https://arxiv.org/pdf/2310.14581v1.pdf","comment":"Accepted at the ICCV 2023 Workshop on Towards the Next Generation of\n Computer Vision Datasets: DataComp Track"},{"id":"http://arxiv.org/abs/2310.14576v1","updated":"2023-10-23T05:25:49Z","published":"2023-10-23T05:25:49Z","title":"Tensor Decomposition Based Attention Module for Spiking Neural Networks","summary":" The attention mechanism has been proven to be an effective way to improve\nspiking neural network (SNN). However, based on the fact that the current SNN\ninput data flow is split into tensors to process on GPUs, none of the previous\nworks consider the properties of tensors to implement an attention module. This\ninspires us to rethink current SNN from the perspective of tensor-relevant\ntheories. Using tensor decomposition, we design the \\textit{projected full\nattention} (PFA) module, which demonstrates excellent results with linearly\ngrowing parameters. Specifically, PFA is composed by the \\textit{linear\nprojection of spike tensor} (LPST) module and \\textit{attention map composing}\n(AMC) module. In LPST, we start by compressing the original spike tensor into\nthree projected tensors using a single property-preserving strategy with\nlearnable parameters for each dimension. Then, in AMC, we exploit the inverse\nprocedure of the tensor decomposition process to combine the three tensors into\nthe attention map using a so-called connecting factor. To validate the\neffectiveness of the proposed PFA module, we integrate it into the widely used\nVGG and ResNet architectures for classification tasks. Our method achieves\nstate-of-the-art performance on both static and dynamic benchmark datasets,\nsurpassing the existing SNN models with Transformer-based and CNN-based\nbackbones.\n","authors":["Haoyu Deng","Ruijie Zhu","Xuerui Qiu","Yule Duan","Malu Zhang","Liangjian Deng"],"pdf_url":"https://arxiv.org/pdf/2310.14576v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2310.14570v1","updated":"2023-10-23T05:04:23Z","published":"2023-10-23T05:04:23Z","title":"DICE: Diverse Diffusion Model with Scoring for Trajectory Prediction","summary":" Road user trajectory prediction in dynamic environments is a challenging but\ncrucial task for various applications, such as autonomous driving. One of the\nmain challenges in this domain is the multimodal nature of future trajectories\nstemming from the unknown yet diverse intentions of the agents. Diffusion\nmodels have shown to be very effective in capturing such stochasticity in\nprediction tasks. However, these models involve many computationally expensive\ndenoising steps and sampling operations that make them a less desirable option\nfor real-time safety-critical applications. To this end, we present a novel\nframework that leverages diffusion models for predicting future trajectories in\na computationally efficient manner. To minimize the computational bottlenecks\nin iterative sampling, we employ an efficient sampling mechanism that allows us\nto maximize the number of sampled trajectories for improved accuracy while\nmaintaining inference time in real time. Moreover, we propose a scoring\nmechanism to select the most plausible trajectories by assigning relative\nranks. We show the effectiveness of our approach by conducting empirical\nevaluations on common pedestrian (UCY/ETH) and autonomous driving (nuScenes)\nbenchmark datasets on which our model achieves state-of-the-art performance on\nseveral subsets and metrics.\n","authors":["Younwoo Choi","Ray Coden Mercurius","Soheil Mohamad Alizadeh Shabestary","Amir Rasouli"],"pdf_url":"https://arxiv.org/pdf/2310.14570v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15217v3","updated":"2023-10-23T04:56:09Z","published":"2023-05-24T14:57:42Z","title":"L-CAD: Language-based Colorization with Any-level Descriptions using\n Diffusion Priors","summary":" Language-based colorization produces plausible and visually pleasing colors\nunder the guidance of user-friendly natural language descriptions. Previous\nmethods implicitly assume that users provide comprehensive color descriptions\nfor most of the objects in the image, which leads to suboptimal performance. In\nthis paper, we propose a unified model to perform language-based colorization\nwith any-level descriptions. We leverage the pretrained cross-modality\ngenerative model for its robust language understanding and rich color priors to\nhandle the inherent ambiguity of any-level descriptions. We further design\nmodules to align with input conditions to preserve local spatial structures and\nprevent the ghosting effect. With the proposed novel sampling strategy, our\nmodel achieves instance-aware colorization in diverse and complex scenarios.\nExtensive experimental results demonstrate our advantages of effectively\nhandling any-level descriptions and outperforming both language-based and\nautomatic colorization methods. The code and pretrained models are available\nat: https://github.com/changzheng123/L-CAD.\n","authors":["Zheng Chang","Shuchen Weng","Peixuan Zhang","Yu Li","Si Li","Boxin Shi"],"pdf_url":"https://arxiv.org/pdf/2305.15217v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14566v1","updated":"2023-10-23T04:49:09Z","published":"2023-10-23T04:49:09Z","title":"HallusionBench: You See What You Think? Or You Think What You See? An\n Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5,\n and Other Multi-modality Models","summary":" Large language models (LLMs), after being aligned with vision models and\nintegrated into vision-language models (VLMs), can bring impressive improvement\nin image reasoning tasks. This was shown by the recently released GPT-4V(ison),\nLLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a\ndouble-edged sword: they may ignore the image context and solely rely on the\n(even contradictory) language prior for reasoning. In contrast, the vision\nmodules in VLMs are weaker than LLMs and may result in misleading visual\nrepresentations, which are then translated to confident mistakes by LLMs. To\nstudy these two types of VLM mistakes, i.e., language hallucination and visual\nillusion, we curated HallusionBench, an image-context reasoning benchmark that\nis still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed\nanalysis of examples in HallusionBench, which sheds novel insights on the\nillusion or hallucination of VLMs and how to improve them in the future. The\nbenchmark and codebase will be released at\nhttps://github.com/tianyi-lab/HallusionBench.\n","authors":["Fuxiao Liu","Tianrui Guan","Zongxia Li","Lichang Chen","Yaser Yacoob","Dinesh Manocha","Tianyi Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.14566v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06540v2","updated":"2023-10-23T04:31:46Z","published":"2023-04-13T13:51:26Z","title":"Temporal Knowledge Sharing enable Spiking Neural Network Learning from\n Past and Future","summary":" Spiking Neural Networks (SNNs) have attracted significant attention from\nresearchers across various domains due to their brain-like information\nprocessing mechanism. However, SNNs typically grapple with challenges such as\nextended time steps, low temporal information utilization, and the requirement\nfor consistent time step between testing and training. These challenges render\nSNNs with high latency. Moreover, the constraint on time steps necessitates the\nretraining of the model for new deployments, reducing adaptability. To address\nthese issues, this paper proposes a novel perspective, viewing the SNN as a\ntemporal aggregation model. We introduce the Temporal Knowledge Sharing (TKS)\nmethod, facilitating information interact between different time points. TKS\ncan be perceived as a form of temporal self-distillation. To validate the\nefficacy of TKS in information processing, we tested it on static datasets like\nCIFAR10, CIFAR100, ImageNet-1k, and neuromorphic datasets such as DVS-CIFAR10\nand NCALTECH101. Experimental results demonstrate that our method achieves\nstate-of-the-art performance compared to other algorithms. Furthermore, TKS\naddresses the temporal consistency challenge, endowing the model with superior\ntemporal generalization capabilities. This allows the network to train with\nlonger time steps and maintain high performance during testing with shorter\ntime steps. Such an approach considerably accelerates the deployment of SNNs on\nedge devices. Finally, we conducted ablation experiments and tested TKS on\nfine-grained tasks, with results showcasing TKS's enhanced capability to\nprocess information efficiently.\n","authors":["Yiting Dong","Dongcheng Zhao","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2304.06540v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14561v1","updated":"2023-10-23T04:31:42Z","published":"2023-10-23T04:31:42Z","title":"F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of\n Natural and Perturbed Patterns","summary":" Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by\nwell-designed perturbations. This could lead to disastrous results on critical\napplications such as self-driving cars, surveillance security, and medical\ndiagnosis. At present, adversarial training is one of the most effective\ndefenses against adversarial examples. However, traditional adversarial\ntraining makes it difficult to achieve a good trade-off between clean accuracy\nand robustness since spurious features are still learned by DNNs. The intrinsic\nreason is that traditional adversarial training makes it difficult to fully\nlearn core features from adversarial examples when adversarial noise and clean\nexamples cannot be disentangled. In this paper, we disentangle the adversarial\nexamples into natural and perturbed patterns by bit-plane slicing. We assume\nthe higher bit-planes represent natural patterns and the lower bit-planes\nrepresent perturbed patterns, respectively. We propose a Feature-Focusing\nAdversarial Training (F$^2$AT), which differs from previous work in that it\nenforces the model to focus on the core features from natural patterns and\nreduce the impact of spurious features from perturbed patterns. The\nexperimental results demonstrated that F$^2$AT outperforms state-of-the-art\nmethods in clean accuracy and adversarial robustness.\n","authors":["Yaguan Qian","Chenyu Zhao","Zhaoquan Gu","Bin Wang","Shouling Ji","Wei Wang","Boyang Zhou","Pan Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.14561v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11546v2","updated":"2023-10-23T04:26:44Z","published":"2023-06-20T13:59:20Z","title":"Bullying10K: A Large-Scale Neuromorphic Dataset towards\n Privacy-Preserving Bullying Recognition","summary":" The prevalence of violence in daily life poses significant threats to\nindividuals' physical and mental well-being. Using surveillance cameras in\npublic spaces has proven effective in proactively deterring and preventing such\nincidents. However, concerns regarding privacy invasion have emerged due to\ntheir widespread deployment. To address the problem, we leverage Dynamic Vision\nSensors (DVS) cameras to detect violent incidents and preserve privacy since it\ncaptures pixel brightness variations instead of static imagery. We introduce\nthe Bullying10K dataset, encompassing various actions, complex movements, and\nocclusions from real-life scenarios. It provides three benchmarks for\nevaluating different tasks: action recognition, temporal action localization,\nand pose estimation. With 10,000 event segments, totaling 12 billion events and\n255 GB of data, Bullying10K contributes significantly by balancing violence\ndetection and personal privacy persevering. And it also poses a challenge to\nthe neuromorphic dataset. It will serve as a valuable resource for training and\ndeveloping privacy-protecting video systems. The Bullying10K opens new\npossibilities for innovative approaches in these domains.\n","authors":["Yiting Dong","Yang Li","Dongcheng Zhao","Guobin Shen","Yi Zeng"],"pdf_url":"https://arxiv.org/pdf/2306.11546v2.pdf","comment":"Accepted at the 37th Conference on Neural Information Processing\n Systems (NeurIPS 2023) Track on Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2310.13573v2","updated":"2023-10-23T04:26:07Z","published":"2023-10-20T15:10:46Z","title":"Boosting Generalization with Adaptive Style Techniques for Fingerprint\n Liveness Detection","summary":" We introduce a high-performance fingerprint liveness feature extraction\ntechnique that secured first place in LivDet 2023 Fingerprint Representation\nChallenge. Additionally, we developed a practical fingerprint recognition\nsystem with 94.68% accuracy, earning second place in LivDet 2023 Liveness\nDetection in Action. By investigating various methods, particularly style\ntransfer, we demonstrate improvements in accuracy and generalization when faced\nwith limited training data. As a result, our approach achieved state-of-the-art\nperformance in LivDet 2023 Challenges.\n","authors":["Kexin Zhu","Bo Lin","Yang Qiu","Adam Yule","Yao Tang","Jiajun Liang"],"pdf_url":"https://arxiv.org/pdf/2310.13573v2.pdf","comment":"1st Place in LivDet2023 Fingerprint Representation Challenge"},{"id":"http://arxiv.org/abs/2310.14560v1","updated":"2023-10-23T04:24:31Z","published":"2023-10-23T04:24:31Z","title":"Polyhedral Surface: Self-supervised Point Cloud Reconstruction Based on\n Polyhedral Surface","summary":" Point cloud reconstruction from raw point cloud has been an important topic\nin computer graphics for decades, especially due to its high demand in modeling\nand rendering applications. An important way to solve this problem is\nestablishing a local geometry to fit the local curve. However, previous methods\nbuild either a local plane or polynomial curve. Local plane brings the loss of\nsharp feature and the boundary artefacts on open surface. Polynomial curve is\nhard to combine with neural network due to the local coordinate consistent\nproblem. To address this, we propose a novel polyhedral surface to represent\nlocal surface. This method provides more flexible to represent sharp feature\nand surface boundary on open surface. It does not require any local coordinate\nsystem, which is important when introducing neural networks. Specifically, we\nuse normals to construct the polyhedral surface, including both dihedral and\ntrihedral surfaces using 2 and 3 normals, respectively. Our method achieves\nstate-of-the-art results on three commonly used datasets (ShapeNetCore, ABC,\nand ScanNet). Code will be released upon acceptance.\n","authors":["Hui Tian","Kai Xu"],"pdf_url":"https://arxiv.org/pdf/2310.14560v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14556v1","updated":"2023-10-23T04:22:03Z","published":"2023-10-23T04:22:03Z","title":"S3Aug: Segmentation, Sampling, and Shift for Action Recognition","summary":" Action recognition is a well-established area of research in computer vision.\nIn this paper, we propose S3Aug, a video data augmenatation for action\nrecognition. Unlike conventional video data augmentation methods that involve\ncutting and pasting regions from two videos, the proposed method generates new\nvideos from a single training video through segmentation and label-to-image\ntransformation. Furthermore, the proposed method modifies certain categories of\nlabel images by sampling to generate a variety of videos, and shifts\nintermediate features to enhance the temporal coherency between frames of the\ngenerate videos. Experimental results on the UCF101, HMDB51, and Mimetics\ndatasets demonstrate the effectiveness of the proposed method, paricularlly for\nout-of-context videos of the Mimetics dataset.\n","authors":["Taiki Sugiura","Toru Tamaki"],"pdf_url":"https://arxiv.org/pdf/2310.14556v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14532v1","updated":"2023-10-23T03:34:05Z","published":"2023-10-23T03:34:05Z","title":"Practical Deep Dispersed Watermarking with Synchronization and Fusion","summary":" Deep learning based blind watermarking works have gradually emerged and\nachieved impressive performance. However, previous deep watermarking studies\nmainly focus on fixed low-resolution images while paying less attention to\narbitrary resolution images, especially widespread high-resolution images\nnowadays. Moreover, most works usually demonstrate robustness against typical\nnon-geometric attacks (\\textit{e.g.}, JPEG compression) but ignore common\ngeometric attacks (\\textit{e.g.}, Rotate) and more challenging combined\nattacks. To overcome the above limitations, we propose a practical deep\n\\textbf{D}ispersed \\textbf{W}atermarking with \\textbf{S}ynchronization and\n\\textbf{F}usion, called \\textbf{\\proposed}. Specifically, given an\narbitrary-resolution cover image, we adopt a dispersed embedding scheme which\nsparsely and randomly selects several fixed small-size cover blocks to embed a\nconsistent watermark message by a well-trained encoder. In the extraction\nstage, we first design a watermark synchronization module to locate and rectify\nthe encoded blocks in the noised watermarked image. We then utilize a decoder\nto obtain messages embedded in these blocks, and propose a message fusion\nstrategy based on similarity to make full use of the consistency among\nmessages, thus determining a reliable message. Extensive experiments conducted\non different datasets convincingly demonstrate the effectiveness of our\nproposed {\\proposed}. Compared with state-of-the-art approaches, our blind\nwatermarking can achieve better performance: averagely improve the bit accuracy\nby 5.28\\% and 5.93\\% against single and combined attacks, respectively, and\nshow less file size increment and better visual quality. Our code is available\nat https://github.com/bytedance/DWSF.\n","authors":["Hengchang Guo","Qilong Zhang","Junwei Luo","Feng Guo","Wenbin Zhang","Xiaodong Su","Minglei Li"],"pdf_url":"https://arxiv.org/pdf/2310.14532v1.pdf","comment":"Accpeted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2310.14511v1","updated":"2023-10-23T02:47:25Z","published":"2023-10-23T02:47:25Z","title":"Poster: Real-Time Object Substitution for Mobile Diminished Reality with\n Edge Computing","summary":" Diminished Reality (DR) is considered as the conceptual counterpart to\nAugmented Reality (AR), and has recently gained increasing attention from both\nindustry and academia. Unlike AR which adds virtual objects to the real world,\nDR allows users to remove physical content from the real world. When combined\nwith object replacement technology, it presents an further exciting avenue for\nexploration within the metaverse. Although a few researches have been conducted\non the intersection of object substitution and DR, there is no real-time object\nsubstitution for mobile diminished reality architecture with high quality. In\nthis paper, we propose an end-to-end architecture to facilitate immersive and\nreal-time scene construction for mobile devices with edge computing.\n","authors":["Hongyu Ke","Haoxin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14511v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10070v2","updated":"2023-10-23T02:37:20Z","published":"2023-10-16T05:04:21Z","title":"GreatSplicing: A Semantically Rich Splicing Dataset","summary":" In existing splicing forgery datasets, the insufficient semantic varieties of\nspliced regions cause a problem that trained detection models overfit semantic\nfeatures rather than splicing traces. Meanwhile, because of the absence of a\nreasonable dataset, different detection methods proposed cannot reach a\nconsensus on experimental settings. To address these urgent issues,\nGreatSplicing, a manually created splicing dataset with a considerable amount\nand high quality, is proposed in this paper. GreatSplicing comprises 5,000\nspliced images and covers spliced regions with 335 distinct semantic\ncategories, allowing neural networks to grasp splicing traces better. Extensive\nexperiments demonstrate that models trained on GreatSplicing exhibit minimal\nmisidentification rates and superior cross-dataset detection capabilities\ncompared to existing datasets. Furthermore, GreatSplicing is available for all\nresearch purposes and can be downloaded from www.greatsplicing.net.\n","authors":["Xiuli Bi","Jiaming Liang"],"pdf_url":"https://arxiv.org/pdf/2310.10070v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14504v1","updated":"2023-10-23T02:31:31Z","published":"2023-10-23T02:31:31Z","title":"ADoPT: LiDAR Spoofing Attack Detection Based on Point-Level Temporal\n Consistency","summary":" Deep neural networks (DNNs) are increasingly integrated into LiDAR (Light\nDetection and Ranging)-based perception systems for autonomous vehicles (AVs),\nrequiring robust performance under adversarial conditions. We aim to address\nthe challenge of LiDAR spoofing attacks, where attackers inject fake objects\ninto LiDAR data and fool AVs to misinterpret their environment and make\nerroneous decisions. However, current defense algorithms predominantly depend\non perception outputs (i.e., bounding boxes) thus face limitations in detecting\nattackers given the bounding boxes are generated by imperfect perception models\nprocessing limited points, acquired based on the ego vehicle's viewpoint. To\novercome these limitations, we propose a novel framework, named ADoPT (Anomaly\nDetection based on Point-level Temporal consistency), which quantitatively\nmeasures temporal consistency across consecutive frames and identifies abnormal\nobjects based on the coherency of point clusters. In our evaluation using the\nnuScenes dataset, our algorithm effectively counters various LiDAR spoofing\nattacks, achieving a low (< 10%) false positive ratio (FPR) and high (> 85%)\ntrue positive ratio (TPR), outperforming existing state-of-the-art defense\nmethods, CARLO and 3D-TC2. Furthermore, our evaluation demonstrates the\npromising potential for accurate attack detection across various road\nenvironments.\n","authors":["Minkyoung Cho","Yulong Cao","Zixiang Zhou","Z. Morley Mao"],"pdf_url":"https://arxiv.org/pdf/2310.14504v1.pdf","comment":"BMVC 2023 (17 pages, 13 figures, and 1 table)"},{"id":"http://arxiv.org/abs/2310.14489v1","updated":"2023-10-23T01:46:22Z","published":"2023-10-23T01:46:22Z","title":"MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance\n Segmentation","summary":" Recently, deep learning-based tooth segmentation methods have been limited by\nthe expensive and time-consuming processes of data collection and labeling.\nAchieving high-precision segmentation with limited datasets is critical. A\nviable solution to this entails fine-tuning pre-trained multiview-based models,\nthereby enhancing performance with limited data. However, relying solely on\ntwo-dimensional (2D) images for three-dimensional (3D) tooth segmentation can\nproduce suboptimal outcomes because of occlusion and deformation, i.e.,\nincomplete and distorted shape perception. To improve this fine-tuning-based\nsolution, this paper advocates 2D-3D joint perception. The fundamental\nchallenge in employing 2D-3D joint perception with limited data is that the\n3D-related inputs and modules must follow a lightweight policy instead of using\nhuge 3D data and parameter-rich modules that require extensive training data.\nFollowing this lightweight policy, this paper selects skeletons as the 3D\ninputs and introduces MSFormer, a novel method for tooth segmentation. MSFormer\nincorporates two lightweight modules into existing multiview-based models: a\n3D-skeleton perception module to extract 3D perception from skeletons and a\nskeleton-image contrastive learning module to obtain the 2D-3D joint perception\nby fusing both multiview and skeleton perceptions. The experimental results\nreveal that MSFormer paired with large pre-trained multiview models achieves\nstate-of-the-art performance, requiring only 100 training meshes. Furthermore,\nthe segmentation accuracy is improved by 2.4%-5.5% with the increasing volume\nof training data.\n","authors":["Yuan Li","Huan Liu","Yubo Tao","Xiangyang He","Haifeng Li","Xiaohu Guo","Hai Lin"],"pdf_url":"https://arxiv.org/pdf/2310.14489v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.14487v1","updated":"2023-10-23T01:41:38Z","published":"2023-10-23T01:41:38Z","title":"VQ-NeRF: Vector Quantization Enhances Implicit Neural Representations","summary":" Recent advancements in implicit neural representations have contributed to\nhigh-fidelity surface reconstruction and photorealistic novel view synthesis.\nHowever, the computational complexity inherent in these methodologies presents\na substantial impediment, constraining the attainable frame rates and\nresolutions in practical applications. In response to this predicament, we\npropose VQ-NeRF, an effective and efficient pipeline for enhancing implicit\nneural representations via vector quantization. The essence of our method\ninvolves reducing the sampling space of NeRF to a lower resolution and\nsubsequently reinstating it to the original size utilizing a pre-trained VAE\ndecoder, thereby effectively mitigating the sampling time bottleneck\nencountered during rendering. Although the codebook furnishes representative\nfeatures, reconstructing fine texture details of the scene remains challenging\ndue to high compression rates. To overcome this constraint, we design an\ninnovative multi-scale NeRF sampling scheme that concurrently optimizes the\nNeRF model at both compressed and original scales to enhance the network's\nability to preserve fine details. Furthermore, we incorporate a semantic loss\nfunction to improve the geometric fidelity and semantic coherence of our 3D\nreconstructions. Extensive experiments demonstrate the effectiveness of our\nmodel in achieving the optimal trade-off between rendering quality and\nefficiency. Evaluation on the DTU, BlendMVS, and H3DS datasets confirms the\nsuperior performance of our approach.\n","authors":["Yiying Yang","Wen Liu","Fukun Yin","Xin Chen","Gang Yu","Jiayuan Fan","Tao Chen"],"pdf_url":"https://arxiv.org/pdf/2310.14487v1.pdf","comment":"Submitted to the 38th Annual AAAI Conference on Artificial\n Intelligence"},{"id":"http://arxiv.org/abs/2310.14469v1","updated":"2023-10-23T00:52:23Z","published":"2023-10-23T00:52:23Z","title":"Player Re-Identification Using Body Part Appearences","summary":" We propose a neural network architecture that learns body part appearances\nfor soccer player re-identification. Our model consists of a two-stream network\n(one stream for appearance map extraction and the other for body part map\nextraction) and a bilinear-pooling layer that generates and spatially pools the\nbody part map. Each local feature of the body part map is obtained by a\nbilinear mapping of the corresponding local appearance and body part\ndescriptors. Our novel representation yields a robust image-matching feature\nmap, which results from combining the local similarities of the relevant body\nparts with the weighted appearance similarity. Our model does not require any\npart annotation on the SoccerNet-V3 re-identification dataset to train the\nnetwork. Instead, we use a sub-network of an existing pose estimation network\n(OpenPose) to initialize the part substream and then train the entire network\nto minimize the triplet loss. The appearance stream is pre-trained on the\nImageNet dataset, and the part stream is trained from scratch for the\nSoccerNet-V3 dataset. We demonstrate the validity of our model by showing that\nit outperforms state-of-the-art models such as OsNet and InceptionNet.\n","authors":["Mahesh Bhosale","Abhishek Kumar","David Doermann"],"pdf_url":"https://arxiv.org/pdf/2310.14469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07847v4","updated":"2023-10-23T00:51:19Z","published":"2023-07-15T16:45:01Z","title":"Enabling Real-time Neural Recovery for Cloud Gaming on Mobile Devices","summary":" Cloud gaming is a multi-billion dollar industry. A client in cloud gaming\nsends its movement to the game server on the Internet, which renders and\ntransmits the resulting video back. In order to provide a good gaming\nexperience, a latency below 80 ms is required. This means that video rendering,\nencoding, transmission, decoding, and display have to finish within that time\nframe, which is especially challenging to achieve due to server overload,\nnetwork congestion, and losses. In this paper, we propose a new method for\nrecovering lost or corrupted video frames in cloud gaming. Unlike traditional\nvideo frame recovery, our approach uses game states to significantly enhance\nrecovery accuracy and utilizes partially decoded frames to recover lost\nportions. We develop a holistic system that consists of (i) efficiently\nextracting game states, (ii) modifying H.264 video decoder to generate a mask\nto indicate which portions of video frames need recovery, and (iii) designing a\nnovel neural network to recover either complete or partial video frames. Our\napproach is extensively evaluated using iPhone 12 and laptop implementations,\nand we demonstrate the utility of game states in the game video recovery and\nthe effectiveness of our overall design.\n","authors":["Zhaoyuan He","Yifan Yang","Shuozhe Li","Diyuan Dai","Lili Qiu","Yuqing Yang"],"pdf_url":"https://arxiv.org/pdf/2307.07847v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17262v3","updated":"2023-10-23T00:45:49Z","published":"2023-05-26T21:10:11Z","title":"Im-Promptu: In-Context Composition from Image Prompts","summary":" Large language models are few-shot learners that can solve diverse tasks from\na handful of demonstrations. This implicit understanding of tasks suggests that\nthe attention mechanisms over word tokens may play a role in analogical\nreasoning. In this work, we investigate whether analogical reasoning can enable\nin-context composition over composable elements of visual stimuli. First, we\nintroduce a suite of three benchmarks to test the generalization properties of\na visual in-context learner. We formalize the notion of an analogy-based\nin-context learner and use it to design a meta-learning framework called\nIm-Promptu. Whereas the requisite token granularity for language is well\nestablished, the appropriate compositional granularity for enabling in-context\ngeneralization in visual stimuli is usually unspecified. To this end, we use\nIm-Promptu to train multiple agents with different levels of compositionality,\nincluding vector representations, patch representations, and object slots. Our\nexperiments reveal tradeoffs between extrapolation abilities and the degree of\ncompositionality, with non-compositional representations extending learned\ncomposition rules to unseen domains but performing poorly on combinatorial\ntasks. Patch-based representations require patches to contain entire objects\nfor robust extrapolation. At the same time, object-centric tokenizers coupled\nwith a cross-attention module generate consistent and high-fidelity solutions,\nwith these inductive biases being particularly crucial for compositional\ngeneralization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive\nprogramming interface for image generation.\n","authors":["Bhishma Dedhia","Michael Chang","Jake C. Snell","Thomas L. Griffiths","Niraj K. Jha"],"pdf_url":"https://arxiv.org/pdf/2305.17262v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09064v4","updated":"2023-10-23T00:30:20Z","published":"2023-03-16T03:44:33Z","title":"Dual skip connections in U-Net, ResUnet and U-Net3+ for remote\n extraction of buildings","summary":" Urban buildings are extracted from high-resolution Earth observation (EO)\nimages using semantic segmentation networks like U-Net and its successors. Each\nre-iteration aims to improve performance by employing a denser skip connection\nmechanism that harnesses multi-scale features for accurate object mapping.\nHowever, denser connections increase network parameters and do not necessarily\ncontribute to precise segmentation. In this paper, we develop three dual skip\nconnection mechanisms for three networks (U-Net, ResUnet, and U-Net3+) to\nselectively deepen the essential feature maps for improved performance. The\nthree mechanisms are evaluated on feature maps of different scales, producing\nnine new network configurations. They are evaluated against their original\nvanilla configurations on four building footprint datasets of different spatial\nresolutions, including a multi-resolution (0.3+0.6+1.2m) dataset that we\ndevelop for complex urban environments. The evaluation revealed that densifying\nthe large- and small-scale features in U-Net and U-Net3+ produce up to 0.905\nF1, more than TransUnet (0.903) and Swin-Unet (0.882) in our new dataset with\nup to 19x fewer parameters. The results conclude that selectively densifying\nfeature maps and skip connections enhances network performance without a\nsubstantial increase in parameters. The findings and the new dataset will\ncontribute to the computer vision domain and urban planning decision processes.\n","authors":["Bipul Neupane","Jagannath Aryal","Abbas Rajabifard"],"pdf_url":"https://arxiv.org/pdf/2303.09064v4.pdf","comment":"This work has been submitted to Springer for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2310.15402v1","updated":"2023-10-23T23:15:28Z","published":"2023-10-23T23:15:28Z","title":"Towards contrast-agnostic soft segmentation of the spinal cord","summary":" Spinal cord segmentation is clinically relevant and is notably used to\ncompute spinal cord cross-sectional area (CSA) for the diagnosis and monitoring\nof cord compression or neurodegenerative diseases such as multiple sclerosis.\nWhile several semi and automatic methods exist, one key limitation remains: the\nsegmentation depends on the MRI contrast, resulting in different CSA across\ncontrasts. This is partly due to the varying appearance of the boundary between\nthe spinal cord and the cerebrospinal fluid that depends on the sequence and\nacquisition parameters. This contrast-sensitive CSA adds variability in\nmulti-center studies where protocols can vary, reducing the sensitivity to\ndetect subtle atrophies. Moreover, existing methods enhance the CSA variability\nby training one model per contrast, while also producing binary masks that do\nnot account for partial volume effects. In this work, we present a deep\nlearning-based method that produces soft segmentations of the spinal cord.\nUsing the Spine Generic Public Database of healthy participants\n($\\text{n}=267$; $\\text{contrasts}=6$), we first generated participant-wise\nsoft ground truth (GT) by averaging the binary segmentations across all 6\ncontrasts. These soft GT, along with a regression-based loss function, were\nthen used to train a UNet model for spinal cord segmentation. We evaluated our\nmodel against state-of-the-art methods and performed ablation studies involving\ndifferent GT mask types, loss functions, and contrast-specific models. Our\nresults show that using the soft average segmentations along with a regression\nloss function reduces CSA variability ($p < 0.05$, Wilcoxon signed-rank test).\nThe proposed spinal cord segmentation model generalizes better than the\nstate-of-the-art contrast-specific methods amongst unseen datasets, vendors,\ncontrasts, and pathologies (compression, lesions), while accounting for partial\nvolume effects.\n","authors":["Sandrine Bédard","Naga Karthik Enamundram","Charidimos Tsagkas","Emanuele Pravatà","Cristina Granziera","Andrew Smith","Kenneth Arnold Weber II","Julien Cohen-Adad"],"pdf_url":"https://arxiv.org/pdf/2310.15402v1.pdf","comment":"Submitted to Medical Image Analysis"},{"id":"http://arxiv.org/abs/2310.15388v1","updated":"2023-10-23T22:41:04Z","published":"2023-10-23T22:41:04Z","title":"Remote Heart Rate Monitoring in Smart Environments from Videos with\n Self-supervised Pre-training","summary":" Recent advances in deep learning have made it increasingly feasible to\nestimate heart rate remotely in smart environments by analyzing videos.\nHowever, a notable limitation of deep learning methods is their heavy reliance\non extensive sets of labeled data for effective training. To address this\nissue, self-supervised learning has emerged as a promising avenue. Building on\nthis, we introduce a solution that utilizes self-supervised contrastive\nlearning for the estimation of remote photoplethysmography (PPG) and heart rate\nmonitoring, thereby reducing the dependence on labeled data and enhancing\nperformance. We propose the use of 3 spatial and 3 temporal augmentations for\ntraining an encoder through a contrastive framework, followed by utilizing the\nlate-intermediate embeddings of the encoder for remote PPG and heart rate\nestimation. Our experiments on two publicly available datasets showcase the\nimprovement of our proposed approach over several related works as well as\nsupervised learning baselines, as our results approach the state-of-the-art. We\nalso perform thorough experiments to showcase the effects of using different\ndesign choices such as the video representation learning method, the\naugmentations used in the pre-training stage, and others. We also demonstrate\nthe robustness of our proposed method over the supervised learning approaches\non reduced amounts of labeled data.\n","authors":["Divij Gupta","Ali Etemad"],"pdf_url":"https://arxiv.org/pdf/2310.15388v1.pdf","comment":"Accepted in IEEE Internet of Things Journal 2023"},{"id":"http://arxiv.org/abs/2310.03388v3","updated":"2023-10-23T22:11:02Z","published":"2023-10-05T08:49:51Z","title":"OpenPatch: a 3D patchwork for Out-Of-Distribution detection","summary":" Moving deep learning models from the laboratory setting to the open world\nentails preparing them to handle unforeseen conditions. In several applications\nthe occurrence of novel classes during deployment poses a significant threat,\nthus it is crucial to effectively detect them. Ideally, this skill should be\nused when needed without requiring any further computational training effort at\nevery new task. Out-of-distribution detection has attracted significant\nattention in the last years, however the majority of the studies deal with 2D\nimages ignoring the inherent 3D nature of the real-world and often confusing\nbetween domain and semantic novelty. In this work, we focus on the latter,\nconsidering the objects geometric structure captured by 3D point clouds\nregardless of the specific domain. We advance the field by introducing\nOpenPatch that builds on a large pre-trained model and simply extracts from its\nintermediate features a set of patch representations that describe each known\nclass. For any new sample, we obtain a novelty score by evaluating whether it\ncan be recomposed mainly by patches of a single known class or rather via the\ncontribution of multiple classes. We present an extensive experimental\nevaluation of our approach for the task of semantic novelty detection on\nreal-world point cloud samples when the reference known data are synthetic. We\ndemonstrate that OpenPatch excels in both the full and few-shot known sample\nscenarios, showcasing its robustness across varying pre-training objectives and\nnetwork backbones. The inherent training-free nature of our method allows for\nits immediate application to a wide array of real-world tasks, offering a\ncompelling advantage over approaches that need expensive retraining efforts.\n","authors":["Paolo Rabino","Antonio Alliegro","Francesco Cappio Borlino","Tatiana Tommasi"],"pdf_url":"https://arxiv.org/pdf/2310.03388v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15371v1","updated":"2023-10-23T21:14:52Z","published":"2023-10-23T21:14:52Z","title":"Vicinal Feature Statistics Augmentation for Federated 3D Medical Volume\n Segmentation","summary":" Federated learning (FL) enables multiple client medical institutes\ncollaboratively train a deep learning (DL) model with privacy protection.\nHowever, the performance of FL can be constrained by the limited availability\nof labeled data in small institutes and the heterogeneous (i.e., non-i.i.d.)\ndata distribution across institutes. Though data augmentation has been a proven\ntechnique to boost the generalization capabilities of conventional centralized\nDL as a \"free lunch\", its application in FL is largely underexplored. Notably,\nconstrained by costly labeling, 3D medical segmentation generally relies on\ndata augmentation. In this work, we aim to develop a vicinal feature-level data\naugmentation (VFDA) scheme to efficiently alleviate the local feature shift and\nfacilitate collaborative training for privacy-aware FL segmentation. We take\nboth the inner- and inter-institute divergence into consideration, without the\nneed for cross-institute transfer of raw data or their mixup. Specifically, we\nexploit the batch-wise feature statistics (e.g., mean and standard deviation)\nin each institute to abstractly represent the discrepancy of data, and model\neach feature statistic probabilistically via a Gaussian prototype, with the\nmean corresponding to the original statistic and the variance quantifying the\naugmentation scope. From the vicinal risk minimization perspective, novel\nfeature statistics can be drawn from the Gaussian distribution to fulfill\naugmentation. The variance is explicitly derived by the data bias in each\nindividual institute and the underlying feature statistics characterized by all\nparticipating institutes. The added-on VFDA consistently yielded marked\nimprovements over six advanced FL methods on both 3D brain tumor and cardiac\nsegmentation.\n","authors":["Yongsong Huang","Wanqing Xie","Mingzhen Li","Mingmei Cheng","Jinzhou Wu","Weixiao Wang","Jane You","Xiaofeng Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15371v1.pdf","comment":"28th biennial international conference on Information Processing in\n Medical Imaging (IPMI 2023): Oral Paper"},{"id":"http://arxiv.org/abs/2310.15368v1","updated":"2023-10-23T21:03:30Z","published":"2023-10-23T21:03:30Z","title":"Deep Integrated Explanations","summary":" This paper presents Deep Integrated Explanations (DIX) - a universal method\nfor explaining vision models. DIX generates explanation maps by integrating\ninformation from the intermediate representations of the model, coupled with\ntheir corresponding gradients. Through an extensive array of both objective and\nsubjective evaluations spanning diverse tasks, datasets, and model\nconfigurations, we showcase the efficacy of DIX in generating faithful and\naccurate explanation maps, while surpassing current state-of-the-art methods.\n","authors":["Oren Barkan","Yehonathan Elisha","Jonathan Weill","Yuval Asher","Amit Eshel","Noam Koenigstein"],"pdf_url":"https://arxiv.org/pdf/2310.15368v1.pdf","comment":"CIKM 2023"},{"id":"http://arxiv.org/abs/2310.15328v1","updated":"2023-10-23T19:48:58Z","published":"2023-10-23T19:48:58Z","title":"DeepVox and SAVE-CT: a contrast- and dose-independent 3D deep learning\n approach for thoracic aorta segmentation and aneurysm prediction using\n computed tomography scans","summary":" Thoracic aortic aneurysm (TAA) is a fatal disease which potentially leads to\ndissection or rupture through progressive enlargement of the aorta. It is\nusually asymptomatic and screening recommendation are limited. The\ngold-standard evaluation is performed by computed tomography angiography (CTA)\nand radiologists time-consuming assessment. Scans for other indications could\nhelp on this screening, however if acquired without contrast enhancement or\nwith low dose protocol, it can make the clinical evaluation difficult, besides\nincreasing the scans quantity for the radiologists. In this study, it was\nselected 587 unique CT scans including control and TAA patients, acquired with\nlow and standard dose protocols, with or without contrast enhancement. A novel\nsegmentation model, DeepVox, exhibited dice score coefficients of 0.932 and\n0.897 for development and test sets, respectively, with faster training speed\nin comparison to models reported in the literature. The novel TAA\nclassification model, SAVE-CT, presented accuracies of 0.930 and 0.922 for\ndevelopment and test sets, respectively, using only the binary segmentation\nmask from DeepVox as input, without hand-engineered features. These two models\ntogether are a potential approach for TAA screening, as they can handle\nvariable number of slices as input, handling thoracic and thoracoabdominal\nsequences, in a fully automated contrast- and dose-independent evaluation. This\nmay assist to decrease TAA mortality and prioritize the evaluation queue of\npatients for radiologists.\n","authors":["Matheus del-Valle","Lariza Laura de Oliveira","Henrique Cursino Vieira","Henrique Min Ho Lee","Lucas Lembrança Pinheiro","Maria Fernanda Portugal","Newton Shydeo Brandão Miyoshi","Nelson Wolosker"],"pdf_url":"https://arxiv.org/pdf/2310.15328v1.pdf","comment":"23 pages, 4 figures, 7 tables"},{"id":"http://arxiv.org/abs/2310.15325v1","updated":"2023-10-23T19:46:41Z","published":"2023-10-23T19:46:41Z","title":"LXMERT Model Compression for Visual Question Answering","summary":" Large-scale pretrained models such as LXMERT are becoming popular for\nlearning cross-modal representations on text-image pairs for vision-language\ntasks. According to the lottery ticket hypothesis, NLP and computer vision\nmodels contain smaller subnetworks capable of being trained in isolation to\nfull performance. In this paper, we combine these observations to evaluate\nwhether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA\ntask. In addition, we perform a model size cost-benefit analysis by\ninvestigating how much pruning can be done without significant loss in\naccuracy. Our experiment results demonstrate that LXMERT can be effectively\npruned by 40%-60% in size with 3% loss in accuracy.\n","authors":["Maryam Hashemi","Ghazaleh Mahmoudi","Sara Kodeiri","Hadi Sheikhi","Sauleh Eetemadi"],"pdf_url":"https://arxiv.org/pdf/2310.15325v1.pdf","comment":"To appear in The Fourth Annual West Coast NLP (WeCNLP) Summit"},{"id":"http://arxiv.org/abs/2310.15324v1","updated":"2023-10-23T19:45:46Z","published":"2023-10-23T19:45:46Z","title":"Videoprompter: an ensemble of foundational models for zero-shot video\n understanding","summary":" Vision-language models (VLMs) classify the query video by calculating a\nsimilarity score between the visual features and text-based class label\nrepresentations. Recently, large language models (LLMs) have been used to\nenrich the text-based class labels by enhancing the descriptiveness of the\nclass names. However, these improvements are restricted to the text-based\nclassifier only, and the query visual features are not considered. In this\npaper, we propose a framework which combines pre-trained discriminative VLMs\nwith pre-trained generative video-to-text and text-to-text models. We introduce\ntwo key modifications to the standard zero-shot setting. First, we propose\nlanguage-guided visual feature enhancement and employ a video-to-text model to\nconvert the query video to its descriptive form. The resulting descriptions\ncontain vital visual cues of the query video, such as what objects are present\nand their spatio-temporal interactions. These descriptive cues provide\nadditional semantic knowledge to VLMs to enhance their zeroshot performance.\nSecond, we propose video-specific prompts to LLMs to generate more meaningful\ndescriptions to enrich class label representations. Specifically, we introduce\nprompt techniques to create a Tree Hierarchy of Categories for class names,\noffering a higher-level action context for additional visual cues, We\ndemonstrate the effectiveness of our approach in video understanding across\nthree different zero-shot settings: 1) video action recognition, 2)\nvideo-to-text and textto-video retrieval, and 3) time-sensitive video tasks.\nConsistent improvements across multiple benchmarks and with various VLMs\ndemonstrate the effectiveness of our proposed framework. Our code will be made\npublicly available.\n","authors":["Adeel Yousaf","Muzammal Naseer","Salman Khan","Fahad Shahbaz Khan","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2310.15324v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15308v1","updated":"2023-10-23T19:21:57Z","published":"2023-10-23T19:21:57Z","title":"SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial\n Understanding","summary":" The landscape of publicly available vision foundation models (VFMs), such as\nCLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed\nwith distinct capabilities stemming from their pre-training objectives. For\ninstance, CLIP excels in semantic understanding, while SAM specializes in\nspatial understanding for segmentation. In this work, we introduce a simple\nrecipe to efficiently merge VFMs into a unified model that assimilates their\nexpertise. Our proposed method integrates multi-task learning, continual\nlearning techniques, and teacher-student distillation. This strategy entails\nsignificantly less computational cost compared to traditional multi-task\ntraining from scratch. Additionally, it only demands a small fraction of the\npre-training datasets that were initially used to train individual models. By\napplying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that\namalgamates the strengths of SAM and CLIP into a single backbone, making it apt\nfor edge device applications. We show that SAM-CLIP learns richer visual\nrepresentations, equipped with both localization and semantic features,\nsuitable for a broad range of vision tasks. SAM-CLIP obtains improved\nperformance on several head probing tasks when compared with SAM and CLIP. We\nfurther show that SAM-CLIP not only retains the foundational strengths of its\nprecursor models but also introduces synergistic functionalities, most notably\nin zero-shot semantic segmentation, where SAM-CLIP establishes new\nstate-of-the-art results on 5 benchmarks. It outperforms previous models that\nare specifically designed for this task by a large margin, including +6.8% and\n+5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.\n","authors":["Haoxiang Wang","Pavan Kumar Anasosalu Vasu","Fartash Faghri","Raviteja Vemulapalli","Mehrdad Farajtabar","Sachin Mehta","Mohammad Rastegari","Oncel Tuzel","Hadi Pouransari"],"pdf_url":"https://arxiv.org/pdf/2310.15308v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15247v1","updated":"2023-10-23T18:01:36Z","published":"2023-10-23T18:01:36Z","title":"SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis","summary":" Sound design involves creatively selecting, recording, and editing sound\neffects for various media like cinema, video games, and virtual/augmented\nreality. One of the most time-consuming steps when designing sound is\nsynchronizing audio with video. In some cases, environmental recordings from\nvideo shoots are available, which can aid in the process. However, in video\ngames and animations, no reference audio exists, requiring manual annotation of\nevent timings from the video. We propose a system to extract repetitive actions\nonsets from a video, which are then used - in conjunction with audio or textual\nembeddings - to condition a diffusion model trained to generate a new\nsynchronized sound effects audio track. In this way, we leave complete creative\ncontrol to the sound designer while removing the burden of synchronization with\nvideo. Furthermore, editing the onset track or changing the conditioning\nembedding requires much less effort than editing the audio track itself,\nsimplifying the sonification process. We provide sound examples, source code,\nand pretrained models to faciliate reproducibility\n","authors":["Marco Comunità","Riccardo F. Gramaccioni","Emilian Postolache","Emanuele Rodolà","Danilo Comminiello","Joshua D. Reiss"],"pdf_url":"https://arxiv.org/pdf/2310.15247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.10761v3","updated":"2023-10-23T16:56:59Z","published":"2022-03-21T07:12:18Z","title":"Harnessing Hard Mixed Samples with Decoupled Regularizer","summary":" Mixup is an efficient data augmentation approach that improves the\ngeneralization of neural networks by smoothing the decision boundary with mixed\ndata. Recently, dynamic mixup methods have improved previous static policies\neffectively (e.g., linear interpolation) by maximizing target-related salient\nregions in mixed samples, but excessive additional time costs are not\nacceptable. These additional computational overheads mainly come from\noptimizing the mixed samples according to the mixed labels. However, we found\nthat the extra optimizing step may be redundant because label-mismatched mixed\nsamples are informative hard mixed samples for deep models to localize\ndiscriminative features. In this paper, we thus are not trying to propose a\nmore complicated dynamic mixup policy but rather an efficient mixup objective\nfunction with a decoupled regularizer named Decoupled Mixup (DM). The primary\neffect is that DM can adaptively utilize those hard mixed samples to mine\ndiscriminative features without losing the original smoothness of mixup. As a\nresult, DM enables static mixup methods to achieve comparable or even exceed\nthe performance of dynamic methods without any extra computation. This also\nleads to an interesting objective design problem for mixup training that we\nneed to focus on both smoothing the decision boundaries and identifying\ndiscriminative features. Extensive experiments on supervised and\nsemi-supervised learning benchmarks across seven datasets validate the\neffectiveness of DM as a plug-and-play module. Source code and models are\navailable at https://github.com/Westlake-AI/openmixup\n","authors":["Zicheng Liu","Siyuan Li","Ge Wang","Cheng Tan","Lirong Wu","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2203.10761v3.pdf","comment":"NeurIPS'2023 Camera Ready. The source code is available at\n https://github.com/Westlake-AI/openmixup"},{"id":"http://arxiv.org/abs/2204.05278v4","updated":"2023-10-23T15:51:12Z","published":"2022-04-11T17:29:36Z","title":"Negligible effect of brain MRI data preprocessing for tumor segmentation","summary":" Magnetic resonance imaging (MRI) data is heterogeneous due to differences in\ndevice manufacturers, scanning protocols, and inter-subject variability. A\nconventional way to mitigate MR image heterogeneity is to apply preprocessing\ntransformations such as anatomy alignment, voxel resampling, signal intensity\nequalization, image denoising, and localization of regions of interest.\nAlthough a preprocessing pipeline standardizes image appearance, its influence\non the quality of image segmentation and on other downstream tasks in deep\nneural networks has never been rigorously studied.\n We conduct experiments on three publicly available datasets and evaluate the\neffect of different preprocessing steps in intra- and inter-dataset training\nscenarios. Our results demonstrate that most popular standardization steps add\nno value to the network performance; moreover, preprocessing can hamper model\nperformance. We suggest that image intensity normalization approaches do not\ncontribute to model accuracy because of the reduction of signal variance with\nimage standardization. Finally, we show that the contribution of\nskull-stripping in data preprocessing is almost negligible if measured in terms\nof estimated tumor volume.\n We show that the only essential transformation for accurate deep learning\nanalysis is the unification of voxel spacing across the dataset. In contrast,\ninter-subjects anatomy alignment in the form of non-rigid atlas registration is\nnot necessary and intensity equalization steps (denoising, bias-field\ncorrection and histogram matching) do not improve models' performance. The\nstudy code is accessible online\nhttps://github.com/MedImAIR/brain-mri-processing-pipeline\n","authors":["Ekaterina Kondrateva","Polina Druzhinina","Alexandra Dalechina","Svetlana Zolotova","Andrey Golanov","Boris Shirokikh","Mikhail Belyaev","Anvar Kurmukov"],"pdf_url":"https://arxiv.org/pdf/2204.05278v4.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2305.02996v2","updated":"2023-10-23T17:48:34Z","published":"2023-05-04T17:01:17Z","title":"Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR\n Decomposition","summary":" Cross-encoder models, which jointly encode and score a query-item pair, are\nprohibitively expensive for direct k-nearest neighbor (k-NN) search.\nConsequently, k-NN search typically employs a fast approximate retrieval (e.g.\nusing BM25 or dual-encoder vectors), followed by reranking with a\ncross-encoder; however, the retrieval approximation often has detrimental\nrecall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent\nwork that employs a cross-encoder only, making search efficient using a\nrelatively small number of anchor items, and a CUR matrix factorization. While\nANNCUR's one-time selection of anchors tends to approximate the cross-encoder\ndistances on average, doing so forfeits the capacity to accurately estimate\ndistances to items near the query, leading to regret in the crucial end-task:\nrecall of top-k items. In this paper, we propose ADACUR, a method that\nadaptively, iteratively, and efficiently minimizes the approximation error for\nthe practically important top-k neighbors. It does so by iteratively performing\nk-NN search using the anchors available so far, then adding these retrieved\nnearest neighbors to the anchor set for the next round. Empirically, on\nmultiple datasets, in comparison to previous traditional and state-of-the-art\nmethods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed\napproach ADACUR consistently reduces recall error-by up to 70% on the important\nk = 1 setting-while using no more compute than its competitors.\n","authors":["Nishant Yadav","Nicholas Monath","Manzil Zaheer","Andrew McCallum"],"pdf_url":"https://arxiv.org/pdf/2305.02996v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2307.09384v2","updated":"2023-10-23T17:24:02Z","published":"2023-07-18T16:05:25Z","title":"Zero-shot Query Reformulation for Conversational Search","summary":" As the popularity of voice assistants continues to surge, conversational\nsearch has gained increased attention in Information Retrieval. However, data\nsparsity issues in conversational search significantly hinder the progress of\nsupervised conversational search methods. Consequently, researchers are\nfocusing more on zero-shot conversational search approaches. Nevertheless,\nexisting zero-shot methods face three primary limitations: they are not\nuniversally applicable to all retrievers, their effectiveness lacks sufficient\nexplainability, and they struggle to resolve common conversational ambiguities\ncaused by omission. To address these limitations, we introduce a novel\nZero-shot Query Reformulation (ZeQR) framework that reformulates queries based\non previous dialogue contexts without requiring supervision from conversational\nsearch data. Specifically, our framework utilizes language models designed for\nmachine reading comprehension tasks to explicitly resolve two common\nambiguities: coreference and omission, in raw queries. In comparison to\nexisting zero-shot methods, our approach is universally applicable to any\nretriever without additional adaptation or indexing. It also provides greater\nexplainability and effectively enhances query intent understanding because\nambiguities are explicitly and proactively resolved. Through extensive\nexperiments on four TREC conversational datasets, we demonstrate the\neffectiveness of our method, which consistently outperforms state-of-the-art\nbaselines.\n","authors":["Dayu Yang","Yue Zhang","Hui Fang"],"pdf_url":"https://arxiv.org/pdf/2307.09384v2.pdf","comment":"Accepted by the 9th ACM SIGIR International Conference on the Theory\n of Information Retrieval"},{"id":"http://arxiv.org/abs/2210.03116v3","updated":"2023-10-23T16:28:09Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v3.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2305.14499v2","updated":"2023-10-23T14:46:34Z","published":"2023-05-23T20:09:52Z","title":"NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive\n Decoders","summary":" Neural document rerankers are extremely effective in terms of accuracy.\nHowever, the best models require dedicated hardware for serving, which is\ncostly and often not feasible. To avoid this serving-time requirement, we\npresent a method of capturing up to 86% of the gains of a Transformer\ncross-attention model with a lexicalized scoring function that only requires\n10-6% of the Transformer's FLOPs per document and can be served using commodity\nCPUs. When combined with a BM25 retriever, this approach matches the quality of\na state-of-the art dual encoder retriever, that still requires an accelerator\nfor query encoding. We introduce NAIL (Non-Autoregressive Indexing with\nLanguage models) as a model architecture that is compatible with recent\nencoder-decoder and decoder-only large language models, such as T5, GPT-3 and\nPaLM. This model architecture can leverage existing pre-trained checkpoints and\ncan be fine-tuned for efficiently constructing document representations that do\nnot require neural processing of queries.\n","authors":["Livio Baldini Soares","Daniel Gillick","Jeremy R. Cole","Tom Kwiatkowski"],"pdf_url":"https://arxiv.org/pdf/2305.14499v2.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14884v1","updated":"2023-10-23T12:53:22Z","published":"2023-10-23T12:53:22Z","title":"Budgeted Embedding Table For Recommender Systems","summary":" At the heart of contemporary recommender systems (RSs) are latent factor\nmodels that provide quality recommendation experience to users. These models\nuse embedding vectors, which are typically of a uniform and fixed size, to\nrepresent users and items. As the number of users and items continues to grow,\nthis design becomes inefficient and hard to scale. Recent lightweight embedding\nmethods have enabled different users and items to have diverse embedding sizes,\nbut are commonly subject to two major drawbacks. Firstly, they limit the\nembedding size search to optimizing a heuristic balancing the recommendation\nquality and the memory complexity, where the trade-off coefficient needs to be\nmanually tuned for every memory budget requested. The implicitly enforced\nmemory complexity term can even fail to cap the parameter usage, making the\nresultant embedding table fail to meet the memory budget strictly. Secondly,\nmost solutions, especially reinforcement learning based ones derive and\noptimize the embedding size for each each user/item on an instance-by-instance\nbasis, which impedes the search efficiency. In this paper, we propose Budgeted\nEmbedding Table (BET), a novel method that generates table-level actions (i.e.,\nembedding sizes for all users and items) that is guaranteed to meet\npre-specified memory budgets. Furthermore, by leveraging a set-based action\nformulation and engaging set representation learning, we present an innovative\naction search strategy powered by an action fitness predictor that efficiently\nevaluates each table-level action. Experiments have shown state-of-the-art\nperformance on two real-world datasets when BET is paired with three popular\nrecommender models under different memory budgets.\n","authors":["Yunke Qu","Tong Chen","Quoc Viet Hung Nguyen","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2310.14884v1.pdf","comment":"Accepted by WSDM 2024"},{"id":"http://arxiv.org/abs/2310.14802v1","updated":"2023-10-23T10:58:09Z","published":"2023-10-23T10:58:09Z","title":"DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye\n Movement for Machine Reading","summary":" The use of visually-rich documents (VRDs) in various fields has created a\ndemand for Document AI models that can read and comprehend documents like\nhumans, which requires the overcoming of technical, linguistic, and cognitive\nbarriers. Unfortunately, the lack of appropriate datasets has significantly\nhindered advancements in the field. To address this issue, we introduce\n\\textsc{DocTrack}, a VRD dataset really aligned with human eye-movement\ninformation using eye-tracking technology. This dataset can be used to\ninvestigate the challenges mentioned above. Additionally, we explore the impact\nof human reading order on document understanding tasks and examine what would\nhappen if a machine reads in the same order as a human. Our results suggest\nthat although Document AI models have made significant progress, they still\nhave a long way to go before they can read VRDs as accurately, continuously,\nand flexibly as humans do. These findings have potential implications for\nfuture research and development of Document AI models. The data is available at\n\\url{https://github.com/hint-lab/doctrack}.\n","authors":["Hao Wang","Qingxuan Wang","Yue Li","Changqing Wang","Chenhui Chu","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14802v1.pdf","comment":"14 pages, 8 figures, Accepted by Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2308.11175v2","updated":"2023-10-23T10:46:07Z","published":"2023-08-22T04:06:56Z","title":"MISSRec: Pre-training and Transferring Multi-modal Interest-aware\n Sequence Representation for Recommendation","summary":" The goal of sequential recommendation (SR) is to predict a user's potential\ninterested items based on her/his historical interaction sequences. Most\nexisting sequential recommenders are developed based on ID features, which,\ndespite their widespread use, often underperform with sparse IDs and struggle\nwith the cold-start problem. Besides, inconsistent ID mappings hinder the\nmodel's transferability, isolating similar recommendation domains that could\nhave been co-optimized. This paper aims to address these issues by exploring\nthe potential of multi-modal information in learning robust and generalizable\nsequence representations. We propose MISSRec, a multi-modal pre-training and\ntransfer learning framework for SR. On the user side, we design a\nTransformer-based encoder-decoder model, where the contextual encoder learns to\ncapture the sequence-level multi-modal user interests while a novel\ninterest-aware decoder is developed to grasp item-modality-interest relations\nfor better sequence representation. On the candidate item side, we adopt a\ndynamic fusion module to produce user-adaptive item representation, providing\nmore precise matching between users and items. We pre-train the model with\ncontrastive learning objectives and fine-tune it in an efficient manner.\nExtensive experiments demonstrate the effectiveness and flexibility of MISSRec,\npromising a practical solution for real-world recommendation scenarios. Data\nand code are available on \\url{https://github.com/gimpong/MM23-MISSRec}.\n","authors":["Jinpeng Wang","Ziyun Zeng","Yunxiao Wang","Yuting Wang","Xingyu Lu","Tianxiang Li","Jun Yuan","Rui Zhang","Hai-Tao Zheng","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2308.11175v2.pdf","comment":"Accepted to ACM MM 2023. Data and code are available"},{"id":"http://arxiv.org/abs/2306.10532v2","updated":"2023-10-23T07:29:30Z","published":"2023-06-18T11:51:39Z","title":"Personalized Elastic Embedding Learning for On-Device Recommendation","summary":" To address privacy concerns and reduce network latency, there has been a\nrecent trend of compressing cumbersome recommendation models trained on the\ncloud and deploying compact recommender models to resource-limited devices for\nreal-time recommendation. Existing solutions generally overlook device\nheterogeneity and user heterogeneity. They either require all devices to share\nthe same compressed model or the devices with the same resource budget to share\nthe same model. However, even users with the same devices may have different\npreferences. In addition, they assume the available resources (e.g., memory)\nfor the recommender on a device are constant, which is not reflective of\nreality. In light of device and user heterogeneities as well as dynamic\nresource constraints, this paper proposes a Personalized Elastic Embedding\nLearning framework (PEEL) for on-device recommendation, which generates\npersonalized embeddings for devices with various memory budgets in once-for-all\nmanner, efficiently adapting to new or dynamic budgets, and effectively\naddressing user preference diversity by assigning personalized embeddings for\ndifferent groups of users. Specifically, it pretrains using user-item\ninteraction instances to generate the global embedding table and cluster users\ninto groups. Then, it refines the embedding tables with local interaction\ninstances within each group. Personalized elastic embedding is generated from\nthe group-wise embedding blocks and their weights that indicate the\ncontribution of each embedding block to the local recommendation performance.\nPEEL efficiently generates personalized elastic embeddings by selecting\nembedding blocks with the largest weights, making it adaptable to dynamic\nmemory budgets. Extensive experiments are conducted on two public datasets, and\nthe results show that PEEL yields superior performance on devices with\nheterogeneous and dynamic memory budgets.\n","authors":["Ruiqi Zheng","Liang Qu","Tong Chen","Kai Zheng","Yuhui Shi","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2306.10532v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14626v1","updated":"2023-10-23T07:00:51Z","published":"2023-10-23T07:00:51Z","title":"Conversational Recommender System and Large Language Model Are Made for\n Each Other in E-commerce Pre-sales Dialogue","summary":" E-commerce pre-sales dialogue aims to understand and elicit user needs and\npreferences for the items they are seeking so as to provide appropriate\nrecommendations. Conversational recommender systems (CRSs) learn user\nrepresentation and provide accurate recommendations based on dialogue context,\nbut rely on external knowledge. Large language models (LLMs) generate responses\nthat mimic pre-sales dialogues after fine-tuning, but lack domain-specific\nknowledge for accurate recommendations. Intuitively, the strengths of LLM and\nCRS in E-commerce pre-sales dialogues are complementary, yet no previous work\nhas explored this. This paper investigates the effectiveness of combining LLM\nand CRS in E-commerce pre-sales dialogues, proposing two collaboration methods:\nCRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a\nreal-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of\ntwo collaborative approaches with two CRSs and two LLMs on four tasks of\nEcommerce pre-sales dialogue. We find that collaborations between CRS and LLM\ncan be very effective in some cases.\n","authors":["Yuanxing Liu","Wei-Nan Zhang","Yifan Chen","Yuchi Zhang","Haopeng Bai","Fan Feng","Hengbin Cui","Yongbin Li","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2310.14626v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.14587v1","updated":"2023-10-23T05:52:09Z","published":"2023-10-23T05:52:09Z","title":"Large Search Model: Redefining Search Stack in the Era of LLMs","summary":" Modern search engines are built on a stack of different components, including\nquery understanding, retrieval, multi-stage ranking, and question answering,\namong others. These components are often optimized and deployed independently.\nIn this paper, we introduce a novel conceptual framework called large search\nmodel, which redefines the conventional search stack by unifying search tasks\nwith one large language model (LLM). All tasks are formulated as autoregressive\ntext generation problems, allowing for the customization of tasks through the\nuse of natural language prompts. This proposed framework capitalizes on the\nstrong language understanding and reasoning capabilities of LLMs, offering the\npotential to enhance search result quality while simultaneously simplifying the\nexisting cumbersome search stack. To substantiate the feasibility of this\nframework, we present a series of proof-of-concept experiments and discuss the\npotential challenges associated with implementing this approach within\nreal-world search systems.\n","authors":["Liang Wang","Nan Yang","Xiaolong Huang","Linjun Yang","Rangan Majumder","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2310.14587v1.pdf","comment":"16 pages"},{"id":"http://arxiv.org/abs/2305.13631v2","updated":"2023-10-23T05:42:51Z","published":"2023-05-23T02:59:19Z","title":"EDIS: Entity-Driven Image Search over Multimodal Web Content","summary":" Making image retrieval methods practical for real-world search applications\nrequires significant progress in dataset scales, entity comprehension, and\nmultimodal information fusion. In this work, we introduce\n\\textbf{E}ntity-\\textbf{D}riven \\textbf{I}mage \\textbf{S}earch (EDIS), a\nchallenging dataset for cross-modal image search in the news domain. EDIS\nconsists of 1 million web images from actual search engine results and curated\ndatasets, with each image paired with a textual description. Unlike datasets\nthat assume a small set of single-modality candidates, EDIS reflects real-world\nweb image search scenarios by including a million multimodal image-text pairs\nas candidates. EDIS encourages the development of retrieval models that\nsimultaneously address cross-modal information fusion and matching. To achieve\naccurate ranking results, a model must: 1) understand named entities and events\nfrom text queries, 2) ground entities onto images or text descriptions, and 3)\neffectively fuse textual and visual representations. Our experimental results\nshow that EDIS challenges state-of-the-art methods with dense entities and a\nlarge-scale candidate set. The ablation study also proves that fusing textual\nfeatures with visual features is critical in improving retrieval results.\n","authors":["Siqi Liu","Weixi Feng","Tsu-jui Fu","Wenhu Chen","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.13631v2.pdf","comment":"EMNLP 2023 camera ready version"},{"id":"http://arxiv.org/abs/2306.02371v2","updated":"2023-10-23T04:30:43Z","published":"2023-06-04T15:02:11Z","title":"I^3 Retriever: Incorporating Implicit Interaction in Pre-trained\n Language Models for Passage Retrieval","summary":" Passage retrieval is a fundamental task in many information systems, such as\nweb search and question answering, where both efficiency and effectiveness are\ncritical concerns. In recent years, neural retrievers based on pre-trained\nlanguage models (PLM), such as dual-encoders, have achieved huge success. Yet,\nstudies have found that the performance of dual-encoders are often limited due\nto the neglecting of the interaction information between queries and candidate\npassages. Therefore, various interaction paradigms have been proposed to\nimprove the performance of vanilla dual-encoders. Particularly, recent\nstate-of-the-art methods often introduce late-interaction during the model\ninference process. However, such late-interaction based methods usually bring\nextensive computation and storage cost on large corpus. Despite their\neffectiveness, the concern of efficiency and space footprint is still an\nimportant factor that limits the application of interaction-based neural\nretrieval models. To tackle this issue, we incorporate implicit interaction\ninto dual-encoders, and propose I^3 retriever. In particular, our implicit\ninteraction paradigm leverages generated pseudo-queries to simulate\nquery-passage interaction, which jointly optimizes with query and passage\nencoders in an end-to-end manner. It can be fully pre-computed and cached, and\nits inference process only involves simple dot product operation of the query\nvector and passage vector, which makes it as efficient as the vanilla dual\nencoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep\nLearning Datasets, demonstrating the I^3 retriever's superiority in terms of\nboth effectiveness and efficiency. Moreover, the proposed implicit interaction\nis compatible with special pre-training and knowledge distillation for passage\nretrieval, which brings a new state-of-the-art performance.\n","authors":["Qian Dong","Yiding Liu","Qingyao Ai","Haitao Li","Shuaiqiang Wang","Yiqun Liu","Dawei Yin","Shaoping Ma"],"pdf_url":"https://arxiv.org/pdf/2306.02371v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2310.14512v1","updated":"2023-10-23T02:47:27Z","published":"2023-10-23T02:47:27Z","title":"CorefPrompt: Prompt-based Event Coreference Resolution by Measuring\n Event Type and Argument Compatibilities","summary":" Event coreference resolution (ECR) aims to group event mentions referring to\nthe same real-world event into clusters. Most previous studies adopt the\n\"encoding first, then scoring\" framework, making the coreference judgment rely\non event encoding. Furthermore, current methods struggle to leverage\nhuman-summarized ECR rules, e.g., coreferential events should have the same\nevent type, to guide the model. To address these two issues, we propose a\nprompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM\n(masked language model) task. This allows for simultaneous event modeling and\ncoreference discrimination within a single template, with a fully shared\ncontext. In addition, we introduce two auxiliary prompt tasks, event-type\ncompatibility and argument compatibility, to explicitly demonstrate the\nreasoning process of ECR, which helps the model make final predictions.\nExperimental results show that our method CorefPrompt performs well in a\nstate-of-the-art (SOTA) benchmark.\n","authors":["Sheng Xu","Peifeng Li","Qiaoming Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14512v1.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.14483v1","updated":"2023-10-23T01:29:18Z","published":"2023-10-23T01:29:18Z","title":"\"Why Should I Review This Paper?\" Unifying Semantic, Topic, and Citation\n Factors for Paper-Reviewer Matching","summary":" As many academic conferences are overwhelmed by a rapidly increasing number\nof paper submissions, automatically finding appropriate reviewers for each\nsubmission becomes a more urgent need than ever. Various factors have been\nconsidered by previous attempts on this task to measure the expertise relevance\nbetween a paper and a reviewer, including whether the paper is semantically\nclose to, shares topics with, and cites previous papers of the reviewer.\nHowever, the majority of previous studies take only one of these factors into\naccount, leading to an incomprehensive evaluation of paper-reviewer relevance.\nTo bridge this gap, in this paper, we propose a unified model for\npaper-reviewer matching that jointly captures semantic, topic, and citation\nfactors. In the unified model, a contextualized language model backbone is\nshared by all factors to learn common knowledge, while instruction tuning is\nintroduced to characterize the uniqueness of each factor by producing\nfactor-aware paper embeddings. Experiments on four datasets (one of which is\nnewly contributed by us) across different fields, including machine learning,\ncomputer vision, information retrieval, and data mining, consistently validate\nthe effectiveness of our proposed UniPR model in comparison with\nstate-of-the-art paper-reviewer matching methods and scientific pre-trained\nlanguage models.\n","authors":["Yu Zhang","Yanzhen Shen","Xiusi Chen","Bowen Jin","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2310.14483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15342v1","updated":"2023-10-23T20:15:30Z","published":"2023-10-23T20:15:30Z","title":"Towards Hybrid-grained Feature Interaction Selection for Deep Sparse\n Network","summary":" Deep sparse networks are widely investigated as a neural network architecture\nfor prediction tasks with high-dimensional sparse features, with which feature\ninteraction selection is a critical component. While previous methods primarily\nfocus on how to search feature interaction in a coarse-grained space, less\nattention has been given to a finer granularity. In this work, we introduce a\nhybrid-grained feature interaction selection approach that targets both feature\nfield and feature value for deep sparse networks. To explore such expansive\nspace, we propose a decomposed space which is calculated on the fly. We then\ndevelop a selection algorithm called OptFeature, which efficiently selects the\nfeature interaction from both the feature field and the feature value\nsimultaneously. Results from experiments on three large real-world benchmark\ndatasets demonstrate that OptFeature performs well in terms of accuracy and\nefficiency. Additional studies support the feasibility of our method.\n","authors":["Fuyuan Lyu","Xing Tang","Dugang Liu","Chen Ma","Weihong Luo","Liang Chen","Xiuqiang He","Xue Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15342v1.pdf","comment":"NeurIPS 2023 poster"},{"id":"http://arxiv.org/abs/2304.09848v2","updated":"2023-10-23T19:11:38Z","published":"2023-04-19T17:56:12Z","title":"Evaluating Verifiability in Generative Search Engines","summary":" Generative search engines directly generate responses to user queries, along\nwith in-line citations. A prerequisite trait of a trustworthy generative search\nengine is verifiability, i.e., systems should cite comprehensively (high\ncitation recall; all statements are fully supported by citations) and\naccurately (high citation precision; every cite supports its associated\nstatement). We conduct human evaluation to audit four popular generative search\nengines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse\nset of queries from a variety of sources (e.g., historical Google user queries,\ndynamically-collected open-ended questions on Reddit, etc.). We find that\nresponses from existing generative search engines are fluent and appear\ninformative, but frequently contain unsupported statements and inaccurate\ncitations: on average, a mere 51.5% of generated sentences are fully supported\nby citations and only 74.5% of citations support their associated sentence. We\nbelieve that these results are concerningly low for systems that may serve as a\nprimary tool for information-seeking users, especially given their facade of\ntrustworthiness. We hope that our results further motivate the development of\ntrustworthy generative search engines and help researchers and users better\nunderstand the shortcomings of existing commercial systems.\n","authors":["Nelson F. Liu","Tianyi Zhang","Percy Liang"],"pdf_url":"https://arxiv.org/pdf/2304.09848v2.pdf","comment":"25 pages, 12 figures; to appear in Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15275v1","updated":"2023-10-23T18:25:33Z","published":"2023-10-23T18:25:33Z","title":"Triple Simplex Matrix Completion for Expense Forecasting","summary":" Forecasting project expenses is a crucial step for businesses to avoid budget\noverruns and project failures. Traditionally, this has been done by financial\nanalysts or data science techniques such as time-series analysis. However,\nthese approaches can be uncertain and produce results that differ from the\nplanned budget, especially at the start of a project with limited data points.\nThis paper proposes a constrained non-negative matrix completion model that\npredicts expenses by learning the likelihood of the project correlating with\ncertain expense patterns in the latent space. The model is constrained on three\nprobability simplexes, two of which are on the factor matrices and the third on\nthe missing entries. Additionally, the predicted expense values are guaranteed\nto meet the budget constraint without the need of post-processing. An inexact\nalternating optimization algorithm is developed to solve the associated\noptimization problem and is proven to converge to a stationary point. Results\nfrom two real datasets demonstrate the effectiveness of the proposed method in\ncomparison to state-of-the-art algorithms.\n","authors":["Cheng Qian","Lucas Glass","Nikos Sidiropoulos"],"pdf_url":"https://arxiv.org/pdf/2310.15275v1.pdf","comment":"5 pages 2 figures"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.15168v1","updated":"2023-10-23T17:59:52Z","published":"2023-10-23T17:59:52Z","title":"Ghost on the Shell: An Expressive Representation of General 3D Shapes","summary":" The creation of photorealistic virtual worlds requires the accurate modeling\nof 3D surface geometry for a wide range of objects. For this, meshes are\nappealing since they 1) enable fast physics-based rendering with realistic\nmaterial and lighting, 2) support physical simulation, and 3) are\nmemory-efficient for modern graphics pipelines. Recent work on reconstructing\nand statistically modeling 3D shape, however, has critiqued meshes as being\ntopologically inflexible. To capture a wide range of object shapes, any 3D\nrepresentation must be able to model solid, watertight, shapes as well as thin,\nopen, surfaces. Recent work has focused on the former, and methods for\nreconstructing open surfaces do not support fast reconstruction with material\nand lighting or unconditional generative modelling. Inspired by the observation\nthat open surfaces can be seen as islands floating on watertight surfaces, we\nparameterize open surfaces by defining a manifold signed distance field on\nwatertight templates. With this parameterization, we further develop a\ngrid-based and differentiable representation that parameterizes both watertight\nand non-watertight meshes of arbitrary topology. Our new representation, called\nGhost-on-the-Shell (G-Shell), enables two important applications:\ndifferentiable rasterization-based reconstruction from multiview images and\ngenerative modelling of non-watertight meshes. We empirically demonstrate that\nG-Shell achieves state-of-the-art performance on non-watertight mesh\nreconstruction and generation tasks, while also performing effectively for\nwatertight meshes.\n","authors":["Zhen Liu","Yao Feng","Yuliang Xiu","Weiyang Liu","Liam Paull","Michael J. Black","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2310.15168v1.pdf","comment":"Technical Report (26 pages, 16 figures)"},{"id":"http://arxiv.org/abs/2310.15165v1","updated":"2023-10-23T17:59:16Z","published":"2023-10-23T17:59:16Z","title":"Handling Data Heterogeneity via Architectural Design for Federated\n Visual Recognition","summary":" Federated Learning (FL) is a promising research paradigm that enables the\ncollaborative training of machine learning models among various parties without\nthe need for sensitive information exchange. Nonetheless, retaining data in\nindividual clients introduces fundamental challenges to achieving performance\non par with centrally trained models. Our study provides an extensive review of\nfederated learning applied to visual recognition. It underscores the critical\nrole of thoughtful architectural design choices in achieving optimal\nperformance, a factor often neglected in the FL literature. Many existing FL\nsolutions are tested on shallow or simple networks, which may not accurately\nreflect real-world applications. This practice restricts the transferability of\nresearch findings to large-scale visual recognition models. Through an in-depth\nanalysis of diverse cutting-edge architectures such as convolutional neural\nnetworks, transformers, and MLP-mixers, we experimentally demonstrate that\narchitectural choices can substantially enhance FL systems' performance,\nparticularly when handling heterogeneous data. We study 19 visual recognition\nmodels from five different architectural families on four challenging FL\ndatasets. We also re-investigate the inferior performance of convolution-based\narchitectures in the FL setting and analyze the influence of normalization\nlayers on the FL performance. Our findings emphasize the importance of\narchitectural design for computer vision tasks in practical scenarios,\neffectively narrowing the performance gap between federated and centralized\nlearning. Our source code is available at\nhttps://github.com/sarapieri/fed_het.git.\n","authors":["Sara Pieri","Jose Renato Restom","Samuel Horvath","Hisham Cholakkal"],"pdf_url":"https://arxiv.org/pdf/2310.15165v1.pdf","comment":"to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15154v1","updated":"2023-10-23T17:55:31Z","published":"2023-10-23T17:55:31Z","title":"Linear Representations of Sentiment in Large Language Models","summary":" Sentiment is a pervasive feature in natural language text, yet it is an open\nquestion how sentiment is represented within Large Language Models (LLMs). In\nthis study, we reveal that across a range of models, sentiment is represented\nlinearly: a single direction in activation space mostly captures the feature\nacross a range of tasks with one extreme for positive and the other for\nnegative. Through causal interventions, we isolate this direction and show it\nis causally relevant in both toy tasks and real world datasets such as Stanford\nSentiment Treebank. Through this case study we model a thorough investigation\nof what a single direction means on a broad data distribution.\n We further uncover the mechanisms that involve this direction, highlighting\nthe roles of a small subset of attention heads and neurons. Finally, we\ndiscover a phenomenon which we term the summarization motif: sentiment is not\nsolely represented on emotionally charged words, but is additionally summarized\nat intermediate positions without inherent sentiment, such as punctuation and\nnames. We show that in Stanford Sentiment Treebank zero-shot classification,\n76% of above-chance classification accuracy is lost when ablating the sentiment\ndirection, nearly half of which (36%) is due to ablating the summarized\nsentiment direction exclusively at comma positions.\n","authors":["Curt Tigges","Oskar John Hollinsworth","Atticus Geiger","Neel Nanda"],"pdf_url":"https://arxiv.org/pdf/2310.15154v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15151v1","updated":"2023-10-23T17:53:47Z","published":"2023-10-23T17:53:47Z","title":"Verb Conjugation in Transformers Is Determined by Linear Encodings of\n Subject Number","summary":" Deep architectures such as Transformers are sometimes criticized for having\nuninterpretable \"black-box\" representations. We use causal intervention\nanalysis to show that, in fact, some linguistic features are represented in a\nlinear, interpretable format. Specifically, we show that BERT's ability to\nconjugate verbs relies on a linear encoding of subject number that can be\nmanipulated with predictable effects on conjugation accuracy. This encoding is\nfound in the subject position at the first layer and the verb position at the\nlast layer, but distributed across positions at middle layers, particularly\nwhen there are multiple cues to subject number.\n","authors":["Sophie Hao","Tal Linzen"],"pdf_url":"https://arxiv.org/pdf/2310.15151v1.pdf","comment":"To appear in Findings of the Association for Computational\n Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15150v1","updated":"2023-10-23T17:53:14Z","published":"2023-10-23T17:53:14Z","title":"Online Detection of AI-Generated Images","summary":" With advancements in AI-generated images coming on a continuous basis, it is\nincreasingly difficult to distinguish traditionally-sourced images (e.g.,\nphotos, artwork) from AI-generated ones. Previous detection methods study the\ngeneralization from a single generator to another in isolation. However, in\nreality, new generators are released on a streaming basis. We study\ngeneralization in this setting, training on N models and testing on the next\n(N+k), following the historical release dates of well-known generation methods.\nFurthermore, images increasingly consist of both real and generated components,\nfor example through image inpainting. Thus, we extend this approach to pixel\nprediction, demonstrating strong performance using automatically-generated\ninpainted data. In addition, for settings where commercial models are not\npublicly available for automatic data generation, we evaluate if pixel\ndetectors can be trained solely on whole synthetic images.\n","authors":["David C. Epstein","Ishan Jain","Oliver Wang","Richard Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.15150v1.pdf","comment":"ICCV DeepFake Analysis and Detection Workshop, 2023"},{"id":"http://arxiv.org/abs/2310.15149v1","updated":"2023-10-23T17:53:09Z","published":"2023-10-23T17:53:09Z","title":"Unlocking the Transferability of Tokens in Deep Models for Tabular Data","summary":" Fine-tuning a pre-trained deep neural network has become a successful\nparadigm in various machine learning tasks. However, such a paradigm becomes\nparticularly challenging with tabular data when there are discrepancies between\nthe feature sets of pre-trained models and the target tasks. In this paper, we\npropose TabToken, a method aims at enhancing the quality of feature tokens\n(i.e., embeddings of tabular features). TabToken allows for the utilization of\npre-trained models when the upstream and downstream tasks share overlapping\nfeatures, facilitating model fine-tuning even with limited training examples.\nSpecifically, we introduce a contrastive objective that regularizes the tokens,\ncapturing the semantics within and across features. During the pre-training\nstage, the tokens are learned jointly with top-layer deep models such as\ntransformer. In the downstream task, tokens of the shared features are kept\nfixed while TabToken efficiently fine-tunes the remaining parts of the model.\nTabToken not only enables knowledge transfer from a pre-trained model to tasks\nwith heterogeneous features, but also enhances the discriminative ability of\ndeep tabular models in standard classification and regression tasks.\n","authors":["Qi-Le Zhou","Han-Jia Ye","Le-Ye Wang","De-Chuan Zhan"],"pdf_url":"https://arxiv.org/pdf/2310.15149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07980v2","updated":"2023-10-23T17:50:33Z","published":"2023-10-12T02:03:10Z","title":"GRASP: Accelerating Shortest Path Attacks via Graph Attention","summary":" Recent advances in machine learning (ML) have shown promise in aiding and\naccelerating classical combinatorial optimization algorithms. ML-based speed\nups that aim to learn in an end to end manner (i.e., directly output the\nsolution) tend to trade off run time with solution quality. Therefore,\nsolutions that are able to accelerate existing solvers while maintaining their\nperformance guarantees, are of great interest. We consider an APX-hard problem,\nwhere an adversary aims to attack shortest paths in a graph by removing the\nminimum number of edges. We propose the GRASP algorithm: Graph Attention\nAccelerated Shortest Path Attack, an ML aided optimization algorithm that\nachieves run times up to 10x faster, while maintaining the quality of solution\ngenerated. GRASP uses a graph attention network to identify a smaller subgraph\ncontaining the combinatorial solution, thus effectively reducing the input\nproblem size. Additionally, we demonstrate how careful representation of the\ninput graph, including node features that correlate well with the optimization\ntask, can highlight important structure in the optimization solution.\n","authors":["Zohair Shafi","Benjamin A. Miller","Ayan Chatterjee","Tina Eliassi-Rad","Rajmonda S. Caceres"],"pdf_url":"https://arxiv.org/pdf/2310.07980v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15145v1","updated":"2023-10-23T17:50:08Z","published":"2023-10-23T17:50:08Z","title":"Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for\n Autonomous Real-World Reinforcement Learning","summary":" The pre-train and fine-tune paradigm in machine learning has had dramatic\nsuccess in a wide range of domains because the use of existing data or\npre-trained models on the internet enables quick and easy learning of new\ntasks. We aim to enable this paradigm in robotic reinforcement learning,\nallowing a robot to learn a new task with little human effort by leveraging\ndata and models from the Internet. However, reinforcement learning often\nrequires significant human effort in the form of manual reward specification or\nenvironment resets, even if the policy is pre-trained. We introduce RoboFuME, a\nreset-free fine-tuning system that pre-trains a multi-task manipulation policy\nfrom diverse datasets of prior experiences and self-improves online to learn a\ntarget task with minimal human intervention. Our insights are to utilize\ncalibrated offline reinforcement learning techniques to ensure efficient online\nfine-tuning of a pre-trained policy in the presence of distribution shifts and\nleverage pre-trained vision language models (VLMs) to build a robust reward\nclassifier for autonomously providing reward signals during the online\nfine-tuning process. In a diverse set of five real robot manipulation tasks, we\nshow that our method can incorporate data from an existing robot dataset\ncollected at a different institution and improve on a target task within as\nlittle as 3 hours of autonomous real-world experience. We also demonstrate in\nsimulation experiments that our method outperforms prior works that use\ndifferent RL algorithms or different approaches for predicting rewards. Project\nwebsite: https://robofume.github.io\n","authors":["Jingyun Yang","Max Sobol Mark","Brandon Vu","Archit Sharma","Jeannette Bohg","Chelsea Finn"],"pdf_url":"https://arxiv.org/pdf/2310.15145v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.02996v2","updated":"2023-10-23T17:48:34Z","published":"2023-05-04T17:01:17Z","title":"Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR\n Decomposition","summary":" Cross-encoder models, which jointly encode and score a query-item pair, are\nprohibitively expensive for direct k-nearest neighbor (k-NN) search.\nConsequently, k-NN search typically employs a fast approximate retrieval (e.g.\nusing BM25 or dual-encoder vectors), followed by reranking with a\ncross-encoder; however, the retrieval approximation often has detrimental\nrecall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent\nwork that employs a cross-encoder only, making search efficient using a\nrelatively small number of anchor items, and a CUR matrix factorization. While\nANNCUR's one-time selection of anchors tends to approximate the cross-encoder\ndistances on average, doing so forfeits the capacity to accurately estimate\ndistances to items near the query, leading to regret in the crucial end-task:\nrecall of top-k items. In this paper, we propose ADACUR, a method that\nadaptively, iteratively, and efficiently minimizes the approximation error for\nthe practically important top-k neighbors. It does so by iteratively performing\nk-NN search using the anchors available so far, then adding these retrieved\nnearest neighbors to the anchor set for the next round. Empirically, on\nmultiple datasets, in comparison to previous traditional and state-of-the-art\nmethods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed\napproach ADACUR consistently reduces recall error-by up to 70% on the important\nk = 1 setting-while using no more compute than its competitors.\n","authors":["Nishant Yadav","Nicholas Monath","Manzil Zaheer","Andrew McCallum"],"pdf_url":"https://arxiv.org/pdf/2305.02996v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15143v1","updated":"2023-10-23T17:48:11Z","published":"2023-10-23T17:48:11Z","title":"Hyperparameter optimization of hp-greedy reduced basis for gravitational\n wave surrogates","summary":" In a previous work we introduced, in the context of gravitational wave\nscience, an initial study on an automated domain-decomposition approach for\nreduced basis through hp-greedy refinement. The approach constructs local\nreduced bases of lower dimensionality than global ones, with the same or higher\naccuracy. These ``light'' local bases should imply both faster evaluations when\npredicting new waveforms and faster data analysis, in particular faster\nstatistical inference (the forward and inverse problems, respectively). In this\napproach, however, we have previously found important dependence on several\nhyperparameters, which do not appear in global reduced basis. This naturally\nleads to the problem of hyperparameter optimization (HPO), which is the subject\nof this paper. We tackle the problem through a Bayesian optimization, and show\nits superiority when compared to grid or random searches. We find that for\ngravitational waves from the collision of two spinning but non-precessing black\nholes, for the same accuracy, local hp-greedy reduced bases with HPO have a\nlower dimensionality of up to $4 \\times$ for the cases here studied, depending\non the desired accuracy. This factor should directly translate in a parameter\nestimation speedup, for instance. Such acceleration might help in the near\nreal-time requirements for electromagnetic counterparts of gravitational waves\nfrom compact binary coalescences. In addition, we find that the Bayesian\napproach used in this paper for HPO is two orders of magnitude faster than, for\nexample, a grid search, with about a $100 \\times$ acceleration. The code\ndeveloped for this project is available as open source from public\nrepositories.\n","authors":["Franco Cerino","Andrés Diaz-Pace","Emmanuel Tassone","Manuel Tiglio","Atuel Villegas"],"pdf_url":"https://arxiv.org/pdf/2310.15143v1.pdf","comment":"This paper is an invited contribution to the Special Issue \"Recent\n Advances in Gravity: A Themed Issue in Honor of Prof. Jorge Pullin on his\n 60th Anniversary''"},{"id":"http://arxiv.org/abs/2310.15141v1","updated":"2023-10-23T17:47:34Z","published":"2023-10-23T17:47:34Z","title":"SpecTr: Fast Speculative Decoding via Optimal Transport","summary":" Autoregressive sampling from large language models has led to\nstate-of-the-art results in several natural language tasks. However,\nautoregressive sampling generates tokens one at a time making it slow, and even\nprohibitive in certain tasks. One way to speed up sampling is\n$\\textit{speculative decoding}$: use a small model to sample a $\\textit{draft}$\n(block or sequence of tokens), and then score all tokens in the draft by the\nlarge language model in parallel. A subset of the tokens in the draft are\naccepted (and the rest rejected) based on a statistical method to guarantee\nthat the final output follows the distribution of the large model. In this\nwork, we provide a principled understanding of speculative decoding through the\nlens of optimal transport (OT) with $\\textit{membership cost}$. This framework\ncan be viewed as an extension of the well-known $\\textit{maximal-coupling}$\nproblem. This new formulation enables us to generalize the speculative decoding\nmethod to allow for a set of $k$ candidates at the token-level, which leads to\nan improved optimal membership cost. We show that the optimal draft selection\nalgorithm (transport plan) can be computed via linear programming, whose\nbest-known runtime is exponential in $k$. We then propose a valid draft\nselection algorithm whose acceptance probability is $(1-1/e)$-optimal\nmultiplicatively. Moreover, it can be computed in time almost linear with size\nof domain of a single token. Using this $new draft selection$ algorithm, we\ndevelop a new autoregressive sampling algorithm called $\\textit{SpecTr}$, which\nprovides speedup in decoding while ensuring that there is no quality\ndegradation in the decoded output. We experimentally demonstrate that for\nstate-of-the-art large language models, the proposed approach achieves a wall\nclock speedup of 2.13X, a further 1.37X speedup over speculative decoding on\nstandard benchmarks.\n","authors":["Ziteng Sun","Ananda Theertha Suresh","Jae Hun Ro","Ahmad Beirami","Himanshu Jain","Felix Yu"],"pdf_url":"https://arxiv.org/pdf/2310.15141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15140v1","updated":"2023-10-23T17:46:07Z","published":"2023-10-23T17:46:07Z","title":"AutoDAN: Automatic and Interpretable Adversarial Attacks on Large\n Language Models","summary":" Safety alignment of Large Language Models (LLMs) can be compromised with\nmanual jailbreak attacks and (automatic) adversarial attacks. Recent work\nsuggests that patching LLMs against these attacks is possible: manual jailbreak\nattacks are human-readable but often limited and public, making them easy to\nblock; adversarial attacks generate gibberish prompts that can be detected\nusing perplexity-based filters. In this paper, we show that these solutions may\nbe too optimistic. We propose an interpretable adversarial attack,\n\\texttt{AutoDAN}, that combines the strengths of both types of attacks. It\nautomatically generates attack prompts that bypass perplexity-based filters\nwhile maintaining a high attack success rate like manual jailbreak attacks.\nThese prompts are interpretable and diverse, exhibiting strategies commonly\nused in manual jailbreak attacks, and transfer better than their non-readable\ncounterparts when using limited training data or a single proxy model. We also\ncustomize \\texttt{AutoDAN}'s objective to leak system prompts, another\njailbreak application not addressed in the adversarial attack literature. %,\ndemonstrating the versatility of the approach. We can also customize the\nobjective of \\texttt{AutoDAN} to leak system prompts, beyond the ability to\nelicit harmful content from the model, demonstrating the versatility of the\napproach. Our work provides a new way to red-team LLMs and to understand the\nmechanism of jailbreak attacks.\n","authors":["Sicheng Zhu","Ruiyi Zhang","Bang An","Gang Wu","Joe Barrow","Zichao Wang","Furong Huang","Ani Nenkova","Tong Sun"],"pdf_url":"https://arxiv.org/pdf/2310.15140v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.11373v2","updated":"2023-10-23T17:44:37Z","published":"2023-07-21T06:12:39Z","title":"Diverse Offline Imitation Learning","summary":" There has been significant recent progress in the area of unsupervised skill\ndiscovery, utilizing various information-theoretic objectives as measures of\ndiversity. Despite these advances, challenges remain: current methods require\nsignificant online interaction, fail to leverage vast amounts of available\ntask-agnostic data and typically lack a quantitative measure of skill utility.\nWe address these challenges by proposing a principled offline algorithm for\nunsupervised skill discovery that, in addition to maximizing diversity, ensures\nthat each learned skill imitates state-only expert demonstrations to a certain\ndegree. Our main analytical contribution is to connect Fenchel duality,\nreinforcement learning, and unsupervised skill discovery to maximize a mutual\ninformation objective subject to KL-divergence state occupancy constraints.\nFurthermore, we demonstrate the effectiveness of our method on the standard\noffline benchmark D4RL and on a custom offline dataset collected from a 12-DoF\nquadruped robot for which the policies trained in simulation transfer well to\nthe real robotic system.\n","authors":["Marin Vlastelica","Jin Cheng","Georg Martius","Pavel Kolev"],"pdf_url":"https://arxiv.org/pdf/2307.11373v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15135v1","updated":"2023-10-23T17:42:01Z","published":"2023-10-23T17:42:01Z","title":"Quantifying the Dialect Gap and its Correlates Across Languages","summary":" Historically, researchers and consumers have noticed a decrease in quality\nwhen applying NLP tools to minority variants of languages (i.e. Puerto Rican\nSpanish or Swiss German), but studies exploring this have been limited to a\nselect few languages. Additionally, past studies have mainly been conducted in\na monolingual context, so cross-linguistic trends have not been identified and\ntied to external factors. In this work, we conduct a comprehensive evaluation\nof the most influential, state-of-the-art large language models (LLMs) across\ntwo high-use applications, machine translation and automatic speech\nrecognition, to assess their functionality on the regional dialects of several\nhigh- and low-resource languages. Additionally, we analyze how the regional\ndialect gap is correlated with economic, social, and linguistic factors. The\nimpact of training data, including related factors like dataset size and its\nconstruction procedure, is shown to be significant but not consistent across\nmodels or languages, meaning a one-size-fits-all approach cannot be taken in\nsolving the dialect gap. This work will lay the foundation for furthering the\nfield of dialectal NLP by laying out evident disparities and identifying\npossible pathways for addressing them through mindful data collection.\n","authors":["Anjali Kantharuban","Ivan Vulić","Anna Korhonen"],"pdf_url":"https://arxiv.org/pdf/2310.15135v1.pdf","comment":"Accepted to EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2303.05639v3","updated":"2023-10-23T17:40:57Z","published":"2023-03-10T01:04:27Z","title":"Self-Supervised One-Shot Learning for Automatic Segmentation of StyleGAN\n Images","summary":" We propose a framework for the automatic one-shot segmentation of synthetic\nimages generated by a StyleGAN. Our framework is based on the observation that\nthe multi-scale hidden features in the GAN generator hold useful semantic\ninformation that can be utilized for automatic on-the-fly segmentation of the\ngenerated images. Using these features, our framework learns to segment\nsynthetic images using a self-supervised contrastive clustering algorithm that\nprojects the hidden features into a compact space for per-pixel classification.\nThis contrastive learner is based on using a novel data augmentation strategy\nand a pixel-wise swapped prediction loss that leads to faster learning of the\nfeature vectors for one-shot segmentation. We have tested our implementation on\nfive standard benchmarks to yield a segmentation performance that not only\noutperforms the semi-supervised baselines by an average wIoU margin of 1.02 %\nbut also improves the inference speeds by a factor of 4.5. Finally, we also\nshow the results of using the proposed one-shot learner in implementing BagGAN,\na framework for producing annotated synthetic baggage X-ray scans for threat\ndetection. This framework was trained and tested on the PIDRay baggage\nbenchmark to yield a performance comparable to its baseline segmenter based on\nmanual annotations.\n","authors":["Ankit Manerikar","Avinash C. Kak"],"pdf_url":"https://arxiv.org/pdf/2303.05639v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15129v1","updated":"2023-10-23T17:33:31Z","published":"2023-10-23T17:33:31Z","title":"Location-Aware Visual Question Generation with Lightweight Models","summary":" This work introduces a novel task, location-aware visual question generation\n(LocaVQG), which aims to generate engaging questions from data relevant to a\nparticular geographical location. Specifically, we represent such\nlocation-aware information with surrounding images and a GPS coordinate. To\ntackle this task, we present a dataset generation pipeline that leverages GPT-4\nto produce diverse and sophisticated questions. Then, we aim to learn a\nlightweight model that can address the LocaVQG task and fit on an edge device,\nsuch as a mobile phone. To this end, we propose a method which can reliably\ngenerate engaging questions from location-aware information. Our proposed\nmethod outperforms baselines regarding human evaluation (e.g., engagement,\ngrounding, coherence) and automatic evaluation metrics (e.g., BERTScore,\nROUGE-2). Moreover, we conduct extensive ablation studies to justify our\nproposed techniques for both generating the dataset and solving the task.\n","authors":["Nicholas Collin Suwono","Justin Chih-Yao Chen","Tun Min Hung","Ting-Hao Kenneth Huang","I-Bin Liao","Yung-Hui Li","Lun-Wei Ku","Shao-Hua Sun"],"pdf_url":"https://arxiv.org/pdf/2310.15129v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15128v1","updated":"2023-10-23T17:32:38Z","published":"2023-10-23T17:32:38Z","title":"Projected Stochastic Gradient Descent with Quantum Annealed Binary\n Gradients","summary":" We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards\ntraining neural networks with binary weights, known as binary neural networks\n(BNNs), on quantum hardware. BNNs reduce the computational requirements and\nenergy consumption of deep learning models with minimal loss in accuracy.\nHowever, training them in practice remains to be an open challenge. Most known\nBNN-optimisers either rely on projected updates or binarise weights\npost-training. Instead, QP-SBGD approximately maps the gradient onto binary\nvariables, by solving a quadratic constrained binary optimisation. Under\npractically reasonable assumptions, we show that this update rule converges\nwith a rate of $\\mathcal{O}(1 / \\sqrt{T})$. Moreover, we show how the\n$\\mathcal{NP}$-hard projection can be effectively executed on an adiabatic\nquantum annealer, harnessing recent advancements in quantum computation. We\nalso introduce a projected version of this update rule and prove that if a\nfixed point exists in the binary variable space, the modified updates will\nconverge to it. Last but not least, our algorithm is implemented layer-wise,\nmaking it suitable to train larger networks on resource-limited quantum\nhardware. Through extensive evaluations, we show that QP-SBGD outperforms or is\non par with competitive and well-established baselines such as BinaryConnect,\nsignSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as\nwell as binary graph neural networks.\n","authors":["Maximilian Krahn","Michelle Sasdelli","Fengyi Yang","Vladislav Golyanik","Juho Kannala","Tat-Jun Chin","Tolga Birdal"],"pdf_url":"https://arxiv.org/pdf/2310.15128v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15127v1","updated":"2023-10-23T17:31:55Z","published":"2023-10-23T17:31:55Z","title":"Open-Ended Instructable Embodied Agents with Memory-Augmented Large\n Language Models","summary":" Pre-trained and frozen LLMs can effectively map simple scene re-arrangement\ninstructions to programs over a robot's visuomotor functions through\nappropriate few-shot example prompting. To parse open-domain natural language\nand adapt to a user's idiosyncratic procedures, not known during prompt\nengineering time, fixed prompts fall short. In this paper, we introduce HELPER,\nan embodied agent equipped with an external memory of language-program pairs\nthat parses free-form human-robot dialogue into action programs through\nretrieval-augmented LLM prompting: relevant memories are retrieved based on the\ncurrent dialogue, instruction, correction or VLM description, and used as\nin-context prompt examples for LLM querying. The memory is expanded during\ndeployment to include pairs of user's language and action plans, to assist\nfuture inferences and personalize them to the user's language and routines.\nHELPER sets a new state-of-the-art in the TEACh benchmark in both Execution\nfrom Dialog History (EDH) and Trajectory from Dialogue (TfD), with 1.7x\nimprovement over the previous SOTA for TfD. Our models, code and video results\ncan be found in our project's website: https://helper-agent-llm.github.io.\n","authors":["Gabriel Sarch","Yue Wu","Michael J. Tarr","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2310.15127v1.pdf","comment":"https://helper-agent-llm.github.io"},{"id":"http://arxiv.org/abs/2310.15124v1","updated":"2023-10-23T17:29:53Z","published":"2023-10-23T17:29:53Z","title":"Mixed-Variable Global Sensitivity Analysis For Knowledge Discovery And\n Efficient Combinatorial Materials Design","summary":" Global Sensitivity Analysis (GSA) is the study of the influence of any given\ninputs on the outputs of a model. In the context of engineering design, GSA has\nbeen widely used to understand both individual and collective contributions of\ndesign variables on the design objectives. So far, global sensitivity studies\nhave often been limited to design spaces with only quantitative (numerical)\ndesign variables. However, many engineering systems also contain, if not only,\nqualitative (categorical) design variables in addition to quantitative design\nvariables. In this paper, we integrate Latent Variable Gaussian Process (LVGP)\nwith Sobol' analysis to develop the first metamodel-based mixed-variable GSA\nmethod. Through numerical case studies, we validate and demonstrate the\neffectiveness of our proposed method for mixed-variable problems. Furthermore,\nwhile the proposed GSA method is general enough to benefit various engineering\ndesign applications, we integrate it with multi-objective Bayesian optimization\n(BO) to create a sensitivity-aware design framework in accelerating the Pareto\nfront design exploration for metal-organic framework (MOF) materials with\nmany-level combinatorial design spaces. Although MOFs are constructed only from\nqualitative variables that are notoriously difficult to design, our method can\nutilize sensitivity analysis to navigate the optimization in the many-level\nlarge combinatorial design space, greatly expediting the exploration of novel\nMOF candidates.\n","authors":["Yigitcan Comlek","Liwei Wang","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15124v1.pdf","comment":"35 Pages, 10 Figures, 2 Tables"},{"id":"http://arxiv.org/abs/2310.15123v1","updated":"2023-10-23T17:29:48Z","published":"2023-10-23T17:29:48Z","title":"Branch-Solve-Merge Improves Large Language Model Evaluation and\n Generation","summary":" Large Language Models (LLMs) are frequently used for multi-faceted language\ngeneration and evaluation tasks that involve satisfying intricate user\nconstraints or taking into account multiple aspects and criteria. However,\ntheir performance can fall short, due to the model's lack of coherence and\ninability to plan and decompose the problem. We propose Branch-Solve-Merge\n(BSM), a Large Language Model program (Schlag et al., 2023) for tackling such\nchallenging natural language tasks. It consists of branch, solve, and merge\nmodules that are parameterized with specific prompts to the base LLM. These\nthree modules plan a decomposition of the task into multiple parallel\nsub-tasks, independently solve them, and fuse the solutions to the sub-tasks.\nWe apply our method to the tasks of LLM response evaluation and constrained\ntext generation and evaluate its effectiveness with multiple LLMs, including\nVicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and\nconsistency for each LLM by enhancing human-LLM agreement by up to 26%,\nreducing length and pairwise position biases by up to 50%, and allowing\nLLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint\nstory generation task, BSM improves the coherence of the stories while also\nimproving constraint satisfaction by 12%.\n","authors":["Swarnadeep Saha","Omer Levy","Asli Celikyilmaz","Mohit Bansal","Jason Weston","Xian Li"],"pdf_url":"https://arxiv.org/pdf/2310.15123v1.pdf","comment":"22 pages, 7 figures, 10 tables"},{"id":"http://arxiv.org/abs/2310.15111v1","updated":"2023-10-23T17:20:01Z","published":"2023-10-23T17:20:01Z","title":"Matryoshka Diffusion Models","summary":" Diffusion models are the de facto approach for generating high-quality images\nand videos, but learning high-dimensional models remains a formidable task due\nto computational and optimization challenges. Existing methods often resort to\ntraining cascaded models in pixel space or using a downsampled latent space of\na separately trained auto-encoder. In this paper, we introduce Matryoshka\nDiffusion Models(MDM), an end-to-end framework for high-resolution image and\nvideo synthesis. We propose a diffusion process that denoises inputs at\nmultiple resolutions jointly and uses a NestedUNet architecture where features\nand parameters for small-scale inputs are nested within those of large scales.\nIn addition, MDM enables a progressive training schedule from lower to higher\nresolutions, which leads to significant improvements in optimization for\nhigh-resolution generation. We demonstrate the effectiveness of our approach on\nvarious benchmarks, including class-conditioned image generation,\nhigh-resolution text-to-image, and text-to-video applications. Remarkably, we\ncan train a single pixel-space model at resolutions of up to 1024x1024 pixels,\ndemonstrating strong zero-shot generalization using the CC12M dataset, which\ncontains only 12 million images.\n","authors":["Jiatao Gu","Shuangfei Zhai","Yizhe Zhang","Josh Susskind","Navdeep Jaitly"],"pdf_url":"https://arxiv.org/pdf/2310.15111v1.pdf","comment":"28 pages, 18 figures"},{"id":"http://arxiv.org/abs/2302.09738v8","updated":"2023-10-23T17:16:10Z","published":"2023-02-20T03:31:11Z","title":"Simplifying Momentum-based Positive-definite Submanifold Optimization\n with Applications to Deep Learning","summary":" Riemannian submanifold optimization with momentum is computationally\nchallenging because, to ensure that the iterates remain on the submanifold, we\noften need to solve difficult differential equations. Here, we simplify such\ndifficulties for a class of sparse or structured symmetric positive-definite\nmatrices with the affine-invariant metric. We do so by proposing a generalized\nversion of the Riemannian normal coordinates that dynamically orthonormalizes\nthe metric and locally converts the problem into an unconstrained problem in\nthe Euclidean space. We use our approach to simplify existing approaches for\nstructured covariances and develop matrix-inverse-free $2^\\text{nd}$-order\noptimizers for deep learning with low precision by using only matrix\nmultiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL\n","authors":["Wu Lin","Valentin Duruisseaux","Melvin Leok","Frank Nielsen","Mohammad Emtiyaz Khan","Mark Schmidt"],"pdf_url":"https://arxiv.org/pdf/2302.09738v8.pdf","comment":"An updated version of the ICML 2023 paper. Updated the main text to\n emphasize challenges of using existing Riemannian methods to estimate sparse\n and structured SPD matrices"},{"id":"http://arxiv.org/abs/2310.15108v1","updated":"2023-10-23T17:15:11Z","published":"2023-10-23T17:15:11Z","title":"Evaluating machine learning models in non-standard settings: An overview\n and new findings","summary":" Estimating the generalization error (GE) of machine learning models is\nfundamental, with resampling methods being the most common approach. However,\nin non-standard settings, particularly those where observations are not\nindependently and identically distributed, resampling using simple random data\ndivisions may lead to biased GE estimates. This paper strives to present\nwell-grounded guidelines for GE estimation in various such non-standard\nsettings: clustered data, spatial data, unequal sampling probabilities, concept\ndrift, and hierarchically structured outcomes. Our overview combines\nwell-established methodologies with other existing methods that, to our\nknowledge, have not been frequently considered in these particular settings. A\nunifying principle among these techniques is that the test data used in each\niteration of the resampling procedure should reflect the new observations to\nwhich the model will be applied, while the training data should be\nrepresentative of the entire data set used to obtain the final model. Beyond\nproviding an overview, we address literature gaps by conducting simulation\nstudies. These studies assess the necessity of using GE-estimation methods\ntailored to the respective setting. Our findings corroborate the concern that\nstandard resampling methods often yield biased GE estimates in non-standard\nsettings, underscoring the importance of tailored GE estimation.\n","authors":["Roman Hornung","Malte Nalenz","Lennart Schneider","Andreas Bender","Ludwig Bothmann","Bernd Bischl","Thomas Augustin","Anne-Laure Boulesteix"],"pdf_url":"https://arxiv.org/pdf/2310.15108v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01112v2","updated":"2023-10-23T17:14:16Z","published":"2023-06-01T19:54:39Z","title":"Improving day-ahead Solar Irradiance Time Series Forecasting by\n Leveraging Spatio-Temporal Context","summary":" Solar power harbors immense potential in mitigating climate change by\nsubstantially reducing CO$_{2}$ emissions. Nonetheless, the inherent\nvariability of solar irradiance poses a significant challenge for seamlessly\nintegrating solar power into the electrical grid. While the majority of prior\nresearch has centered on employing purely time series-based methodologies for\nsolar forecasting, only a limited number of studies have taken into account\nfactors such as cloud cover or the surrounding physical context. In this paper,\nwe put forth a deep learning architecture designed to harness spatio-temporal\ncontext using satellite data, to attain highly accurate \\textit{day-ahead}\ntime-series forecasting for any given station, with a particular emphasis on\nforecasting Global Horizontal Irradiance (GHI). We also suggest a methodology\nto extract a distribution for each time step prediction, which can serve as a\nvery valuable measure of uncertainty attached to the forecast. When evaluating\nmodels, we propose a testing scheme in which we separate particularly difficult\nexamples from easy ones, in order to capture the model performances in crucial\nsituations, which in the case of this study are the days suffering from varying\ncloudy conditions. Furthermore, we present a new multi-modal dataset gathering\nsatellite imagery over a large zone and time series for solar irradiance and\nother related physical variables from multiple geographically diverse solar\nstations. Our approach exhibits robust performance in solar irradiance\nforecasting, including zero-shot generalization tests at unobserved solar\nstations, and holds great promise in promoting the effective integration of\nsolar power into the grid.\n","authors":["Oussama Boussif","Ghait Boukachab","Dan Assouline","Stefano Massaroli","Tianle Yuan","Loubna Benabbou","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2306.01112v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13683v2","updated":"2023-10-23T17:06:07Z","published":"2023-10-20T17:44:25Z","title":"CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP\n Performance on Low-Resource Languages","summary":" This work introduces CAPIVARA, a cost-efficient framework designed to enhance\nthe performance of multilingual CLIP models in low-resource languages. While\nCLIP has excelled in zero-shot vision-language tasks, the resource-intensive\nnature of model training remains challenging. Many datasets lack linguistic\ndiversity, featuring solely English descriptions for images. CAPIVARA addresses\nthis by augmenting text data using image captioning and machine translation to\ngenerate multiple synthetic captions in low-resource languages. We optimize the\ntraining pipeline with LiT, LoRA, and gradient checkpointing to alleviate the\ncomputational cost. Through extensive experiments, CAPIVARA emerges as state of\nthe art in zero-shot tasks involving images and Portuguese texts. We show the\npotential for significant improvements in other low-resource languages,\nachieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a\nsingle GPU for 2 hours. Our model and code is available at\nhttps://github.com/hiaac-nlp/CAPIVARA.\n","authors":["Gabriel Oliveira dos Santos","Diego A. B. Moreira","Alef Iury Ferreira","Jhessica Silva","Luiz Pereira","Pedro Bueno","Thiago Sousa","Helena Maia","Nádia Da Silva","Esther Colombini","Helio Pedrini","Sandra Avila"],"pdf_url":"https://arxiv.org/pdf/2310.13683v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15099v1","updated":"2023-10-23T17:05:53Z","published":"2023-10-23T17:05:53Z","title":"Dual-path convolutional neural network using micro-FTIR imaging to\n predict breast cancer subtypes and biomarkers levels: estrogen receptor,\n progesterone receptor, HER2 and Ki67","summary":" Breast cancer molecular subtypes classification plays an import role to sort\npatients with divergent prognosis. The biomarkers used are Estrogen Receptor\n(ER), Progesterone Receptor (PR), HER2, and Ki67. Based on these biomarkers\nexpression levels, subtypes are classified as Luminal A (LA), Luminal B (LB),\nHER2 subtype, and Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is\nused to classify subtypes, although interlaboratory and interobserver\nvariations can affect its accuracy, besides being a time-consuming technique.\nThe Fourier transform infrared micro-spectroscopy may be coupled with deep\nlearning for cancer evaluation, where there is still a lack of studies for\nsubtypes and biomarker levels prediction. This study presents a novel 2D deep\nlearning approach to achieve these predictions. Sixty micro-FTIR images of\n320x320 pixels were collected from a human breast biopsies microarray. Data\nwere clustered by K-means, preprocessed and 32x32 patches were generated using\na fully automated approach. CaReNet-V2, a novel convolutional neural network,\nwas developed to classify breast cancer (CA) vs adjacent tissue (AT) and\nmolecular subtypes, and to predict biomarkers level. The clustering method\nenabled to remove non-tissue pixels. Test accuracies for CA vs AT and subtype\nwere above 0.84. The model enabled the prediction of ER, PR, and HER2 levels,\nwhere borderline values showed lower performance (minimum accuracy of 0.54).\nKi67 percentage regression demonstrated a mean error of 3.6%. Thus, CaReNet-V2\nis a potential technique for breast cancer biopsies evaluation, standing out as\na screening analysis technique and helping to prioritize patients.\n","authors":["Matheus del-Valle","Emerson Soares Bernardes","Denise Maria Zezell"],"pdf_url":"https://arxiv.org/pdf/2310.15099v1.pdf","comment":"32 pages, 3 figures, 6 tables"},{"id":"http://arxiv.org/abs/2302.07384v3","updated":"2023-10-23T17:04:14Z","published":"2023-02-14T22:48:24Z","title":"The Geometry of Neural Nets' Parameter Spaces Under Reparametrization","summary":" Model reparametrization, which follows the change-of-variable rule of\ncalculus, is a popular way to improve the training of neural nets. But it can\nalso be problematic since it can induce inconsistencies in, e.g., Hessian-based\nflatness measures, optimization trajectories, and modes of probability\ndensities. This complicates downstream analyses: e.g. one cannot definitively\nrelate flatness with generalization since arbitrary reparametrization changes\ntheir relationship. In this work, we study the invariance of neural nets under\nreparametrization from the perspective of Riemannian geometry. From this point\nof view, invariance is an inherent property of any neural net if one explicitly\nrepresents the metric and uses the correct associated transformation rules.\nThis is important since although the metric is always present, it is often\nimplicitly assumed as identity, and thus dropped from the notation, then lost\nunder reparametrization. We discuss implications for measuring the flatness of\nminima, optimization, and for probability-density maximization. Finally, we\nexplore some interesting directions where invariance is useful.\n","authors":["Agustinus Kristiadi","Felix Dangel","Philipp Hennig"],"pdf_url":"https://arxiv.org/pdf/2302.07384v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15097v1","updated":"2023-10-23T17:00:20Z","published":"2023-10-23T17:00:20Z","title":"A Canonical Data Transformation for Achieving Inter- and Within-group\n Fairness","summary":" Increases in the deployment of machine learning algorithms for applications\nthat deal with sensitive data have brought attention to the issue of fairness\nin machine learning. Many works have been devoted to applications that require\ndifferent demographic groups to be treated fairly. However, algorithms that aim\nto satisfy inter-group fairness (also called group fairness) may inadvertently\ntreat individuals within the same demographic group unfairly. To address this\nissue, we introduce a formal definition of within-group fairness that maintains\nfairness among individuals from within the same group. We propose a\npre-processing framework to meet both inter- and within-group fairness criteria\nwith little compromise in accuracy. The framework maps the feature vectors of\nmembers from different groups to an inter-group-fair canonical domain before\nfeeding them into a scoring function. The mapping is constructed to preserve\nthe relative relationship between the scores obtained from the unprocessed\nfeature vectors of individuals from the same demographic group, guaranteeing\nwithin-group fairness. We apply this framework to the COMPAS risk assessment\nand Law School datasets and compare its performance in achieving inter-group\nand within-group fairness to two regularization-based methods.\n","authors":["Zachary McBride Lazri","Ivan Brugere","Xin Tian","Dana Dachman-Soled","Antigoni Polychroniadou","Danial Dervovic","Min Wu"],"pdf_url":"https://arxiv.org/pdf/2310.15097v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15094v1","updated":"2023-10-23T16:58:34Z","published":"2023-10-23T16:58:34Z","title":"One-dimensional convolutional neural network model for breast cancer\n subtypes classification and biochemical content evaluation using micro-FTIR\n hyperspectral images","summary":" Breast cancer treatment still remains a challenge, where molecular subtypes\nclassification plays a crucial role in selecting appropriate and specific\ntherapy. The four subtypes are Luminal A (LA), Luminal B (LB), HER2 subtype,\nand Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is the\ngold-standard evaluation, although interobserver variations are reported and\nmolecular signatures identification is time-consuming. Fourier transform\ninfrared micro-spectroscopy with machine learning approaches have been used to\nevaluate cancer samples, presenting biochemical-related explainability.\nHowever, this explainability is harder when using deep learning. This study\ncreated a 1D deep learning tool for breast cancer subtype evaluation and\nbiochemical contribution. Sixty hyperspectral images were acquired from a human\nbreast cancer microarray. K-Means clustering was applied to select tissue and\nparaffin spectra. CaReNet-V1, a novel 1D convolutional neural network, was\ndeveloped to classify breast cancer (CA) and adjacent tissue (AT), and\nmolecular subtypes. A 1D adaptation of Grad-CAM was applied to assess the\nbiochemical impact to the classifications. CaReNet-V1 effectively classified CA\nand AT (test accuracy of 0.89), as well as HER2 and TNBC subtypes (0.83 and\n0.86), with greater difficulty for LA and LB (0.74 and 0.68). The model enabled\nthe evaluation of the most contributing wavenumbers to the predictions,\nproviding a direct relationship with the biochemical content. Therefore,\nCaReNet-V1 and hyperspectral images is a potential approach for breast cancer\nbiopsies assessment, providing additional information to the pathology report.\nBiochemical content impact feature may be used for other studies, such as\ntreatment efficacy evaluation and development new diagnostics and therapeutic\nmethods.\n","authors":["Matheus del-Valle","Emerson Soares Bernardes","Denise Maria Zezell"],"pdf_url":"https://arxiv.org/pdf/2310.15094v1.pdf","comment":"23 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2310.15085v1","updated":"2023-10-23T16:46:28Z","published":"2023-10-23T16:46:28Z","title":"On the Detection of Image-Scaling Attacks in Machine Learning","summary":" Image scaling is an integral part of machine learning and computer vision\nsystems. Unfortunately, this preprocessing step is vulnerable to so-called\nimage-scaling attacks where an attacker makes unnoticeable changes to an image\nso that it becomes a new image after scaling. This opens up new ways for\nattackers to control the prediction or to improve poisoning and backdoor\nattacks. While effective techniques exist to prevent scaling attacks, their\ndetection has not been rigorously studied yet. Consequently, it is currently\nnot possible to reliably spot these attacks in practice.\n This paper presents the first in-depth systematization and analysis of\ndetection methods for image-scaling attacks. We identify two general detection\nparadigms and derive novel methods from them that are simple in design yet\nsignificantly outperform previous work. We demonstrate the efficacy of these\nmethods in a comprehensive evaluation with all major learning platforms and\nscaling algorithms. First, we show that image-scaling attacks modifying the\nentire scaled image can be reliably detected even under an adaptive adversary.\nSecond, we find that our methods provide strong detection performance even if\nonly minor parts of the image are manipulated. As a result, we can introduce a\nnovel protection layer against image-scaling attacks.\n","authors":["Erwin Quiring","Andreas Müller","Konrad Rieck"],"pdf_url":"https://arxiv.org/pdf/2310.15085v1.pdf","comment":"Accepted at ACSAC'23"},{"id":"http://arxiv.org/abs/2310.15084v1","updated":"2023-10-23T16:45:29Z","published":"2023-10-23T16:45:29Z","title":"Quantum Federated Learning With Quantum Networks","summary":" A major concern of deep learning models is the large amount of data that is\nrequired to build and train them, much of which is reliant on sensitive and\npersonally identifiable information that is vulnerable to access by third\nparties. Ideas of using the quantum internet to address this issue have been\npreviously proposed, which would enable fast and completely secure online\ncommunications. Previous work has yielded a hybrid quantum-classical transfer\nlearning scheme for classical data and communication with a hub-spoke topology.\nWhile quantum communication is secure from eavesdrop attacks and no\nmeasurements from quantum to classical translation, due to no cloning theorem,\nhub-spoke topology is not ideal for quantum communication without quantum\nmemory. Here we seek to improve this model by implementing a decentralized ring\ntopology for the federated learning scheme, where each client is given a\nportion of the entire dataset and only performs training on that set. We also\ndemonstrate the first successful use of quantum weights for quantum federated\nlearning, which allows us to perform our training entirely in quantum.\n","authors":["Tyler Wang","Huan-Hsin Tseng","Shinjae Yoo"],"pdf_url":"https://arxiv.org/pdf/2310.15084v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15080v1","updated":"2023-10-23T16:37:59Z","published":"2023-10-23T16:37:59Z","title":"Federated Learning of Large Language Models with Parameter-Efficient\n Prompt Tuning and Adaptive Optimization","summary":" Federated learning (FL) is a promising paradigm to enable collaborative model\ntraining with decentralized data. However, the training process of Large\nLanguage Models (LLMs) generally incurs the update of significant parameters,\nwhich limits the applicability of FL techniques to tackle the LLMs in real\nscenarios. Prompt tuning can significantly reduce the number of parameters to\nupdate, but it either incurs performance degradation or low training\nefficiency. The straightforward utilization of prompt tuning in the FL often\nraises non-trivial communication costs and dramatically degrades performance.\nIn addition, the decentralized data is generally non-Independent and\nIdentically Distributed (non-IID), which brings client drift problems and thus\npoor performance. This paper proposes a Parameter-efficient prompt Tuning\napproach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and\neffective FL of LLMs. First, an efficient partial prompt tuning approach is\nproposed to improve performance and efficiency simultaneously. Second, a novel\nadaptive optimization method is developed to address the client drift problems\non both the device and server sides to enhance performance further. Extensive\nexperiments based on 10 datasets demonstrate the superb performance (up to\n60.8\\% in terms of accuracy) and efficiency (up to 97.59\\% in terms of training\ntime) of FedPepTAO compared with 9 baseline approaches. Our code is available\nat https://github.com/llm-eff/FedPepTAO.\n","authors":["Tianshi Che","Ji Liu","Yang Zhou","Jiaxiang Ren","Jiwen Zhou","Victor S. Sheng","Huaiyu Dai","Dejing Dou"],"pdf_url":"https://arxiv.org/pdf/2310.15080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15074v1","updated":"2023-10-23T16:32:18Z","published":"2023-10-23T16:32:18Z","title":"MGAS: Multi-Granularity Architecture Search for Effective and Efficient\n Neural Networks","summary":" Differentiable architecture search (DAS) has become the prominent approach in\nthe field of neural architecture search (NAS) due to its time-efficient\nautomation of neural network design. It shifts the traditional paradigm of\ndiscrete architecture sampling and evaluation to differentiable super-net\noptimization and discretization. However, existing DAS methods either only\nconduct coarse-grained operation-level search, or restrictively explore\nfine-grained filter-level and weight-level units using manually-defined\nremaining ratios, which fail to simultaneously achieve small model size and\nsatisfactory model performance. Additionally, they address the high memory\nconsumption of the search process at the expense of search quality. To tackle\nthese issues, we introduce multi-granularity architecture search (MGAS), a\nunified framework which aims to comprehensively and memory-efficiently explore\nthe multi-granularity search space to discover both effective and efficient\nneural networks. Specifically, we learn discretization functions specific to\neach granularity level to adaptively determine the remaining ratios according\nto the evolving architecture. This ensures an optimal balance among units of\ndifferent granularity levels for different target model sizes. Considering the\nmemory demands, we break down the super-net optimization and discretization\ninto multiple sub-net stages. By allowing re-pruning and regrowing of units in\nprevious sub-nets during subsequent stages, we compensate for potential bias in\nearlier stages. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet\ndemonstrate that MGAS outperforms other state-of-the-art methods in achieving a\nbetter trade-off between model performance and model size.\n","authors":["Xiaoyun Liu","Divya Saxena","Jiannong Cao","Yuqing Zhao","Penghui Ruan"],"pdf_url":"https://arxiv.org/pdf/2310.15074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.03116v3","updated":"2023-10-23T16:28:09Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v3.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2305.15287v2","updated":"2023-10-23T16:12:38Z","published":"2023-05-24T16:09:41Z","title":"The Crucial Role of Normalization in Sharpness-Aware Minimization","summary":" Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based\noptimizer (Foret et al., ICLR 2021) that greatly improves the prediction\nperformance of deep neural networks. Consequently, there has been a surge of\ninterest in explaining its empirical success. We focus, in particular, on\nunderstanding the role played by normalization, a key component of the SAM\nupdates. We theoretically and empirically study the effect of normalization in\nSAM for both convex and non-convex functions, revealing two key roles played by\nnormalization: i) it helps in stabilizing the algorithm; and ii) it enables the\nalgorithm to drift along a continuum (manifold) of minima -- a property\nidentified by recent theoretical works that is the key to better performance.\nWe further argue that these two properties of normalization make SAM robust\nagainst the choice of hyper-parameters, supporting the practicality of SAM. Our\nconclusions are backed by various experiments.\n","authors":["Yan Dai","Kwangjun Ahn","Suvrit Sra"],"pdf_url":"https://arxiv.org/pdf/2305.15287v2.pdf","comment":"30 pages, Published in 37th Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.12934v2","updated":"2023-10-23T16:12:02Z","published":"2023-10-19T17:31:40Z","title":"Generative Flow Networks as Entropy-Regularized RL","summary":" The recently proposed generative flow networks (GFlowNets) are a method of\ntraining a policy to sample compositional discrete objects with probabilities\nproportional to a given reward via a sequence of actions. GFlowNets exploit the\nsequential nature of the problem, drawing parallels with reinforcement learning\n(RL). Our work extends the connection between RL and GFlowNets to a general\ncase. We demonstrate how the task of learning a generative flow network can be\nefficiently redefined as an entropy-regularized RL problem with a specific\nreward and regularizer structure. Furthermore, we illustrate the practical\nefficiency of this reformulation by applying standard soft RL algorithms to\nGFlowNet training across several probabilistic modeling tasks. Contrary to\npreviously reported results, we show that entropic RL approaches can be\ncompetitive against established GFlowNet training methods. This perspective\nopens a direct path for integrating reinforcement learning principles into the\nrealm of generative flow networks.\n","authors":["Daniil Tiapkin","Nikita Morozov","Alexey Naumov","Dmitry Vetrov"],"pdf_url":"https://arxiv.org/pdf/2310.12934v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15059v1","updated":"2023-10-23T16:03:23Z","published":"2023-10-23T16:03:23Z","title":"Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic\n Gaussian Mixture Models","summary":" A long-standing challenge for a robotic manipulation system operating in\nreal-world scenarios is adapting and generalizing its acquired motor skills to\nunseen environments. We tackle this challenge employing hybrid skill models\nthat integrate imitation and reinforcement paradigms, to explore how the\nlearning and adaptation of a skill, along with its core grounding in the scene\nthrough a learned keypoint, can facilitate such generalization. To that end, we\ndevelop Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models (KIS-GMM)\napproach that learns to predict the reference of a dynamical system within the\nscene as a 3D keypoint, leveraging visual observations obtained by the robot's\nphysical interactions during skill learning. Through conducting comprehensive\nevaluations in both simulated and real-world environments, we show that our\nmethod enables a robot to gain a significant zero-shot generalization to novel\nenvironments and to refine skills in the target environments faster than\nlearning from scratch. Importantly, this is achieved without the need for new\nground truth data. Moreover, our method effectively copes with scene\ndisplacements.\n","authors":["Iman Nematollahi","Kirill Yankov","Wolfram Burgard","Tim Welschehold"],"pdf_url":"https://arxiv.org/pdf/2310.15059v1.pdf","comment":"Accepted at the International Symposium on Experimental Robotics\n (ISER) 2023. Videos at http://kis-gmm.cs.uni-freiburg.de/"},{"id":"http://arxiv.org/abs/2310.15054v1","updated":"2023-10-23T15:56:39Z","published":"2023-10-23T15:56:39Z","title":"Coordinated Replay Sample Selection for Continual Federated Learning","summary":" Continual Federated Learning (CFL) combines Federated Learning (FL), the\ndecentralized learning of a central model on a number of client devices that\nmay not communicate their data, and Continual Learning (CL), the learning of a\nmodel from a continual stream of data without keeping the entire history. In\nCL, the main challenge is \\textit{forgetting} what was learned from past data.\nWhile replay-based algorithms that keep a small pool of past training data are\neffective to reduce forgetting, only simple replay sample selection strategies\nhave been applied to CFL in prior work, and no previous work has explored\ncoordination among clients for better sample selection. To bridge this gap, we\nadapt a replay sample selection objective based on loss gradient diversity to\nCFL and propose a new relaxation-based selection of samples to optimize the\nobjective. Next, we propose a practical algorithm to coordinate gradient-based\nreplay sample selection across clients without communicating private data. We\nbenchmark our coordinated and uncoordinated replay sample selection algorithms\nagainst random sampling-based baselines with language models trained on a large\nscale de-identified real-world text dataset. We show that gradient-based sample\nselection methods both boost performance and reduce forgetting compared to\nrandom sampling methods, with our coordination method showing gains early in\nthe low replay size regime (when the budget for storing past data is small).\n","authors":["Jack Good","Jimit Majmudar","Christophe Dupuy","Jixuan Wang","Charith Peris","Clement Chung","Richard Zemel","Rahul Gupta"],"pdf_url":"https://arxiv.org/pdf/2310.15054v1.pdf","comment":"7 pages, 6 figures, accepted to EMNLP (industry track)"},{"id":"http://arxiv.org/abs/2310.15051v1","updated":"2023-10-23T15:55:15Z","published":"2023-10-23T15:55:15Z","title":"TeleQnA: A Benchmark Dataset to Assess Large Language Models\n Telecommunications Knowledge","summary":" We introduce TeleQnA, the first benchmark dataset designed to evaluate the\nknowledge of Large Language Models (LLMs) in telecommunications. Comprising\n10,000 questions and answers, this dataset draws from diverse sources,\nincluding standards and research articles. This paper outlines the automated\nquestion generation framework responsible for creating this dataset, along with\nhow human input was integrated at various stages to ensure the quality of the\nquestions. Afterwards, using the provided dataset, an evaluation is conducted\nto assess the capabilities of LLMs, including GPT-3.5 and GPT-4. The results\nhighlight that these models struggle with complex standards related questions\nbut exhibit proficiency in addressing general telecom-related inquiries.\nAdditionally, our results showcase how incorporating telecom knowledge context\nsignificantly enhances their performance, thus shedding light on the need for a\nspecialized telecom foundation model. Finally, the dataset is shared with\nactive telecom professionals, whose performance is subsequently benchmarked\nagainst that of the LLMs. The findings illustrate that LLMs can rival the\nperformance of active professionals in telecom knowledge, thanks to their\ncapacity to process vast amounts of information, underscoring the potential of\nLLMs within this domain. The dataset has been made publicly accessible on\nGitHub.\n","authors":["Ali Maatouk","Fadhel Ayed","Nicola Piovesan","Antonio De Domenico","Merouane Debbah","Zhi-Quan Luo"],"pdf_url":"https://arxiv.org/pdf/2310.15051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.15284v2","updated":"2023-10-23T15:52:03Z","published":"2022-06-30T13:43:16Z","title":"Quantum Advantage Seeker with Kernels (QuASK): a software framework to\n speed up the research in quantum machine learning","summary":" Exploiting the properties of quantum information to the benefit of machine\nlearning models is perhaps the most active field of research in quantum\ncomputation. This interest has supported the development of a multitude of\nsoftware frameworks (e.g. Qiskit, Pennylane, Braket) to implement, simulate,\nand execute quantum algorithms. Most of them allow us to define quantum\ncircuits, run basic quantum algorithms, and access low-level primitives\ndepending on the hardware such software is supposed to run. For most\nexperiments, these frameworks have to be manually integrated within a larger\nmachine learning software pipeline. The researcher is in charge of knowing\ndifferent software packages, integrating them through the development of long\ncode scripts, analyzing the results, and generating the plots. Long code often\nleads to erroneous applications, due to the average number of bugs growing\nproportional with respect to the program length. Moreover, other researchers\nwill struggle to understand and reproduce the experiment, due to the need to be\nfamiliar with all the different software frameworks involved in the code\nscript. We propose QuASK, an open-source quantum machine learning framework\nwritten in Python that aids the researcher in performing their experiments,\nwith particular attention to quantum kernel techniques. QuASK can be used as a\ncommand-line tool to download datasets, pre-process them, quantum machine\nlearning routines, analyze and visualize the results. QuASK implements most\nstate-of-the-art algorithms to analyze the data through quantum kernels, with\nthe possibility to use projected kernels, (gradient-descent) trainable quantum\nkernels, and structure-optimized quantum kernels. Our framework can also be\nused as a library and integrated into pre-existing software, maximizing code\nreuse.\n","authors":["Francesco Di Marcantonio","Massimiliano Incudini","Davide Tezza","Michele Grossi"],"pdf_url":"https://arxiv.org/pdf/2206.15284v2.pdf","comment":"Close to the published version"},{"id":"http://arxiv.org/abs/2310.15047v1","updated":"2023-10-23T15:50:08Z","published":"2023-10-23T15:50:08Z","title":"Meta- (out-of-context) learning in neural networks","summary":" Brown et al. (2020) famously introduced the phenomenon of in-context learning\nin large language models (LLMs). We establish the existence of a phenomenon we\ncall $\\textbf{meta-out-of-context learning (meta-OCL)}$ via carefully designed\nsynthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs\nto more readily \"internalize\" the semantic content of text that is, or appears\nto be, broadly useful (such as true statements, or text from authoritative\nsources) and use it in appropriate circumstances. We further demonstrate\nmeta-OCL in a synthetic computer vision setting, and propose two hypotheses for\nthe emergence of meta-OCL: one relying on the way models store knowledge in\ntheir parameters, and another suggesting that the implicit gradient alignment\nbias of gradient-descent-based optimizers may be responsible. Finally, we\nreflect on what our results might imply about capabilities of future AI\nsystems, and discuss potential risks. Our code can be found at\nhttps://github.com/krasheninnikov/internalization .\n","authors":["Dmitrii Krasheninnikov","Egor Krasheninnikov","Bruno Mlodozeniec","David Krueger"],"pdf_url":"https://arxiv.org/pdf/2310.15047v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06599v6","updated":"2023-10-23T15:40:43Z","published":"2023-06-11T06:27:06Z","title":"Variational Imbalanced Regression: Fair Uncertainty Quantification via\n Probabilistic Smoothing","summary":" Existing regression models tend to fall short in both accuracy and\nuncertainty estimation when the label distribution is imbalanced. In this\npaper, we propose a probabilistic deep learning model, dubbed variational\nimbalanced regression (VIR), which not only performs well in imbalanced\nregression but naturally produces reasonable uncertainty estimation as a\nbyproduct. Different from typical variational autoencoders assuming I.I.D.\nrepresentations (a data point's representation is not directly affected by\nother data points), our VIR borrows data with similar regression labels to\ncompute the latent representation's variational distribution; furthermore,\ndifferent from deterministic regression models producing point estimates, VIR\npredicts the entire normal-inverse-gamma distributions and modulates the\nassociated conjugate distributions to impose probabilistic reweighting on the\nimbalanced data, thereby providing better uncertainty estimation. Experiments\nin several real-world datasets show that our VIR can outperform\nstate-of-the-art imbalanced regression models in terms of both accuracy and\nuncertainty estimation. Code will soon be available at\nhttps://github.com/Wang-ML-Lab/variational-imbalanced-regression.\n","authors":["Ziyan Wang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2306.06599v6.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.05703v2","updated":"2023-10-23T15:26:40Z","published":"2023-10-09T13:24:44Z","title":"An Attribution Method for Siamese Encoders","summary":" Despite the success of Siamese encoder models such as sentence transformers\n(ST), little is known about the aspects of inputs they pay attention to. A\nbarrier is that their predictions cannot be attributed to individual features,\nas they compare two inputs rather than processing a single one. This paper\nderives a local attribution method for Siamese encoders by generalizing the\nprinciple of integrated gradients to models with multiple inputs. The solution\ntakes the form of feature-pair attributions, and can be reduced to a\ntoken-token matrix for STs. Our method involves the introduction of integrated\nJacobians and inherits the advantageous formal properties of integrated\ngradients: it accounts for the model's full computation graph and is guaranteed\nto converge to the actual prediction. A pilot study shows that in an ST few\ntoken-pairs can often explain large fractions of predictions, and it focuses on\nnouns and verbs. For accurate predictions, it however needs to attend to the\nmajority of tokens and parts of speech.\n","authors":["Lucas Möller","Dmitry Nikolaev","Sebastian Padó"],"pdf_url":"https://arxiv.org/pdf/2310.05703v2.pdf","comment":"Accepted to EMNLP'23"},{"id":"http://arxiv.org/abs/2310.13678v2","updated":"2023-10-23T15:25:55Z","published":"2023-10-20T17:31:39Z","title":"Long-Form Speech Translation through Segmentation with Finite-State\n Decoding Constraints on Large Language Models","summary":" One challenge in speech translation is that plenty of spoken content is\nlong-form, but short units are necessary for obtaining high-quality\ntranslations. To address this mismatch, we adapt large language models (LLMs)\nto split long ASR transcripts into segments that can be independently\ntranslated so as to maximize the overall translation quality. We overcome the\ntendency of hallucination in LLMs by incorporating finite-state constraints\nduring decoding; these eliminate invalid outputs without requiring additional\ntraining. We discover that LLMs are adaptable to transcripts containing ASR\nerrors through prompt-tuning or fine-tuning. Relative to a state-of-the-art\nautomatic punctuation baseline, our best LLM improves the average BLEU by 2.9\npoints for English-German, English-Spanish, and English-Arabic TED talk\ntranslation in 9 test sets, just by improving segmentation.\n","authors":["Arya D. McCarthy","Hao Zhang","Shankar Kumar","Felix Stahlberg","Ke Wu"],"pdf_url":"https://arxiv.org/pdf/2310.13678v2.pdf","comment":"accepted to the Findings of EMNLP 2023. arXiv admin note: text\n overlap with arXiv:2212.09895"},{"id":"http://arxiv.org/abs/2310.15027v1","updated":"2023-10-23T15:23:42Z","published":"2023-10-23T15:23:42Z","title":"Deep Autoencoder-based Z-Interference Channels with Perfect and\n Imperfect CSI","summary":" A deep autoencoder (DAE)-based structure for endto-end communication over the\ntwo-user Z-interference channel (ZIC) with finite-alphabet inputs is designed\nin this paper. The proposed structure jointly optimizes the two encoder/decoder\npairs and generates interference-aware constellations that dynamically adapt\ntheir shape based on interference intensity to minimize the bit error rate\n(BER). An in-phase/quadrature-phase (I/Q) power allocation layer is introduced\nin the DAE to guarantee an average power constraint and enable the architecture\nto generate constellations with nonuniform shapes. This brings further gain\ncompared to standard uniform constellations such as quadrature amplitude\nmodulation. The proposed structure is then extended to work with imperfect\nchannel state information (CSI). The CSI imperfection due to both the\nestimation and quantization errors are examined. The performance of the DAEZIC\nis compared with two baseline methods, i.e., standard and rotated\nconstellations. The proposed structure significantly enhances the performance\nof the ZIC both for the perfect and imperfect CSI. Simulation results show that\nthe improvement is achieved in all interference regimes (weak, moderate, and\nstrong) and consistently increases with the signal-to-noise ratio (SNR). For\nexample, more than an order of magnitude BER reduction is obtained with respect\nto the most competitive conventional method at weak interference when SNR>15dB\nand two bits per symbol are transmitted. The improvements reach about two\norders of magnitude when quantization error exists, indicating that the DAE-ZIC\nis more robust to the interference compared to the conventional methods.\n","authors":["Xinliang Zhang","Mojtaba Vaezi"],"pdf_url":"https://arxiv.org/pdf/2310.15027v1.pdf","comment":"13 pages, 13 figures, 2 tables. Accepted for publication in the IEEE\n Transactions on Communications. arXiv admin note: text overlap with\n arXiv:2303.08312"},{"id":"http://arxiv.org/abs/2310.15026v1","updated":"2023-10-23T15:23:32Z","published":"2023-10-23T15:23:32Z","title":"Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time\n Projection Chamber Data","summary":" High-energy large-scale particle colliders produce data at high speed in the\norder of 1 terabytes per second in nuclear physics and petabytes per second in\nhigh-energy physics. Developing real-time data compression algorithms to reduce\nsuch data at high throughput to fit permanent storage has drawn increasing\nattention. Specifically, at the newly constructed sPHENIX experiment at the\nRelativistic Heavy Ion Collider (RHIC), a time projection chamber is used as\nthe main tracking detector, which records particle trajectories in a volume of\na three-dimensional (3D) cylinder. The resulting data are usually very sparse\nwith occupancy around 10.8%. Such sparsity presents a challenge to conventional\nlearning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. The 3D\nconvolutional neural network (CNN)-based approach, Bicephalous Convolutional\nAutoencoder (BCAE), outperforms traditional methods both in compression rate\nand reconstruction accuracy. BCAE can also utilize the computation power of\ngraphical processing units suitable for deployment in a modern heterogeneous\nhigh-performance computing environment. This work introduces two BCAE variants:\nBCAE++ and BCAE-2D. BCAE++ achieves a 15% better compression ratio and a 77%\nbetter reconstruction accuracy measured in mean absolute error compared with\nBCAE. BCAE-2D treats the radial direction as the channel dimension of an image,\nresulting in a 3x speedup in compression throughput. In addition, we\ndemonstrate an unbalanced autoencoder with a larger decoder can improve\nreconstruction accuracy without significantly sacrificing throughput. Lastly,\nwe observe both the BCAE++ and BCAE-2D can benefit more from using\nhalf-precision mode in throughput (76-79% increase) without loss in\nreconstruction accuracy. The source code and links to data and pretrained\nmodels can be found at https://github.com/BNL-DAQ-LDRD/NeuralCompression_v2.\n","authors":["Yi Huang","Yihui Ren","Shinjae Yoo","Jin Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15026v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15017v2","updated":"2023-10-23T15:23:22Z","published":"2023-05-24T10:58:20Z","title":"Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through\n Interaction with Symbolic Systems","summary":" Despite outstanding performance in many tasks, language models are\nnotoriously inclined to make factual errors in tasks requiring arithmetic\ncomputation. We address this deficiency by creating Calc-X, a collection of\ndatasets that demonstrates the appropriate use of a calculator in reasoning\nchains. Calc-X is suitable for teaching language models to offload computations\nto a symbolic system. We survey and unify several existing chain-of-thought\ndatasets into a proposed format, resulting in a standard collection of over\n300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X\ncollection to train open-source calculator-using models we call Calcformers and\nshow that these models approximately double the accuracy of generating correct\nresults compared to vanilla language model baselines. We make all Calc-X\ndatasets, source code and Calcformers models publicly available.\n","authors":["Marek Kadlčík","Michal Štefánik","Ondřej Sotolář","Vlastimil Martinek"],"pdf_url":"https://arxiv.org/pdf/2305.15017v2.pdf","comment":"Published in EMNLP 2023: Main track"},{"id":"http://arxiv.org/abs/2310.11122v2","updated":"2023-10-23T15:18:46Z","published":"2023-10-17T10:14:10Z","title":"Sensitivity-Aware Amortized Bayesian Inference","summary":" Bayesian inference is a powerful framework for making probabilistic\ninferences and decisions under uncertainty. Fundamental choices in modern\nBayesian workflows concern the specification of the likelihood function and\nprior distributions, the posterior approximator, and the data. Each choice can\nsignificantly influence model-based inference and subsequent decisions, thereby\nnecessitating sensitivity analysis. In this work, we propose a multifaceted\napproach to integrate sensitivity analyses into amortized Bayesian inference\n(ABI, i.e., simulation-based inference with neural networks). First, we utilize\nweight sharing to encode the structural similarities between alternative\nlikelihood and prior specifications in the training process with minimal\ncomputational overhead. Second, we leverage the rapid inference of neural\nnetworks to assess sensitivity to various data perturbations or pre-processing\nprocedures. In contrast to most other Bayesian approaches, both steps\ncircumvent the costly bottleneck of refitting the model(s) for each choice of\nlikelihood, prior, or dataset. Finally, we propose to use neural network\nensembles to evaluate variation in results induced by unreliable approximation\non unseen data. We demonstrate the effectiveness of our method in applied\nmodeling problems, ranging from the estimation of disease outbreak dynamics and\nglobal warming thresholds to the comparison of human decision-making models.\nOur experiments showcase how our approach enables practitioners to effectively\nunveil hidden relationships between modeling choices and inferential\nconclusions.\n","authors":["Lasse Elsemüller","Hans Olischläger","Marvin Schmitt","Paul-Christian Bürkner","Ullrich Köthe","Stefan T. Radev"],"pdf_url":"https://arxiv.org/pdf/2310.11122v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15020v1","updated":"2023-10-23T15:15:19Z","published":"2023-10-23T15:15:19Z","title":"Invariance is Key to Generalization: Examining the Role of\n Representation in Sim-to-Real Transfer for Visual Navigation","summary":" The data-driven approach to robot control has been gathering pace rapidly,\nyet generalization to unseen task domains remains a critical challenge. We\nargue that the key to generalization is representations that are (i) rich\nenough to capture all task-relevant information and (ii) invariant to\nsuperfluous variability between the training and the test domains. We\nexperimentally study such a representation -- containing both depth and\nsemantic information -- for visual navigation and show that it enables a\ncontrol policy trained entirely in simulated indoor scenes to generalize to\ndiverse real-world environments, both indoors and outdoors. Further, we show\nthat our representation reduces the A-distance between the training and test\ndomains, improving the generalization error bound as a result. Our proposed\napproach is scalable: the learned policy improves continuously, as the\nfoundation models that it exploits absorb more diverse data during\npre-training.\n","authors":["Bo Ai","Zhanxin Wu","David Hsu"],"pdf_url":"https://arxiv.org/pdf/2310.15020v1.pdf","comment":"11 pages, accepted by the 18th International Symposium on\n Experimental Robotics (ISER 2023)"},{"id":"http://arxiv.org/abs/2310.15019v1","updated":"2023-10-23T15:14:55Z","published":"2023-10-23T15:14:55Z","title":"Meta learning with language models: Challenges and opportunities in the\n classification of imbalanced text","summary":" Detecting out of policy speech (OOPS) content is important but difficult.\nWhile machine learning is a powerful tool to tackle this challenging task, it\nis hard to break the performance ceiling due to factors like quantity and\nquality limitations on training data and inconsistencies in OOPS definition and\ndata labeling. To realize the full potential of available limited resources, we\npropose a meta learning technique (MLT) that combines individual models built\nwith different text representations. We analytically show that the resulting\ntechnique is numerically stable and produces reasonable combining weights. We\ncombine the MLT with a threshold-moving (TM) technique to further improve the\nperformance of the combined predictor on highly-imbalanced in-distribution and\nout-of-distribution datasets. We also provide computational results to show the\nstatistically significant advantages of the proposed MLT approach.\n All authors contributed equally to this work.\n","authors":["Apostol Vassilev","Honglan Jin","Munawar Hasan"],"pdf_url":"https://arxiv.org/pdf/2310.15019v1.pdf","comment":"22 pages, including 5 figures, 12 tables, 1 appendix"},{"id":"http://arxiv.org/abs/2310.15017v1","updated":"2023-10-23T15:12:20Z","published":"2023-10-23T15:12:20Z","title":"The primacy bias in Model-based RL","summary":" The primacy bias in deep reinforcement learning (DRL), which refers to the\nagent's tendency to overfit early data and lose the ability to learn from new\ndata, can significantly decrease the performance of DRL algorithms. Previous\nstudies have shown that employing simple techniques, such as resetting the\nagent's parameters, can substantially alleviate the primacy bias. However, we\nobserve that resetting the agent's parameters harms its performance in the\ncontext of model-based reinforcement learning (MBRL). In fact, on further\ninvestigation, we find that the primacy bias in MBRL differs from that in\nmodel-free RL. In this work, we focus on investigating the primacy bias in MBRL\nand propose world model resetting, which works in MBRL. We apply our method to\ntwo different MBRL algorithms, MBPO and DreamerV2. We validate the\neffectiveness of our method on multiple continuous control tasks on MuJoCo and\nDeepMind Control Suite, as well as discrete control tasks on Atari 100k\nbenchmark. The results show that world model resetting can significantly\nalleviate the primacy bias in model-based setting and improve algorithm's\nperformance. We also give a guide on how to perform world model resetting\neffectively.\n","authors":["Zhongjian Qiao","Jiafei Lyu","Xiu Li"],"pdf_url":"https://arxiv.org/pdf/2310.15017v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15015v1","updated":"2023-10-23T15:10:37Z","published":"2023-10-23T15:10:37Z","title":"Leveraging Deep Learning for Abstractive Code Summarization of\n Unofficial Documentation","summary":" Usually, programming languages have official documentation to guide\ndevelopers with APIs, methods, and classes. However, researchers identified\ninsufficient or inadequate documentation examples and flaws with the API's\ncomplex structure as barriers to learning an API. As a result, developers may\nconsult other sources (StackOverflow, GitHub, etc.) to learn more about an API.\nRecent research studies have shown that unofficial documentation is a valuable\nsource of information for generating code summaries. We, therefore, have been\nmotivated to leverage such a type of documentation along with deep learning\ntechniques towards generating high-quality summaries for APIs discussed in\ninformal documentation.\n This paper proposes an automatic approach using the BART algorithm, a\nstate-of-the-art transformer model, to generate summaries for APIs discussed in\nStackOverflow. We built an oracle of human-generated summaries to evaluate our\napproach against it using ROUGE and BLEU metrics which are the most widely used\nevaluation metrics in text summarization. Furthermore, we evaluated our\nsummaries empirically against a previous work in terms of quality. Our findings\ndemonstrate that using deep learning algorithms can improve summaries' quality\nand outperform the previous work by an average of %57 for Precision, %66 for\nRecall, and %61 for F-measure, and it runs 4.4 times faster.\n","authors":["AmirHossein Naghshzan","Latifa Guerrouj","Olga Baysal"],"pdf_url":"https://arxiv.org/pdf/2310.15015v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13098v2","updated":"2023-10-23T15:03:50Z","published":"2023-10-19T18:56:04Z","title":"SRAI: Towards Standardization of Geospatial AI","summary":" Spatial Representations for Artificial Intelligence (srai) is a Python\nlibrary for working with geospatial data. The library can download geospatial\ndata, split a given area into micro-regions using multiple algorithms and train\nan embedding model using various architectures. It includes baseline models as\nwell as more complex methods from published works. Those capabilities make it\npossible to use srai in a complete pipeline for geospatial task solving. The\nproposed library is the first step to standardize the geospatial AI domain\ntoolset. It is fully open-source and published under Apache 2.0 licence.\n","authors":["Piotr Gramacki","Kacper Leśniara","Kamil Raczycki","Szymon Woźniak","Marcin Przymus","Piotr Szymański"],"pdf_url":"https://arxiv.org/pdf/2310.13098v2.pdf","comment":"Accepted for the 6th ACM SIGSPATIAL International Workshop on AI for\n Geographic Knowledge Discovery (GeoAI 2023)"},{"id":"http://arxiv.org/abs/2310.15007v1","updated":"2023-10-23T15:00:46Z","published":"2023-10-23T15:00:46Z","title":"Did the Neurons Read your Book? Document-level Membership Inference for\n Large Language Models","summary":" With large language models (LLMs) poised to become embedded in our daily\nlives, questions are starting to be raised about the dataset(s) they learned\nfrom. These questions range from potential bias or misinformation LLMs could\nretain from their training data to questions of copyright and fair use of\nhuman-generated text. However, while these questions emerge, developers of the\nrecent state-of-the-art LLMs become increasingly reluctant to disclose details\non their training corpus. We here introduce the task of document-level\nmembership inference for real-world LLMs, i.e. inferring whether the LLM has\nseen a given document during training or not. First, we propose a procedure for\nthe development and evaluation of document-level membership inference for LLMs\nby leveraging commonly used data sources for training and the model release\ndate. We then propose a practical, black-box method to predict document-level\nmembership and instantiate it on OpenLLaMA-7B with both books and academic\npapers. We show our methodology to perform very well, reaching an impressive\nAUC of 0.856 for books and 0.678 for papers. We then show our approach to\noutperform the sentence-level membership inference attacks used in the privacy\nliterature for the document-level membership task. We finally evaluate whether\nsmaller models might be less sensitive to document-level inference and show\nOpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach.\nTaken together, our results show that accurate document-level membership can be\ninferred for LLMs, increasing the transparency of technology poised to change\nour lives.\n","authors":["Matthieu Meeus","Shubham Jain","Marek Rei","Yves-Alexandre de Montjoye"],"pdf_url":"https://arxiv.org/pdf/2310.15007v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15003v1","updated":"2023-10-23T14:57:26Z","published":"2023-10-23T14:57:26Z","title":"Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent\n Geometries","summary":" The inductive bias of a graph neural network (GNN) is largely encoded in its\nspecified graph. Latent graph inference relies on latent geometric\nrepresentations to dynamically rewire or infer a GNN's graph to maximize the\nGNN's predictive downstream performance, but it lacks solid theoretical\nfoundations in terms of embedding-based representation guarantees. This paper\naddresses this issue by introducing a trainable deep learning architecture,\ncoined neural snowflake, that can adaptively implement fractal-like metrics on\n$\\mathbb{R}^d$. We prove that any given finite weights graph can be\nisometrically embedded by a standard MLP encoder. Furthermore, when the latent\ngraph can be represented in the feature space of a sufficiently regular kernel,\nwe show that the combined neural snowflake and MLP encoder do not succumb to\nthe curse of dimensionality by using only a low-degree polynomial number of\nparameters in the number of nodes. This implementation enables a\nlow-dimensional isometric embedding of the latent graph. We conduct synthetic\nexperiments to demonstrate the superior metric learning capabilities of neural\nsnowflakes when compared to more familiar spaces like Euclidean space.\nAdditionally, we carry out latent graph inference experiments on graph\nbenchmarks. Consistently, the neural snowflake model achieves predictive\nperformance that either matches or surpasses that of the state-of-the-art\nlatent graph inference models. Importantly, this performance improvement is\nachieved without requiring random search for optimal latent geometry. Instead,\nthe neural snowflake model achieves this enhancement in a differentiable\nmanner.\n","authors":["Haitz Sáez de Ocáriz Borde","Anastasis Kratsios"],"pdf_url":"https://arxiv.org/pdf/2310.15003v1.pdf","comment":"9 Pages + Appendix, 2 Figures, 9 Tables"},{"id":"http://arxiv.org/abs/2303.00564v3","updated":"2023-10-23T14:54:52Z","published":"2023-03-01T15:11:23Z","title":"Learning curves for deep structured Gaussian feature models","summary":" In recent years, significant attention in deep learning theory has been\ndevoted to analyzing when models that interpolate their training data can still\ngeneralize well to unseen examples. Many insights have been gained from\nstudying models with multiple layers of Gaussian random features, for which one\ncan compute precise generalization asymptotics. However, few works have\nconsidered the effect of weight anisotropy; most assume that the random\nfeatures are generated using independent and identically distributed Gaussian\nweights, and allow only for structure in the input data. Here, we use the\nreplica trick from statistical physics to derive learning curves for models\nwith many layers of structured Gaussian features. We show that allowing\ncorrelations between the rows of the first layer of features can aid\ngeneralization, while structure in later layers is generally detrimental. Our\nresults shed light on how weight structure affects generalization in a simple\nclass of solvable models.\n","authors":["Jacob A. Zavatone-Veth","Cengiz Pehlevan"],"pdf_url":"https://arxiv.org/pdf/2303.00564v3.pdf","comment":"14+18 pages, 2+1 figures. NeurIPS 2023 Camera Ready"},{"id":"http://arxiv.org/abs/2310.14997v1","updated":"2023-10-23T14:48:51Z","published":"2023-10-23T14:48:51Z","title":"Simple Hardware-Efficient PCFGs with Independent Left and Right\n Productions","summary":" Scaling dense PCFGs to thousands of nonterminals via a low-rank\nparameterization of the rule probability tensor has been shown to be beneficial\nfor unsupervised parsing. However, PCFGs scaled this way still perform poorly\nas a language model, and even underperform similarly-sized HMMs. This work\nintroduces \\emph{SimplePCFG}, a simple PCFG formalism with independent left and\nright productions. Despite imposing a stronger independence assumption than the\nlow-rank approach, we find that this formalism scales more effectively both as\na language model and as an unsupervised parser. As an unsupervised parser, our\nsimple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language\nmodel, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank\nPCFGs. We further introduce \\emph{FlashInside}, a hardware IO-aware\nimplementation of the inside algorithm for efficiently scaling simple PCFGs.\n","authors":["Wei Liu","Songlin Yang","Yoon Kim","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2310.14997v1.pdf","comment":"Accepted to Findings of EMNLP, 2023"},{"id":"http://arxiv.org/abs/2310.14993v1","updated":"2023-10-23T14:46:20Z","published":"2023-10-23T14:46:20Z","title":"Understanding the Inner Workings of Language Models Through\n Representation Dissimilarity","summary":" As language models are applied to an increasing number of real-world\napplications, understanding their inner workings has become an important issue\nin model trust, interpretability, and transparency. In this work we show that\nrepresentation dissimilarity measures, which are functions that measure the\nextent to which two model's internal representations differ, can be a valuable\ntool for gaining insight into the mechanics of language models. Among our\ninsights are: (i) an apparent asymmetry in the internal representations of\nmodel using SoLU and GeLU activation functions, (ii) evidence that\ndissimilarity measures can identify and locate generalization properties of\nmodels that are invisible via in-distribution test set performance, and (iii)\nnew evaluations of how language model features vary as width and depth are\nincreased. Our results suggest that dissimilarity measures are a promising set\nof tools for shedding light on the inner workings of language models.\n","authors":["Davis Brown","Charles Godfrey","Nicholas Konz","Jonathan Tu","Henry Kvinge"],"pdf_url":"https://arxiv.org/pdf/2310.14993v1.pdf","comment":"EMNLP 2023 (main)"},{"id":"http://arxiv.org/abs/2310.14992v1","updated":"2023-10-23T14:45:51Z","published":"2023-10-23T14:45:51Z","title":"Bayesian Regression Markets","summary":" Machine learning tasks are vulnerable to the quality of data used as input.\nYet, it is often challenging for firms to obtain adequate datasets, with them\nbeing naturally distributed amongst owners, that in practice, may be\ncompetitors in a downstream market and reluctant to share information. Focusing\non supervised learning for regression tasks, we develop a \\textit{regression\nmarket} to provide a monetary incentive for data sharing. Our proposed\nmechanism adopts a Bayesian framework, allowing us to consider a more general\nclass of regression tasks. We present a thorough exploration of the market\nproperties, and show that similar proposals in current literature expose the\nmarket agents to sizeable financial risks, which can be mitigated in our\nprobabilistic setting.\n","authors":["Thomas Falconer","Jalal Kazempour","Pierre Pinson"],"pdf_url":"https://arxiv.org/pdf/2310.14992v1.pdf","comment":"46 pages, 11 figures, 2 tables"},{"id":"http://arxiv.org/abs/2310.14982v1","updated":"2023-10-23T14:29:48Z","published":"2023-10-23T14:29:48Z","title":"Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate","summary":" Recurrent Neural Networks (RNNs) are renowned for their adeptness in modeling\ntemporal dependencies, a trait that has driven their widespread adoption for\nsequential data processing. Nevertheless, vanilla RNNs are confronted with the\nwell-known issue of gradient vanishing and exploding, posing a significant\nchallenge for learning and establishing long-range dependencies. Additionally,\ngated RNNs tend to be over-parameterized, resulting in poor network\ngeneralization. To address these challenges, we propose a novel Delayed Memory\nUnit (DMU) in this paper, wherein a delay line structure, coupled with delay\ngates, is introduced to facilitate temporal interaction and temporal credit\nassignment, so as to enhance the temporal modeling capabilities of vanilla\nRNNs. Particularly, the DMU is designed to directly distribute the input\ninformation to the optimal time instant in the future, rather than aggregating\nand redistributing it over time through intricate network dynamics. Our\nproposed DMU demonstrates superior temporal modeling capabilities across a\nbroad range of sequential modeling tasks, utilizing considerably fewer\nparameters than other state-of-the-art gated RNN models in applications such as\nspeech recognition, radar gesture recognition, ECG waveform segmentation, and\npermuted sequential image classification.\n","authors":["Pengfei Sun","Jibin Wu","Malu Zhang","Paul Devos","Dick Botteldooren"],"pdf_url":"https://arxiv.org/pdf/2310.14982v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.02247v2","updated":"2023-10-23T14:28:11Z","published":"2023-05-03T16:32:30Z","title":"Select without Fear: Almost All Mini-Batch Schedules Generalize\n Optimally","summary":" We establish matching upper and lower generalization error bounds for\nmini-batch Gradient Descent (GD) training with either deterministic or\nstochastic, data-independent, but otherwise arbitrary batch selection rules. We\nconsider smooth Lipschitz-convex/nonconvex/strongly-convex loss functions, and\nshow that classical upper bounds for Stochastic GD (SGD) also hold verbatim for\nsuch arbitrary nonadaptive batch schedules, including all deterministic ones.\nFurther, for convex and strongly-convex losses we prove matching lower bounds\ndirectly on the generalization error uniform over the aforementioned class of\nbatch schedules, showing that all such batch schedules generalize optimally.\nLastly, for smooth (non-Lipschitz) nonconvex losses, we show that full-batch\n(deterministic) GD is essentially optimal, among all possible batch schedules\nwithin the considered class, including all stochastic ones.\n","authors":["Konstantinos E. Nikolakakis","Amin Karbasi","Dionysis Kalogerias"],"pdf_url":"https://arxiv.org/pdf/2305.02247v2.pdf","comment":"37 pages, 2 tables"},{"id":"http://arxiv.org/abs/2310.14979v1","updated":"2023-10-23T14:26:43Z","published":"2023-10-23T14:26:43Z","title":"ACTOR: Active Learning with Annotator-specific Classification Heads to\n Embrace Human Label Variation","summary":" Label aggregation such as majority voting is commonly used to resolve\nannotator disagreement in dataset creation. However, this may disregard\nminority values and opinions. Recent studies indicate that learning from\nindividual annotations outperforms learning from aggregated labels, though they\nrequire a considerable amount of annotation. Active learning, as an annotation\ncost-saving strategy, has not been fully explored in the context of learning\nfrom disagreement. We show that in the active learning setting, a multi-head\nmodel performs significantly better than a single-head model in terms of\nuncertainty estimation. By designing and evaluating acquisition functions with\nannotator-specific heads on two datasets, we show that group-level entropy\nworks generally well on both datasets. Importantly, it achieves performance in\nterms of both prediction and uncertainty estimation comparable to full-scale\ntraining from disagreement, while saving up to 70% of the annotation budget.\n","authors":["Xinpeng Wang","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2310.14979v1.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2310.14976v1","updated":"2023-10-23T14:25:55Z","published":"2023-10-23T14:25:55Z","title":"Reinforcement learning in large, structured action spaces: A simulation\n study of decision support for spinal cord injury rehabilitation","summary":" Reinforcement learning (RL) has helped improve decision-making in several\napplications. However, applying traditional RL is challenging in some\napplications, such as rehabilitation of people with a spinal cord injury (SCI).\nAmong other factors, using RL in this domain is difficult because there are\nmany possible treatments (i.e., large action space) and few patients (i.e.,\nlimited training data). Treatments for SCIs have natural groupings, so we\npropose two approaches to grouping treatments so that an RL agent can learn\neffectively from limited data. One relies on domain knowledge of SCI\nrehabilitation and the other learns similarities among treatments using an\nembedding technique. We then use Fitted Q Iteration to train an agent that\nlearns optimal treatments. Through a simulation study designed to reflect the\nproperties of SCI rehabilitation, we find that both methods can help improve\nthe treatment decisions of physiotherapists, but the approach based on domain\nknowledge offers better performance. Our findings provide a \"proof of concept\"\nthat RL can be used to help improve the treatment of those with an SCI and\nindicates that continued efforts to gather data and apply RL to this domain are\nworthwhile.\n","authors":["Nathan Phelps","Stephanie Marrocco","Stephanie Cornell","Dalton L. Wolfe","Daniel J. Lizotte"],"pdf_url":"https://arxiv.org/pdf/2310.14976v1.pdf","comment":"31 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.05674v2","updated":"2023-10-23T14:16:49Z","published":"2023-10-09T12:45:13Z","title":"Making Scalable Meta Learning Practical","summary":" Despite its flexibility to learn diverse inductive biases in machine learning\nprograms, meta learning (i.e., learning to learn) has long been recognized to\nsuffer from poor scalability due to its tremendous compute/memory costs,\ntraining instability, and a lack of efficient distributed training support. In\nthis work, we focus on making scalable meta learning practical by introducing\nSAMA, which combines advances in both implicit differentiation algorithms and\nsystems. Specifically, SAMA is designed to flexibly support a broad range of\nadaptive optimizers in the base level of meta learning programs, while reducing\ncomputational burden by avoiding explicit computation of second-order gradient\ninformation, and exploiting efficient distributed training techniques\nimplemented for first-order gradients. Evaluated on multiple large-scale meta\nlearning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and\n2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU\nsetups compared to other baseline meta learning algorithms. Furthermore, we\nshow that SAMA-based data optimization leads to consistent improvements in text\nclassification accuracy with BERT and RoBERTa large language models, and\nachieves state-of-the-art results in both small- and large-scale data pruning\non image classification tasks, demonstrating the practical applicability of\nscalable meta learning across language and vision domains.\n","authors":["Sang Keun Choe","Sanket Vaibhav Mehta","Hwijeen Ahn","Willie Neiswanger","Pengtao Xie","Emma Strubell","Eric Xing"],"pdf_url":"https://arxiv.org/pdf/2310.05674v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14968v1","updated":"2023-10-23T14:13:27Z","published":"2023-10-23T14:13:27Z","title":"The Fundamental Dilemma of Bayesian Active Meta-learning","summary":" Many applications involve estimation of parameters that generalize across\nmultiple diverse, but related, data-scarce task environments. Bayesian active\nmeta-learning, a form of sequential optimal experimental design, provides a\nframework for solving such problems. The active meta-learner's goal is to gain\ntransferable knowledge (estimate the transferable parameters) in the presence\nof idiosyncratic characteristics of the current task (task-specific\nparameters). We show that in such a setting, greedy pursuit of this goal can\nactually hurt estimation of the transferable parameters (induce so-called\nnegative transfer). The learner faces a dilemma akin to but distinct from the\nexploration--exploitation dilemma: should they spend their acquisition budget\npursuing transferable knowledge, or identifying the current task-specific\nparameters? We show theoretically that some tasks pose an inevitable and\narbitrarily large threat of negative transfer, and that task identification is\ncritical to reducing this threat. Our results generalize to analysis of prior\nmisspecification over nuisance parameters. Finally, we empirically illustrate\ncircumstances that lead to negative transfer.\n","authors":["Sabina J. Sloman","Ayush Bharti","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2310.14968v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04012v2","updated":"2023-10-23T14:08:09Z","published":"2023-02-08T11:54:07Z","title":"CodeLMSec Benchmark: Systematically Evaluating and Finding Security\n Vulnerabilities in Black-Box Code Language Models","summary":" Large language models (LLMs) for automatic code generation have achieved\nbreakthroughs in several programming tasks. Their advances in competition-level\nprogramming problems have made them an essential pillar of AI-assisted pair\nprogramming, and tools such as GitHub Copilot have emerged as part of the daily\nprogramming workflow used by millions of developers. The training data for\nthese models is usually collected from the Internet (e.g., from open-source\nrepositories) and is likely to contain faults and security vulnerabilities.\nThis unsanitized training data can cause the language models to learn these\nvulnerabilities and propagate them during the code generation procedure. While\nthese models have been extensively assessed for their ability to produce\nfunctionally correct programs, there remains a lack of comprehensive\ninvestigations and benchmarks addressing the security aspects of these models.\n In this work, we propose a method to systematically study the security issues\nof code language models to assess their susceptibility to generating vulnerable\ncode. To this end, we introduce the first approach to automatically find\ngenerated code that contains vulnerabilities in black-box code generation\nmodels. To achieve this, we present an approach to approximate inversion of the\nblack-box code generation models based on few-shot prompting. We evaluate the\neffectiveness of our approach by examining code language models in generating\nhigh-risk security weaknesses. Furthermore, we establish a collection of\ndiverse non-secure prompts for various vulnerability scenarios using our\nmethod. This dataset forms a benchmark for evaluating and comparing the\nsecurity weaknesses in code language models.\n","authors":["Hossein Hajipour","Keno Hassler","Thorsten Holz","Lea Schönherr","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2302.04012v2.pdf","comment":"23 pages, 9 figures"},{"id":"http://arxiv.org/abs/2202.13852v2","updated":"2023-10-23T14:07:30Z","published":"2022-02-28T15:08:48Z","title":"Hyperbolic Graph Neural Networks: A Review of Methods and Applications","summary":" Graph neural networks generalize conventional neural networks to\ngraph-structured data and have received widespread attention due to their\nimpressive representation ability. In spite of the remarkable achievements, the\nperformance of Euclidean models in graph-related learning is still bounded and\nlimited by the representation ability of Euclidean geometry, especially for\ndatasets with highly non-Euclidean latent anatomy. Recently, hyperbolic space\nhas gained increasing popularity in processing graph data with tree-like\nstructure and power-law distribution, owing to its exponential growth property.\nIn this survey, we comprehensively revisit the technical details of the current\nhyperbolic graph neural networks, unifying them into a general framework and\nsummarizing the variants of each component. More importantly, we present\nvarious HGNN-related applications. Last, we also identify several challenges,\nwhich potentially serve as guidelines for further flourishing the achievements\nof graph learning in hyperbolic spaces.\n","authors":["Menglin Yang","Min Zhou","Zhihao Li","Jiahong Liu","Lujia Pan","Hui Xiong","Irwin King"],"pdf_url":"https://arxiv.org/pdf/2202.13852v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14963v1","updated":"2023-10-23T14:06:46Z","published":"2023-10-23T14:06:46Z","title":"Adam through a Second-Order Lens","summary":" Research into optimisation for deep learning is characterised by a tension\nbetween the computational efficiency of first-order, gradient-based methods\n(such as SGD and Adam) and the theoretical efficiency of second-order,\ncurvature-based methods (such as quasi-Newton methods and K-FAC). We seek to\ncombine the benefits of both approaches into a single computationally-efficient\nalgorithm. Noting that second-order methods often depend on stabilising\nheuristics (such as Levenberg-Marquardt damping), we propose AdamQLR: an\noptimiser combining damping and learning rate selection techniques from K-FAC\n(Martens and Grosse, 2015) with the update directions proposed by Adam,\ninspired by considering Adam through a second-order lens. We evaluate AdamQLR\non a range of regression and classification tasks at various scales, achieving\ncompetitive generalisation performance vs runtime.\n","authors":["Ross M. Clarke","Baiyu Su","José Miguel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2310.14963v1.pdf","comment":"28 pages, 15 figures, 4 tables. Submitted to ICLR 2024"},{"id":"http://arxiv.org/abs/2212.11680v2","updated":"2023-10-23T14:05:49Z","published":"2022-12-20T19:37:20Z","title":"Smooth Sailing: Improving Active Learning for Pre-trained Language\n Models with Representation Smoothness Analysis","summary":" Developed to alleviate prohibitive labeling costs, active learning (AL)\nmethods aim to reduce label complexity in supervised learning. While recent\nwork has demonstrated the benefit of using AL in combination with large\npre-trained language models (PLMs), it has often overlooked the practical\nchallenges that hinder the effectiveness of AL. We address these challenges by\nleveraging representation smoothness analysis to ensure AL is feasible, that\nis, both effective and practicable. Firstly, we propose an early stopping\ntechnique that does not require a validation set -- often unavailable in\nrealistic AL conditions -- and observe significant improvements over random\nsampling across multiple datasets and AL methods. Further, we find that task\nadaptation improves AL, whereas standard short fine-tuning in AL does not\nprovide improvements over random sampling. Our work demonstrates the usefulness\nof representation smoothness analysis for AL and introduces an AL stopping\ncriterion that reduces label complexity.\n","authors":["Josip Jukić","Jan Šnajder"],"pdf_url":"https://arxiv.org/pdf/2212.11680v2.pdf","comment":"Accepted at Learning with Small Data 2023, Association for\n Computational Linguistics"},{"id":"http://arxiv.org/abs/2310.14961v1","updated":"2023-10-23T14:04:18Z","published":"2023-10-23T14:04:18Z","title":"StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography","summary":" Coronary angiography continues to serve as the primary method for diagnosing\ncoronary artery disease (CAD), which is the leading global cause of mortality.\nThe severity of CAD is quantified by the location, degree of narrowing\n(stenosis), and number of arteries involved. In current practice, this\nquantification is performed manually using visual inspection and thus suffers\nfrom poor inter- and intra-rater reliability. The MICCAI grand challenge:\nAutomatic Region-based Coronary Artery Disease diagnostics using the X-ray\nangiography imagEs (ARCADE) curated a dataset with stenosis annotations, with\nthe goal of creating an automated stenosis detection algorithm. Using a\ncombination of machine learning and other computer vision techniques, we\npropose the architecture and algorithm StenUNet to accurately detect stenosis\nfrom X-ray Coronary Angiography. Our submission to the ARCADE challenge placed\n3rd among all teams. We achieved an F1 score of 0.5348 on the test set, 0.0005\nlower than the 2nd place.\n","authors":["Hui Lin","Tom Liu","Aggelos Katsaggelos","Adrienne Kline"],"pdf_url":"https://arxiv.org/pdf/2310.14961v1.pdf","comment":"12 pages, 5 figures, 1 table"},{"id":"http://arxiv.org/abs/2310.14957v1","updated":"2023-10-23T14:00:02Z","published":"2023-10-23T14:00:02Z","title":"XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series\n Classification","summary":" Despite the growing body of work on explainable machine learning in time\nseries classification (TSC), it remains unclear how to evaluate different\nexplainability methods. Resorting to qualitative assessment and user studies to\nevaluate explainers for TSC is difficult since humans have difficulties\nunderstanding the underlying information contained in time series data.\nTherefore, a systematic review and quantitative comparison of explanation\nmethods to confirm their correctness becomes crucial. While steps to\nstandardized evaluations were taken for tabular, image, and textual data,\nbenchmarking explainability methods on time series is challenging due to a)\ntraditional metrics not being directly applicable, b) implementation and\nadaption of traditional metrics for time series in the literature vary, and c)\nvarying baseline implementations. This paper proposes XTSC-Bench, a\nbenchmarking tool providing standardized datasets, models, and metrics for\nevaluating explanation methods on TSC. We analyze 3 perturbation-, 6 gradient-\nand 2 example-based explanation methods to TSC showing that improvements in the\nexplainers' robustness and reliability are necessary, especially for\nmultivariate data.\n","authors":["Jacqueline Höllig","Steffen Thoma","Florian Grimm"],"pdf_url":"https://arxiv.org/pdf/2310.14957v1.pdf","comment":"Accepted at ICMLA 2023"},{"id":"http://arxiv.org/abs/2307.13899v2","updated":"2023-10-23T13:58:25Z","published":"2023-07-26T01:47:49Z","title":"Regularizing Neural Networks with Meta-Learning Generative Models","summary":" This paper investigates methods for improving generative data augmentation\nfor deep learning. Generative data augmentation leverages the synthetic samples\nproduced by generative models as an additional dataset for classification with\nsmall dataset settings. A key challenge of generative data augmentation is that\nthe synthetic data contain uninformative samples that degrade accuracy. This is\nbecause the synthetic samples do not perfectly represent class categories in\nreal data and uniform sampling does not necessarily provide useful samples for\ntasks. In this paper, we present a novel strategy for generative data\naugmentation called meta generative regularization (MGR). To avoid the\ndegradation of generative data augmentation, MGR utilizes synthetic samples in\nthe regularization term for feature extractors instead of in the loss function,\ne.g., cross-entropy. These synthetic samples are dynamically determined to\nminimize the validation losses through meta-learning. We observed that MGR can\navoid the performance degradation of na\\\"ive generative data augmentation and\nboost the baselines. Experiments on six datasets showed that MGR is effective\nparticularly when datasets are smaller and stably outperforms baselines.\n","authors":["Shin'ya Yamaguchi","Daiki Chijiwa","Sekitoshi Kanai","Atsutoshi Kumagai","Hisashi Kashima"],"pdf_url":"https://arxiv.org/pdf/2307.13899v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.13769v2","updated":"2023-10-23T13:45:06Z","published":"2023-05-30T06:41:20Z","title":"Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation\n and Elaboration","summary":" In recent years, AI-assisted drug design methods have been proposed to\ngenerate molecules given the pockets' structures of target proteins. Most of\nthem are atom-level-based methods, which consider atoms as basic components and\ngenerate atom positions and types. In this way, however, it is hard to generate\nrealistic fragments with complicated structures. To solve this, we propose\nD3FG, a functional-group-based diffusion model for pocket-specific molecule\ngeneration and elaboration. D3FG decomposes molecules into two categories of\ncomponents: functional groups defined as rigid bodies and linkers as mass\npoints. And the two kinds of components can together form complicated fragments\nthat enhance ligand-protein interactions.\n To be specific, in the diffusion process, D3FG diffuses the data distribution\nof the positions, orientations, and types of the components into a prior\ndistribution; In the generative process, the noise is gradually removed from\nthe three variables by denoisers parameterized with designed equivariant graph\nneural networks. In the experiments, our method can generate molecules with\nmore realistic 3D structures, competitive affinities toward the protein\ntargets, and better drug properties. Besides, D3FG as a solution to a new task\nof molecule elaboration, could generate molecules with high affinities based on\nexisting ligands and the hotspots of target proteins.\n","authors":["Haitao Lin","Yufei Huang","Odin Zhang","Lirong Wu","Siyuan Li","Zhiyuan Chen","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2306.13769v2.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2310.14935v1","updated":"2023-10-23T13:35:24Z","published":"2023-10-23T13:35:24Z","title":"Causal machine learning for single-cell genomics","summary":" Advances in single-cell omics allow for unprecedented insights into the\ntranscription profiles of individual cells. When combined with large-scale\nperturbation screens, through which specific biological mechanisms can be\ntargeted, these technologies allow for measuring the effect of targeted\nperturbations on the whole transcriptome. These advances provide an opportunity\nto better understand the causative role of genes in complex biological\nprocesses such as gene regulation, disease progression or cellular development.\nHowever, the high-dimensional nature of the data, coupled with the intricate\ncomplexity of biological systems renders this task nontrivial. Within the\nmachine learning community, there has been a recent increase of interest in\ncausality, with a focus on adapting established causal techniques and\nalgorithms to handle high-dimensional data. In this perspective, we delineate\nthe application of these methodologies within the realm of single-cell genomics\nand their challenges. We first present the model that underlies most of current\ncausal approaches to single-cell biology and discuss and challenge the\nassumptions it entails from the biological point of view. We then identify open\nproblems in the application of causal approaches to single-cell data:\ngeneralising to unseen environments, learning interpretable models, and\nlearning causal models of dynamics. For each problem, we discuss how various\nresearch directions - including the development of computational approaches and\nthe adaptation of experimental protocols - may offer ways forward, or on the\ncontrary pose some difficulties. With the advent of single cell atlases and\nincreasing perturbation data, we expect causal models to become a crucial tool\nfor informed experimental design.\n","authors":["Alejandro Tejada-Lapuerta","Paul Bertin","Stefan Bauer","Hananeh Aliee","Yoshua Bengio","Fabian J. Theis"],"pdf_url":"https://arxiv.org/pdf/2310.14935v1.pdf","comment":"35 pages, 7 figures, 3 tables, 1 box"},{"id":"http://arxiv.org/abs/2310.14934v1","updated":"2023-10-23T13:34:59Z","published":"2023-10-23T13:34:59Z","title":"Robust Depth Linear Error Decomposition with Double Total Variation and\n Nuclear Norm for Dynamic MRI Reconstruction","summary":" Compressed Sensing (CS) significantly speeds up Magnetic Resonance Image\n(MRI) processing and achieves accurate MRI reconstruction from under-sampled\nk-space data. According to the current research, there are still several\nproblems with dynamic MRI k-space reconstruction based on CS. 1) There are\ndifferences between the Fourier domain and the Image domain, and the\ndifferences between MRI processing of different domains need to be considered.\n2) As three-dimensional data, dynamic MRI has its spatial-temporal\ncharacteristics, which need to calculate the difference and consistency of\nsurface textures while preserving structural integrity and uniqueness. 3)\nDynamic MRI reconstruction is time-consuming and computationally\nresource-dependent. In this paper, we propose a novel robust low-rank dynamic\nMRI reconstruction optimization model via highly under-sampled and Discrete\nFourier Transform (DFT) called the Robust Depth Linear Error Decomposition\nModel (RDLEDM). Our method mainly includes linear decomposition, double Total\nVariation (TV), and double Nuclear Norm (NN) regularizations. By adding linear\nimage domain error analysis, the noise is reduced after under-sampled and DFT\nprocessing, and the anti-interference ability of the algorithm is enhanced.\nDouble TV and NN regularizations can utilize both spatial-temporal\ncharacteristics and explore the complementary relationship between different\ndimensions in dynamic MRI sequences. In addition, Due to the non-smoothness and\nnon-convexity of TV and NN terms, it is difficult to optimize the unified\nobjective model. To address this issue, we utilize a fast algorithm by solving\na primal-dual form of the original problem. Compared with five state-of-the-art\nmethods, extensive experiments on dynamic MRI data demonstrate the superior\nperformance of the proposed method in terms of both reconstruction accuracy and\ntime complexity.\n","authors":["Junpeng Tan","Chunmei Qing","Xiangmin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.14934v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.09112v2","updated":"2023-10-23T13:28:35Z","published":"2023-01-22T12:29:03Z","title":"Differentially Private Natural Language Models: Recent Advances and\n Future Directions","summary":" Recent developments in deep learning have led to great success in various\nnatural language processing (NLP) tasks. However, these applications may\ninvolve data that contain sensitive information. Therefore, how to achieve good\nperformance while also protecting the privacy of sensitive data is a crucial\nchallenge in NLP. To preserve privacy, Differential Privacy (DP), which can\nprevent reconstruction attacks and protect against potential side knowledge, is\nbecoming a de facto technique for private data analysis. In recent years, NLP\nin DP models (DP-NLP) has been studied from different perspectives, which\ndeserves a comprehensive review. In this paper, we provide the first systematic\nreview of recent advances in DP deep learning models in NLP. In particular, we\nfirst discuss some differences and additional challenges of DP-NLP compared\nwith the standard DP deep learning. Then, we investigate some existing work on\nDP-NLP and present its recent developments from three aspects: gradient\nperturbation based methods, embedding vector perturbation based methods, and\nensemble model based methods. We also discuss some challenges and future\ndirections.\n","authors":["Lijie Hu","Ivan Habernal","Lei Shen","Di Wang"],"pdf_url":"https://arxiv.org/pdf/2301.09112v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.12731v2","updated":"2023-10-23T13:27:16Z","published":"2022-08-26T15:38:05Z","title":"Comparing Apples to Oranges: Learning Similarity Functions for Data\n Produced by Different Distributions","summary":" Similarity functions measure how comparable pairs of elements are, and play a\nkey role in a wide variety of applications, e.g., notions of Individual\nFairness abiding by the seminal paradigm of Dwork et al., as well as Clustering\nproblems. However, access to an accurate similarity function should not always\nbe considered guaranteed, and this point was even raised by Dwork et al. For\ninstance, it is reasonable to assume that when the elements to be compared are\nproduced by different distributions, or in other words belong to different\n``demographic'' groups, knowledge of their true similarity might be very\ndifficult to obtain. In this work, we present an efficient sampling framework\nthat learns these across-groups similarity functions, using only a limited\namount of experts' feedback. We show analytical results with rigorous\ntheoretical bounds, and empirically validate our algorithms via a large suite\nof experiments.\n","authors":["Leonidas Tsepenekas","Ivan Brugere","Freddy Lecue","Daniele Magazzeni"],"pdf_url":"https://arxiv.org/pdf/2208.12731v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.05779v2","updated":"2023-10-23T13:18:50Z","published":"2023-10-09T15:11:02Z","title":"Why Should This Article Be Deleted? Transparent Stance Detection in\n Multilingual Wikipedia Editor Discussions","summary":" The moderation of content on online platforms is usually non-transparent. On\nWikipedia, however, this discussion is carried out publicly and the editors are\nencouraged to use the content moderation policies as explanations for making\nmoderation decisions. Currently, only a few comments explicitly mention those\npolicies -- 20% of the English ones, but as few as 2% of the German and Turkish\ncomments. To aid in this process of understanding how content is moderated, we\nconstruct a novel multilingual dataset of Wikipedia editor discussions along\nwith their reasoning in three languages. The dataset contains the stances of\nthe editors (keep, delete, merge, comment), along with the stated reason, and a\ncontent moderation policy, for each edit decision. We demonstrate that stance\nand corresponding reason (policy) can be predicted jointly with a high degree\nof accuracy, adding transparency to the decision-making process. We release\nboth our joint prediction models and the multilingual content moderation\ndataset for further research on automated transparent content moderation.\n","authors":["Lucie-Aimée Kaffee","Arnav Arora","Isabelle Augenstein"],"pdf_url":"https://arxiv.org/pdf/2310.05779v2.pdf","comment":"This submission has been accepted to 2023 Conference on Empirical\n Methods in Natural Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.14909v1","updated":"2023-10-23T13:18:49Z","published":"2023-10-23T13:18:49Z","title":"Linking Surface Facts to Large-Scale Knowledge Graphs","summary":" Open Information Extraction (OIE) methods extract facts from natural language\ntext in the form of (\"subject\"; \"relation\"; \"object\") triples. These facts are,\nhowever, merely surface forms, the ambiguity of which impedes their downstream\nusage; e.g., the surface phrase \"Michael Jordan\" may refer to either the former\nbasketball player or the university professor. Knowledge Graphs (KGs), on the\nother hand, contain facts in a canonical (i.e., unambiguous) form, but their\ncoverage is limited by a static schema (i.e., a fixed set of entities and\npredicates). To bridge this gap, we need the best of both worlds: (i) high\ncoverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of\nKGs. In order to achieve this goal, we propose a new benchmark with novel\nevaluation protocols that can, for example, measure fact linking performance on\na granular triple slot level, while also measuring if a system has the ability\nto recognize that a surface form has no match in the existing KG. Our extensive\nevaluation of several baselines show that detection of out-of-KG entities and\npredicates is more difficult than accurate linking to existing ones, thus\ncalling for more research efforts on this difficult task. We publicly release\nall resources (data, benchmark and code) on\nhttps://github.com/nec-research/fact-linking.\n","authors":["Gorjan Radevski","Kiril Gashteovski","Chia-Chien Hung","Carolin Lawrence","Goran Glavaš"],"pdf_url":"https://arxiv.org/pdf/2310.14909v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14901v1","updated":"2023-10-23T13:11:30Z","published":"2023-10-23T13:11:30Z","title":"Series of Hessian-Vector Products for Tractable Saddle-Free Newton\n Optimisation of Neural Networks","summary":" Despite their popularity in the field of continuous optimisation,\nsecond-order quasi-Newton methods are challenging to apply in machine learning,\nas the Hessian matrix is intractably large. This computational burden is\nexacerbated by the need to address non-convexity, for instance by modifying the\nHessian's eigenvalues as in Saddle-Free Newton methods. We propose an\noptimisation algorithm which addresses both of these concerns - to our\nknowledge, the first efficiently-scalable optimisation algorithm to\nasymptotically use the exact (eigenvalue-modified) inverse Hessian. Our method\nframes the problem as a series which principally square-roots and inverts the\nsquared Hessian, then uses it to precondition a gradient vector, all without\nexplicitly computing or eigendecomposing the Hessian. A truncation of this\ninfinite series provides a new optimisation algorithm which is scalable and\ncomparable to other first- and second-order optimisation methods in both\nruntime and optimisation performance. We demonstrate this in a variety of\nsettings, including a ResNet-18 trained on CIFAR-10.\n","authors":["Elre T. Oldewage","Ross M. Clarke","José Miguel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2310.14901v1.pdf","comment":"36 pages, 10 figures, 5 tables. Submitted to TMLR. First two authors'\n order randomised"},{"id":"http://arxiv.org/abs/2207.06272v3","updated":"2023-10-23T13:06:58Z","published":"2022-07-13T15:18:00Z","title":"Hindsight Learning for MDPs with Exogenous Inputs","summary":" Many resource management problems require sequential decision-making under\nuncertainty, where the only uncertainty affecting the decision outcomes are\nexogenous variables outside the control of the decision-maker. We model these\nproblems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and\ndesign a class of data-efficient algorithms for them termed Hindsight Learning\n(HL). Our HL algorithms achieve data efficiency by leveraging a key insight:\nhaving samples of the exogenous variables, past decisions can be revisited in\nhindsight to infer counterfactual consequences that can accelerate policy\nimprovements. We compare HL against classic baselines in the multi-secretary\nand airline revenue management problems. We also scale our algorithms to a\nbusiness-critical cloud resource management problem -- allocating Virtual\nMachines (VMs) to physical machines, and simulate their performance with real\ndatasets from a large public cloud provider. We find that HL algorithms\noutperform domain-specific heuristics, as well as state-of-the-art\nreinforcement learning methods.\n","authors":["Sean R. Sinclair","Felipe Frujeri","Ching-An Cheng","Luke Marshall","Hugo Barbalho","Jingling Li","Jennifer Neville","Ishai Menache","Adith Swaminathan"],"pdf_url":"https://arxiv.org/pdf/2207.06272v3.pdf","comment":"52 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.00797v2","updated":"2023-10-23T13:06:30Z","published":"2023-10-01T21:24:05Z","title":"Going Beyond Familiar Features for Deep Anomaly Detection","summary":" Anomaly Detection (AD) is a critical task that involves identifying\nobservations that do not conform to a learned model of normality. Prior work in\ndeep AD is predominantly based on a familiarity hypothesis, where familiar\nfeatures serve as the reference in a pre-trained embedding space. While this\nstrategy has proven highly successful, it turns out that it causes consistent\nfalse negatives when anomalies consist of truly novel features that are not\nwell captured by the pre-trained encoding. We propose a novel approach to AD\nusing explainability to capture novel features as unexplained observations in\nthe input space. We achieve strong performance across a wide range of anomaly\nbenchmarks by combining similarity and novelty in a hybrid approach. Our\napproach establishes a new state-of-the-art across multiple benchmarks,\nhandling diverse anomaly types while eliminating the need for expensive\nbackground models and dense matching. In particular, we show that by taking\naccount of novel features, we reduce false negative anomalies by up to 40% on\nchallenging benchmarks compared to the state-of-the-art. Our method gives\nvisually inspectable explanations for pixel-level anomalies.\n","authors":["Sarath Sivaprasad","Mario Fritz"],"pdf_url":"https://arxiv.org/pdf/2310.00797v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14894v1","updated":"2023-10-23T13:04:15Z","published":"2023-10-23T13:04:15Z","title":"Local Universal Rule-based Explanations","summary":" Explainable artificial intelligence (XAI) is one of the most intensively\ndeveloped are of AI in recent years. It is also one of the most fragmented one\nwith multiple methods that focus on different aspects of explanations. This\nmakes difficult to obtain the full spectrum of explanation at once in a compact\nand consistent way. To address this issue, we present Local Universal Explainer\n(LUX) that is a rule-based explainer which can generate factual, counterfactual\nand visual explanations. It is based on a modified version of decision tree\nalgorithms that allows for oblique splits and integration with feature\nimportance XAI methods such as SHAP or LIME. It does not use data generation in\nopposite to other algorithms, but is focused on selecting local concepts in a\nform of high-density clusters of real data that have the highest impact on\nforming the decision boundary of the explained model. We tested our method on\nreal and synthetic datasets and compared it with state-of-the-art rule-based\nexplainers such as LORE, EXPLAN and Anchor. Our method outperforms currently\nexisting approaches in terms of simplicity, global fidelity and\nrepresentativeness.\n","authors":["Szymon Bobek","Grzegorz J. Nalepa"],"pdf_url":"https://arxiv.org/pdf/2310.14894v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.00861v4","updated":"2023-10-23T13:02:38Z","published":"2023-02-02T04:12:29Z","title":"SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling","summary":" Time series analysis is widely used in extensive areas. Recently, to reduce\nlabeling expenses and benefit various tasks, self-supervised pre-training has\nattracted immense interest. One mainstream paradigm is masked modeling, which\nsuccessfully pre-trains deep models by learning to reconstruct the masked\ncontent based on the unmasked part. However, since the semantic information of\ntime series is mainly contained in temporal variations, the standard way of\nrandomly masking a portion of time points will seriously ruin vital temporal\nvariations of time series, making the reconstruction task too difficult to\nguide representation learning. We thus present SimMTM, a Simple pre-training\nframework for Masked Time-series Modeling. By relating masked modeling to\nmanifold learning, SimMTM proposes to recover masked time points by the\nweighted aggregation of multiple neighbors outside the manifold, which eases\nthe reconstruction task by assembling ruined but complementary temporal\nvariations from multiple masked series. SimMTM further learns to uncover the\nlocal structure of the manifold, which is helpful for masked modeling.\nExperimentally, SimMTM achieves state-of-the-art fine-tuning performance\ncompared to the most advanced time series pre-training methods in two canonical\ntime series analysis tasks: forecasting and classification, covering both in-\nand cross-domain settings.\n","authors":["Jiaxiang Dong","Haixu Wu","Haoran Zhang","Li Zhang","Jianmin Wang","Mingsheng Long"],"pdf_url":"https://arxiv.org/pdf/2302.00861v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14888v1","updated":"2023-10-23T12:57:03Z","published":"2023-10-23T12:57:03Z","title":"Beyond Bayesian Model Averaging over Paths in Probabilistic Programs\n with Stochastic Support","summary":" The posterior in probabilistic programs with stochastic support decomposes as\na weighted sum of the local posterior distributions associated with each\npossible program path. We show that making predictions with this full posterior\nimplicitly performs a Bayesian model averaging (BMA) over paths. This is\npotentially problematic, as model misspecification can cause the BMA weights to\nprematurely collapse onto a single path, leading to sub-optimal predictions in\nturn. To remedy this issue, we propose alternative mechanisms for path\nweighting: one based on stacking and one based on ideas from PAC-Bayes. We show\nhow both can be implemented as a cheap post-processing step on top of existing\ninference engines. In our experiments, we find them to be more robust and lead\nto better predictions compared to the default BMA weights.\n","authors":["Tim Reichelt","Luke Ong","Tom Rainforth"],"pdf_url":"https://arxiv.org/pdf/2310.14888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.02177v2","updated":"2023-10-23T12:48:03Z","published":"2022-10-05T12:07:13Z","title":"Multi-objective optimization via equivariant deep hypervolume\n approximation","summary":" Optimizing multiple competing objectives is a common problem across science\nand industry. The inherent inextricable trade-off between those objectives\nleads one to the task of exploring their Pareto front. A meaningful quantity\nfor the purpose of the latter is the hypervolume indicator, which is used in\nBayesian Optimization (BO) and Evolutionary Algorithms (EAs). However, the\ncomputational complexity for the calculation of the hypervolume scales\nunfavorably with increasing number of objectives and data points, which\nrestricts its use in those common multi-objective optimization frameworks. To\novercome these restrictions we propose to approximate the hypervolume function\nwith a deep neural network, which we call DeepHV. For better sample efficiency\nand generalization, we exploit the fact that the hypervolume is\nscale-equivariant in each of the objectives as well as permutation invariant\nw.r.t. both the objectives and the samples, by using a deep neural network that\nis equivariant w.r.t. the combined group of scalings and permutations. We\nevaluate our method against exact, and approximate hypervolume methods in terms\nof accuracy, computation time, and generalization. We also apply and compare\nour methods to state-of-the-art multi-objective BO methods and EAs on a range\nof synthetic benchmark test cases. The results show that our methods are\npromising for such multi-objective optimization tasks.\n","authors":["Jim Boelrijk","Bernd Ensing","Patrick Forré"],"pdf_url":"https://arxiv.org/pdf/2210.02177v2.pdf","comment":"Updated with camera-ready version. Accepted at ICLR 2023"},{"id":"http://arxiv.org/abs/2305.00374v2","updated":"2023-10-23T12:46:38Z","published":"2023-04-30T03:12:21Z","title":"Enhancing Adversarial Contrastive Learning via Adversarial Invariant\n Regularization","summary":" Adversarial contrastive learning (ACL) is a technique that enhances standard\ncontrastive learning (SCL) by incorporating adversarial data to learn a robust\nrepresentation that can withstand adversarial attacks and common corruptions\nwithout requiring costly annotations. To improve transferability, the existing\nwork introduced the standard invariant regularization (SIR) to impose\nstyle-independence property to SCL, which can exempt the impact of nuisance\nstyle factors in the standard representation. However, it is unclear how the\nstyle-independence property benefits ACL-learned robust representations. In\nthis paper, we leverage the technique of causal reasoning to interpret the ACL\nand propose adversarial invariant regularization (AIR) to enforce independence\nfrom style factors. We regulate the ACL using both SIR and AIR to output the\nrobust representation. Theoretically, we show that AIR implicitly encourages\nthe representational distance between different views of natural data and their\nadversarial variants to be independent of style factors. Empirically, our\nexperimental results show that invariant regularization significantly improves\nthe performance of state-of-the-art ACL methods in terms of both standard\ngeneralization and robustness on downstream tasks. To the best of our\nknowledge, we are the first to apply causal reasoning to interpret ACL and\ndevelop AIR for enhancing ACL-learned robust representations. Our source code\nis at https://github.com/GodXuxilie/Enhancing_ACL_via_AIR.\n","authors":["Xilie Xu","Jingfeng Zhang","Feng Liu","Masashi Sugiyama","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2305.00374v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.12314v4","updated":"2023-10-23T12:43:35Z","published":"2023-03-22T05:04:21Z","title":"Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization\n for Few-shot Generalization","summary":" Prompt tuning is a parameter-efficient method, which learns soft prompts and\nconditions frozen language models to perform specific downstream tasks. Though\neffective, prompt tuning under few-shot settings on the one hand heavily relies\non a good initialization of soft prompts. On the other hand, it can easily\noverfit to few-shot training samples, thereby undermining generalizability.\nExisting works leverage pre-training or supervised meta-learning to initialize\nsoft prompts but they fail to data-efficiently generalize to unseen downstream\ntasks. To address the above problems, this paper proposes a novel\nSelf-sUpervised meta-Prompt learning framework with MEta-gradient\nRegularization for few-shot generalization (SUPMER). SUPMER leverages\nself-supervised meta-learning with a diverse set of well-designed meta-training\ntasks to learn a universal prompt initialization for efficient adaptation using\nonly unlabeled data. Additionally, it jointly meta-learns a gradient\nregularization function to transform raw gradients into a domain-generalizable\ndirection, thus alleviating the problem of overfitting. Extensive experiments\nshow that SUPMER achieves better performance for different few-shot downstream\ntasks, and also exhibits a stronger domain generalization ability. The code for\nSUPMER will be available at https://github.com/beepkh/SUPMER.\n","authors":["Kaihang Pan","Juncheng Li","Hongye Song","Jun Lin","Xiaozhong Liu","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2303.12314v4.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.14866v1","updated":"2023-10-23T12:36:33Z","published":"2023-10-23T12:36:33Z","title":"A Study on Knowledge Graph Embeddings and Graph Neural Networks for Web\n Of Things","summary":" Graph data structures are widely used to store relational information between\nseveral entities. With data being generated worldwide on a large scale, we see\na significant growth in the generation of knowledge graphs. Thing in the future\nis Orange's take on a knowledge graph in the domain of the Web Of Things (WoT),\nwhere the main objective of the platform is to provide a digital representation\nof the physical world and enable cross-domain applications to be built upon\nthis massive and highly connected graph of things. In this context, as the\nknowledge graph grows in size, it is prone to have noisy and messy data. In\nthis paper, we explore state-of-the-art knowledge graph embedding (KGE) methods\nto learn numerical representations of the graph entities and, subsequently,\nexplore downstream tasks like link prediction, node classification, and triple\nclassification. We also investigate Graph neural networks (GNN) alongside KGEs\nand compare their performance on the same downstream tasks. Our evaluation\nhighlights the encouraging performance of both KGE and GNN-based methods on\nnode classification, and the superiority of GNN approaches in the link\nprediction task. Overall, we show that state-of-the-art approaches are relevant\nin a WoT context, and this preliminary work provides insights to implement and\nevaluate them in this context.\n","authors":["Rohith Teja Mittakola","Thomas Hassan"],"pdf_url":"https://arxiv.org/pdf/2310.14866v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.03857v4","updated":"2023-10-23T12:35:27Z","published":"2023-02-08T03:20:14Z","title":"Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset\n Selection","summary":" Adversarial contrastive learning (ACL) does not require expensive data\nannotations but outputs a robust representation that withstands adversarial\nattacks and also generalizes to a wide range of downstream tasks. However, ACL\nneeds tremendous running time to generate the adversarial variants of all\ntraining data, which limits its scalability to large datasets. To speed up ACL,\nthis paper proposes a robustness-aware coreset selection (RCS) method. RCS does\nnot require label information and searches for an informative subset that\nminimizes a representational divergence, which is the distance of the\nrepresentation between natural data and their virtual adversarial variants. The\nvanilla solution of RCS via traversing all possible subsets is computationally\nprohibitive. Therefore, we theoretically transform RCS into a surrogate problem\nof submodular maximization, of which the greedy search is an efficient solution\nwith an optimality guarantee for the original problem. Empirically, our\ncomprehensive results corroborate that RCS can speed up ACL by a large margin\nwithout significantly hurting the robustness transferability. Notably, to the\nbest of our knowledge, we are the first to conduct ACL efficiently on the\nlarge-scale ImageNet-1K dataset to obtain an effective robust representation\nvia RCS. Our source code is at\nhttps://github.com/GodXuxilie/Efficient_ACL_via_RCS.\n","authors":["Xilie Xu","Jingfeng Zhang","Feng Liu","Masashi Sugiyama","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2302.03857v4.pdf","comment":"NeurIPS 2023 Spotlight"},{"id":"http://arxiv.org/abs/2310.14864v1","updated":"2023-10-23T12:33:59Z","published":"2023-10-23T12:33:59Z","title":"Diverse Priors for Deep Reinforcement Learning","summary":" In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards\nin a given environment. During the learning process, RL agents face the dilemma\nof exploitation and exploration: leveraging existing knowledge to acquire\nrewards or seeking potentially higher ones. Using uncertainty as a guiding\nprinciple provides an active and effective approach to solving this dilemma and\nensemble-based methods are one of the prominent avenues for quantifying\nuncertainty. Nevertheless, conventional ensemble-based uncertainty estimation\nlacks an explicit prior, deviating from Bayesian principles. Besides, this\nmethod requires diversity among members to generate less biased uncertainty\nestimation results. To address the above problems, previous research has\nincorporated random functions as priors. Building upon these foundational\nefforts, our work introduces an innovative approach with delicately designed\nprior NNs, which can incorporate maximal diversity in the initial value\nfunctions of RL. Our method has demonstrated superior performance compared with\nthe random prior approaches in solving classic control problems and general\nexploration tasks, significantly improving sample efficiency.\n","authors":["Chenfan Weng","Zhongguo Li"],"pdf_url":"https://arxiv.org/pdf/2310.14864v1.pdf","comment":"8 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.14858v1","updated":"2023-10-23T12:28:21Z","published":"2023-10-23T12:28:21Z","title":"Dynamically Weighted Federated k-Means","summary":" Federated clustering is an important part of the field of federated machine\nlearning, that allows multiple data sources to collaboratively cluster their\ndata while keeping it decentralized and preserving privacy. In this paper, we\nintroduce a novel federated clustering algorithm, named Dynamically Weighted\nFederated k-means (DWF k-means), to address the challenges posed by distributed\ndata sources and heterogeneous data. Our proposed algorithm combines the\nbenefits of traditional clustering techniques with the privacy and scalability\nadvantages of federated learning. It enables multiple data owners to\ncollaboratively cluster their local data while exchanging minimal information\nwith a central coordinator. The algorithm optimizes the clustering process by\nadaptively aggregating cluster assignments and centroids from each data source,\nthereby learning a global clustering solution that reflects the collective\nknowledge of the entire federated network. We conduct experiments on multiple\ndatasets and data distribution settings to evaluate the performance of our\nalgorithm in terms of clustering score, accuracy, and v-measure. The results\ndemonstrate that our approach can match the performance of the centralized\nclassical k-means baseline, and outperform existing federated clustering\nmethods in realistic scenarios.\n","authors":["Patrick Holzer","Tania Jacob","Shubham Kavane"],"pdf_url":"https://arxiv.org/pdf/2310.14858v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14848v1","updated":"2023-10-23T12:15:23Z","published":"2023-10-23T12:15:23Z","title":"Zero-knowledge Proof Meets Machine Learning in Verifiability: A Survey","summary":" With the rapid advancement of artificial intelligence technology, the usage\nof machine learning models is gradually becoming part of our daily lives.\nHigh-quality models rely not only on efficient optimization algorithms but also\non the training and learning processes built upon vast amounts of data and\ncomputational power. However, in practice, due to various challenges such as\nlimited computational resources and data privacy concerns, users in need of\nmodels often cannot train machine learning models locally. This has led them to\nexplore alternative approaches such as outsourced learning and federated\nlearning. While these methods address the feasibility of model training\neffectively, they introduce concerns about the trustworthiness of the training\nprocess since computations are not performed locally. Similarly, there are\ntrustworthiness issues associated with outsourced model inference. These two\nproblems can be summarized as the trustworthiness problem of model\ncomputations: How can one verify that the results computed by other\nparticipants are derived according to the specified algorithm, model, and input\ndata? To address this challenge, verifiable machine learning (VML) has emerged.\nThis paper presents a comprehensive survey of zero-knowledge proof-based\nverifiable machine learning (ZKP-VML) technology. We first analyze the\npotential verifiability issues that may exist in different machine learning\nscenarios. Subsequently, we provide a formal definition of ZKP-VML. We then\nconduct a detailed analysis and classification of existing works based on their\ntechnical approaches. Finally, we discuss the key challenges and future\ndirections in the field of ZKP-based VML.\n","authors":["Zhibo Xing","Zijian Zhang","Jiamou Liu","Ziang Zhang","Meng Li","Liehuang Zhu","Giovanni Russello"],"pdf_url":"https://arxiv.org/pdf/2310.14848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11535v2","updated":"2023-10-23T12:11:24Z","published":"2023-01-27T04:54:12Z","title":"Learning Informative Representation for Fairness-aware Multivariate\n Time-series Forecasting: A Group-based Perspective","summary":" Performance unfairness among variables widely exists in multivariate time\nseries (MTS) forecasting models since such models may attend/bias to certain\n(advantaged) variables. Addressing this unfairness problem is important for\nequally attending to all variables and avoiding vulnerable model biases/risks.\nHowever, fair MTS forecasting is challenging and has been less studied in the\nliterature. To bridge such significant gap, we formulate the fairness modeling\nproblem as learning informative representations attending to both advantaged\nand disadvantaged variables. Accordingly, we propose a novel framework, named\nFairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial\nlearning to generate both group-independent and group-relevant representations\nfor the downstream forecasting. The framework first leverages a spectral\nrelaxation of the K-means objective to infer variable correlations and thus to\ngroup variables. Then, it utilizes a filtering&fusion component to filter the\ngroup-relevant information and generate group-independent representations via\northogonality regularization. The group-independent and group-relevant\nrepresentations form highly informative representations, facilitating to\nsharing knowledge from advantaged variables to disadvantaged variables to\nguarantee fairness. Extensive experiments on four public datasets demonstrate\nthe effectiveness of our proposed FairFor for fair forecasting and significant\nperformance improvement.\n","authors":["Hui He","Qi Zhang","Shoujin Wang","Kun Yi","Zhendong Niu","Longbing Cao"],"pdf_url":"https://arxiv.org/pdf/2301.11535v2.pdf","comment":"13 pages, 5 figures, accepted by IEEE Transactions on Knowledge and\n Data Engineering (TKDE)"},{"id":"http://arxiv.org/abs/2310.14845v1","updated":"2023-10-23T12:11:13Z","published":"2023-10-23T12:11:13Z","title":"ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt","summary":" Recent research has demonstrated the efficacy of pre-training graph neural\nnetworks (GNNs) to capture the transferable graph semantics and enhance the\nperformance of various downstream tasks. However, the semantic knowledge\nlearned from pretext tasks might be unrelated to the downstream task, leading\nto a semantic gap that limits the application of graph pre-training. To reduce\nthis gap, traditional approaches propose hybrid pre-training to combine various\npretext tasks together in a multi-task learning fashion and learn multi-grained\nknowledge, which, however, cannot distinguish tasks and results in some\ntransferable task-specific knowledge distortion by each other. Moreover, most\nGNNs cannot distinguish nodes located in different parts of the graph, making\nthem fail to learn position-specific knowledge and lead to suboptimal\nperformance. In this work, inspired by the prompt-based tuning in natural\nlanguage processing, we propose a unified framework for graph hybrid\npre-training which injects the task identification and position identification\ninto GNNs through a prompt mechanism, namely multi-task graph dual prompt\n(ULTRA-DP). Based on this framework, we propose a prompt-based transferability\ntest to find the most relevant pretext task in order to reduce the semantic\ngap. To implement the hybrid pre-training tasks, beyond the classical edge\nprediction task (node-node level), we further propose a novel pre-training\nparadigm based on a group of $k$-nearest neighbors (node-group level). The\ncombination of them across different scales is able to comprehensively express\nmore structural semantics and derive richer multi-grained knowledge. Extensive\nexperiments show that our proposed ULTRA-DP can significantly enhance the\nperformance of hybrid pre-training methods and show the generalizability to\nother pre-training tasks and backbone architectures.\n","authors":["Mouxiang Chen","Zemin Liu","Chenghao Liu","Jundong Li","Qiheng Mao","Jianling Sun"],"pdf_url":"https://arxiv.org/pdf/2310.14845v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.16494v5","updated":"2023-10-23T12:02:34Z","published":"2022-11-29T18:58:07Z","title":"On the Ability of Graph Neural Networks to Model Interactions Between\n Vertices","summary":" Graph neural networks (GNNs) are widely used for modeling complex\ninteractions between entities represented as vertices of a graph. Despite\nrecent efforts to theoretically analyze the expressive power of GNNs, a formal\ncharacterization of their ability to model interactions is lacking. The current\npaper aims to address this gap. Formalizing strength of interactions through an\nestablished measure known as separation rank, we quantify the ability of\ncertain GNNs to model interaction between a given subset of vertices and its\ncomplement, i.e. between the sides of a given partition of input vertices. Our\nresults reveal that the ability to model interaction is primarily determined by\nthe partition's walk index -- a graph-theoretical characteristic defined by the\nnumber of walks originating from the boundary of the partition. Experiments\nwith common GNN architectures corroborate this finding. As a practical\napplication of our theory, we design an edge sparsification algorithm named\nWalk Index Sparsification (WIS), which preserves the ability of a GNN to model\ninteractions when input edges are removed. WIS is simple, computationally\nefficient, and in our experiments has markedly outperformed alternative methods\nin terms of induced prediction accuracy. More broadly, it showcases the\npotential of improving GNNs by theoretically analyzing the interactions they\ncan model.\n","authors":["Noam Razin","Tom Verbin","Nadav Cohen"],"pdf_url":"https://arxiv.org/pdf/2211.16494v5.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.14838v1","updated":"2023-10-23T11:58:01Z","published":"2023-10-23T11:58:01Z","title":"Calibration of Time-Series Forecasting Transformers: Detecting and\n Adapting Context-Driven Distribution Shift","summary":" Recent years have witnessed the success of introducing Transformers to time\nseries forecasting. From a data generation perspective, we illustrate that\nexisting Transformers are susceptible to distribution shifts driven by temporal\ncontexts, whether observed or unobserved. Such context-driven distribution\nshift (CDS) introduces biases in predictions within specific contexts and poses\nchallenges for conventional training paradigm. In this paper, we introduce a\nuniversal calibration methodology for the detection and adaptation of CDS with\na trained Transformer model. To this end, we propose a novel CDS detector,\ntermed the \"residual-based CDS detector\" or \"Reconditionor\", which quantifies\nthe model's vulnerability to CDS by evaluating the mutual information between\nprediction residuals and their corresponding contexts. A high Reconditionor\nscore indicates a severe susceptibility, thereby necessitating model\nadaptation. In this circumstance, we put forth a straightforward yet potent\nadapter framework for model calibration, termed the \"sample-level\ncontextualized adapter\" or \"SOLID\". This framework involves the curation of a\ncontextually similar dataset to the provided test sample and the subsequent\nfine-tuning of the model's prediction layer with a limited number of steps. Our\ntheoretical analysis demonstrates that this adaptation strategy is able to\nachieve an optimal equilibrium between bias and variance. Notably, our proposed\nReconditionor and SOLID are model-agnostic and readily adaptable to a wide\nrange of Transformers. Extensive experiments show that SOLID consistently\nenhances the performance of current SOTA Transformers on real-world datasets,\nespecially on cases with substantial CDS detected by the proposed\nReconditionor, thus validate the effectiveness of the calibration approach.\n","authors":["Mouxiang Chen","Lefei Shen","Han Fu","Zhuo Li","Jianling Sun","Chenghao Liu"],"pdf_url":"https://arxiv.org/pdf/2310.14838v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14837v1","updated":"2023-10-23T11:57:44Z","published":"2023-10-23T11:57:44Z","title":"Harnessing Attention Mechanisms: Efficient Sequence Reduction using\n Attention-based Autoencoders","summary":" Many machine learning models use the manipulation of dimensions as a driving\nforce to enable models to identify and learn important features in data. In the\ncase of sequential data this manipulation usually happens on the token\ndimension level. Despite the fact that many tasks require a change in sequence\nlength itself, the step of sequence length reduction usually happens out of\nnecessity and in a single step. As far as we are aware, no model uses the\nsequence length reduction step as an additional opportunity to tune the models\nperformance. In fact, sequence length manipulation as a whole seems to be an\noverlooked direction. In this study we introduce a novel attention-based method\nthat allows for the direct manipulation of sequence lengths. To explore the\nmethod's capabilities, we employ it in an autoencoder model. The autoencoder\nreduces the input sequence to a smaller sequence in latent space. It then aims\nto reproduce the original sequence from this reduced form. In this setting, we\nexplore the methods reduction performance for different input and latent\nsequence lengths. We are able to show that the autoencoder retains all the\nsignificant information when reducing the original sequence to half its\noriginal size. When reducing down to as low as a quarter of its original size,\nthe autoencoder is still able to reproduce the original sequence with an\naccuracy of around 90%.\n","authors":["Daniel Biermann","Fabrizio Palumbo","Morten Goodwin","Ole-Christoffer Granmo"],"pdf_url":"https://arxiv.org/pdf/2310.14837v1.pdf","comment":"8 pages, 5 images, 1 table"},{"id":"http://arxiv.org/abs/2310.14826v1","updated":"2023-10-23T11:45:34Z","published":"2023-10-23T11:45:34Z","title":"Sharp error bounds for imbalanced classification: how many examples in\n the minority class?","summary":" When dealing with imbalanced classification data, reweighting the loss\nfunction is a standard procedure allowing to equilibrate between the true\npositive and true negative rates within the risk measure. Despite significant\ntheoretical work in this area, existing results do not adequately address a\nmain challenge within the imbalanced classification framework, which is the\nnegligible size of one class in relation to the full sample size and the need\nto rescale the risk function by a probability tending to zero. To address this\ngap, we present two novel contributions in the setting where the rare class\nprobability approaches zero: (1) a non asymptotic fast rate probability bound\nfor constrained balanced empirical risk minimization, and (2) a consistent\nupper bound for balanced nearest neighbors estimates. Our findings provide a\nclearer understanding of the benefits of class-weighting in realistic settings,\nopening new avenues for further research in this field.\n","authors":["Anass Aghbalou","François Portier","Anne Sabourin"],"pdf_url":"https://arxiv.org/pdf/2310.14826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14817v1","updated":"2023-10-23T11:33:24Z","published":"2023-10-23T11:33:24Z","title":"Text2Topic: Multi-Label Text Classification System for Efficient Topic\n Detection in User Generated Content with Zero-Shot Capabilities","summary":" Multi-label text classification is a critical task in the industry. It helps\nto extract structured information from large amount of textual data. We propose\nText to Topic (Text2Topic), which achieves high multi-label classification\nperformance by employing a Bi-Encoder Transformer architecture that utilizes\nconcatenation, subtraction, and multiplication of embeddings on both text and\ntopic. Text2Topic also supports zero-shot predictions, produces domain-specific\ntext embeddings, and enables production-scale batch-inference with high\nthroughput. The final model achieves accurate and comprehensive results\ncompared to state-of-the-art baselines, including large language models (LLMs).\n In this study, a total of 239 topics are defined, and around 1.6 million\ntext-topic pairs annotations (in which 200K are positive) are collected on\napproximately 120K texts from 3 main data sources on Booking.com. The data is\ncollected with optimized smart sampling and partial labeling. The final\nText2Topic model is deployed on a real-world stream processing platform, and it\noutperforms other models with 92.9% micro mAP, as well as a 75.8% macro mAP\nscore. We summarize the modeling choices which are extensively tested through\nablation studies, and share detailed in-production decision-making steps.\n","authors":["Fengjun Wang","Moran Beladev","Ofri Kleinfeld","Elina Frayerman","Tal Shachar","Eran Fainman","Karen Lastmann Assaraf","Sarai Mizrachi","Benjamin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14814v1","updated":"2023-10-23T11:30:06Z","published":"2023-10-23T11:30:06Z","title":"Leveraging Ensemble Diversity for Robust Self-Training in the Presence\n of Sample Selection Bias","summary":" Self-training is a well-known approach for semi-supervised learning. It\nconsists of iteratively assigning pseudo-labels to unlabeled data for which the\nmodel is confident and treating them as labeled examples. For neural networks,\nsoftmax prediction probabilities are often used as a confidence measure,\ndespite the fact that they are known to be overconfident, even for wrong\npredictions. This phenomenon is particularly intensified in the presence of\nsample selection bias, i.e., when data labeling is subject to some constraint.\nTo address this issue, we propose a novel confidence measure, called\n$\\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of\nlinear classifiers. We provide the theoretical analysis of our approach by\nstudying stationary points and describing the relationship between the\ndiversity of the individual members and their performance. We empirically\ndemonstrate the benefit of our confidence measure for three different\npseudo-labeling policies on classification datasets of various data modalities.\n","authors":["Ambroise Odonnat","Vasilii Feofanov","Ievgen Redko"],"pdf_url":"https://arxiv.org/pdf/2310.14814v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14794v2","updated":"2023-10-23T11:22:58Z","published":"2023-05-24T06:45:33Z","title":"Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak\n Supervision for Text Classification","summary":" Recent advances in weakly supervised text classification mostly focus on\ndesigning sophisticated methods to turn high-level human heuristics into\nquality pseudo-labels. In this paper, we revisit the seed matching-based\nmethod, which is arguably the simplest way to generate pseudo-labels, and show\nthat its power was greatly underestimated. We show that the limited performance\nof seed matching is largely due to the label bias injected by the simple\nseed-match rule, which prevents the classifier from learning reliable\nconfidence for selecting high-quality pseudo-labels. Interestingly, simply\ndeleting the seed words present in the matched input texts can mitigate the\nlabel bias and help learn better confidence. Subsequently, the performance\nachieved by seed matching can be improved significantly, making it on par with\nor even better than the state-of-the-art. Furthermore, to handle the case when\nthe seed words are not made known, we propose to simply delete the word tokens\nin the input text randomly with a high deletion ratio. Remarkably, seed\nmatching equipped with this random deletion method can often achieve even\nbetter performance than that with seed deletion.\n","authors":["Chengyu Dong","Zihan Wang","Jingbo Shang"],"pdf_url":"https://arxiv.org/pdf/2305.14794v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2211.16965v2","updated":"2023-10-23T11:21:21Z","published":"2022-11-30T13:20:11Z","title":"Privacy-Preserving Federated Deep Clustering based on GAN","summary":" Federated clustering (FC) is an essential extension of centralized clustering\ndesigned for the federated setting, wherein the challenge lies in constructing\na global similarity measure without the need to share private data.\nConventional approaches to FC typically adopt extensions of centralized\nmethods, like K-means and fuzzy c-means. However, these methods are susceptible\nto non-independent-and-identically-distributed (non-IID) data among clients,\nleading to suboptimal performance, particularly with high-dimensional data. In\nthis paper, we present a novel approach to address these limitations by\nproposing a Privacy-Preserving Federated Deep Clustering based on Generative\nAdversarial Networks (GANs). Each client trains a local generative adversarial\nnetwork (GAN) locally and uploads the synthetic data to the server. The server\napplies a deep clustering network on the synthetic data to establish $k$\ncluster centroids, which are then downloaded to the clients for cluster\nassignment. Theoretical analysis demonstrates that the GAN-generated samples,\nshared among clients, inherently uphold certain privacy guarantees,\nsafeguarding the confidentiality of individual data. Furthermore, extensive\nexperimental evaluations showcase the effectiveness and utility of our proposed\nmethod in achieving accurate and privacy-preserving federated clustering.\n","authors":["Jie Yan","Jing Liu","Ji Qi","Zhong-Yuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2211.16965v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14809v1","updated":"2023-10-23T11:16:32Z","published":"2023-10-23T11:16:32Z","title":"Learning spatio-temporal patterns with Neural Cellular Automata","summary":" Neural Cellular Automata (NCA) are a powerful combination of machine learning\nand mechanistic modelling. We train NCA to learn complex dynamics from time\nseries of images and PDE trajectories. Our method is designed to identify\nunderlying local rules that govern large scale dynamic emergent behaviours.\nPrevious work on NCA focuses on learning rules that give stationary emergent\nstructures. We extend NCA to capture both transient and stable structures\nwithin the same system, as well as learning rules that capture the dynamics of\nTuring pattern formation in nonlinear Partial Differential Equations (PDEs). We\ndemonstrate that NCA can generalise very well beyond their PDE training data,\nwe show how to constrain NCA to respect given symmetries, and we explore the\neffects of associated hyperparameters on model performance and stability. Being\nable to learn arbitrary dynamics gives NCA great potential as a data driven\nmodelling framework, especially for modelling biological pattern formation.\n","authors":["Alex D. Richardson","Tibor Antal","Richard A. Blythe","Linus J. Schumacher"],"pdf_url":"https://arxiv.org/pdf/2310.14809v1.pdf","comment":"For videos referenced in appendix, see:\n https://github.com/AlexDR1998/NCA/tree/main/Videos"},{"id":"http://arxiv.org/abs/2210.16524v2","updated":"2023-10-23T11:01:10Z","published":"2022-10-29T07:42:11Z","title":"Federated clustering with GAN-based data synthesis","summary":" Federated clustering (FC) is an extension of centralized clustering in\nfederated settings. The key here is how to construct a global similarity\nmeasure without sharing private data, since the local similarity may be\ninsufficient to group local data correctly and the similarity of samples across\nclients cannot be directly measured due to privacy constraints. Obviously, the\nmost straightforward way to analyze FC is to employ the methods extended from\ncentralized ones, such as K-means (KM) and fuzzy c-means (FCM). However, they\nare vulnerable to non independent-and-identically-distributed (non-IID) data\namong clients. To handle this, we propose a new federated clustering framework,\nnamed synthetic data aided federated clustering (SDA-FC). It trains generative\nadversarial network locally in each client and uploads the generated synthetic\ndata to the server, where KM or FCM is performed on the synthetic data. The\nsynthetic data can make the model immune to the non-IID problem and enable us\nto capture the global similarity characteristics more effectively without\nsharing private data. Comprehensive experiments reveals the advantages of\nSDA-FC, including superior performance in addressing the non-IID problem and\nthe device failures.\n","authors":["Jie Yan","Jing Liu","Ji Qi","Zhong-Yuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2210.16524v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14793v1","updated":"2023-10-23T10:53:25Z","published":"2023-10-23T10:53:25Z","title":"What do Deck Chairs and Sun Hats Have in Common? Uncovering Shared\n Properties in Large Concept Vocabularies","summary":" Concepts play a central role in many applications. This includes settings\nwhere concepts have to be modelled in the absence of sentence context. Previous\nwork has therefore focused on distilling decontextualised concept embeddings\nfrom language models. But concepts can be modelled from different perspectives,\nwhereas concept embeddings typically mostly capture taxonomic structure. To\naddress this issue, we propose a strategy for identifying what different\nconcepts, from a potentially large concept vocabulary, have in common with\nothers. We then represent concepts in terms of the properties they share with\nthe other concepts. To demonstrate the practical usefulness of this way of\nmodelling concepts, we consider the task of ultra-fine entity typing, which is\na challenging multi-label classification problem. We show that by augmenting\nthe label set with shared properties, we can improve the performance of the\nstate-of-the-art models for this task.\n","authors":["Amit Gajbhiye","Zied Bouraoui","Na Li","Usashi Chatterjee","Luis Espinosa Anke","Steven Schockaert"],"pdf_url":"https://arxiv.org/pdf/2310.14793v1.pdf","comment":"Accepted for EMNLP 2023"},{"id":"http://arxiv.org/abs/2302.02601v4","updated":"2023-10-23T10:51:24Z","published":"2023-02-06T07:45:57Z","title":"Learning Representations of Bi-level Knowledge Graphs for Reasoning\n beyond Link Prediction","summary":" Knowledge graphs represent known facts using triplets. While existing\nknowledge graph embedding methods only consider the connections between\nentities, we propose considering the relationships between triplets. For\nexample, let us consider two triplets $T_1$ and $T_2$ where $T_1$ is\n(Academy_Awards, Nominates, Avatar) and $T_2$ is (Avatar, Wins,\nAcademy_Awards). Given these two base-level triplets, we see that $T_1$ is a\nprerequisite for $T_2$. In this paper, we define a higher-level triplet to\nrepresent a relationship between triplets, e.g., $\\langle T_1$,\nPrerequisiteFor, $T_2\\rangle$ where PrerequisiteFor is a higher-level relation.\nWe define a bi-level knowledge graph that consists of the base-level and the\nhigher-level triplets. We also propose a data augmentation strategy based on\nthe random walks on the bi-level knowledge graph to augment plausible triplets.\nOur model called BiVE learns embeddings by taking into account the structures\nof the base-level and the higher-level triplets, with additional consideration\nof the augmented triplets. We propose two new tasks: triplet prediction and\nconditional link prediction. Given a triplet $T_1$ and a higher-level relation,\nthe triplet prediction predicts a triplet that is likely to be connected to\n$T_1$ by the higher-level relation, e.g., $\\langle T_1$, PrerequisiteFor,\n?$\\rangle$. The conditional link prediction predicts a missing entity in a\ntriplet conditioned on another triplet, e.g., $\\langle T_1$, PrerequisiteFor,\n(Avatar, Wins, ?)$\\rangle$. Experimental results show that BiVE significantly\noutperforms all other methods in the two new tasks and the typical base-level\nlink prediction in real-world bi-level knowledge graphs.\n","authors":["Chanyoung Chung","Joyce Jiyoung Whang"],"pdf_url":"https://arxiv.org/pdf/2302.02601v4.pdf","comment":"14 pages, 3 figures, 15 tables. 37th AAAI Conference on Artificial\n Intelligence (AAAI 2023)"},{"id":"http://arxiv.org/abs/2306.03543v2","updated":"2023-10-23T10:37:55Z","published":"2023-06-06T09:44:56Z","title":"How to Select Which Active Learning Strategy is Best Suited for Your\n Specific Problem and Budget","summary":" In the domain of Active Learning (AL), a learner actively selects which\nunlabeled examples to seek labels from an oracle, while operating within\npredefined budget constraints. Importantly, it has been recently shown that\ndistinct query strategies are better suited for different conditions and\nbudgetary constraints. In practice, the determination of the most appropriate\nAL strategy for a given situation remains an open problem. To tackle this\nchallenge, we propose a practical derivative-based method that dynamically\nidentifies the best strategy for a given budget. Intuitive motivation for our\napproach is provided by the theoretical analysis of a simplified scenario. We\nthen introduce a method to dynamically select an AL strategy, which takes into\naccount the unique characteristics of the problem and the available budget.\nEmpirical results showcase the effectiveness of our approach across diverse\nbudgets and computer vision tasks.\n","authors":["Guy Hacohen","Daphna Weinshall"],"pdf_url":"https://arxiv.org/pdf/2306.03543v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07037v3","updated":"2023-10-23T10:36:58Z","published":"2023-08-14T09:56:35Z","title":"Bayesian Flow Networks","summary":" This paper introduces Bayesian Flow Networks (BFNs), a new class of\ngenerative model in which the parameters of a set of independent distributions\nare modified with Bayesian inference in the light of noisy data samples, then\npassed as input to a neural network that outputs a second, interdependent\ndistribution. Starting from a simple prior and iteratively updating the two\ndistributions yields a generative procedure similar to the reverse process of\ndiffusion models; however it is conceptually simpler in that no forward process\nis required. Discrete and continuous-time loss functions are derived for\ncontinuous, discretised and discrete data, along with sample generation\nprocedures. Notably, the network inputs for discrete data lie on the\nprobability simplex, and are therefore natively differentiable, paving the way\nfor gradient-based sample guidance and few-step generation in discrete domains\nsuch as language modelling. The loss function directly optimises data\ncompression and places no restrictions on the network architecture. In our\nexperiments BFNs achieve competitive log-likelihoods for image modelling on\ndynamically binarized MNIST and CIFAR-10, and outperform all known discrete\ndiffusion models on the text8 character-level language modelling task.\n","authors":["Alex Graves","Rupesh Kumar Srivastava","Timothy Atkinson","Faustino Gomez"],"pdf_url":"https://arxiv.org/pdf/2308.07037v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14784v1","updated":"2023-10-23T10:36:52Z","published":"2023-10-23T10:36:52Z","title":"An Efficient Imbalance-Aware Federated Learning Approach for Wearable\n Healthcare with Autoregressive Ratio Observation","summary":" Widely available healthcare services are now getting popular because of\nadvancements in wearable sensing techniques and mobile edge computing. People's\nhealth information is collected by edge devices such as smartphones and\nwearable bands for further analysis on servers, then send back suggestions and\nalerts for abnormal conditions. The recent emergence of federated learning\nallows users to train private data on local devices while updating models\ncollaboratively. However, the heterogeneous distribution of the health\ncondition data may lead to significant risks to model performance due to class\nimbalance. Meanwhile, as FL training is powered by sharing gradients only with\nthe server, training data is almost inaccessible. The conventional solutions to\nclass imbalance do not work for federated learning. In this work, we propose a\nnew federated learning framework FedImT, dedicated to addressing the challenges\nof class imbalance in federated learning scenarios. FedImT contains an online\nscheme that can estimate the data composition during each round of aggregation,\nthen introduces a self-attenuating iterative equivalent to track variations of\nmultiple estimations and promptly tweak the balance of the loss computing for\nminority classes. Experiments demonstrate the effectiveness of FedImT in\nsolving the imbalance problem without extra energy consumption and avoiding\nprivacy risks.\n","authors":["Wenhao Yan","He Li","Kaoru Ota","Mianxiong Dong"],"pdf_url":"https://arxiv.org/pdf/2310.14784v1.pdf","comment":"submitted to IEEE OJCS in Oct. 2023 (under review)"},{"id":"http://arxiv.org/abs/2305.13971v3","updated":"2023-10-23T10:30:37Z","published":"2023-05-23T11:54:37Z","title":"Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning","summary":" Despite their impressive performance, large language models (LMs) still\nstruggle with reliably generating complex output structures when not finetuned\nto follow the required output format exactly. To address this issue,\ngrammar-constrained decoding (GCD) can be used to control the generation of\nLMs, guaranteeing that the output follows a given structure. Most existing GCD\nmethods are, however, limited to specific tasks, such as parsing or code\ngeneration. In this work, we demonstrate that formal grammars can describe the\noutput space for a much wider range of tasks and argue that GCD can serve as a\nunified framework for structured NLP tasks in general. For increased\nflexibility, we introduce input-dependent grammars, which allow the grammar to\ndepend on the input and thus enable the generation of different output\nstructures for different inputs. We then empirically demonstrate the power and\nflexibility of GCD-enhanced LMs on (1) information extraction, (2) entity\ndisambiguation, and (3) constituency parsing. Our results indicate that\ngrammar-constrained LMs substantially outperform unconstrained LMs or even beat\ntask-specific finetuned models. Grammar constraints thus hold great promise for\nharnessing off-the-shelf LMs for a wide range of structured NLP tasks,\nespecially where training data is scarce or finetuning is expensive. Code and\ndata: https://github.com/epfl-dlab/GCD.\n","authors":["Saibo Geng","Martin Josifosky","Maxime Peyrard","Robert West"],"pdf_url":"https://arxiv.org/pdf/2305.13971v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12045v4","updated":"2023-10-23T10:30:04Z","published":"2023-06-21T06:30:18Z","title":"Temporal Conditioning Spiking Latent Variable Models of the Neural\n Response to Natural Visual Scenes","summary":" Developing computational models of neural response is crucial for\nunderstanding sensory processing and neural computations. Current\nstate-of-the-art neural network methods use temporal filters to handle temporal\ndependencies, resulting in an unrealistic and inflexible processing paradigm.\nMeanwhile, these methods target trial-averaged firing rates and fail to capture\nimportant features in spike trains. This work presents the temporal\nconditioning spiking latent variable models (TeCoS-LVM) to simulate the neural\nresponse to natural visual stimuli. We use spiking neurons to produce spike\noutputs that directly match the recorded trains. This approach helps to avoid\nlosing information embedded in the original spike trains. We exclude the\ntemporal dimension from the model parameter space and introduce a temporal\nconditioning operation to allow the model to adaptively explore and exploit\ntemporal dependencies in stimuli sequences in a {\\it natural paradigm}. We show\nthat TeCoS-LVM models can produce more realistic spike activities and\naccurately fit spike statistics than powerful alternatives. Additionally,\nlearned TeCoS-LVM models can generalize well to longer time scales. Overall,\nwhile remaining computationally tractable, our model effectively captures key\nfeatures of neural coding systems. It thus provides a useful tool for building\naccurate predictive computational accounts for various sensory perception\ncircuits.\n","authors":["Gehua Ma","Runhao Jiang","Rui Yan","Huajin Tang"],"pdf_url":"https://arxiv.org/pdf/2306.12045v4.pdf","comment":"Accepted at NeurIPS 2023. 22 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.14777v1","updated":"2023-10-23T10:26:14Z","published":"2023-10-23T10:26:14Z","title":"Geographical Erasure in Language Generation","summary":" Large language models (LLMs) encode vast amounts of world knowledge. However,\nsince these models are trained on large swaths of internet data, they are at\nrisk of inordinately capturing information about dominant groups. This\nimbalance can propagate into generated language. In this work, we study and\noperationalise a form of geographical erasure, wherein language models\nunderpredict certain countries. We demonstrate consistent instances of erasure\nacross a range of LLMs. We discover that erasure strongly correlates with low\nfrequencies of country mentions in the training corpus. Lastly, we mitigate\nerasure by finetuning using a custom objective.\n","authors":["Pola Schwöbel","Jacek Golebiowski","Michele Donini","Cédric Archambeau","Danish Pruthi"],"pdf_url":"https://arxiv.org/pdf/2310.14777v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.12942v2","updated":"2023-10-23T10:24:26Z","published":"2023-10-19T17:39:47Z","title":"On the Representational Capacity of Recurrent Neural Language Models","summary":" This work investigates the computational expressivity of language models\n(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992)\nfamously showed that RNNs with rational weights and hidden states and unbounded\ncomputation time are Turing complete. However, LMs define weightings over\nstrings in addition to just (unweighted) language membership and the analysis\nof the computational power of RNN LMs (RLMs) should reflect this. We extend the\nTuring completeness result to the probabilistic case, showing how a rationally\nweighted RLM with unbounded computation time can simulate any probabilistic\nTuring machine (PTM). Since, in practice, RLMs work in real-time, processing a\nsymbol at every time step, we treat the above result as an upper bound on the\nexpressivity of RLMs. We also provide a lower bound by showing that under the\nrestriction to real-time computation, such models can simulate deterministic\nreal-time rational PTMs.\n","authors":["Franz Nowak","Anej Svete","Li Du","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2310.12942v2.pdf","comment":"To be published at EMNLP 2023;"},{"id":"http://arxiv.org/abs/2310.14774v1","updated":"2023-10-23T10:19:09Z","published":"2023-10-23T10:19:09Z","title":"Principled Approaches for Learning to Defer with Multiple Experts","summary":" We present a study of surrogate losses and algorithms for the general problem\nof learning to defer with multiple experts. We first introduce a new family of\nsurrogate losses specifically tailored for the multiple-expert setting, where\nthe prediction and deferral functions are learned simultaneously. We then prove\nthat these surrogate losses benefit from strong $H$-consistency bounds. We\nillustrate the application of our analysis through several examples of\npractical surrogate losses, for which we give explicit guarantees. These loss\nfunctions readily lead to the design of new learning to defer algorithms based\non their minimization. While the main focus of this work is a theoretical\nanalysis, we also report the results of several experiments on SVHN and\nCIFAR-10 datasets.\n","authors":["Anqi Mao","Mehryar Mohri","Yutao Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.14774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.01597v2","updated":"2023-10-23T10:18:09Z","published":"2022-12-29T10:46:40Z","title":"Demystify Problem-Dependent Power of Quantum Neural Networks on\n Multi-Class Classification","summary":" Quantum neural networks (QNNs) have become an important tool for\nunderstanding the physical world, but their advantages and limitations are not\nfully understood. Some QNNs with specific encoding methods can be efficiently\nsimulated by classical surrogates, while others with quantum memory may perform\nbetter than classical classifiers. Here we systematically investigate the\nproblem-dependent power of quantum neural classifiers (QCs) on multi-class\nclassification tasks. Through the analysis of expected risk, a measure that\nweighs the training loss and the generalization error of a classifier jointly,\nwe identify two key findings: first, the training loss dominates the power\nrather than the generalization ability; second, QCs undergo a U-shaped risk\ncurve, in contrast to the double-descent risk curve of deep neural classifiers.\nWe also reveal the intrinsic connection between optimal QCs and the Helstrom\nbound and the equiangular tight frame. Using these findings, we propose a\nmethod that uses loss dynamics to probe whether a QC may be more effective than\na classical classifier on a particular learning task. Numerical results\ndemonstrate the effectiveness of our approach to explain the superiority of QCs\nover multilayer Perceptron on parity datasets and their limitations over\nconvolutional neural networks on image datasets. Our work sheds light on the\nproblem-dependent power of QNNs and offers a practical tool for evaluating\ntheir potential merit.\n","authors":["Yuxuan Du","Yibo Yang","Dacheng Tao","Min-Hsiu Hsieh"],"pdf_url":"https://arxiv.org/pdf/2301.01597v2.pdf","comment":"Updated version. Published on PRL"},{"id":"http://arxiv.org/abs/2310.14772v1","updated":"2023-10-23T10:16:27Z","published":"2023-10-23T10:16:27Z","title":"Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and\n Algorithms","summary":" We study the key framework of learning with abstention in the multi-class\nclassification setting. In this setting, the learner can choose to abstain from\nmaking a prediction with some pre-defined cost. We present a series of new\ntheoretical and algorithmic results for this learning problem in the\npredictor-rejector framework. We introduce several new families of surrogate\nlosses for which we prove strong non-asymptotic and hypothesis set-specific\nconsistency guarantees, thereby resolving positively two existing open\nquestions. These guarantees provide upper bounds on the estimation error of the\nabstention loss function in terms of that of the surrogate loss. We analyze\nboth a single-stage setting where the predictor and rejector are learned\nsimultaneously and a two-stage setting crucial in applications, where the\npredictor is learned in a first stage using a standard surrogate loss such as\ncross-entropy. These guarantees suggest new multi-class abstention algorithms\nbased on minimizing these surrogate losses. We also report the results of\nextensive experiments comparing these algorithms to the current\nstate-of-the-art algorithms on CIFAR-10, CIFAR-100 and SVHN datasets. Our\nresults demonstrate empirically the benefit of our new surrogate losses and\nshow the remarkable performance of our broadly applicable two-stage abstention\nalgorithm.\n","authors":["Anqi Mao","Mehryar Mohri","Yutao Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.14772v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14770v1","updated":"2023-10-23T10:13:35Z","published":"2023-10-23T10:13:35Z","title":"Theoretically Grounded Loss Functions and Algorithms for Score-Based\n Multi-Class Abstention","summary":" Learning with abstention is a key scenario where the learner can abstain from\nmaking a prediction at some cost. In this paper, we analyze the score-based\nformulation of learning with abstention in the multi-class classification\nsetting. We introduce new families of surrogate losses for the abstention loss\nfunction, which include the state-of-the-art surrogate losses in the\nsingle-stage setting and a novel family of loss functions in the two-stage\nsetting. We prove strong non-asymptotic and hypothesis set-specific consistency\nguarantees for these surrogate losses, which upper-bound the estimation error\nof the abstention loss function in terms of the estimation error of the\nsurrogate loss. Our bounds can help compare different score-based surrogates\nand guide the design of novel abstention algorithms by minimizing the proposed\nsurrogate losses. We experimentally evaluate our new algorithms on CIFAR-10,\nCIFAR-100, and SVHN datasets and the practical significance of our new\nsurrogate losses and two-stage abstention algorithms. Our results also show\nthat the relative performance of the state-of-the-art score-based surrogate\nlosses can vary across datasets.\n","authors":["Anqi Mao","Mehryar Mohri","Yutao Zhong"],"pdf_url":"https://arxiv.org/pdf/2310.14770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14768v1","updated":"2023-10-23T10:12:23Z","published":"2023-10-23T10:12:23Z","title":"Policy Gradient with Kernel Quadrature","summary":" Reward evaluation of episodes becomes a bottleneck in a broad range of\nreinforcement learning tasks. Our aim in this paper is to select a small but\nrepresentative subset of a large batch of episodes, only on which we actually\ncompute rewards for more efficient policy gradient iterations. We build a\nGaussian process modeling of discounted returns or rewards to derive a positive\ndefinite kernel on the space of episodes, run an \"episodic\" kernel quadrature\nmethod to compress the information of sample episodes, and pass the reduced\nepisodes to the policy network for gradient updates. We present the theoretical\nbackground of this procedure as well as its numerical illustrations in MuJoCo\nand causal discovery tasks.\n","authors":["Satoshi Hayakawa","Tetsuro Morimura"],"pdf_url":"https://arxiv.org/pdf/2310.14768v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2305.18396v2","updated":"2023-10-23T10:06:52Z","published":"2023-05-28T13:08:13Z","title":"LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly\n Transformers","summary":" The community explored to build private inference frameworks for\ntransformer-based large language models (LLMs) in a server-client setting,\nwhere the server holds the model parameters and the client inputs its private\ndata (or prompt) for inference. However, these frameworks impose significant\noverhead when the private inputs are forward propagated through the original\nLLMs. In this paper, we show that substituting the computation- and\ncommunication-heavy operators in the transformer architecture with\nprivacy-computing friendly approximations can greatly reduce the private\ninference costs while incurring very minor impact on model performance.\nCompared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing\nfriendly model inference pipeline achieves a $5\\times$ acceleration in\ncomputation and an 80% reduction in communication overhead, while retaining\nnearly identical accuracy.\n","authors":["Xuanqi Liu","Zhuotao Liu"],"pdf_url":"https://arxiv.org/pdf/2305.18396v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14764v1","updated":"2023-10-23T10:02:23Z","published":"2023-10-23T10:02:23Z","title":"Improved K-mer Based Prediction of Protein-Protein Interactions With\n Chaos Game Representation, Deep Learning and Reduced Representation Bias","summary":" Protein-protein interactions drive many biological processes, including the\ndetection of phytopathogens by plants' R-Proteins and cell surface receptors.\nMany machine learning studies have attempted to predict protein-protein\ninteractions but performance is highly dependent on training data; models have\nbeen shown to accurately predict interactions when the proteins involved are\nincluded in the training data, but achieve consistently poorer results when\napplied to previously unseen proteins. In addition, models that are trained\nusing proteins that take part in multiple interactions can suffer from\nrepresentation bias, where predictions are driven not by learned biological\nfeatures but by learning of the structure of the interaction dataset.\n We present a method for extracting unique pairs from an interaction dataset,\ngenerating non-redundant paired data for unbiased machine learning. After\napplying the method to datasets containing _Arabidopsis thaliana_ and pathogen\neffector interations, we developed a convolutional neural network model capable\nof learning and predicting interactions from Chaos Game Representations of\nproteins' coding genes.\n","authors":["Ruth Veevers","Dan MacLean"],"pdf_url":"https://arxiv.org/pdf/2310.14764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.12062v5","updated":"2023-10-23T10:02:22Z","published":"2022-07-25T11:07:29Z","title":"Adaptive Asynchronous Control Using Meta-learned Neural Ordinary\n Differential Equations","summary":" Model-based Reinforcement Learning and Control have demonstrated great\npotential in various sequential decision making problem domains, including in\nrobotics settings. However, real-world robotics systems often present\nchallenges that limit the applicability of those methods. In particular, we\nnote two problems that jointly happen in many industrial systems: 1)\nIrregular/asynchronous observations and actions and 2) Dramatic changes in\nenvironment dynamics from an episode to another (e.g. varying payload inertial\nproperties). We propose a general framework that overcomes those difficulties\nby meta-learning adaptive dynamics models for continuous-time prediction and\ncontrol. The proposed approach is task-agnostic and can be adapted to new tasks\nin a straight-forward manner. We present evaluations in two different robot\nsimulations and on a real industrial robot.\n","authors":["Achkan Salehi","Steffen Rühl","Stephane Doncieux"],"pdf_url":"https://arxiv.org/pdf/2207.12062v5.pdf","comment":"16 double column pages, 14 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.14763v1","updated":"2023-10-23T10:01:50Z","published":"2023-10-23T10:01:50Z","title":"Externally Valid Policy Evaluation Combining Trial and Observational\n Data","summary":" Randomized trials are widely considered as the gold standard for evaluating\nthe effects of decision policies. Trial data is, however, drawn from a\npopulation which may differ from the intended target population and this raises\na problem of external validity (aka. generalizability). In this paper we seek\nto use trial data to draw valid inferences about the outcome of a policy on the\ntarget population. Additional covariate data from the target population is used\nto model the sampling of individuals in the trial study. We develop a method\nthat yields certifiably valid trial-based policy evaluations under any\nspecified range of model miscalibrations. The method is nonparametric and the\nvalidity is assured even with finite samples. The certified policy evaluations\nare illustrated using both simulated and real data.\n","authors":["Sofia Ek","Dave Zachariah"],"pdf_url":"https://arxiv.org/pdf/2310.14763v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.05845v6","updated":"2023-10-23T09:52:59Z","published":"2022-10-12T00:35:45Z","title":"Contrastive Retrospection: honing in on critical steps for rapid\n learning and generalization in RL","summary":" In real life, success is often contingent upon multiple critical steps that\nare distant in time from each other and from the final reward. These critical\nsteps are challenging to identify with traditional reinforcement learning (RL)\nmethods that rely on the Bellman equation for credit assignment. Here, we\npresent a new RL algorithm that uses offline contrastive learning to hone in on\nthese critical steps. This algorithm, which we call Contrastive Retrospection\n(ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of\nprototypes for the critical steps in a task by a novel contrastive loss and\ndelivers an intrinsic reward when the current state matches one of the\nprototypes. The prototypes in ConSpec provide two key benefits for credit\nassignment: (i) They enable rapid identification of all the critical steps.\n(ii) They do so in a readily interpretable manner, enabling out-of-distribution\ngeneralization when sensory features are altered. Distinct from other\ncontemporary RL approaches to credit assignment, ConSpec takes advantage of the\nfact that it is easier to retrospectively identify the small set of steps that\nsuccess is contingent upon (and ignoring other states) than it is to\nprospectively predict reward at every taken step. ConSpec greatly improves\nlearning in a diverse set of RL tasks.\n","authors":["Chen Sun","Wannan Yang","Thomas Jiralerspong","Dane Malenfant","Benjamin Alsbury-Nealy","Yoshua Bengio","Blake Richards"],"pdf_url":"https://arxiv.org/pdf/2210.05845v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13479v2","updated":"2023-10-23T09:42:29Z","published":"2023-10-20T13:20:17Z","title":"Segment, Select, Correct: A Framework for Weakly-Supervised Referring\n Segmentation","summary":" Referring Image Segmentation (RIS) - the problem of identifying objects in\nimages through natural language sentences - is a challenging task currently\nmostly solved through supervised learning. However, while collecting referred\nannotation masks is a time-consuming process, the few existing\nweakly-supervised and zero-shot approaches fall significantly short in\nperformance compared to fully-supervised learning ones. To bridge the\nperformance gap without mask annotations, we propose a novel weakly-supervised\nframework that tackles RIS by decomposing it into three steps: obtaining\ninstance masks for the object mentioned in the referencing instruction\n(segment), using zero-shot learning to select a potentially correct mask for\nthe given instruction (select), and bootstrapping a model which allows for\nfixing the mistakes of zero-shot selection (correct). In our experiments, using\nonly the first two steps (zero-shot segment and select) outperforms other\nzero-shot baselines by as much as 19%, while our full method improves upon this\nmuch stronger baseline and sets the new state-of-the-art for weakly-supervised\nRIS, reducing the gap between the weakly-supervised and fully-supervised\nmethods in some cases from around 33% to as little as 14%. Code is available at\nhttps://github.com/fgirbal/segment-select-correct.\n","authors":["Francisco Eiras","Kemal Oksuz","Adel Bibi","Philip H. S. Torr","Puneet K. Dokania"],"pdf_url":"https://arxiv.org/pdf/2310.13479v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14753v1","updated":"2023-10-23T09:40:30Z","published":"2023-10-23T09:40:30Z","title":"Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules","summary":" Masked graph modeling excels in the self-supervised representation learning\nof molecular graphs. Scrutinizing previous studies, we can reveal a common\nscheme consisting of three key components: (1) graph tokenizer, which breaks a\nmolecular graph into smaller fragments (i.e., subgraphs) and converts them into\ntokens; (2) graph masking, which corrupts the graph with masks; (3) graph\nautoencoder, which first applies an encoder on the masked graph to generate the\nrepresentations, and then employs a decoder on the representations to recover\nthe tokens of the original graph. However, the previous MGM studies focus\nextensively on graph masking and encoder, while there is limited understanding\nof tokenizer and decoder. To bridge the gap, we first summarize popular\nmolecule tokenizers at the granularity of node, edge, motif, and Graph Neural\nNetworks (GNNs), and then examine their roles as the MGM's reconstruction\ntargets. Further, we explore the potential of adopting an expressive decoder in\nMGM. Our results show that a subgraph-level tokenizer and a sufficiently\nexpressive decoder with remask decoding have a large impact on the encoder's\nrepresentation learning. Finally, we propose a novel MGM method SimSGT,\nfeaturing a Simple GNN-based Tokenizer (SGT) and an effective decoding\nstrategy. We empirically validate that our method outperforms the existing\nmolecule self-supervised learning methods. Our codes and checkpoints are\navailable at https://github.com/syr-cn/SimSGT.\n","authors":["Zhiyuan Liu","Yaorui Shi","An Zhang","Enzhi Zhang","Kenji Kawaguchi","Xiang Wang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2310.14753v1.pdf","comment":"NeurIPS 2023. 10 pages"},{"id":"http://arxiv.org/abs/2310.14751v1","updated":"2023-10-23T09:36:13Z","published":"2023-10-23T09:36:13Z","title":"Efficient and Interpretable Bandit Algorithms","summary":" Motivated by the importance of explainability in modern machine learning, we\ndesign bandit algorithms that are \\emph{efficient} and \\emph{interpretable}. A\nbandit algorithm is interpretable if it explores with the objective of reducing\nuncertainty in the unknown model parameter. To quantify the interpretability,\nwe introduce a novel metric of \\textit{uncertainty loss}, which compares the\nrate of the uncertainty reduction to the theoretical optimum. We propose CODE,\na bandit algorithm based on a \\textbf{C}onstrained \\textbf{O}ptimal\n\\textbf{DE}sign, that is interpretable and maximally reduces the uncertainty.\nThe key idea in \\code is to explore among all plausible actions, determined by\na statistical constraint, to achieve interpretability. We implement CODE\nefficiently in both multi-armed and linear bandits and derive near-optimal\nregret bounds by leveraging the optimality criteria of the approximate optimal\ndesign. CODE can be also viewed as removing phases in conventional phased\nelimination, which makes it more practical and general. We demonstrate the\nadvantage of \\code by numerical experiments on both synthetic and real-world\nproblems. CODE outperforms other state-of-the-art interpretable designs while\nmatching the performance of popular but uninterpretable designs, such as upper\nconfidence bound algorithms.\n","authors":["Subhojyoti Mukherjee","Ruihao Zhu","Branislav Kveton"],"pdf_url":"https://arxiv.org/pdf/2310.14751v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04033v2","updated":"2023-10-23T09:32:08Z","published":"2023-07-08T18:58:08Z","title":"Learning Variational Neighbor Labels for Test-Time Domain Generalization","summary":" This paper strives for domain generalization, where models are trained\nexclusively on source domains before being deployed at unseen target domains.\nWe follow the strict separation of source training and target testing but\nexploit the value of the unlabeled target data itself during inference. We make\nthree contributions. First, we propose probabilistic pseudo-labeling of target\nsamples to generalize the source-trained model to the target domain at test\ntime. We formulate the generalization at test time as a variational inference\nproblem by modeling pseudo labels as distributions to consider the uncertainty\nduring generalization and alleviate the misleading signal of inaccurate pseudo\nlabels. Second, we learn variational neighbor labels that incorporate the\ninformation of neighboring target samples to generate more robust pseudo\nlabels. Third, to learn the ability to incorporate more representative target\ninformation and generate more precise and robust variational neighbor labels,\nwe introduce a meta-generalization stage during training to simulate the\ngeneralization procedure. Experiments on six widely-used datasets demonstrate\nthe benefits, abilities, and effectiveness of our proposal.\n","authors":["Sameer Ambekar","Zehao Xiao","Jiayi Shen","Xiantong Zhen","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2307.04033v2.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2303.17925v2","updated":"2023-10-23T09:27:40Z","published":"2023-03-31T09:48:16Z","title":"Beyond Multilayer Perceptrons: Investigating Complex Topologies in\n Neural Networks","summary":" In this study, we explore the impact of network topology on the approximation\ncapabilities of artificial neural networks (ANNs), with a particular focus on\ncomplex topologies. We propose a novel methodology for constructing complex\nANNs based on various topologies, including Barab\\'asi-Albert,\nErd\\H{o}s-R\\'enyi, Watts-Strogatz, and multilayer perceptrons (MLPs). The\nconstructed networks are evaluated on synthetic datasets generated from\nmanifold learning generators, with varying levels of task difficulty and noise,\nand on real-world datasets from the UCI suite. Our findings reveal that complex\ntopologies lead to superior performance in high-difficulty regimes compared to\ntraditional MLPs. This performance advantage is attributed to the ability of\ncomplex networks to exploit the compositionality of the underlying target\nfunction. However, this benefit comes at the cost of increased forward-pass\ncomputation time and reduced robustness to graph damage. Additionally, we\ninvestigate the relationship between various topological attributes and model\nperformance. Our analysis shows that no single attribute can account for the\nobserved performance differences, suggesting that the influence of network\ntopology on approximation capabilities may be more intricate than a simple\ncorrelation with individual topological attributes. Our study sheds light on\nthe potential of complex topologies for enhancing the performance of ANNs and\nprovides a foundation for future research exploring the interplay between\nmultiple topological attributes and their impact on model performance.\n","authors":["Tommaso Boccato","Matteo Ferrante","Andrea Duggento","Nicola Toschi"],"pdf_url":"https://arxiv.org/pdf/2303.17925v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15845v2","updated":"2023-10-23T09:26:28Z","published":"2023-03-28T09:36:14Z","title":"Conditional Generative Models are Provably Robust: Pointwise Guarantees\n for Bayesian Inverse Problems","summary":" Conditional generative models became a very powerful tool to sample from\nBayesian inverse problem posteriors. It is well-known in classical Bayesian\nliterature that posterior measures are quite robust with respect to\nperturbations of both the prior measure and the negative log-likelihood, which\nincludes perturbations of the observations. However, to the best of our\nknowledge, the robustness of conditional generative models with respect to\nperturbations of the observations has not been investigated yet. In this paper,\nwe prove for the first time that appropriately learned conditional generative\nmodels provide robust results for single observations.\n","authors":["Fabian Altekrüger","Paul Hagemann","Gabriele Steidl"],"pdf_url":"https://arxiv.org/pdf/2303.15845v2.pdf","comment":"Accepted and published in Transactions on Machine Learning Research\n (07/2023)"},{"id":"http://arxiv.org/abs/2310.14743v1","updated":"2023-10-23T09:25:50Z","published":"2023-10-23T09:25:50Z","title":"The Safety Challenges of Deep Learning in Real-World Type 1 Diabetes\n Management","summary":" Blood glucose simulation allows the effectiveness of type 1 diabetes (T1D)\nmanagement strategies to be evaluated without patient harm. Deep learning\nalgorithms provide a promising avenue for extending simulator capabilities;\nhowever, these algorithms are limited in that they do not necessarily learn\nphysiologically correct glucose dynamics and can learn incorrect and\npotentially dangerous relationships from confounders in training data. This is\nlikely to be more important in real-world scenarios, as data is not collected\nunder strict research protocol. This work explores the implications of using\ndeep learning algorithms trained on real-world data to model glucose dynamics.\nFree-living data was processed from the OpenAPS Data Commons and supplemented\nwith patient-reported tags of challenging diabetes events, constituting one of\nthe most detailed real-world T1D datasets. This dataset was used to train and\nevaluate state-of-the-art glucose simulators, comparing their prediction error\nacross safety critical scenarios and assessing the physiological\nappropriateness of the learned dynamics using Shapley Additive Explanations\n(SHAP). While deep learning prediction accuracy surpassed the widely-used\nmathematical simulator approach, the model deteriorated in safety critical\nscenarios and struggled to leverage self-reported meal and exercise\ninformation. SHAP value analysis also indicated the model had fundamentally\nconfused the roles of insulin and carbohydrates, which is one of the most basic\nT1D management principles. This work highlights the importance of considering\nphysiological appropriateness when using deep learning to model real-world\nsystems in T1D and healthcare more broadly, and provides recommendations for\nbuilding models that are robust to real-world data constraints.\n","authors":["Harry Emerson","Ryan McConville","Matthew Guy"],"pdf_url":"https://arxiv.org/pdf/2310.14743v1.pdf","comment":"15 pages, 3 figures"},{"id":"http://arxiv.org/abs/2202.05069v3","updated":"2023-10-23T09:05:24Z","published":"2022-02-10T14:57:15Z","title":"Transfer-Learning Across Datasets with Different Input Dimensions: An\n Algorithm and Analysis for the Linear Regression Case","summary":" With the development of new sensors and monitoring devices, more sources of\ndata become available to be used as inputs for machine learning models. These\ncan on the one hand help to improve the accuracy of a model. On the other hand,\ncombining these new inputs with historical data remains a challenge that has\nnot yet been studied in enough detail. In this work, we propose a transfer\nlearning algorithm that combines new and historical data with different input\ndimensions. This approach is easy to implement, efficient, with computational\ncomplexity equivalent to the ordinary least-squares method, and requires no\nhyperparameter tuning, making it straightforward to apply when the new data is\nlimited. Different from other approaches, we provide a rigorous theoretical\nstudy of its robustness, showing that it cannot be outperformed by a baseline\nthat utilizes only the new data. Our approach achieves state-of-the-art\nperformance on 9 real-life datasets, outperforming the linear DSFT, another\nlinear transfer learning algorithm, and performing comparably to non-linear\nDSFT.\n","authors":["Luis Pedro Silvestrin","Harry van Zanten","Mark Hoogendoorn","Ger Koole"],"pdf_url":"https://arxiv.org/pdf/2202.05069v3.pdf","comment":"Manuscript accepted for publication at the Journal of Computational\n Mathematics and Data Science. Code available at\n https://github.com/lpsilvestrin/incremental_input_tl"},{"id":"http://arxiv.org/abs/2310.14720v1","updated":"2023-10-23T08:56:01Z","published":"2023-10-23T08:56:01Z","title":"Extended Deep Adaptive Input Normalization for Preprocessing Time Series\n Data for Neural Networks","summary":" Data preprocessing is a crucial part of any machine learning pipeline, and it\ncan have a significant impact on both performance and training efficiency. This\nis especially evident when using deep neural networks for time series\nprediction and classification: real-world time series data often exhibit\nirregularities such as multi-modality, skewness and outliers, and the model\nperformance can degrade rapidly if these characteristics are not adequately\naddressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input\nNormalization) layer, a novel adaptive neural layer that learns how to\nappropriately normalize irregular time series data for a given task in an\nend-to-end fashion, instead of using a fixed normalization scheme. This is\nachieved by optimizing its unknown parameters simultaneously with the deep\nneural network using back-propagation. Our experiments, conducted using\nsynthetic data, a credit default prediction dataset, and a large-scale limit\norder book benchmark dataset, demonstrate the superior performance of the EDAIN\nlayer when compared to conventional normalization methods and existing adaptive\ntime series preprocessing layers.\n","authors":["Marcus A. K. September","Francesco Sanna Passino","Leonie Goldmann","Anton Hinel"],"pdf_url":"https://arxiv.org/pdf/2310.14720v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01210v2","updated":"2023-10-23T08:52:36Z","published":"2023-10-02T13:55:06Z","title":"Towards Robust Cardiac Segmentation using Graph Convolutional Networks","summary":" Fully automatic cardiac segmentation can be a fast and reproducible method to\nextract clinical measurements from an echocardiography examination. The U-Net\narchitecture is the current state-of-the-art deep learning architecture for\nmedical segmentation and can segment cardiac structures in real-time with\naverage errors comparable to inter-observer variability. However, this\narchitecture still generates large outliers that are often anatomically\nincorrect. This work uses the concept of graph convolutional neural networks\nthat predict the contour points of the structures of interest instead of\nlabeling each pixel. We propose a graph architecture that uses two\nconvolutional rings based on cardiac anatomy and show that this eliminates\nanatomical incorrect multi-structure segmentations on the publicly available\nCAMUS dataset. Additionally, this work contributes with an ablation study on\nthe graph convolutional architecture and an evaluation of clinical measurements\non the clinical HUNT4 dataset. Finally, we propose to use the inter-model\nagreement of the U-Net and the graph network as a predictor of both the input\nand segmentation quality. We show this predictor can detect out-of-distribution\nand unsuitable input images in real-time. Source code is available online:\nhttps://github.com/gillesvntnu/GCN_multistructure\n","authors":["Gilles Van De Vyver","Sarina Thomas","Guy Ben-Yosef","Sindre Hellum Olaisen","Håvard Dalen","Lasse Løvstakken","Erik Smistad"],"pdf_url":"https://arxiv.org/pdf/2310.01210v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14714v1","updated":"2023-10-23T08:51:05Z","published":"2023-10-23T08:51:05Z","title":"BatteryML:An Open-source platform for Machine Learning on Battery\n Degradation","summary":" Battery degradation remains a pivotal concern in the energy storage domain,\nwith machine learning emerging as a potent tool to drive forward insights and\nsolutions. However, this intersection of electrochemical science and machine\nlearning poses complex challenges. Machine learning experts often grapple with\nthe intricacies of battery science, while battery researchers face hurdles in\nadapting intricate models tailored to specific datasets. Beyond this, a\ncohesive standard for battery degradation modeling, inclusive of data formats\nand evaluative benchmarks, is conspicuously absent. Recognizing these\nimpediments, we present BatteryML - a one-step, all-encompass, and open-source\nplatform designed to unify data preprocessing, feature extraction, and the\nimplementation of both traditional and state-of-the-art models. This\nstreamlined approach promises to enhance the practicality and efficiency of\nresearch applications. BatteryML seeks to fill this void, fostering an\nenvironment where experts from diverse specializations can collaboratively\ncontribute, thus elevating the collective understanding and advancement of\nbattery research.The code for our project is publicly available on GitHub at\nhttps://github.com/microsoft/BatteryML.\n","authors":["Han Zhang","Xiaofan Gui","Shun Zheng","Ziheng Lu","Yuqi Li","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.14714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.10256v2","updated":"2023-10-23T08:50:44Z","published":"2023-03-17T21:42:58Z","title":"PINNSim: A Simulator for Power System Dynamics based on Physics-Informed\n Neural Networks","summary":" The dynamic behaviour of a power system can be described by a system of\ndifferential-algebraic equations. Time-domain simulations are used to simulate\nthe evolution of these dynamics. They often require the use of small time step\nsizes and therefore become computationally expensive. To accelerate these\nsimulations, we propose a simulator -- PINNSim -- that allows to take\nsignificantly larger time steps. It is based on Physics-Informed Neural\nNetworks (PINNs) for the solution of the dynamics of single components in the\npower system. To resolve their interaction we employ a scalable root-finding\nalgorithm. We demonstrate PINNSim on a 9-bus system and show the increased time\nstep size compared to a trapezoidal integration rule. We discuss key\ncharacteristics of PINNSim and important steps for developing PINNSim into a\nfully fledged simulator. As such, it could offer the opportunity for\nsignificantly increasing time step sizes and thereby accelerating time-domain\nsimulations.\n","authors":["Jochen Stiasny","Baosen Zhang","Spyros Chatzivasileiadis"],"pdf_url":"https://arxiv.org/pdf/2303.10256v2.pdf","comment":"submitted to the 23rd Power Systems Computation Conference (PSCC\n 2024)"},{"id":"http://arxiv.org/abs/2310.14710v1","updated":"2023-10-23T08:49:39Z","published":"2023-10-23T08:49:39Z","title":"Random Forest Dissimilarity for High-Dimension Low Sample Size\n Classification","summary":" High dimension, low sample size (HDLSS) problems are numerous among\nreal-world applications of machine learning. From medical images to text\nprocessing, traditional machine learning algorithms are usually unsuccessful in\nlearning the best possible concept from such data. In a previous work, we\nproposed a dissimilarity-based approach for multi-view classification, the\nRandom Forest Dissimilarity (RFD), that perfoms state-of-the-art results for\nsuch problems. In this work, we transpose the core principle of this approach\nto solving HDLSS classification problems, by using the RF similarity measure as\na learned precomputed SVM kernel (RFSVM). We show that such a learned\nsimilarity measure is particularly suited and accurate for this classification\ncontext. Experiments conducted on 40 public HDLSS classification datasets,\nsupported by rigorous statistical analyses, show that the RFSVM method\noutperforms existing methods for the majority of HDLSS problems and remains at\nthe same time very competitive for low or non-HDLSS problems.\n","authors":["Lucca Portes Cavalheiro","Simon Bernard","Jean Paul Barddal","Laurent Heutte"],"pdf_url":"https://arxiv.org/pdf/2310.14710v1.pdf","comment":"23 pages. To be published in statistics and computing (accepted\n September 26, 2023)"},{"id":"http://arxiv.org/abs/2310.14707v1","updated":"2023-10-23T08:47:27Z","published":"2023-10-23T08:47:27Z","title":"A Hybrid GNN approach for predicting node data for 3D meshes","summary":" Metal forging is used to manufacture dies. We require the best set of input\nparameters for the process to be efficient. Currently, we predict the best\nparameters using the finite element method by generating simulations for the\ndifferent initial conditions, which is a time-consuming process. In this paper,\nintroduce a hybrid approach that helps in processing and generating new data\nsimulations using a surrogate graph neural network model based on graph\nconvolutions, having a cheaper time cost. We also introduce a hybrid approach\nthat helps in processing and generating new data simulations using the model.\nGiven a dataset representing meshes, our focus is on the conversion of the\navailable information into a graph or point cloud structure. This new\nrepresentation enables deep learning. The predicted result is similar, with a\nlow error when compared to that produced using the finite element method. The\nnew models have outperformed existing PointNet and simple graph neural network\nmodels when applied to produce the simulations.\n","authors":["Shwetha Salimath","Francesca Bugiotti","Frederic Magoules"],"pdf_url":"https://arxiv.org/pdf/2310.14707v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14693v1","updated":"2023-10-23T08:36:21Z","published":"2023-10-23T08:36:21Z","title":"Federated learning compression designed for lightweight communications","summary":" Federated Learning (FL) is a promising distributed method for edge-level\nmachine learning, particularly for privacysensitive applications such as those\nin military and medical domains, where client data cannot be shared or\ntransferred to a cloud computing server. In many use-cases, communication cost\nis a major challenge in FL due to its natural intensive network usage. Client\ndevices, such as smartphones or Internet of Things (IoT) nodes, have limited\nresources in terms of energy, computation, and memory. To address these\nhardware constraints, lightweight models and compression techniques such as\npruning and quantization are commonly adopted in centralised paradigms. In this\npaper, we investigate the impact of compression techniques on FL for a typical\nimage classification task. Going further, we demonstrate that a straightforward\nmethod can compresses messages up to 50% while having less than 1% of accuracy\nloss, competing with state-of-the-art techniques.\n","authors":["Lucas Grativol Ribeiro","Mathieu Leonardon","Guillaume Muller","Virginie Fresse","Matthieu Arzel"],"pdf_url":"https://arxiv.org/pdf/2310.14693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.13288v2","updated":"2023-10-23T08:12:24Z","published":"2022-08-28T21:04:42Z","title":"Learning Informative Health Indicators Through Unsupervised Contrastive\n Learning","summary":" Condition monitoring is essential to operate industrial assets safely and\nefficiently. To achieve this goal, the development of robust health indicators\nhas recently attracted significant attention. These indicators, which provide\nquantitative real-time insights into the health status of industrial assets\nover time, serve as valuable tools for fault detection and prognostics. In this\nstudy, we propose a novel and universal approach to learn health indicators\nbased on unsupervised contrastive learning. Operational time acts as a proxy\nfor the asset's degradation state, enabling the learning of a contrastive\nfeature space that facilitates the construction of a health indicator by\nmeasuring the distance to the healthy condition. To highlight the universality\nof the proposed approach, we assess the proposed contrastive learning framework\nin two distinct tasks - wear assessment and fault detection - across two\ndifferent case studies: a milling machines case study and a real condition\nmonitoring case study of railway wheels from operating trains. First, we\nevaluate if the health indicator is able to learn the real health condition on\na milling machine case study where the ground truth wear condition is\ncontinuously measured. Second, we apply the proposed method on a real case\nstudy of railway wheels where the ground truth health condition is not known.\nHere, we evaluate the suitability of the learned health indicator for fault\ndetection of railway wheel defects. Our results demonstrate that the proposed\napproach is able to learn the ground truth health evolution of milling machines\nand the learned health indicator is suited for fault detection of railway\nwheels operated under various operating conditions by outperforming\nstate-of-the-art methods. Further, we demonstrate that our proposed approach is\nuniversally applicable to different systems and different health conditions.\n","authors":["Katharina Rombach","Gabriel Michau","Wilfried Bürzle","Stefan Koller","Olga Fink"],"pdf_url":"https://arxiv.org/pdf/2208.13288v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14671v1","updated":"2023-10-23T08:11:17Z","published":"2023-10-23T08:11:17Z","title":"Population Descent: A Natural-Selection Based Hyper-Parameter Tuning\n Framework","summary":" First-order gradient descent has been the base of the most successful\noptimization algorithms ever implemented. On supervised learning problems with\nvery high dimensionality, such as neural network optimization, it is almost\nalways the algorithm of choice, mainly due to its memory and computational\nefficiency. However, it is a classical result in optimization that gradient\ndescent converges to local minima on non-convex functions. Even more\nimportantly, in certain high-dimensional cases, escaping the plateaus of large\nsaddle points becomes intractable. On the other hand, black-box optimization\nmethods are not sensitive to the local structure of a loss function's landscape\nbut suffer the curse of dimensionality. Instead, memetic algorithms aim to\ncombine the benefits of both. Inspired by this, we present Population Descent,\na memetic algorithm focused on hyperparameter optimization. We show that an\nadaptive m-elitist selection approach combined with a normalized-fitness-based\nrandomization scheme outperforms more complex state-of-the-art algorithms by up\nto 13% on common benchmark tasks.\n","authors":["Abhinav Pomalapally","Bassel El Mabsout","Renato Mansuco"],"pdf_url":"https://arxiv.org/pdf/2310.14671v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14670v1","updated":"2023-10-23T08:09:42Z","published":"2023-10-23T08:09:42Z","title":"Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and\n Beyond","summary":" Vision-language (VL) understanding tasks evaluate models' comprehension of\ncomplex visual scenes through multiple-choice questions. However, we have\nidentified two dataset biases that models can exploit as shortcuts to resolve\nvarious VL tasks correctly without proper understanding. The first type of\ndataset bias is \\emph{Unbalanced Matching} bias, where the correct answer\noverlaps the question and image more than the incorrect answers. The second\ntype of dataset bias is \\emph{Distractor Similarity} bias, where incorrect\nanswers are overly dissimilar to the correct answer but significantly similar\nto other incorrect answers within the same sample. To address these dataset\nbiases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic\ntraining and debiased evaluation data. We then introduce Intra-sample\nCounterfactual Training (ICT) to assist models in utilizing the synthesized\ntraining data, particularly the counterfactual data, via focusing on\nintra-sample differentiation. Extensive experiments demonstrate the\neffectiveness of ADS and ICT in consistently improving model performance across\ndifferent benchmarks, even in domain-shifted scenarios.\n","authors":["Zhecan Wang","Long Chen","Haoxuan You","Keyang Xu","Yicheng He","Wenhao Li","Noal Codella","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2310.14670v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14664v1","updated":"2023-10-23T08:00:03Z","published":"2023-10-23T08:00:03Z","title":"Data Pruning via Moving-one-Sample-out","summary":" In this paper, we propose a novel data-pruning approach called\nmoving-one-sample-out (MoSo), which aims to identify and remove the least\ninformative samples from the training set. The core insight behind MoSo is to\ndetermine the importance of each sample by assessing its impact on the optimal\nempirical risk. This is achieved by measuring the extent to which the empirical\nrisk changes when a particular sample is excluded from the training set.\nInstead of using the computationally expensive leaving-one-out-retraining\nprocedure, we propose an efficient first-order approximator that only requires\ngradient information from different training stages. The key idea behind our\napproximation is that samples with gradients that are consistently aligned with\nthe average gradient of the training set are more informative and should\nreceive higher scores, which could be intuitively understood as follows: if the\ngradient from a specific sample is consistent with the average gradient vector,\nit implies that optimizing the network using the sample will yield a similar\neffect on all remaining samples. Experimental results demonstrate that MoSo\neffectively mitigates severe performance degradation at high pruning ratios and\nachieves satisfactory performance across various settings.\n","authors":["Haoru Tan","Sitong Wu","Fei Du","Yukang Chen","Zhibin Wang","Fan Wang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2310.14664v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14932v2","updated":"2023-10-23T07:57:02Z","published":"2023-06-26T09:42:59Z","title":"GloptiNets: Scalable Non-Convex Optimization with Certificates","summary":" We present a novel approach to non-convex optimization with certificates,\nwhich handles smooth functions on the hypercube or on the torus. Unlike\ntraditional methods that rely on algebraic properties, our algorithm exploits\nthe regularity of the target function intrinsic in the decay of its Fourier\nspectrum. By defining a tractable family of models, we allow at the same time\nto obtain precise certificates and to leverage the advanced and powerful\ncomputational techniques developed to optimize neural networks. In this way the\nscalability of our approach is naturally enhanced by parallel computing with\nGPUs. Our approach, when applied to the case of polynomials of moderate\ndimensions but with thousands of coefficients, outperforms the state-of-the-art\noptimization methods with certificates, as the ones based on Lasserre's\nhierarchy, addressing problems intractable for the competitors.\n","authors":["Gaspard Beugnot","Julien Mairal","Alessandro Rudi"],"pdf_url":"https://arxiv.org/pdf/2306.14932v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14661v1","updated":"2023-10-23T07:54:39Z","published":"2023-10-23T07:54:39Z","title":"Tractable MCMC for Private Learning with Pure and Gaussian Differential\n Privacy","summary":" Posterior sampling, i.e., exponential mechanism to sample from the posterior\ndistribution, provides $\\varepsilon$-pure differential privacy (DP) guarantees\nand does not suffer from potentially unbounded privacy breach introduced by\n$(\\varepsilon,\\delta)$-approximate DP. In practice, however, one needs to apply\napproximate sampling methods such as Markov chain Monte Carlo (MCMC), thus\nre-introducing the unappealing $\\delta$-approximation error into the privacy\nguarantees. To bridge this gap, we propose the Approximate SAample Perturbation\n(abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to\nits Wasserstein-infinity ($W_\\infty$) distance from a reference distribution\nthat satisfies pure DP or pure Gaussian DP (i.e., $\\delta=0$). We then leverage\na Metropolis-Hastings algorithm to generate the sample and prove that the\nalgorithm converges in W$_\\infty$ distance. We show that by combining our new\ntechniques with a careful localization step, we obtain the first nearly\nlinear-time algorithm that achieves the optimal rates in the DP-ERM problem\nwith strongly convex and smooth losses.\n","authors":["Yingyu Lin","Yian Ma","Yu-Xiang Wang","Rachel Redberg"],"pdf_url":"https://arxiv.org/pdf/2310.14661v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14659v1","updated":"2023-10-23T07:53:47Z","published":"2023-10-23T07:53:47Z","title":"Predicting Accurate Lagrangian Multipliers for Mixed Integer Linear\n Programs","summary":" Lagrangian relaxation stands among the most efficient approaches for solving\na Mixed Integer Linear Programs (MILP) with difficult constraints. Given any\nduals for these constraints, called Lagrangian Multipliers (LMs), it returns a\nbound on the optimal value of the MILP, and Lagrangian methods seek the LMs\ngiving the best such bound. But these methods generally rely on iterative\nalgorithms resembling gradient descent to maximize the concave piecewise linear\ndual function: the computational burden grows quickly with the number of\nrelaxed constraints. We introduce a deep learning approach that bypasses the\ndescent, effectively amortizing the local, per instance, optimization. A\nprobabilistic encoder based on a graph convolutional network computes\nhigh-dimensional representations of relaxed constraints in MILP instances. A\ndecoder then turns these representations into LMs. We train the encoder and\ndecoder jointly by directly optimizing the bound obtained from the predicted\nmultipliers. Numerical experiments show that our approach closes up to 85~\\% of\nthe gap between the continuous relaxation and the best Lagrangian bound, and\nprovides a high quality warm-start for descent based Lagrangian methods.\n","authors":["Francesco Demelas","Joseph Le Roux","Mathieu Lacroix","Axel Parmentier"],"pdf_url":"https://arxiv.org/pdf/2310.14659v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.06360v5","updated":"2023-10-23T07:47:27Z","published":"2023-05-10T12:02:18Z","title":"Exploring the Landscape of Machine Unlearning: A Comprehensive Survey\n and Taxonomy","summary":" Machine unlearning (MU) is gaining increasing attention due to the need to\nremove or modify predictions made by machine learning (ML) models. While\ntraining models have become more efficient and accurate, the importance of\nunlearning previously learned information has become increasingly significant\nin fields such as privacy, security, and fairness. This paper presents a\ncomprehensive survey of MU, covering current state-of-the-art techniques and\napproaches, including data deletion, perturbation, and model updates. In\naddition, commonly used metrics and datasets are also presented. The paper also\nhighlights the challenges that need to be addressed, including attack\nsophistication, standardization, transferability, interpretability, training\ndata, and resource constraints. The contributions of this paper include\ndiscussions about the potential benefits of MU and its future directions.\nAdditionally, the paper emphasizes the need for researchers and practitioners\nto continue exploring and refining unlearning techniques to ensure that ML\nmodels can adapt to changing circumstances while maintaining user trust. The\nimportance of unlearning is further highlighted in making Artificial\nIntelligence (AI) more trustworthy and transparent, especially with the\nincreasing importance of AI in various domains that involve large amounts of\npersonal user data.\n","authors":["Thanveer Shaik","Xiaohui Tao","Haoran Xie","Lin Li","Xiaofeng Zhu","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2305.06360v5.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2310.14651v1","updated":"2023-10-23T07:44:04Z","published":"2023-10-23T07:44:04Z","title":"$Λ$-Split: A Privacy-Preserving Split Computing Framework for\n Cloud-Powered Generative AI","summary":" In the wake of the burgeoning expansion of generative artificial intelligence\n(AI) services, the computational demands inherent to these technologies\nfrequently necessitate cloud-powered computational offloading, particularly for\nresource-constrained mobile devices. These services commonly employ prompts to\nsteer the generative process, and both the prompts and the resultant content,\nsuch as text and images, may harbor privacy-sensitive or confidential\ninformation, thereby elevating security and privacy risks. To mitigate these\nconcerns, we introduce $\\Lambda$-Split, a split computing framework to\nfacilitate computational offloading while simultaneously fortifying data\nprivacy against risks such as eavesdropping and unauthorized access. In\n$\\Lambda$-Split, a generative model, usually a deep neural network (DNN), is\npartitioned into three sub-models and distributed across the user's local\ndevice and a cloud server: the input-side and output-side sub-models are\nallocated to the local, while the intermediate, computationally-intensive\nsub-model resides on the cloud server. This architecture ensures that only the\nhidden layer outputs are transmitted, thereby preventing the external\ntransmission of privacy-sensitive raw input and output data. Given the\nblack-box nature of DNNs, estimating the original input or output from\nintercepted hidden layer outputs poses a significant challenge for malicious\neavesdroppers. Moreover, $\\Lambda$-Split is orthogonal to traditional\nencryption-based security mechanisms, offering enhanced security when deployed\nin conjunction. We empirically validate the efficacy of the $\\Lambda$-Split\nframework using Llama 2 and Stable Diffusion XL, representative large language\nand diffusion models developed by Meta and Stability AI, respectively. Our\n$\\Lambda$-Split implementation is publicly accessible at\nhttps://github.com/nishio-laboratory/lambda_split.\n","authors":["Shoki Ohta","Takayuki Nishio"],"pdf_url":"https://arxiv.org/pdf/2310.14651v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2310.12487v2","updated":"2023-10-23T07:41:22Z","published":"2023-10-19T05:47:28Z","title":"Improved Operator Learning by Orthogonal Attention","summary":" Neural operators, as an efficient surrogate model for learning the solutions\nof PDEs, have received extensive attention in the field of scientific machine\nlearning. Among them, attention-based neural operators have become one of the\nmainstreams in related research. However, existing approaches overfit the\nlimited training data due to the considerable number of parameters in the\nattention mechanism. To address this, we develop an orthogonal attention based\non the eigendecomposition of the kernel integral operator and the neural\napproximation of eigenfunctions. The orthogonalization naturally poses a proper\nregularization effect on the resulting neural operator, which aids in resisting\noverfitting and boosting generalization. Experiments on six standard neural\noperator benchmark datasets comprising both regular and irregular geometries\nshow that our method can outperform competing baselines with decent margins.\n","authors":["Zipeng Xiao","Zhongkai Hao","Bokai Lin","Zhijie Deng","Hang Su"],"pdf_url":"https://arxiv.org/pdf/2310.12487v2.pdf","comment":"14 pages, 5 figures"},{"id":"http://arxiv.org/abs/2007.05014v3","updated":"2023-10-23T07:33:55Z","published":"2020-07-09T18:15:01Z","title":"Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack\n Constraint","summary":" Constrained submodular maximization problems encompass a wide variety of\napplications, including personalized recommendation, team formation, and\nrevenue maximization via viral marketing. The massive instances occurring in\nmodern day applications can render existing algorithms prohibitively slow,\nwhile frequently, those instances are also inherently stochastic. Focusing on\nthese challenges, we revisit the classic problem of maximizing a (possibly\nnon-monotone) submodular function subject to a knapsack constraint. We present\na simple randomized greedy algorithm that achieves a $5.83$ approximation and\nruns in $O(n \\log n)$ time, i.e., at least a factor $n$ faster than other\nstate-of-the-art algorithms. The robustness of our approach allows us to\nfurther transfer it to a stochastic version of the problem. There, we obtain a\n9-approximation to the best adaptive policy, which is the first constant\napproximation for non-monotone objectives. Experimental evaluation of our\nalgorithms showcases their improved performance on real and synthetic data.\n","authors":["Georgios Amanatidis","Federico Fusco","Philip Lazos","Stefano Leonardi","Rebecca Reiffenhäuser"],"pdf_url":"https://arxiv.org/pdf/2007.05014v3.pdf","comment":"Same as v1. Version 2 was a replacement intended for arXiv:2102.08327\n and erroneously updated here"}],"Multimedia":[{"id":"http://arxiv.org/abs/2304.03652v3","updated":"2023-10-23T17:42:19Z","published":"2023-04-07T13:59:57Z","title":"An Accessible Toolkit for 360 VR Studies","summary":" Virtual reality is expected to play a significant role in the transformation\nof education and psychological studies. The possibilities for its application\nas a visual research method can be enhanced as established frameworks and\ntoolkits are made more available to users, not just developers, advocates, and\ntechnical academics, enhancing its controlled study impact. With an accessible\nfirst design approach, we can overcome accessibility constraints and tap into\nnew research potential. The open-sourced toolkit demonstrates how game engine\ntechnologies can be utilized to immerse participants in a 360-video environment\nwith curated text displayed at pre-set intervals. Allowing for researchers to\nguide participants through virtual experiences intuitively through a desktop\napplication while the study unfolds in the users VR headset.\n","authors":["Corrie Green","Chloë Farr","Yang Jiang"],"pdf_url":"https://arxiv.org/pdf/2304.03652v3.pdf","comment":"for associated github repo,\n https://github.com/corriedotdev/vr-360-player"},{"id":"http://arxiv.org/abs/2310.14946v1","updated":"2023-10-23T13:45:21Z","published":"2023-10-23T13:45:21Z","title":"Intuitive Multilingual Audio-Visual Speech Recognition with a\n Single-Trained Model","summary":" We present a novel approach to multilingual audio-visual speech recognition\ntasks by introducing a single model on a multilingual dataset. Motivated by a\nhuman cognitive system where humans can intuitively distinguish different\nlanguages without any conscious effort or guidance, we propose a model that can\ncapture which language is given as an input speech by distinguishing the\ninherent similarities and differences between languages. To do so, we design a\nprompt fine-tuning technique into the largely pre-trained audio-visual\nrepresentation model so that the network can recognize the language class as\nwell as the speech with the corresponding language. Our work contributes to\ndeveloping robust and efficient multilingual audio-visual speech recognition\nsystems, reducing the need for language-specific models.\n","authors":["Joanna Hong","Se Jin Park","Yong Man Ro"],"pdf_url":"https://arxiv.org/pdf/2310.14946v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2308.11175v2","updated":"2023-10-23T10:46:07Z","published":"2023-08-22T04:06:56Z","title":"MISSRec: Pre-training and Transferring Multi-modal Interest-aware\n Sequence Representation for Recommendation","summary":" The goal of sequential recommendation (SR) is to predict a user's potential\ninterested items based on her/his historical interaction sequences. Most\nexisting sequential recommenders are developed based on ID features, which,\ndespite their widespread use, often underperform with sparse IDs and struggle\nwith the cold-start problem. Besides, inconsistent ID mappings hinder the\nmodel's transferability, isolating similar recommendation domains that could\nhave been co-optimized. This paper aims to address these issues by exploring\nthe potential of multi-modal information in learning robust and generalizable\nsequence representations. We propose MISSRec, a multi-modal pre-training and\ntransfer learning framework for SR. On the user side, we design a\nTransformer-based encoder-decoder model, where the contextual encoder learns to\ncapture the sequence-level multi-modal user interests while a novel\ninterest-aware decoder is developed to grasp item-modality-interest relations\nfor better sequence representation. On the candidate item side, we adopt a\ndynamic fusion module to produce user-adaptive item representation, providing\nmore precise matching between users and items. We pre-train the model with\ncontrastive learning objectives and fine-tune it in an efficient manner.\nExtensive experiments demonstrate the effectiveness and flexibility of MISSRec,\npromising a practical solution for real-world recommendation scenarios. Data\nand code are available on \\url{https://github.com/gimpong/MM23-MISSRec}.\n","authors":["Jinpeng Wang","Ziyun Zeng","Yunxiao Wang","Yuting Wang","Xingyu Lu","Tianxiang Li","Jun Yuan","Rui Zhang","Hai-Tao Zheng","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2308.11175v2.pdf","comment":"Accepted to ACM MM 2023. Data and code are available"},{"id":"http://arxiv.org/abs/2310.14778v1","updated":"2023-10-23T10:29:33Z","published":"2023-10-23T10:29:33Z","title":"Audio-Visual Speaker Tracking: Progress, Challenges, and Future\n Directions","summary":" Audio-visual speaker tracking has drawn increasing attention over the past\nfew years due to its academic values and wide application. Audio and visual\nmodalities can provide complementary information for localization and tracking.\nWith audio and visual information, the Bayesian-based filter can solve the\nproblem of data association, audio-visual fusion and track management. In this\npaper, we conduct a comprehensive overview of audio-visual speaker tracking. To\nour knowledge, this is the first extensive survey over the past five years. We\nintroduce the family of Bayesian filters and summarize the methods for\nobtaining audio-visual measurements. In addition, the existing trackers and\ntheir performance on AV16.3 dataset are summarized. In the past few years, deep\nlearning techniques have thrived, which also boosts the development of audio\nvisual speaker tracking. The influence of deep learning techniques in terms of\nmeasurement extraction and state estimation is also discussed. At last, we\ndiscuss the connections between audio-visual speaker tracking and other areas\nsuch as speech separation and distributed speaker tracking.\n","authors":["Jinzheng Zhao","Yong Xu","Xinyuan Qian","Davide Berghi","Peipei Wu","Meng Cui","Jianyuan Sun","Philip J. B. Jackson","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14778v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.05895v2","updated":"2023-10-23T08:19:43Z","published":"2022-11-10T21:44:33Z","title":"Understanding ME? Multimodal Evaluation for Fine-grained Visual\n Commonsense","summary":" Visual commonsense understanding requires Vision Language (VL) models to not\nonly understand image and text but also cross-reference in-between to fully\nintegrate and achieve comprehension of the visual scene described. Recently,\nvarious approaches have been developed and have achieved high performance on\nvisual commonsense benchmarks. However, it is unclear whether the models really\nunderstand the visual scene and underlying commonsense knowledge due to limited\nevaluation data resources. To provide an in-depth analysis, we present a\nMultimodal Evaluation (ME) pipeline to automatically generate question-answer\npairs to test models' understanding of the visual scene, text, and related\nknowledge. We then take a step further to show that training with the ME data\nboosts the model's performance in standard VCR evaluation. Lastly, our in-depth\nanalysis and comparison reveal interesting findings: (1) semantically low-level\ninformation can assist the learning of high-level information but not the\nopposite; (2) visual information is generally under utilization compared with\ntext.\n","authors":["Zhecan Wang","Haoxuan You","Yicheng He","Wenhao Li","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2211.05895v2.pdf","comment":"Accepted to EMNLP 2022 Long Paper"},{"id":"http://arxiv.org/abs/2310.14670v1","updated":"2023-10-23T08:09:42Z","published":"2023-10-23T08:09:42Z","title":"Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and\n Beyond","summary":" Vision-language (VL) understanding tasks evaluate models' comprehension of\ncomplex visual scenes through multiple-choice questions. However, we have\nidentified two dataset biases that models can exploit as shortcuts to resolve\nvarious VL tasks correctly without proper understanding. The first type of\ndataset bias is \\emph{Unbalanced Matching} bias, where the correct answer\noverlaps the question and image more than the incorrect answers. The second\ntype of dataset bias is \\emph{Distractor Similarity} bias, where incorrect\nanswers are overly dissimilar to the correct answer but significantly similar\nto other incorrect answers within the same sample. To address these dataset\nbiases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic\ntraining and debiased evaluation data. We then introduce Intra-sample\nCounterfactual Training (ICT) to assist models in utilizing the synthesized\ntraining data, particularly the counterfactual data, via focusing on\nintra-sample differentiation. Extensive experiments demonstrate the\neffectiveness of ADS and ICT in consistently improving model performance across\ndifferent benchmarks, even in domain-shifted scenarios.\n","authors":["Zhecan Wang","Long Chen","Haoxuan You","Keyang Xu","Yicheng He","Wenhao Li","Noal Codella","Kai-Wei Chang","Shih-Fu Chang"],"pdf_url":"https://arxiv.org/pdf/2310.14670v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14637v1","updated":"2023-10-23T07:21:40Z","published":"2023-10-23T07:21:40Z","title":"Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval","summary":" Deep hashing has been intensively studied and successfully applied in\nlarge-scale image retrieval systems due to its efficiency and effectiveness.\nRecent studies have recognized that the existence of adversarial examples poses\na security threat to deep hashing models, that is, adversarial vulnerability.\nNotably, it is challenging to efficiently distill reliable semantic\nrepresentatives for deep hashing to guide adversarial learning, and thereby it\nhinders the enhancement of adversarial robustness of deep hashing-based\nretrieval models. Moreover, current researches on adversarial training for deep\nhashing are hard to be formalized into a unified minimax structure. In this\npaper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the\nadversarial robustness of deep hashing models. Specifically, we conceive a\ndiscriminative mainstay features learning (DMFL) scheme to construct semantic\nrepresentatives for guiding adversarial learning in deep hashing. Particularly,\nour DMFL with the strict theoretical guarantee is adaptively optimized in a\ndiscriminative learning manner, where both discriminative and semantic\nproperties are jointly considered. Moreover, adversarial examples are\nfabricated by maximizing the Hamming distance between the hash codes of\nadversarial samples and mainstay features, the efficacy of which is validated\nin the adversarial attack trials. Further, we, for the first time, formulate\nthe formalized adversarial training of deep hashing into a unified minimax\noptimization under the guidance of the generated mainstay codes. Extensive\nexperiments on benchmark datasets show superb attack performance against the\nstate-of-the-art algorithms, meanwhile, the proposed adversarial training can\neffectively eliminate adversarial perturbations for trustworthy deep\nhashing-based retrieval. Our code is available at\nhttps://github.com/xandery-geek/SAAT.\n","authors":["Xu Yuan","Zheng Zhang","Xunguang Wang","Lin Wu"],"pdf_url":"https://arxiv.org/pdf/2310.14637v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14605v1","updated":"2023-10-23T06:22:39Z","published":"2023-10-23T06:22:39Z","title":"M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal\n Aspect-based Sentiment Analysis","summary":" Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained\nSentiment Analysis task, which has attracted growing research interests\nrecently. Existing work mainly utilizes image information to improve the\nperformance of MABSA task. However, most of the studies overestimate the\nimportance of images since there are many noise images unrelated to the text in\nthe dataset, which will have a negative impact on model learning. Although some\nwork attempts to filter low-quality noise images by setting thresholds, relying\non thresholds will inevitably filter out a lot of useful image information.\nTherefore, in this work, we focus on whether the negative impact of noisy\nimages can be reduced without modifying the data. To achieve this goal, we\nborrow the idea of Curriculum Learning and propose a Multi-grained\nMulti-curriculum Denoising Framework (M2DF), which can achieve denoising by\nadjusting the order of training data. Extensive experimental results show that\nour framework consistently outperforms state-of-the-art work on three sub-tasks\nof MABSA.\n","authors":["Fei Zhao","Chunhui Li","Zhen Wu","Yawen Ouyang","Jianbing Zhang","Xinyu Dai"],"pdf_url":"https://arxiv.org/pdf/2310.14605v1.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14496v1","updated":"2023-10-23T02:05:35Z","published":"2023-10-23T02:05:35Z","title":"Redundancy-Adaptive Multimodal Learning for Imperfect Data","summary":" Multimodal models trained on complete modality data often exhibit a\nsubstantial decrease in performance when faced with imperfect data containing\ncorruptions or missing modalities. To address this robustness challenge, prior\nmethods have explored various approaches from aspects of augmentation,\nconsistency or uncertainty, but these approaches come with associated drawbacks\nrelated to data complexity, representation, and learning, potentially\ndiminishing their overall effectiveness. In response to these challenges, this\nstudy introduces a novel approach known as the Redundancy-Adaptive Multimodal\nLearning (RAML). RAML efficiently harnesses information redundancy across\nmultiple modalities to combat the issues posed by imperfect data while\nremaining compatible with the complete modality. Specifically, RAML achieves\nredundancy-lossless information extraction through separate unimodal\ndiscriminative tasks and enforces a proper norm constraint on each unimodal\nfeature representation. Furthermore, RAML explicitly enhances multimodal fusion\nby leveraging fine-grained redundancy among unimodal features to learn\ncorrespondences between corrupted and untainted information. Extensive\nexperiments on various benchmark datasets under diverse conditions have\nconsistently demonstrated that RAML outperforms state-of-the-art methods by a\nsignificant margin.\n","authors":["Mengxi Chen","Jiangchao Yao","Linyu Xing","Yu Wang","Ya Zhang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14496v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07847v4","updated":"2023-10-23T00:51:19Z","published":"2023-07-15T16:45:01Z","title":"Enabling Real-time Neural Recovery for Cloud Gaming on Mobile Devices","summary":" Cloud gaming is a multi-billion dollar industry. A client in cloud gaming\nsends its movement to the game server on the Internet, which renders and\ntransmits the resulting video back. In order to provide a good gaming\nexperience, a latency below 80 ms is required. This means that video rendering,\nencoding, transmission, decoding, and display have to finish within that time\nframe, which is especially challenging to achieve due to server overload,\nnetwork congestion, and losses. In this paper, we propose a new method for\nrecovering lost or corrupted video frames in cloud gaming. Unlike traditional\nvideo frame recovery, our approach uses game states to significantly enhance\nrecovery accuracy and utilizes partially decoded frames to recover lost\nportions. We develop a holistic system that consists of (i) efficiently\nextracting game states, (ii) modifying H.264 video decoder to generate a mask\nto indicate which portions of video frames need recovery, and (iii) designing a\nnovel neural network to recover either complete or partial video frames. Our\napproach is extensively evaluated using iPhone 12 and laptop implementations,\nand we demonstrate the utility of game states in the game video recovery and\nthe effectiveness of our overall design.\n","authors":["Zhaoyuan He","Yifan Yang","Shuozhe Li","Diyuan Dai","Lili Qiu","Yuqing Yang"],"pdf_url":"https://arxiv.org/pdf/2307.07847v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15406v1","updated":"2023-10-23T23:39:35Z","published":"2023-10-23T23:39:35Z","title":"Visual Elements and Cognitive Biases Influence Interpretations of Trends\n in Scatter Plots","summary":" Visualizations are common methods to convey information but also increasingly\nused to spread misinformation. It is therefore important to understand the\nfactors people use to interpret visualizations. In this paper, we focus on\nfactors that influence interpretations of scatter plots, investigating the\nextent to which common visual aspects of scatter plots (outliers and trend\nlines) and cognitive biases (people's beliefs) influence perception of\ncorrelation trends. We highlight three main findings: outliers skew trend\nperception but exert less influence than other points; trend lines make trends\nseem stronger but also mitigate the influence of some outliers; and people's\nbeliefs have a small influence on perceptions of weak, but not strong\ncorrelations. From these results we derive guidelines for adjusting visual\nelements to mitigate the influence of factors that distort interpretations of\nscatter plots. We explore how these guidelines may generalize to other\nvisualization types and make recommendations for future studies.\n","authors":["Alexandre Filipowicz","Scott Carter","Nayeli Bravo","Rumen Iliev","Shabnam Hakimi","David Ayman Shamma","Kent Lyons","Candice Hogan","Charlene Wu"],"pdf_url":"https://arxiv.org/pdf/2310.15406v1.pdf","comment":"18 pages, 6 figure, 2 tables"},{"id":"http://arxiv.org/abs/2310.15247v1","updated":"2023-10-23T18:01:36Z","published":"2023-10-23T18:01:36Z","title":"SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis","summary":" Sound design involves creatively selecting, recording, and editing sound\neffects for various media like cinema, video games, and virtual/augmented\nreality. One of the most time-consuming steps when designing sound is\nsynchronizing audio with video. In some cases, environmental recordings from\nvideo shoots are available, which can aid in the process. However, in video\ngames and animations, no reference audio exists, requiring manual annotation of\nevent timings from the video. We propose a system to extract repetitive actions\nonsets from a video, which are then used - in conjunction with audio or textual\nembeddings - to condition a diffusion model trained to generate a new\nsynchronized sound effects audio track. In this way, we leave complete creative\ncontrol to the sound designer while removing the burden of synchronization with\nvideo. Furthermore, editing the onset track or changing the conditioning\nembedding requires much less effort than editing the audio track itself,\nsimplifying the sonification process. We provide sound examples, source code,\nand pretrained models to faciliate reproducibility\n","authors":["Marco Comunità","Riccardo F. Gramaccioni","Emilian Postolache","Emanuele Rodolà","Danilo Comminiello","Joshua D. Reiss"],"pdf_url":"https://arxiv.org/pdf/2310.15247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12795v2","updated":"2023-10-23T12:21:05Z","published":"2023-06-22T10:53:10Z","title":"Learning Unseen Modality Interaction","summary":" Multimodal learning assumes all modality combinations of interest are\navailable during training to learn cross-modal correspondences.In this paper,\nwe challenge this modality-complete assumption for multimodal learning and\ninstead strive for generalization to unseen modality combinations during\ninference. We pose the problem of unseen modality interaction and introduce a\nfirst solution. It exploits a module that projects the multidimensional\nfeatures of different modalities into a common space with rich information\npreserved. This allows the information to be accumulated with a simple\nsummation operation across available modalities. To reduce overfitting to less\ndiscriminative modality combinations during training, we further improve the\nmodel learning with pseudo-supervision indicating the reliability of a\nmodality's prediction. We demonstrate that our approach is effective for\ndiverse tasks and modalities by evaluating it for multimodal video\nclassification, robot state regression, and multimedia retrieval. Project\nwebsite: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.\n","authors":["Yunhua Zhang","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2306.12795v2.pdf","comment":"Published at NeurIPS 2023"}]},"2023-10-24T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.16049v1","updated":"2023-10-24T17:59:20Z","published":"2023-10-24T17:59:20Z","title":"MuSR: Testing the Limits of Chain-of-thought with Multistep Soft\n Reasoning","summary":" While large language models (LLMs) equipped with techniques like\nchain-of-thought prompting have demonstrated impressive capabilities, they\nstill fall short in their ability to reason robustly in complex settings.\nHowever, evaluating LLM reasoning is challenging because system capabilities\ncontinue to grow while benchmark datasets for tasks like logical deduction have\nremained static. We introduce MuSR, a dataset for evaluating language models on\nmultistep soft reasoning tasks specified in a natural language narrative. This\ndataset has two crucial features. First, it is created through a novel\nneurosymbolic synthetic-to-natural generation algorithm, enabling the\nconstruction of complex reasoning instances that challenge GPT-4 (e.g., murder\nmysteries roughly 1000 words in length) and which can be scaled further as more\ncapable LLMs are released. Second, our dataset instances are free text\nnarratives corresponding to real-world domains of reasoning; this makes it\nsimultaneously much more challenging than other synthetically-crafted\nbenchmarks while remaining realistic and tractable for human annotators to\nsolve with high accuracy. We evaluate a range of LLMs and prompting techniques\non this dataset and characterize the gaps that remain for techniques like\nchain-of-thought to perform robust reasoning.\n","authors":["Zayne Sprague","Xi Ye","Kaj Bostrom","Swarat Chaudhuri","Greg Durrett"],"pdf_url":"https://arxiv.org/pdf/2310.16049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16048v1","updated":"2023-10-24T17:59:04Z","published":"2023-10-24T17:59:04Z","title":"AI Alignment and Social Choice: Fundamental Limitations and Policy\n Implications","summary":" Aligning AI agents to human intentions and values is a key bottleneck in\nbuilding safe and deployable AI applications. But whose values should AI agents\nbe aligned with? Reinforcement learning with human feedback (RLHF) has emerged\nas the key framework for AI alignment. RLHF uses feedback from human\nreinforcers to fine-tune outputs; all widely deployed large language models\n(LLMs) use RLHF to align their outputs to human values. It is critical to\nunderstand the limitations of RLHF and consider policy challenges arising from\nthese limitations. In this paper, we investigate a specific challenge in\nbuilding RLHF systems that respect democratic norms. Building on impossibility\nresults in social choice theory, we show that, under fairly broad assumptions,\nthere is no unique voting protocol to universally align AI systems using RLHF\nthrough democratic processes. Further, we show that aligning AI agents with the\nvalues of all individuals will always violate certain private ethical\npreferences of an individual user i.e., universal AI alignment using RLHF is\nimpossible. We discuss policy implications for the governance of AI systems\nbuilt using RLHF: first, the need for mandating transparent voting rules to\nhold model builders accountable. Second, the need for model builders to focus\non developing AI agents that are narrowly aligned to specific user groups.\n","authors":["Abhilash Mishra"],"pdf_url":"https://arxiv.org/pdf/2310.16048v1.pdf","comment":"10 pages, no figures"},{"id":"http://arxiv.org/abs/2308.02019v2","updated":"2023-10-24T17:58:42Z","published":"2023-08-03T20:20:01Z","title":"Baby Llama: knowledge distillation from an ensemble of teachers trained\n on a small dataset with no performance penalty","summary":" We present our submission to the BabyLM challenge, whose goal was to improve\nthe sample efficiency of language models. We trained an ensemble consisting of\na GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word\nBabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model,\nwhich exceeds in performance both of its teachers as well as a similar model\ntrained without distillation. This suggests that distillation can not only\nretain the full performance of the teacher model when the latter is trained on\na sufficiently small dataset; it can exceed it, and lead to significantly\nbetter performance than direct training.\n","authors":["Inar Timiryasov","Jean-Loup Tastet"],"pdf_url":"https://arxiv.org/pdf/2308.02019v2.pdf","comment":"11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and\n accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint\n available at https://huggingface.co/timinar/baby-llama-58m, training code\n available at https://github.com/timinar/BabyLlama"},{"id":"http://arxiv.org/abs/2310.16045v1","updated":"2023-10-24T17:58:07Z","published":"2023-10-24T17:58:07Z","title":"Woodpecker: Hallucination Correction for Multimodal Large Language\n Models","summary":" Hallucination is a big shadow hanging over the rapidly evolving Multimodal\nLarge Language Models (MLLMs), referring to the phenomenon that the generated\ntext is inconsistent with the image content. In order to mitigate\nhallucinations, existing studies mainly resort to an instruction-tuning manner\nthat requires retraining the models with specific data. In this paper, we pave\na different way, introducing a training-free method named Woodpecker. Like a\nwoodpecker heals trees, it picks out and corrects hallucinations from the\ngenerated text. Concretely, Woodpecker consists of five stages: key concept\nextraction, question formulation, visual knowledge validation, visual claim\ngeneration, and hallucination correction. Implemented in a post-remedy manner,\nWoodpecker can easily serve different MLLMs, while being interpretable by\naccessing intermediate outputs of the five stages. We evaluate Woodpecker both\nquantitatively and qualitatively and show the huge potential of this new\nparadigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement\nin accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released\nat https://github.com/BradyFU/Woodpecker.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Tong Xu","Hao Wang","Dianbo Sui","Yunhang Shen","Ke Li","Xing Sun","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16045v1.pdf","comment":"16 pages, 7 figures. Code Website:\n https://github.com/BradyFU/Woodpecker"},{"id":"http://arxiv.org/abs/2310.16042v1","updated":"2023-10-24T17:57:03Z","published":"2023-10-24T17:57:03Z","title":"WebWISE: Web Interface Control and Sequential Exploration with Large\n Language Models","summary":" The paper investigates using a Large Language Model (LLM) to automatically\nperform web software tasks using click, scroll, and text input operations.\nPrevious approaches, such as reinforcement learning (RL) or imitation learning,\nare inefficient to train and task-specific. Our method uses filtered Document\nObject Model (DOM) elements as observations and performs tasks step-by-step,\nsequentially generating small programs based on the current observations. We\nuse in-context learning, either benefiting from a single manually provided\nexample, or an automatically generated example based on a successful zero-shot\ntrial. We evaluate the proposed method on the MiniWob++ benchmark. With only\none in-context example, our WebWISE method achieves similar or better\nperformance than other methods that require many demonstrations or trials.\n","authors":["Heyi Tao","Sethuraman T V","Michal Shlapentokh-Rothman","Derek Hoiem","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2310.16042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16040v1","updated":"2023-10-24T17:54:25Z","published":"2023-10-24T17:54:25Z","title":"Instruct and Extract: Instruction Tuning for On-Demand Information\n Extraction","summary":" Large language models with instruction-following capabilities open the door\nto a wider group of users. However, when it comes to information extraction - a\nclassic task in natural language processing - most task-specific systems cannot\nalign well with long-tail ad hoc extraction use cases for non-expert users. To\naddress this, we propose a novel paradigm, termed On-Demand Information\nExtraction, to fulfill the personalized demands of real-world users. Our task\naims to follow the instructions to extract the desired content from the\nassociated text and present it in a structured tabular format. The table\nheaders can either be user-specified or inferred contextually by the model. To\nfacilitate research in this emerging area, we present a benchmark named\nInstructIE, inclusive of both automatically generated training data, as well as\nthe human-annotated test set. Building on InstructIE, we further develop an\nOn-Demand Information Extractor, ODIE. Comprehensive evaluations on our\nbenchmark reveal that ODIE substantially outperforms the existing open-source\nmodels of similar size. Our code and dataset are released on\nhttps://github.com/yzjiao/On-Demand-IE.\n","authors":["Yizhu Jiao","Ming Zhong","Sha Li","Ruining Zhao","Siru Ouyang","Heng Ji","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2310.16040v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16035v1","updated":"2023-10-24T17:50:20Z","published":"2023-10-24T17:50:20Z","title":"What's Left? Concept Grounding with Logic-Enhanced Foundation Models","summary":" Recent works such as VisProg and ViperGPT have smartly composed foundation\nmodels for visual reasoning-using large language models (LLMs) to produce\nprograms that can be executed by pre-trained vision-language models. However,\nthey operate in limited domains, such as 2D images, not fully exploiting the\ngeneralization of language: abstract concepts like \"left\" can also be grounded\nin 3D, temporal, and action data, as in moving to your left. This limited\ngeneralization stems from these inference-only methods' inability to learn or\nadapt pre-trained models to a new domain. We propose the Logic-Enhanced\nFoundation Model (LEFT), a unified framework that learns to ground and reason\nwith concepts across domains with a differentiable, domain-independent,\nfirst-order logic-based program executor. LEFT has an LLM interpreter that\noutputs a program represented in a general, logic-based reasoning language,\nwhich is shared across all domains and tasks. LEFT's executor then executes the\nprogram with trainable domain-specific grounding modules. We show that LEFT\nflexibly learns concepts in four domains: 2D images, 3D scenes, human motions,\nand robotic manipulation. It exhibits strong reasoning ability in a wide\nvariety of tasks, including those that are complex and not seen during\ntraining, and can be easily applied to new domains.\n","authors":["Joy Hsu","Jiayuan Mao","Joshua B. Tenenbaum","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16035v1.pdf","comment":"NeurIPS 2023. First two authors contributed equally. Project page:\n https://web.stanford.edu/~joycj/projects/left_neurips_2023"},{"id":"http://arxiv.org/abs/2305.14989v2","updated":"2023-10-24T17:48:43Z","published":"2023-05-24T10:24:10Z","title":"Dolphin: A Challenging and Diverse Benchmark for Arabic NLG","summary":" We present Dolphin, a novel benchmark that addresses the need for a natural\nlanguage generation (NLG) evaluation framework dedicated to the wide collection\nof Arabic languages and varieties. The proposed benchmark encompasses a broad\nrange of 13 different NLG tasks, including dialogue generation, question\nanswering, machine translation, summarization, among others. Dolphin comprises\na substantial corpus of 40 diverse and representative public datasets across 50\ntest splits, carefully curated to reflect real-world scenarios and the\nlinguistic richness of Arabic. It sets a new standard for evaluating the\nperformance and generalization capabilities of Arabic and multilingual models,\npromising to enable researchers to push the boundaries of current\nmethodologies. We provide an extensive analysis of Dolphin, highlighting its\ndiversity and identifying gaps in current Arabic NLG research. We also offer a\npublic leaderboard that is both interactive and modular and evaluate several\nmodels on our benchmark, allowing us to set strong baselines against which\nresearchers can compare.\n","authors":["El Moatez Billah Nagoudi","AbdelRahim Elmadany","Ahmed El-Shangiti","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2305.14989v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16033v1","updated":"2023-10-24T17:48:04Z","published":"2023-10-24T17:48:04Z","title":"Visual Cropping Improves Zero-Shot Question Answering of Multimodal\n Large Language Models","summary":" Multimodal Large Language Models (LLMs) have recently achieved promising\nzero-shot accuracy on visual question answering (VQA) -- a fundamental task\naffecting various downstream applications and domains. Given the great\npotential for the broad use of these models, it is important to investigate\ntheir limitations in dealing with different image and question properties. In\nthis work, we investigate whether multimodal LLMs can perceive small details as\nwell as large details in images. In particular, we show that their zero-shot\naccuracy in answering visual questions is very sensitive to the size of the\nvisual subject of the question, declining up to $46\\%$ with size. Furthermore,\nwe show that this effect is causal by observing that human visual cropping can\nsignificantly mitigate their sensitivity to size. Inspired by the usefulness of\nhuman cropping, we then propose three automatic visual cropping methods as\ninference time mechanisms to improve the zero-shot performance of multimodal\nLLMs. We study their effectiveness on four popular VQA datasets, and a subset\nof the VQAv2 dataset tailored towards fine visual details. Our findings suggest\nthat multimodal LLMs should be used with caution in detail-sensitive VQA\napplications, and that visual cropping is a promising direction to improve\ntheir zero-shot performance. Our code and data are publicly available.\n","authors":["Jiarui Zhang","Mahyar Khayatkhoei","Prateek Chhikara","Filip Ilievski"],"pdf_url":"https://arxiv.org/pdf/2310.16033v1.pdf","comment":"11 pages, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.16028v1","updated":"2023-10-24T17:43:29Z","published":"2023-10-24T17:43:29Z","title":"What Algorithms can Transformers Learn? A Study in Length Generalization","summary":" Large language models exhibit surprising emergent generalization properties,\nyet also struggle on many simple reasoning tasks such as arithmetic and parity.\nThis raises the question of if and when Transformer models can learn the true\nalgorithm for solving a task. We study the scope of Transformers' abilities in\nthe specific setting of length generalization on algorithmic tasks. Here, we\npropose a unifying framework to understand when and how Transformers can\nexhibit strong length generalization on a given task. Specifically, we leverage\nRASP (Weiss et al., 2021) -- a programming language designed for the\ncomputational model of a Transformer -- and introduce the RASP-Generalization\nConjecture: Transformers tend to length generalize on a task if the task can be\nsolved by a short RASP program which works for all input lengths. This simple\nconjecture remarkably captures most known instances of length generalization on\nalgorithmic tasks. Moreover, we leverage our insights to drastically improve\ngeneralization performance on traditionally hard tasks (such as parity and\naddition). On the theoretical side, we give a simple example where the\n\"min-degree-interpolator\" model of learning from Abbe et al. (2023) does not\ncorrectly predict Transformers' out-of-distribution behavior, but our\nconjecture does. Overall, our work provides a novel perspective on the\nmechanisms of compositional generalization and the algorithmic capabilities of\nTransformers.\n","authors":["Hattie Zhou","Arwen Bradley","Etai Littwin","Noam Razin","Omid Saremi","Josh Susskind","Samy Bengio","Preetum Nakkiran"],"pdf_url":"https://arxiv.org/pdf/2310.16028v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2302.03169v2","updated":"2023-10-24T17:39:05Z","published":"2023-02-06T23:57:56Z","title":"Data Selection for Language Models via Importance Resampling","summary":" Selecting a suitable pretraining dataset is crucial for both general-domain\n(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We\nformalize this problem as selecting a subset of a large raw unlabeled dataset\nto match a desired target distribution given some unlabeled target samples. Due\nto the large scale and dimensionality of the raw text data, existing methods\nuse simple heuristics or use experts to manually curate data. Instead, we\nextend the classic importance resampling approach used in low-dimensions for LM\ndata selection. We propose Data Selection with Importance Resampling (DSIR), an\nefficient and scalable framework that estimates importance weights in a reduced\nfeature space for tractability and selects data with importance resampling\naccording to these weights. To determine an appropriate feature space, we show\nthat KL reduction, a data metric that measures the proximity between selected\npretraining data and the target in a feature space, has high correlation with\naverage downstream accuracy (r=0.89) when computed with simple n-gram features.\nThis motivates our instantiation of DSIR using n-gram features. When performing\ncontinued pretraining towards a specific domain, DSIR performs comparably to\nexpert curation across 8 target distributions. When pretraining general-domain\nmodels (target is Wikipedia + books), DSIR improves over random selection and\nheuristic filtering baselines by 2-2.5% on the GLUE benchmark.\n","authors":["Sang Michael Xie","Shibani Santurkar","Tengyu Ma","Percy Liang"],"pdf_url":"https://arxiv.org/pdf/2302.03169v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11398v2","updated":"2023-10-24T17:12:49Z","published":"2023-10-17T17:06:26Z","title":"Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism\n with Neural Networks","summary":" In the realm of deep learning, the self-attention mechanism has substantiated\nits pivotal role across a myriad of tasks, encompassing natural language\nprocessing and computer vision. Despite achieving success across diverse\napplications, the traditional self-attention mechanism primarily leverages\nlinear transformations for the computation of query, key, and value (QKV),\nwhich may not invariably be the optimal choice under specific circumstances.\nThis paper probes into a novel methodology for QKV computation-implementing a\nspecially-designed neural network structure for the calculation. Utilizing a\nmodified Marian model, we conducted experiments on the IWSLT 2017\nGerman-English translation task dataset and juxtaposed our method with the\nconventional approach. The experimental results unveil a significant\nenhancement in BLEU scores with our method. Furthermore, our approach also\nmanifested superiority when training the Roberta model with the Wikitext-103\ndataset, reflecting a notable reduction in model perplexity compared to its\noriginal counterpart. These experimental outcomes not only validate the\nefficacy of our method but also reveal the immense potential in optimizing the\nself-attention mechanism through neural network-based QKV computation, paving\nthe way for future research and practical applications. The source code and\nimplementation details for our proposed method can be accessed at\nhttps://github.com/ocislyjrti/NeuralAttention.\n","authors":["Muhan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.11398v2.pdf","comment":"Updated the formulas in Section 3.2 \"Detailed Methodology\" and\n revised Section 2 \"Background\" for clarity and accuracy"},{"id":"http://arxiv.org/abs/2310.13548v2","updated":"2023-10-24T17:12:03Z","published":"2023-10-20T14:46:48Z","title":"Towards Understanding Sycophancy in Language Models","summary":" Reinforcement learning from human feedback (RLHF) is a popular technique for\ntraining high-quality AI assistants. However, RLHF may also encourage model\nresponses that match user beliefs over truthful responses, a behavior known as\nsycophancy. We investigate the prevalence of sycophancy in RLHF-trained models\nand whether human preference judgements are responsible. We first demonstrate\nthat five state-of-the-art AI assistants consistently exhibit sycophantic\nbehavior across four varied free-form text-generation tasks. To understand if\nhuman preferences drive this broadly observed behavior of RLHF models, we\nanalyze existing human preference data. We find that when a response matches a\nuser's views, it is more likely to be preferred. Moreover, both humans and\npreference models (PMs) prefer convincingly-written sycophantic responses over\ncorrect ones a non-negligible fraction of the time. Optimizing model outputs\nagainst PMs also sometimes sacrifices truthfulness in favor of sycophancy.\nOverall, our results indicate that sycophancy is a general behavior of RLHF\nmodels, likely driven in part by human preference judgements favoring\nsycophantic responses.\n","authors":["Mrinank Sharma","Meg Tong","Tomasz Korbak","David Duvenaud","Amanda Askell","Samuel R. Bowman","Newton Cheng","Esin Durmus","Zac Hatfield-Dodds","Scott R. Johnston","Shauna Kravec","Timothy Maxwell","Sam McCandlish","Kamal Ndousse","Oliver Rausch","Nicholas Schiefer","Da Yan","Miranda Zhang","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2310.13548v2.pdf","comment":"32 pages, 20 figures"},{"id":"http://arxiv.org/abs/2305.13829v3","updated":"2023-10-24T16:55:19Z","published":"2023-05-23T08:51:08Z","title":"Learning from Mistakes via Cooperative Study Assistant for Large\n Language Models","summary":" Large language models (LLMs) have demonstrated their potential to refine\ntheir generation based on their own feedback. However, the feedback from LLM\nitself is often inaccurate, thereby limiting its benefits. In this paper, we\npropose Study Assistant for Large LAnguage Model (SALAM), a novel framework\nwith an auxiliary agent to assist the main LLM in learning from mistakes\nthrough interactive cooperation. In the gathering phase, the student assistant\nagent probes the main LLM, analyzes its errors, and collects the interaction in\na mistake memory. During the examination phase, the study assistant provides\nguidelines by retrieving relevant cases to help the main LLM anticipate and\navoid similar errors. We first investigate the effectiveness of a general study\nassistant and then customize it to provide LLM-specific guidance through\nimitation learning from successful guidance experiences. Our experiments on\nthree LLMs using two challenging frameworks demonstrate that SALAM can\nsignificantly boost LLMs by an accuracy margin of up to 6.6 on BBH and 12.6 on\nBBQ.\n","authors":["Danqing Wang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2305.13829v3.pdf","comment":"Accepted by EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.15987v1","updated":"2023-10-24T16:37:18Z","published":"2023-10-24T16:37:18Z","title":"Dissecting In-Context Learning of Translations in GPTs","summary":" Most of the recent work in leveraging Large Language Models (LLMs) such as\nGPT-3 for Machine Translation (MT) has focused on selecting the few-shot\nsamples for prompting. In this work, we try to better understand the role of\ndemonstration attributes for the in-context learning of translations through\nperturbations of high-quality, in-domain demonstrations. We find that\nasymmetric perturbation of the source-target mappings yield vastly different\nresults. We show that the perturbation of the source side has surprisingly\nlittle impact, while target perturbation can drastically reduce translation\nquality, suggesting that it is the output text distribution that provides the\nmost important learning signal during in-context learning of translations. We\npropose a method named Zero-Shot-Context to add this signal automatically in\nZero-Shot prompting. We demonstrate that it improves upon the zero-shot\ntranslation performance of GPT-3, even making it competitive with few-shot\nprompted translations.\n","authors":["Vikas Raunak","Hany Hassan Awadalla","Arul Menezes"],"pdf_url":"https://arxiv.org/pdf/2310.15987v1.pdf","comment":"EMNLP Findings (+ Minor Updates over Camera-Ready)"},{"id":"http://arxiv.org/abs/2307.03838v2","updated":"2023-10-24T16:31:49Z","published":"2023-07-07T21:13:27Z","title":"RADAR: Robust AI-Text Detection via Adversarial Learning","summary":" Recent advances in large language models (LLMs) and the intensifying\npopularity of ChatGPT-like applications have blurred the boundary of\nhigh-quality text generation between humans and machines. However, in addition\nto the anticipated revolutionary changes to our technology and society, the\ndifficulty of distinguishing LLM-generated texts (AI-text) from human-generated\ntexts poses new challenges of misuse and fairness, such as fake content\ngeneration, plagiarism, and false accusations of innocent writers. While\nexisting works show that current AI-text detectors are not robust to LLM-based\nparaphrasing, this paper aims to bridge this gap by proposing a new framework\ncalled RADAR, which jointly trains a robust AI-text detector via adversarial\nlearning. RADAR is based on adversarial training of a paraphraser and a\ndetector. The paraphraser's goal is to generate realistic content to evade\nAI-text detection. RADAR uses the feedback from the detector to update the\nparaphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly\n2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets,\nexperimental results show that RADAR significantly outperforms existing AI-text\ndetection methods, especially when paraphrasing is in place. We also identify\nthe strong transferability of RADAR from instruction-tuned LLMs to other LLMs,\nand evaluate the improved capability of RADAR via GPT-3.5-Turbo.\n","authors":["Xiaomeng Hu","Pin-Yu Chen","Tsung-Yi Ho"],"pdf_url":"https://arxiv.org/pdf/2307.03838v2.pdf","comment":"Accepted by NeurIPS 2023. Project page and demos:\n https://radar.vizhub.ai"},{"id":"http://arxiv.org/abs/2310.13469v3","updated":"2023-10-24T16:14:55Z","published":"2023-10-20T13:05:32Z","title":"Ask Language Model to Clean Your Noisy Translation Data","summary":" Transformer models have demonstrated remarkable performance in neural machine\ntranslation (NMT). However, their vulnerability to noisy input poses a\nsignificant challenge in practical implementation, where generating clean\noutput from noisy input is crucial. The MTNT dataset is widely used as a\nbenchmark for evaluating the robustness of NMT models against noisy input.\nNevertheless, its utility is limited due to the presence of noise in both the\nsource and target sentences. To address this limitation, we focus on cleaning\nthe noise from the target sentences in MTNT, making it more suitable as a\nbenchmark for noise evaluation. Leveraging the capabilities of large language\nmodels (LLMs), we observe their impressive abilities in noise removal. For\nexample, they can remove emojis while considering their semantic meaning.\nAdditionally, we show that LLM can effectively rephrase slang, jargon, and\nprofanities. The resulting datasets, called C-MTNT, exhibit significantly less\nnoise in the target sentences while preserving the semantic integrity of the\noriginal sentences. Our human and GPT-4 evaluations also lead to a consistent\nconclusion that LLM performs well on this task. Lastly, experiments on C-MTNT\nshowcased its effectiveness in evaluating the robustness of NMT models,\nhighlighting the potential of advanced language models for data cleaning and\nemphasizing C-MTNT as a valuable resource.\n","authors":["Quinten Bolding","Baohao Liao","Brandon James Denis","Jun Luo","Christof Monz"],"pdf_url":"https://arxiv.org/pdf/2310.13469v3.pdf","comment":"EMNLP 2023, Findings"},{"id":"http://arxiv.org/abs/2310.15970v1","updated":"2023-10-24T16:10:58Z","published":"2023-10-24T16:10:58Z","title":"Accented Speech Recognition With Accent-specific Codebooks","summary":" Speech accents pose a significant challenge to state-of-the-art automatic\nspeech recognition (ASR) systems. Degradation in performance across\nunderrepresented accents is a severe deterrent to the inclusive adoption of\nASR. In this work, we propose a novel accent adaptation approach for end-to-end\nASR systems using cross-attention with a trainable set of codebooks. These\nlearnable codebooks capture accent-specific information and are integrated\nwithin the ASR encoder layers. The model is trained on accented English speech,\nwhile the test data also contained accents which were not seen during training.\nOn the Mozilla Common Voice multi-accented dataset, we show that our proposed\napproach yields significant performance gains not only on the seen English\naccents (up to $37\\%$ relative improvement in word error rate) but also on the\nunseen accents (up to $5\\%$ relative improvement in WER). Further, we\nillustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We\nalso compare the performance with other approaches based on accent adversarial\ntraining.\n","authors":["Darshan Prabhu","Preethi Jyothi","Sriram Ganapathy","Vinit Unni"],"pdf_url":"https://arxiv.org/pdf/2310.15970v1.pdf","comment":"Accepted to EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2310.15961v1","updated":"2023-10-24T16:03:57Z","published":"2023-10-24T16:03:57Z","title":"Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation","summary":" Despite the promise of Mixture of Experts (MoE) models in increasing\nparameter counts of Transformer models while maintaining training and inference\ncosts, their application carries notable drawbacks. The key strategy of these\nmodels is to, for each processed token, activate at most a few experts -\nsubsets of an extensive feed-forward layer. But this approach is not without\nits challenges. The operation of matching experts and tokens is discrete, which\nmakes MoE models prone to issues like training instability and uneven expert\nutilization. Existing techniques designed to address these concerns, such as\nauxiliary losses or balance-aware matching, result either in lower model\nperformance or are more difficult to train. In response to these issues, we\npropose Mixture of Tokens, a fully-differentiable model that retains the\nbenefits of MoE architectures while avoiding the aforementioned difficulties.\nRather than routing tokens to experts, this approach mixes tokens from\ndifferent examples prior to feeding them to experts, enabling the model to\nlearn from all token-expert combinations. Importantly, this mixing can be\ndisabled to avoid mixing of different sequences during inference. Crucially,\nthis method is fully compatible with both masked and causal Large Language\nModel training and inference.\n","authors":["Szymon Antoniak","Sebastian Jaszczur","Michał Krutul","Maciej Pióro","Jakub Krajewski","Jan Ludziejewski","Tomasz Odrzygóźdź","Marek Cygan"],"pdf_url":"https://arxiv.org/pdf/2310.15961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09141v2","updated":"2023-10-24T16:01:46Z","published":"2023-10-13T14:33:02Z","title":"PuoBERTa: Training and evaluation of a curated language model for\n Setswana","summary":" Natural language processing (NLP) has made significant progress for\nwell-resourced languages such as English but lagged behind for low-resource\nlanguages like Setswana. This paper addresses this gap by presenting PuoBERTa,\na customised masked language model trained specifically for Setswana. We cover\nhow we collected, curated, and prepared diverse monolingual texts to generate a\nhigh-quality corpus for PuoBERTa's training. Building upon previous efforts in\ncreating monolingual resources for Setswana, we evaluated PuoBERTa across\nseveral NLP tasks, including part-of-speech (POS) tagging, named entity\nrecognition (NER), and news categorisation. Additionally, we introduced a new\nSetswana news categorisation dataset and provided the initial benchmarks using\nPuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP\ncapabilities for understudied languages like Setswana and paves the way for\nfuture research directions.\n","authors":["Vukosi Marivate","Moseli Mots'Oehli","Valencia Wagner","Richard Lastrucci","Isheanesu Dzingirai"],"pdf_url":"https://arxiv.org/pdf/2310.09141v2.pdf","comment":"Accepted for SACAIR 2023"},{"id":"http://arxiv.org/abs/2310.15959v1","updated":"2023-10-24T15:59:43Z","published":"2023-10-24T15:59:43Z","title":"NoteChat: A Dataset of Synthetic Doctor-Patient Conversations\n Conditioned on Clinical Notes","summary":" The detailed clinical records drafted by doctors after each patient's visit\nare crucial for medical practitioners and researchers. Automating the creation\nof these notes with language models can reduce the workload of doctors.\nHowever, training such models can be difficult due to the limited public\navailability of conversations between patients and doctors. In this paper, we\nintroduce NoteChat, a cooperative multi-agent framework leveraging Large\nLanguage Models (LLMs) for generating synthetic doctor-patient conversations\nconditioned on clinical notes. NoteChat consists of Planning, Roleplay, and\nPolish modules. We provide a comprehensive automatic and human evaluation of\nNoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT\nand GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic\ndoctor-patient conversations, underscoring the untapped potential of LLMs in\nhealthcare. This work represents the first instance of multiple LLMs\ncooperating to complete a doctor-patient conversation conditioned on clinical\nnotes, offering promising avenues for the intersection of AI and healthcare\n","authors":["Junda Wang","Zonghai Yao","Zhichao Yang","Huixue Zhou","Rumeng Li","Xun Wang","Yucheng Xu","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2310.15959v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14892v2","updated":"2023-10-24T15:54:24Z","published":"2023-10-23T12:59:11Z","title":"Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time\n Controllable Text Generation","summary":" Controllable text generation (CTG) aims to generate text with desired\nattributes, and decoding-time-based methods have shown promising performance on\nthis task. However, in this paper, we identify the phenomenon of Attribute\nCollapse for the first time. It causes the fluency of generated text to rapidly\ndecrease when the control strength exceeds a critical value, rendering the text\ncompletely unusable. This limitation hinders the effectiveness of decoding\nmethods in achieving high levels of controllability. To address this problem,\nwe propose a novel lightweight decoding framework named Air-Decoding. Its main\nidea is reconstructing the attribute distributions to balance the weights\nbetween attribute words and non-attribute words to generate more fluent text.\nSpecifically, we train prefixes by prefix-tuning to obtain attribute\ndistributions. Then we design a novel attribute distribution reconstruction\nmethod to balance the obtained distributions and use the reconstructed\ndistributions to guide language models for generation, effectively avoiding the\nissue of Attribute Collapse. Experiments on multiple CTG tasks prove that our\nmethod achieves a new state-of-the-art control performance.\n","authors":["Tianqi Zhong","Quan Wang","Jingxuan Han","Yongdong Zhang","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2310.14892v2.pdf","comment":"Accepted as an EMNLP 2023 main paper"},{"id":"http://arxiv.org/abs/2309.08030v2","updated":"2023-10-24T15:43:11Z","published":"2023-09-14T21:07:53Z","title":"AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised\n Features for Audio-Visual Speech Enhancement","summary":" Speech enhancement systems are typically trained using pairs of clean and\nnoisy speech. In audio-visual speech enhancement (AVSE), there is not as much\nground-truth clean data available; most audio-visual datasets are collected in\nreal-world environments with background noise and reverberation, hampering the\ndevelopment of AVSE. In this work, we introduce AV2Wav, a resynthesis-based\naudio-visual speech enhancement approach that can generate clean speech despite\nthe challenges of real-world training data. We obtain a subset of nearly clean\nspeech from an audio-visual corpus using a neural quality estimator, and then\ntrain a diffusion model on this subset to generate waveforms conditioned on\ncontinuous speech representations from AV-HuBERT with noise-robust training. We\nuse continuous rather than discrete representations to retain prosody and\nspeaker information. With this vocoding task alone, the model can perform\nspeech enhancement better than a masking-based baseline. We further fine-tune\nthe diffusion model on clean/noisy utterance pairs to improve the performance.\nOur approach outperforms a masking-based baseline in terms of both automatic\nmetrics and a human listening test and is close in quality to the target speech\nin the listening test. Audio samples can be found at\nhttps://home.ttic.edu/~jcchou/demo/avse/avse_demo.html.\n","authors":["Ju-Chieh Chou","Chung-Ming Chien","Karen Livescu"],"pdf_url":"https://arxiv.org/pdf/2309.08030v2.pdf","comment":"Submitted to ICASSP 2024"},{"id":"http://arxiv.org/abs/2310.15941v1","updated":"2023-10-24T15:38:21Z","published":"2023-10-24T15:38:21Z","title":"This is not a Dataset: A Large Negation Benchmark to Challenge Large\n Language Models","summary":" Although large language models (LLMs) have apparently acquired a certain\nlevel of grammatical knowledge and the ability to make generalizations, they\nfail to interpret negation, a crucial step in Natural Language Processing. We\ntry to clarify the reasons for the sub-optimal performance of LLMs\nunderstanding negation. We introduce a large semi-automatically generated\ndataset of circa 400,000 descriptive sentences about commonsense knowledge that\ncan be true or false in which negation is present in about 2/3 of the corpus in\ndifferent forms. We have used our dataset with the largest available open LLMs\nin a zero-shot approach to grasp their generalization and inference capability\nand we have also fine-tuned some of the models to assess whether the\nunderstanding of negation can be trained. Our findings show that, while LLMs\nare proficient at classifying affirmative sentences, they struggle with\nnegative sentences and lack a deep understanding of negation, often relying on\nsuperficial cues. Although fine-tuning the models on negative sentences\nimproves their performance, the lack of generalization in handling negation is\npersistent, highlighting the ongoing challenges of LLMs regarding negation\nunderstanding and generalization. The dataset and code are publicly available.\n","authors":["Iker García-Ferrero","Begoña Altuna","Javier Álvez","Itziar Gonzalez-Dios","German Rigau"],"pdf_url":"https://arxiv.org/pdf/2310.15941v1.pdf","comment":"Accepted in the The 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2302.14229v4","updated":"2023-10-24T15:34:47Z","published":"2023-02-28T01:27:37Z","title":"Zero-Shot Cross-Lingual Summarization via Large Language Models","summary":" Given a document in a source language, cross-lingual summarization (CLS) aims\nto generate a summary in a different target language. Recently, the emergence\nof Large Language Models (LLMs), such as GPT-3.5, ChatGPT and GPT-4, has\nattracted wide attention from the computational linguistics community. However,\nit is not yet known the performance of LLMs on CLS. In this report, we\nempirically use various prompts to guide LLMs to perform zero-shot CLS from\ndifferent paradigms (i.e., end-to-end and pipeline), and provide a preliminary\nevaluation on the generated summaries. We find that ChatGPT and GPT-4\noriginally prefer to produce lengthy summaries with detailed information. These\ntwo LLMs can further balance informativeness and conciseness with the help of\nan interactive prompt, significantly improving their CLS performance.\nExperimental results on three widely-used CLS datasets show that GPT-4 achieves\nstate-of-the-art zero-shot CLS performance, and performs competitively compared\nwith the fine-tuned mBART-50. Moreover, we also find some multi-lingual and\nbilingual LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) have limited\nzero-shot CLS ability. Due to the composite nature of CLS, which requires\nmodels to perform summarization and translation simultaneously, accomplishing\nthis task in a zero-shot manner is even a challenge for LLMs. Therefore, we\nsincerely hope and recommend future LLM research could use CLS as a testbed.\n","authors":["Jiaan Wang","Yunlong Liang","Fandong Meng","Beiqi Zou","Zhixu Li","Jianfeng Qu","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2302.14229v4.pdf","comment":"Both first authors contributed equally. Technical Report, 12 pages.\n Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP\n 2023)"},{"id":"http://arxiv.org/abs/2310.15929v1","updated":"2023-10-24T15:27:15Z","published":"2023-10-24T15:27:15Z","title":"E-Sparse: Boosting the Large Language Model Inference through\n Entropy-based N:M Sparsity","summary":" Traditional pruning methods are known to be challenging to work in Large\nLanguage Models (LLMs) for Generative AI because of their unaffordable training\nprocess and large computational demands. For the first time, we introduce the\ninformation entropy of hidden state features into a pruning metric design,\nnamely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse\nemploys the information richness to leverage the channel importance, and\nfurther incorporates several novel techniques to put it into effect: (1) it\nintroduces information entropy to enhance the significance of parameter weights\nand input feature norms as a novel pruning metric, and performs N:M sparsity\nwithout modifying the remaining weights. (2) it designs global naive shuffle\nand local block shuffle to quickly optimize the information distribution and\nadequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is\nimplemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere\nGPUs. Extensive experiments on the LLaMA family and OPT models show that\nE-Sparse can significantly speed up the model inference over the dense model\n(up to 1.53X) and obtain significant memory saving (up to 43.52%), with\nacceptable accuracy loss.\n","authors":["Yun Li","Lin Niu","Xipeng Zhang","Kai Liu","Jianchen Zhu","Zhanhui Kang"],"pdf_url":"https://arxiv.org/pdf/2310.15929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15921v1","updated":"2023-10-24T15:22:04Z","published":"2023-10-24T15:22:04Z","title":"Contrastive Learning-based Sentence Encoders Implicitly Weight\n Informative Words","summary":" The performance of sentence encoders can be significantly improved through\nthe simple practice of fine-tuning using contrastive loss. A natural question\narises: what characteristics do models acquire during contrastive learning?\nThis paper theoretically and experimentally shows that contrastive-based\nsentence encoders implicitly weight words based on information-theoretic\nquantities; that is, more informative words receive greater weight, while\nothers receive less. The theory states that, in the lower bound of the optimal\nvalue of the contrastive learning objective, the norm of word embedding\nreflects the information gain associated with the distribution of surrounding\nwords. We also conduct comprehensive experiments using various models, multiple\ndatasets, two methods to measure the implicit weighting of models (Integrated\nGradients and SHAP), and two information-theoretic quantities (information gain\nand self-information). The results provide empirical evidence that contrastive\nfine-tuning emphasizes informative words.\n","authors":["Hiroto Kurita","Goro Kobayashi","Sho Yokoi","Kentaro Inui"],"pdf_url":"https://arxiv.org/pdf/2310.15921v1.pdf","comment":"16 pages, 6 figures, accepted to EMNLP 2023 Findings (short paper)"},{"id":"http://arxiv.org/abs/2305.13040v4","updated":"2023-10-24T15:19:39Z","published":"2023-05-22T13:47:51Z","title":"SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented\n Dialogue Agents","summary":" Task-oriented dialogue (TOD) models have made significant progress in recent\nyears. However, previous studies primarily focus on datasets written by\nannotators, which has resulted in a gap between academic research and\nreal-world spoken conversation scenarios. While several small-scale spoken TOD\ndatasets are proposed to address robustness issues such as ASR errors, they\nignore the unique challenges in spoken conversation. To tackle the limitations,\nwe introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,\ncontaining 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from\nhuman-to-human spoken conversations. SpokenWOZ further incorporates common\nspoken characteristics such as word-by-word processing and reasoning in spoken\nlanguage. Based on these characteristics, we present cross-turn slot and\nreasoning slot detection as new challenges. We conduct experiments on various\nbaselines, including text-modal models, newly proposed dual-modal models, and\nLLMs, e.g., ChatGPT. The results show that the current models still have\nsubstantial room for improvement in spoken conversation, where the most\nadvanced dialogue state tracker only achieves 25.65% in joint goal accuracy and\nthe SOTA end-to-end model only correctly completes the user request in 52.1% of\ndialogues. The dataset, code, and leaderboard are available:\nhttps://spokenwoz.github.io/SpokenWOZ-github.io/.\n","authors":["Shuzheng Si","Wentao Ma","Haoyu Gao","Yuchuan Wu","Ting-En Lin","Yinpei Dai","Hangyu Li","Rui Yan","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2305.13040v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.10180v2","updated":"2023-10-24T15:17:56Z","published":"2023-10-16T08:42:39Z","title":"TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative\n Language Models","summary":" Automated theorem proving (ATP) has become an appealing domain for exploring\nthe reasoning ability of the recent successful generative language models.\nHowever, current ATP benchmarks mainly focus on symbolic inference, but rarely\ninvolve the understanding of complex number combination reasoning. In this\nwork, we propose TRIGO, an ATP benchmark that not only requires a model to\nreduce a trigonometric expression with step-by-step proofs but also evaluates a\ngenerative LM's reasoning ability on formulas and its capability to manipulate,\ngroup, and factor number terms. We gather trigonometric expressions and their\nreduced forms from the web, annotate the simplification process manually, and\ntranslate it into the Lean formal language system. We then automatically\ngenerate additional examples from the annotated samples to expand the dataset.\nFurthermore, we develop an automatic generator based on Lean-Gym to create\ndataset splits of varying difficulties and distributions in order to thoroughly\nanalyze the model's generalization ability. Our extensive experiments show our\nproposed TRIGO poses a new challenge for advanced generative LM's including\nGPT-4 which is pre-trained on a considerable amount of open-source formal\ntheorem-proving language data, and provide a new tool to study the generative\nLM's ability on both formal and mathematical reasoning.\n","authors":["Jing Xiong","Jianhao Shen","Ye Yuan","Haiming Wang","Yichun Yin","Zhengying Liu","Lin Li","Zhijiang Guo","Qingxing Cao","Yinya Huang","Chuanyang Zheng","Xiaodan Liang","Ming Zhang","Qun Liu"],"pdf_url":"https://arxiv.org/pdf/2310.10180v2.pdf","comment":"Accepted by EMNLP 2023. Code is available at\n https://github.com/menik1126/TRIGO"},{"id":"http://arxiv.org/abs/2310.15916v1","updated":"2023-10-24T15:17:14Z","published":"2023-10-24T15:17:14Z","title":"In-Context Learning Creates Task Vectors","summary":" In-context learning (ICL) in Large Language Models (LLMs) has emerged as a\npowerful new learning paradigm. However, its underlying mechanism is still not\nwell understood. In particular, it is challenging to map it to the \"standard\"\nmachine learning framework, where one uses a training set $S$ to find a\nbest-fitting function $f(x)$ in some hypothesis class. Here we make progress on\nthis problem by showing that the functions learned by ICL often have a very\nsimple structure: they correspond to the transformer LLM whose only inputs are\nthe query $x$ and a single \"task vector\" calculated from the training set.\nThus, ICL can be seen as compressing $S$ into a single task vector\n$\\boldsymbol{\\theta}(S)$ and then using this task vector to modulate the\ntransformer to produce the output. We support the above claim via comprehensive\nexperiments across a range of models and tasks.\n","authors":["Roee Hendel","Mor Geva","Amir Globerson"],"pdf_url":"https://arxiv.org/pdf/2310.15916v1.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15019v2","updated":"2023-10-24T15:15:38Z","published":"2023-10-23T15:14:55Z","title":"Meta learning with language models: Challenges and opportunities in the\n classification of imbalanced text","summary":" Detecting out of policy speech (OOPS) content is important but difficult.\nWhile machine learning is a powerful tool to tackle this challenging task, it\nis hard to break the performance ceiling due to factors like quantity and\nquality limitations on training data and inconsistencies in OOPS definition and\ndata labeling. To realize the full potential of available limited resources, we\npropose a meta learning technique (MLT) that combines individual models built\nwith different text representations. We analytically show that the resulting\ntechnique is numerically stable and produces reasonable combining weights. We\ncombine the MLT with a threshold-moving (TM) technique to further improve the\nperformance of the combined predictor on highly-imbalanced in-distribution and\nout-of-distribution datasets. We also provide computational results to show the\nstatistically significant advantages of the proposed MLT approach.\n All authors contributed equally to this work.\n","authors":["Apostol Vassilev","Honglan Jin","Munawar Hasan"],"pdf_url":"https://arxiv.org/pdf/2310.15019v2.pdf","comment":"22 pages, including 5 figures, 12 tables, 1 appendix"},{"id":"http://arxiv.org/abs/2310.15910v1","updated":"2023-10-24T15:15:18Z","published":"2023-10-24T15:15:18Z","title":"Characterizing Mechanisms for Factual Recall in Language Models","summary":" Language Models (LMs) often must integrate facts they memorized in\npretraining with new information that appears in a given context. These two\nsources can disagree, causing competition within the model, and it is unclear\nhow an LM will resolve the conflict. On a dataset that queries for knowledge of\nworld capitals, we investigate both distributional and mechanistic determinants\nof LM behavior in such situations. Specifically, we measure the proportion of\nthe time an LM will use a counterfactual prefix (e.g., \"The capital of Poland\nis London\") to overwrite what it learned in pretraining (\"Warsaw\"). On Pythia\nand GPT2, the training frequency of both the query country (\"Poland\") and the\nin-context city (\"London\") highly affect the models' likelihood of using the\ncounterfactual. We then use head attribution to identify individual attention\nheads that either promote the memorized answer or the in-context answer in the\nlogits. By scaling up or down the value vector of these heads, we can control\nthe likelihood of using the in-context answer on new data. This method can\nincrease the rate of generating the in-context answer to 88\\% of the time\nsimply by scaling a single head at runtime. Our work contributes to a body of\nevidence showing that we can often localize model behaviors to specific\ncomponents and provides a proof of concept for how future methods might control\nmodel behavior dynamically at runtime.\n","authors":["Qinan Yu","Jack Merullo","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2310.15910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15905v1","updated":"2023-10-24T15:08:12Z","published":"2023-10-24T15:08:12Z","title":"Is Probing All You Need? Indicator Tasks as an Alternative to Probing\n Embedding Spaces","summary":" The ability to identify and control different kinds of linguistic information\nencoded in vector representations of words has many use cases, especially for\nexplainability and bias removal. This is usually done via a set of simple\nclassification tasks, termed probes, to evaluate the information encoded in the\nembedding space. However, the involvement of a trainable classifier leads to\nentanglement between the probe's results and the classifier's nature. As a\nresult, contemporary works on probing include tasks that do not involve\ntraining of auxiliary models. In this work we introduce the term indicator\ntasks for non-trainable tasks which are used to query embedding spaces for the\nexistence of certain properties, and claim that this kind of tasks may point to\na direction opposite to probes, and that this contradiction complicates the\ndecision on whether a property exists in an embedding space. We demonstrate our\nclaims with two test cases, one dealing with gender debiasing and another with\nthe erasure of morphological information from embedding spaces. We show that\nthe application of a suitable indicator provides a more accurate picture of the\ninformation captured and removed compared to probes. We thus conclude that\nindicator tasks should be implemented and taken into consideration when\neliciting information from embedded representations.\n","authors":["Tal Levy","Omer Goldman","Reut Tsarfaty"],"pdf_url":"https://arxiv.org/pdf/2310.15905v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2209.01646v3","updated":"2023-10-24T15:07:44Z","published":"2022-09-04T15:40:10Z","title":"SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented\n Inference for Unlabeled Entity Problem in NER","summary":" Named Entity Recognition is the task to locate and classify the entities in\nthe text. However, Unlabeled Entity Problem in NER datasets seriously hinders\nthe improvement of NER performance. This paper proposes SCL-RAI to cope with\nthis problem. Firstly, we decrease the distance of span representations with\nthe same label while increasing it for different ones via span-based\ncontrastive learning, which relieves the ambiguity among entities and improves\nthe robustness of the model over unlabeled entities. Then we propose retrieval\naugmented inference to mitigate the decision boundary shifting problem. Our\nmethod significantly outperforms the previous SOTA method by 4.21% and 8.64%\nF1-score on two real-world datasets.\n","authors":["Shuzheng Si","Shuang Zeng","Jiaxing Lin","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2209.01646v3.pdf","comment":"COLING 2022"},{"id":"http://arxiv.org/abs/2310.15904v1","updated":"2023-10-24T15:07:35Z","published":"2023-10-24T15:07:35Z","title":"Do Stochastic Parrots have Feelings Too? Improving Neural Detection of\n Synthetic Text via Emotion Recognition","summary":" Recent developments in generative AI have shone a spotlight on\nhigh-performance synthetic text generation technologies. The now wide\navailability and ease of use of such models highlights the urgent need to\nprovide equally powerful technologies capable of identifying synthetic text.\nWith this in mind, we draw inspiration from psychological studies which suggest\nthat people can be driven by emotion and encode emotion in the text they\ncompose. We hypothesize that pretrained language models (PLMs) have an\naffective deficit because they lack such an emotional driver when generating\ntext and consequently may generate synthetic text which has affective\nincoherence i.e. lacking the kind of emotional coherence present in\nhuman-authored text. We subsequently develop an emotionally aware detector by\nfine-tuning a PLM on emotion. Experiment results indicate that our\nemotionally-aware detector achieves improvements across a range of synthetic\ntext generators, various sized models, datasets, and domains. Finally, we\ncompare our emotionally-aware synthetic text detector to ChatGPT in the task of\nidentification of its own output and show substantial gains, reinforcing the\npotential of emotion as a signal to identify synthetic text. Code, models, and\ndatasets are available at https: //github.com/alanagiasi/emoPLMsynth\n","authors":["Alan Cowap","Yvette Graham","Jennifer Foster"],"pdf_url":"https://arxiv.org/pdf/2310.15904v1.pdf","comment":"Accepted to Findings of EMNLP 2023 (long paper). Camera ready version"},{"id":"http://arxiv.org/abs/2309.12871v5","updated":"2023-10-24T14:59:02Z","published":"2023-09-22T13:52:42Z","title":"AnglE-optimized Text Embeddings","summary":" High-quality text embedding is pivotal in improving semantic textual\nsimilarity (STS) tasks, which are crucial components in Large Language Model\n(LLM) applications. However, a common challenge existing text embedding models\nface is the problem of vanishing gradients, primarily due to their reliance on\nthe cosine function in the optimization objective, which has saturation zones.\nTo address this issue, this paper proposes a novel angle-optimized text\nembedding model called AnglE. The core idea of AnglE is to introduce angle\noptimization in a complex space. This novel approach effectively mitigates the\nadverse effects of the saturation zone in the cosine function, which can impede\ngradient and hinder optimization processes. To set up a comprehensive STS\nevaluation, we experimented on existing short-text STS datasets and a newly\ncollected long-text STS dataset from GitHub Issues. Furthermore, we examine\ndomain-specific STS scenarios with limited labeled data and explore how AnglE\nworks with LLM-annotated data. Extensive experiments were conducted on various\ntasks including short-text STS, long-text STS, and domain-specific STS tasks.\nThe results show that AnglE outperforms the state-of-the-art (SOTA) STS models\nthat ignore the cosine saturation zone. These findings demonstrate the ability\nof AnglE to generate high-quality text embeddings and the usefulness of angle\noptimization in STS.\n","authors":["Xianming Li","Jing Li"],"pdf_url":"https://arxiv.org/pdf/2309.12871v5.pdf","comment":"update results and add non-STS transfer tasks"},{"id":"http://arxiv.org/abs/2310.12172v2","updated":"2023-10-24T14:58:13Z","published":"2023-10-15T04:52:04Z","title":"Overview of ImageArg-2023: The First Shared Task in Multimodal Argument\n Mining","summary":" This paper presents an overview of the ImageArg shared task, the first\nmultimodal Argument Mining shared task co-located with the 10th Workshop on\nArgument Mining at EMNLP 2023. The shared task comprises two classification\nsubtasks - (1) Subtask-A: Argument Stance Classification; (2) Subtask-B: Image\nPersuasiveness Classification. The former determines the stance of a tweet\ncontaining an image and a piece of text toward a controversial topic (e.g., gun\ncontrol and abortion). The latter determines whether the image makes the tweet\ntext more persuasive. The shared task received 31 submissions for Subtask-A and\n21 submissions for Subtask-B from 9 different teams across 6 countries. The top\nsubmission in Subtask-A achieved an F1-score of 0.8647 while the best\nsubmission in Subtask-B achieved an F1-score of 0.5561.\n","authors":["Zhexiong Liu","Mohamed Elaraby","Yang Zhong","Diane Litman"],"pdf_url":"https://arxiv.org/pdf/2310.12172v2.pdf","comment":"In The 10th Argument Mining Workshop, held in conjunction with The\n Conference on Empirical Methods in Natural Language Processing (EMNLP),\n December 2023"},{"id":"http://arxiv.org/abs/2310.15896v1","updated":"2023-10-24T14:57:34Z","published":"2023-10-24T14:57:34Z","title":"BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs\n with Multi-turn Health Conversations Polished by ChatGPT","summary":" Large language models (LLMs) have performed well in providing general and\nextensive health suggestions in single-turn conversations, exemplified by\nsystems such as ChatGPT, ChatGLM, ChatDoctor, DoctorGLM, and etc. However, the\nlimited information provided by users during single turn results in inadequate\npersonalization and targeting of the generated suggestions, which requires\nusers to independently select the useful part. It is mainly caused by the\nmissing ability to engage in multi-turn questioning. In real-world medical\nconsultations, doctors usually employ a series of iterative inquiries to\ncomprehend the patient's condition thoroughly, enabling them to provide\neffective and personalized suggestions subsequently, which can be defined as\nchain of questioning (CoQ) for LLMs. To improve the CoQ of LLMs, we propose\nBianQue, a ChatGLM-based LLM finetuned with the self-constructed health\nconversation dataset BianQueCorpus that is consist of multiple turns of\nquestioning and health suggestions polished by ChatGPT. Experimental results\ndemonstrate that the proposed BianQue can simultaneously balance the\ncapabilities of both questioning and health suggestions, which will help\npromote the research and application of LLMs in the field of proactive health.\n","authors":["Yirong Chen","Zhenyu Wang","Xiaofen Xing","huimin zheng","Zhipei Xu","Kai Fang","Junhong Wang","Sihang Li","Jieling Wu","Qi Liu","Xiangmin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.04048v3","updated":"2023-10-24T14:56:51Z","published":"2023-03-07T16:57:20Z","title":"Is ChatGPT a Good NLG Evaluator? A Preliminary Study","summary":" Recently, the emergence of ChatGPT has attracted wide attention from the\ncomputational linguistics community. Many prior studies have shown that ChatGPT\nachieves remarkable performance on various NLP tasks in terms of automatic\nevaluation metrics. However, the ability of ChatGPT to serve as an evaluation\nmetric is still underexplored. Considering assessing the quality of natural\nlanguage generation (NLG) models is an arduous task and NLG metrics notoriously\nshow their poor correlation with human judgments, we wonder whether ChatGPT is\na good NLG evaluation metric. In this report, we provide a preliminary\nmeta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail,\nwe regard ChatGPT as a human evaluator and give task-specific (e.g.,\nsummarization) and aspect-specific (e.g., relevance) instruction to prompt\nChatGPT to evaluate the generated results of NLG models. We conduct experiments\non five NLG meta-evaluation datasets (including summarization, story generation\nand data-to-text tasks). Experimental results show that compared with previous\nautomatic metrics, ChatGPT achieves state-of-the-art or competitive correlation\nwith human judgments in most cases. In addition, we find that the effectiveness\nof the ChatGPT evaluator might be influenced by the creation method of the\nmeta-evaluation datasets. For the meta-evaluation datasets which are created\ngreatly depending on the reference and thus are biased, the ChatGPT evaluator\nmight lose its effectiveness. We hope our preliminary study could prompt the\nemergence of a general-purposed reliable NLG metric.\n","authors":["Jiaan Wang","Yunlong Liang","Fandong Meng","Zengkui Sun","Haoxiang Shi","Zhixu Li","Jinan Xu","Jianfeng Qu","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2303.04048v3.pdf","comment":"Both first authors contributed equally. Technical Report, 11 pages.\n Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP\n 2023)"},{"id":"http://arxiv.org/abs/2309.02373v2","updated":"2023-10-24T14:53:50Z","published":"2023-09-05T16:35:41Z","title":"nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style\n Models with Limited Resources","summary":" State-of-the-art language models like T5 have revolutionized the NLP\nlandscape, but their computational demands hinder a large portion of the\nresearch community. To address this challenge, we present nanoT5, a\nspecially-optimized PyTorch framework for efficient pre-training and\nfine-tuning of T5 models. Drawing on insights from optimizer differences and\nprioritizing efficiency, nanoT5 allows a T5-Base model to be pre-trained on a\nsingle GPU in just 16 hours, without any loss in performance. With the\nintroduction of this open-source framework, we hope to widen the accessibility\nto language modelling research and cater to the community's demand for more\nuser-friendly T5 (Encoder-Decoder) implementations. We make our contributions,\nincluding configurations, codebase, pre-training insights, and pre-trained\nmodels, available to the public.\n","authors":["Piotr Nawrot"],"pdf_url":"https://arxiv.org/pdf/2309.02373v2.pdf","comment":"To appear at 3rd Workshop for Natural Language Processing Open Source\n Software"},{"id":"http://arxiv.org/abs/2310.15852v1","updated":"2023-10-24T14:08:37Z","published":"2023-10-24T14:08:37Z","title":"Using Artificial French Data to Understand the Emergence of Gender Bias\n in Transformer Language Models","summary":" Numerous studies have demonstrated the ability of neural language models to\nlearn various linguistic properties without direct supervision. This work takes\nan initial step towards exploring the less researched topic of how neural\nmodels discover linguistic properties of words, such as gender, as well as the\nrules governing their usage. We propose to use an artificial corpus generated\nby a PCFG based on French to precisely control the gender distribution in the\ntraining data and determine under which conditions a model correctly captures\ngender information or, on the contrary, appears gender-biased.\n","authors":["Lina Conti","Guillaume Wisniewski"],"pdf_url":"https://arxiv.org/pdf/2310.15852v1.pdf","comment":"Accepted at EMNLP'23"},{"id":"http://arxiv.org/abs/2310.15851v1","updated":"2023-10-24T14:08:26Z","published":"2023-10-24T14:08:26Z","title":"Self-Guard: Empower the LLM to Safeguard Itself","summary":" The jailbreak attack can bypass the safety measures of a Large Language Model\n(LLM), generating harmful content. This misuse of LLM has led to negative\nsocietal consequences. Currently, there are two main approaches to address\njailbreak attacks: safety training and safeguards. Safety training focuses on\nfurther training LLM to enhance its safety. On the other hand, safeguards\ninvolve implementing external models or filters to prevent harmful outputs.\nHowever, safety training has constraints in its ability to adapt to new attack\ntypes and often leads to a drop in model performance. Safeguards have proven to\nbe of limited help. To tackle these issues, we propose a novel approach called\nSelf-Guard, which combines the strengths of both safety methods. Self-Guard\nincludes two stages. In the first stage, we enhance the model's ability to\nassess harmful content, and in the second stage, we instruct the model to\nconsistently perform harmful content detection on its own responses. The\nexperiment has demonstrated that Self-Guard is robust against jailbreak\nattacks. In the bad case analysis, we find that LLM occasionally provides\nharmless responses to harmful queries. Additionally, we evaluated the general\ncapabilities of the LLM before and after safety training, providing evidence\nthat Self-Guard does not result in the LLM's performance degradation. In\nsensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM\nbut also can even mitigate this issue.\n","authors":["Zezhong Wang","Fangkai Yang","Lu Wang","Pu Zhao","Hongru Wang","Liang Chen","Qingwei Lin","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2310.15851v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10387v3","updated":"2023-10-24T14:00:22Z","published":"2023-05-17T17:26:16Z","title":"Elaborative Simplification as Implicit Questions Under Discussion","summary":" Automated text simplification, a technique useful for making text more\naccessible to people such as children and emergent bilinguals, is often thought\nof as a monolingual translation task from complex sentences to simplified\nsentences using encoder-decoder models. This view fails to account for\nelaborative simplification, where new information is added into the simplified\ntext. This paper proposes to view elaborative simplification through the lens\nof the Question Under Discussion (QUD) framework, providing a robust way to\ninvestigate what writers elaborate upon, how they elaborate, and how\nelaborations fit into the discourse context by viewing elaborations as explicit\nanswers to implicit questions. We introduce ElabQUD, consisting of 1.3K\nelaborations accompanied with implicit QUDs, to study these phenomena. We show\nthat explicitly modeling QUD (via question generation) not only provides\nessential understanding of elaborative simplification and how the elaborations\nconnect with the rest of the discourse, but also substantially improves the\nquality of elaboration generation.\n","authors":["Yating Wu","William Sheffield","Kyle Mahowald","Junyi Jessy Li"],"pdf_url":"https://arxiv.org/pdf/2305.10387v3.pdf","comment":"Equal contribution by Yating Wu and William Sheffield. This the EMNLP\n 2023 Main camera-ready version"},{"id":"http://arxiv.org/abs/2304.02210v2","updated":"2023-10-24T14:00:21Z","published":"2023-04-05T03:49:06Z","title":"Document-Level Machine Translation with Large Language Models","summary":" Large language models (LLMs) such as ChatGPT can produce coherent, cohesive,\nrelevant, and fluent answers for various natural language processing (NLP)\ntasks. Taking document-level machine translation (MT) as a testbed, this paper\nprovides an in-depth evaluation of LLMs' ability on discourse modeling. The\nstudy focuses on three aspects: 1) Effects of Context-Aware Prompts, where we\ninvestigate the impact of different prompts on document-level translation\nquality and discourse phenomena; 2) Comparison of Translation Models, where we\ncompare the translation performance of ChatGPT with commercial MT systems and\nadvanced document-level MT methods; 3) Analysis of Discourse Modelling\nAbilities, where we further probe discourse knowledge encoded in LLMs and shed\nlight on impacts of training techniques on discourse modeling. By evaluating on\na number of benchmarks, we surprisingly find that LLMs have demonstrated\nsuperior performance and show potential to become a new paradigm for\ndocument-level translation: 1) leveraging their powerful long-text modeling\ncapabilities, GPT-3.5 and GPT-4 outperform commercial MT systems in terms of\nhuman evaluation; 2) GPT-4 demonstrates a stronger ability for probing\nlinguistic knowledge than GPT-3.5. This work highlights the challenges and\nopportunities of LLMs for MT, which we hope can inspire the future design and\nevaluation of LLMs.We release our data and annotations at\nhttps://github.com/longyuewangdcu/Document-MT-LLM.\n","authors":["Longyue Wang","Chenyang Lyu","Tianbo Ji","Zhirui Zhang","Dian Yu","Shuming Shi","Zhaopeng Tu"],"pdf_url":"https://arxiv.org/pdf/2304.02210v2.pdf","comment":"Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang are equal\n contributors"},{"id":"http://arxiv.org/abs/2310.15836v1","updated":"2023-10-24T13:43:01Z","published":"2023-10-24T13:43:01Z","title":"A Diffusion Weighted Graph Framework for New Intent Discovery","summary":" New Intent Discovery (NID) aims to recognize both new and known intents from\nunlabeled data with the aid of limited labeled data containing only known\nintents. Without considering structure relationships between samples, previous\nmethods generate noisy supervisory signals which cannot strike a balance\nbetween quantity and quality, hindering the formation of new intent clusters\nand effective transfer of the pre-training knowledge. To mitigate this\nlimitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to\ncapture both semantic similarities and structure relationships inherent in\ndata, enabling more sufficient and reliable supervisory signals. Specifically,\nfor each sample, we diffuse neighborhood relationships along semantic paths\nguided by the nearest neighbors for multiple hops to characterize its local\nstructure discriminately. Then, we sample its positive keys and weigh them\nbased on semantic similarities and local structures for contrastive learning.\nDuring inference, we further propose Graph Smoothing Filter (GSF) to explicitly\nutilize the structure relationships to filter high-frequency noise embodied in\nsemantically ambiguous samples on the cluster boundary. Extensive experiments\nshow that our method outperforms state-of-the-art models on all evaluation\nmetrics across multiple benchmark datasets. Code and data are available at\nhttps://github.com/yibai-shi/DWGF.\n","authors":["Wenkai Shi","Wenbin An","Feng Tian","Qinghua Zheng","QianYing Wang","Ping Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15836v1.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2305.12001v2","updated":"2023-10-24T13:38:19Z","published":"2023-05-19T20:58:22Z","title":"OPT-R: Exploring the Role of Explanations in Finetuning and Prompting\n for Reasoning Skills of Large Language Models","summary":" In this paper, we conduct a thorough investigation into the reasoning\ncapabilities of Large Language Models (LLMs), focusing specifically on the Open\nPretrained Transformers (OPT) models as a representative of such models. Our\nstudy entails finetuning three different sizes of OPT on a carefully curated\nreasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned\nwithout explanations, and OPT-RE, finetuned with explanations. We then evaluate\nall models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS\nbenchmark, covering 26 distinct reasoning skills, utilizing three prompting\ntechniques. Through a comprehensive grid of 27 configurations and 6,156 test\nevaluations, we investigate the dimensions of finetuning, prompting, and scale\nto understand the role of explanations on different reasoning skills. Our\nfindings reveal that having explanations in the fewshot exemplar has no\nsignificant impact on the model's performance when the model is finetuned,\nwhile positively affecting the non-finetuned counterpart. Moreover, we observe\na slight yet consistent increase in classification accuracy as we incorporate\nexplanations during prompting and finetuning, respectively. Finally, we offer\ninsights on which skills benefit the most from incorporating explanations\nduring finetuning and prompting, such as Numerical (+20.4%) and Analogical\n(+13.9%) reasoning, as well as skills that exhibit negligible or negative\neffects.\n","authors":["Badr AlKhamissi","Siddharth Verma","Ping Yu","Zhijing Jin","Asli Celikyilmaz","Mona Diab"],"pdf_url":"https://arxiv.org/pdf/2305.12001v2.pdf","comment":"Proceedings of the 1st Workshop on Natural Language Reasoning and\n Structured Explanations (NLRSE) at ACL 2023"},{"id":"http://arxiv.org/abs/2305.14280v2","updated":"2023-10-24T13:36:49Z","published":"2023-05-23T17:26:50Z","title":"Multilingual Pixel Representations for Translation and Effective\n Cross-lingual Transfer","summary":" We introduce and demonstrate how to effectively train multilingual machine\ntranslation models with pixel representations. We experiment with two different\ndata settings with a variety of language and script coverage, demonstrating\nimproved performance compared to subword embeddings. We explore various\nproperties of pixel representations such as parameter sharing within and across\nscripts to better understand where they lead to positive transfer. We observe\nthat these properties not only enable seamless cross-lingual transfer to unseen\nscripts, but make pixel representations more data-efficient than alternatives\nsuch as vocabulary expansion. We hope this work contributes to more extensible\nmultilingual models for all languages and scripts.\n","authors":["Elizabeth Salesky","Neha Verma","Philipp Koehn","Matt Post"],"pdf_url":"https://arxiv.org/pdf/2305.14280v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15829v1","updated":"2023-10-24T13:32:20Z","published":"2023-10-24T13:32:20Z","title":"Unnatural language processing: How do language models handle\n machine-generated prompts?","summary":" Language model prompt optimization research has shown that semantically and\ngrammatically well-formed manually crafted prompts are routinely outperformed\nby automatically generated token sequences with no apparent meaning or\nsyntactic structure, including sequences of vectors from a model's embedding\nspace. We use machine-generated prompts to probe how models respond to input\nthat is not composed of natural language expressions. We study the behavior of\nmodels of different sizes in multiple semantic tasks in response to both\ncontinuous and discrete machine-generated prompts, and compare it to the\nbehavior in response to human-generated natural-language prompts. Even when\nproducing a similar output, machine-generated and human prompts trigger\ndifferent response patterns through the network processing pathways, including\ndifferent perplexities, different attention and output entropy distributions,\nand different unit activation profiles. We provide preliminary insight into the\nnature of the units activated by different prompt types, suggesting that only\nnatural language prompts recruit a genuinely linguistic circuit.\n","authors":["Corentin Kervadec","Francesca Franzon","Marco Baroni"],"pdf_url":"https://arxiv.org/pdf/2310.15829v1.pdf","comment":"Findings of EMNLP 2023 Camera-Ready"},{"id":"http://arxiv.org/abs/2310.15823v1","updated":"2023-10-24T13:23:57Z","published":"2023-10-24T13:23:57Z","title":"Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To\n Word--Definition Alignment","summary":" A Reverse Dictionary is a tool enabling users to discover a word based on its\nprovided definition, meaning, or description. Such a technique proves valuable\nin various scenarios, aiding language learners who possess a description of a\nword without its identity, and benefiting writers seeking precise terminology.\nThese scenarios often encapsulate what is referred to as the\n\"Tip-of-the-Tongue\" (TOT) phenomena. In this work, we present our winning\nsolution for the Arabic Reverse Dictionary shared task. This task focuses on\nderiving a vector representation of an Arabic word from its accompanying\ndescription. The shared task encompasses two distinct subtasks: the first\ninvolves an Arabic definition as input, while the second employs an English\ndefinition. For the first subtask, our approach relies on an ensemble of\nfinetuned Arabic BERT-based models, predicting the word embedding for a given\ndefinition. The final representation is obtained through averaging the output\nembeddings from each model within the ensemble. In contrast, the most effective\nsolution for the second subtask involves translating the English test\ndefinitions into Arabic and applying them to the finetuned models originally\ntrained for the first subtask. This straightforward method achieves the highest\nscore across both subtasks.\n","authors":["Ahmed ElBakry","Mohamed Gabr","Muhammad ElNokrashy","Badr AlKhamissi"],"pdf_url":"https://arxiv.org/pdf/2310.15823v1.pdf","comment":"ArabicNLP 2023"},{"id":"http://arxiv.org/abs/2310.15819v1","updated":"2023-10-24T13:17:40Z","published":"2023-10-24T13:17:40Z","title":"Generative Language Models Exhibit Social Identity Biases","summary":" The surge in popularity of large language models has given rise to concerns\nabout biases that these models could learn from humans. In this study, we\ninvestigate whether ingroup solidarity and outgroup hostility, fundamental\nsocial biases known from social science, are present in 51 large language\nmodels. We find that almost all foundational language models and some\ninstruction fine-tuned models exhibit clear ingroup-positive and\noutgroup-negative biases when prompted to complete sentences (e.g., \"We\nare...\"). A comparison of LLM-generated sentences with human-written sentences\non the internet reveals that these models exhibit similar level, if not\ngreater, levels of bias than human text. To investigate where these biases stem\nfrom, we experimentally varied the amount of ingroup-positive or\noutgroup-negative sentences the model was exposed to during fine-tuning in the\ncontext of the United States Democrat-Republican divide. Doing so resulted in\nthe models exhibiting a marked increase in ingroup solidarity and an even\ngreater increase in outgroup hostility. Furthermore, removing either\ningroup-positive or outgroup-negative sentences (or both) from the fine-tuning\ndata leads to a significant reduction in both ingroup solidarity and outgroup\nhostility, suggesting that biases can be reduced by removing biased training\ndata. Our findings suggest that modern language models exhibit fundamental\nsocial identity biases and that such biases can be mitigated by curating\ntraining data. Our results have practical implications for creating less biased\nlarge-language models and further underscore the need for more research into\nuser interactions with LLMs to prevent potential bias reinforcement in humans.\n","authors":["Tiancheng Hu","Yara Kyrychenko","Steve Rathje","Nigel Collier","Sander van der Linden","Jon Roozenbeek"],"pdf_url":"https://arxiv.org/pdf/2310.15819v1.pdf","comment":"supplementary material, data, and code see\n https://osf.io/9ht32/?view_only=f0ab4b23325f4c31ad3e12a7353b55f5"},{"id":"http://arxiv.org/abs/2310.03368v3","updated":"2023-10-24T13:16:29Z","published":"2023-10-05T07:57:09Z","title":"Evaluating Hallucinations in Chinese Large Language Models","summary":" In this paper, we establish a benchmark named HalluQA (Chinese Hallucination\nQuestion-Answering) to measure the hallucination phenomenon in Chinese large\nlanguage models. HalluQA contains 450 meticulously designed adversarial\nquestions, spanning multiple domains, and takes into account Chinese historical\nculture, customs, and social phenomena. During the construction of HalluQA, we\nconsider two types of hallucinations: imitative falsehoods and factual errors,\nand we construct adversarial samples based on GLM-130B and ChatGPT. For\nevaluation, we design an automated evaluation method using GPT-4 to judge\nwhether a model output is hallucinated. We conduct extensive experiments on 24\nlarge language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk\nand etc. Out of the 24 models, 18 achieved non-hallucination rates lower than\n50%. This indicates that HalluQA is highly challenging. We analyze the primary\ntypes of hallucinations in different types of models and their causes.\nAdditionally, we discuss which types of hallucinations should be prioritized\nfor different types of models.\n","authors":["Qinyuan Cheng","Tianxiang Sun","Wenwei Zhang","Siyin Wang","Xiangyang Liu","Mozhi Zhang","Junliang He","Mianqiu Huang","Zhangyue Yin","Kai Chen","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.03368v3.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2305.14843v2","updated":"2023-10-24T13:08:27Z","published":"2023-05-24T07:51:42Z","title":"Meta-learning For Vision-and-language Cross-lingual Transfer","summary":" Current pre-trained vison-language models (PVLMs) achieve excellent\nperformance on a range of multi-modal datasets. Recent work has aimed at\nbuilding multilingual models, and a range of novel multilingual multi-modal\ndatasets have been proposed. Current PVLMs typically perform poorly on these\ndatasets when used for multi-modal zero-shot or few-shot cross-lingual\ntransfer, especially for low-resource languages. To alleviate this problem, we\npropose a novel meta-learning fine-tuning framework. Our framework makes\ncurrent PVLMs rapidly adaptive to new languages in vision-language scenarios by\ndesigning MAML in a cross-lingual multi-modal manner. Experiments show that our\nmethod boosts the performance of current state-of-the-art PVLMs in both\nzero-shot and few-shot cross-lingual transfer on a range of vision-language\nunderstanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co)\n","authors":["Hanxu Hu","Frank Keller"],"pdf_url":"https://arxiv.org/pdf/2305.14843v2.pdf","comment":"MRL2023 (co-located with EMNLP2023)"},{"id":"http://arxiv.org/abs/2303.03387v3","updated":"2023-10-24T12:57:08Z","published":"2023-03-02T17:30:43Z","title":"CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a\n Context Synergized Hyperbolic Network","summary":" The tremendous growth of social media users interacting in online\nconversations has led to significant growth in hate speech, affecting people\nfrom various demographics. Most of the prior works focus on detecting explicit\nhate speech, which is overt and leverages hateful phrases, with very little\nwork focusing on detecting hate speech that is implicit or denotes hatred\nthrough indirect or coded language. In this paper, we present CoSyn, a\ncontext-synergized neural network that explicitly incorporates user- and\nconversational context for detecting implicit hate speech in online\nconversations. CoSyn introduces novel ways to encode these external contexts\nand employs a novel context interaction mechanism that clearly captures the\ninterplay between them, making independent assessments of the amounts of\ninformation to be retrieved from these noisy contexts. Additionally, it carries\nout all these operations in the hyperbolic space to account for the scale-free\ndynamics of social media. We demonstrate the effectiveness of CoSyn on 6 hate\nspeech datasets and show that CoSyn outperforms all our baselines in detecting\nimplicit hate speech with absolute improvements in the range of 1.24% - 57.8%.\n","authors":["Sreyan Ghosh","Manan Suri","Purva Chiniya","Utkarsh Tyagi","Sonal Kumar","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2303.03387v3.pdf","comment":"Accepted to EMNLP 2023 Main Conference. Code:\n https://github.com/Sreyan88/CoSyn"},{"id":"http://arxiv.org/abs/2310.01320v3","updated":"2023-10-24T12:51:28Z","published":"2023-10-02T16:27:36Z","title":"Avalon's Game of Thoughts: Battle Against Deception through Recursive\n Contemplation","summary":" Recent breakthroughs in large language models (LLMs) have brought remarkable\nsuccess in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is\nthat the information processed by LLMs is consistently honest, neglecting the\npervasive deceptive or misleading information in human society and AI-generated\ncontent. This oversight makes LLMs susceptible to malicious manipulations,\npotentially resulting in detrimental outcomes. This study utilizes the\nintricate Avalon game as a testbed to explore LLMs' potential in deceptive\nenvironments. Avalon, full of misinformation and requiring sophisticated logic,\nmanifests as a \"Game-of-Thoughts\". Inspired by the efficacy of humans'\nrecursive thinking and perspective-taking in the Avalon game, we introduce a\nnovel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to\nidentify and counteract deceptive information. ReCon combines formulation and\nrefinement contemplation processes; formulation contemplation produces initial\nthoughts and speech, while refinement contemplation further polishes them.\nAdditionally, we incorporate first-order and second-order perspective\ntransitions into these processes respectively. Specifically, the first-order\nallows an LLM agent to infer others' mental states, and the second-order\ninvolves understanding how others perceive the agent's mental state. After\nintegrating ReCon with different LLMs, extensive experiment results from the\nAvalon game indicate its efficacy in aiding LLMs to discern and maneuver around\ndeceptive information without extra fine-tuning and data. Finally, we offer a\npossible explanation for the efficacy of ReCon and explore the current\nlimitations of LLMs in terms of safety, reasoning, speaking style, and format,\npotentially furnishing insights for subsequent research.\n","authors":["Shenzhi Wang","Chang Liu","Zilong Zheng","Siyuan Qi","Shuo Chen","Qisen Yang","Andrew Zhao","Chaofei Wang","Shiji Song","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.01320v3.pdf","comment":"40 pages"},{"id":"http://arxiv.org/abs/2310.15799v1","updated":"2023-10-24T12:50:28Z","published":"2023-10-24T12:50:28Z","title":"DALE: Generative Data Augmentation for Low-Resource Legal NLP","summary":" We present DALE, a novel and effective generative Data Augmentation framework\nfor low-resource LEgal NLP. DALE addresses the challenges existing frameworks\npose in generating effective data augmentations of legal documents - legal\nlanguage, with its specialized vocabulary and complex semantics, morphology,\nand syntax, does not benefit from data augmentations that merely rephrase the\nsource sentence. To address this, DALE, built on an Encoder-Decoder Language\nModel, is pre-trained on a novel unsupervised text denoising objective based on\nselective masking - our masking strategy exploits the domain-specific language\ncharacteristics of templatized legal documents to mask collocated spans of\ntext. Denoising these spans helps DALE acquire knowledge about legal concepts,\nprinciples, and language usage. Consequently, it develops the ability to\ngenerate coherent and diverse augmentations with novel contexts. Finally, DALE\nperforms conditional generation to generate synthetic augmentations for\nlow-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13\ndatasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our\nbaselines, including LLMs, qualitatively and quantitatively, with improvements\nof 1%-50%.\n","authors":["Sreyan Ghosh","Chandra Kiran Evuru","Sonal Kumar","S Ramaneswaran","S Sakshi","Utkarsh Tyagi","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2310.15799v1.pdf","comment":"Accepted to EMNLP 2023 Main Conference. Code:\n https://github.com/Sreyan88/DALE"},{"id":"http://arxiv.org/abs/2310.15797v1","updated":"2023-10-24T12:48:52Z","published":"2023-10-24T12:48:52Z","title":"Random Entity Quantization for Parameter-Efficient Compositional\n Knowledge Graph Representation","summary":" Representation Learning on Knowledge Graphs (KGs) is essential for downstream\ntasks. The dominant approach, KG Embedding (KGE), represents entities with\nindependent vectors and faces the scalability challenge. Recent studies propose\nan alternative way for parameter efficiency, which represents entities by\ncomposing entity-corresponding codewords matched from predefined small-scale\ncodebooks. We refer to the process of obtaining corresponding codewords of each\nentity as entity quantization, for which previous works have designed\ncomplicated strategies. Surprisingly, this paper shows that simple random\nentity quantization can achieve similar results to current strategies. We\nanalyze this phenomenon and reveal that entity codes, the quantization outcomes\nfor expressing entities, have higher entropy at the code level and Jaccard\ndistance at the codeword level under random entity quantization. Therefore,\ndifferent entities become more easily distinguished, facilitating effective KG\nrepresentation. The above results show that current quantization strategies are\nnot critical for KG representation, and there is still room for improvement in\nentity distinguishability beyond current strategies. The code to reproduce our\nresults is available at https://github.com/JiaangL/RandomQuantization.\n","authors":["Jiaang Li","Quan Wang","Yi Liu","Licheng Zhang","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2310.15797v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15793v1","updated":"2023-10-24T12:44:09Z","published":"2023-10-24T12:44:09Z","title":"Improving generalization in large language models by learning prefix\n subspaces","summary":" This article focuses on large language models (LLMs) fine-tuning in the\nscarce data regime (also known as the \"few-shot\" learning setting). We propose\na method to increase the generalization capabilities of LLMs based on neural\nnetwork subspaces. This optimization method, recently introduced in computer\nvision, aims to improve model generalization by identifying wider local optima\nthrough the joint optimization of an entire simplex of models in parameter\nspace. Its adaptation to massive, pretrained transformers, however, poses some\nchallenges. First, their considerable number of parameters makes it difficult\nto train several models jointly, and second, their deterministic parameter\ninitialization schemes make them unfit for the subspace method as originally\nproposed. We show in this paper that \"Parameter Efficient Fine-Tuning\" (PEFT)\nmethods, however, are perfectly compatible with this original approach, and\npropose to learn entire simplex of continuous prefixes. We test our method on a\nvariant of the GLUE benchmark adapted to the few-shot learning setting, and\nshow that both our contributions jointly lead to a gain in average performances\ncompared to sota methods. The implementation can be found at the following\nlink: https://github.com/Liloulou/prefix_subspace\n","authors":["Louis Falissard","Vincent Guigue","Laure Soulier"],"pdf_url":"https://arxiv.org/pdf/2310.15793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11368v4","updated":"2023-10-24T12:30:25Z","published":"2023-10-17T16:05:52Z","title":"VECHR: A Dataset for Explainable and Robust Classification of\n Vulnerability Type in the European Court of Human Rights","summary":" Recognizing vulnerability is crucial for understanding and implementing\ntargeted support to empower individuals in need. This is especially important\nat the European Court of Human Rights (ECtHR), where the court adapts\nConvention standards to meet actual individual needs and thus ensures effective\nhuman rights protection. However, the concept of vulnerability remains elusive\nat the ECtHR and no prior NLP research has dealt with it. To enable future\nresearch in this area, we present VECHR, a novel expert-annotated multi-label\ndataset comprising of vulnerability type classification and explanation\nrationale. We benchmark the performance of state-of-the-art models on VECHR\nfrom both prediction and explainability perspectives. Our results demonstrate\nthe challenging nature of the task with lower prediction performance and\nlimited agreement between models and experts. Further, we analyze the\nrobustness of these models in dealing with out-of-domain (OOD) data and observe\noverall limited performance. Our dataset poses unique challenges offering\nsignificant room for improvement regarding performance, explainability, and\nrobustness.\n","authors":["Shanshan Xu","Leon Staufer","T. Y. S. S Santosh","Oana Ichim","Corina Heri","Matthias Grabmair"],"pdf_url":"https://arxiv.org/pdf/2310.11368v4.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11878v4","updated":"2023-10-24T12:28:13Z","published":"2023-10-18T11:04:31Z","title":"From Dissonance to Insights: Dissecting Disagreements in Rationale\n Construction for Case Outcome Classification","summary":" In legal NLP, Case Outcome Classification (COC) must not only be accurate but\nalso trustworthy and explainable. Existing work in explainable COC has been\nlimited to annotations by a single expert. However, it is well-known that\nlawyers may disagree in their assessment of case facts. We hence collect a\nnovel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two\nexperts in the domain of international human rights law, for whom we observe\nweak agreement. We study their disagreements and build a two-level\ntask-independent taxonomy, supplemented with COC-specific subcategories. To our\nknowledge, this is the first work in the legal NLP that focuses on human label\nvariation. We quantitatively assess different taxonomy categories and find that\ndisagreements mainly stem from underspecification of the legal context, which\nposes challenges given the typically limited granularity and noise in COC\nmetadata. We further assess the explainablility of SOTA COC models on RAVE and\nobserve limited agreement between models and experts. Overall, our case study\nreveals hitherto underappreciated complexities in creating benchmark datasets\nin legal NLP that revolve around identifying aspects of a case's facts\nsupposedly relevant to its outcome.\n","authors":["Shanshan Xu","T. Y. S. S Santosh","Oana Ichim","Isabella Risini","Barbara Plank","Matthias Grabmair"],"pdf_url":"https://arxiv.org/pdf/2310.11878v4.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15777v1","updated":"2023-10-24T12:22:34Z","published":"2023-10-24T12:22:34Z","title":"MindLLM: Pre-training Lightweight Large Language Model from Scratch,\n Evaluations and Domain Applications","summary":" Large Language Models (LLMs) have demonstrated remarkable performance across\nvarious natural language tasks, marking significant strides towards general\nartificial intelligence. While general artificial intelligence is leveraged by\ndeveloping increasingly large-scale models, there could be another branch to\ndevelop lightweight custom models that better serve certain domains, taking\ninto account the high cost of training and deploying LLMs and the scarcity of\nresources. In this paper, we present MindLLM, a novel series of bilingual\nlightweight large language models, trained from scratch, alleviating such\nburdens by offering models with 1.3 billion and 3 billion parameters. A\nthorough account of experiences accrued during large model development is\ngiven, covering every step of the process, including data construction, model\narchitecture, evaluation, and applications. Such insights are hopefully\nvaluable for fellow academics and developers. MindLLM consistently matches or\nsurpasses the performance of other open-source larger models on some public\nbenchmarks. We also introduce an innovative instruction tuning framework\ntailored for smaller models to enhance their capabilities efficiently.\nMoreover, we explore the application of MindLLM in specific vertical domains\nsuch as law and finance, underscoring the agility and adaptability of our\nlightweight models.\n","authors":["Yizhe Yang","Huashan Sun","Jiawei Li","Runheng Liu","Yinghao Li","Yuhang Liu","Heyan Huang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2310.15777v1.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2310.15773v1","updated":"2023-10-24T12:18:17Z","published":"2023-10-24T12:18:17Z","title":"BLESS: Benchmarking Large Language Models on Sentence Simplification","summary":" We present BLESS, a comprehensive performance benchmark of the most recent\nstate-of-the-art large language models (LLMs) on the task of text\nsimplification (TS). We examine how well off-the-shelf LLMs can solve this\nchallenging task, assessing a total of 44 models, differing in size,\narchitecture, pre-training methods, and accessibility, on three test sets from\ndifferent domains (Wikipedia, news, and medical) under a few-shot setting. Our\nanalysis considers a suite of automatic metrics as well as a large-scale\nquantitative investigation into the types of common edit operations performed\nby the different models. Furthermore, we perform a manual qualitative analysis\non a subset of model outputs to better gauge the quality of the generated\nsimplifications. Our evaluation indicates that the best LLMs, despite not being\ntrained on TS, perform comparably with state-of-the-art TS baselines.\nAdditionally, we find that certain LLMs demonstrate a greater range and\ndiversity of edit operations. Our performance benchmark will be available as a\nresource for the development of future TS methods and evaluation metrics.\n","authors":["Tannon Kew","Alison Chi","Laura Vásquez-Rodríguez","Sweta Agrawal","Dennis Aumiller","Fernando Alva-Manchego","Matthew Shardlow"],"pdf_url":"https://arxiv.org/pdf/2310.15773v1.pdf","comment":"This paper has been accepted to EMNLP 2023 as a main long paper. 9\n pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.11258v2","updated":"2023-10-24T12:01:14Z","published":"2023-10-17T13:23:18Z","title":"Utilizing Weak Supervision To Generate Indonesian Conservation Dataset","summary":" Weak supervision has emerged as a promising approach for rapid and\nlarge-scale dataset creation in response to the increasing demand for\naccelerated NLP development. By leveraging labeling functions, weak supervision\nallows practitioners to generate datasets quickly by creating learned label\nmodels that produce soft-labeled datasets. This paper aims to show how such an\napproach can be utilized to build an Indonesian NLP dataset from conservation\nnews text. We construct two types of datasets: multi-class classification and\nsentiment classification. We then provide baseline experiments using various\npretrained language models. These baseline results demonstrate test\nperformances of 59.79% accuracy and 55.72% F1-score for sentiment\nclassification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC\nfor multi-class classification. Additionally, we release the datasets and\nlabeling functions used in this work for further research and exploration.\n","authors":["Mega Fransiska","Diah Pitaloka"," Saripudin","Satrio Putra","Lintang Sutawika"],"pdf_url":"https://arxiv.org/pdf/2310.11258v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15758v1","updated":"2023-10-24T12:01:11Z","published":"2023-10-24T12:01:11Z","title":"Learning From Free-Text Human Feedback -- Collect New Datasets Or Extend\n Existing Ones?","summary":" Learning from free-text human feedback is essential for dialog systems, but\nannotated data is scarce and usually covers only a small fraction of error\ntypes known in conversational AI. Instead of collecting and annotating new\ndatasets from scratch, recent advances in synthetic dialog generation could be\nused to augment existing dialog datasets with the necessary annotations.\nHowever, to assess the feasibility of such an effort, it is important to know\nthe types and frequency of free-text human feedback included in these datasets.\nIn this work, we investigate this question for a variety of commonly used\ndialog datasets, including MultiWoZ, SGD, BABI, PersonaChat,\nWizards-of-Wikipedia, and the human-bot split of the Self-Feeding Chatbot.\nUsing our observations, we derive new taxonomies for the annotation of\nfree-text human feedback in dialogs and investigate the impact of including\nsuch data in response generation for three SOTA language generation models,\nincluding GPT-2, LLAMA, and Flan-T5. Our findings provide new insights into the\ncomposition of the datasets examined, including error types, user response\ntypes, and the relations between them.\n","authors":["Dominic Petrak","Nafise Sadat Moosavi","Ye Tian","Nikolai Rozanov","Iryna Gurevych"],"pdf_url":"https://arxiv.org/pdf/2310.15758v1.pdf","comment":"Accepted to be presented at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15757v1","updated":"2023-10-24T12:00:59Z","published":"2023-10-24T12:00:59Z","title":"Do Differences in Values Influence Disagreements in Online Discussions?","summary":" Disagreements are common in online discussions. Disagreement may foster\ncollaboration and improve the quality of a discussion under some conditions.\nAlthough there exist methods for recognizing disagreement, a deeper\nunderstanding of factors that influence disagreement is lacking in the\nliterature. We investigate a hypothesis that differences in personal values are\nindicative of disagreement in online discussions. We show how state-of-the-art\nmodels can be used for estimating values in online discussions and how the\nestimated values can be aggregated into value profiles. We evaluate the\nestimated value profiles based on human-annotated agreement labels. We find\nthat the dissimilarity of value profiles correlates with disagreement in\nspecific cases. We also find that including value information in agreement\nprediction improves performance.\n","authors":["Michiel van der Meer","Piek Vossen","Catholijn M. Jonker","Pradeep K. Murukannaiah"],"pdf_url":"https://arxiv.org/pdf/2310.15757v1.pdf","comment":"Accepted as main paper at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15752v1","updated":"2023-10-24T11:55:16Z","published":"2023-10-24T11:55:16Z","title":"Integrating Language Models into Direct Speech Translation: An\n Inference-Time Solution to Control Gender Inflection","summary":" When translating words referring to the speaker, speech translation (ST)\nsystems should not resort to default masculine generics nor rely on potentially\nmisleading vocal traits. Rather, they should assign gender according to the\nspeakers' preference. The existing solutions to do so, though effective, are\nhardly feasible in practice as they involve dedicated model re-training on\ngender-labeled ST data. To overcome these limitations, we propose the first\ninference-time solution to control speaker-related gender inflections in ST.\nOur approach partially replaces the (biased) internal language model (LM)\nimplicitly learned by the ST decoder with gender-specific external LMs.\nExperiments on en->es/fr/it show that our solution outperforms the base models\nand the best training-time mitigation strategy by up to 31.0 and 1.6 points in\ngender accuracy, respectively, for feminine forms. The gains are even larger\n(up to 32.0 and 3.4) in the challenging condition where speakers' vocal traits\nconflict with their gender.\n","authors":["Dennis Fucci","Marco Gaido","Sara Papi","Mauro Cettolo","Matteo Negri","Luisa Bentivogli"],"pdf_url":"https://arxiv.org/pdf/2310.15752v1.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15746v1","updated":"2023-10-24T11:40:34Z","published":"2023-10-24T11:40:34Z","title":"Failures Pave the Way: Enhancing Large Language Models through\n Tuning-free Rule Accumulation","summary":" Large Language Models (LLMs) have showcased impressive performance. However,\ndue to their inability to capture relationships among samples, these frozen\nLLMs inevitably keep repeating similar mistakes. In this work, we propose our\nTuning-free Rule Accumulation (TRAN) framework, which guides LLMs in improving\ntheir performance by learning from previous mistakes. Considering data arrives\nsequentially, LLMs gradually accumulate rules from incorrect cases, forming a\nrule collection. These rules are then utilized by the LLMs to avoid making\nsimilar mistakes when processing subsequent inputs. Moreover, the rules remain\nindependent of the primary prompts, seamlessly complementing prompt design\nstrategies. Experimentally, we show that TRAN improves over recent baselines by\na large margin.\n","authors":["Zeyuan Yang","Peng Li","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15746v1.pdf","comment":"This paper is accepted by the EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.15743v1","updated":"2023-10-24T11:35:23Z","published":"2023-10-24T11:35:23Z","title":"RAPL: A Relation-Aware Prototype Learning Approach for Few-Shot\n Document-Level Relation Extraction","summary":" How to identify semantic relations among entities in a document when only a\nfew labeled documents are available? Few-shot document-level relation\nextraction (FSDLRE) is crucial for addressing the pervasive data scarcity\nproblem in real-world scenarios. Metric-based meta-learning is an effective\nframework widely adopted for FSDLRE, which constructs class prototypes for\nclassification. However, existing works often struggle to obtain class\nprototypes with accurate relational semantics: 1) To build prototype for a\ntarget relation type, they aggregate the representations of all entity pairs\nholding that relation, while these entity pairs may also hold other relations,\nthus disturbing the prototype. 2) They use a set of generic NOTA\n(none-of-the-above) prototypes across all tasks, neglecting that the NOTA\nsemantics differs in tasks with different target relation types. In this paper,\nwe propose a relation-aware prototype learning method for FSDLRE to strengthen\nthe relational semantics of prototype representations. By judiciously\nleveraging the relation descriptions and realistic NOTA instances as guidance,\nour method effectively refines the relation prototypes and generates\ntask-specific NOTA prototypes. Extensive experiments demonstrate that our\nmethod outperforms state-of-the-art approaches by average 2.61% $F_1$ across\nvarious settings of two FSDLRE benchmarks.\n","authors":["Shiao Meng","Xuming Hu","Aiwei Liu","Shu'ang Li","Fukun Ma","Yawen Yang","Lijie Wen"],"pdf_url":"https://arxiv.org/pdf/2310.15743v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14724v2","updated":"2023-10-24T11:31:38Z","published":"2023-10-23T09:01:13Z","title":"A Survey on LLM-generated Text Detection: Necessity, Methods, and Future\n Directions","summary":" The powerful ability to understand, follow, and generate complex language\nemerging from large language models (LLMs) makes LLM-generated text flood many\nareas of our daily lives at an incredible speed and is widely accepted by\nhumans. As LLMs continue to expand, there is an imperative need to develop\ndetectors that can detect LLM-generated text. This is crucial to mitigate\npotential misuse of LLMs and safeguard realms like artistic expression and\nsocial networks from harmful influence of LLM-generated content. The\nLLM-generated text detection aims to discern if a piece of text was produced by\nan LLM, which is essentially a binary classification task. The detector\ntechniques have witnessed notable advancements recently, propelled by\ninnovations in watermarking techniques, zero-shot methods, fine-turning LMs\nmethods, adversarial learning methods, LLMs as detectors, and human-assisted\nmethods. In this survey, we collate recent research breakthroughs in this area\nand underscore the pressing need to bolster detector research. We also delve\ninto prevalent datasets, elucidating their limitations and developmental\nrequirements. Furthermore, we analyze various LLM-generated text detection\nparadigms, shedding light on challenges like out-of-distribution problems,\npotential attacks, and data ambiguity. Conclusively, we highlight interesting\ndirections for future research in LLM-generated text detection to advance the\nimplementation of responsible artificial intelligence (AI). Our aim with this\nsurvey is to provide a clear and comprehensive introduction for newcomers while\nalso offering seasoned researchers a valuable update in the field of\nLLM-generated text detection. The useful resources are publicly available at:\nhttps://github.com/NLP2CT/LLM-generated-Text-Detection.\n","authors":["Junchao Wu","Shu Yang","Runzhe Zhan","Yulin Yuan","Derek F. Wong","Lidia S. Chao"],"pdf_url":"https://arxiv.org/pdf/2310.14724v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14616v2","updated":"2023-10-24T11:30:07Z","published":"2023-05-24T01:30:50Z","title":"Exploring Affordance and Situated Meaning in Image Captions: A\n Multimodal Analysis","summary":" This paper explores the grounding issue regarding multimodal semantic\nrepresentation from a computational cognitive-linguistic view. We annotate\nimages from the Flickr30k dataset with five perceptual properties: Affordance,\nPerceptual Salience, Object Number, Gaze Cueing, and Ecological Niche\nAssociation (ENA), and examine their association with textual elements in the\nimage captions. Our findings reveal that images with Gibsonian affordance show\na higher frequency of captions containing 'holding-verbs' and 'container-nouns'\ncompared to images displaying telic affordance. Perceptual Salience, Object\nNumber, and ENA are also associated with the choice of linguistic expressions.\nOur study demonstrates that comprehensive understanding of objects or events\nrequires cognitive attention, semantic nuances in language, and integration\nacross multiple modalities. We highlight the vital importance of situated\nmeaning and affordance grounding in natural language understanding, with the\npotential to advance human-like interpretation in various scenarios.\n","authors":["Pin-Er Chen","Po-Ya Angela Wang","Hsin-Yu Chou","Yu-Hsiang Tseng","Shu-Kai Hsieh"],"pdf_url":"https://arxiv.org/pdf/2305.14616v2.pdf","comment":"10 pages, 9 figures"},{"id":"http://arxiv.org/abs/2305.00586v4","updated":"2023-10-24T11:09:21Z","published":"2023-04-30T21:44:21Z","title":"How does GPT-2 compute greater-than?: Interpreting mathematical\n abilities in a pre-trained language model","summary":" Pre-trained language models can be surprisingly adept at tasks they were not\nexplicitly trained on, but how they implement these capabilities is poorly\nunderstood. In this paper, we investigate the basic mathematical abilities\noften acquired by pre-trained language models. Concretely, we use mechanistic\ninterpretability techniques to explain the (limited) mathematical abilities of\nGPT-2 small. As a case study, we examine its ability to take in sentences such\nas \"The war lasted from the year 1732 to the year 17\", and predict valid\ntwo-digit end years (years > 32). We first identify a circuit, a small subset\nof GPT-2 small's computational graph that computes this task's output. Then, we\nexplain the role of each circuit component, showing that GPT-2 small's final\nmulti-layer perceptrons boost the probability of end years greater than the\nstart year. Finally, we find related tasks that activate our circuit. Our\nresults suggest that GPT-2 small computes greater-than using a complex but\ngeneral mechanism that activates across diverse contexts.\n","authors":["Michael Hanna","Ollie Liu","Alexandre Variengien"],"pdf_url":"https://arxiv.org/pdf/2305.00586v4.pdf","comment":"NeurIPS 2023 Camera Ready Version"},{"id":"http://arxiv.org/abs/2310.15724v1","updated":"2023-10-24T11:00:07Z","published":"2023-10-24T11:00:07Z","title":"Variator: Accelerating Pre-trained Models with Plug-and-Play Compression\n Modules","summary":" Pre-trained language models (PLMs) have achieved remarkable results on NLP\ntasks but at the expense of huge parameter sizes and the consequent\ncomputational costs. In this paper, we propose Variator, a parameter-efficient\nacceleration method that enhances computational efficiency through\nplug-and-play compression plugins. Compression plugins are designed to reduce\nthe sequence length via compressing multiple hidden vectors into one and\ntrained with original PLMs frozen. Different from traditional model\nacceleration methods, which compress PLMs to smaller sizes, Variator offers two\ndistinct advantages: (1) In real-world applications, the plug-and-play nature\nof our compression plugins enables dynamic selection of different compression\nplugins with varying acceleration ratios based on the current workload. (2) The\ncompression plugin comprises a few compact neural network layers with minimal\nparameters, significantly saving storage and memory overhead, particularly in\nscenarios with a growing number of tasks. We validate the effectiveness of\nVariator on seven datasets. Experimental results show that Variator can save\n53% computational costs using only 0.9% additional parameters with a\nperformance drop of less than 2%. Moreover, when the model scales to billions\nof parameters, Variator matches the strong performance of uncompressed PLMs.\n","authors":["Chaojun Xiao","Yuqi Luo","Wenbin Zhang","Pengle Zhang","Xu Han","Yankai Lin","Zhengyan Zhang","Ruobing Xie","Zhiyuan Liu","Maosong Sun","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.15724v1.pdf","comment":"Accepted by Findings of EMNLP"},{"id":"http://arxiv.org/abs/2305.12962v2","updated":"2023-10-24T10:59:50Z","published":"2023-05-22T12:11:39Z","title":"Distilling ChatGPT for Explainable Automated Student Answer Assessment","summary":" Providing explainable and faithful feedback is crucial for automated student\nanswer assessment. In this paper, we introduce a novel framework that explores\nusing ChatGPT, a cutting-edge large language model, for the concurrent tasks of\nstudent answer scoring and rationale generation. We identify the appropriate\ninstructions by prompting ChatGPT with different templates to collect the\nrationales, where inconsistent rationales are refined to align with marking\nstandards. The refined ChatGPT outputs enable us to fine-tune a smaller\nlanguage model that simultaneously assesses student answers and provides\nrationales. Extensive experiments on the benchmark dataset show that the\nproposed method improves the overall QWK score by 11% compared to ChatGPT.\nFurthermore, our thorough analysis and human evaluation demonstrate that the\nrationales generated by our proposed method are comparable to those of ChatGPT.\nOur approach provides a viable solution to achieve explainable automated\nassessment in education. Code available at\nhttps://github.com/lijiazheng99/aera.\n","authors":["Jiazheng Li","Lin Gui","Yuxiang Zhou","David West","Cesare Aloisi","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2305.12962v2.pdf","comment":"Accepted EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15722v1","updated":"2023-10-24T10:58:33Z","published":"2023-10-24T10:58:33Z","title":"Re-Temp: Relation-Aware Temporal Representation Learning for Temporal\n Knowledge Graph Completion","summary":" Temporal Knowledge Graph Completion (TKGC) under the extrapolation setting\naims to predict the missing entity from a fact in the future, posing a\nchallenge that aligns more closely with real-world prediction problems.\nExisting research mostly encodes entities and relations using sequential graph\nneural networks applied to recent snapshots. However, these approaches tend to\noverlook the ability to skip irrelevant snapshots according to entity-related\nrelations in the query and disregard the importance of explicit temporal\ninformation. To address this, we propose our model, Re-Temp (Relation-Aware\nTemporal Representation Learning), which leverages explicit temporal embedding\nas input and incorporates skip information flow after each timestamp to skip\nunnecessary information for prediction. Additionally, we introduce a two-phase\nforward propagation method to prevent information leakage. Through the\nevaluation on six TKGC (extrapolation) datasets, we demonstrate that our model\noutperforms all eight recent state-of-the-art models by a significant margin.\n","authors":["Kunze Wang","Soyeon Caren Han","Josiah Poon"],"pdf_url":"https://arxiv.org/pdf/2310.15722v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15720v1","updated":"2023-10-24T10:52:41Z","published":"2023-10-24T10:52:41Z","title":"Ensemble of Task-Specific Language Models for Brain Encoding","summary":" Language models have been shown to be rich enough to encode fMRI activations\nof certain Regions of Interest in our Brains. Previous works have explored\ntransfer learning from representations learned for popular natural language\nprocessing tasks for predicting brain responses. In our work, we improve the\nperformance of such encoders by creating an ensemble model out of 10 popular\nLanguage Models (2 syntactic and 8 semantic). We beat the current baselines by\n10% on average across all ROIs through our ensembling methods.\n","authors":["Sanjai Kumaran","Arvindh Arun","Jerrin John"],"pdf_url":"https://arxiv.org/pdf/2310.15720v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.10548v2","updated":"2023-10-24T10:49:58Z","published":"2022-12-20T18:51:48Z","title":"T-Projection: High Quality Annotation Projection for Sequence Labeling\n Tasks","summary":" In the absence of readily available labeled data for a given sequence\nlabeling task and language, annotation projection has been proposed as one of\nthe possible strategies to automatically generate annotated data. Annotation\nprojection has often been formulated as the task of transporting, on parallel\ncorpora, the labels pertaining to a given span in the source language into its\ncorresponding span in the target language. In this paper we present\nT-Projection, a novel approach for annotation projection that leverages large\npretrained text-to-text language models and state-of-the-art machine\ntranslation technology. T-Projection decomposes the label projection task into\ntwo subtasks: (i) A candidate generation step, in which a set of projection\ncandidates using a multilingual T5 model is generated and, (ii) a candidate\nselection step, in which the generated candidates are ranked based on\ntranslation probabilities. We conducted experiments on intrinsic and extrinsic\ntasks in 5 Indo-European and 8 low-resource African languages. We demostrate\nthat T-projection outperforms previous annotation projection methods by a wide\nmargin. We believe that T-Projection can help to automatically alleviate the\nlack of high-quality training data for sequence labeling tasks. Code and data\nare publicly available.\n","authors":["Iker García-Ferrero","Rodrigo Agerri","German Rigau"],"pdf_url":"https://arxiv.org/pdf/2212.10548v2.pdf","comment":"Findings of the EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15702v1","updated":"2023-10-24T10:25:21Z","published":"2023-10-24T10:25:21Z","title":"Enhancing Biomedical Lay Summarisation with External Knowledge Graphs","summary":" Previous approaches for automatic lay summarisation are exclusively reliant\non the source article that, given it is written for a technical audience (e.g.,\nresearchers), is unlikely to explicitly define all technical concepts or state\nall of the background information that is relevant for a lay audience. We\naddress this issue by augmenting eLife, an existing biomedical lay\nsummarisation dataset, with article-specific knowledge graphs, each containing\ndetailed information on relevant biomedical concepts. Using both automatic and\nhuman evaluations, we systematically investigate the effectiveness of three\ndifferent approaches for incorporating knowledge graphs within lay\nsummarisation models, with each method targeting a distinct area of the\nencoder-decoder model architecture. Our results confirm that integrating\ngraph-based domain knowledge can significantly benefit lay summarisation by\nsubstantially increasing the readability of generated text and improving the\nexplanation of technical concepts.\n","authors":["Tomas Goldsack","Zhihao Zhang","Chen Tang","Carolina Scarton","Chenghua Lin"],"pdf_url":"https://arxiv.org/pdf/2310.15702v1.pdf","comment":"Accepted to the EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2305.13034v2","updated":"2023-10-24T10:22:05Z","published":"2023-05-22T13:38:53Z","title":"Nearest Neighbor Machine Translation is Meta-Optimizer on Output\n Projection Layer","summary":" Nearest Neighbor Machine Translation ($k$NN-MT) has achieved great success in\ndomain adaptation tasks by integrating pre-trained Neural Machine Translation\n(NMT) models with domain-specific token-level retrieval. However, the reasons\nunderlying its success have not been thoroughly investigated. In this paper, we\ncomprehensively analyze $k$NN-MT through theoretical and empirical studies.\nInitially, we provide new insights into the working mechanism of $k$NN-MT as an\nefficient technique to implicitly execute gradient descent on the output\nprojection layer of NMT, indicating that it is a specific case of model\nfine-tuning. Subsequently, we conduct multi-domain experiments and word-level\nanalysis to examine the differences in performance between $k$NN-MT and\nentire-model fine-tuning. Our findings suggest that: (1) Incorporating $k$NN-MT\nwith adapters yields comparable translation performance to fine-tuning on\nin-domain test sets, while achieving better performance on out-of-domain test\nsets; (2) Fine-tuning significantly outperforms $k$NN-MT on the recall of\nin-domain low-frequency words, but this gap could be bridged by optimizing the\ncontext representations with additional adapter layers.\n","authors":["Ruize Gao","Zhirui Zhang","Yichao Du","Lemao Liu","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2305.13034v2.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.15694v1","updated":"2023-10-24T10:05:32Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15693v1","updated":"2023-10-24T10:03:27Z","published":"2023-10-24T10:03:27Z","title":"Towards Automated Recipe Genre Classification using Semi-Supervised\n Learning","summary":" Sharing cooking recipes is a great way to exchange culinary ideas and provide\ninstructions for food preparation. However, categorizing raw recipes found\nonline into appropriate food genres can be challenging due to a lack of\nadequate labeled data. In this study, we present a dataset named the\n``Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking\nRecipe Dataset\" that contains two million culinary recipes labeled in\nrespective categories with extended named entities extracted from recipe\ndescriptions. This collection of data includes various features such as title,\nNER, directions, and extended NER, as well as nine different labels\nrepresenting genres including bakery, drinks, non-veg, vegetables, fast food,\ncereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends\nthe size of the Named Entity Recognition (NER) list to address missing named\nentities like heat, time or process from the recipe directions using two NER\nextraction tools. 3A2M+ dataset provides a comprehensive solution to the\nvarious challenging recipe-related tasks, including classification, named\nentity recognition, and recipe generation. Furthermore, we have demonstrated\ntraditional machine learning, deep learning and pre-trained language models to\nclassify the recipes into their corresponding genre and achieved an overall\naccuracy of 98.6\\%. Our investigation indicates that the title feature played a\nmore significant role in classifying the genre.\n","authors":["Nazmus Sakib","G. M. Shahariar","Md. Mohsinul Kabir","Md. Kamrul Hasan","Hasan Mahmud"],"pdf_url":"https://arxiv.org/pdf/2310.15693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15689v1","updated":"2023-10-24T10:00:56Z","published":"2023-10-24T10:00:56Z","title":"Creating a silver standard for patent simplification","summary":" Patents are legal documents that aim at protecting inventions on the one hand\nand at making technical knowledge circulate on the other. Their complex style\n-- a mix of legal, technical, and extremely vague language -- makes their\ncontent hard to access for humans and machines and poses substantial challenges\nto the information retrieval community. This paper proposes an approach to\nautomatically simplify patent text through rephrasing. Since no in-domain\nparallel simplification data exist, we propose a method to automatically\ngenerate a large-scale silver standard for patent sentences. To obtain\ncandidates, we use a general-domain paraphrasing system; however, the process\nis error-prone and difficult to control. Thus, we pair it with proper filters\nand construct a cleaner corpus that can successfully be used to train a\nsimplification system. Human evaluation of the synthetic silver corpus shows\nthat it is considered grammatical, adequate, and contains simple sentences.\n","authors":["Silvia Casola","Alberto Lavelli","Horacio Saggion"],"pdf_url":"https://arxiv.org/pdf/2310.15689v1.pdf","comment":"This paper has been published at SIGIR 2023"},{"id":"http://arxiv.org/abs/2305.12920v2","updated":"2023-10-24T09:58:10Z","published":"2023-05-22T11:08:00Z","title":"A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and\n Why?","summary":" Understanding the fundamental concepts and trends in a scientific field is\ncrucial for keeping abreast of its continuous advancement. In this study, we\npropose a systematic framework for analyzing the evolution of research topics\nin a scientific field using causal discovery and inference techniques. We\ndefine three variables to encompass diverse facets of the evolution of research\ntopics within NLP and utilize a causal discovery algorithm to unveil the causal\nconnections among these variables using observational data. Subsequently, we\nleverage this structure to measure the intensity of these relationships. By\nconducting extensive experiments on the ACL Anthology corpus, we demonstrate\nthat our framework effectively uncovers evolutionary trends and the underlying\ncauses for a wide range of NLP research topics. Specifically, we show that\ntasks and methods are primary drivers of research in NLP, with datasets\nfollowing, while metrics have minimal impact.\n","authors":["Aniket Pramanick","Yufang Hou","Saif M. Mohammad","Iryna Gurevych"],"pdf_url":"https://arxiv.org/pdf/2305.12920v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15077v2","updated":"2023-10-24T09:56:46Z","published":"2023-05-24T11:56:21Z","title":"Contrastive Learning of Sentence Embeddings from Scratch","summary":" Contrastive learning has been the dominant approach to train state-of-the-art\nsentence embeddings. Previous studies have typically learned sentence\nembeddings either through the use of human-annotated natural language inference\n(NLI) data or via large-scale unlabeled sentences in an unsupervised manner.\nHowever, even in the case of unlabeled data, their acquisition presents\nchallenges in certain domains due to various reasons. To address these issues,\nwe present SynCSE, a contrastive learning framework that trains sentence\nembeddings with synthesized data. Specifically, we explore utilizing large\nlanguage models to synthesize the required data samples for contrastive\nlearning, including (1) producing positive and negative annotations given\nunlabeled sentences (SynCSE-partial), and (2) generating sentences along with\ntheir corresponding annotations from scratch (SynCSE-scratch). Experimental\nresults on sentence similarity and reranking tasks indicate that both\nSynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines,\nand SynCSE-partial even achieves comparable performance to the supervised\nmodels in most settings.\n","authors":["Junlei Zhang","Zhenzhong Lan","Junxian He"],"pdf_url":"https://arxiv.org/pdf/2305.15077v2.pdf","comment":"Emnlp 2023"},{"id":"http://arxiv.org/abs/2310.15684v1","updated":"2023-10-24T09:56:46Z","published":"2023-10-24T09:56:46Z","title":"Improving Biomedical Abstractive Summarisation with Knowledge\n Aggregation from Citation Papers","summary":" Abstracts derived from biomedical literature possess distinct domain-specific\ncharacteristics, including specialised writing styles and biomedical\nterminologies, which necessitate a deep understanding of the related\nliterature. As a result, existing language models struggle to generate\ntechnical summaries that are on par with those produced by biomedical experts,\ngiven the absence of domain-specific background knowledge. This paper aims to\nenhance the performance of language models in biomedical abstractive\nsummarisation by aggregating knowledge from external papers cited within the\nsource article. We propose a novel attention-based citation aggregation model\nthat integrates domain-specific knowledge from citation papers, allowing neural\nnetworks to generate summaries by leveraging both the paper content and\nrelevant knowledge from citation papers. Furthermore, we construct and release\na large-scale biomedical summarisation dataset that serves as a foundation for\nour research. Extensive experiments demonstrate that our model outperforms\nstate-of-the-art approaches and achieves substantial improvements in\nabstractive biomedical text summarisation.\n","authors":["Chen Tang","Shun Wang","Tomas Goldsack","Chenghua Lin"],"pdf_url":"https://arxiv.org/pdf/2310.15684v1.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15683v1","updated":"2023-10-24T09:52:09Z","published":"2023-10-24T09:52:09Z","title":"Prevalence and prevention of large language model use in crowd work","summary":" We show that the use of large language models (LLMs) is prevalent among crowd\nworkers, and that targeted mitigation strategies can significantly reduce, but\nnot eliminate, LLM use. On a text summarization task where workers were not\ndirected in any way regarding their LLM use, the estimated prevalence of LLM\nuse was around 30%, but was reduced by about half by asking workers to not use\nLLMs and by raising the cost of using them, e.g., by disabling copy-pasting.\nSecondary analyses give further insight into LLM use and its prevention: LLM\nuse yields high-quality but homogeneous responses, which may harm research\nconcerned with human (rather than model) behavior and degrade future models\ntrained with crowdsourced data. At the same time, preventing LLM use may be at\nodds with obtaining high-quality responses; e.g., when requesting workers not\nto use LLMs, summaries contained fewer keywords carrying essential information.\nOur estimates will likely change as LLMs increase in popularity or\ncapabilities, and as norms around their usage change. Yet, understanding the\nco-evolution of LLM-based tools and users is key to maintaining the validity of\nresearch done using crowdsourcing, and we provide a critical baseline before\nwidespread adoption ensues.\n","authors":["Veniamin Veselovsky","Manoel Horta Ribeiro","Philip Cozzolino","Andrew Gordon","David Rothschild","Robert West"],"pdf_url":"https://arxiv.org/pdf/2310.15683v1.pdf","comment":"VV and MHR equal contribution. 14 pages, 1 figure, 1 table"},{"id":"http://arxiv.org/abs/2310.15672v1","updated":"2023-10-24T09:31:03Z","published":"2023-10-24T09:31:03Z","title":"How Much Context Does My Attention-Based ASR System Need?","summary":" For the task of speech recognition, the use of more than 30 seconds of\nacoustic context during training is uncommon, and under-investigated in\nliterature. In this work, we examine the effect of scaling the sequence length\nused to train/evaluate (dense-attention based) acoustic and language models on\nspeech recognition performance. For these experiments a dataset of roughly\n100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5\nseconds to 1 hour being explored. Zero-shot evaluations on long-format datasets\nEarnings-22 and Tedlium demonstrate a benefit from training with around 80\nseconds of acoustic context, showing up to a 14.9% relative improvement from a\nlimited context baseline. Furthermore, we perform a system combination with\nlong-context transformer language models via beam search for a fully\nlong-context ASR system, with results that are competitive with the current\nstate-of-the-art.\n","authors":["Robert Flynn","Anton Ragni"],"pdf_url":"https://arxiv.org/pdf/2310.15672v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15664v1","updated":"2023-10-24T09:23:57Z","published":"2023-10-24T09:23:57Z","title":"Expression Syntax Information Bottleneck for Math Word Problems","summary":" Math Word Problems (MWP) aims to automatically solve mathematical questions\ngiven in texts. Previous studies tend to design complex models to capture\nadditional information in the original text so as to enable the model to gain\nmore comprehensive features. In this paper, we turn our attention in the\nopposite direction, and work on how to discard redundant features containing\nspurious correlations for MWP. To this end, we design an Expression Syntax\nInformation Bottleneck method for MWP (called ESIB) based on variational\ninformation bottleneck, which extracts essential features of expression syntax\ntree while filtering latent-specific redundancy containing syntax-irrelevant\nfeatures. The key idea of ESIB is to encourage multiple models to predict the\nsame expression syntax tree for different problem representations of the same\nproblem by mutual learning so as to capture consistent information of\nexpression syntax tree and discard latent-specific redundancy. To improve the\ngeneralization ability of the model and generate more diverse expressions, we\ndesign a self-distillation loss to encourage the model to rely more on the\nexpression syntax information in the latent space. Experimental results on two\nlarge-scale benchmarks show that our model not only achieves state-of-the-art\nresults but also generates more diverse solutions. The code is available.\n","authors":["Jing Xiong","Chengming Li","Min Yang","Xiping Hu","Bin Hu"],"pdf_url":"https://arxiv.org/pdf/2310.15664v1.pdf","comment":"This paper has been accepted by SIGIR 2022. The code can be found at\n https://github.com/menik1126/math_ESIB"},{"id":"http://arxiv.org/abs/2310.15654v1","updated":"2023-10-24T09:10:26Z","published":"2023-10-24T09:10:26Z","title":"A Survey on Detection of LLMs-Generated Content","summary":" The burgeoning capabilities of advanced large language models (LLMs) such as\nChatGPT have led to an increase in synthetic content generation with\nimplications across a variety of sectors, including media, cybersecurity,\npublic discourse, and education. As such, the ability to detect LLMs-generated\ncontent has become of paramount importance. We aim to provide a detailed\noverview of existing detection strategies and benchmarks, scrutinizing their\ndifferences and identifying key challenges and prospects in the field,\nadvocating for more adaptable and robust models to enhance detection accuracy.\nWe also posit the necessity for a multi-faceted approach to defend against\nvarious attacks to counter the rapidly advancing capabilities of LLMs. To the\nbest of our knowledge, this work is the first comprehensive survey on the\ndetection in the era of LLMs. We hope it will provide a broad understanding of\nthe current landscape of LLMs-generated content detection, offering a guiding\nreference for researchers and practitioners striving to uphold the integrity of\ndigital information in an era increasingly dominated by synthetic content. The\nrelevant papers are summarized and will be consistently updated at\nhttps://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git.\n","authors":["Xianjun Yang","Liangming Pan","Xuandong Zhao","Haifeng Chen","Linda Petzold","William Yang Wang","Wei Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.15654v1.pdf","comment":"We will keep updating at\n https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git"},{"id":"http://arxiv.org/abs/2310.15638v1","updated":"2023-10-24T08:56:49Z","published":"2023-10-24T08:56:49Z","title":"CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large\n Language Models for Data Annotation","summary":" Annotated data plays a critical role in Natural Language Processing (NLP) in\ntraining models and evaluating their performance. Given recent developments in\nLarge Language Models (LLMs), models such as ChatGPT demonstrate zero-shot\ncapability on many text-annotation tasks, comparable with or even exceeding\nhuman annotators. Such LLMs can serve as alternatives for manual annotation,\ndue to lower costs and higher scalability. However, limited work has leveraged\nLLMs as complementary annotators, nor explored how annotation work is best\nallocated among humans and LLMs to achieve both quality and cost objectives. We\npropose CoAnnotating, a novel paradigm for Human-LLM co-annotation of\nunstructured texts at scale. Under this framework, we utilize uncertainty to\nestimate LLMs' annotation capability. Our empirical study shows CoAnnotating to\nbe an effective means to allocate work from results on different datasets, with\nup to 21% performance improvement over random baseline. For code\nimplementation, see https://github.com/SALT-NLP/CoAnnotating.\n","authors":["Minzhi Li","Taiwei Shi","Caleb Ziems","Min-Yen Kan","Nancy F. Chen","Zhengyuan Liu","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.15638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15636v1","updated":"2023-10-24T08:56:06Z","published":"2023-10-24T08:56:06Z","title":"Career Path Prediction using Resume Representation Learning and\n Skill-based Matching","summary":" The impact of person-job fit on job satisfaction and performance is widely\nacknowledged, which highlights the importance of providing workers with next\nsteps at the right time in their career. This task of predicting the next step\nin a career is known as career path prediction, and has diverse applications\nsuch as turnover prevention and internal job mobility. Existing methods to\ncareer path prediction rely on large amounts of private career history data to\nmodel the interactions between job titles and companies. We propose leveraging\nthe unexplored textual descriptions that are part of work experience sections\nin resumes. We introduce a structured dataset of 2,164 anonymized career\nhistories, annotated with ESCO occupation labels. Based on this dataset, we\npresent a novel representation learning approach, CareerBERT, specifically\ndesigned for work history data. We develop a skill-based model and a text-based\nmodel for career path prediction, which achieve 35.24% and 39.61% recall@10\nrespectively on our dataset. Finally, we show that both approaches are\ncomplementary as a hybrid approach achieves the strongest result with 43.01%\nrecall@10.\n","authors":["Jens-Joris Decorte","Jeroen Van Hautte","Johannes Deleu","Chris Develder","Thomas Demeester"],"pdf_url":"https://arxiv.org/pdf/2310.15636v1.pdf","comment":"Accepted to the 3nd Workshop on Recommender Systems for Human\n Resources (RecSys in HR 2023) as part of RecSys 2023"},{"id":"http://arxiv.org/abs/2310.15632v1","updated":"2023-10-24T08:54:23Z","published":"2023-10-24T08:54:23Z","title":"Tips for making the most of 64-bit architectures in langage design,\n libraries or garbage collection","summary":" The 64-bit architectures that have become standard today offer unprecedented\nlow-level programming possibilities. For the first time in the history of\ncomputing, the size of address registers far exceeded the physical capacity of\ntheir bus.After a brief reminder of the possibilities offered by the small size\nof addresses compared to the available 64 bits,we develop three concrete\nexamples of how the vacant bits of these registers can be used.Among these\nexamples, two of them concern the implementation of a library for a new\nstatically typed programming language.Firstly, the implementation of\nmulti-precision integers, with the aim of improving performance in terms of\nboth calculation speed and RAM savings.The second example focuses on the\nlibrary's handling of UTF-8 character strings.Here, the idea is to make\nindexing easier by ignoring the physical size of each UTF-8 characters.Finally,\nthe third example is a possible enhancement of garbage collectors, in\nparticular the mark \\& sweep for the object marking phase.\n","authors":["Benoît Sonntag","Dominique Colnet"],"pdf_url":"https://arxiv.org/pdf/2310.15632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15612v1","updated":"2023-10-24T08:27:56Z","published":"2023-10-24T08:27:56Z","title":"Machine Translation for Nko: Tools, Corpora and Baseline Results","summary":" Currently, there is no usable machine translation system for Nko, a language\nspoken by tens of millions of people across multiple West African countries,\nwhich holds significant cultural and educational value. To address this issue,\nwe present a set of tools, resources, and baseline results aimed towards the\ndevelopment of usable machine translation systems for Nko and other languages\nthat do not currently have sufficiently large parallel text corpora available.\n(1) Friallel: A novel collaborative parallel text curation software that\nincorporates quality control through copyedit-based workflows. (2) Expansion of\nthe FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko\ntranslations in parallel with 204 and 40 other languages. (3) nicolingua-0005:\nA collection of trilingual and bilingual corpora with 130,850 parallel segments\nand monolingual corpora containing over 3 million Nko words. (4) Baseline\nbilingual and multilingual neural machine translation results with the best\nmodel scoring 30.83 English-Nko chrF++ on FLoRes-devtest.\n","authors":["Moussa Koulako Bala Doumbouya","Baba Mamadi Diané","Solo Farabado Cissé","Djibrila Diané","Abdoulaye Sow","Séré Moussa Doumbouya","Daouda Bangoura","Fodé Moriba Bayo","Ibrahima Sory 2. Condé","Kalo Mory Diané","Chris Piech","Christopher Manning"],"pdf_url":"https://arxiv.org/pdf/2310.15612v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15602v1","updated":"2023-10-24T08:17:11Z","published":"2023-10-24T08:17:11Z","title":"MUSER: A Multi-View Similar Case Retrieval Dataset","summary":" Similar case retrieval (SCR) is a representative legal AI application that\nplays a pivotal role in promoting judicial fairness. However, existing SCR\ndatasets only focus on the fact description section when judging the similarity\nbetween cases, ignoring other valuable sections (e.g., the court's opinion)\nthat can provide insightful reasoning process behind. Furthermore, the case\nsimilarities are typically measured solely by the textual semantics of the fact\ndescriptions, which may fail to capture the full complexity of legal cases from\nthe perspective of legal knowledge. In this work, we present MUSER, a similar\ncase retrieval dataset based on multi-view similarity measurement and\ncomprehensive legal element with sentence-level legal element annotations.\nSpecifically, we select three perspectives (legal fact, dispute focus, and law\nstatutory) and build a comprehensive and structured label schema of legal\nelements for each of them, to enable accurate and knowledgeable evaluation of\ncase similarities. The constructed dataset originates from Chinese civil cases\nand contains 100 query cases and 4,024 candidate cases. We implement several\ntext classification algorithms for legal element prediction and various\nretrieval methods for retrieving similar cases on MUSER. The experimental\nresults indicate that incorporating legal elements can benefit the performance\nof SCR models, but further efforts are still required to address the remaining\nchallenges posed by MUSER. The source code and dataset are released at\nhttps://github.com/THUlawtech/MUSER.\n","authors":["Qingquan Li","Yiran Hu","Feng Yao","Chaojun Xiao","Zhiyuan Liu","Maosong Sun","Weixing Shen"],"pdf_url":"https://arxiv.org/pdf/2310.15602v1.pdf","comment":"Accepted by CIKM 2023 Resource Track"},{"id":"http://arxiv.org/abs/2305.13627v2","updated":"2023-10-24T08:08:33Z","published":"2023-05-23T02:51:34Z","title":"InstructAlign: High-and-Low Resource Language Alignment via Continual\n Crosslingual Instruction Tuning","summary":" Large language models (LLMs) that are tuned with instructions have\ndemonstrated remarkable capabilities in various tasks and languages. However,\ntheir ability to generalize to underrepresented languages is limited due to the\nscarcity of available data. Additionally, directly adapting new languages to\ninstruction-tuned LLMs can result in catastrophic forgetting, which leads to\nthe loss of multitasking ability. To address this issue, we propose\nInstructAlign which uses continual crosslingual instruction tuning to enable\nLLMs to align new unseen languages with previously learned high-resource\nlanguages. Our results demonstrate the effectiveness of InstructAlign in\nenabling the model to understand low-resource languages with limited parallel\ndata while preventing catastrophic forgetting. Our work contributes to the\nadvancement of language adaptation methods, particularly for adapting\ninstruction-tuned LLMs to underrepresented languages. Our code is released on\nhttps://github.com/HLTCHKUST/InstructAlign\n","authors":["Samuel Cahyawijaya","Holy Lovenia","Tiezheng Yu","Willy Chung","Pascale Fung"],"pdf_url":"https://arxiv.org/pdf/2305.13627v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.02490v3","updated":"2023-10-24T07:59:31Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v3.pdf","comment":"Add results of GPT-4V. Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2301.08721v2","updated":"2023-10-24T07:58:35Z","published":"2023-01-19T02:29:23Z","title":"Batch Prompting: Efficient Inference with Large Language Model APIs","summary":" Performing inference on large volumes of samples with large language models\n(LLMs) can be computationally and financially costly in industry and real-world\nuse. We propose batch prompting, a simple yet effective prompting approach that\nenables the LLM to run inference in batches, instead of one sample at a time.\nOur method reduces both token and time costs while retaining downstream\nperformance. We theoretically demonstrate that under a few-shot in-context\nlearning setting, the inference costs decrease almost inverse linearly with the\nnumber of samples in each batch. We extensively validate the effectiveness of\nbatch prompting on ten datasets across commonsense QA, arithmetic reasoning,\nand NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch)\nreduces the LLM (Codex) inference token and time costs while achieving better\nor comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5\nand GPT-4, we show the benefits of batch prompting also hold. Further analysis\nshows that the number of samples in each batch and the complexity of tasks\naffect its performance. Moreover, batch prompting can be applied across\ndifferent reasoning methods using LLMs. Our code can be found at the site\nhttps://github.com/xlang-ai/batch-prompting.\n","authors":["Zhoujun Cheng","Jungo Kasai","Tao Yu"],"pdf_url":"https://arxiv.org/pdf/2301.08721v2.pdf","comment":"EMNLP 2023 Industry Track"},{"id":"http://arxiv.org/abs/2310.15594v1","updated":"2023-10-24T07:58:20Z","published":"2023-10-24T07:58:20Z","title":"Retrieval-based Knowledge Transfer: An Effective Approach for Extreme\n Large Language Model Compression","summary":" Large-scale pre-trained language models (LLMs) have demonstrated exceptional\nperformance in various natural language processing (NLP) tasks. However, the\nmassive size of these models poses huge challenges for their deployment in\nreal-world applications. While numerous model compression techniques have been\nproposed, most of them are not well-suited for achieving extreme model\ncompression when there is a significant gap in model scale. In this paper, we\nintroduce a novel compression paradigm called Retrieval-based Knowledge\nTransfer (RetriKT), which effectively transfers the knowledge of LLMs to\nextremely small-scale models (e.g., 1%). In particular, our approach extracts\nknowledge from LLMs to construct a knowledge store, from which the small-scale\nmodel can retrieve relevant information and leverage it for effective\ninference. To improve the quality of the model, soft prompt tuning and Proximal\nPolicy Optimization (PPO) reinforcement learning techniques are employed.\nExtensive experiments are conducted on low-resource tasks from SuperGLUE and\nGLUE benchmarks. The results demonstrate that the proposed approach\nsignificantly enhances the performance of small-scale models by leveraging the\nknowledge from LLMs.\n","authors":["Jiduan Liu","Jiahao Liu","Qifan Wang","Jingang Wang","Xunliang Cai","Dongyan Zhao","Ran Lucien Wang","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.15594v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.15587v1","updated":"2023-10-24T07:52:19Z","published":"2023-10-24T07:52:19Z","title":"ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts","summary":" Eye movements in reading play a crucial role in psycholinguistic research\nstudying the cognitive mechanisms underlying human language processing. More\nrecently, the tight coupling between eye movements and cognition has also been\nleveraged for language-related machine learning tasks such as the\ninterpretability, enhancement, and pre-training of language models, as well as\nthe inference of reader- and text-specific properties. However, scarcity of eye\nmovement data and its unavailability at application time poses a major\nchallenge for this line of research. Initially, this problem was tackled by\nresorting to cognitive models for synthesizing eye movement data. However, for\nthe sole purpose of generating human-like scanpaths, purely data-driven\nmachine-learning-based methods have proven to be more suitable. Following\nrecent advances in adapting diffusion processes to discrete data, we propose\nScanDL, a novel discrete sequence-to-sequence diffusion model that generates\nsynthetic scanpaths on texts. By leveraging pre-trained word representations\nand jointly embedding both the stimulus text and the fixation sequence, our\nmodel captures multi-modal interactions between the two inputs. We evaluate\nScanDL within- and across-dataset and demonstrate that it significantly\noutperforms state-of-the-art scanpath generation methods. Finally, we provide\nan extensive psycholinguistic analysis that underlines the model's ability to\nexhibit human-like reading behavior. Our implementation is made available at\nhttps://github.com/DiLi-Lab/ScanDL.\n","authors":["Lena S. Bolliger","David R. Reich","Patrick Haller","Deborah N. Jakobi","Paul Prasse","Lena A. Jäger"],"pdf_url":"https://arxiv.org/pdf/2310.15587v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15585v1","updated":"2023-10-24T07:51:08Z","published":"2023-10-24T07:51:08Z","title":"Multimodal Representations for Teacher-Guided Compositional Visual\n Reasoning","summary":" Neural Module Networks (NMN) are a compelling method for visual question\nanswering, enabling the translation of a question into a program consisting of\na series of reasoning sub-tasks that are sequentially executed on the image to\nproduce an answer. NMNs provide enhanced explainability compared to integrated\nmodels, allowing for a better understanding of the underlying reasoning\nprocess. To improve the effectiveness of NMNs we propose to exploit features\nobtained by a large-scale cross-modal encoder. Also, the current training\napproach of NMNs relies on the propagation of module outputs to subsequent\nmodules, leading to the accumulation of prediction errors and the generation of\nfalse answers. To mitigate this, we introduce an NMN learning strategy\ninvolving scheduled teacher guidance. Initially, the model is fully guided by\nthe ground-truth intermediate outputs, but gradually transitions to an\nautonomous behavior as training progresses. This reduces error accumulation,\nthus improving training efficiency and final performance.We demonstrate that by\nincorporating cross-modal features and employing more effective training\ntechniques for NMN, we achieve a favorable balance between performance and\ntransparency in the reasoning process.\n","authors":["Wafa Aissa","Marin Ferecatu","Michel Crucianu"],"pdf_url":"https://arxiv.org/pdf/2310.15585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15577v1","updated":"2023-10-24T07:40:09Z","published":"2023-10-24T07:40:09Z","title":"CONTRASTE: Supervised Contrastive Pre-training With Aspect-based Prompts\n For Aspect Sentiment Triplet Extraction","summary":" Existing works on Aspect Sentiment Triplet Extraction (ASTE) explicitly focus\non developing more efficient fine-tuning techniques for the task. Instead, our\nmotivation is to come up with a generic approach that can improve the\ndownstream performances of multiple ABSA tasks simultaneously. Towards this, we\npresent CONTRASTE, a novel pre-training strategy using CONTRastive learning to\nenhance the ASTE performance. While we primarily focus on ASTE, we also\ndemonstrate the advantage of our proposed technique on other ABSA tasks such as\nACOS, TASD, and AESC. Given a sentence and its associated (aspect, opinion,\nsentiment) triplets, first, we design aspect-based prompts with corresponding\nsentiments masked. We then (pre)train an encoder-decoder model by applying\ncontrastive learning on the decoder-generated aspect-aware sentiment\nrepresentations of the masked terms. For fine-tuning the model weights thus\nobtained, we then propose a novel multi-task approach where the base\nencoder-decoder model is combined with two complementary modules, a\ntagging-based Opinion Term Detector, and a regression-based Triplet Count\nEstimator. Exhaustive experiments on four benchmark datasets and a detailed\nablation study establish the importance of each of our proposed components as\nwe achieve new state-of-the-art ASTE results.\n","authors":["Rajdeep Mukherjee","Nithish Kannen","Saurabh Kumar Pandey","Pawan Goyal"],"pdf_url":"https://arxiv.org/pdf/2310.15577v1.pdf","comment":"Accepted as a Long Paper at EMNLP 2023 (Findings); 16 pages; Codes:\n https://github.com/nitkannen/CONTRASTE/"},{"id":"http://arxiv.org/abs/2310.15575v1","updated":"2023-10-24T07:38:43Z","published":"2023-10-24T07:38:43Z","title":"POE: Process of Elimination for Multiple Choice Reasoning","summary":" Language models (LMs) are capable of conducting in-context learning for\nmultiple choice reasoning tasks, but the options in these tasks are treated\nequally. As humans often first eliminate wrong options before picking the final\ncorrect answer, we argue a similar two-step strategy can make LMs better at\nthese tasks. To this end, we present the Process of Elimination (POE), a\ntwo-step scoring method. In the first step, POE scores each option, and\neliminates seemingly wrong options. In the second step, POE masks these wrong\noptions, and makes the final prediction from the remaining options. Zero-shot\nexperiments on 8 reasoning tasks illustrate the effectiveness of POE, and a\nfollowing analysis finds our method to be especially performant on logical\nreasoning tasks. We further analyze the effect of masks, and show that POE\napplies to few-shot settings and large language models (LLMs) like ChatGPT.\n","authors":["Chenkai Ma","Xinya Du"],"pdf_url":"https://arxiv.org/pdf/2310.15575v1.pdf","comment":"Accepted as a short paper at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15572v1","updated":"2023-10-24T07:35:24Z","published":"2023-10-24T07:35:24Z","title":"Natural Language Processing for Drug Discovery Knowledge Graphs:\n promises and pitfalls","summary":" Building and analysing knowledge graphs (KGs) to aid drug discovery is a\ntopical area of research. A salient feature of KGs is their ability to combine\nmany heterogeneous data sources in a format that facilitates discovering\nconnections. The utility of KGs has been exemplified in areas such as drug\nrepurposing, with insights made through manual exploration and modelling of the\ndata. In this article, we discuss promises and pitfalls of using natural\nlanguage processing (NLP) to mine unstructured text typically from scientific\nliterature as a data source for KGs. This draws on our experience of initially\nparsing structured data sources such as ChEMBL as the basis for data within a\nKG, and then enriching or expanding upon them using NLP. The fundamental\npromise of NLP for KGs is the automated extraction of data from millions of\ndocuments a task practically impossible to do via human curation alone.\nHowever, there are many potential pitfalls in NLP-KG pipelines such as\nincorrect named entity recognition and ontology linking all of which could\nultimately lead to erroneous inferences and conclusions.\n","authors":["J. Charles G. Jeynes","Tim James","Matthew Corney"],"pdf_url":"https://arxiv.org/pdf/2310.15572v1.pdf","comment":"17 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.15571v1","updated":"2023-10-24T07:35:23Z","published":"2023-10-24T07:35:23Z","title":"Visually Grounded Continual Language Learning with Selective\n Specialization","summary":" A desirable trait of an artificial agent acting in the visual world is to\ncontinually learn a sequence of language-informed tasks while striking a\nbalance between sufficiently specializing in each task and building a\ngeneralized knowledge for transfer. Selective specialization, i.e., a careful\nselection of model components to specialize in each task, is a strategy to\nprovide control over this trade-off. However, the design of selection\nstrategies requires insights on the role of each model component in learning\nrather specialized or generalizable representations, which poses a gap in\ncurrent research. Thus, our aim with this work is to provide an extensive\nanalysis of selection strategies for visually grounded continual language\nlearning. Due to the lack of suitable benchmarks for this purpose, we introduce\ntwo novel diagnostic datasets that provide enough control and flexibility for a\nthorough model analysis. We assess various heuristics for module specialization\nstrategies as well as quantifiable measures for two different types of model\narchitectures. Finally, we design conceptually simple approaches based on our\nanalysis that outperform common continual learning baselines. Our results\ndemonstrate the need for further efforts towards better aligning continual\nlearning algorithms with the learning behaviors of individual model parts.\n","authors":["Kyra Ahrens","Lennart Bengtson","Jae Hee Lee","Stefan Wermter"],"pdf_url":"https://arxiv.org/pdf/2310.15571v1.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.15569v1","updated":"2023-10-24T07:23:46Z","published":"2023-10-24T07:23:46Z","title":"MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in\n the Materials Science Domain","summary":" Keeping track of all relevant recent publications and experimental results\nfor a research area is a challenging task. Prior work has demonstrated the\nefficacy of information extraction models in various scientific areas.\nRecently, several datasets have been released for the yet understudied\nmaterials science domain. However, these datasets focus on sub-problems such as\nparsing synthesis procedures or on sub-domains, e.g., solid oxide fuel cells.\nIn this resource paper, we present MuLMS, a new dataset of 50 open-access\narticles, spanning seven sub-domains of materials science. The corpus has been\nannotated by domain experts with several layers ranging from named entities\nover relations to frame structures. We present competitive neural models for\nall tasks and demonstrate that multi-task training with existing related\nresources leads to benefits.\n","authors":["Timo Pierre Schrader","Matteo Finco","Stefan Grünewald","Felix Hildebrand","Annemarie Friedrich"],"pdf_url":"https://arxiv.org/pdf/2310.15569v1.pdf","comment":"17 pages, 2 figures, 28 tables, to be published in \"Proceedings of\n the second Workshop on Information Extraction from Scientific Publications\""},{"id":"http://arxiv.org/abs/2310.15556v1","updated":"2023-10-24T06:56:38Z","published":"2023-10-24T06:56:38Z","title":"TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for\n Inference Cost Reduction","summary":" Since ChatGPT released its API for public use, the number of applications\nbuilt on top of commercial large language models (LLMs) increase exponentially.\nOne popular usage of such models is leveraging its in-context learning ability\nand generating responses given user queries leveraging knowledge obtained by\nretrieval augmentation. One problem of deploying commercial retrieval-augmented\nLLMs is the cost due to the additionally retrieved context that largely\nincreases the input token size of the LLMs. To mitigate this, we propose a\ntoken compression scheme that includes two methods: summarization compression\nand semantic compression. The first method applies a T5-based model that is\nfine-tuned by datasets generated using self-instruct containing samples with\nvarying lengths and reduce token size by doing summarization. The second method\nfurther compresses the token size by removing words with lower impact on the\nsemantic. In order to adequately evaluate the effectiveness of the proposed\nmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)\nfocusing on food recommendation for women around pregnancy period or infants.\nOur summarization compression can reduce 65% of the retrieval token size with\nfurther 0.3% improvement on the accuracy; semantic compression provides a more\nflexible way to trade-off the token size with performance, for which we can\nreduce the token size by 20% with only 1.6% of accuracy drop.\n","authors":["Junyi Liu","Liangzhi Li","Tong Xiang","Bowen Wang","Yiming Qian"],"pdf_url":"https://arxiv.org/pdf/2310.15556v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.09168v3","updated":"2023-10-24T06:55:17Z","published":"2023-10-13T15:03:15Z","title":"Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through\n Active Exploration","summary":" Instruction-tuning can be substantially optimized through enhanced diversity,\nresulting in models capable of handling a broader spectrum of tasks. However,\nexisting data employed for such tuning often exhibit an inadequate coverage of\nindividual domains, limiting the scope for nuanced comprehension and\ninteractions within these areas. To address this deficiency, we propose\nExplore-Instruct, a novel approach to enhance the data coverage to be used in\ndomain-specific instruction-tuning through active exploration via Large\nLanguage Models (LLMs). Built upon representative domain use cases,\nExplore-Instruct explores a multitude of variations or possibilities by\nimplementing a search algorithm to obtain diversified and domain-focused\ninstruction-tuning data. Our data-centric analysis validates the effectiveness\nof this proposed approach in improving domain-specific instruction coverage.\nMoreover, our model's performance demonstrates considerable advancements over\nmultiple baselines, including those utilizing domain-specific data enhancement.\nOur findings offer a promising opportunity to improve instruction coverage,\nespecially in domain-specific contexts, thereby advancing the development of\nadaptable language models. Our code, model weights, and data are public at\n\\url{https://github.com/fanqiwan/Explore-Instruct}.\n","authors":["Fanqi Wan","Xinting Huang","Tao Yang","Xiaojun Quan","Wei Bi","Shuming Shi"],"pdf_url":"https://arxiv.org/pdf/2310.09168v3.pdf","comment":"Accepted to EMNLP 2023 (Main Conference)"},{"id":"http://arxiv.org/abs/2310.05374v3","updated":"2023-10-24T06:48:55Z","published":"2023-10-09T03:10:49Z","title":"Improving End-to-End Speech Processing by Efficient Text Data\n Utilization with Latent Synthesis","summary":" Training a high performance end-to-end speech (E2E) processing model requires\nan enormous amount of labeled speech data, especially in the era of\ndata-centric artificial intelligence. However, labeled speech data are usually\nscarcer and more expensive for collection, compared to textual data. We propose\nLatent Synthesis (LaSyn), an efficient textual data utilization framework for\nE2E speech processing models. We train a latent synthesizer to convert textual\ndata into an intermediate latent representation of a pre-trained speech model.\nThese pseudo acoustic representations of textual data augment acoustic data for\nmodel training. We evaluate LaSyn on low-resource automatic speech recognition\n(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an\nE2E baseline trained on LibriSpeech train-clean-100, with relative word error\nrate reductions over 22.3% on different test sets. For SLU, LaSyn improves our\nE2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for\nslot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM)\nand EM-Tree accuracies on STOP respectively. With fewer parameters, the results\nof LaSyn are competitive to published state-of-the-art works. The results\ndemonstrate the quality of the augmented training data.\n","authors":["Jianqiao Lu","Wenyong Huang","Nianzu Zheng","Xingshan Zeng","Yu Ting Yeung","Xiao Chen"],"pdf_url":"https://arxiv.org/pdf/2310.05374v3.pdf","comment":"15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.14523v2","updated":"2023-10-24T06:48:07Z","published":"2023-10-23T03:11:46Z","title":"Rethinking Word-Level Auto-Completion in Computer-Aided Translation","summary":" Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted\nTranslation. It aims at providing word-level auto-completion suggestions for\nhuman translators. While previous studies have primarily focused on designing\ncomplex model architectures, this paper takes a different perspective by\nrethinking the fundamental question: what kind of words are good\nauto-completions? We introduce a measurable criterion to answer this question\nand discover that existing WLAC models often fail to meet this criterion.\nBuilding upon this observation, we propose an effective approach to enhance\nWLAC performance by promoting adherence to the criterion. Notably, the proposed\napproach is general and can be applied to various encoder-based architectures.\nThrough extensive experiments, we demonstrate that our approach outperforms the\ntop-performing system submitted to the WLAC shared tasks in WMT2022, while\nutilizing significantly smaller model sizes.\n","authors":["Xingyu Chen","Lemao Liu","Guoping Huang","Zhirui Zhang","Mingming Yang","Shuming Shi","Rui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14523v2.pdf","comment":"EMNLP2023"},{"id":"http://arxiv.org/abs/2310.15552v1","updated":"2023-10-24T06:45:00Z","published":"2023-10-24T06:45:00Z","title":"Unveiling Multilinguality in Transformer Models: Exploring Language\n Specificity in Feed-Forward Networks","summary":" Recent research suggests that the feed-forward module within Transformers can\nbe viewed as a collection of key-value memories, where the keys learn to\ncapture specific patterns from the input based on the training examples. The\nvalues then combine the output from the 'memories' of the keys to generate\npredictions about the next token. This leads to an incremental process of\nprediction that gradually converges towards the final token choice near the\noutput layers. This interesting perspective raises questions about how\nmultilingual models might leverage this mechanism. Specifically, for\nautoregressive models trained on two or more languages, do all neurons (across\nlayers) respond equally to all languages? No! Our hypothesis centers around the\nnotion that during pretraining, certain model parameters learn strong\nlanguage-specific features, while others learn more language-agnostic (shared\nacross languages) features. To validate this, we conduct experiments utilizing\nparallel corpora of two languages that the model was initially pretrained on.\nOur findings reveal that the layers closest to the network's input or output\ntend to exhibit more language-specific behaviour compared to the layers in the\nmiddle.\n","authors":["Sunit Bhattacharya","Ondrej Bojar"],"pdf_url":"https://arxiv.org/pdf/2310.15552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15541v1","updated":"2023-10-24T06:15:15Z","published":"2023-10-24T06:15:15Z","title":"Improving Language Models Meaning Understanding and Consistency by\n Learning Conceptual Roles from Dictionary","summary":" The non-humanlike behaviour of contemporary pre-trained language models\n(PLMs) is a leading cause undermining their trustworthiness. A striking\nphenomenon of such faulty behaviours is the generation of inconsistent\npredictions, which produces logically contradictory results, such as generating\ndifferent predictions for texts delivering the same meaning or violating\nlogical properties. Previous studies exploited data augmentation or implemented\nspecialised loss functions to alleviate the issue. However, their usage is\nlimited, because they consume expensive training resources for large-sized PLMs\nand can only handle a certain consistency type. To this end, we propose a\npractical approach that alleviates the inconsistent behaviour issue by\nfundamentally improving PLMs' meaning awareness. Based on the conceptual role\ntheory, our method allows PLMs to capture accurate meaning by learning precise\ninterrelationships between concepts from word-definition pairs in a dictionary.\nNext, we propose an efficient parameter integration technique that updates only\na few additional parameters to combine the learned interrelationship with PLMs'\npre-trained knowledge. Our experimental results reveal that the approach can\nconcurrently improve multiple types of consistency, enables efficient knowledge\nintegration, and easily applies to other languages.\n","authors":["Myeongjun Erik Jang","Thomas Lukasiewicz"],"pdf_url":"https://arxiv.org/pdf/2310.15541v1.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2310.15539v1","updated":"2023-10-24T06:04:28Z","published":"2023-10-24T06:04:28Z","title":"SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code\n Translation","summary":" With the recent focus on Large Language Models (LLMs), both StarCoder (Li et\nal., 2023) and Code Llama (Rozi\\`ere et al., 2023) have demonstrated remarkable\nperformance in code generation. However, there is still a need for improvement\nin code translation functionality with efficient training techniques. In\nresponse to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM\ndesigned specifically for multi-programming language-to-Python code\ntranslation. In particular, SteloCoder achieves C++, C#, JavaScript, Java, or\nPHP-to-Python code translation without specifying the input programming\nlanguage. We modified StarCoder model architecture by incorporating a\nMixture-of-Experts (MoE) technique featuring five experts and a gating network\nfor multi-task handling. Experts are obtained by StarCoder fine-tuning.\nSpecifically, we use a Low-Rank Adaptive Method (LoRA) technique, limiting each\nexpert size as only 0.06% of number of StarCoder's parameters. At the same\ntime, to enhance training efficiency in terms of time, we adopt curriculum\nlearning strategy and use self-instruct data for efficient fine-tuning. As a\nresult, each expert takes only 6 hours to train on one single 80Gb A100 HBM.\nWith experiments on XLCoST datasets, SteloCoder achieves an average of 73.76\nCodeBLEU score in multi-programming language-to-Python translation, surpassing\nthe top performance from the leaderboard by at least 3.5. This accomplishment\nis attributed to only 45M extra parameters with StarCoder as the backbone and\n32 hours of valid training on one 80GB A100 HBM. The source code is release\nhere: https://github.com/sade-adrien/SteloCoder.\n","authors":["Jialing Pan","Adrien Sadé","Jin Kim","Eric Soriano","Guillem Sole","Sylvain Flamant"],"pdf_url":"https://arxiv.org/pdf/2310.15539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13981v2","updated":"2023-10-24T06:03:23Z","published":"2023-05-23T12:05:09Z","title":"Preserving Knowledge Invariance: Rethinking Robustness Evaluation of\n Open Information Extraction","summary":" The robustness to distribution changes ensures that NLP models can be\nsuccessfully applied in the realistic world, especially for information\nextraction tasks. However, most prior evaluation benchmarks have been devoted\nto validating pairwise matching correctness, ignoring the crucial measurement\nof robustness. In this paper, we present the first benchmark that simulates the\nevaluation of open information extraction models in the real world, where the\nsyntactic and expressive distributions under the same knowledge meaning may\ndrift variously. We design and annotate a large-scale testbed in which each\nexample is a knowledge-invariant clique that consists of sentences with\nstructured knowledge of the same meaning but with different syntactic and\nexpressive forms. By further elaborating the robustness metric, a model is\njudged to be robust if its performance is consistently accurate on the overall\ncliques. We perform experiments on typical models published in the last decade\nas well as a popular large language model, the results show that the existing\nsuccessful models exhibit a frustrating degradation, with a maximum drop of\n23.43 F1 score. Our resources and code are available at\nhttps://github.com/qijimrc/ROBUST.\n","authors":["Ji Qi","Chuchun Zhang","Xiaozhi Wang","Kaisheng Zeng","Jifan Yu","Jinxin Liu","Jiuding Sun","Yuxiang Chen","Lei Hou","Juanzi Li","Bin Xu"],"pdf_url":"https://arxiv.org/pdf/2305.13981v2.pdf","comment":"Accepted by EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.14654v2","updated":"2023-10-24T06:03:14Z","published":"2023-10-23T07:50:10Z","title":"SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab,\n IIT Madras","summary":" India is home to a multitude of languages of which 22 languages are\nrecognised by the Indian Constitution as official. Building speech based\napplications for the Indian population is a difficult problem owing to limited\ndata and the number of languages and accents to accommodate. To encourage the\nlanguage technology community to build speech based applications in Indian\nlanguages, we are open sourcing SPRING-INX data which has about 2000 hours of\nlegally sourced and manually transcribed speech data for ASR system building in\nAssamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi\nand Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology\nMadras and is a part of National Language Translation Mission (NLTM), funded by\nthe Indian Ministry of Electronics and Information Technology (MeitY),\nGovernment of India. We describe the data collection and data cleaning process\nalong with the data statistics in this paper.\n","authors":["Nithya R","Malavika S","Jordan F","Arjun Gangwar","Metilda N J","S Umesh","Rithik Sarab","Akhilesh Kumar Dubey","Govind Divakaran","Samudra Vijaya K","Suryakanth V Gangashetty"],"pdf_url":"https://arxiv.org/pdf/2310.14654v2.pdf","comment":"3 pages, About SPRING-INX Data"},{"id":"http://arxiv.org/abs/2209.05401v3","updated":"2023-10-24T05:59:52Z","published":"2022-09-12T16:53:37Z","title":"MaXM: Towards Multilingual Visual Question Answering","summary":" Visual Question Answering (VQA) has been primarily studied through the lens\nof the English language. Yet, tackling VQA in other languages in the same\nmanner would require a considerable amount of resources. In this paper, we\npropose scalable solutions to multilingual visual question answering (mVQA), on\nboth data and modeling fronts. We first propose a translation-based framework\nto mVQA data generation that requires much less human annotation efforts than\nthe conventional approach of directly collection questions and answers. Then,\nwe apply our framework to the multilingual captions in the Crossmodal-3600\ndataset and develop an efficient annotation protocol to create MaXM, a\ntest-only VQA benchmark in 7 diverse languages. Finally, we develop a simple,\nlightweight, and effective approach as well as benchmark state-of-the-art\nEnglish and multilingual VQA models. We hope that our benchmark encourages\nfurther research on mVQA.\n","authors":["Soravit Changpinyo","Linting Xue","Michal Yarom","Ashish V. Thapliyal","Idan Szpektor","Julien Amelot","Xi Chen","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2209.05401v3.pdf","comment":"EMNLP 2023 (Findings).\n https://github.com/google-research-datasets/maxm"},{"id":"http://arxiv.org/abs/2305.14996v2","updated":"2023-10-24T05:18:32Z","published":"2023-05-24T10:35:56Z","title":"The ACL OCL Corpus: Advancing Open Science in Computational Linguistics","summary":" We present ACL OCL, a scholarly corpus derived from the ACL Anthology to\nassist Open scientific research in the Computational Linguistics domain.\nIntegrating and enhancing the previous versions of the ACL Anthology, the ACL\nOCL contributes metadata, PDF files, citation graphs and additional structured\nfull texts with sections, figures, and links to a large knowledge resource\n(Semantic Scholar). The ACL OCL spans seven decades, containing 73K papers,\nalongside 210K figures.\n We spotlight how ACL OCL applies to observe trends in computational\nlinguistics. By detecting paper topics with a supervised neural model, we note\nthat interest in \"Syntax: Tagging, Chunking and Parsing\" is waning and \"Natural\nLanguage Generation\" is resurging. Our dataset is available from HuggingFace\n(https://huggingface.co/datasets/WINGNUS/ACL-OCL).\n","authors":["Shaurya Rohatgi","Yanxia Qin","Benjamin Aw","Niranjana Unnithan","Min-Yen Kan"],"pdf_url":"https://arxiv.org/pdf/2305.14996v2.pdf","comment":"To appear in EMNLP2023"},{"id":"http://arxiv.org/abs/2310.15517v1","updated":"2023-10-24T04:50:59Z","published":"2023-10-24T04:50:59Z","title":"MarkQA: A large scale KBQA dataset with numerical reasoning","summary":" While question answering over knowledge bases (KBQA) has shown progress in\naddressing factoid questions, KBQA with numerical reasoning remains relatively\nunexplored. In this paper, we focus on the complex numerical reasoning in KBQA\nand propose a new task, NR-KBQA, which necessitates the ability to perform both\nmulti-hop reasoning and numerical reasoning. We design a logic form in Python\nformat called PyQL to represent the reasoning process of numerical reasoning\nquestions. To facilitate the development of NR-KBQA, we present a large dataset\ncalled MarkQA, which is automatically constructed from a small set of seeds.\nEach question in MarkQA is equipped with its corresponding SPARQL query,\nalongside the step-by-step reasoning process in the QDMR format and PyQL\nprogram. Experimental results of some state-of-the-art QA methods on the MarkQA\nshow that complex numerical reasoning in KBQA faces great challenges.\n","authors":["Xiang Huang","Sitao Cheng","Yuheng Bao","Shanshan Huang","Yuzhong Qu"],"pdf_url":"https://arxiv.org/pdf/2310.15517v1.pdf","comment":"camera ready for EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15515v1","updated":"2023-10-24T04:50:29Z","published":"2023-10-24T04:50:29Z","title":"Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting\n Elusive Disinformation","summary":" Recent ubiquity and disruptive impacts of large language models (LLMs) have\nraised concerns about their potential to be misused (.i.e, generating\nlarge-scale harmful and misleading content). To combat this emerging risk of\nLLMs, we propose a novel \"Fighting Fire with Fire\" (F3) strategy that harnesses\nmodern LLMs' generative and emergent reasoning capabilities to counter\nhuman-written and LLM-generated disinformation. First, we leverage\nGPT-3.5-turbo to synthesize authentic and deceptive LLM-generated content\nthrough paraphrase-based and perturbation-based prefix-style prompts,\nrespectively. Second, we apply zero-shot in-context semantic reasoning\ntechniques with cloze-style prompts to discern genuine from deceptive posts and\nnews articles. In our extensive experiments, we observe GPT-3.5-turbo's\nzero-shot superiority for both in-distribution and out-of-distribution\ndatasets, where GPT-3.5-turbo consistently achieved accuracy at 68-72%, unlike\nthe decline observed in previous customized and fine-tuned disinformation\ndetectors. Our codebase and dataset are available at\nhttps://github.com/mickeymst/F3.\n","authors":["Jason Lucas","Adaku Uchendu","Michiharu Yamashita","Jooyoung Lee","Shaurya Rohatgi","Dongwon Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15515v1.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.02971v2","updated":"2023-10-24T04:47:19Z","published":"2023-10-04T17:07:32Z","title":"Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech\n Model","summary":" Prompting and adapter tuning have emerged as efficient alternatives to\nfine-tuning (FT) methods. However, existing studies on speech prompting focused\non classification tasks and failed on more complex sequence generation tasks.\nBesides, adapter tuning is primarily applied with a focus on encoder-only\nself-supervised models. Our experiments show that prompting on Wav2Seq, a\nself-supervised encoder-decoder model, surpasses previous works in sequence\ngeneration tasks. It achieves a remarkable 53% relative improvement in word\nerror rate for ASR and a 27% in F1 score for slot filling. Additionally,\nprompting competes with the FT method in the low-resource scenario. Moreover,\nwe show the transferability of prompting and adapter tuning on Wav2Seq in\ncross-lingual ASR. When limited trainable parameters are involved, prompting\nand adapter tuning consistently outperform conventional FT across 7 languages.\nNotably, in the low-resource scenario, prompting consistently outperforms\nadapter tuning.\n","authors":["Kai-Wei Chang","Ming-Hsin Chen","Yun-Ping Lin","Jing Neng Hsu","Paul Kuo-Ming Huang","Chien-yu Huang","Shang-Wen Li","Hung-yi Lee"],"pdf_url":"https://arxiv.org/pdf/2310.02971v2.pdf","comment":"Accepted to IEEE ASRU 2023"},{"id":"http://arxiv.org/abs/2310.15513v1","updated":"2023-10-24T04:43:45Z","published":"2023-10-24T04:43:45Z","title":"A Joint Matrix Factorization Analysis of Multilingual Representations","summary":" We present an analysis tool based on joint matrix factorization for comparing\nlatent representations of multilingual and monolingual models. An alternative\nto probing, this tool allows us to analyze multiple sets of representations in\na joint manner. Using this tool, we study to what extent and how\nmorphosyntactic features are reflected in the representations learned by\nmultilingual pre-trained models. We conduct a large-scale empirical study of\nover 33 languages and 17 morphosyntactic categories. Our findings demonstrate\nvariations in the encoding of morphosyntactic information across upper and\nlower layers, with category-specific differences influenced by language\nproperties. Hierarchical clustering of the factorization outputs yields a tree\nstructure that is related to phylogenetic trees manually crafted by linguists.\nMoreover, we find the factorization outputs exhibit strong associations with\nperformance observed across different cross-lingual tasks. We release our code\nto facilitate future research.\n","authors":["Zheng Zhao","Yftah Ziser","Bonnie Webber","Shay B. Cohen"],"pdf_url":"https://arxiv.org/pdf/2310.15513v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15511v1","updated":"2023-10-24T04:40:38Z","published":"2023-10-24T04:40:38Z","title":"KITAB: Evaluating LLMs on Constraint Satisfaction for Information\n Retrieval","summary":" We study the ability of state-of-the art models to answer constraint\nsatisfaction queries for information retrieval (e.g., 'a list of ice cream\nshops in San Diego'). In the past, such queries were considered to be tasks\nthat could only be solved via web-search or knowledge bases. More recently,\nlarge language models (LLMs) have demonstrated initial emergent abilities in\nthis task. However, many current retrieval benchmarks are either saturated or\ndo not measure constraint satisfaction. Motivated by rising concerns around\nfactual incorrectness and hallucinations of LLMs, we present KITAB, a new\ndataset for measuring constraint satisfaction abilities of language models.\nKITAB consists of book-related data across more than 600 authors and 13,000\nqueries, and also offers an associated dynamic data collection and constraint\nverification approach for acquiring similar test data for other authors. Our\nextended experiments on GPT4 and GPT3.5 characterize and decouple common\nfailure modes across dimensions such as information popularity, constraint\ntypes, and context availability. Results show that in the absence of context,\nmodels exhibit severe limitations as measured by irrelevant information,\nfactual errors, and incompleteness, many of which exacerbate as information\npopularity decreases. While context availability mitigates irrelevant\ninformation, it is not helpful for satisfying constraints, identifying\nfundamental barriers to constraint satisfaction. We open source our\ncontributions to foster further research on improving constraint satisfaction\nabilities of future models.\n","authors":["Marah I Abdin","Suriya Gunasekar","Varun Chandrasekaran","Jerry Li","Mert Yuksekgonul","Rahee Ghosh Peshawaria","Ranjita Naik","Besmira Nushi"],"pdf_url":"https://arxiv.org/pdf/2310.15511v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2305.16339v2","updated":"2023-10-24T04:38:52Z","published":"2023-05-24T02:05:03Z","title":"Don't Trust ChatGPT when Your Question is not in English: A Study of\n Multilingual Abilities and Types of LLMs","summary":" Large Language Models (LLMs) have demonstrated exceptional natural language\nunderstanding abilities and have excelled in a variety of natural language\nprocessing (NLP)tasks in recent years. Despite the fact that most LLMs are\ntrained predominantly in English, multiple studies have demonstrated their\ncomparative performance in many other languages. However, fundamental questions\npersist regarding how LLMs acquire their multi-lingual abilities and how\nperformance varies across different languages. These inquiries are crucial for\nthe study of LLMs since users and researchers often come from diverse language\nbackgrounds, potentially influencing their utilization and interpretation of\nLLMs' results. In this work, we propose a systematic way of qualifying the\nperformance disparities of LLMs under multilingual settings. We investigate the\nphenomenon of across-language generalizations in LLMs, wherein insufficient\nmulti-lingual training data leads to advanced multi-lingual capabilities. To\naccomplish this, we employ a novel back-translation-based prompting method. The\nresults show that GPT exhibits highly translating-like behaviour in\nmultilingual settings.\n","authors":["Xiang Zhang","Senyu Li","Bradley Hauer","Ning Shi","Grzegorz Kondrak"],"pdf_url":"https://arxiv.org/pdf/2305.16339v2.pdf","comment":"Paper accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14975v2","updated":"2023-10-24T04:27:42Z","published":"2023-05-24T10:12:33Z","title":"Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence\n Scores from Language Models Fine-Tuned with Human Feedback","summary":" A trustworthy real-world prediction system should produce well-calibrated\nconfidence scores; that is, its confidence in an answer should be indicative of\nthe likelihood that the answer is correct, enabling deferral to an expert in\ncases of low-confidence predictions. Recent studies have shown that\nunsupervised pre-training produces large language models (LMs) whose\nconditional probabilities are remarkably well-calibrated. However, the most\nwidely-used LMs are fine-tuned with reinforcement learning from human feedback\n(RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional\nprobabilities that are very poorly calibrated. In light of this perceived\nweakness, we conduct a broad evaluation of methods for extracting confidence\nscores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find\nthat verbalized confidences emitted as output tokens are typically\nbetter-calibrated than the model's conditional probabilities on the TriviaQA,\nSciQ, and TruthfulQA benchmarks, often reducing the expected calibration error\nby a relative 50%.\n","authors":["Katherine Tian","Eric Mitchell","Allan Zhou","Archit Sharma","Rafael Rafailov","Huaxiu Yao","Chelsea Finn","Christopher D. Manning"],"pdf_url":"https://arxiv.org/pdf/2305.14975v2.pdf","comment":"EMNLP 2023 Camera Ready"},{"id":"http://arxiv.org/abs/2310.15494v1","updated":"2023-10-24T03:42:49Z","published":"2023-10-24T03:42:49Z","title":"TRAMS: Training-free Memory Selection for Long-range Language Modeling","summary":" The Transformer architecture is crucial for numerous AI models, but it still\nfaces challenges in long-range language modeling. Though several specific\ntransformer architectures have been designed to tackle issues of long-range\ndependencies, existing methods like Transformer-XL are plagued by a high\npercentage of ineffective memories. In this study, we present a plug-and-play\nstrategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens\nparticipating in attention calculation based on one simple metric. This\nstrategy allows us to keep tokens that are likely to have a high attention\nscore with the current queries and ignore the other ones. We have tested our\napproach on the word-level benchmark (WikiText-103) and the character-level\nbenchmark (enwik8), and the results indicate an improvement without having\nadditional training or adding additional parameters.\n","authors":["Haofei Yu","Cunxiang wang","Yue Zhang","Wei Bi"],"pdf_url":"https://arxiv.org/pdf/2310.15494v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13999v3","updated":"2023-10-24T03:41:37Z","published":"2023-05-23T12:28:37Z","title":"Towards A Unified View of Sparse Feed-Forward Network in Pretraining\n Large Language Model","summary":" Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE)\nhave proven effective in scaling up Transformers model size for\n\\textit{pretraining} large language models. By only activating part of the FFN\nparameters conditioning on input, S-FFN improves generalization performance\nwhile keeping training and inference costs (in FLOPs) fixed. In this work, we\nanalyzed two major design choices of S-FFN: the memory block (a.k.a. expert)\nsize and the memory block selection method under a general conceptual framework\nof sparse neural memory. Using this unified framework, we compare several S-FFN\narchitectures for language modeling and provide insights into their relative\nefficacy and efficiency. We found a simpler selection method --\n\\textbf{\\texttt{Avg-K}} that selects blocks through their mean aggregated\nhidden states, achieving lower perplexity in language model pretraining\ncompared to existing MoE architectures including Switch Transformer (Fedus et\nal., 2021) and HashLayer (Roller et al., 2021).\n","authors":["Zeyu Leo Liu","Tim Dettmers","Xi Victoria Lin","Veselin Stoyanov","Xian Li"],"pdf_url":"https://arxiv.org/pdf/2305.13999v3.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2308.07308v3","updated":"2023-10-24T03:38:06Z","published":"2023-08-14T17:54:10Z","title":"LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked","summary":" Large language models (LLMs) are popular for high-quality text generation but\ncan produce harmful content, even when aligned with human values through\nreinforcement learning. Adversarial prompts can bypass their safety measures.\nWe propose LLM Self Defense, a simple approach to defend against these attacks\nby having an LLM screen the induced responses. Our method does not require any\nfine-tuning, input preprocessing, or iterative output generation. Instead, we\nincorporate the generated content into a pre-defined prompt and employ another\ninstance of an LLM to analyze the text and predict whether it is harmful. We\ntest LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent\nLLMs against various types of attacks, such as forcefully inducing affirmative\nresponses to prompts and prompt engineering attacks. Notably, LLM Self Defense\nsucceeds in reducing the attack success rate to virtually 0 using both GPT 3.5\nand Llama 2.\n","authors":["Mansi Phute","Alec Helbling","Matthew Hull","ShengYun Peng","Sebastian Szyller","Cory Cornelius","Duen Horng Chau"],"pdf_url":"https://arxiv.org/pdf/2308.07308v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15484v1","updated":"2023-10-24T03:24:15Z","published":"2023-10-24T03:24:15Z","title":"NuTrea: Neural Tree Search for Context-guided Multi-hop KGQA","summary":" Multi-hop Knowledge Graph Question Answering (KGQA) is a task that involves\nretrieving nodes from a knowledge graph (KG) to answer natural language\nquestions. Recent GNN-based approaches formulate this task as a KG path\nsearching problem, where messages are sequentially propagated from the seed\nnode towards the answer nodes. However, these messages are past-oriented, and\nthey do not consider the full KG context. To make matters worse, KG nodes often\nrepresent proper noun entities and are sometimes encrypted, being uninformative\nin selecting between paths. To address these problems, we propose Neural Tree\nSearch (NuTrea), a tree search-based GNN model that incorporates the broader KG\ncontext. Our model adopts a message-passing scheme that probes the unreached\nsubtree regions to boost the past-oriented embeddings. In addition, we\nintroduce the Relation Frequency-Inverse Entity Frequency (RF-IEF) node\nembedding that considers the global KG context to better characterize ambiguous\nKG nodes. The general effectiveness of our approach is demonstrated through\nexperiments on three major multi-hop KGQA benchmark datasets, and our extensive\nanalyses further validate its expressiveness and robustness. Overall, NuTrea\nprovides a powerful means to query the KG with complex natural language\nquestions. Code is available at https://github.com/mlvlab/NuTrea.\n","authors":["Hyeong Kyu Choi","Seunghun Lee","Jaewon Chu","Hyunwoo J. Kim"],"pdf_url":"https://arxiv.org/pdf/2310.15484v1.pdf","comment":"Neural Information Processing Systems (NeurIPS) 2023"},{"id":"http://arxiv.org/abs/2307.13854v2","updated":"2023-10-24T03:19:22Z","published":"2023-07-25T22:59:32Z","title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","summary":" With advances in generative AI, there is now potential for autonomous agents\nto manage daily tasks via natural language commands. However, current agents\nare primarily created and tested in simplified synthetic environments, leading\nto a disconnect with real-world scenarios. In this paper, we build an\nenvironment for language-guided agents that is highly realistic and\nreproducible. Specifically, we focus on agents that perform tasks on the web,\nand create an environment with fully functional websites from four common\ndomains: e-commerce, social forum discussions, collaborative software\ndevelopment, and content management. Our environment is enriched with tools\n(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage\nhuman-like task-solving. Building upon our environment, we release a set of\nbenchmark tasks focusing on evaluating the functional correctness of task\ncompletions. The tasks in our benchmark are diverse, long-horizon, and designed\nto emulate tasks that humans routinely perform on the internet. We experiment\nwith several baseline agents, integrating recent techniques such as reasoning\nbefore acting. The results demonstrate that solving complex tasks is\nchallenging: our best GPT-4-based agent only achieves an end-to-end task\nsuccess rate of 14.41%, significantly lower than the human performance of\n78.24%. These results highlight the need for further development of robust\nagents, that current state-of-the-art large language models are far from\nperfect performance in these real-life tasks, and that WebArena can be used to\nmeasure such progress.\n","authors":["Shuyan Zhou","Frank F. Xu","Hao Zhu","Xuhui Zhou","Robert Lo","Abishek Sridhar","Xianyi Cheng","Tianyue Ou","Yonatan Bisk","Daniel Fried","Uri Alon","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2307.13854v2.pdf","comment":"Our code, data, environment reproduction resources, and video\n demonstrations are publicly available at https://webarena.dev/"},{"id":"http://arxiv.org/abs/2310.15477v1","updated":"2023-10-24T03:08:58Z","published":"2023-10-24T03:08:58Z","title":"CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without\n Full Large Language Model","summary":" Instruction tuning has recently been recognized as an effective way of\naligning Large Language Models (LLMs) to enhance their generalization ability\nacross various tasks. However, when tuning publicly accessible, centralized\nLLMs with private instruction data, privacy concerns are inevitable. While\ndirect transfer of parameterized modules between models is a plausible approach\nto address this, its implications and effectiveness need further exploration.\nThis paper focuses on Offsite-Tuning (OFT), a representative technique that\ntransfers transformer blocks between centralized LLMs and downstream emulators.\nGiven the limited understanding of the underlying mechanism of OFT, we perform\nan empirical analysis on LLMs from the perspectives of representation and\nfunctional similarity. Interestingly, our findings reveal a unique modular\nstructure within the layers of LLMs that appears to emerge as the model size\nexpands. Simultaneously, we note subtle but potentially significant changes in\nrepresentation and intermediate predictions across the layers. Inspired by\nthese observations, we propose CRaSh, involving Clustering, Removing, and\nSharing, a training-free strategy to derive improved emulators from LLMs. CRaSh\nsignificantly boosts performance of OFT with billions of parameters.\nFurthermore, we investigate the optimal solutions yielded by fine-tuning with\nand without full model through the lens of loss landscape. Our findings\ndemonstrate a linear connectivity among these optima falling over the same\nbasin, thereby highlighting the effectiveness of CRaSh and OFT. The source code\nis publicly available at https://github.com/TsinghuaC3I/CRaSh.\n","authors":["Kaiyan Zhang","Ning Ding","Biqing Qi","Xuekai Zhu","Xinwei Long","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.15477v1.pdf","comment":"Accepted to EMNLP 2023 (Main Conference)"},{"id":"http://arxiv.org/abs/2305.01278v2","updated":"2023-10-24T02:57:42Z","published":"2023-05-02T09:28:39Z","title":"VPGTrans: Transfer Visual Prompt Generator across LLMs","summary":" While developing a new multimodal LLM (MLLM) by pre-training on tremendous\nimage-text pairs from scratch can be exceedingly resource-consuming, connecting\nan existing LLM with a comparatively lightweight visual prompt generator (VPG)\nbecomes a feasible paradigm. However, further tuning the VPG part of the MLLM\nstill suffers from indispensable computational costs, i.e., requiring thousands\nof GPU hours and millions of training data. One alternative solution is to\ntransfer an existing VPG from any existing MLLMs for the target MLLM.\n In this work, we for the first time investigate the VPG transferability\nacross LLMs, and explore a solution to reduce the cost of VPG transfer. We\nfirst study the VPG transfer across different LLM sizes (e.g., small-to-large),\nand across different LLM types, through which we diagnose the key factors to\nmaximize the transfer efficiency. Based on our observation, we design a\ntwo-stage transfer framework named VPGTrans, which is simple yet highly\neffective. Through extensive experiments, we demonstrate that VPGTrans helps\nsignificantly speed up the transfer learning process without compromising\nperformance. Remarkably, it helps achieve the VPG transfer from BLIP-2\nOPT$_\\text{2.7B}$ to BLIP-2 OPT$_\\text{6.7B}$ with over 10 times speed-up and\n10.7% training data compared with connecting a VPG to OPT$_\\text{6.7B}$ from\nscratch. Further, a series of intriguing findings and potential rationales\nbehind them are provided and discussed. Finally, we showcase the practical\nvalue of our VPGTrans approach, by customizing two novel MLLMs, including\nVL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.\n","authors":["Ao Zhang","Hao Fei","Yuan Yao","Wei Ji","Li Li","Zhiyuan Liu","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2305.01278v2.pdf","comment":"Project Website: https://vpgtrans.github.io Code:\n https://github.com/VPGTrans/VPGTrans NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15470v1","updated":"2023-10-24T02:48:50Z","published":"2023-10-24T02:48:50Z","title":"Continual Event Extraction with Semantic Confusion Rectification","summary":" We study continual event extraction, which aims to extract incessantly\nemerging event information while avoiding forgetting. We observe that the\nsemantic confusion on event types stems from the annotations of the same text\nbeing updated over time. The imbalance between event types even aggravates this\nissue. This paper proposes a novel continual event extraction model with\nsemantic confusion rectification. We mark pseudo labels for each sentence to\nalleviate semantic confusion. We transfer pivotal knowledge between current and\nprevious models to enhance the understanding of event types. Moreover, we\nencourage the model to focus on the semantics of long-tailed event types by\nleveraging other associated types. Experimental results show that our model\noutperforms state-of-the-art baselines and is proficient in imbalanced\ndatasets.\n","authors":["Zitao Wang","Xinyi Wang","Wei Hu"],"pdf_url":"https://arxiv.org/pdf/2310.15470v1.pdf","comment":"Accepted in the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.15469v1","updated":"2023-10-24T02:48:19Z","published":"2023-10-24T02:48:19Z","title":"The Janus Interface: How Fine-Tuning in Large Language Models Amplifies\n the Privacy Risks","summary":" The era post-2018 marked the advent of Large Language Models (LLMs), with\ninnovations such as OpenAI's ChatGPT showcasing prodigious linguistic prowess.\nAs the industry galloped toward augmenting model parameters and capitalizing on\nvast swaths of human language data, security and privacy challenges also\nemerged. Foremost among these is the potential inadvertent accrual of Personal\nIdentifiable Information (PII) during web-based data acquisition, posing risks\nof unintended PII disclosure. While strategies like RLHF during training and\nCatastrophic Forgetting have been marshaled to control the risk of privacy\ninfringements, recent advancements in LLMs, epitomized by OpenAI's fine-tuning\ninterface for GPT-3.5, have reignited concerns. One may ask: can the\nfine-tuning of LLMs precipitate the leakage of personal information embedded\nwithin training datasets? This paper reports the first endeavor to seek the\nanswer to the question, particularly our discovery of a new LLM exploitation\navenue, called the Janus attack. In the attack, one can construct a PII\nassociation task, whereby an LLM is fine-tuned using a minuscule PII dataset,\nto potentially reinstate and reveal concealed PIIs. Our findings indicate that,\nwith a trivial fine-tuning outlay, LLMs such as GPT-3.5 can transition from\nbeing impermeable to PII extraction to a state where they divulge a substantial\nproportion of concealed PII. This research, through its deep dive into the\nJanus attack vector, underscores the imperative of navigating the intricate\ninterplay between LLM utility and privacy preservation.\n","authors":["Xiaoyi Chen","Siyuan Tang","Rui Zhu","Shijun Yan","Lei Jin","Zihao Wang","Liya Su","XiaoFeng Wang","Haixu Tang"],"pdf_url":"https://arxiv.org/pdf/2310.15469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14512v2","updated":"2023-10-24T02:45:55Z","published":"2023-10-23T02:47:27Z","title":"CorefPrompt: Prompt-based Event Coreference Resolution by Measuring\n Event Type and Argument Compatibilities","summary":" Event coreference resolution (ECR) aims to group event mentions referring to\nthe same real-world event into clusters. Most previous studies adopt the\n\"encoding first, then scoring\" framework, making the coreference judgment rely\non event encoding. Furthermore, current methods struggle to leverage\nhuman-summarized ECR rules, e.g., coreferential events should have the same\nevent type, to guide the model. To address these two issues, we propose a\nprompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM\n(masked language model) task. This allows for simultaneous event modeling and\ncoreference discrimination within a single template, with a fully shared\ncontext. In addition, we introduce two auxiliary prompt tasks, event-type\ncompatibility and argument compatibility, to explicitly demonstrate the\nreasoning process of ECR, which helps the model make final predictions.\nExperimental results show that our method CorefPrompt performs well in a\nstate-of-the-art (SOTA) benchmark.\n","authors":["Sheng Xu","Peifeng Li","Qiaoming Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14512v2.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.15464v1","updated":"2023-10-24T02:27:06Z","published":"2023-10-24T02:27:06Z","title":"Interpreting Answers to Yes-No Questions in User-Generated Content","summary":" Interpreting answers to yes-no questions in social media is difficult. Yes\nand no keywords are uncommon, and the few answers that include them are rarely\nto be interpreted what the keywords suggest. In this paper, we present a new\ncorpus of 4,442 yes-no question-answer pairs from Twitter. We discuss\nlinguistic characteristics of answers whose interpretation is yes or no, as\nwell as answers whose interpretation is unknown. We show that large language\nmodels are far from solving this problem, even after fine-tuning and blending\nother corpora for the same problem but outside social media.\n","authors":["Shivam Mathur","Keun Hee Park","Dhivya Chinnappa","Saketh Kotamraju","Eduardo Blanco"],"pdf_url":"https://arxiv.org/pdf/2310.15464v1.pdf","comment":"Accepted at the Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15461v1","updated":"2023-10-24T02:23:34Z","published":"2023-10-24T02:23:34Z","title":"Facilitating Self-Guided Mental Health Interventions Through\n Human-Language Model Interaction: A Case Study of Cognitive Restructuring","summary":" Self-guided mental health interventions, such as \"do-it-yourself\" tools to\nlearn and practice coping strategies, show great promise to improve access to\nmental health care. However, these interventions are often cognitively\ndemanding and emotionally triggering, creating accessibility barriers that\nlimit their wide-scale implementation and adoption. In this paper, we study how\nhuman-language model interaction can support self-guided mental health\ninterventions. We take cognitive restructuring, an evidence-based therapeutic\ntechnique to overcome negative thinking, as a case study. In an IRB-approved\nrandomized field study on a large mental health website with 15,531\nparticipants, we design and evaluate a system that uses language models to\nsupport people through various steps of cognitive restructuring. Our findings\nreveal that our system positively impacts emotional intensity for 67% of\nparticipants and helps 65% overcome negative thoughts. Although adolescents\nreport relatively worse outcomes, we find that tailored interventions that\nsimplify language model generations improve overall effectiveness and equity.\n","authors":["Ashish Sharma","Kevin Rushton","Inna Wanyin Lin","Theresa Nguyen","Tim Althoff"],"pdf_url":"https://arxiv.org/pdf/2310.15461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13230v2","updated":"2023-10-24T02:21:38Z","published":"2023-09-23T01:52:14Z","title":"Unify word-level and span-level tasks: NJUNLP's Participation for the\n WMT2023 Quality Estimation Shared Task","summary":" We introduce the submissions of the NJUNLP team to the WMT 2023 Quality\nEstimation (QE) shared task. Our team submitted predictions for the\nEnglish-German language pair on all two sub-tasks: (i) sentence- and word-level\nquality prediction; and (ii) fine-grained error span detection. This year, we\nfurther explore pseudo data methods for QE based on NJUQE framework\n(https://github.com/NJUNLP/njuqe). We generate pseudo MQM data using parallel\ndata from the WMT translation task. We pre-train the XLMR large model on pseudo\nQE data, then fine-tune it on real QE data. At both stages, we jointly learn\nsentence-level scores and word-level tags. Empirically, we conduct experiments\nto find the key hyper-parameters that improve the performance. Technically, we\npropose a simple method that covert the word-level outputs to fine-grained\nerror span results. Overall, our models achieved the best results in\nEnglish-German for both word-level and fine-grained error span detection\nsub-tasks by a considerable margin.\n","authors":["Xiang Geng","Zhejian Lai","Yu Zhang","Shimin Tao","Hao Yang","Jiajun Chen","Shujian Huang"],"pdf_url":"https://arxiv.org/pdf/2309.13230v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.07430v3","updated":"2023-10-24T02:04:59Z","published":"2023-09-14T05:15:01Z","title":"Clinical Text Summarization: Adapting Large Language Models Can\n Outperform Human Experts","summary":" Sifting through vast textual data and summarizing key information from\nelectronic health records (EHR) imposes a substantial burden on how clinicians\nallocate their time. Although large language models (LLMs) have shown immense\npromise in natural language processing (NLP) tasks, their efficacy on a diverse\nrange of clinical summarization tasks has not yet been rigorously demonstrated.\nIn this work, we apply domain adaptation methods to eight LLMs, spanning six\ndatasets and four distinct clinical summarization tasks: radiology reports,\npatient questions, progress notes, and doctor-patient dialogue. Our thorough\nquantitative assessment reveals trade-offs between models and adaptation\nmethods in addition to instances where recent advances in LLMs may not improve\nresults. Further, in a clinical reader study with ten physicians, we show that\nsummaries from our best-adapted LLMs are preferable to human summaries in terms\nof completeness and correctness. Our ensuing qualitative analysis highlights\nchallenges faced by both LLMs and human experts. Lastly, we correlate\ntraditional quantitative NLP metrics with reader study scores to enhance our\nunderstanding of how these metrics align with physician preferences. Our\nresearch marks the first evidence of LLMs outperforming human experts in\nclinical text summarization across multiple tasks. This implies that\nintegrating LLMs into clinical workflows could alleviate documentation burden,\nempowering clinicians to focus more on personalized patient care and the\ninherently human aspects of medicine.\n","authors":["Dave Van Veen","Cara Van Uden","Louis Blankemeier","Jean-Benoit Delbrouck","Asad Aali","Christian Bluethgen","Anuj Pareek","Malgorzata Polacin","Eduardo Pontes Reis","Anna Seehofnerova","Nidhi Rohatgi","Poonam Hosamani","William Collins","Neera Ahuja","Curtis P. Langlotz","Jason Hom","Sergios Gatidis","John Pauly","Akshay S. Chaudhari"],"pdf_url":"https://arxiv.org/pdf/2309.07430v3.pdf","comment":"24 pages, 24 figures. Compared to the original, newer versions\n include minor edits and supplementary additional experiments that reinforce\n the initial findings"},{"id":"http://arxiv.org/abs/2310.14545v2","updated":"2023-10-24T01:56:05Z","published":"2023-10-23T03:55:13Z","title":"Harnessing ChatGPT for thematic analysis: Are we ready?","summary":" ChatGPT is an advanced natural language processing tool with growing\napplications across various disciplines in medical research. Thematic analysis,\na qualitative research method to identify and interpret patterns in data, is\none application that stands to benefit from this technology. This viewpoint\nexplores the utilization of ChatGPT in three core phases of thematic analysis\nwithin a medical context: 1) direct coding of transcripts, 2) generating themes\nfrom a predefined list of codes, and 3) preprocessing quotes for manuscript\ninclusion. Additionally, we explore the potential of ChatGPT to generate\ninterview transcripts, which may be used for training purposes. We assess the\nstrengths and limitations of using ChatGPT in these roles, highlighting areas\nwhere human intervention remains necessary. Overall, we argue that ChatGPT can\nfunction as a valuable tool during analysis, enhancing the efficiency of the\nthematic analysis and offering additional insights into the qualitative data.\n","authors":["V Vien Lee","Stephanie C. C. van der Lubbe","Lay Hoon Goh","Jose M. Valderas"],"pdf_url":"https://arxiv.org/pdf/2310.14545v2.pdf","comment":"23 pages, 7 figures, 3 tables, 1 textbox"},{"id":"http://arxiv.org/abs/2310.14747v2","updated":"2023-10-24T01:43:33Z","published":"2023-10-23T09:32:53Z","title":"MCC-KD: Multi-CoT Consistent Knowledge Distillation","summary":" Large language models (LLMs) have showcased remarkable capabilities in\ncomplex reasoning through chain of thought (CoT) prompting. Recently, there has\nbeen a growing interest in transferring these reasoning abilities from LLMs to\nsmaller models. However, achieving both the diversity and consistency in\nrationales presents a challenge. In this paper, we focus on enhancing these two\naspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to\nefficiently distill the reasoning capabilities. In MCC-KD, we generate multiple\nrationales for each question and enforce consistency among the corresponding\npredictions by minimizing the bidirectional KL-divergence between the answer\ndistributions. We investigate the effectiveness of MCC-KD with different model\narchitectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both\nmathematical reasoning and commonsense reasoning benchmarks. The empirical\nresults not only confirm MCC-KD's superior performance on in-distribution\ndatasets but also highlight its robust generalization ability on\nout-of-distribution datasets.\n","authors":["Hongzhan Chen","Siyue Wu","Xiaojun Quan","Rui Wang","Ming Yan","Ji Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14747v2.pdf","comment":"Accepted to ENMLP 2023"},{"id":"http://arxiv.org/abs/2310.05388v2","updated":"2023-10-24T01:37:46Z","published":"2023-10-09T03:55:55Z","title":"GROVE: A Retrieval-augmented Complex Story Generation Framework with A\n Forest of Evidence","summary":" Conditional story generation is significant in human-machine interaction,\nparticularly in producing stories with complex plots. While Large language\nmodels (LLMs) perform well on multiple NLP tasks, including story generation,\nit is challenging to generate stories with both complex and creative plots.\nExisting methods often rely on detailed prompts to guide LLMs to meet target\nconditions, which inadvertently restrict the creative potential of the\ngenerated stories. We argue that leveraging information from exemplary\nhuman-written stories facilitates generating more diverse plotlines. Delving\ndeeper into story details helps build complex and credible plots. In this\npaper, we propose a retrieval-au\\textbf{G}mented sto\\textbf{R}y generation\nframework with a f\\textbf{O}rest of e\\textbf{V}id\\textbf{E}nce (GROVE) to\nenhance stories' complexity. We build a retrieval repository for target\nconditions to produce few-shot examples to prompt LLMs. Additionally, we design\nan ``asking-why'' prompting scheme that extracts a forest of evidence,\nproviding compensation for the ambiguities that may occur in the generated\nstory. This iterative process uncovers underlying story backgrounds. Finally,\nwe select the most fitting chains of evidence from the evidence forest and\nintegrate them into the generated story, thereby enhancing the narrative's\ncomplexity and credibility. Experimental results and numerous examples verify\nthe effectiveness of our method.\n","authors":["Zhihua Wen","Zhiliang Tian","Wei Wu","Yuxin Yang","Yanqi Shi","Zhen Huang","Dongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.05388v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.06498v2","updated":"2023-10-24T01:37:10Z","published":"2023-10-10T10:14:59Z","title":"A New Benchmark and Reverse Validation Method for Passage-level\n Hallucination Detection","summary":" Large Language Models (LLMs) have shown their ability to collaborate\neffectively with humans in real-world scenarios. However, LLMs are apt to\ngenerate hallucinations, i.e., makeup incorrect text and unverified\ninformation, which can cause significant damage when deployed for\nmission-critical tasks. In this paper, we propose a self-check approach based\non reverse validation to detect factual errors automatically in a zero-resource\nfashion. To facilitate future studies and assess different methods, we\nconstruct a hallucination detection benchmark named PHD, which is generated by\nChatGPT and annotated by human annotators. Contrasting previous studies of\nzero-resource hallucination detection, our method and benchmark concentrate on\npassage-level detection instead of sentence-level. We empirically evaluate our\nmethod and existing zero-resource detection methods on two datasets. The\nexperimental results demonstrate that the proposed method considerably\noutperforms the baselines while costing fewer tokens and less time.\nFurthermore, we manually analyze some hallucination cases that LLM failed to\ncapture, revealing the shared limitation of zero-resource methods.\n","authors":["Shiping Yang","Renliang Sun","Xiaojun Wan"],"pdf_url":"https://arxiv.org/pdf/2310.06498v2.pdf","comment":"EMNLP2023 Findings"},{"id":"http://arxiv.org/abs/2007.01777v4","updated":"2023-10-24T01:26:23Z","published":"2020-07-03T16:00:26Z","title":"Interpretable Text Classification Via Prototype Trajectories","summary":" We propose a novel interpretable deep neural network for text classification,\ncalled ProtoryNet, based on a new concept of prototype trajectories. Motivated\nby the prototype theory in modern linguistics, ProtoryNet makes a prediction by\nfinding the most similar prototype for each sentence in a text sequence and\nfeeding an RNN backbone with the proximity of each sentence to the\ncorresponding active prototype. The RNN backbone then captures the temporal\npattern of the prototypes, which we refer to as prototype trajectories.\nPrototype trajectories enable intuitive and fine-grained interpretation of the\nreasoning process of the RNN model, in resemblance to how humans analyze texts.\nWe also design a prototype pruning procedure to reduce the total number of\nprototypes used by the model for better interpretability. Experiments on\nmultiple public data sets show that ProtoryNet is more accurate than the\nbaseline prototype-based deep neural net and reduces the performance gap\ncompared to state-of-the-art black-box models. In addition, after prototype\npruning, the resulting ProtoryNet models only need less than or around 20\nprototypes for all datasets, which significantly benefits interpretability.\nFurthermore, we report a survey result indicating that human users find\nProtoryNet more intuitive and easier to understand than other prototype-based\nmethods.\n","authors":["Dat Hong","Stephen S. Baek","Tong Wang"],"pdf_url":"https://arxiv.org/pdf/2007.01777v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14934v2","updated":"2023-10-24T01:21:05Z","published":"2023-05-24T09:16:51Z","title":"GRACE: Discriminator-Guided Chain-of-Thought Reasoning","summary":" In the context of multi-step reasoning, e.g., with chain-of-thought, language\nmodels (LMs) can easily assign a high likelihood to incorrect steps. As a\nresult, decoding strategies that optimize for solution likelihood often yield\nincorrect solutions. To address this issue, we propose Guiding chain-of-thought\nReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding\napproach that steers the decoding process towards producing correct reasoning\nsteps. GRACE employs a discriminator trained with a contrastive loss over\ncorrect and incorrect steps, which is used during decoding to score next-step\ncandidates based on their correctness. Importantly, GRACE only requires\nsampling from the LM, without the need for LM training or fine-tuning. Using\nmodels from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and\ntwo symbolic reasoning tasks, where it exhibits substantial performance gains\ncompared to greedy decoding, verifiers, and self-consistency in most settings.\nWhen further combined with self-consistency, GRACE outperforms all the\nbaselines by sizeable margins. Human and LLM evaluations over GSM8K show that\nGRACE not only improves the final answer accuracy but also the correctness of\nthe intermediate reasoning. Our implementation can be accessed at\n\\url{https://github.com/mukhal/grace}.\n","authors":["Muhammad Khalifa","Lajanugen Logeswaran","Moontae Lee","Honglak Lee","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2305.14934v2.pdf","comment":"To appear at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15439v1","updated":"2023-10-24T01:20:05Z","published":"2023-10-24T01:20:05Z","title":"K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific\n Ratings","summary":" Numerous datasets have been proposed to combat the spread of online hate.\nDespite these efforts, a majority of these resources are English-centric,\nprimarily focusing on overt forms of hate. This research gap calls for\ndeveloping high-quality corpora in diverse languages that also encapsulate more\nsubtle hate expressions. This study introduces K-HATERS, a new corpus for hate\nspeech detection in Korean, comprising approximately 192K news comments with\ntarget-specific offensiveness ratings. This resource is the largest offensive\nlanguage corpus in Korean and is the first to offer target-specific ratings on\na three-point Likert scale, enabling the detection of hate expressions in\nKorean across varying degrees of offensiveness. We conduct experiments showing\nthe effectiveness of the proposed corpus, including a comparison with existing\ndatasets. Additionally, to address potential noise and bias in human\nannotations, we explore a novel idea of adopting the Cognitive Reflection Test,\nwhich is widely used in social science for assessing an individual's cognitive\nability, as a proxy of labeling quality. Findings indicate that annotations\nfrom individuals with the lowest test scores tend to yield detection models\nthat make biased predictions toward specific target groups and are less\naccurate. This study contributes to the NLP research on hate speech detection\nand resource construction. The code and dataset can be accessed at\nhttps://github.com/ssu-humane/K-HATERS.\n","authors":["Chaewon Park","Soohwan Kim","Kyubyong Park","Kunwoo Park"],"pdf_url":"https://arxiv.org/pdf/2310.15439v1.pdf","comment":"15 pages, EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2305.13245v2","updated":"2023-10-24T00:57:50Z","published":"2023-05-22T17:16:38Z","title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head\n Checkpoints","summary":" Multi-query attention (MQA), which only uses a single key-value head,\ndrastically speeds up decoder inference. However, MQA can lead to quality\ndegradation, and moreover it may not be desirable to train a separate model\njust for faster inference. We (1) propose a recipe for uptraining existing\nmulti-head language model checkpoints into models with MQA using 5% of original\npre-training compute, and (2) introduce grouped-query attention (GQA), a\ngeneralization of multi-query attention which uses an intermediate (more than\none, less than number of query heads) number of key-value heads. We show that\nuptrained GQA achieves quality close to multi-head attention with comparable\nspeed to MQA.\n","authors":["Joshua Ainslie","James Lee-Thorp","Michiel de Jong","Yury Zemlyanskiy","Federico Lebrón","Sumit Sanghai"],"pdf_url":"https://arxiv.org/pdf/2305.13245v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2303.09752v3","updated":"2023-10-24T00:51:49Z","published":"2023-03-17T03:28:17Z","title":"CoLT5: Faster Long-Range Transformers with Conditional Computation","summary":" Many natural language processing tasks benefit from long inputs, but\nprocessing long documents with Transformers is expensive -- not only due to\nquadratic attention complexity but also from applying feedforward and\nprojection layers to every token. However, not all tokens are equally\nimportant, especially for longer documents. We propose CoLT5, a long-input\nTransformer model that builds on this intuition by employing conditional\ncomputation, devoting more resources to important tokens in both feedforward\nand attention layers. We show that CoLT5 achieves stronger performance than\nLongT5 with much faster training and inference, achieving SOTA on the\nlong-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably\nmake use of extremely long inputs, showing strong gains up to 64k input length.\n","authors":["Joshua Ainslie","Tao Lei","Michiel de Jong","Santiago Ontañón","Siddhartha Brahma","Yury Zemlyanskiy","David Uthus","Mandy Guo","James Lee-Thorp","Yi Tay","Yun-Hsuan Sung","Sumit Sanghai"],"pdf_url":"https://arxiv.org/pdf/2303.09752v3.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15431v1","updated":"2023-10-24T00:51:29Z","published":"2023-10-24T00:51:29Z","title":"What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts\n and Rationales for Disambiguating Defeasible Social and Moral Situations","summary":" Moral or ethical judgments rely heavily on the specific contexts in which\nthey occur. Understanding varying shades of defeasible contextualizations\n(i.e., additional information that strengthens or attenuates the moral\nacceptability of an action) is critical to accurately represent the subtlety\nand intricacy of grounded human moral judgment in real-life scenarios.\n We introduce defeasible moral reasoning: a task to provide grounded contexts\nthat make an action more or less morally acceptable, along with commonsense\nrationales that justify the reasoning. To elicit high-quality task data, we\ntake an iterative self-distillation approach that starts from a small amount of\nunstructured seed knowledge from GPT-3 and then alternates between (1)\nself-distillation from student models; (2) targeted filtering with a critic\nmodel trained by human judgment (to boost validity) and NLI (to boost\ndiversity); (3) self-imitation learning (to amplify the desired data quality).\nThis process yields a student model that produces defeasible contexts with\nimproved validity, diversity, and defeasibility. From this model we distill a\nhigh-quality dataset, \\delta-Rules-of-Thumb, of 1.2M entries of\ncontextualizations and rationales for 115K defeasible moral actions rated\nhighly by human annotators 85.9% to 99.8% of the time. Using \\delta-RoT we\nobtain a final student model that wins over all intermediate student models by\na notable margin.\n","authors":["Kavel Rao","Liwei Jiang","Valentina Pyatkin","Yuling Gu","Niket Tandon","Nouha Dziri","Faeze Brahman","Yejin Choi"],"pdf_url":"https://arxiv.org/pdf/2310.15431v1.pdf","comment":"Camera Ready EMNLP Findings 2023. First two authors contributed\n equally"},{"id":"http://arxiv.org/abs/2310.15429v1","updated":"2023-10-24T00:50:33Z","published":"2023-10-24T00:50:33Z","title":"Beyond Sentiment: Leveraging Topic Metrics for Political Stance\n Classification","summary":" Sentiment analysis, widely critiqued for capturing merely the overall tone of\na corpus, falls short in accurately reflecting the latent structures and\npolitical stances within texts. This study introduces topic metrics, dummy\nvariables converted from extracted topics, as both an alternative and\ncomplement to sentiment metrics in stance classification. By employing three\ndatasets identified by Bestvater and Monroe (2023), this study demonstrates\nBERTopic's proficiency in extracting coherent topics and the effectiveness of\ntopic metrics in stance classification. The experiment results show that\nBERTopic improves coherence scores by 17.07% to 54.20% when compared to\ntraditional approaches such as Dirichlet Allocation (LDA) and Non-negative\nMatrix Factorization (NMF), prevalent in earlier political science research.\nAdditionally, our results indicate topic metrics outperform sentiment metrics\nin stance classification, increasing performance by as much as 18.95%. Our\nfindings suggest topic metrics are especially effective for context-rich texts\nand corpus where stance and sentiment correlations are weak. The combination of\nsentiment and topic metrics achieve an optimal performance in most of the\nscenarios and can further address the limitations of relying solely on\nsentiment as well as the low coherence score of topic metrics.\n","authors":["Weihong Qi"],"pdf_url":"https://arxiv.org/pdf/2310.15429v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15425v1","updated":"2023-10-24T00:43:54Z","published":"2023-10-24T00:43:54Z","title":"The Mason-Alberta Phonetic Segmenter: A forced alignment system based on\n deep neural networks and interpolation","summary":" Forced alignment systems automatically determine boundaries between segments\nin speech data, given an orthographic transcription. These tools are\ncommonplace in phonetics to facilitate the use of speech data that would be\ninfeasible to manually transcribe and segment. In the present paper, we\ndescribe a new neural network-based forced alignment system, the Mason-Alberta\nPhonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two\npossible improvements we pursue for forced alignment systems. The first is\ntreating the acoustic model in a forced aligner as a tagging task, rather than\na classification task, motivated by the common understanding that segments in\nspeech are not truly discrete and commonly overlap. The second is an\ninterpolation technique to allow boundaries more precise than the common 10 ms\nlimit in modern forced alignment systems. We compare configurations of our\nsystem to a state-of-the-art system, the Montreal Forced Aligner. The tagging\napproach did not generally yield improved results over the Montreal Forced\nAligner. However, a system with the interpolation technique had a 27.92%\nincrease relative to the Montreal Forced Aligner in the amount of boundaries\nwithin 10 ms of the target on the test set. We also reflect on the task and\ntraining process for acoustic modeling in forced alignment, highlighting how\nthe output targets for these models do not match phoneticians' conception of\nsimilarity between phones and that reconciliation of this tension may require\nrethinking the task and output targets or how speech itself should be\nsegmented.\n","authors":["Matthew C. Kelley","Scott James Perry","Benjamin V. Tucker"],"pdf_url":"https://arxiv.org/pdf/2310.15425v1.pdf","comment":"submitted for publication"},{"id":"http://arxiv.org/abs/2310.05030v2","updated":"2023-10-24T00:36:29Z","published":"2023-10-08T06:20:36Z","title":"Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as\n You May Think -- Introducing AI Detectability Index","summary":" With the rise of prolific ChatGPT, the risk and consequences of AI-generated\ntext has increased alarmingly. To address the inevitable question of ownership\nattribution for AI-generated artifacts, the US Copyright Office released a\nstatement stating that 'If a work's traditional elements of authorship were\nproduced by a machine, the work lacks human authorship and the Office will not\nregister it'. Furthermore, both the US and the EU governments have recently\ndrafted their initial proposals regarding the regulatory framework for AI.\nGiven this cynosural spotlight on generative AI, AI-generated text detection\n(AGTD) has emerged as a topic that has already received immediate attention in\nresearch, with some initial methods having been proposed, soon followed by\nemergence of techniques to bypass detection. This paper introduces the Counter\nTuring Test (CT^2), a benchmark consisting of techniques aiming to offer a\ncomprehensive evaluation of the robustness of existing AGTD techniques. Our\nempirical findings unequivocally highlight the fragility of the proposed AGTD\nmethods under scrutiny. Amidst the extensive deliberations on policy-making for\nregulating AI development, it is of utmost importance to assess the\ndetectability of content generated by LLMs. Thus, to establish a quantifiable\nspectrum facilitating the evaluation and ranking of LLMs according to their\ndetectability levels, we propose the AI Detectability Index (ADI). We conduct a\nthorough examination of 15 contemporary LLMs, empirically demonstrating that\nlarger LLMs tend to have a higher ADI, indicating they are less detectable\ncompared to smaller LLMs. We firmly believe that ADI holds significant value as\na tool for the wider NLP community, with the potential to serve as a rubric in\nAI-related policy-making.\n","authors":["Megha Chakraborty","S. M Towhidul Islam Tonmoy","S M Mehedi Zaman","Krish Sharma","Niyar R Barman","Chandan Gupta","Shreya Gautam","Tanay Kumar","Vinija Jain","Aman Chadha","Amit P. Sheth","Amitava Das"],"pdf_url":"https://arxiv.org/pdf/2310.05030v2.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2310.15421v1","updated":"2023-10-24T00:24:11Z","published":"2023-10-24T00:24:11Z","title":"FANToM: A Benchmark for Stress-testing Machine Theory of Mind in\n Interactions","summary":" Theory of mind (ToM) evaluations currently focus on testing models using\npassive narratives that inherently lack interactivity. We introduce FANToM, a\nnew benchmark designed to stress-test ToM within information-asymmetric\nconversational contexts via question answering. Our benchmark draws upon\nimportant theoretical requisites from psychology and necessary empirical\nconsiderations when evaluating large language models (LLMs). In particular, we\nformulate multiple types of questions that demand the same underlying reasoning\nto identify illusory or false sense of ToM capabilities in LLMs. We show that\nFANToM is challenging for state-of-the-art LLMs, which perform significantly\nworse than humans even with chain-of-thought reasoning or fine-tuning.\n","authors":["Hyunwoo Kim","Melanie Sclar","Xuhui Zhou","Ronan Le Bras","Gunhee Kim","Yejin Choi","Maarten Sap"],"pdf_url":"https://arxiv.org/pdf/2310.15421v1.pdf","comment":"EMNLP 2023. Code and dataset can be found here:\n https://hyunw.kim/fantom"},{"id":"http://arxiv.org/abs/2310.15420v1","updated":"2023-10-24T00:23:30Z","published":"2023-10-24T00:23:30Z","title":"Let the Pretrained Language Models \"Imagine\" for Short Texts Topic\n Modeling","summary":" Topic models are one of the compelling methods for discovering latent\nsemantics in a document collection. However, it assumes that a document has\nsufficient co-occurrence information to be effective. However, in short texts,\nco-occurrence information is minimal, which results in feature sparsity in\ndocument representation. Therefore, existing topic models (probabilistic or\nneural) mostly fail to mine patterns from them to generate coherent topics. In\nthis paper, we take a new approach to short-text topic modeling to address the\ndata-sparsity issue by extending short text into longer sequences using\nexisting pre-trained language models (PLMs). Besides, we provide a simple\nsolution extending a neural topic model to reduce the effect of noisy\nout-of-topics text generation from PLMs. We observe that our model can\nsubstantially improve the performance of short-text topic modeling. Extensive\nexperiments on multiple real-world datasets under extreme data sparsity\nscenarios show that our models can generate high-quality topics outperforming\nstate-of-the-art models.\n","authors":["Pritom Saha Akash","Jie Huang","Kevin Chen-Chuan Chang"],"pdf_url":"https://arxiv.org/pdf/2310.15420v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16251v1","updated":"2023-10-24T23:53:15Z","published":"2023-10-24T23:53:15Z","title":"Speakerly: A Voice-based Writing Assistant for Text Composition","summary":" We present Speakerly, a new real-time voice-based writing assistance system\nthat helps users with text composition across various use cases such as emails,\ninstant messages, and notes. The user can interact with the system through\ninstructions or dictation, and the system generates a well-formatted and\ncoherent document. We describe the system architecture and detail how we\naddress the various challenges while building and deploying such a system at\nscale. More specifically, our system uses a combination of small, task-specific\nmodels as well as pre-trained language models for fast and effective text\ncomposition while supporting a variety of input modes for better usability.\n","authors":["Dhruv Kumar","Vipul Raheja","Alice Kaiser-Schatzlein","Robyn Perry","Apurva Joshi","Justin Hugues-Nuger","Samuel Lou","Navid Chowdhury"],"pdf_url":"https://arxiv.org/pdf/2310.16251v1.pdf","comment":"Accepted at EMNLP 2023 Industry Track"},{"id":"http://arxiv.org/abs/2310.16248v1","updated":"2023-10-24T23:45:57Z","published":"2023-10-24T23:45:57Z","title":"GlotLID: Language Identification for Low-Resource Languages","summary":" Several recent papers have published good solutions for language\nidentification (LID) for about 300 high-resource and medium-resource languages.\nHowever, there is no LID available that (i) covers a wide range of low-resource\nlanguages, (ii) is rigorously evaluated and reliable and (iii) efficient and\neasy to use. Here, we publish GlotLID-M, an LID model that satisfies the\ndesiderata of wide coverage, reliability and efficiency. It identifies 1665\nlanguages, a large increase in coverage compared to prior work. In our\nexperiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and\nNLLB) when balancing F1 and false positive rate (FPR). We analyze the unique\nchallenges that low-resource LID poses: incorrect corpus metadata, leakage from\nhigh-resource languages, difficulty separating closely related languages,\nhandling of macrolanguage vs varieties and in general noisy data. We hope that\nintegrating GlotLID-M into dataset creation pipelines will improve quality and\nenhance accessibility of NLP technology for low-resource languages and\ncultures. GlotLID-M model, code, and list of data sources are available:\nhttps://github.com/cisnlp/GlotLID.\n","authors":["Amir Hossein Kargaran","Ayyoob Imani","François Yvon","Hinrich Schütze"],"pdf_url":"https://arxiv.org/pdf/2310.16248v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16242v1","updated":"2023-10-24T23:30:17Z","published":"2023-10-24T23:30:17Z","title":"ZzzGPT: An Interactive GPT Approach to Enhance Sleep Quality","summary":" In today's world, sleep quality is pivotal for overall well-being. While\nwearable sensors offer real-time monitoring, they often lack actionable\ninsights, leading to user abandonment. This paper delves into the role of\ntechnology in understanding sleep patterns. We introduce a two-stage framework,\nutilizing Large Language Models (LLMs), aiming to provide accurate sleep\npredictions with actionable feedback. Leveraging the GLOBEM dataset and\nsynthetic data from LLMs, we highlight enhanced results with models like\nXGBoost. Our approach merges advanced machine learning with user-centric\ndesign, blending scientific accuracy with practicality.\n","authors":["Yonchanok Khaokaew","Thuc Hanh Nguyen","Kaixin Ji","Hiruni Kegalle","Marwah Alaofi"],"pdf_url":"https://arxiv.org/pdf/2310.16242v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16240v1","updated":"2023-10-24T23:29:06Z","published":"2023-10-24T23:29:06Z","title":"Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting\n Pre-trained Language Models","summary":" In this work, we propose a method that combines two popular research areas by\ninjecting linguistic structures into pre-trained language models in the\nparameter-efficient fine-tuning (PEFT) setting. In our approach, parallel\nadapter modules encoding different linguistic structures are combined using a\nnovel Mixture-of-Linguistic-Experts architecture, where Gumbel-Softmax gates\nare used to determine the importance of these modules at each layer of the\nmodel. To reduce the number of parameters, we first train the model for a fixed\nsmall number of steps before pruning the experts based on their importance\nscores. Our experiment results with three different pre-trained models show\nthat our approach can outperform state-of-the-art PEFT methods with a\ncomparable number of parameters. In addition, we provide additional analysis to\nexamine the experts selected by each model at each layer to provide insights\nfor future studies.\n","authors":["Raymond Li","Gabriel Murray","Giuseppe Carenini"],"pdf_url":"https://arxiv.org/pdf/2310.16240v1.pdf","comment":"14 pages, 3 figures, Camera-Ready for EMNLP 2023 Findings (Long\n Paper)"},{"id":"http://arxiv.org/abs/2305.10429v3","updated":"2023-10-24T23:16:15Z","published":"2023-05-17T17:58:13Z","title":"DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining","summary":" The mixture proportions of pretraining data domains (e.g., Wikipedia, books,\nweb text) greatly affect language model (LM) performance. In this paper, we\npropose Domain Reweighting with Minimax Optimization (DoReMi), which first\ntrains a small proxy model using group distributionally robust optimization\n(Group DRO) over domains to produce domain weights (mixture proportions)\nwithout knowledge of downstream tasks. We then resample a dataset with these\ndomain weights and train a larger, full-sized model. In our experiments, we use\nDoReMi on a 280M-parameter proxy model to find domain weights for training an\n8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves\nperplexity across all domains, even when it downweights a domain. DoReMi\nimproves average few-shot downstream accuracy by 6.5% points over a baseline\nmodel trained using The Pile's default domain weights and reaches the baseline\naccuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has\nno knowledge of downstream tasks, even matches the performance of using domain\nweights tuned on downstream tasks.\n","authors":["Sang Michael Xie","Hieu Pham","Xuanyi Dong","Nan Du","Hanxiao Liu","Yifeng Lu","Percy Liang","Quoc V. Le","Tengyu Ma","Adams Wei Yu"],"pdf_url":"https://arxiv.org/pdf/2305.10429v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.15772v3","updated":"2023-10-24T22:59:26Z","published":"2023-08-30T05:41:29Z","title":"Task-Based MoE for Multitask Multilingual Machine Translation","summary":" Mixture-of-experts (MoE) architecture has been proven a powerful method for\ndiverse tasks in training deep models in many applications. However, current\nMoE implementations are task agnostic, treating all tokens from different tasks\nin the same manner. In this work, we instead design a novel method that\nincorporates task information into MoE models at different granular levels with\nshared dynamic task-based adapters. Our experiments and analysis show the\nadvantages of our approaches over the dense and canonical MoE models on\nmulti-task multilingual machine translations. With task-specific adapters, our\nmodels can additionally generalize to new tasks efficiently.\n","authors":["Hai Pham","Young Jin Kim","Subhabrata Mukherjee","David P. Woodruff","Barnabas Poczos","Hany Hassan Awadalla"],"pdf_url":"https://arxiv.org/pdf/2308.15772v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11430v2","updated":"2023-10-24T22:50:02Z","published":"2023-05-19T04:59:34Z","title":"TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks","summary":" While LLMs have shown great success in understanding and generating text in\ntraditional conversational settings, their potential for performing ill-defined\ncomplex tasks is largely under-studied. Indeed, we are yet to conduct\ncomprehensive benchmarking studies with multiple LLMs that are exclusively\nfocused on a complex task. However, conducting such benchmarking studies is\nchallenging because of the large variations in LLMs' performance when different\nprompt types/styles are used and different degrees of detail are provided in\nthe prompts. To address this issue, the paper proposes a general taxonomy that\ncan be used to design prompts with specific properties in order to perform a\nwide range of complex tasks. This taxonomy will allow future benchmarking\nstudies to report the specific categories of prompts used as part of the study,\nenabling meaningful comparisons across different studies. Also, by establishing\na common standard through this taxonomy, researchers will be able to draw more\naccurate conclusions about LLMs' performance on a specific complex task.\n","authors":["Shubhra Kanti Karmaker Santu","Dongji Feng"],"pdf_url":"https://arxiv.org/pdf/2305.11430v2.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16226v1","updated":"2023-10-24T22:41:14Z","published":"2023-10-24T22:41:14Z","title":"TiC-CLIP: Continual Training of CLIP Models","summary":" Keeping large foundation models up to date on latest data is inherently\nexpensive. To avoid the prohibitive costs of constantly retraining, it is\nimperative to continually train these models. This problem is exacerbated by\nthe lack of any large scale continual learning benchmarks or baselines. We\nintroduce the first set of web-scale Time-Continual (TiC) benchmarks for\ntraining vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with\nover 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first\nuse our benchmarks to curate various dynamic evaluations to measure temporal\nrobustness of existing models. We show OpenAI's CLIP (trained on data up to\n2020) loses $\\approx 8\\%$ zero-shot accuracy on our curated retrieval task from\n2021--2022 compared with more recently trained models in OpenCLIP repository.\nWe then study how to efficiently train models on time-continuous data. We\ndemonstrate that a simple rehearsal-based approach that continues training from\nthe last checkpoint and replays old data reduces compute by $2.5\\times$ when\ncompared to the standard practice of retraining from scratch.\n","authors":["Saurabh Garg","Mehrdad Farajtabar","Hadi Pouransari","Raviteja Vemulapalli","Sachin Mehta","Oncel Tuzel","Vaishaal Shankar","Fartash Faghri"],"pdf_url":"https://arxiv.org/pdf/2310.16226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16225v1","updated":"2023-10-24T22:34:43Z","published":"2023-10-24T22:34:43Z","title":"CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset","summary":" The CoNLL-03 corpus is arguably the most well-known and utilized benchmark\ndataset for named entity recognition (NER). However, prior works found\nsignificant numbers of annotation errors, incompleteness, and inconsistencies\nin the data. This poses challenges to objectively comparing NER approaches and\nanalyzing their errors, as current state-of-the-art models achieve F1-scores\nthat are comparable to or even exceed the estimated noise level in CoNLL-03. To\naddress this issue, we present a comprehensive relabeling effort assisted by\nautomatic consistency checking that corrects 7.0% of all labels in the English\nCoNLL-03. Our effort adds a layer of entity linking annotation both for better\nexplainability of NER labels and as additional safeguard of annotation quality.\nOur experimental evaluation finds not only that state-of-the-art approaches\nreach significantly higher F1-scores (97.1%) on our data, but crucially that\nthe share of correct predictions falsely counted as errors due to annotation\nnoise drops from 47% to 6%. This indicates that our resource is well suited to\nanalyze the remaining errors made by state-of-the-art models, and that the\ntheoretical upper bound even on high resource, coarse-grained NER is not yet\nreached. To facilitate such analysis, we make CleanCoNLL publicly available to\nthe research community.\n","authors":["Susanna Rücker","Alan Akbik"],"pdf_url":"https://arxiv.org/pdf/2310.16225v1.pdf","comment":"EMNLP 2023 camera-ready version"},{"id":"http://arxiv.org/abs/2310.16218v1","updated":"2023-10-24T22:18:13Z","published":"2023-10-24T22:18:13Z","title":"Knowledge Editing for Large Language Models: A Survey","summary":" Large language models (LLMs) have recently transformed both the academic and\nindustrial landscapes due to their remarkable capacity to understand, analyze,\nand generate texts based on their vast knowledge and reasoning ability.\nNevertheless, one major drawback of LLMs is their substantial computational\ncost for pre-training due to their unprecedented amounts of parameters. The\ndisadvantage is exacerbated when new knowledge frequently needs to be\nintroduced into the pre-trained model. Therefore, it is imperative to develop\neffective and efficient techniques to update pre-trained LLMs. Traditional\nmethods encode new knowledge in pre-trained LLMs through direct fine-tuning.\nHowever, naively re-training LLMs can be computationally intensive and risks\ndegenerating valuable pre-trained knowledge irrelevant to the update in the\nmodel. Recently, Knowledge-based Model Editing (KME) has attracted increasing\nattention, which aims to precisely modify the LLMs to incorporate specific\nknowledge, without negatively influencing other irrelevant knowledge. In this\nsurvey, we aim to provide a comprehensive and in-depth overview of recent\nadvances in the field of KME. We first introduce a general formulation of KME\nto encompass different KME strategies. Afterward, we provide an innovative\ntaxonomy of KME techniques based on how the new knowledge is introduced into\npre-trained LLMs, and investigate existing KME strategies while analyzing key\ninsights, advantages, and limitations of methods from each category. Moreover,\nrepresentative metrics, datasets, and applications of KME are introduced\naccordingly. Finally, we provide an in-depth analysis regarding the\npracticality and remaining challenges of KME and suggest promising research\ndirections for further advancement in this field.\n","authors":["Song Wang","Yaochen Zhu","Haochen Liu","Zaiyi Zheng","Chen Chen","Jundong L"],"pdf_url":"https://arxiv.org/pdf/2310.16218v1.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2310.16197v1","updated":"2023-10-24T21:30:15Z","published":"2023-10-24T21:30:15Z","title":"Background Summarization of Event Timelines","summary":" Generating concise summaries of news events is a challenging natural language\nprocessing task. While journalists often curate timelines to highlight key\nsub-events, newcomers to a news event face challenges in catching up on its\nhistorical context. In this paper, we address this need by introducing the task\nof background news summarization, which complements each timeline update with a\nbackground summary of relevant preceding events. We construct a dataset by\nmerging existing timeline datasets and asking human annotators to write a\nbackground summary for each timestep of each news event. We establish strong\nbaseline performance using state-of-the-art summarization systems and propose a\nquery-focused variant to generate background summaries. To evaluate background\nsummary quality, we present a question-answering-based evaluation metric,\nBackground Utility Score (BUS), which measures the percentage of questions\nabout a current event timestep that a background summary answers. Our\nexperiments show the effectiveness of instruction fine-tuned systems such as\nFlan-T5, in addition to strong zero-shot performance using GPT-3.5.\n","authors":["Adithya Pratapa","Kevin Small","Markus Dreyer"],"pdf_url":"https://arxiv.org/pdf/2310.16197v1.pdf","comment":"EMNLP 2023 camera-ready"},{"id":"http://arxiv.org/abs/2310.16193v1","updated":"2023-10-24T21:23:53Z","published":"2023-10-24T21:23:53Z","title":"Length is a Curse and a Blessing for Document-level Semantics","summary":" In recent years, contrastive learning (CL) has been extensively utilized to\nrecover sentence and document-level encoding capability from pre-trained\nlanguage models. In this work, we question the length generalizability of\nCL-based models, i.e., their vulnerability towards length-induced semantic\nshift. We verify not only that length vulnerability is a significant yet\noverlooked research gap, but we can devise unsupervised CL methods solely\ndepending on the semantic signal provided by document length. We first derive\nthe theoretical foundations underlying length attacks, showing that elongating\na document would intensify the high intra-document similarity that is already\nbrought by CL. Moreover, we found that isotropy promised by CL is highly\ndependent on the length range of text exposed in training. Inspired by these\nfindings, we introduce a simple yet universal document representation learning\nframework, LA(SER)$^{3}$: length-agnostic self-reference for semantically\nrobust sentence representation learning, achieving state-of-the-art\nunsupervised performance on the standard information retrieval benchmark.\n","authors":["Chenghao Xiao","Yizhi Li","G Thomas Hudson","Chenghua Lin","Noura Al Moubayed"],"pdf_url":"https://arxiv.org/pdf/2310.16193v1.pdf","comment":"Accepted at EMNLP 2023. Our code is publicly available at\n https://github.com/gowitheflow-1998/LA-SER-cubed"},{"id":"http://arxiv.org/abs/2305.13812v3","updated":"2023-10-24T21:21:00Z","published":"2023-05-23T08:28:38Z","title":"Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for\n Improved Vision-Language Compositionality","summary":" Contrastively trained vision-language models have achieved remarkable\nprogress in vision and language representation learning, leading to\nstate-of-the-art models for various downstream multimodal tasks. However,\nrecent research has highlighted severe limitations of these models in their\nability to perform compositional reasoning over objects, attributes, and\nrelations. Scene graphs have emerged as an effective way to understand images\ncompositionally. These are graph-structured semantic representations of images\nthat contain objects, their attributes, and relations with other objects in a\nscene. In this work, we consider the scene graph parsed from text as a proxy\nfor the image scene graph and propose a graph decomposition and augmentation\nframework along with a coarse-to-fine contrastive learning objective between\nimages and text that aligns sentences of various complexities to the same\nimage. Along with this, we propose novel negative mining techniques in the\nscene graph space for improving attribute binding and relation understanding.\nThrough extensive experiments, we demonstrate the effectiveness of our approach\nthat significantly improves attribute binding, relation understanding,\nsystematic generalization, and productivity on multiple recently proposed\nbenchmarks (For example, improvements upto $18\\%$ for systematic\ngeneralization, $16.5\\%$ for relation understanding over a strong baseline),\nwhile achieving similar or better performance than CLIP on various general\nmultimodal tasks.\n","authors":["Harman Singh","Pengchuan Zhang","Qifan Wang","Mengjiao Wang","Wenhan Xiong","Jingfei Du","Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2305.13812v3.pdf","comment":"EMNLP 2023 (long paper, main conference)"},{"id":"http://arxiv.org/abs/2212.10755v2","updated":"2023-10-24T21:03:30Z","published":"2022-12-21T04:21:46Z","title":"JASMINE: Arabic GPT Models for Few-Shot Learning","summary":" Scholarship on generative pretraining (GPT) remains acutely Anglocentric,\nleaving serious gaps in our understanding of the whole class of autoregressive\nmodels. For example, we have little knowledge about the potential of these\nmodels and their societal impacts in diverse linguistic and cultural settings.\nWe alleviate this issue for Arabic, a wide collection of languages and\ndialectal varieties with more than 400 million population, by introducing\nJASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer\nlanguage models ranging in size between 300 million-6.7 billion parameters\npretrained on a large and diverse dataset (~ 235 GB of text). We also carefully\ndesign and release a comprehensive benchmark for both automated and human\nevaluation of Arabic autoregressive models, with coverage of potential social\nbiases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE\nextensively showing powerful performance intrinsically as well as in few-shot\nlearning on a wide range of NLP tasks. We aim to responsibly release our models\nand evaluation benchmark with interested researchers, along with code for\nexperimenting with them.\n","authors":["El Moatez Billah Nagoudi","Muhammad Abdul-Mageed","AbdelRahim Elmadany","Alcides Alcoba Inciarte","Md Tawkat Islam Khondaker"],"pdf_url":"https://arxiv.org/pdf/2212.10755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16183v1","updated":"2023-10-24T21:00:41Z","published":"2023-10-24T21:00:41Z","title":"BLP 2023 Task 2: Sentiment Analysis","summary":" We present an overview of the BLP Sentiment Shared Task, organized as part of\nthe inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is\ndefined as the detection of sentiment in a given piece of social media text.\nThis task attracted interest from 71 participants, among whom 29 and 30 teams\nsubmitted systems during the development and evaluation phases, respectively.\nIn total, participants submitted 597 runs. However, a total of 15 teams\nsubmitted system description papers. The range of approaches in the submitted\nsystems spans from classical machine learning models, fine-tuning pre-trained\nmodels, to leveraging Large Language Model (LLMs) in zero- and few-shot\nsettings. In this paper, we provide a detailed account of the task setup,\nincluding dataset development and evaluation setup. Additionally, we provide a\nbrief overview of the systems submitted by the participants. All datasets and\nevaluation scripts from the shared task have been made publicly available for\nthe research community, to foster further research in this domain\n","authors":["Md. Arid Hasan","Firoj Alam","Anika Anjum","Shudipta Das","Afiyat Anjum"],"pdf_url":"https://arxiv.org/pdf/2310.16183v1.pdf","comment":"Accepted in BLP Workshop at EMNLP-23"},{"id":"http://arxiv.org/abs/2212.09912v2","updated":"2023-10-24T20:59:33Z","published":"2022-12-19T23:33:21Z","title":"Tokenization Consistency Matters for Generative Models on Extractive NLP\n Tasks","summary":" Generative models have been widely applied to solve extractive tasks, where\nparts of the input is extracted to form the desired output, and achieved\nsignificant success. For example, in extractive question answering (QA),\ngenerative models have constantly yielded state-of-the-art results. In this\nwork, we identify the issue of tokenization inconsistency that is commonly\nneglected in training these models. This issue damages the extractive nature of\nthese tasks after the input and output are tokenized inconsistently by the\ntokenizer, and thus leads to performance drop as well as hallucination. We\npropose a simple yet effective fix to this issue and conduct a case study on\nextractive QA. We show that, with consistent tokenization, the model performs\nbetter in both in-domain and out-of-domain datasets, with a notable average of\n+1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA\ndatasets. Further, the model converges faster, and becomes less likely to\ngenerate out-of-context answers. With these findings, we would like to call for\nmore attention on how tokenization should be done when solving extractive tasks\nand recommend applying consistent tokenization during training.\n","authors":["Kaiser Sun","Peng Qi","Yuhao Zhang","Lan Liu","William Yang Wang","Zhiheng Huang"],"pdf_url":"https://arxiv.org/pdf/2212.09912v2.pdf","comment":"Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.16181v1","updated":"2023-10-24T20:58:07Z","published":"2023-10-24T20:58:07Z","title":"Hidden Citations Obscure True Impact in Science","summary":" References, the mechanism scientists rely on to signal previous knowledge,\nlately have turned into widely used and misused measures of scientific impact.\nYet, when a discovery becomes common knowledge, citations suffer from\nobliteration by incorporation. This leads to the concept of hidden citation,\nrepresenting a clear textual credit to a discovery without a reference to the\npublication embodying it. Here, we rely on unsupervised interpretable machine\nlearning applied to the full text of each paper to systematically identify\nhidden citations. We find that for influential discoveries hidden citations\noutnumber citation counts, emerging regardless of publishing venue and\ndiscipline. We show that the prevalence of hidden citations is not driven by\ncitation counts, but rather by the degree of the discourse on the topic within\nthe text of the manuscripts, indicating that the more discussed is a discovery,\nthe less visible it is to standard bibliometric analysis. Hidden citations\nindicate that bibliometric measures offer a limited perspective on quantifying\nthe true impact of a discovery, raising the need to extract knowledge from the\nfull text of the scientific corpus.\n","authors":["Xiangyi Meng","Onur Varol","Albert-László Barabási"],"pdf_url":"https://arxiv.org/pdf/2310.16181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16176v1","updated":"2023-10-24T20:48:11Z","published":"2023-10-24T20:48:11Z","title":"Correction with Backtracking Reduces Hallucination in Summarization","summary":" Abstractive summarization aims at generating natural language summaries of a\nsource document that are succinct while preserving the important elements.\nDespite recent advances, neural text summarization models are known to be\nsusceptible to hallucinating (or more correctly confabulating), that is to\nproduce summaries with details that are not grounded in the source document. In\nthis paper, we introduce a simple yet efficient technique, CoBa, to reduce\nhallucination in abstractive summarization. The approach is based on two steps:\nhallucination detection and mitigation. We show that the former can be achieved\nthrough measuring simple statistics about conditional word probabilities and\ndistance to context words. Further, we demonstrate that straight-forward\nbacktracking is surprisingly effective at mitigation. We thoroughly evaluate\nthe proposed method with prior art on three benchmark datasets for text\nsummarization. The results show that CoBa is effective and efficient in\nreducing hallucination, and offers great adaptability and flexibility.\n","authors":["Zhenzhen Liu","Chao Wan","Varsha Kishore","Jin Peng Zhou","Minmin Chen","Kilian Q. Weinberger"],"pdf_url":"https://arxiv.org/pdf/2310.16176v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14410v2","updated":"2023-10-24T20:44:09Z","published":"2023-05-23T17:59:10Z","title":"Image Manipulation via Multi-Hop Instructions -- A New Dataset and\n Weakly-Supervised Neuro-Symbolic Approach","summary":" We are interested in image manipulation via natural language text -- a task\nthat is useful for multiple AI applications but requires complex reasoning over\nmulti-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning\n(NSCL), which has been quite effective for the task of Visual Question\nAnswering (VQA), for the task of image manipulation. Our system referred to as\nNeuroSIM can perform complex multi-hop reasoning over multi-object scenes and\nonly requires weak supervision in the form of annotated data for VQA. NeuroSIM\nparses an instruction into a symbolic program, based on a Domain Specific\nLanguage (DSL) comprising of object attributes and manipulation operations,\nthat guides its execution. We create a new dataset for the task, and extensive\nexperiments demonstrate that NeuroSIM is highly competitive with or beats SOTA\nbaselines that make use of supervised data for manipulation.\n","authors":["Harman Singh","Poorva Garg","Mohit Gupta","Kevin Shah","Ashish Goswami","Satyam Modi","Arnab Kumar Mondal","Dinesh Khandelwal","Dinesh Garg","Parag Singla"],"pdf_url":"https://arxiv.org/pdf/2305.14410v2.pdf","comment":"EMNLP 2023 (long paper, main conference)"},{"id":"http://arxiv.org/abs/2212.09825v2","updated":"2023-10-24T20:43:40Z","published":"2022-12-19T19:53:14Z","title":"What to Read in a Contract? Party-Specific Summarization of Legal\n Obligations, Entitlements, and Prohibitions","summary":" Reviewing and comprehending key obligations, entitlements, and prohibitions\nin legal contracts can be a tedious task due to their length and\ndomain-specificity. Furthermore, the key rights and duties requiring review\nvary for each contracting party. In this work, we propose a new task of\nparty-specific extractive summarization for legal contracts to facilitate\nfaster reviewing and improved comprehension of rights and duties. To facilitate\nthis, we curate a dataset comprising of party-specific pairwise importance\ncomparisons annotated by legal experts, covering ~293K sentence pairs that\ninclude obligations, entitlements, and prohibitions extracted from lease\nagreements. Using this dataset, we train a pairwise importance ranker and\npropose a pipeline-based extractive summarization system that generates a\nparty-specific contract summary. We establish the need for incorporating\ndomain-specific notion of importance during summarization by comparing our\nsystem against various baselines using both automatic and human evaluation\nmethods\n","authors":["Abhilasha Sancheti","Aparna Garimella","Balaji Vasan Srinivasan","Rachel Rudinger"],"pdf_url":"https://arxiv.org/pdf/2212.09825v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.12786v2","updated":"2023-10-24T20:10:04Z","published":"2023-05-22T07:31:08Z","title":"Mitigating Data Imbalance and Representation Degeneration in\n Multilingual Machine Translation","summary":" Despite advances in multilingual neural machine translation (MNMT), we argue\nthat there are still two major challenges in this area: data imbalance and\nrepresentation degeneration. The data imbalance problem refers to the imbalance\nin the amount of parallel corpora for all language pairs, especially for\nlong-tail languages (i.e., very low-resource languages). The representation\ndegeneration problem refers to the problem of encoded tokens tending to appear\nonly in a small subspace of the full space available to the MNMT model. To\nsolve these two issues, we propose Bi-ACL, a framework that uses only\ntarget-side monolingual data and a bilingual dictionary to improve the\nperformance of the MNMT model. We define two modules, named bidirectional\nautoencoder and bidirectional contrastive learning, which we combine with an\nonline constrained beam search and a curriculum learning sampling strategy.\nExtensive experiments show that our proposed method is more effective both in\nlong-tail languages and in high-resource languages. We also demonstrate that\nour approach is capable of transferring knowledge between domains and languages\nin zero-shot scenarios.\n","authors":["Wen Lai","Alexandra Chronopoulou","Alexander Fraser"],"pdf_url":"https://arxiv.org/pdf/2305.12786v2.pdf","comment":"Accepted to Findings of EMNLP 2023, add statistical significance\n tests. code available at https://github.com/lavine-lmu/Bi-ACL"},{"id":"http://arxiv.org/abs/2310.05857v2","updated":"2023-10-24T20:00:21Z","published":"2023-10-09T16:52:07Z","title":"Improving Summarization with Human Edits","summary":" Recent work has shown the promise of learning with human feedback paradigms\nto produce human-determined high-quality text. Existing works use human\nfeedback to train large language models (LLMs) in general domain abstractive\nsummarization and have obtained summary quality exceeding traditional\nlikelihood training. In this paper, we focus on a less explored form of human\nfeedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training\n(SALT), a novel technique to use both the human-edited and model-generated data\ntogether in the training loop. In addition, we demonstrate simulating Human\nEdits with ground truth summaries coming from existing training data --\nImitation edits, along with the model-generated summaries obtained after the\ntraining, to reduce the need for expensive human-edit data. In our experiments,\nwe extend human feedback exploration from general domain summarization to\nmedical domain summarization. Our results demonstrate the effectiveness of SALT\nin improving the summary quality with Human and Imitation Edits. Through\nadditional experiments, we show that SALT outperforms the conventional RLHF\nmethod (designed for human preferences) -- DPO, when applied to human-edit\ndata. We hope the evidence in our paper prompts researchers to explore,\ncollect, and better use different human feedback approaches scalably.\n","authors":["Zonghai Yao","Benjamin J Schloss","Sai P. Selvaraj"],"pdf_url":"https://arxiv.org/pdf/2310.05857v2.pdf","comment":"To appear in proceedings of the Main Conference on Empirical Methods\n in Natural Language Processing (EMNLP) 2023"},{"id":"http://arxiv.org/abs/2310.16153v1","updated":"2023-10-24T19:50:07Z","published":"2023-10-24T19:50:07Z","title":"WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task","summary":" We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER)\nShared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering\nnovel NER datasets (i.e., Wojood) and the definition of subtasks designed to\nfacilitate meaningful comparisons between different NER approaches.\nWojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45\nunique teams registered for this shared task, with 11 of them actively\nparticipating in the test phase. Specifically, 11 teams participated in\nFlatNER, while $8$ teams tackled NestedNER. The winning teams achieved F1\nscores of 91.96 and 93.73 in FlatNER and NestedNER, respectively.\n","authors":["Mustafa Jarrar","Muhammad Abdul-Mageed","Mohammed Khalilia","Bashar Talafha","AbdelRahim Elmadany","Nagham Hamad","Alaa' Omar"],"pdf_url":"https://arxiv.org/pdf/2310.16153v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16147v1","updated":"2023-10-24T19:47:26Z","published":"2023-10-24T19:47:26Z","title":"PreWoMe: Exploiting Presuppositions as Working Memory for Long Form\n Question Answering","summary":" Information-seeking questions in long-form question answering (LFQA) often\nprove misleading due to ambiguity or false presupposition in the question.\nWhile many existing approaches handle misleading questions, they are tailored\nto limited questions, which are insufficient in a real-world setting with\nunpredictable input characteristics. In this work, we propose PreWoMe, a\nunified approach capable of handling any type of information-seeking question.\nThe key idea of PreWoMe involves extracting presuppositions in the question and\nexploiting them as working memory to generate feedback and action about the\nquestion. Our experiment shows that PreWoMe is effective not only in tackling\nmisleading questions but also in handling normal ones, thereby demonstrating\nthe effectiveness of leveraging presuppositions, feedback, and action for\nreal-world QA settings.\n","authors":["Wookje Han","Jinsol Park","Kyungjae Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16147v1.pdf","comment":"11 pages 3 figures, Accepted to EMNLP 2023 (short)"},{"id":"http://arxiv.org/abs/2310.16146v1","updated":"2023-10-24T19:43:39Z","published":"2023-10-24T19:43:39Z","title":"Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model\n System for Answering Medical Questions using Scientific Literature","summary":" The quickly-expanding nature of published medical literature makes it\nchallenging for clinicians and researchers to keep up with and summarize\nrecent, relevant findings in a timely manner. While several closed-source\nsummarization tools based on large language models (LLMs) now exist, rigorous\nand systematic evaluations of their outputs are lacking. Furthermore, there is\na paucity of high-quality datasets and appropriate benchmark tasks with which\nto evaluate these tools. We address these issues with four contributions: we\nrelease Clinfo.ai, an open-source WebApp that answers clinical questions based\non dynamically retrieved scientific literature; we specify an information\nretrieval and abstractive summarization task to evaluate the performance of\nsuch retrieval-augmented LLM systems; we release a dataset of 200 questions and\ncorresponding answers derived from published systematic reviews, which we name\nPubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for\nClinfo.ai and other publicly available OpenQA systems on PubMedRS-200.\n","authors":["Alejandro Lozano","Scott L Fleming","Chia-Chun Chiang","Nigam Shah"],"pdf_url":"https://arxiv.org/pdf/2310.16146v1.pdf","comment":"Preprint of an article published in Pacific Symposium on Biocomputing\n copyright 2024 World Scientific Publishing Co., Singapore,\n http://psb.stanford.edu/"},{"id":"http://arxiv.org/abs/2310.16142v1","updated":"2023-10-24T19:33:27Z","published":"2023-10-24T19:33:27Z","title":"A Language Model with Limited Memory Capacity Captures Interference in\n Human Sentence Processing","summary":" Two of the central factors believed to underpin human sentence processing\ndifficulty are expectations and retrieval from working memory. A recent attempt\nto create a unified cognitive model integrating these two factors relied on the\nparallels between the self-attention mechanism of transformer language models\nand cue-based retrieval theories of working memory in human sentence processing\n(Ryu and Lewis 2021). While Ryu and Lewis show that attention patterns in\nspecialized attention heads of GPT-2 are consistent with similarity-based\ninterference, a key prediction of cue-based retrieval models, their method\nrequires identifying syntactically specialized attention heads, and makes the\ncognitively implausible assumption that hundreds of memory retrieval operations\ntake place in parallel. In the present work, we develop a recurrent neural\nlanguage model with a single self-attention head, which more closely parallels\nthe memory system assumed by cognitive theories. We show that our model's\nsingle attention head captures semantic and syntactic interference effects\nobserved in human experiments.\n","authors":["William Timkey","Tal Linzen"],"pdf_url":"https://arxiv.org/pdf/2310.16142v1.pdf","comment":"To appear in Findings of the Association for Computational\n Linguistics: EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16135v1","updated":"2023-10-24T19:22:01Z","published":"2023-10-24T19:22:01Z","title":"Can You Follow Me? Testing Situational Understanding in ChatGPT","summary":" Understanding sentence meanings and updating information states appropriately\nacross time -- what we call \"situational understanding\" (SU) -- is a critical\nability for human-like AI agents. SU is essential in particular for chat\nmodels, such as ChatGPT, to enable consistent, coherent, and effective dialogue\nbetween humans and AI. Previous works have identified certain SU limitations in\nnon-chatbot Large Language models (LLMs), but the extent and causes of these\nlimitations are not well understood, and capabilities of current chat-based\nmodels in this domain have not been explored. In this work we tackle these\nquestions, proposing a novel synthetic environment for SU testing which allows\nus to do controlled and systematic testing of SU in chat-oriented models,\nthrough assessment of models' ability to track and enumerate environment\nstates. Our environment also allows for close analysis of dynamics of model\nperformance, to better understand underlying causes for performance patterns.\nWe apply our test to ChatGPT, the state-of-the-art chatbot, and find that\ndespite the fundamental simplicity of the task, the model's performance\nreflects an inability to retain correct environment states across time. Our\nfollow-up analyses suggest that performance degradation is largely because\nChatGPT has non-persistent in-context memory (although it can access the full\ndialogue history) and it is susceptible to hallucinated updates -- including\nupdates that artificially inflate accuracies. Our findings suggest overall that\nChatGPT is not currently equipped for robust tracking of situation states, and\nthat trust in the impressive dialogue performance of ChatGPT comes with risks.\nWe release the codebase for reproducing our test environment, as well as all\nprompts and API responses from ChatGPT, at\nhttps://github.com/yangalan123/SituationalTesting.\n","authors":["Chenghao Yang","Allyson Ettinger"],"pdf_url":"https://arxiv.org/pdf/2310.16135v1.pdf","comment":"EMNLP 2023 Main Paper (Camera Ready)"},{"id":"http://arxiv.org/abs/2310.09725v2","updated":"2023-10-24T19:14:23Z","published":"2023-10-15T04:00:36Z","title":"KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large\n Language Models","summary":" Large language models (LLMs) demonstrate remarkable performance on\nknowledge-intensive tasks, suggesting that real-world knowledge is encoded in\ntheir model parameters. However, besides explorations on a few probing tasks in\nlimited knowledge domains, it is not well understood how to evaluate LLMs'\nknowledge systematically and how well their knowledge abilities generalize,\nacross a spectrum of knowledge domains and progressively complex task formats.\nTo this end, we propose KGQuiz, a knowledge-intensive benchmark to\ncomprehensively investigate the knowledge generalization abilities of LLMs.\nKGQuiz is a scalable framework constructed from triplet-based knowledge, which\ncovers three knowledge domains and consists of five tasks with increasing\ncomplexity: true-or-false, multiple-choice QA, blank filling, factual editing,\nand open-ended knowledge generation. To gain a better understanding of LLMs'\nknowledge abilities and their generalization, we evaluate 10 open-source and\nblack-box LLMs on the KGQuiz benchmark across the five knowledge-intensive\ntasks and knowledge domains. Extensive experiments demonstrate that LLMs\nachieve impressive performance in straightforward knowledge QA tasks, while\nsettings and contexts requiring more complex reasoning or employing\ndomain-specific facts still present significant challenges. We envision KGQuiz\nas a testbed to analyze such nuanced variations in performance across domains\nand task formats, and ultimately to understand, evaluate, and improve LLMs'\nknowledge abilities across a wide spectrum of knowledge domains and tasks.\n","authors":["Yuyang Bai","Shangbin Feng","Vidhisha Balachandran","Zhaoxuan Tan","Shiqi Lou","Tianxing He","Yulia Tsvetkov"],"pdf_url":"https://arxiv.org/pdf/2310.09725v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16131v1","updated":"2023-10-24T19:12:56Z","published":"2023-10-24T19:12:56Z","title":"GenKIE: Robust Generative Multimodal Document Key Information Extraction","summary":" Key information extraction (KIE) from scanned documents has gained increasing\nattention because of its applications in various domains. Although promising\nresults have been achieved by some recent KIE approaches, they are usually\nbuilt based on discriminative models, which lack the ability to handle optical\ncharacter recognition (OCR) errors and require laborious token-level labelling.\nIn this paper, we propose a novel generative end-to-end model, named GenKIE, to\naddress the KIE task. GenKIE is a sequence-to-sequence multimodal generative\nmodel that utilizes multimodal encoders to embed visual, layout and textual\nfeatures and a decoder to generate the desired output. Well-designed prompts\nare leveraged to incorporate the label semantics as the weakly supervised\nsignals and entice the generation of the key information. One notable advantage\nof the generative model is that it enables automatic correction of OCR errors.\nBesides, token-level granular annotation is not required. Extensive experiments\non multiple public real-world datasets show that GenKIE effectively generalizes\nover different types of documents and achieves state-of-the-art results. Our\nexperiments also validate the model's robustness against OCR errors, making\nGenKIE highly applicable in real-world scenarios.\n","authors":["Panfeng Cao","Ye Wang","Qiang Zhang","Zaiqiao Meng"],"pdf_url":"https://arxiv.org/pdf/2310.16131v1.pdf","comment":"Accepted by EMNLP 2023, Findings paper"},{"id":"http://arxiv.org/abs/2310.16127v1","updated":"2023-10-24T19:06:55Z","published":"2023-10-24T19:06:55Z","title":"Octopus: A Multitask Model and Toolkit for Arabic Natural Language\n Generation","summary":" Understanding Arabic text and generating human-like responses is a\nchallenging endeavor. While many researchers have proposed models and solutions\nfor individual problems, there is an acute shortage of a comprehensive Arabic\nnatural language generation toolkit that is capable of handling a wide range of\ntasks. In this work, we present a novel Arabic text-to-text Transformer model,\nnamely AraT5v2. Our new model is methodically trained on extensive and diverse\ndata, utilizing an extended sequence length of 2,048 tokens. We explore various\npretraining strategies including unsupervised, supervised, and joint\npertaining, under both single and multitask settings. Our models outperform\ncompetitive baselines with large margins. We take our work one step further by\ndeveloping and publicly releasing Octopus, a Python-based package and\ncommand-line toolkit tailored for eight Arabic generation tasks all exploiting\na single model. We release the models and the toolkit on our public repository.\n","authors":["AbdelRahim Elmadany","El Moatez Billah Nagoudi","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2310.16127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16117v1","updated":"2023-10-24T18:41:24Z","published":"2023-10-24T18:41:24Z","title":"NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task","summary":" We describe the findings of the fourth Nuanced Arabic Dialect Identification\nShared Task (NADI 2023). The objective of NADI is to help advance\nstate-of-the-art Arabic NLP by creating opportunities for teams of researchers\nto collaboratively compete under standardized conditions. It does so with a\nfocus on Arabic dialects, offering novel datasets and defining subtasks that\nallow for meaningful comparisons between different approaches. NADI 2023\ntargeted both dialect identification (Subtask 1) and dialect-to-MSA machine\ntranslation (Subtask 2 and Subtask 3). A total of 58 unique teams registered\nfor the shared task, of whom 18 teams have participated (with 76 valid\nsubmissions during test phase). Among these, 16 teams participated in Subtask\n1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning\nteams achieved 87.27\n F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3,\nrespectively. Results show that all three subtasks remain challenging, thereby\nmotivating future work in this area. We describe the methods employed by the\nparticipating teams and briefly offer an outlook for NADI.\n","authors":["Muhammad Abdul-Mageed","AbdelRahim Elmadany","Chiyu Zhang","El Moatez Billah Nagoudi","Houda Bouamor","Nizar Habash"],"pdf_url":"https://arxiv.org/pdf/2310.16117v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2210.09582"},{"id":"http://arxiv.org/abs/2309.13876v3","updated":"2023-10-24T18:28:22Z","published":"2023-09-25T05:01:34Z","title":"Reproducing Whisper-Style Training Using an Open-Source Toolkit and\n Publicly Available Data","summary":" Pre-training speech models on large volumes of data has achieved remarkable\nsuccess. OpenAI Whisper is a multilingual multitask model trained on 680k hours\nof supervised speech data. It generalizes well to various speech recognition\nand translation benchmarks even in a zero-shot setup. However, the full\npipeline for developing such models (from data collection to training) is not\npublicly accessible, which makes it difficult for researchers to further\nimprove its performance and address training-related issues such as efficiency,\nrobustness, fairness, and bias. This work presents an Open Whisper-style Speech\nModel (OWSM), which reproduces Whisper-style training using an open-source\ntoolkit and publicly available data. OWSM even supports more translation\ndirections and can be more efficient to train. We will publicly release all\nscripts used for data preparation, training, inference, and scoring as well as\npre-trained models and training logs to promote open science.\n","authors":["Yifan Peng","Jinchuan Tian","Brian Yan","Dan Berrebbi","Xuankai Chang","Xinjian Li","Jiatong Shi","Siddhant Arora","William Chen","Roshan Sharma","Wangyou Zhang","Yui Sudo","Muhammad Shakeel","Jee-weon Jung","Soumi Maiti","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2309.13876v3.pdf","comment":"Accepted at ASRU 2023"},{"id":"http://arxiv.org/abs/2310.16111v1","updated":"2023-10-24T18:25:13Z","published":"2023-10-24T18:25:13Z","title":"Locally Differentially Private Document Generation Using Zero Shot\n Prompting","summary":" Numerous studies have highlighted the privacy risks associated with\npretrained large language models. In contrast, our research offers a unique\nperspective by demonstrating that pretrained large language models can\neffectively contribute to privacy preservation. We propose a locally\ndifferentially private mechanism called DP-Prompt, which leverages the power of\npretrained large language models and zero-shot prompting to counter author\nde-anonymization attacks while minimizing the impact on downstream utility.\nWhen DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5),\nwe observe a notable reduction in the success rate of de-anonymization attacks,\nshowing that it surpasses existing approaches by a considerable margin despite\nits simpler design. For instance, in the case of the IMDB dataset, DP-Prompt\n(with ChatGPT) perfectly recovers the clean sentiment F1 score while achieving\na 46\\% reduction in author identification F1 score against static attackers and\na 26\\% reduction against adaptive attackers. We conduct extensive experiments\nacross six open-source large language models, ranging up to 7 billion\nparameters, to analyze various effects of the privacy-utility tradeoff.\n","authors":["Saiteja Utpala","Sara Hooker","Pin Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16111v1.pdf","comment":"Accepted at EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.16095v1","updated":"2023-10-24T18:00:40Z","published":"2023-10-24T18:00:40Z","title":"CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn\n from Financial Reports","summary":" In this paper, we introduce CR-COPEC called Causal Rationale of Corporate\nPerformance Changes from financial reports. This is a comprehensive large-scale\ndomain-adaptation causal sentence dataset to detect financial performance\nchanges of corporate. CR-COPEC contributes to two major achievements. First, it\ndetects causal rationale from 10-K annual reports of the U.S. companies, which\ncontain experts' causal analysis following accounting standards in a formal\nmanner. This dataset can be widely used by both individual investors and\nanalysts as material information resources for investing and decision making\nwithout tremendous effort to read through all the documents. Second, it\ncarefully considers different characteristics which affect the financial\nperformance of companies in twelve industries. As a result, CR-COPEC can\ndistinguish causal sentences in various industries by taking unique narratives\nin each industry into consideration. We also provide an extensive analysis of\nhow well CR-COPEC dataset is constructed and suited for classifying target\nsentences as causal ones with respect to industry characteristics. Our dataset\nand experimental codes are publicly available.\n","authors":["Ye Eun Chun","Sunjae Kwon","Kyunghwan Sohn","Nakwon Sung","Junyoup Lee","Byungki Seo","Kevin Compher","Seung-won Hwang","Jaesik Choi"],"pdf_url":"https://arxiv.org/pdf/2310.16095v1.pdf","comment":"Accepted in Findings of EMNLP 2023"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.16052v1","updated":"2023-10-24T17:59:55Z","published":"2023-10-24T17:59:55Z","title":"Synthetic Data as Validation","summary":" This study leverages synthetic data as a validation set to reduce overfitting\nand ease the selection of the best model in AI development. While synthetic\ndata have been used for augmenting the training set, we find that synthetic\ndata can also significantly diversify the validation set, offering marked\nadvantages in domains like healthcare, where data are typically limited,\nsensitive, and from out-domain sources (i.e., hospitals). In this study, we\nillustrate the effectiveness of synthetic data for early cancer detection in\ncomputed tomography (CT) volumes, where synthetic tumors are generated and\nsuperimposed onto healthy organs, thereby creating an extensive dataset for\nrigorous validation. Using synthetic data as validation can improve AI\nrobustness in both in-domain and out-domain test sets. Furthermore, we\nestablish a new continual learning framework that continuously trains AI models\non a stream of out-domain data with synthetic tumors. The AI model trained and\nvalidated in dynamically expanding synthetic data can consistently outperform\nmodels trained and validated exclusively on real-world data. Specifically, the\nDSC score for liver tumor segmentation improves from 26.7% (95% CI:\n22.6%-30.9%) to 34.5% (30.8%-38.2%) when evaluated on an in-domain dataset and\nfrom 31.1% (26.0%-36.2%) to 35.4% (32.1%-38.7%) on an out-domain dataset.\nImportantly, the performance gain is particularly significant in identifying\nvery tiny liver tumors (radius < 5mm) in CT volumes, with Sensitivity improving\nfrom 33.1% to 55.4% on an in-domain dataset and 33.9% to 52.3% on an out-domain\ndataset, justifying the efficacy in early detection of cancer. The application\nof synthetic data, from both training and validation perspectives, underlines a\npromising avenue to enhance AI robustness when dealing with data from varying\ndomains.\n","authors":["Qixin Hu","Alan Yuille","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.16052v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16047v1","updated":"2023-10-24T17:58:54Z","published":"2023-10-24T17:58:54Z","title":"From Posterior Sampling to Meaningful Diversity in Image Restoration","summary":" Image restoration problems are typically ill-posed in the sense that each\ndegraded image can be restored in infinitely many valid ways. To accommodate\nthis, many works generate a diverse set of outputs by attempting to randomly\nsample from the posterior distribution of natural images given the degraded\ninput. Here we argue that this strategy is commonly of limited practical value\nbecause of the heavy tail of the posterior distribution. Consider for example\ninpainting a missing region of the sky in an image. Since there is a high\nprobability that the missing region contains no object but clouds, any set of\nsamples from the posterior would be entirely dominated by (practically\nidentical) completions of sky. However, arguably, presenting users with only\none clear sky completion, along with several alternative solutions such as\nairships, birds, and balloons, would better outline the set of possibilities.\nIn this paper, we initiate the study of meaningfully diverse image restoration.\nWe explore several post-processing approaches that can be combined with any\ndiverse image restoration method to yield semantically meaningful diversity.\nMoreover, we propose a practical approach for allowing diffusion based image\nrestoration methods to generate meaningfully diverse outputs, while incurring\nonly negligent computational overhead. We conduct extensive user studies to\nanalyze the proposed techniques, and find the strategy of reducing similarity\nbetween outputs to be significantly favorable over posterior sampling. Code and\nexamples are available in https://noa-cohen.github.io/MeaningfulDiversityInIR\n","authors":["Noa Cohen","Hila Manor","Yuval Bahat","Tomer Michaeli"],"pdf_url":"https://arxiv.org/pdf/2310.16047v1.pdf","comment":"Code and examples are available in\n https://noa-cohen.github.io/MeaningfulDiversityInIR"},{"id":"http://arxiv.org/abs/2310.16045v1","updated":"2023-10-24T17:58:07Z","published":"2023-10-24T17:58:07Z","title":"Woodpecker: Hallucination Correction for Multimodal Large Language\n Models","summary":" Hallucination is a big shadow hanging over the rapidly evolving Multimodal\nLarge Language Models (MLLMs), referring to the phenomenon that the generated\ntext is inconsistent with the image content. In order to mitigate\nhallucinations, existing studies mainly resort to an instruction-tuning manner\nthat requires retraining the models with specific data. In this paper, we pave\na different way, introducing a training-free method named Woodpecker. Like a\nwoodpecker heals trees, it picks out and corrects hallucinations from the\ngenerated text. Concretely, Woodpecker consists of five stages: key concept\nextraction, question formulation, visual knowledge validation, visual claim\ngeneration, and hallucination correction. Implemented in a post-remedy manner,\nWoodpecker can easily serve different MLLMs, while being interpretable by\naccessing intermediate outputs of the five stages. We evaluate Woodpecker both\nquantitatively and qualitatively and show the huge potential of this new\nparadigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement\nin accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released\nat https://github.com/BradyFU/Woodpecker.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Tong Xu","Hao Wang","Dianbo Sui","Yunhang Shen","Ke Li","Xing Sun","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16045v1.pdf","comment":"16 pages, 7 figures. Code Website:\n https://github.com/BradyFU/Woodpecker"},{"id":"http://arxiv.org/abs/2310.16044v1","updated":"2023-10-24T17:57:58Z","published":"2023-10-24T17:57:58Z","title":"Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark","summary":" We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering\nBenchmark. Recent advances in inverse rendering have enabled a wide range of\nreal-world applications in 3D content generation, moving rapidly from research\nand commercial use cases to consumer devices. While the results continue to\nimprove, there is no real-world benchmark that can quantitatively assess and\ncompare the performance of various inverse rendering methods. Existing\nreal-world datasets typically only consist of the shape and multi-view images\nof objects, which are not sufficient for evaluating the quality of material\nrecovery and object relighting. Methods capable of recovering material and\nlighting often resort to synthetic data for quantitative evaluation, which on\nthe other hand does not guarantee generalization to complex real-world\nenvironments. We introduce a new dataset of real-world objects captured under a\nvariety of natural scenes with ground-truth 3D scans, multi-view images, and\nenvironment lighting. Using this dataset, we establish the first comprehensive\nreal-world evaluation benchmark for object inverse rendering tasks from\nin-the-wild scenes, and compare the performance of various existing methods.\nAll data, code, and models can be accessed at https://stanfordorb.github.io/.\n","authors":["Zhengfei Kuang","Yunzhi Zhang","Hong-Xing Yu","Samir Agarwala","Shangzhe Wu","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16044v1.pdf","comment":"The first two authors contributed equality to this work. NeurIPS 2023\n Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2310.16035v1","updated":"2023-10-24T17:50:20Z","published":"2023-10-24T17:50:20Z","title":"What's Left? Concept Grounding with Logic-Enhanced Foundation Models","summary":" Recent works such as VisProg and ViperGPT have smartly composed foundation\nmodels for visual reasoning-using large language models (LLMs) to produce\nprograms that can be executed by pre-trained vision-language models. However,\nthey operate in limited domains, such as 2D images, not fully exploiting the\ngeneralization of language: abstract concepts like \"left\" can also be grounded\nin 3D, temporal, and action data, as in moving to your left. This limited\ngeneralization stems from these inference-only methods' inability to learn or\nadapt pre-trained models to a new domain. We propose the Logic-Enhanced\nFoundation Model (LEFT), a unified framework that learns to ground and reason\nwith concepts across domains with a differentiable, domain-independent,\nfirst-order logic-based program executor. LEFT has an LLM interpreter that\noutputs a program represented in a general, logic-based reasoning language,\nwhich is shared across all domains and tasks. LEFT's executor then executes the\nprogram with trainable domain-specific grounding modules. We show that LEFT\nflexibly learns concepts in four domains: 2D images, 3D scenes, human motions,\nand robotic manipulation. It exhibits strong reasoning ability in a wide\nvariety of tasks, including those that are complex and not seen during\ntraining, and can be easily applied to new domains.\n","authors":["Joy Hsu","Jiayuan Mao","Joshua B. Tenenbaum","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16035v1.pdf","comment":"NeurIPS 2023. First two authors contributed equally. Project page:\n https://web.stanford.edu/~joycj/projects/left_neurips_2023"},{"id":"http://arxiv.org/abs/2310.16033v1","updated":"2023-10-24T17:48:04Z","published":"2023-10-24T17:48:04Z","title":"Visual Cropping Improves Zero-Shot Question Answering of Multimodal\n Large Language Models","summary":" Multimodal Large Language Models (LLMs) have recently achieved promising\nzero-shot accuracy on visual question answering (VQA) -- a fundamental task\naffecting various downstream applications and domains. Given the great\npotential for the broad use of these models, it is important to investigate\ntheir limitations in dealing with different image and question properties. In\nthis work, we investigate whether multimodal LLMs can perceive small details as\nwell as large details in images. In particular, we show that their zero-shot\naccuracy in answering visual questions is very sensitive to the size of the\nvisual subject of the question, declining up to $46\\%$ with size. Furthermore,\nwe show that this effect is causal by observing that human visual cropping can\nsignificantly mitigate their sensitivity to size. Inspired by the usefulness of\nhuman cropping, we then propose three automatic visual cropping methods as\ninference time mechanisms to improve the zero-shot performance of multimodal\nLLMs. We study their effectiveness on four popular VQA datasets, and a subset\nof the VQAv2 dataset tailored towards fine visual details. Our findings suggest\nthat multimodal LLMs should be used with caution in detail-sensitive VQA\napplications, and that visual cropping is a promising direction to improve\ntheir zero-shot performance. Our code and data are publicly available.\n","authors":["Jiarui Zhang","Mahyar Khayatkhoei","Prateek Chhikara","Filip Ilievski"],"pdf_url":"https://arxiv.org/pdf/2310.16033v1.pdf","comment":"11 pages, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.16029v1","updated":"2023-10-24T17:46:12Z","published":"2023-10-24T17:46:12Z","title":"Finetuning Offline World Models in the Real World","summary":" Reinforcement Learning (RL) is notoriously data-inefficient, which makes\ntraining on a real robot difficult. While model-based RL algorithms (world\nmodels) improve data-efficiency to some extent, they still require hours or\ndays of interaction to learn skills. Recently, offline RL has been proposed as\na framework for training RL policies on pre-existing datasets without any\nonline interaction. However, constraining an algorithm to a fixed dataset\ninduces a state-action distribution shift between training and inference, and\nlimits its applicability to new tasks. In this work, we seek to get the best of\nboth worlds: we consider the problem of pretraining a world model with offline\ndata collected on a real robot, and then finetuning the model on online data\ncollected by planning with the learned model. To mitigate extrapolation errors\nduring online interaction, we propose to regularize the planner at test-time by\nbalancing estimated returns and (epistemic) model uncertainty. We evaluate our\nmethod on a variety of visuo-motor control tasks in simulation and on a real\nrobot, and find that our method enables few-shot finetuning to seen and unseen\ntasks even when offline data is limited. Videos, code, and data are available\nat https://yunhaifeng.com/FOWM .\n","authors":["Yunhai Feng","Nicklas Hansen","Ziyan Xiong","Chandramouli Rajagopalan","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16029v1.pdf","comment":"CoRL 2023 Oral; Project website: https://yunhaifeng.com/FOWM"},{"id":"http://arxiv.org/abs/2310.16020v1","updated":"2023-10-24T17:30:26Z","published":"2023-10-24T17:30:26Z","title":"ConvBKI: Real-Time Probabilistic Semantic Mapping Network with\n Quantifiable Uncertainty","summary":" In this paper, we develop a modular neural network for real-time semantic\nmapping in uncertain environments, which explicitly updates per-voxel\nprobabilistic distributions within a neural network layer. Our approach\ncombines the reliability of classical probabilistic algorithms with the\nperformance and efficiency of modern neural networks. Although robotic\nperception is often divided between modern differentiable methods and classical\nexplicit methods, a union of both is necessary for real-time and trustworthy\nperformance. We introduce a novel Convolutional Bayesian Kernel Inference\n(ConvBKI) layer which incorporates semantic segmentation predictions online\ninto a 3D map through a depthwise convolution layer by leveraging conjugate\npriors. We compare ConvBKI against state-of-the-art deep learning approaches\nand probabilistic algorithms for mapping to evaluate reliability and\nperformance. We also create a Robot Operating System (ROS) package of ConvBKI\nand test it on real-world perceptually challenging off-road driving data.\n","authors":["Joey Wilson","Yuewei Fu","Joshua Friesen","Parker Ewen","Andrew Capodieci","Paramsothy Jayakumar","Kira Barton","Maani Ghaffari"],"pdf_url":"https://arxiv.org/pdf/2310.16020v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2209.10663"},{"id":"http://arxiv.org/abs/2309.10617v2","updated":"2023-10-24T17:23:26Z","published":"2023-09-19T13:47:31Z","title":"Intelligent Debris Mass Estimation Model for Autonomous Underwater\n Vehicle","summary":" Marine debris poses a significant threat to the survival of marine wildlife,\noften leading to entanglement and starvation, ultimately resulting in death.\nTherefore, removing debris from the ocean is crucial to restore the natural\nbalance and allow marine life to thrive. Instance segmentation is an advanced\nform of object detection that identifies objects and precisely locates and\nseparates them, making it an essential tool for autonomous underwater vehicles\n(AUVs) to navigate and interact with their underwater environment effectively.\nAUVs use image segmentation to analyze images captured by their cameras to\nnavigate underwater environments. In this paper, we use instance segmentation\nto calculate the area of individual objects within an image, we use YOLOV7 in\nRoboflow to generate a set of bounding boxes for each object in the image with\na class label and a confidence score for every detection. A segmentation mask\nis then created for each object by applying a binary mask to the object's\nbounding box. The masks are generated by applying a binary threshold to the\noutput of a convolutional neural network trained to segment objects from the\nbackground. Finally, refining the segmentation mask for each object is done by\napplying post-processing techniques such as morphological operations and\ncontour detection, to improve the accuracy and quality of the mask. The process\nof estimating the area of instance segmentation involves calculating the area\nof each segmented instance separately and then summing up the areas of all\ninstances to obtain the total area. The calculation is carried out using\nstandard formulas based on the shape of the object, such as rectangles and\ncircles. In cases where the object is complex, the Monte Carlo method is used\nto estimate the area. This method provides a higher degree of accuracy than\ntraditional methods, especially when using a large number of samples.\n","authors":["Mohana Sri S","Swethaa S","Aouthithiye Barathwaj SR Y","Sai Ganesh CS"],"pdf_url":"https://arxiv.org/pdf/2309.10617v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16014v1","updated":"2023-10-24T17:15:16Z","published":"2023-10-24T17:15:16Z","title":"Human-in-the-Loop Task and Motion Planning for Imitation Learning","summary":" Imitation learning from human demonstrations can teach robots complex\nmanipulation skills, but is time-consuming and labor intensive. In contrast,\nTask and Motion Planning (TAMP) systems are automated and excel at solving\nlong-horizon tasks, but they are difficult to apply to contact-rich tasks. In\nthis paper, we present Human-in-the-Loop Task and Motion Planning (HITL-TAMP),\na novel system that leverages the benefits of both approaches. The system\nemploys a TAMP-gated control mechanism, which selectively gives and takes\ncontrol to and from a human teleoperator. This enables the human teleoperator\nto manage a fleet of robots, maximizing data collection efficiency. The\ncollected human data is then combined with an imitation learning framework to\ntrain a TAMP-gated policy, leading to superior performance compared to training\non full task demonstrations. We compared HITL-TAMP to a conventional\nteleoperation system -- users gathered more than 3x the number of demos given\nthe same time budget. Furthermore, proficient agents (75\\%+ success) could be\ntrained from just 10 minutes of non-expert teleoperation data. Finally, we\ncollected 2.1K demos with HITL-TAMP across 12 contact-rich, long-horizon tasks\nand show that the system often produces near-perfect agents. Videos and\nadditional results at https://hitltamp.github.io .\n","authors":["Ajay Mandlekar","Caelan Garrett","Danfei Xu","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2310.16014v1.pdf","comment":"Conference on Robot Learning (CoRL) 2023"},{"id":"http://arxiv.org/abs/2310.16003v1","updated":"2023-10-24T16:56:58Z","published":"2023-10-24T16:56:58Z","title":"CVPR 2023 Text Guided Video Editing Competition","summary":" Humans watch more than a billion hours of video per day. Most of this video\nwas edited manually, which is a tedious process. However, AI-enabled\nvideo-generation and video-editing is on the rise. Building on text-to-image\nmodels like Stable Diffusion and Imagen, generative AI has improved\ndramatically on video tasks. But it's hard to evaluate progress in these video\ntasks because there is no standard benchmark. So, we propose a new dataset for\ntext-guided video editing (TGVE), and we run a competition at CVPR to evaluate\nmodels on our TGVE dataset. In this paper we present a retrospective on the\ncompetition and describe the winning method. The competition dataset is\navailable at https://sites.google.com/view/loveucvpr23/track4.\n","authors":["Jay Zhangjie Wu","Xiuyu Li","Difei Gao","Zhen Dong","Jinbin Bai","Aishani Singh","Xiaoyu Xiang","Youzeng Li","Zuwei Huang","Yuanxi Sun","Rui He","Feng Hu","Junhua Hu","Hai Huang","Hanyu Zhu","Xu Cheng","Jie Tang","Mike Zheng Shou","Kurt Keutzer","Forrest Iandola"],"pdf_url":"https://arxiv.org/pdf/2310.16003v1.pdf","comment":"Project page: https://sites.google.com/view/loveucvpr23/track4"},{"id":"http://arxiv.org/abs/2310.16002v1","updated":"2023-10-24T16:55:07Z","published":"2023-10-24T16:55:07Z","title":"Integrating View Conditions for Image Synthesis","summary":" In the field of image processing, applying intricate semantic modifications\nwithin existing images remains an enduring challenge. This paper introduces a\npioneering framework that integrates viewpoint information to enhance the\ncontrol of image editing tasks. By surveying existing object editing\nmethodologies, we distill three essential criteria, consistency,\ncontrollability, and harmony, that should be met for an image editing method.\nIn contrast to previous approaches, our method takes the lead in satisfying all\nthree requirements for addressing the challenge of image synthesis. Through\ncomprehensive experiments, encompassing both quantitative assessments and\nqualitative comparisons with contemporary state-of-the-art methods, we present\ncompelling evidence of our framework's superior performance across multiple\ndimensions. This work establishes a promising avenue for advancing image\nsynthesis techniques and empowering precise object modifications while\npreserving the visual coherence of the entire composition.\n","authors":["Jinbin Bai","Zhen Dong","Aosong Feng","Xiao Zhang","Tian Ye","Kaicheng Zhou","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2310.16002v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15999v1","updated":"2023-10-24T16:48:56Z","published":"2023-10-24T16:48:56Z","title":"Transitivity Recovering Decompositions: Interpretable and Robust\n Fine-Grained Relationships","summary":" Recent advances in fine-grained representation learning leverage\nlocal-to-global (emergent) relationships for achieving state-of-the-art\nresults. The relational representations relied upon by such methods, however,\nare abstract. We aim to deconstruct this abstraction by expressing them as\ninterpretable graphs over image views. We begin by theoretically showing that\nabstract relational representations are nothing but a way of recovering\ntransitive relationships among local views. Based on this, we design\nTransitivity Recovering Decompositions (TRD), a graph-space search algorithm\nthat identifies interpretable equivalents of abstract emergent relationships at\nboth instance and class levels, and with no post-hoc computations. We\nadditionally show that TRD is provably robust to noisy views, with empirical\nevidence also supporting this finding. The latter allows TRD to perform at par\nor even better than the state-of-the-art, while being fully interpretable.\nImplementation is available at https://github.com/abhrac/trd.\n","authors":["Abhra Chaudhuri","Massimiliano Mancini","Zeynep Akata","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2310.15999v1.pdf","comment":"Neural Information Processing Systems (NeurIPS) 2023"},{"id":"http://arxiv.org/abs/2310.15985v1","updated":"2023-10-24T16:36:51Z","published":"2023-10-24T16:36:51Z","title":"Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning","summary":" This paper presents a novel approach to Single-Positive Multi-label Learning.\nIn general multi-label learning, a model learns to predict multiple labels or\ncategories for a single input image. This is in contrast with standard\nmulti-class image classification, where the task is predicting a single label\nfrom many possible labels for an image. Single-Positive Multi-label Learning\n(SPML) specifically considers learning to predict multiple labels when there is\nonly a single annotation per image in the training data. Multi-label learning\nis in many ways a more realistic task than single-label learning as real-world\ndata often involves instances belonging to multiple categories simultaneously;\nhowever, most common computer vision datasets predominantly contain single\nlabels due to the inherent complexity and cost of collecting multiple high\nquality annotations for each instance. We propose a novel approach called\nVision-Language Pseudo-Labeling (VLPL), which uses a vision-language model to\nsuggest strong positive and negative pseudo-labels, and outperforms the current\nSOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and\n8.4% on CUB-Birds. Our code and data are available at\nhttps://github.com/mvrl/VLPL.\n","authors":["Xin Xing","Zhexiao Xiong","Abby Stylianou","Srikumar Sastry","Liyu Gong","Nathan Jacobs"],"pdf_url":"https://arxiv.org/pdf/2310.15985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15984v1","updated":"2023-10-24T16:34:03Z","published":"2023-10-24T16:34:03Z","title":"Geometry-Aware Video Quality Assessment for Dynamic Digital Human","summary":" Dynamic Digital Humans (DDHs) are 3D digital models that are animated using\npredefined motions and are inevitably bothered by noise/shift during the\ngeneration process and compression distortion during the transmission process,\nwhich needs to be perceptually evaluated. Usually, DDHs are displayed as 2D\nrendered animation videos and it is natural to adapt video quality assessment\n(VQA) methods to DDH quality assessment (DDH-QA) tasks. However, the VQA\nmethods are highly dependent on viewpoints and less sensitive to geometry-based\ndistortions. Therefore, in this paper, we propose a novel no-reference (NR)\ngeometry-aware video quality assessment method for DDH-QA challenge. Geometry\ncharacteristics are described by the statistical parameters estimated from the\nDDHs' geometry attribute distributions. Spatial and temporal features are\nacquired from the rendered videos. Finally, all kinds of features are\nintegrated and regressed into quality values. Experimental results show that\nthe proposed method achieves state-of-the-art performance on the DDH-QA\ndatabase.\n","authors":["Zicheng Zhang","Yingjie Zhou","Wei Sun","Xiongkuo Min","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2310.15984v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.09708v3","updated":"2023-10-24T16:22:40Z","published":"2022-08-20T15:17:40Z","title":"DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two\n Quantization","summary":" Efficiently deploying deep neural networks on low-resource edge devices is\nchallenging due to their ever-increasing resource requirements. To address this\nissue, researchers have proposed multiplication-free neural networks, such as\nPower-of-Two quantization, or also known as Shift networks, which aim to reduce\nmemory usage and simplify computation. However, existing low-bit Shift networks\nare not as accurate as their full-precision counterparts, typically suffering\nfrom limited weight range encoding schemes and quantization loss. In this\npaper, we propose the DenseShift network, which significantly improves the\naccuracy of Shift networks, achieving competitive performance to full-precision\nnetworks for vision and speech applications. In addition, we introduce a method\nto deploy an efficient DenseShift network using non-quantized floating-point\nactivations, while obtaining 1.6X speed-up over existing methods. To achieve\nthis, we demonstrate that zero-weight values in low-bit Shift networks do not\ncontribute to model capacity and negatively impact inference computation. To\naddress this issue, we propose a zero-free shifting mechanism that simplifies\ninference and increases model capacity. We further propose a sign-scale\ndecomposition design to enhance training efficiency and a low-variance random\ninitialization strategy to improve the model's transfer learning performance.\nOur extensive experiments on various computer vision and speech tasks\ndemonstrate that DenseShift outperforms existing low-bit multiplication-free\nnetworks and achieves competitive performance compared to full-precision\nnetworks. Furthermore, our proposed approach exhibits strong transfer learning\nperformance without a drop in accuracy. Our code was released on GitHub.\n","authors":["Xinlin Li","Bang Liu","Rui Heng Yang","Vanessa Courville","Chao Xing","Vahid Partovi Nia"],"pdf_url":"https://arxiv.org/pdf/2208.09708v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.08546v4","updated":"2023-10-24T16:02:42Z","published":"2023-05-15T11:17:17Z","title":"Towards Visual Saliency Explanations of Face Verification","summary":" In the past years, deep convolutional neural networks have been pushing the\nfrontier of face recognition (FR) techniques in both verification and\nidentification scenarios. Despite the high accuracy, they are often criticized\nfor lacking explainability. There has been an increasing demand for\nunderstanding the decision-making process of deep face recognition systems.\nRecent studies have investigated the usage of visual saliency maps as an\nexplanation, but they often lack a discussion and analysis in the context of\nface recognition. This paper concentrates on explainable face verification\ntasks and conceives a new explanation framework. Firstly, a definition of the\nsaliency-based explanation method is provided, which focuses on the decisions\nmade by the deep FR model. Secondly, a new model-agnostic explanation method\nnamed CorrRISE is proposed to produce saliency maps, which reveal both the\nsimilar and dissimilar regions of any given pair of face images. Then, an\nevaluation methodology is designed to measure the performance of general visual\nsaliency explanation methods in face verification. Finally, substantial visual\nand quantitative results have shown that the proposed CorrRISE method\ndemonstrates promising results in comparison with other state-of-the-art\nexplainable face verification approaches.\n","authors":["Yuhang Lu","Zewei Xu","Touradj Ebrahimi"],"pdf_url":"https://arxiv.org/pdf/2305.08546v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11862v2","updated":"2023-10-24T15:56:14Z","published":"2023-10-18T10:26:18Z","title":"Learning to Generate Parameters of ConvNets for Unseen Image Data","summary":" Typical Convolutional Neural Networks (ConvNets) depend heavily on large\namounts of image data and resort to an iterative optimization algorithm (e.g.,\nSGD or Adam) to learn network parameters, which makes training very time- and\nresource-intensive. In this paper, we propose a new training paradigm and\nformulate the parameter learning of ConvNets into a prediction task: given a\nConvNet architecture, we observe there exists correlations between image\ndatasets and their corresponding optimal network parameters, and explore if we\ncan learn a hyper-mapping between them to capture the relations, such that we\ncan directly predict the parameters of the network for an image dataset never\nseen during the training phase. To do this, we put forward a new hypernetwork\nbased model, called PudNet, which intends to learn a mapping between datasets\nand their corresponding network parameters, and then predicts parameters for\nunseen data with only a single forward propagation. Moreover, our model\nbenefits from a series of adaptive hyper recurrent units sharing weights to\ncapture the dependencies of parameters among different network layers.\nExtensive experiments demonstrate that our proposed method achieves good\nefficacy for unseen image datasets on two kinds of settings: Intra-dataset\nprediction and Inter-dataset prediction. Our PudNet can also well scale up to\nlarge-scale datasets, e.g., ImageNet-1K. It takes 8967 GPU seconds to train\nResNet-18 on the ImageNet-1K using GC from scratch and obtain a top-5 accuracy\nof 44.65 %. However, our PudNet costs only 3.89 GPU seconds to predict the\nnetwork parameters of ResNet-18 achieving comparable performance (44.92 %),\nmore than 2,300 times faster than the traditional training paradigm.\n","authors":["Shiye Wang","Kaituo Feng","Changsheng Li","Ye Yuan","Guoren Wang"],"pdf_url":"https://arxiv.org/pdf/2310.11862v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15955v1","updated":"2023-10-24T15:54:11Z","published":"2023-10-24T15:54:11Z","title":"Decoupled DETR: Spatially Disentangling Localization and Classification\n for Improved End-to-End Object Detection","summary":" The introduction of DETR represents a new paradigm for object detection.\nHowever, its decoder conducts classification and box localization using shared\nqueries and cross-attention layers, leading to suboptimal results. We observe\nthat different regions of interest in the visual feature map are suitable for\nperforming query classification and box localization tasks, even for the same\nobject. Salient regions provide vital information for classification, while the\nboundaries around them are more favorable for box regression. Unfortunately,\nsuch spatial misalignment between these two tasks greatly hinders DETR's\ntraining. Therefore, in this work, we focus on decoupling localization and\nclassification tasks in DETR. To achieve this, we introduce a new design scheme\ncalled spatially decoupled DETR (SD-DETR), which includes a task-aware query\ngeneration module and a disentangled feature learning process. We elaborately\ndesign the task-aware query initialization process and divide the\ncross-attention block in the decoder to allow the task-aware queries to match\ndifferent visual regions. Meanwhile, we also observe that the prediction\nmisalignment problem for high classification confidence and precise\nlocalization exists, so we propose an alignment loss to further guide the\nspatially decoupled DETR training. Through extensive experiments, we\ndemonstrate that our approach achieves a significant improvement in MSCOCO\ndatasets compared to previous work. For instance, we improve the performance of\nConditional DETR by 4.5 AP. By spatially disentangling the two tasks, our\nmethod overcomes the misalignment problem and greatly improves the performance\nof DETR for object detection.\n","authors":["Manyuan Zhang","Guanglu Song","Yu Liu","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.15955v1.pdf","comment":"accepted by ICCV2023"},{"id":"http://arxiv.org/abs/2310.15952v1","updated":"2023-10-24T15:53:07Z","published":"2023-10-24T15:53:07Z","title":"Improving Robustness and Reliability in Medical Image Classification\n with Latent-Guided Diffusion and Nested-Ensembles","summary":" While deep learning models have achieved remarkable success across a range of\nmedical image analysis tasks, deployment of these models in real clinical\ncontexts requires that they be robust to variability in the acquired images.\nWhile many methods apply predefined transformations to augment the training\ndata to enhance test-time robustness, these transformations may not ensure the\nmodel's robustness to the diverse variability seen in patient images. In this\npaper, we introduce a novel three-stage approach based on transformers coupled\nwith conditional diffusion models, with the goal of improving model robustness\nto the kinds of imaging variability commonly encountered in practice without\nthe need for pre-determined data augmentation strategies. To this end, multiple\nimage encoders first learn hierarchical feature representations to build\ndiscriminative latent spaces. Next, a reverse diffusion process, guided by the\nlatent code, acts on an informative prior and proposes prediction candidates in\na generative manner. Finally, several prediction candidates are aggregated in a\nbi-level aggregation protocol to produce the final output. Through extensive\nexperiments on medical imaging benchmark datasets, we show that our method\nimproves upon state-of-the-art methods in terms of robustness and confidence\ncalibration. Additionally, we introduce a strategy to quantify the prediction\nuncertainty at the instance level, increasing their trustworthiness to\nclinicians using them in clinical practice.\n","authors":["Xing Shen","Hengguan Huang","Brennan Nichyporuk","Tal Arbel"],"pdf_url":"https://arxiv.org/pdf/2310.15952v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.15948v1","updated":"2023-10-24T15:50:35Z","published":"2023-10-24T15:50:35Z","title":"Language-driven Scene Synthesis using Multi-conditional Diffusion Model","summary":" Scene synthesis is a challenging problem with several industrial\napplications. Recently, substantial efforts have been directed to synthesize\nthe scene using human motions, room layouts, or spatial graphs as the input.\nHowever, few studies have addressed this problem from multiple modalities,\nespecially combining text prompts. In this paper, we propose a language-driven\nscene synthesis task, which is a new task that integrates text prompts, human\nmotion, and existing objects for scene synthesis. Unlike other single-condition\nsynthesis tasks, our problem involves multiple conditions and requires a\nstrategy for processing and encoding them into a unified space. To address the\nchallenge, we present a multi-conditional diffusion model, which differs from\nthe implicit unification approach of other diffusion literature by explicitly\npredicting the guiding points for the original data distribution. We\ndemonstrate that our approach is theoretically supportive. The intensive\nexperiment results illustrate that our method outperforms state-of-the-art\nbenchmarks and enables natural scene editing applications. The source code and\ndataset can be accessed at https://lang-scene-synth.github.io/.\n","authors":["An Vuong","Minh Nhat Vu","Toan Tien Nguyen","Baoru Huang","Dzung Nguyen","Thieu Vo","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.15948v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15946v1","updated":"2023-10-24T15:47:52Z","published":"2023-10-24T15:47:52Z","title":"ShARc: Shape and Appearance Recognition for Person Identification\n In-the-wild","summary":" Identifying individuals in unconstrained video settings is a valuable yet\nchallenging task in biometric analysis due to variations in appearances,\nenvironments, degradations, and occlusions. In this paper, we present ShARc, a\nmultimodal approach for video-based person identification in uncontrolled\nenvironments that emphasizes 3-D body shape, pose, and appearance. We introduce\ntwo encoders: a Pose and Shape Encoder (PSE) and an Aggregated Appearance\nEncoder (AAE). PSE encodes the body shape via binarized silhouettes, skeleton\nmotions, and 3-D body shape, while AAE provides two levels of temporal\nappearance feature aggregation: attention-based feature aggregation and\naveraging aggregation. For attention-based feature aggregation, we employ\nspatial and temporal attention to focus on key areas for person distinction.\nFor averaging aggregation, we introduce a novel flattening layer after\naveraging to extract more distinguishable information and reduce overfitting of\nattention. We utilize centroid feature averaging for gallery registration. We\ndemonstrate significant improvements over existing state-of-the-art methods on\npublic datasets, including CCVID, MEVID, and BRIAR.\n","authors":["Haidong Zhu","Wanrong Zheng","Zhaoheng Zheng","Ram Nevatia"],"pdf_url":"https://arxiv.org/pdf/2310.15946v1.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2306.16635v2","updated":"2023-10-24T15:33:38Z","published":"2023-06-29T02:19:49Z","title":"Improving Fairness in Deepfake Detection","summary":" Despite the development of effective deepfake detection models in recent\nyears, several recent studies have demonstrated that biases in the training\ndata utilized to develop deepfake detection models can lead to unfair\nperformance for demographic groups of different races and/or genders. Such can\nresult in these groups being unfairly targeted or excluded from detection,\nallowing misclassified deepfakes to manipulate public opinion and erode trust\nin the model. While these studies have focused on identifying and evaluating\nthe unfairness in deepfake detection, no methods have been developed to address\nthe fairness issue of deepfake detection at the algorithm level. In this work,\nwe make the first attempt to improve deepfake detection fairness by proposing\nnovel loss functions to train fair deepfake detection models in ways that are\nagnostic or aware of demographic factors. Extensive experiments on four\ndeepfake datasets and five deepfake detectors demonstrate the effectiveness and\nflexibility of our approach in improving the deepfake detection fairness.\n","authors":["Yan Ju","Shu Hu","Shan Jia","George H. Chen","Siwei Lyu"],"pdf_url":"https://arxiv.org/pdf/2306.16635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15913v1","updated":"2023-10-24T15:15:57Z","published":"2023-10-24T15:15:57Z","title":"Mitigate Domain Shift by Primary-Auxiliary Objectives Association for\n Generalizing Person ReID","summary":" While deep learning has significantly improved ReID model accuracy under the\nindependent and identical distribution (IID) assumption, it has also become\nclear that such models degrade notably when applied to an unseen novel domain\ndue to unpredictable/unknown domain shift. Contemporary domain generalization\n(DG) ReID models struggle in learning domain-invariant representation solely\nthrough training on an instance classification objective. We consider that a\ndeep learning model is heavily influenced and therefore biased towards\ndomain-specific characteristics, e.g., background clutter, scale and viewpoint\nvariations, limiting the generalizability of the learned model, and hypothesize\nthat the pedestrians are domain invariant owning they share the same structural\ncharacteristics. To enable the ReID model to be less domain-specific from these\npure pedestrians, we introduce a method that guides model learning of the\nprimary ReID instance classification objective by a concurrent auxiliary\nlearning objective on weakly labeled pedestrian saliency detection. To solve\nthe problem of conflicting optimization criteria in the model parameter space\nbetween the two learning objectives, we introduce a Primary-Auxiliary\nObjectives Association (PAOA) mechanism to calibrate the loss gradients of the\nauxiliary task towards the primary learning task gradients. Benefiting from the\nharmonious multitask learning design, our model can be extended with the recent\ntest-time diagram to form the PAOA+, which performs on-the-fly optimization\nagainst the auxiliary objective in order to maximize the model's generative\ncapacity in the test target domain. Experiments demonstrate the superiority of\nthe proposed PAOA model.\n","authors":["Qilei Li","Shaogang Gong"],"pdf_url":"https://arxiv.org/pdf/2310.15913v1.pdf","comment":"Accepted to WACV2024"},{"id":"http://arxiv.org/abs/2310.15898v1","updated":"2023-10-24T15:02:02Z","published":"2023-10-24T15:02:02Z","title":"YOLO-Angio: An Algorithm for Coronary Anatomy Segmentation","summary":" Coronary angiography remains the gold standard for diagnosis of coronary\nartery disease, the most common cause of death worldwide. While this procedure\nis performed more than 2 million times annually, there remain few methods for\nfast and accurate automated measurement of disease and localization of coronary\nanatomy. Here, we present our solution to the Automatic Region-based Coronary\nArtery Disease diagnostics using X-ray angiography images (ARCADE) challenge\nheld at MICCAI 2023. For the artery segmentation task, our three-stage approach\ncombines preprocessing and feature selection by classical computer vision to\nenhance vessel contrast, followed by an ensemble model based on YOLOv8 to\npropose possible vessel candidates by generating a vessel map. A final\nsegmentation is based on a logic-based approach to reconstruct the coronary\ntree in a graph-based sorting method. Our entry to the ARCADE challenge placed\n3rd overall. Using the official metric for evaluation, we achieved an F1 score\nof 0.422 and 0.4289 on the validation and hold-out sets respectively.\n","authors":["Tom Liu","Hui Lin","Aggelos K. Katsaggelos","Adrienne Kline"],"pdf_url":"https://arxiv.org/pdf/2310.15898v1.pdf","comment":"MICCAI Conference ARCADE Grand Challenge, YOLO, Computer Vision,"},{"id":"http://arxiv.org/abs/2303.12484v2","updated":"2023-10-24T15:01:12Z","published":"2023-03-22T11:51:49Z","title":"Label-Efficient Deep Learning in Medical Image Analysis: Challenges and\n Future Directions","summary":" Deep learning has seen rapid growth in recent years and achieved\nstate-of-the-art performance in a wide range of applications. However, training\nmodels typically requires expensive and time-consuming collection of large\nquantities of labeled data. This is particularly true within the scope of\nmedical imaging analysis (MIA), where data are limited and labels are expensive\nto be acquired. Thus, label-efficient deep learning methods are developed to\nmake comprehensive use of the labeled data as well as the abundance of\nunlabeled and weak-labeled data. In this survey, we extensively investigated\nover 300 recent papers to provide a comprehensive overview of recent progress\non label-efficient learning strategies in MIA. We first present the background\nof label-efficient learning and categorize the approaches into different\nschemes. Next, we examine the current state-of-the-art methods in detail\nthrough each scheme. Specifically, we provide an in-depth investigation,\ncovering not only canonical semi-supervised, self-supervised, and\nmulti-instance learning schemes, but also recently emerged active and\nannotation-efficient learning strategies. Moreover, as a comprehensive\ncontribution to the field, this survey not only elucidates the commonalities\nand unique features of the surveyed methods but also presents a detailed\nanalysis of the current challenges in the field and suggests potential avenues\nfor future research.\n","authors":["Cheng Jin","Zhengrui Guo","Yi Lin","Luyang Luo","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2303.12484v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.15206v3","updated":"2023-10-24T14:30:03Z","published":"2023-03-24T11:53:48Z","title":"Perceptual Quality Assessment of NeRF and Neural View Synthesis Methods\n for Front-Facing Views","summary":" Neural view synthesis (NVS) is one of the most successful techniques for\nsynthesizing free viewpoint videos, capable of achieving high fidelity from\nonly a sparse set of captured images. This success has led to many variants of\nthe techniques, each evaluated on a set of test views typically using image\nquality metrics such as PSNR, SSIM, or LPIPS. There has been a lack of research\non how NVS methods perform with respect to perceived video quality. We present\nthe first study on perceptual evaluation of NVS and NeRF variants. For this\nstudy, we collected two datasets of scenes captured in a controlled lab\nenvironment as well as in-the-wild. In contrast to existing datasets, these\nscenes come with reference video sequences, allowing us to test for temporal\nartifacts and subtle distortions that are easily overlooked when viewing only\nstatic images. We measured the quality of videos synthesized by several NVS\nmethods in a well-controlled perceptual quality assessment experiment as well\nas with many existing state-of-the-art image/video quality metrics. We present\na detailed analysis of the results and recommendations for dataset and metric\nselection for NVS evaluation.\n","authors":["Hanxue Liang","Tianhao Wu","Param Hanji","Francesco Banterle","Hongyun Gao","Rafal Mantiuk","Cengiz Oztireli"],"pdf_url":"https://arxiv.org/pdf/2303.15206v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.03412v3","updated":"2023-10-24T14:22:39Z","published":"2021-06-07T08:23:02Z","title":"Resolution learning in deep convolutional networks using scale-space\n theory","summary":" Resolution in deep convolutional neural networks (CNNs) is typically bounded\nby the receptive field size through filter sizes, and subsampling layers or\nstrided convolutions on feature maps. The optimal resolution may vary\nsignificantly depending on the dataset. Modern CNNs hard-code their resolution\nhyper-parameters in the network architecture which makes tuning such\nhyper-parameters cumbersome. We propose to do away with hard-coded resolution\nhyper-parameters and aim to learn the appropriate resolution from data. We use\nscale-space theory to obtain a self-similar parametrization of filters and make\nuse of the N-Jet: a truncated Taylor series to approximate a filter by a\nlearned combination of Gaussian derivative filters. The parameter sigma of the\nGaussian basis controls both the amount of detail the filter encodes and the\nspatial extent of the filter. Since sigma is a continuous parameter, we can\noptimize it with respect to the loss. The proposed N-Jet layer achieves\ncomparable performance when used in state-of-the art architectures, while\nlearning the correct resolution in each layer automatically. We evaluate our\nN-Jet layer on both classification and segmentation, and we show that learning\nsigma is especially beneficial for inputs at multiple sizes.\n","authors":["Silvia L. Pintea","Nergis Tomen","Stanley F. Goes","Marco Loog","Jan C. van Gemert"],"pdf_url":"https://arxiv.org/pdf/2106.03412v3.pdf","comment":"Preprint accepted by IEEE Transactions on Image Processing, 2021\n (TIP). Link to final published article:\n https://ieeexplore.ieee.org/abstract/document/9552550"},{"id":"http://arxiv.org/abs/2310.15848v1","updated":"2023-10-24T14:01:53Z","published":"2023-10-24T14:01:53Z","title":"On Responsible Machine Learning Datasets with Fairness, Privacy, and\n Regulatory Norms","summary":" Artificial Intelligence (AI) has made its way into various scientific fields,\nproviding astonishing improvements over existing algorithms for a wide variety\nof tasks. In recent years, there have been severe concerns over the\ntrustworthiness of AI technologies. The scientific community has focused on the\ndevelopment of trustworthy AI algorithms. However, machine and deep learning\nalgorithms, popular in the AI community today, depend heavily on the data used\nduring their development. These learning algorithms identify patterns in the\ndata, learning the behavioral objective. Any flaws in the data have the\npotential to translate directly into algorithms. In this study, we discuss the\nimportance of Responsible Machine Learning Datasets and propose a framework to\nevaluate the datasets through a responsible rubric. While existing work focuses\non the post-hoc evaluation of algorithms for their trustworthiness, we provide\na framework that considers the data component separately to understand its role\nin the algorithm. We discuss responsible datasets through the lens of fairness,\nprivacy, and regulatory compliance and provide recommendations for constructing\nfuture datasets. After surveying over 100 datasets, we use 60 datasets for\nanalysis and demonstrate that none of these datasets is immune to issues of\nfairness, privacy preservation, and regulatory compliance. We provide\nmodifications to the ``datasheets for datasets\" with important additions for\nimproved dataset documentation. With governments around the world regularizing\ndata protection laws, the method for the creation of datasets in the scientific\ncommunity requires revision. We believe this study is timely and relevant in\ntoday's era of AI.\n","authors":["Surbhi Mittal","Kartik Thakral","Richa Singh","Mayank Vatsa","Tamar Glaser","Cristian Canton Ferrer","Tal Hassner"],"pdf_url":"https://arxiv.org/pdf/2310.15848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.15512v2","updated":"2023-10-24T13:57:40Z","published":"2023-08-29T15:39:15Z","title":"Shatter and Gather: Learning Referring Image Segmentation with Text\n Supervision","summary":" Referring image segmentation, the task of segmenting any arbitrary entities\ndescribed in free-form texts, opens up a variety of vision applications.\nHowever, manual labeling of training data for this task is prohibitively\ncostly, leading to lack of labeled data for training. We address this issue by\na weakly supervised learning approach using text descriptions of training\nimages as the only source of supervision. To this end, we first present a new\nmodel that discovers semantic entities in input image and then combines such\nentities relevant to text query to predict the mask of the referent. We also\npresent a new loss function that allows the model to be trained without any\nfurther supervision. Our method was evaluated on four public benchmarks for\nreferring image segmentation, where it clearly outperformed the existing method\nfor the same task and recent open-vocabulary segmentation models on all the\nbenchmarks.\n","authors":["Dongwon Kim","Namyup Kim","Cuiling Lan","Suha Kwak"],"pdf_url":"https://arxiv.org/pdf/2308.15512v2.pdf","comment":"Accepted to ICCV 2023, Project page:\n https://southflame.github.io/sag/"},{"id":"http://arxiv.org/abs/2310.15827v1","updated":"2023-10-24T13:28:46Z","published":"2023-10-24T13:28:46Z","title":"Automatic Aorta Segmentation with Heavily Augmented, High-Resolution 3-D\n ResUNet: Contribution to the SEG.A Challenge","summary":" Automatic aorta segmentation from 3-D medical volumes is an important yet\ndifficult task. Several factors make the problem challenging, e.g. the\npossibility of aortic dissection or the difficulty with segmenting and\nannotating the small branches. This work presents a contribution by the MedGIFT\nteam to the SEG.A challenge organized during the MICCAI 2023 conference. We\npropose a fully automated algorithm based on deep encoder-decoder architecture.\nThe main assumption behind our work is that data preprocessing and augmentation\nare much more important than the deep architecture, especially in low data\nregimes. Therefore, the solution is based on a variant of traditional\nconvolutional U-Net. The proposed solution achieved a Dice score above 0.9 for\nall testing cases with the highest stability among all participants. The method\nscored 1st, 4th, and 3rd in terms of the clinical evaluation, quantitative\nresults, and volumetric meshing quality, respectively. We freely release the\nsource code, pretrained model, and provide access to the algorithm on the\nGrand-Challenge platform.\n","authors":["Marek Wodzinski","Henning Müller"],"pdf_url":"https://arxiv.org/pdf/2310.15827v1.pdf","comment":"MICCAI 2023 - SEG.A Challenge Contribution"},{"id":"http://arxiv.org/abs/2310.13570v2","updated":"2023-10-24T13:24:25Z","published":"2023-10-20T15:08:17Z","title":"A Simple Baseline for Knowledge-Based Visual Question Answering","summary":" This paper is on the problem of Knowledge-Based Visual Question Answering\n(KB-VQA). Recent works have emphasized the significance of incorporating both\nexplicit (through external databases) and implicit (through LLMs) knowledge to\nanswer questions requiring external knowledge effectively. A common limitation\nof such approaches is that they consist of relatively complicated pipelines and\noften heavily rely on accessing GPT-3 API. Our main contribution in this paper\nis to propose a much simpler and readily reproducible pipeline which, in a\nnutshell, is based on efficient in-context learning by prompting LLaMA (1 and\n2) using question-informative captions as contextual information. Contrary to\nrecent approaches, our method is training-free, does not require access to\nexternal databases or APIs, and yet achieves state-of-the-art accuracy on the\nOK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to\nunderstand important aspects of our method. Our code is publicly available at\nhttps://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA\n","authors":["Alexandros Xenos","Themos Stafylakis","Ioannis Patras","Georgios Tzimiropoulos"],"pdf_url":"https://arxiv.org/pdf/2310.13570v2.pdf","comment":"Accepted at EMNLP 2023 (camera-ready version)"},{"id":"http://arxiv.org/abs/2310.15787v1","updated":"2023-10-24T12:34:58Z","published":"2023-10-24T12:34:58Z","title":"SequenceMatch: Revisiting the design of weak-strong augmentations for\n Semi-supervised learning","summary":" Semi-supervised learning (SSL) has become popular in recent years because it\nallows the training of a model using a large amount of unlabeled data. However,\none issue that many SSL methods face is the confirmation bias, which occurs\nwhen the model is overfitted to the small labeled training dataset and produces\noverconfident, incorrect predictions. To address this issue, we propose\nSequenceMatch, an efficient SSL method that utilizes multiple data\naugmentations. The key element of SequenceMatch is the inclusion of a medium\naugmentation for unlabeled data. By taking advantage of different augmentations\nand the consistency constraints between each pair of augmented examples,\nSequenceMatch helps reduce the divergence between the prediction distribution\nof the model for weakly and strongly augmented examples. In addition,\nSequenceMatch defines two different consistency constraints for high and\nlow-confidence predictions. As a result, SequenceMatch is more data-efficient\nthan ReMixMatch, and more time-efficient than both ReMixMatch ($\\times4$) and\nCoMatch ($\\times2$) while having higher accuracy. Despite its simplicity,\nSequenceMatch consistently outperforms prior methods on standard benchmarks,\nsuch as CIFAR-10/100, SVHN, and STL-10. It also surpasses prior\nstate-of-the-art methods by a large margin on large-scale datasets such as\nImageNet, with a 38.46\\% error rate. Code is available at\nhttps://github.com/beandkay/SequenceMatch.\n","authors":["Khanh-Binh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.15787v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.15778v1","updated":"2023-10-24T12:25:37Z","published":"2023-10-24T12:25:37Z","title":"3D Masked Autoencoders for Enhanced Privacy in MRI Scans","summary":" MRI scans provide valuable medical information, however they also contain\nsensitive and personally identifiable information (PII) that needs to be\nprotected. Whereas MRI metadata is easily sanitized, MRI image data is a\nprivacy risk because it contains information to render highly-realistic 3D\nvisualizations of a patient's head, enabling malicious actors to possibly\nidentify the subject by cross-referencing a database. Data anonymization and\nde-identification is concerned with ensuring the privacy and confidentiality of\nindividuals' personal information. Traditional MRI de-identification methods\nremove privacy-sensitive parts (e.g. eyes, nose etc.) from a given scan. This\ncomes at the expense of introducing a domain shift that can throw off\ndownstream analyses. Recently, a GAN-based approach was proposed to de-identify\na patient's scan by remodeling it (e.g. changing the face) rather than by\nremoving parts. In this work, we propose CP-MAE, a model that de-identifies the\nface using masked autoencoders and that outperforms all previous approaches in\nterms of downstream task performance as well as de-identification. With our\nmethod we are able to synthesize scans of resolution up to $256^3$ (previously\n128 cubic) which constitutes an eight-fold increase in the number of voxels.\nUsing our construction we were able to design a system that exhibits a highly\nrobust training stage, making it easy to fit the network on novel data.\n","authors":["Lennart Alexander Van der Goten","Kevin Smith"],"pdf_url":"https://arxiv.org/pdf/2310.15778v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.11740v2","updated":"2023-10-24T12:17:47Z","published":"2022-09-19T08:15:30Z","title":"On the Shift Invariance of Max Pooling Feature Maps in Convolutional\n Neural Networks","summary":" This paper focuses on improving the mathematical interpretability of\nconvolutional neural networks (CNNs) in the context of image classification.\nSpecifically, we tackle the instability issue arising in their first layer,\nwhich tends to learn parameters that closely resemble oriented band-pass\nfilters when trained on datasets like ImageNet. Subsampled convolutions with\nsuch Gabor-like filters are prone to aliasing, causing sensitivity to small\ninput shifts. In this context, we establish conditions under which the max\npooling operator approximates a complex modulus, which is nearly shift\ninvariant. We then derive a measure of shift invariance for subsampled\nconvolutions followed by max pooling. In particular, we highlight the crucial\nrole played by the filter's frequency and orientation in achieving stability.\nWe experimentally validate our theory by considering a deterministic feature\nextractor based on the dual-tree complex wavelet packet transform, a particular\ncase of discrete Gabor-like decomposition.\n","authors":["Hubert Leterme","Kévin Polisano","Valérie Perrier","Karteek Alahari"],"pdf_url":"https://arxiv.org/pdf/2209.11740v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15767v1","updated":"2023-10-24T12:13:51Z","published":"2023-10-24T12:13:51Z","title":"Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning","summary":" High-resolution (HR) magnetic resonance imaging (MRI) is crucial for\nenhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent\nlimitation of MRI resolution restricts its widespread applicability. Deep\nlearning-based image super-resolution (SR) methods exhibit promise in improving\nMRI resolution without additional cost. However, these methods frequently\nrequire a substantial number of HR MRI images for training, which can be\nchallenging to acquire. In this paper, we propose an unpaired MRI SR approach\nthat employs self-supervised contrastive learning to enhance SR performance\nwith limited training data. Our approach leverages both authentic HR images and\nsynthetically generated SR images to construct positive and negative sample\npairs, thus facilitating the learning of discriminative features. Empirical\nresults presented in this study underscore significant enhancements in the peak\nsignal-to-noise ratio and structural similarity index, even when a paucity of\nHR images is available. These findings accentuate the potential of our approach\nin addressing the challenge of limited training data, thereby contributing to\nthe advancement of high-resolution MRI in clinical applications.\n","authors":["Hao Li","Quanwei Liu","Jianan Liu","Xiling Liu","Yanni Dong","Tao Huang","Zhihan Lv"],"pdf_url":"https://arxiv.org/pdf/2310.15767v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15764v1","updated":"2023-10-24T12:11:19Z","published":"2023-10-24T12:11:19Z","title":"Debiasing, calibrating, and improving Semi-supervised Learning\n performance via simple Ensemble Projector","summary":" Recent studies on semi-supervised learning (SSL) have achieved great success.\nDespite their promising performance, current state-of-the-art methods tend\ntoward increasingly complex designs at the cost of introducing more network\ncomponents and additional training procedures. In this paper, we propose a\nsimple method named Ensemble Projectors Aided for Semi-supervised Learning\n(EPASS), which focuses mainly on improving the learned embeddings to boost the\nperformance of the existing contrastive joint-training semi-supervised learning\nframeworks. Unlike standard methods, where the learned embeddings from one\nprojector are stored in memory banks to be used with contrastive learning,\nEPASS stores the ensemble embeddings from multiple projectors in memory banks.\nAs a result, EPASS improves generalization, strengthens feature representation,\nand boosts performance. For instance, EPASS improves strong baselines for\nsemi-supervised learning by 39.47\\%/31.39\\%/24.70\\% top-1 error rate, while\nusing only 100k/1\\%/10\\% of labeled data for SimMatch, and achieves\n40.24\\%/32.64\\%/25.90\\% top-1 error rate for CoMatch on the ImageNet dataset.\nThese improvements are consistent across methods, network architectures, and\ndatasets, proving the general effectiveness of the proposed methods. Code is\navailable at https://github.com/beandkay/EPASS.\n","authors":["Khanh-Binh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.15764v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.15747v1","updated":"2023-10-24T11:44:39Z","published":"2023-10-24T11:44:39Z","title":"Large Language Models are Temporal and Causal Reasoners for Video\n Question Answering","summary":" Large Language Models (LLMs) have shown remarkable performances on a wide\nrange of natural language understanding and generation tasks. We observe that\nthe LLMs provide effective priors in exploiting $\\textit{linguistic shortcuts}$\nfor temporal and causal reasoning in Video Question Answering (VideoQA).\nHowever, such priors often cause suboptimal results on VideoQA by leading the\nmodel to over-rely on questions, $\\textit{i.e.}$, $\\textit{linguistic bias}$,\nwhile ignoring visual content. This is also known as `ungrounded guesses' or\n`hallucinations'. To address this problem while leveraging LLMs' prior on\nVideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to\npredict all the combinations of $\\langle$V, Q, A$\\rangle$ triplet by flipping\nthe source pair and the target label to understand their complex relationships,\n$\\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs,\nrespectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to\nLLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five\nchallenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general\nframework that is applicable to various LLMs (OPT and GPT-J) and consistently\nimproves their performances. We empirically demonstrate that Flipped-VQA not\nonly enhances the exploitation of linguistic shortcuts but also mitigates the\nlinguistic bias, which causes incorrect answers over-relying on the question.\nCode is available at https://github.com/mlvlab/Flipped-VQA.\n","authors":["Dohwan Ko","Ji Soo Lee","Wooyoung Kang","Byungseok Roh","Hyunwoo J. Kim"],"pdf_url":"https://arxiv.org/pdf/2310.15747v1.pdf","comment":"Accepted paper at EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2305.14616v2","updated":"2023-10-24T11:30:07Z","published":"2023-05-24T01:30:50Z","title":"Exploring Affordance and Situated Meaning in Image Captions: A\n Multimodal Analysis","summary":" This paper explores the grounding issue regarding multimodal semantic\nrepresentation from a computational cognitive-linguistic view. We annotate\nimages from the Flickr30k dataset with five perceptual properties: Affordance,\nPerceptual Salience, Object Number, Gaze Cueing, and Ecological Niche\nAssociation (ENA), and examine their association with textual elements in the\nimage captions. Our findings reveal that images with Gibsonian affordance show\na higher frequency of captions containing 'holding-verbs' and 'container-nouns'\ncompared to images displaying telic affordance. Perceptual Salience, Object\nNumber, and ENA are also associated with the choice of linguistic expressions.\nOur study demonstrates that comprehensive understanding of objects or events\nrequires cognitive attention, semantic nuances in language, and integration\nacross multiple modalities. We highlight the vital importance of situated\nmeaning and affordance grounding in natural language understanding, with the\npotential to advance human-like interpretation in various scenarios.\n","authors":["Pin-Er Chen","Po-Ya Angela Wang","Hsin-Yu Chou","Yu-Hsiang Tseng","Shu-Kai Hsieh"],"pdf_url":"https://arxiv.org/pdf/2305.14616v2.pdf","comment":"10 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.15741v1","updated":"2023-10-24T11:28:59Z","published":"2023-10-24T11:28:59Z","title":"Interpretable Medical Image Classification using Prototype Learning and\n Privileged Information","summary":" Interpretability is often an essential requirement in medical imaging.\nAdvanced deep learning methods are required to address this need for\nexplainability and high performance. In this work, we investigate whether\nadditional information available during the training process can be used to\ncreate an understandable and powerful model. We propose an innovative solution\ncalled Proto-Caps that leverages the benefits of capsule networks, prototype\nlearning and the use of privileged information. Evaluating the proposed\nsolution on the LIDC-IDRI dataset shows that it combines increased\ninterpretability with above state-of-the-art prediction performance. Compared\nto the explainable baseline model, our method achieves more than 6 % higher\naccuracy in predicting both malignancy (93.0 %) and mean characteristic\nfeatures of lung nodules. Simultaneously, the model provides case-based\nreasoning with prototype representations that allow visual validation of\nradiologist-defined attributes.\n","authors":["Luisa Gallee","Meinrad Beer","Michael Goetz"],"pdf_url":"https://arxiv.org/pdf/2310.15741v1.pdf","comment":"MICCAI 2023 Medical Image Computing and Computer Assisted\n Intervention"},{"id":"http://arxiv.org/abs/2209.00232v2","updated":"2023-10-24T11:13:37Z","published":"2022-09-01T05:26:32Z","title":"Hybrid Gromov-Wasserstein Embedding for Capsule Learning","summary":" Capsule networks (CapsNets) aim to parse images into a hierarchy of objects,\nparts, and their relations using a two-step process involving part-whole\ntransformation and hierarchical component routing. However, this hierarchical\nrelationship modeling is computationally expensive, which has limited the wider\nuse of CapsNet despite its potential advantages. The current state of CapsNet\nmodels primarily focuses on comparing their performance with capsule baselines,\nfalling short of achieving the same level of proficiency as deep CNN variants\nin intricate tasks. To address this limitation, we present an efficient\napproach for learning capsules that surpasses canonical baseline models and\neven demonstrates superior performance compared to high-performing convolution\nmodels. Our contribution can be outlined in two aspects: firstly, we introduce\na group of subcapsules onto which an input vector is projected. Subsequently,\nwe present the Hybrid Gromov-Wasserstein framework, which initially quantifies\nthe dissimilarity between the input and the components modeled by the\nsubcapsules, followed by determining their alignment degree through optimal\ntransport. This innovative mechanism capitalizes on new insights into defining\nalignment between the input and subcapsules, based on the similarity of their\nrespective component distributions. This approach enhances CapsNets' capacity\nto learn from intricate, high-dimensional data while retaining their\ninterpretability and hierarchical structure. Our proposed model offers two\ndistinct advantages: (i) its lightweight nature facilitates the application of\ncapsules to more intricate vision tasks, including object detection; (ii) it\noutperforms baseline approaches in these demanding tasks.\n","authors":["Pourya Shamsolmoali","Masoumeh Zareapoor","Swagatam Das","Eric Granger","Salvador Garcia"],"pdf_url":"https://arxiv.org/pdf/2209.00232v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15842v2","updated":"2023-10-24T11:09:55Z","published":"2023-09-27T17:59:11Z","title":"Exploiting the Signal-Leak Bias in Diffusion Models","summary":" There is a bias in the inference pipeline of most diffusion models. This bias\narises from a signal leak whose distribution deviates from the noise\ndistribution, creating a discrepancy between training and inference processes.\nWe demonstrate that this signal-leak bias is particularly significant when\nmodels are tuned to a specific style, causing sub-optimal style matching.\nRecent research tries to avoid the signal leakage during training. We instead\nshow how we can exploit this signal-leak bias in existing diffusion models to\nallow more control over the generated images. This enables us to generate\nimages with more varied brightness, and images that better match a desired\nstyle or color. By modeling the distribution of the signal leak in the spatial\nfrequency and pixel domains, and including a signal leak in the initial latent,\nwe generate images that better match expected results without any additional\ntraining.\n","authors":["Martin Nicolas Everaert","Athanasios Fitsios","Marco Bocchio","Sami Arpa","Sabine Süsstrunk","Radhakrishna Achanta"],"pdf_url":"https://arxiv.org/pdf/2309.15842v2.pdf","comment":"corrected the author names in reference [24]"},{"id":"http://arxiv.org/abs/2310.09647v2","updated":"2023-10-24T11:03:31Z","published":"2023-10-14T19:27:46Z","title":"Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video","summary":" Dynamic radiance fields have emerged as a promising approach for generating\nnovel views from a monocular video. However, previous methods enforce the\ngeometric consistency to dynamic radiance fields only between adjacent input\nframes, making it difficult to represent the global scene geometry and\ndegenerates at the viewpoint that is spatio-temporally distant from the input\ncamera trajectory. To solve this problem, we introduce point-based dynamic\nradiance fields (\\textbf{Point-DynRF}), a novel framework where the global\ngeometric information and the volume rendering process are trained by neural\npoint clouds and dynamic radiance fields, respectively. Specifically, we\nreconstruct neural point clouds directly from geometric proxies and optimize\nboth radiance fields and the geometric proxies using our proposed losses,\nallowing them to complement each other. We validate the effectiveness of our\nmethod with experiments on the NVIDIA Dynamic Scenes Dataset and several\ncausally captured monocular video clips.\n","authors":["Byeongjun Park","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2310.09647v2.pdf","comment":"WACV2024"},{"id":"http://arxiv.org/abs/2309.13505v3","updated":"2023-10-24T11:01:24Z","published":"2023-09-24T00:05:39Z","title":"Rewrite Caption Semantics: Bridging Semantic Gaps for\n Language-Supervised Semantic Segmentation","summary":" Vision-Language Pre-training has demonstrated its remarkable zero-shot\nrecognition ability and potential to learn generalizable visual representations\nfrom language supervision. Taking a step ahead, language-supervised semantic\nsegmentation enables spatial localization of textual inputs by learning pixel\ngrouping solely from image-text pairs. Nevertheless, the state-of-the-art\nsuffers from clear semantic gaps between visual and textual modality: plenty of\nvisual concepts appeared in images are missing in their paired captions. Such\nsemantic misalignment circulates in pre-training, leading to inferior zero-shot\nperformance in dense predictions due to insufficient visual concepts captured\nin textual representations. To close such semantic gap, we propose Concept\nCuration (CoCu), a pipeline that leverages CLIP to compensate for the missing\nsemantics. For each image-text pair, we establish a concept archive that\nmaintains potential visually-matched concepts with our proposed vision-driven\nexpansion and text-to-vision-guided ranking. Relevant concepts can thus be\nidentified via cluster-guided sampling and fed into pre-training, thereby\nbridging the gap between visual and textual semantics. Extensive experiments\nover a broad suite of 8 segmentation benchmarks show that CoCu achieves superb\nzero-shot transfer performance and greatly boosts language-supervised\nsegmentation baseline by a large margin, suggesting the value of bridging\nsemantic gap in pre-training data.\n","authors":["Yun Xing","Jian Kang","Aoran Xiao","Jiahao Nie","Shao Ling","Shijian Lu"],"pdf_url":"https://arxiv.org/pdf/2309.13505v3.pdf","comment":"NeurIPS 2023. Code is available at\n https://github.com/xing0047/rewrite"},{"id":"http://arxiv.org/abs/2310.15725v1","updated":"2023-10-24T11:00:56Z","published":"2023-10-24T11:00:56Z","title":"Query-adaptive DETR for Crowded Pedestrian Detection","summary":" DEtection TRansformer (DETR) and its variants (DETRs) have been successfully\napplied to crowded pedestrian detection, which achieved promising performance.\nHowever, we find that, in different degrees of crowded scenes, the number of\nDETRs' queries must be adjusted manually, otherwise, the performance would\ndegrade to varying degrees. In this paper, we first analyze the two current\nquery generation methods and summarize four guidelines for designing the\nadaptive query generation method. Then, we propose Rank-based Adaptive Query\nGeneration (RAQG) to alleviate the problem. Specifically, we design a rank\nprediction head that can predict the rank of the lowest confidence positive\ntraining sample produced by the encoder. Based on the predicted rank, we design\nan adaptive selection method that can adaptively select coarse detection\nresults produced by the encoder to generate queries. Moreover, to train the\nrank prediction head better, we propose Soft Gradient L1 Loss. The gradient of\nSoft Gradient L1 Loss is continuous, which can describe the relationship\nbetween the loss value and the updated value of model parameters granularly.\nOur method is simple and effective, which can be plugged into any DETRs to make\nit query-adaptive in theory. The experimental results on Crowdhuman dataset and\nCitypersons dataset show that our method can adaptively generate queries for\nDETRs and achieve competitive results. Especially, our method achieves\nstate-of-the-art 39.4% MR on Crowdhuman dataset.\n","authors":["Feng Gao","Jiaxu Leng","Ji Gan","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2310.15725v1.pdf","comment":"10 pages, 6 figures"},{"id":"http://arxiv.org/abs/2305.07644v2","updated":"2023-10-24T10:52:16Z","published":"2023-05-12T17:55:40Z","title":"Beware of diffusion models for synthesizing medical images -- A\n comparison with GANs in terms of memorizing brain MRI and chest x-ray images","summary":" Diffusion models were initially developed for text-to-image generation and\nare now being utilized to generate high-quality synthetic images. Preceded by\nGANs, diffusion models have shown impressive results using various evaluation\nmetrics. However, commonly used metrics such as FID and IS are not suitable for\ndetermining whether diffusion models are simply reproducing the training\nimages. Here we train StyleGAN and diffusion models, using BRATS20, BRATS21 and\na chest x-ray pneumonia dataset, to synthesize brain MRI and chest x-ray\nimages, and measure the correlation between the synthe4c images and all\ntraining images. Our results show that diffusion models are more likely to\nmemorize the training images, compared to StyleGAN, especially for small\ndatasets and when using 2D slices from 3D volumes. Researchers should be\ncareful when using diffusion models for medical imaging, if the final goal is\nto share the synthe4c images\n","authors":["Muhammad Usman Akbar","Wuhao Wang","Anders Eklund"],"pdf_url":"https://arxiv.org/pdf/2305.07644v2.pdf","comment":"12 Pages, 6 Figures"},{"id":"http://arxiv.org/abs/2310.15712v1","updated":"2023-10-24T10:40:51Z","published":"2023-10-24T10:40:51Z","title":"GNeSF: Generalizable Neural Semantic Fields","summary":" 3D scene segmentation based on neural implicit representation has emerged\nrecently with the advantage of training only on 2D supervision. However,\nexisting approaches still requires expensive per-scene optimization that\nprohibits generalization to novel scenes during inference. To circumvent this\nproblem, we introduce a generalizable 3D segmentation framework based on\nimplicit representation. Specifically, our framework takes in multi-view image\nfeatures and semantic maps as the inputs instead of only spatial information to\navoid overfitting to scene-specific geometric and semantic information. We\npropose a novel soft voting mechanism to aggregate the 2D semantic information\nfrom different views for each 3D point. In addition to the image features, view\ndifference information is also encoded in our framework to predict the voting\nscores. Intuitively, this allows the semantic information from nearby views to\ncontribute more compared to distant ones. Furthermore, a visibility module is\nalso designed to detect and filter out detrimental information from occluded\nviews. Due to the generalizability of our proposed method, we can synthesize\nsemantic maps or conduct 3D semantic segmentation for novel scenes with solely\n2D semantic supervision. Experimental results show that our approach achieves\ncomparable performance with scene-specific approaches. More importantly, our\napproach can even outperform existing strong supervision-based approaches with\nonly 2D annotations. Our source code is available at:\nhttps://github.com/HLinChen/GNeSF.\n","authors":["Hanlin Chen","Chen Li","Mengqi Guo","Zhiwen Yan","Gim Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15712v1.pdf","comment":"NeurPIS 2023"},{"id":"http://arxiv.org/abs/2208.07365v3","updated":"2023-10-24T10:34:22Z","published":"2022-08-15T17:59:31Z","title":"Unsupervised Video Domain Adaptation for Action Recognition: A\n Disentanglement Perspective","summary":" Unsupervised video domain adaptation is a practical yet challenging task. In\nthis work, for the first time, we tackle it from a disentanglement view. Our\nkey idea is to handle the spatial and temporal domain divergence separately\nthrough disentanglement. Specifically, we consider the generation of\ncross-domain videos from two sets of latent factors, one encoding the static\ninformation and another encoding the dynamic information. A Transfer Sequential\nVAE (TranSVAE) framework is then developed to model such generation. To better\nserve for adaptation, we propose several objectives to constrain the latent\nfactors. With these constraints, the spatial divergence can be readily removed\nby disentangling the static domain-specific information out, and the temporal\ndivergence is further reduced from both frame- and video-levels through\nadversarial learning. Extensive experiments on the UCF-HMDB, Jester, and\nEpic-Kitchens datasets verify the effectiveness and superiority of TranSVAE\ncompared with several state-of-the-art approaches. Code is publicly available.\n","authors":["Pengfei Wei","Lingdong Kong","Xinghua Qu","Yi Ren","Zhiqiang Xu","Jing Jiang","Xiang Yin"],"pdf_url":"https://arxiv.org/pdf/2208.07365v3.pdf","comment":"NeurIPS 2023; 20 pages, 9 figures, 10 tables; Code at\n https://github.com/ldkong1205/TranSVAE"},{"id":"http://arxiv.org/abs/2307.09302v2","updated":"2023-10-24T10:34:22Z","published":"2023-07-18T14:40:48Z","title":"Conformal prediction under ambiguous ground truth","summary":" Conformal Prediction (CP) allows to perform rigorous uncertainty\nquantification by constructing a prediction set $C(X)$ satisfying $\\mathbb{P}(Y\n\\in C(X))\\geq 1-\\alpha$ for a user-chosen $\\alpha \\in [0,1]$ by relying on\ncalibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\\mathbb{P}=\\mathbb{P}^{X}\n\\otimes \\mathbb{P}^{Y|X}$. It is typically implicitly assumed that\n$\\mathbb{P}^{Y|X}$ is the \"true\" posterior label distribution. However, in many\nreal-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating\nexpert opinions using a voting procedure, resulting in a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus\nw.r.t. $\\mathbb{P}_{vote}=\\mathbb{P}^X \\otimes \\mathbb{P}_{vote}^{Y|X}$ rather\nthan the true distribution $\\mathbb{P}$. In cases with unambiguous ground truth\nlabels, the distinction between $\\mathbb{P}_{vote}$ and $\\mathbb{P}$ is\nirrelevant. However, when experts do not agree because of ambiguous labels,\napproximating $\\mathbb{P}^{Y|X}$ with a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose\nto leverage expert opinions to approximate $\\mathbb{P}^{Y|X}$ using a\nnon-degenerate distribution $\\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP\nprocedures which provide guarantees w.r.t. $\\mathbb{P}_{agg}=\\mathbb{P}^X\n\\otimes \\mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels\nfrom $\\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a\ncase study of skin condition classification with significant disagreement among\nexpert annotators, we show that applying CP w.r.t. $\\mathbb{P}_{vote}$\nunder-covers expert annotations: calibrated for $72\\%$ coverage, it falls short\nby on average $10\\%$; our Monte Carlo CP closes this gap both empirically and\ntheoretically.\n","authors":["David Stutz","Abhijit Guha Roy","Tatiana Matejovicova","Patricia Strachan","Ali Taylan Cemgil","Arnaud Doucet"],"pdf_url":"https://arxiv.org/pdf/2307.09302v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.13572v3","updated":"2023-10-24T10:32:22Z","published":"2022-11-24T12:44:33Z","title":"Physics-Based Object 6D-Pose Estimation during Non-Prehensile\n Manipulation","summary":" We propose a method to track the 6D pose of an object over time, while the\nobject is under non-prehensile manipulation by a robot. At any given time\nduring the manipulation of the object, we assume access to the robot joint\ncontrols and an image from a camera. We use the robot joint controls to perform\na physics-based prediction of how the object might be moving. We then combine\nthis prediction with the observation coming from the camera, to estimate the\nobject pose as accurately as possible. We use a particle filtering approach to\ncombine the control information with the visual information. We compare the\nproposed method with two baselines: (i) using only an image-based pose\nestimation system at each time-step, and (ii) a particle filter which does not\nperform the computationally expensive physics predictions, but assumes the\nobject moves with constant velocity. Our results show that making physics-based\npredictions is worth the computational cost, resulting in more accurate\ntracking, and estimating object pose even when the object is not clearly\nvisible to the camera.\n","authors":["Zisong Xu","Rafael Papallas","Mehmet Dogar"],"pdf_url":"https://arxiv.org/pdf/2211.13572v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14538v2","updated":"2023-10-24T10:24:00Z","published":"2023-09-25T21:28:14Z","title":"Dynamic Scene Graph Representation for Surgical Video","summary":" Surgical videos captured from microscopic or endoscopic imaging devices are\nrich but complex sources of information, depicting different tools and\nanatomical structures utilized during an extended amount of time. Despite\ncontaining crucial workflow information and being commonly recorded in many\nprocedures, usage of surgical videos for automated surgical workflow\nunderstanding is still limited.\n In this work, we exploit scene graphs as a more holistic, semantically\nmeaningful and human-readable way to represent surgical videos while encoding\nall anatomical structures, tools, and their interactions. To properly evaluate\nthe impact of our solutions, we create a scene graph dataset from semantic\nsegmentations from the CaDIS and CATARACTS datasets. We demonstrate that scene\ngraphs can be leveraged through the use of graph convolutional networks (GCNs)\nto tackle surgical downstream tasks such as surgical workflow recognition with\ncompetitive performance. Moreover, we demonstrate the benefits of surgical\nscene graphs regarding the explainability and robustness of model decisions,\nwhich are crucial in the clinical setting.\n","authors":["Felix Holm","Ghazal Ghazaei","Tobias Czempiel","Ege Özsoy","Stefan Saur","Nassir Navab"],"pdf_url":"https://arxiv.org/pdf/2309.14538v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.05295v4","updated":"2023-10-24T10:08:10Z","published":"2022-11-10T02:05:17Z","title":"Harmonizing output imbalance for defect segmentation on\n extremely-imbalanced photovoltaic module cells images","summary":" The continuous development of the photovoltaic (PV) industry has raised high\nrequirements for the quality of monocrystalline of PV module cells. When\nlearning to segment defect regions in PV module cell images, Tiny Hidden Cracks\n(THC) lead to extremely-imbalanced samples. The ratio of defect pixels to\nnormal pixels can be as low as 1:2000. This extreme imbalance makes it\ndifficult to segment the THC of PV module cells, which is also a challenge for\nsemantic segmentation. To address the problem of segmenting defects on\nextremely-imbalanced THC data, the paper makes contributions from three\naspects: (1) it proposes an explicit measure for output imbalance; (2) it\ngeneralizes a distribution-based loss that can handle different types of output\nimbalances; and (3) it introduces a compound loss with our adaptive\nhyperparameter selection algorithm that can keep the consistency of training\nand inference for harmonizing the output imbalance on extremelyimbalanced input\ndata. The proposed method is evaluated on four widely-used deep learning\narchitectures and four datasets with varying degrees of input imbalance. The\nexperimental results show that the proposed method outperforms existing\nmethods.\n","authors":["Jianye Yi","Xiaopin Zhong","Weixiang Liu","Zongze Wu","Yuanlong Deng","Zhengguang Wu"],"pdf_url":"https://arxiv.org/pdf/2211.05295v4.pdf","comment":"19 pages, 16 figures, 3 appendixes"},{"id":"http://arxiv.org/abs/2310.15690v1","updated":"2023-10-24T10:01:15Z","published":"2023-10-24T10:01:15Z","title":"Physics-Informed with Power-Enhanced Residual Network for Interpolation\n and Inverse Problems","summary":" This paper introduces a novel neural network structure called the\nPower-Enhancing residual network, designed to improve interpolation\ncapabilities for both smooth and non-smooth functions in 2D and 3D settings. By\nadding power terms to residual elements, the architecture boosts the network's\nexpressive power. The study explores network depth, width, and optimization\nmethods, showing the architecture's adaptability and performance advantages.\nConsistently, the results emphasize the exceptional accuracy of the proposed\nPower-Enhancing residual network, particularly for non-smooth functions.\nReal-world examples also confirm its superiority over plain neural network in\nterms of accuracy, convergence, and efficiency. The study also looks at the\nimpact of deeper network. Moreover, the proposed architecture is also applied\nto solving the inverse Burgers' equation, demonstrating superior performance.\nIn conclusion, the Power-Enhancing residual network offers a versatile solution\nthat significantly enhances neural network capabilities. The codes implemented\nare available at: \\url{https://github.com/CMMAi/ResNet_for_PINN}.\n","authors":["Amir Noorizadegan","D. L. Young","Y. C. Hon","C. S. Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15688v1","updated":"2023-10-24T09:59:55Z","published":"2023-10-24T09:59:55Z","title":"Nighttime Thermal Infrared Image Colorization with Feedback-based Object\n Appearance Learning","summary":" Stable imaging in adverse environments (e.g., total darkness) makes thermal\ninfrared (TIR) cameras a prevalent option for night scene perception. However,\nthe low contrast and lack of chromaticity of TIR images are detrimental to\nhuman interpretation and subsequent deployment of RGB-based vision algorithms.\nTherefore, it makes sense to colorize the nighttime TIR images by translating\nthem into the corresponding daytime color images (NTIR2DC). Despite the\nimpressive progress made in the NTIR2DC task, how to improve the translation\nperformance of small object classes is under-explored. To address this problem,\nwe propose a generative adversarial network incorporating feedback-based object\nappearance learning (FoalGAN). Specifically, an occlusion-aware mixup module\nand corresponding appearance consistency loss are proposed to reduce the\ncontext dependence of object translation. As a representative example of small\nobjects in nighttime street scenes, we illustrate how to enhance the realism of\ntraffic light by designing a traffic light appearance loss. To further improve\nthe appearance learning of small objects, we devise a dual feedback learning\nstrategy to selectively adjust the learning frequency of different samples. In\naddition, we provide pixel-level annotation for a subset of the Brno dataset,\nwhich can facilitate the research of NTIR image understanding under multiple\nweather conditions. Extensive experiments illustrate that the proposed FoalGAN\nis not only effective for appearance learning of small objects, but also\noutperforms other image translation methods in terms of semantic preservation\nand edge consistency for the NTIR2DC task.\n","authors":["Fu-Ya Luo","Shu-Lin Liu","Yi-Jun Cao","Kai-Fu Yang","Chang-Yong Xie","Yong Liu","Yong-Jie Li"],"pdf_url":"https://arxiv.org/pdf/2310.15688v1.pdf","comment":"14 pages, 14 figures. arXiv admin note: text overlap with\n arXiv:2208.02960"},{"id":"http://arxiv.org/abs/2306.09347v2","updated":"2023-10-24T09:51:00Z","published":"2023-06-15T17:59:54Z","title":"Segment Any Point Cloud Sequences by Distilling Vision Foundation Models","summary":" Recent advancements in vision foundation models (VFMs) have opened up new\npossibilities for versatile and efficient visual perception. In this work, we\nintroduce Seal, a novel framework that harnesses VFMs for segmenting diverse\nautomotive point cloud sequences. Seal exhibits three appealing properties: i)\nScalability: VFMs are directly distilled into point clouds, obviating the need\nfor annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial\nand temporal relationships are enforced at both the camera-to-LiDAR and\npoint-to-segment regularization stages, facilitating cross-modal representation\nlearning. iii) Generalizability: Seal enables knowledge transfer in an\noff-the-shelf manner to downstream tasks involving diverse point clouds,\nincluding those from real/synthetic, low/high-resolution, large/small-scale,\nand clean/corrupted datasets. Extensive experiments conducted on eleven\ndifferent point cloud datasets showcase the effectiveness and superiority of\nSeal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear\nprobing, surpassing random initialization by 36.9% mIoU and outperforming prior\narts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains\nover existing methods across 20 different few-shot fine-tuning tasks on all\neleven tested point cloud datasets.\n","authors":["Youquan Liu","Lingdong Kong","Jun Cen","Runnan Chen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2306.09347v2.pdf","comment":"NeurIPS 2023 (Spotlight); 37 pages, 16 figures, 15 tables; Code at\n https://github.com/youquanl/Segment-Any-Point-Cloud"},{"id":"http://arxiv.org/abs/2201.11808v5","updated":"2023-10-24T09:42:53Z","published":"2022-01-27T21:10:20Z","title":"LAP: An Attention-Based Module for Concept Based Self-Interpretation and\n Knowledge Injection in Convolutional Neural Networks","summary":" Despite the state-of-the-art performance of deep convolutional neural\nnetworks, they are susceptible to bias and malfunction in unseen situations.\nMoreover, the complex computation behind their reasoning is not\nhuman-understandable to develop trust. External explainer methods have tried to\ninterpret network decisions in a human-understandable way, but they are accused\nof fallacies due to their assumptions and simplifications. On the other side,\nthe inherent self-interpretability of models, while being more robust to the\nmentioned fallacies, cannot be applied to the already trained models. In this\nwork, we propose a new attention-based pooling layer, called Local Attention\nPooling (LAP), that accomplishes self-interpretability and the possibility for\nknowledge injection without performance loss. The module is easily pluggable\ninto any convolutional neural network, even the already trained ones. We have\ndefined a weakly supervised training scheme to learn the distinguishing\nfeatures in decision-making without depending on experts' annotations. We\nverified our claims by evaluating several LAP-extended models on two datasets,\nincluding ImageNet. The proposed framework offers more valid\nhuman-understandable and faithful-to-the-model interpretations than the\ncommonly used white-box explainer methods.\n","authors":["Rassa Ghavami Modegh","Ahmad Salimi","Alireza Dizaji","Hamid R. Rabiee"],"pdf_url":"https://arxiv.org/pdf/2201.11808v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15676v1","updated":"2023-10-24T09:39:05Z","published":"2023-10-24T09:39:05Z","title":"Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive\n Survey and Evaluation","summary":" Multi-modal 3D scene understanding has gained considerable attention due to\nits wide applications in many areas, such as autonomous driving and\nhuman-computer interaction. Compared to conventional single-modal 3D\nunderstanding, introducing an additional modality not only elevates the\nrichness and precision of scene interpretation but also ensures a more robust\nand resilient understanding. This becomes especially crucial in varied and\nchallenging environments where solely relying on 3D data might be inadequate.\nWhile there has been a surge in the development of multi-modal 3D methods over\npast three years, especially those integrating multi-camera images (3D+2D) and\ntextual descriptions (3D+language), a comprehensive and in-depth review is\nnotably absent. In this article, we present a systematic survey of recent\nprogress to bridge this gap. We begin by briefly introducing a background that\nformally defines various 3D multi-modal tasks and summarizes their inherent\nchallenges. After that, we present a novel taxonomy that delivers a thorough\ncategorization of existing methods according to modalities and tasks, exploring\ntheir respective strengths and limitations. Furthermore, comparative results of\nrecent approaches on several benchmark datasets, together with insightful\nanalysis, are offered. Finally, we discuss the unresolved issues and provide\nseveral potential avenues for future research.\n","authors":["Yinjie Lei","Zixuan Wang","Feng Chen","Guoqing Wang","Peng Wang","Yang Yang"],"pdf_url":"https://arxiv.org/pdf/2310.15676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15023v3","updated":"2023-10-24T09:34:02Z","published":"2023-05-24T11:06:15Z","title":"Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large\n Language Models","summary":" Recently, growing interest has been aroused in extending the multimodal\ncapability of large language models (LLMs), e.g., vision-language (VL)\nlearning, which is regarded as the next milestone of artificial general\nintelligence. However, existing solutions are prohibitively expensive, which\nnot only need to optimize excessive parameters, but also require another\nlarge-scale pre-training before VL instruction tuning. In this paper, we\npropose a novel and affordable solution for the effective VL adaption of LLMs,\ncalled Mixture-of-Modality Adaptation (MMA). Instead of using large neural\nnetworks to connect the image encoder and LLM, MMA adopts lightweight modules,\ni.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables\nthe joint optimization of the image and language models. Meanwhile, MMA is also\nequipped with a routing algorithm to help LLMs achieve an automatic shift\nbetween single- and multi-modal instructions without compromising their ability\nof natural language understanding. To validate MMA, we apply it to a recent LLM\ncalled LLaMA and term this formed large vision-language instructed model as\nLaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two\nsetups, namely multimodal science question answering and multimodal dialogue.\nThe experimental results not only demonstrate the competitive performance and\nthe superior training efficiency of LaVIN than existing multimodal LLMs, but\nalso confirm its great potential as a general-purpose chatbot. More\nimportantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4\ntraining hours with 3.8M trainable parameters, greatly confirming the\neffectiveness of MMA. Our project is released at\nhttps://luogen1996.github.io/lavin.\n","authors":["Gen Luo","Yiyi Zhou","Tianhe Ren","Shengxin Chen","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2305.15023v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15670v1","updated":"2023-10-24T09:29:26Z","published":"2023-10-24T09:29:26Z","title":"Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection","summary":" Current research is primarily dedicated to advancing the accuracy of\ncamera-only 3D object detectors (apprentice) through the knowledge transferred\nfrom LiDAR- or multi-modal-based counterparts (expert). However, the presence\nof the domain gap between LiDAR and camera features, coupled with the inherent\nincompatibility in temporal fusion, significantly hinders the effectiveness of\ndistillation-based enhancements for apprentices. Motivated by the success of\nuni-modal distillation, an apprentice-friendly expert model would predominantly\nrely on camera features, while still achieving comparable performance to\nmulti-modal models. To this end, we introduce VCD, a framework to improve the\ncamera-only apprentice model, including an apprentice-friendly multi-modal\nexpert and temporal-fusion-friendly distillation supervision. The multi-modal\nexpert VCD-E adopts an identical structure as that of the camera-only\napprentice in order to alleviate the feature disparity, and leverages LiDAR\ninput as a depth prior to reconstruct the 3D scene, achieving the performance\non par with other heterogeneous multi-modal experts. Additionally, a\nfine-grained trajectory-based distillation module is introduced with the\npurpose of individually rectifying the motion misalignment for each object in\nthe scene. With those improvements, our camera-only apprentice VCD-A sets new\nstate-of-the-art on nuScenes with a score of 63.1% NDS.\n","authors":["Linyan Huang","Zhiqi Li","Chonghao Sima","Wenhai Wang","Jingdong Wang","Yu Qiao","Hongyang Li"],"pdf_url":"https://arxiv.org/pdf/2310.15670v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15658v1","updated":"2023-10-24T09:11:34Z","published":"2023-10-24T09:11:34Z","title":"Region-controlled Style Transfer","summary":" Image style transfer is a challenging task in computational vision. Existing\nalgorithms transfer the color and texture of style images by controlling the\nneural network's feature layers. However, they fail to control the strength of\ntextures in different regions of the content image. To address this issue, we\npropose a training method that uses a loss function to constrain the style\nintensity in different regions. This method guides the transfer strength of\nstyle features in different regions based on the gradient relationship between\nstyle and content images. Additionally, we introduce a novel feature fusion\nmethod that linearly transforms content features to resemble style features\nwhile preserving their semantic relationships. Extensive experiments have\ndemonstrated the effectiveness of our proposed approach.\n","authors":["Junjie Kang","Jinsong Wu","Shiqi Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.15658v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15655v1","updated":"2023-10-24T09:10:43Z","published":"2023-10-24T09:10:43Z","title":"Breaking of brightness consistency in optical flow with a lightweight\n CNN network","summary":" Sparse optical flow is widely used in various computer vision tasks, however\nassuming brightness consistency limits its performance in High Dynamic Range\n(HDR) environments. In this work, a lightweight network is used to extract\nillumination robust convolutional features and corners with strong invariance.\nModifying the typical brightness consistency of the optical flow method to the\nconvolutional feature consistency yields the light-robust hybrid optical flow\nmethod. The proposed network runs at 190 FPS on a commercial CPU because it\nuses only four convolutional layers to extract feature maps and score maps\nsimultaneously. Since the shallow network is difficult to train directly, a\ndeep network is designed to compute the reliability map that helps it. An\nend-to-end unsupervised training mode is used for both networks. To validate\nthe proposed method, we compare corner repeatability and matching performance\nwith origin optical flow under dynamic illumination. In addition, a more\naccurate visual inertial system is constructed by replacing the optical flow\nmethod in VINS-Mono. In a public HDR dataset, it reduces translation errors by\n93\\%. The code is publicly available at https://github.com/linyicheng1/LET-NET.\n","authors":["Yicheng Lin","Shuo Wang","Yunlong Jiang","Bin Han"],"pdf_url":"https://arxiv.org/pdf/2310.15655v1.pdf","comment":"7 pages,7 figures"},{"id":"http://arxiv.org/abs/2210.01125v2","updated":"2023-10-24T09:08:04Z","published":"2022-10-03T03:07:33Z","title":"Spectral2Spectral: Image-spectral Similarity Assisted Spectral CT Deep\n Reconstruction without Reference","summary":" Spectral computed tomography based on a photon-counting detector (PCD)\nattracts more and more attentions since it has the capability to provide more\naccurate identification and quantitative analysis for biomedical materials. The\nlimited number of photons within narrow energy bins leads to imaging results of\nlow signal-noise ratio. The existing supervised deep reconstruction networks\nfor CT reconstruction are difficult to address these challenges because it is\nusually impossible to acquire noise-free clinical images with clear structures\nas references. In this paper, we propose an iterative deep reconstruction\nnetwork to synergize unsupervised method and data priors into a unified\nframework, named as Spectral2Spectral. Our Spectral2Spectral employs an\nunsupervised deep training strategy to obtain high-quality images from noisy\ndata in an end-to-end fashion. The structural similarity prior within\nimage-spectral domain is refined as a regularization term to further constrain\nthe network training. The weights of neural network are automatically updated\nto capture image features and structures within the iterative process. Three\nlarge-scale preclinical datasets experiments demonstrate that the\nSpectral2spectral reconstructs better image quality than other the\nstate-of-the-art methods.\n","authors":["Xiaodong Guo","Longhui Li","Peng He","Peng Feng","Dingyue Chang","Hengyong Yu","Weiwen Wu"],"pdf_url":"https://arxiv.org/pdf/2210.01125v2.pdf","comment":"Accepted by IEEE TCI"},{"id":"http://arxiv.org/abs/2310.15646v1","updated":"2023-10-24T09:07:47Z","published":"2023-10-24T09:07:47Z","title":"Mean Teacher DETR with Masked Feature Alignment: A Robust Domain\n Adaptive Detection Transformer Framework","summary":" Unsupervised domain adaptation object detection(UDAOD) research on Detection\nTransformer(DETR) mainly focuses on feature alignment and existing methods can\nbe divided into two kinds, each of which has its unresolved issues. One-stage\nfeature alignment methods can easily lead to performance fluctuation and\ntraining stagnation. Two-stage feature alignment method based on mean teacher\ncomprises a pretraining stage followed by a self-training stage, each facing\nproblems in obtaining reliable pretrained model and achieving consistent\nperformance gains. Methods mentioned above have not yet explore how to utilize\nthe third related domain such as target-like domain to assist adaptation. To\naddress these issues, we propose a two-stage framework named MTM, i.e. Mean\nTeacher-DETR with Masked Feature Alignment. In the pretraining stage, we\nutilize labeled target-like images produced by image style transfer to avoid\nperformance fluctuation. In the self-training stage, we leverage unlabeled\ntarget images by pseudo labels based on mean teacher and propose a module\ncalled Object Queries Knowledge Transfer(OQKT) to ensure consistent performance\ngains of the student model. Most importantly, we propose masked feature\nalignment methods including Masked Domain Query-based Feature Alignment(MDQFA)\nand Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a\nmore robust way, which not only prevent training stagnation and lead to a\nrobust pretrained model in the pretraining stage, but also enhance the model's\ntarget performance in the self-training stage. Experiments on three challenging\nscenarios and a theoretical analysis verify the effectiveness of MTM.\n","authors":["Weixi Weng","Chun Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.15646v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13035v4","updated":"2023-10-24T09:00:20Z","published":"2023-05-22T13:39:28Z","title":"Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design","summary":" Scaling laws have been recently employed to derive compute-optimal model size\n(number of parameters) for a given compute duration. We advance and refine such\nmethods to infer compute-optimal model shapes, such as width and depth, and\nsuccessfully implement this in vision transformers. Our shape-optimized vision\ntransformer, SoViT, achieves results competitive with models that exceed twice\nits size, despite being pre-trained with an equivalent amount of compute. For\nexample, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012,\nsurpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical\nsettings, with also less than half the inference cost. We conduct a thorough\nevaluation across multiple tasks, such as image classification, captioning, VQA\nand zero-shot transfer, demonstrating the effectiveness of our model across a\nbroad range of domains and identifying limitations. Overall, our findings\nchallenge the prevailing approach of blindly scaling up vision models and pave\na path for a more informed scaling.\n","authors":["Ibrahim Alabdulmohsin","Xiaohua Zhai","Alexander Kolesnikov","Lucas Beyer"],"pdf_url":"https://arxiv.org/pdf/2305.13035v4.pdf","comment":"10 pages, 7 figures, 9 tables. Version 2: Layout fixes"},{"id":"http://arxiv.org/abs/2310.15624v1","updated":"2023-10-24T08:45:15Z","published":"2023-10-24T08:45:15Z","title":"GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D\n Object Detection","summary":" Geometry plays a significant role in monocular 3D object detection. It can be\nused to estimate object depth by using the perspective projection between\nobject's physical size and 2D projection in the image plane, which can\nintroduce mathematical priors into deep models. However, this projection\nprocess also introduces error amplification, where the error of the estimated\nheight is amplified and reflected into the projected depth. It leads to\nunreliable depth inferences and also impairs training stability. To tackle this\nproblem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++)\nby modeling geometry projection in a probabilistic manner. This ensures depth\npredictions are well-bounded and associated with a reasonable uncertainty. The\nsignificance of introducing such geometric uncertainty is two-fold: (1). It\nmodels the uncertainty propagation relationship of the geometry projection\nduring training, improving the stability and efficiency of the end-to-end model\nlearning. (2). It can be derived to a highly reliable confidence to indicate\nthe quality of the 3D detection result, enabling more reliable detection\ninference. Experiments show that the proposed approach not only obtains\n(state-of-the-art) SOTA performance in image-based monocular 3D detection but\nalso demonstrates superiority in efficacy with a simplified framework.\n","authors":["Yan Lu","Xinzhu Ma","Lei Yang","Tianzhu Zhang","Yating Liu","Qi Chu","Tong He","Yonghui Li","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.15624v1.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2206.15316v3","updated":"2023-10-24T08:26:50Z","published":"2022-06-30T14:42:18Z","title":"Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory\n Models","summary":" We propose a novel anomaly detection method for echocardiogram videos. The\nintroduced method takes advantage of the periodic nature of the heart cycle to\nlearn three variants of a variational latent trajectory model (TVAE). While the\nfirst two variants (TVAE-C and TVAE-R) model strict periodic movements of the\nheart, the third (TVAE-S) is more general and allows shifts in the spatial\nrepresentation throughout the video. All models are trained on the healthy\nsamples of a novel in-house dataset of infant echocardiogram videos consisting\nof multiple chamber views to learn a normative prior of the healthy population.\nDuring inference, maximum a posteriori (MAP) based anomaly detection is\nperformed to detect out-of-distribution samples in our dataset. The proposed\nmethod reliably identifies severe congenital heart defects, such as Ebstein's\nAnomaly or Shone-complex. Moreover, it achieves superior performance over\nMAP-based anomaly detection with standard variational autoencoders when\ndetecting pulmonary hypertension and right ventricular dilation. Finally, we\ndemonstrate that the proposed method enables interpretable explanations of its\noutput through heatmaps highlighting the regions corresponding to anomalous\nheart structures.\n","authors":["Alain Ryser","Laura Manduchi","Fabian Laumer","Holger Michel","Sven Wellmann","Julia E. Vogt"],"pdf_url":"https://arxiv.org/pdf/2206.15316v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.10076v4","updated":"2023-10-24T08:12:03Z","published":"2023-03-17T15:57:14Z","title":"A Simple Framework for 3D Occupancy Estimation in Autonomous Driving","summary":" The task of estimating 3D occupancy from surrounding-view images is an\nexciting development in the field of autonomous driving, following the success\nof Bird's Eye View (BEV) perception. This task provides crucial 3D attributes\nof the driving environment, enhancing the overall understanding and perception\nof the surrounding space. In this work, we present a simple framework for 3D\noccupancy estimation, which is a CNN-based framework designed to reveal several\nkey factors for 3D occupancy estimation, such as network design, optimization,\nand evaluation. In addition, we explore the relationship between 3D occupancy\nestimation and other related tasks, such as monocular depth estimation and 3D\nreconstruction, which could advance the study of 3D perception in autonomous\ndriving. For evaluation, we propose a simple sampling strategy to define the\nmetric for occupancy evaluation, which is flexible for current public datasets.\nMoreover, we establish the benchmark in terms of the depth estimation metric,\nwhere we compare our proposed method with monocular depth estimation methods on\nthe DDAD and Nuscenes datasets and achieve competitive performance. The\nrelevant code will be updated in https://github.com/GANWANSHUI/SimpleOccupancy.\n","authors":["Wanshui Gan","Ningkai Mo","Hongbin Xu","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2303.10076v4.pdf","comment":"15 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.15599v1","updated":"2023-10-24T08:01:12Z","published":"2023-10-24T08:01:12Z","title":"Grasp Multiple Objects with One Hand","summary":" The human hand's complex kinematics allow for simultaneous grasping and\nmanipulation of multiple objects, essential for tasks like object transfer and\nin-hand manipulation. Despite its importance, robotic multi-object grasping\nremains underexplored and presents challenges in kinematics, dynamics, and\nobject configurations. This paper introduces MultiGrasp, a two-stage method for\nmulti-object grasping on a tabletop with a multi-finger dexterous hand. It\ninvolves (i) generating pre-grasp proposals and (ii) executing the grasp and\nlifting the objects. Experimental results primarily focus on dual-object\ngrasping and report a 44.13% success rate, showcasing adaptability to unseen\nobject configurations and imprecise grasps. The framework also demonstrates the\ncapability to grasp more than two objects, albeit at a reduced inference speed.\n","authors":["Yuyang Li","Bo Liu","Yiran Geng","Puhao Li","Yaodong Yang","Yixin Zhu","Tengyu Liu","Siyuan Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15599v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15597v1","updated":"2023-10-24T08:00:20Z","published":"2023-10-24T08:00:20Z","title":"Emergent Communication in Interactive Sketch Question Answering","summary":" Vision-based emergent communication (EC) aims to learn to communicate through\nsketches and demystify the evolution of human communication. Ironically,\nprevious works neglect multi-round interaction, which is indispensable in human\ncommunication. To fill this gap, we first introduce a novel Interactive Sketch\nQuestion Answering (ISQA) task, where two collaborative players are interacting\nthrough sketches to answer a question about an image in a multi-round manner.\nTo accomplish this task, we design a new and efficient interactive EC system,\nwhich can achieve an effective balance among three evaluation factors,\nincluding the question answering accuracy, drawing complexity and human\ninterpretability. Our experimental results including human evaluation\ndemonstrate that multi-round interactive mechanism facilitates targeted and\nefficient communication between intelligent agents with decent human\ninterpretability.\n","authors":["Zixing Lei","Yiming Zhang","Yuxin Xiong","Siheng Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15597v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.02490v3","updated":"2023-10-24T07:59:31Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v3.pdf","comment":"Add results of GPT-4V. Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2212.07275v3","updated":"2023-10-24T07:58:28Z","published":"2022-12-14T15:17:46Z","title":"PhoMoH: Implicit Photorealistic 3D Models of Human Heads","summary":" We present PhoMoH, a neural network methodology to construct generative\nmodels of photo-realistic 3D geometry and appearance of human heads including\nhair, beards, an oral cavity, and clothing. In contrast to prior work, PhoMoH\nmodels the human head using neural fields, thus supporting complex topology.\nInstead of learning a head model from scratch, we propose to augment an\nexisting expressive head model with new features. Concretely, we learn a highly\ndetailed geometry network layered on top of a mid-resolution head model\ntogether with a detailed, local geometry-aware, and disentangled color field.\nOur proposed architecture allows us to learn photo-realistic human head models\nfrom relatively little data. The learned generative geometry and appearance\nnetworks can be sampled individually and enable the creation of diverse and\nrealistic human heads. Extensive experiments validate our method qualitatively\nand across different metrics.\n","authors":["Mihai Zanfir","Thiemo Alldieck","Cristian Sminchisescu"],"pdf_url":"https://arxiv.org/pdf/2212.07275v3.pdf","comment":"To be published at the International Conference on 3D Vision 2024"},{"id":"http://arxiv.org/abs/2310.15590v1","updated":"2023-10-24T07:54:39Z","published":"2023-10-24T07:54:39Z","title":"Facial Data Minimization: Shallow Model as Your Privacy Filter","summary":" Face recognition service has been used in many fields and brings much\nconvenience to people. However, once the user's facial data is transmitted to a\nservice provider, the user will lose control of his/her private data. In recent\nyears, there exist various security and privacy issues due to the leakage of\nfacial data. Although many privacy-preserving methods have been proposed, they\nusually fail when they are not accessible to adversaries' strategies or\nauxiliary data. Hence, in this paper, by fully considering two cases of\nuploading facial images and facial features, which are very typical in face\nrecognition service systems, we proposed a data privacy minimization\ntransformation (PMT) method. This method can process the original facial data\nbased on the shallow model of authorized services to obtain the obfuscated\ndata. The obfuscated data can not only maintain satisfactory performance on\nauthorized models and restrict the performance on other unauthorized models but\nalso prevent original privacy data from leaking by AI methods and human visual\ntheft. Additionally, since a service provider may execute preprocessing\noperations on the received data, we also propose an enhanced perturbation\nmethod to improve the robustness of PMT. Besides, to authorize one facial image\nto multiple service models simultaneously, a multiple restriction mechanism is\nproposed to improve the scalability of PMT. Finally, we conduct extensive\nexperiments and evaluate the effectiveness of the proposed PMT in defending\nagainst face reconstruction, data abuse, and face attribute estimation attacks.\nThese experimental results demonstrate that PMT performs well in preventing\nfacial data abuse and privacy leakage while maintaining face recognition\naccuracy.\n","authors":["Yuwen Pu","Jiahao Chen","Jiayu Pan","Hao li","Diqun Yan","Xuhong Zhang","Shouling Ji"],"pdf_url":"https://arxiv.org/pdf/2310.15590v1.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2310.15585v1","updated":"2023-10-24T07:51:08Z","published":"2023-10-24T07:51:08Z","title":"Multimodal Representations for Teacher-Guided Compositional Visual\n Reasoning","summary":" Neural Module Networks (NMN) are a compelling method for visual question\nanswering, enabling the translation of a question into a program consisting of\na series of reasoning sub-tasks that are sequentially executed on the image to\nproduce an answer. NMNs provide enhanced explainability compared to integrated\nmodels, allowing for a better understanding of the underlying reasoning\nprocess. To improve the effectiveness of NMNs we propose to exploit features\nobtained by a large-scale cross-modal encoder. Also, the current training\napproach of NMNs relies on the propagation of module outputs to subsequent\nmodules, leading to the accumulation of prediction errors and the generation of\nfalse answers. To mitigate this, we introduce an NMN learning strategy\ninvolving scheduled teacher guidance. Initially, the model is fully guided by\nthe ground-truth intermediate outputs, but gradually transitions to an\nautonomous behavior as training progresses. This reduces error accumulation,\nthus improving training efficiency and final performance.We demonstrate that by\nincorporating cross-modal features and employing more effective training\ntechniques for NMN, we achieve a favorable balance between performance and\ntransparency in the reasoning process.\n","authors":["Wafa Aissa","Marin Ferecatu","Michel Crucianu"],"pdf_url":"https://arxiv.org/pdf/2310.15585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15578v1","updated":"2023-10-24T07:42:04Z","published":"2023-10-24T07:42:04Z","title":"VMAF Re-implementation on PyTorch: Some Experimental Results","summary":" Based on the standard VMAF implementation we propose an implementation of\nVMAF using PyTorch framework. For this implementation comparisons with the\nstandard (libvmaf) show the discrepancy $\\lesssim 10^{-2}$ in VMAF units. We\ninvestigate gradients computation when using VMAF as an objective function and\ndemonstrate that training using this function does not result in ill-behaving\ngradients.\n","authors":["Kirill Aistov","Maxim Koroteev"],"pdf_url":"https://arxiv.org/pdf/2310.15578v1.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2302.04032v2","updated":"2023-10-24T07:35:41Z","published":"2023-02-08T13:08:51Z","title":"A Systematic Performance Analysis of Deep Perceptual Loss Networks:\n Breaking Transfer Learning Conventions","summary":" Deep perceptual loss is a type of loss function in computer vision that aims\nto mimic human perception by using the deep features extracted from neural\nnetworks. In recent years, the method has been applied to great effect on a\nhost of interesting computer vision tasks, especially for tasks with image or\nimage-like outputs, such as image synthesis, segmentation, depth prediction,\nand more. Many applications of the method use pretrained networks, often\nconvolutional networks, for loss calculation. Despite the increased interest\nand broader use, more effort is needed toward exploring which networks to use\nfor calculating deep perceptual loss and from which layers to extract the\nfeatures.\n This work aims to rectify this by systematically evaluating a host of\ncommonly used and readily available, pretrained networks for a number of\ndifferent feature extraction points on four existing use cases of deep\nperceptual loss. The use cases of perceptual similarity, super-resolution,\nimage segmentation, and dimensionality reduction, are evaluated through\nbenchmarks. The benchmarks are implementations of previous works where the\nselected networks and extraction points are evaluated. The performance on the\nbenchmarks, and attributes of the networks and extraction points are then used\nas a basis for an in-depth analysis. This analysis uncovers insight regarding\nwhich architectures provide superior performance for deep perceptual loss and\nhow to choose an appropriate extraction point for a particular task and\ndataset. Furthermore, the work discusses the implications of the results for\ndeep perceptual loss and the broader field of transfer learning. The results\nshow that deep perceptual loss deviates from two commonly held conventions in\ntransfer learning, which suggests that those conventions are in need of deeper\nanalysis.\n","authors":["Gustav Grund Pihlgren","Konstantina Nikolaidou","Prakash Chandra Chhipa","Nosheen Abid","Rajkumar Saini","Fredrik Sandin","Marcus Liwicki"],"pdf_url":"https://arxiv.org/pdf/2302.04032v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15568v1","updated":"2023-10-24T07:22:17Z","published":"2023-10-24T07:22:17Z","title":"I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal\n Mutual Distillation","summary":" Recent progresses on self-supervised 3D human action representation learning\nare largely attributed to contrastive learning. However, in conventional\ncontrastive frameworks, the rich complementarity between different skeleton\nmodalities remains under-explored. Moreover, optimized with distinguishing\nself-augmented samples, models struggle with numerous similar positive\ninstances in the case of limited action categories. In this work, we tackle the\naforementioned problems by introducing a general Inter- and Intra-modal Mutual\nDistillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the\ncross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.\nDifferent from existing distillation solutions that transfer the knowledge of a\npre-trained and fixed teacher to the student, in CMD, the knowledge is\ncontinuously updated and bidirectionally distilled between modalities during\npre-training. To alleviate the interference of similar samples and exploit\ntheir underlying contexts, we further design the Intra-modal Mutual\nDistillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA)\nmechanism is first introduced, where an additional cluster-level discrimination\nbranch is instantiated in each modality. It adaptively aggregates\nhighly-correlated neighboring features, forming local cluster-level\ncontrasting. Mutual distillation is then performed between the two branches for\ncross-level knowledge exchange. Extensive experiments on three datasets show\nthat our approach sets a series of new records.\n","authors":["Yunyao Mao","Jiajun Deng","Wengang Zhou","Zhenbo Lu","Wanli Ouyang","Houqiang Li"],"pdf_url":"https://arxiv.org/pdf/2310.15568v1.pdf","comment":"submitted to IJCV. arXiv admin note: substantial text overlap with\n arXiv:2208.12448"},{"id":"http://arxiv.org/abs/2307.15317v2","updated":"2023-10-24T06:53:55Z","published":"2023-07-28T05:32:56Z","title":"DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable\n Kendall's Rank Correlation","summary":" Few-shot learning aims to adapt models trained on the base dataset to novel\ntasks where the categories were not seen by the model before. This often leads\nto a relatively uniform distribution of feature values across channels on novel\nclasses, posing challenges in determining channel importance for novel tasks.\nStandard few-shot learning methods employ geometric similarity metrics such as\ncosine similarity and negative Euclidean distance to gauge the semantic\nrelatedness between two features. However, features with high geometric\nsimilarities may carry distinct semantics, especially in the context of\nfew-shot learning. In this paper, we demonstrate that the importance ranking of\nfeature channels is a more reliable indicator for few-shot learning than\ngeometric similarity metrics. We observe that replacing the geometric\nsimilarity metric with Kendall's rank correlation only during inference is able\nto improve the performance of few-shot learning across a wide range of methods\nand datasets with different domains. Furthermore, we propose a carefully\ndesigned differentiable loss for meta-training to address the\nnon-differentiability issue of Kendall's rank correlation. By replacing\ngeometric similarity with differentiable Kendall's rank correlation, our method\ncan integrate with numerous existing few-shot approaches and is ready for\nintegrating with future state-of-the-art methods that rely on geometric\nsimilarity metrics. Extensive experiments validate the efficacy of the\nrank-correlation-based approach, showcasing a significant improvement in\nfew-shot learning.\n","authors":["Kaipeng Zheng","Huishuai Zhang","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2307.15317v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15550v1","updated":"2023-10-24T06:43:56Z","published":"2023-10-24T06:43:56Z","title":"PET Synthesis via Self-supervised Adaptive Residual Estimation\n Generative Adversarial Network","summary":" Positron emission tomography (PET) is a widely used, highly sensitive\nmolecular imaging in clinical diagnosis. There is interest in reducing the\nradiation exposure from PET but also maintaining adequate image quality. Recent\nmethods using convolutional neural networks (CNNs) to generate synthesized\nhigh-quality PET images from low-dose counterparts have been reported to be\nstate-of-the-art for low-to-high image recovery methods. However, these methods\nare prone to exhibiting discrepancies in texture and structure between\nsynthesized and real images. Furthermore, the distribution shift between\nlow-dose PET and standard PET has not been fully investigated. To address these\nissues, we developed a self-supervised adaptive residual estimation generative\nadversarial network (SS-AEGAN). We introduce (1) An adaptive residual\nestimation mapping mechanism, AE-Net, designed to dynamically rectify the\npreliminary synthesized PET images by taking the residual map between the\nlow-dose PET and synthesized output as the input, and (2) A self-supervised\npre-training strategy to enhance the feature representation of the coarse\ngenerator. Our experiments with a public benchmark dataset of total-body PET\nimages show that SS-AEGAN consistently outperformed the state-of-the-art\nsynthesis methods with various dose reduction factors.\n","authors":["Yuxin Xue","Lei Bi","Yige Peng","Michael Fulham","David Dagan Feng","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2310.15550v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2306.14846v2","updated":"2023-10-24T06:16:43Z","published":"2023-06-26T16:57:03Z","title":"ViNT: A Foundation Model for Visual Navigation","summary":" General-purpose pre-trained models (\"foundation models\") have enabled\npractitioners to produce generalizable solutions for individual machine\nlearning problems with datasets that are significantly smaller than those\nrequired for learning from scratch. Such models are typically trained on large\nand diverse datasets with weak supervision, consuming much more training data\nthan is available for any individual downstream application. In this paper, we\ndescribe the Visual Navigation Transformer (ViNT), a foundation model that aims\nto bring the success of general-purpose pre-trained models to vision-based\nrobotic navigation. ViNT is trained with a general goal-reaching objective that\ncan be used with any navigation dataset, and employs a flexible\nTransformer-based architecture to learn navigational affordances and enable\nefficient adaptation to a variety of downstream navigational tasks. ViNT is\ntrained on a number of existing navigation datasets, comprising hundreds of\nhours of robotic navigation from a variety of different robotic platforms, and\nexhibits positive transfer, outperforming specialist models trained on singular\ndatasets. ViNT can be augmented with diffusion-based subgoal proposals to\nexplore novel environments, and can solve kilometer-scale navigation problems\nwhen equipped with long-range heuristics. ViNT can also be adapted to novel\ntask specifications with a technique inspired by prompt-tuning, where the goal\nencoder is replaced by an encoding of another task modality (e.g., GPS\nwaypoints or routing commands) embedded into the same space of goal tokens.\nThis flexibility and ability to accommodate a variety of downstream problem\ndomains establishes ViNT as an effective foundation model for mobile robotics.\nFor videos, code, and model checkpoints, see our project page at\nhttps://visualnav-transformer.github.io.\n","authors":["Dhruv Shah","Ajay Sridhar","Nitish Dashora","Kyle Stachowicz","Kevin Black","Noriaki Hirose","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2306.14846v2.pdf","comment":"Accepted for oral presentation at CoRL 2023"},{"id":"http://arxiv.org/abs/2209.05401v3","updated":"2023-10-24T05:59:52Z","published":"2022-09-12T16:53:37Z","title":"MaXM: Towards Multilingual Visual Question Answering","summary":" Visual Question Answering (VQA) has been primarily studied through the lens\nof the English language. Yet, tackling VQA in other languages in the same\nmanner would require a considerable amount of resources. In this paper, we\npropose scalable solutions to multilingual visual question answering (mVQA), on\nboth data and modeling fronts. We first propose a translation-based framework\nto mVQA data generation that requires much less human annotation efforts than\nthe conventional approach of directly collection questions and answers. Then,\nwe apply our framework to the multilingual captions in the Crossmodal-3600\ndataset and develop an efficient annotation protocol to create MaXM, a\ntest-only VQA benchmark in 7 diverse languages. Finally, we develop a simple,\nlightweight, and effective approach as well as benchmark state-of-the-art\nEnglish and multilingual VQA models. We hope that our benchmark encourages\nfurther research on mVQA.\n","authors":["Soravit Changpinyo","Linting Xue","Michal Yarom","Ashish V. Thapliyal","Idan Szpektor","Julien Amelot","Xi Chen","Radu Soricut"],"pdf_url":"https://arxiv.org/pdf/2209.05401v3.pdf","comment":"EMNLP 2023 (Findings).\n https://github.com/google-research-datasets/maxm"},{"id":"http://arxiv.org/abs/2306.02602v3","updated":"2023-10-24T05:39:45Z","published":"2023-06-05T05:21:15Z","title":"ReContrast: Domain-Specific Anomaly Detection via Contrastive\n Reconstruction","summary":" Most advanced unsupervised anomaly detection (UAD) methods rely on modeling\nfeature representations of frozen encoder networks pre-trained on large-scale\ndatasets, e.g. ImageNet. However, the features extracted from the encoders that\nare borrowed from natural image domains coincide little with the features\nrequired in the target UAD domain, such as industrial inspection and medical\nimaging. In this paper, we propose a novel epistemic UAD method, namely\nReContrast, which optimizes the entire network to reduce biases towards the\npre-trained image domain and orients the network in the target domain. We start\nwith a feature reconstruction approach that detects anomalies from errors.\nEssentially, the elements of contrastive learning are elegantly embedded in\nfeature reconstruction to prevent the network from training instability,\npattern collapse, and identical shortcut, while simultaneously optimizing both\nthe encoder and decoder on the target domain. To demonstrate our transfer\nability on various image domains, we conduct extensive experiments across two\npopular industrial defect detection benchmarks and three medical image UAD\ntasks, which shows our superiority over current state-of-the-art methods.\n","authors":["Jia Guo","Shuai Lu","Lize Jia","Weihang Zhang","Huiqi Li"],"pdf_url":"https://arxiv.org/pdf/2306.02602v3.pdf","comment":"NeurIPS 2023 Poster"},{"id":"http://arxiv.org/abs/2310.15533v1","updated":"2023-10-24T05:37:20Z","published":"2023-10-24T05:37:20Z","title":"Learning with Noisy Labels Using Collaborative Sample Selection and\n Contrastive Semi-Supervised Learning","summary":" Learning with noisy labels (LNL) has been extensively studied, with existing\napproaches typically following a framework that alternates between clean sample\nselection and semi-supervised learning (SSL). However, this approach has a\nlimitation: the clean set selected by the Deep Neural Network (DNN) classifier,\ntrained through self-training, inevitably contains noisy samples. This mixture\nof clean and noisy samples leads to misguidance in DNN training during SSL,\nresulting in impaired generalization performance due to confirmation bias\ncaused by error accumulation in sample selection. To address this issue, we\npropose a method called Collaborative Sample Selection (CSS), which leverages\nthe large-scale pre-trained model CLIP. CSS aims to remove the mixed noisy\nsamples from the identified clean set. We achieve this by training a\n2-Dimensional Gaussian Mixture Model (2D-GMM) that combines the probabilities\nfrom CLIP with the predictions from the DNN classifier. To further enhance the\nadaptation of CLIP to LNL, we introduce a co-training mechanism with a\ncontrastive loss in semi-supervised learning. This allows us to jointly train\nthe prompt of CLIP and the DNN classifier, resulting in improved feature\nrepresentation, boosted classification performance of DNNs, and reciprocal\nbenefits to our Collaborative Sample Selection. By incorporating auxiliary\ninformation from CLIP and utilizing prompt fine-tuning, we effectively\neliminate noisy samples from the clean set and mitigate confirmation bias\nduring training. Experimental results on multiple benchmark datasets\ndemonstrate the effectiveness of our proposed method in comparison with the\nstate-of-the-art approaches.\n","authors":["Qing Miao","Xiaohe Wu","Chao Xu","Yanli Ji","Wangmeng Zuo","Yiwen Guo","Zhaopeng Meng"],"pdf_url":"https://arxiv.org/pdf/2310.15533v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15008v2","updated":"2023-10-24T05:13:04Z","published":"2023-10-23T15:02:23Z","title":"Wonder3D: Single Image to 3D using Cross-Domain Diffusion","summary":" In this work, we introduce Wonder3D, a novel method for efficiently\ngenerating high-fidelity textured meshes from single-view images.Recent methods\nbased on Score Distillation Sampling (SDS) have shown the potential to recover\n3D geometry from 2D diffusion priors, but they typically suffer from\ntime-consuming per-shape optimization and inconsistent geometry. In contrast,\ncertain works directly produce 3D information via fast network inferences, but\ntheir results are often of low quality and lack geometric details. To\nholistically improve the quality, consistency, and efficiency of image-to-3D\ntasks, we propose a cross-domain diffusion model that generates multi-view\nnormal maps and the corresponding color images. To ensure consistency, we\nemploy a multi-view cross-domain attention mechanism that facilitates\ninformation exchange across views and modalities. Lastly, we introduce a\ngeometry-aware normal fusion algorithm that extracts high-quality surfaces from\nthe multi-view 2D representations. Our extensive evaluations demonstrate that\nour method achieves high-quality reconstruction results, robust generalization,\nand reasonably good efficiency compared to prior works.\n","authors":["Xiaoxiao Long","Yuan-Chen Guo","Cheng Lin","Yuan Liu","Zhiyang Dou","Lingjie Liu","Yuexin Ma","Song-Hai Zhang","Marc Habermann","Christian Theobalt","Wenping Wang"],"pdf_url":"https://arxiv.org/pdf/2310.15008v2.pdf","comment":"Project page: https://www.xxlong.site/Wonder3D/"},{"id":"http://arxiv.org/abs/2310.13573v3","updated":"2023-10-24T04:58:59Z","published":"2023-10-20T15:10:46Z","title":"Boosting Generalization with Adaptive Style Techniques for Fingerprint\n Liveness Detection","summary":" We introduce a high-performance fingerprint liveness feature extraction\ntechnique that secured first place in LivDet 2023 Fingerprint Representation\nChallenge. Additionally, we developed a practical fingerprint recognition\nsystem with 94.68% accuracy, earning second place in LivDet 2023 Liveness\nDetection in Action. By investigating various methods, particularly style\ntransfer, we demonstrate improvements in accuracy and generalization when faced\nwith limited training data. As a result, our approach achieved state-of-the-art\nperformance in LivDet 2023 Challenges.\n","authors":["Kexin Zhu","Bo Lin","Yang Qiu","Adam Yule","Yao Tang","Jiajun Liang"],"pdf_url":"https://arxiv.org/pdf/2310.13573v3.pdf","comment":"1st Place in LivDet2023 Fingerprint Representation Challenge"},{"id":"http://arxiv.org/abs/2210.03116v4","updated":"2023-10-24T04:26:10Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","Mia Tang","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v4.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2310.15504v1","updated":"2023-10-24T04:16:27Z","published":"2023-10-24T04:16:27Z","title":"Cross-view Self-localization from Synthesized Scene-graphs","summary":" Cross-view self-localization is a challenging scenario of visual place\nrecognition in which database images are provided from sparse viewpoints.\nRecently, an approach for synthesizing database images from unseen viewpoints\nusing NeRF (Neural Radiance Fields) technology has emerged with impressive\nperformance. However, synthesized images provided by these techniques are often\nof lower quality than the original images, and furthermore they significantly\nincrease the storage cost of the database. In this study, we explore a new\nhybrid scene model that combines the advantages of view-invariant appearance\nfeatures computed from raw images and view-dependent spatial-semantic features\ncomputed from synthesized images. These two types of features are then fused\ninto scene graphs, and compressively learned and recognized by a graph neural\nnetwork. The effectiveness of the proposed method was verified using a novel\ncross-view self-localization dataset with many unseen views generated using a\nphotorealistic Habitat simulator.\n","authors":["Ryogo Yamamoto","Kanji Tanaka"],"pdf_url":"https://arxiv.org/pdf/2310.15504v1.pdf","comment":"5 pages, 5 figures, technical report"},{"id":"http://arxiv.org/abs/2309.14057v2","updated":"2023-10-24T03:23:05Z","published":"2023-09-25T11:50:19Z","title":"Weakly Supervised Semantic Segmentation by Knowledge Graph Inference","summary":" Currently, existing efforts in Weakly Supervised Semantic Segmentation (WSSS)\nbased on Convolutional Neural Networks (CNNs) have predominantly focused on\nenhancing the multi-label classification network stage, with limited attention\ngiven to the equally important downstream segmentation network. Furthermore,\nCNN-based local convolutions lack the ability to model the extensive\ninter-category dependencies. Therefore, this paper introduces a graph\nreasoning-based approach to enhance WSSS. The aim is to improve WSSS\nholistically by simultaneously enhancing both the multi-label classification\nand segmentation network stages. In the multi-label classification network\nsegment, external knowledge is integrated, coupled with GCNs, to globally\nreason about inter-class dependencies. This encourages the network to uncover\nfeatures in non-salient regions of images, thereby refining the completeness of\ngenerated pseudo-labels. In the segmentation network segment, the proposed\nGraph Reasoning Mapping (GRM) module is employed to leverage knowledge obtained\nfrom textual databases, facilitating contextual reasoning for class\nrepresentation within image regions. This GRM module enhances feature\nrepresentation in high-level semantics of the segmentation network's local\nconvolutions, while dynamically learning semantic coherence for individual\nsamples. Using solely image-level supervision, we have achieved\nstate-of-the-art performance in WSSS on the PASCAL VOC 2012 and MS-COCO\ndatasets. Extensive experimentation on both the multi-label classification and\nsegmentation network stages underscores the effectiveness of the proposed graph\nreasoning approach for advancing WSSS.\n","authors":["Jia Zhang","Bo Peng","Xi Wu"],"pdf_url":"https://arxiv.org/pdf/2309.14057v2.pdf","comment":"Our description in Chapter 3, Section 3.2 of the paper is too\n repetitive with the paper \"Object detection meets knowledge graphs\". There is\n an error in the description of formula (5) in Section 3.3. And a detailed\n reasoning process is required for formula (5). Therefore, we wish to request\n a retraction of the paper"},{"id":"http://arxiv.org/abs/2310.15168v2","updated":"2023-10-24T03:21:22Z","published":"2023-10-23T17:59:52Z","title":"Ghost on the Shell: An Expressive Representation of General 3D Shapes","summary":" The creation of photorealistic virtual worlds requires the accurate modeling\nof 3D surface geometry for a wide range of objects. For this, meshes are\nappealing since they 1) enable fast physics-based rendering with realistic\nmaterial and lighting, 2) support physical simulation, and 3) are\nmemory-efficient for modern graphics pipelines. Recent work on reconstructing\nand statistically modeling 3D shape, however, has critiqued meshes as being\ntopologically inflexible. To capture a wide range of object shapes, any 3D\nrepresentation must be able to model solid, watertight, shapes as well as thin,\nopen, surfaces. Recent work has focused on the former, and methods for\nreconstructing open surfaces do not support fast reconstruction with material\nand lighting or unconditional generative modelling. Inspired by the observation\nthat open surfaces can be seen as islands floating on watertight surfaces, we\nparameterize open surfaces by defining a manifold signed distance field on\nwatertight templates. With this parameterization, we further develop a\ngrid-based and differentiable representation that parameterizes both watertight\nand non-watertight meshes of arbitrary topology. Our new representation, called\nGhost-on-the-Shell (G-Shell), enables two important applications:\ndifferentiable rasterization-based reconstruction from multiview images and\ngenerative modelling of non-watertight meshes. We empirically demonstrate that\nG-Shell achieves state-of-the-art performance on non-watertight mesh\nreconstruction and generation tasks, while also performing effectively for\nwatertight meshes.\n","authors":["Zhen Liu","Yao Feng","Yuliang Xiu","Weiyang Liu","Liam Paull","Michael J. Black","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2310.15168v2.pdf","comment":"Technical Report (26 pages, 16 figures, Project Page:\n https://gshell3d.github.io/)"},{"id":"http://arxiv.org/abs/2310.15482v1","updated":"2023-10-24T03:18:07Z","published":"2023-10-24T03:18:07Z","title":"Salient Object Detection in RGB-D Videos","summary":" Given the widespread adoption of depth-sensing acquisition devices, RGB-D\nvideos and related data/media have gained considerable traction in various\naspects of daily life. Consequently, conducting salient object detection (SOD)\nin RGB-D videos presents a highly promising and evolving avenue. Despite the\npotential of this area, SOD in RGB-D videos remains somewhat under-explored,\nwith RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To\nexplore this emerging field, this paper makes two primary contributions: the\ndataset and the model. On one front, we construct the RDVS dataset, a new RGB-D\nVSOD dataset with realistic depth and characterized by its diversity of scenes\nand rigorous frame-by-frame annotations. We validate the dataset through\ncomprehensive attribute and object-oriented analyses, and provide training and\ntesting splits. Moreover, we introduce DCTNet+, a three-stream network tailored\nfor RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical\nflow as auxiliary modalities. In pursuit of effective feature enhancement,\nrefinement, and fusion for precise final prediction, we propose two modules:\nthe multi-modal attention module (MAM) and the refinement fusion module (RFM).\nTo enhance interaction and fusion within RFM, we design a universal interaction\nmodule (UIM) and then integrate holistic multi-modal attentive paths (HMAPs)\nfor refining multi-modal low-level features before reaching RFMs. Comprehensive\nexperiments, conducted on pseudo RGB-D video datasets alongside our RDVS,\nhighlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD\nmodels. Ablation experiments were performed on both pseudo and realistic RGB-D\nvideo datasets to demonstrate the advantages of individual modules as well as\nthe necessity of introducing realistic depth. Our code together with RDVS\ndataset will be available at https://github.com/kerenfu/RDVS/.\n","authors":["Ao Mou","Yukang Lu","Jiahao He","Dingyao Min","Keren Fu","Qijun Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.15482v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.08130v2","updated":"2023-10-24T03:08:47Z","published":"2023-04-17T10:15:08Z","title":"A Survey on Few-Shot Class-Incremental Learning","summary":" Large deep learning models are impressive, but they struggle when real-time\ndata is not available. Few-shot class-incremental learning (FSCIL) poses a\nsignificant challenge for deep neural networks to learn new tasks from just a\nfew labeled samples without forgetting the previously learned ones. This setup\neasily leads to catastrophic forgetting and overfitting problems, severely\naffecting model performance. Studying FSCIL helps overcome deep learning model\nlimitations on data volume and acquisition time, while improving practicality\nand adaptability of machine learning models. This paper provides a\ncomprehensive survey on FSCIL. Unlike previous surveys, we aim to synthesize\nfew-shot learning and incremental learning, focusing on introducing FSCIL from\ntwo perspectives, while reviewing over 30 theoretical research studies and more\nthan 20 applied research studies. From the theoretical perspective, we provide\na novel categorization approach that divides the field into five subcategories,\nincluding traditional machine learning methods, meta-learning based methods,\nfeature and feature space-based methods, replay-based methods, and dynamic\nnetwork structure-based methods. We also evaluate the performance of recent\ntheoretical research on benchmark datasets of FSCIL. From the application\nperspective, FSCIL has achieved impressive achievements in various fields of\ncomputer vision such as image classification, object detection, and image\nsegmentation, as well as in natural language processing and graph. We summarize\nthe important applications. Finally, we point out potential future research\ndirections, including applications, problem setups, and theory development.\nOverall, this paper offers a comprehensive analysis of the latest advances in\nFSCIL from a methodological, performance, and application perspective.\n","authors":["Songsong Tian","Lusi Li","Weijun Li","Hang Ran","Xin Ning","Prayag Tiwari"],"pdf_url":"https://arxiv.org/pdf/2304.08130v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.01278v2","updated":"2023-10-24T02:57:42Z","published":"2023-05-02T09:28:39Z","title":"VPGTrans: Transfer Visual Prompt Generator across LLMs","summary":" While developing a new multimodal LLM (MLLM) by pre-training on tremendous\nimage-text pairs from scratch can be exceedingly resource-consuming, connecting\nan existing LLM with a comparatively lightweight visual prompt generator (VPG)\nbecomes a feasible paradigm. However, further tuning the VPG part of the MLLM\nstill suffers from indispensable computational costs, i.e., requiring thousands\nof GPU hours and millions of training data. One alternative solution is to\ntransfer an existing VPG from any existing MLLMs for the target MLLM.\n In this work, we for the first time investigate the VPG transferability\nacross LLMs, and explore a solution to reduce the cost of VPG transfer. We\nfirst study the VPG transfer across different LLM sizes (e.g., small-to-large),\nand across different LLM types, through which we diagnose the key factors to\nmaximize the transfer efficiency. Based on our observation, we design a\ntwo-stage transfer framework named VPGTrans, which is simple yet highly\neffective. Through extensive experiments, we demonstrate that VPGTrans helps\nsignificantly speed up the transfer learning process without compromising\nperformance. Remarkably, it helps achieve the VPG transfer from BLIP-2\nOPT$_\\text{2.7B}$ to BLIP-2 OPT$_\\text{6.7B}$ with over 10 times speed-up and\n10.7% training data compared with connecting a VPG to OPT$_\\text{6.7B}$ from\nscratch. Further, a series of intriguing findings and potential rationales\nbehind them are provided and discussed. Finally, we showcase the practical\nvalue of our VPGTrans approach, by customizing two novel MLLMs, including\nVL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.\n","authors":["Ao Zhang","Hao Fei","Yuan Yao","Wei Ji","Li Li","Zhiyuan Liu","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2305.01278v2.pdf","comment":"Project Website: https://vpgtrans.github.io Code:\n https://github.com/VPGTrans/VPGTrans NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.08078v2","updated":"2023-10-24T02:12:02Z","published":"2023-05-14T05:57:11Z","title":"Supervised Domain Adaptation for Recognizing Retinal Diseases from\n Wide-Field Fundus Images","summary":" This paper addresses the emerging task of recognizing multiple retinal\ndiseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an\neffective use of existing large amount of labeled color fundus photo (CFP) data\nand the relatively small amount of WF and UWF data, we propose a supervised\ndomain adaptation method named Cross-domain Collaborative Learning (CdCL).\nInspired by the success of fixed-ratio based mixup in unsupervised domain\nadaptation, we re-purpose this strategy for the current task. Due to the\nintrinsic disparity between the field-of-view of CFP and WF/UWF images, a scale\nbias naturally exists in a mixup sample that the anatomic structure from a CFP\nimage will be considerably larger than its WF/UWF counterpart. The CdCL method\nresolves the issue by Scale-bias Correction, which employs Transformers for\nproducing scale-invariant features. As demonstrated by extensive experiments on\nmultiple datasets covering both WF and UWF images, the proposed method compares\nfavorably against a number of competitive baselines.\n","authors":["Qijie Wei","Jingyuan Yang","Bo Wang","Jinrui Wang","Jianchun Zhao","Xinyu Zhao","Sheng Yang","Niranchana Manivannan","Youxin Chen","Dayong Ding","Jing Zhou","Xirong Li"],"pdf_url":"https://arxiv.org/pdf/2305.08078v2.pdf","comment":"Accepted by BIBM2023"},{"id":"http://arxiv.org/abs/2309.06067v3","updated":"2023-10-24T01:50:38Z","published":"2023-09-12T09:07:03Z","title":"Batch Implicit Neural Representation for MRI Parallel Reconstruction","summary":" Magnetic resonance imaging (MRI) always suffered from the problem of long\nacquisition time. MRI reconstruction is one solution to reduce scan time by\nskipping certain phase-encoding lines and then restoring high-quality images\nfrom undersampled measurements. Recently, implicit neural representation (INR)\nhas emerged as a new deep learning method that represents an object as a\ncontinuous function of spatial coordinates, and this function is normally\nparameterized by a multilayer perceptron (MLP). In this paper, we propose a\nnovel MRI parallel reconstruction method based on INR, which represents the\nfully-sampled images as the function of voxel coordinates and prior feature\nvectors of undersampled images for overcoming the generalization problem of\nINR. Specifically, we introduce a scale-embedded encoder to produce\nscale-independent voxel-specific features from MR images with different\nundersampled scales and then concatenate with coordinates vectors to recover\nfully-sampled MR images via an MLP, thus achieving arbitrary scale\nreconstruction. The performance of the proposed method was assessed by\nexperimenting on publicly available MRI datasets and compared with other\nreconstruction methods. Our quantitative evaluation demonstrates the\nsuperiority of the proposed method over alternative reconstruction methods.\n","authors":["Hao Li","Yusheng Zhou","Jianan Liu","Xiling Liu","Tao Huang","Zhihan Lv"],"pdf_url":"https://arxiv.org/pdf/2309.06067v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15447v1","updated":"2023-10-24T01:44:11Z","published":"2023-10-24T01:44:11Z","title":"DeepIron: Predicting Unwarped Garment Texture from a Single Image","summary":" Realistic reconstruction of 3D clothing from an image has wide applications,\nsuch as avatar creation and virtual try-on. This paper presents a novel\nframework that reconstructs the texture map for 3D garments from a single image\nwith pose. Assuming that 3D garments are modeled by stitching 2D garment sewing\npatterns, our specific goal is to generate a texture image for the sewing\npatterns. A key component of our framework, the Texture Unwarper, infers the\noriginal texture image from the input clothing image, which exhibits warping\nand occlusion of texture due to the user's body shape and pose. The Texture\nUnwarper effectively transforms between the input and output images by mapping\nthe latent spaces of the two images. By inferring the unwarped original texture\nof the input garment, our method helps reconstruct 3D garment models that can\nshow high-quality texture images realistically deformed for new poses. We\nvalidate the effectiveness of our approach through a comparison with other\nmethods and ablation studies. Additionally, we release a large dataset of\ngarment sewing patterns with textures and images of avatars wearing the\ngarments, which will be useful for future research on garment texture\nreconstruction and synthesis.\n","authors":["Hyun-Song Kwon","Sung-Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15444v1","updated":"2023-10-24T01:36:20Z","published":"2023-10-24T01:36:20Z","title":"Fast Propagation is Better: Accelerating Single-Step Adversarial\n Training via Sampling Subnetworks","summary":" Adversarial training has shown promise in building robust models against\nadversarial examples. A major drawback of adversarial training is the\ncomputational overhead introduced by the generation of adversarial examples. To\novercome this limitation, adversarial training based on single-step attacks has\nbeen explored. Previous work improves the single-step adversarial training from\ndifferent perspectives, e.g., sample initialization, loss regularization, and\ntraining strategy. Almost all of them treat the underlying model as a black\nbox. In this work, we propose to exploit the interior building blocks of the\nmodel to improve efficiency. Specifically, we propose to dynamically sample\nlightweight subnetworks as a surrogate model during training. By doing this,\nboth the forward and backward passes can be accelerated for efficient\nadversarial training. Besides, we provide theoretical analysis to show the\nmodel robustness can be improved by the single-step adversarial training with\nsampled subnetworks. Furthermore, we propose a novel sampling strategy where\nthe sampling varies from layer to layer and from iteration to iteration.\nCompared with previous methods, our method not only reduces the training cost\nbut also achieves better model robustness. Evaluations on a series of popular\ndatasets demonstrate the effectiveness of the proposed FB-Better. Our code has\nbeen released at https://github.com/jiaxiaojunQAQ/FP-Better.\n","authors":["Xiaojun Jia","Jianshu Li","Jindong Gu","Yang Bai","Xiaochun Cao"],"pdf_url":"https://arxiv.org/pdf/2310.15444v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14230v2","updated":"2023-10-24T01:36:19Z","published":"2023-10-22T08:46:40Z","title":"A comprehensive survey on deep active learning and its applications in\n medical image analysis","summary":" Deep learning has achieved widespread success in medical image analysis,\nleading to an increasing demand for large-scale expert-annotated medical image\ndatasets. Yet, the high cost of annotating medical images severely hampers the\ndevelopment of deep learning in this field. To reduce annotation costs, active\nlearning aims to select the most informative samples for annotation and train\nhigh-performance models with as few labeled samples as possible. In this\nsurvey, we review the core methods of active learning, including the evaluation\nof informativeness and sampling strategy. For the first time, we provide a\ndetailed summary of the integration of active learning with other\nlabel-efficient techniques, such as semi-supervised, self-supervised learning,\nand so on. Additionally, we also highlight active learning works that are\nspecifically tailored to medical image analysis. In the end, we offer our\nperspectives on the future trends and challenges of active learning and its\napplications in medical image analysis.\n","authors":["Haoran Wang","Qiuye Jin","Shiman Li","Siyu Liu","Manning Wang","Zhijian Song"],"pdf_url":"https://arxiv.org/pdf/2310.14230v2.pdf","comment":"Paper List on Github:\n https://github.com/LightersWang/Awesome-Active-Learning-for-Medical-Image-Analysis"},{"id":"http://arxiv.org/abs/2310.05462v2","updated":"2023-10-24T01:02:12Z","published":"2023-10-09T07:10:30Z","title":"AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential\n Cross Attention","summary":" Multi-modal medical image fusion is essential for the precise clinical\ndiagnosis and surgical navigation since it can merge the complementary\ninformation in multi-modalities into a single image. The quality of the fused\nimage depends on the extracted single modality features as well as the fusion\nrules for multi-modal information. Existing deep learning-based fusion methods\ncan fully exploit the semantic features of each modality, they cannot\ndistinguish the effective low and high frequency information of each modality\nand fuse them adaptively. To address this issue, we propose AdaFuse, in which\nmultimodal image information is fused adaptively through frequency-guided\nattention mechanism based on Fourier transform. Specifically, we propose the\ncross-attention fusion (CAF) block, which adaptively fuses features of two\nmodalities in the spatial and frequency domains by exchanging key and query\nvalues, and then calculates the cross-attention scores between the spatial and\nfrequency features to further guide the spatial-frequential information fusion.\nThe CAF block enhances the high-frequency features of the different modalities\nso that the details in the fused images can be retained. Moreover, we design a\nnovel loss function composed of structure loss and content loss to preserve\nboth low and high frequency information. Extensive comparison experiments on\nseveral datasets demonstrate that the proposed method outperforms\nstate-of-the-art methods in terms of both visual quality and quantitative\nmetrics. The ablation experiments also validate the effectiveness of the\nproposed loss and fusion strategy.\n","authors":["Xianming Gu","Lihui Wang","Zeyu Deng","Ying Cao","Xingyu Huang","Yue-min Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.05462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.01735v5","updated":"2023-10-24T00:40:26Z","published":"2023-02-03T13:50:25Z","title":"Rethinking Semi-Supervised Medical Image Segmentation: A\n Variance-Reduction Perspective","summary":" For medical image segmentation, contrastive learning is the dominant practice\nto improve the quality of visual representations by contrasting semantically\nsimilar and dissimilar pairs of samples. This is enabled by the observation\nthat without accessing ground truth labels, negative examples with truly\ndissimilar anatomical features, if sampled, can significantly improve the\nperformance. In reality, however, these samples may come from similar\nanatomical regions and the models may struggle to distinguish the minority\ntail-class samples, making the tail classes more prone to misclassification,\nboth of which typically lead to model collapse. In this paper, we propose ARCO,\na semi-supervised contrastive learning (CL) framework with stratified group\ntheory for medical image segmentation. In particular, we first propose building\nARCO through the concept of variance-reduced estimation and show that certain\nvariance-reduction techniques are particularly beneficial in pixel/voxel-level\nsegmentation tasks with extremely limited labels. Furthermore, we theoretically\nprove these sampling techniques are universal in variance reduction. Finally,\nwe experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D\nmedical and three semantic segmentation datasets, with different label\nsettings, and our methods consistently outperform state-of-the-art\nsemi-supervised methods. Additionally, we augment the CL frameworks with these\nsampling techniques and demonstrate significant gains over previous methods. We\nbelieve our work is an important step towards semi-supervised medical image\nsegmentation by quantifying the limitation of current self-supervision\nobjectives for accomplishing such challenging safety-critical tasks.\n","authors":["Chenyu You","Weicheng Dai","Yifei Min","Fenglin Liu","David A. Clifton","S Kevin Zhou","Lawrence Hamilton Staib","James S Duncan"],"pdf_url":"https://arxiv.org/pdf/2302.01735v5.pdf","comment":"Accepted by Advances in Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.15422v1","updated":"2023-10-24T00:28:24Z","published":"2023-10-24T00:28:24Z","title":"G2-MonoDepth: A General Framework of Generalized Depth Inference from\n Monocular RGB+X Data","summary":" Monocular depth inference is a fundamental problem for scene perception of\nrobots. Specific robots may be equipped with a camera plus an optional depth\nsensor of any type and located in various scenes of different scales, whereas\nrecent advances derived multiple individual sub-tasks. It leads to additional\nburdens to fine-tune models for specific robots and thereby high-cost\ncustomization in large-scale industrialization. This paper investigates a\nunified task of monocular depth inference, which infers high-quality depth maps\nfrom all kinds of input raw data from various robots in unseen scenes. A basic\nbenchmark G2-MonoDepth is developed for this task, which comprises four\ncomponents: (a) a unified data representation RGB+X to accommodate RGB plus raw\ndepth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and\nerrors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth\nsparsity/errors of input raw data and diverse scales of output scenes, (c) an\nimproved network to well propagate diverse scene scales from input to output,\nand (d) a data augmentation pipeline to simulate all types of real artifacts in\nraw depth maps for training. G2-MonoDepth is applied in three sub-tasks\nincluding depth estimation, depth completion with different sparsity, and depth\nenhancement in unseen scenes, and it always outperforms SOTA baselines on both\nreal-world data and synthetic data.\n","authors":["Haotian Wang","Meng Yang","Nanning Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.15422v1.pdf","comment":"18 pages, 16 figures"},{"id":"http://arxiv.org/abs/2310.16234v1","updated":"2023-10-24T23:06:29Z","published":"2023-10-24T23:06:29Z","title":"Pixel-Level Clustering Network for Unsupervised Image Segmentation","summary":" While image segmentation is crucial in various computer vision applications,\nsuch as autonomous driving, grasping, and robot navigation, annotating all\nobjects at the pixel-level for training is nearly impossible. Therefore, the\nstudy of unsupervised image segmentation methods is essential. In this paper,\nwe present a pixel-level clustering framework for segmenting images into\nregions without using ground truth annotations. The proposed framework includes\nfeature embedding modules with an attention mechanism, a feature statistics\ncomputing module, image reconstruction, and superpixel segmentation to achieve\naccurate unsupervised segmentation. Additionally, we propose a training\nstrategy that utilizes intra-consistency within each superpixel,\ninter-similarity/dissimilarity between neighboring superpixels, and structural\nsimilarity between images. To avoid potential over-segmentation caused by\nsuperpixel-based losses, we also propose a post-processing method. Furthermore,\nwe present an extension of the proposed method for unsupervised semantic\nsegmentation. We conducted experiments on three publicly available datasets\n(Berkeley segmentation dataset, PASCAL VOC 2012 dataset, and COCO-Stuff\ndataset) to demonstrate the effectiveness of the proposed framework. The\nexperimental results show that the proposed framework outperforms previous\nstate-of-the-art methods.\n","authors":["Cuong Manh Hoang","Byeongkeun Kang"],"pdf_url":"https://arxiv.org/pdf/2310.16234v1.pdf","comment":"13 pages"},{"id":"http://arxiv.org/abs/2310.16228v1","updated":"2023-10-24T22:54:05Z","published":"2023-10-24T22:54:05Z","title":"On the Foundations of Shortcut Learning","summary":" Deep-learning models can extract a rich assortment of features from data.\nWhich features a model uses depends not only on predictivity-how reliably a\nfeature indicates train-set labels-but also on availability-how easily the\nfeature can be extracted, or leveraged, from inputs. The literature on shortcut\nlearning has noted examples in which models privilege one feature over another,\nfor example texture over shape and image backgrounds over foreground objects.\nHere, we test hypotheses about which input properties are more available to a\nmodel, and systematically study how predictivity and availability interact to\nshape models' feature use. We construct a minimal, explicit generative\nframework for synthesizing classification datasets with two latent features\nthat vary in predictivity and in factors we hypothesize to relate to\navailability, and quantify a model's shortcut bias-its over-reliance on the\nshortcut (more available, less predictive) feature at the expense of the core\n(less available, more predictive) feature. We find that linear models are\nrelatively unbiased, but introducing a single hidden layer with ReLU or Tanh\nunits yields a bias. Our empirical findings are consistent with a theoretical\naccount based on Neural Tangent Kernels. Finally, we study how models used in\npractice trade off predictivity and availability in naturalistic datasets,\ndiscovering availability manipulations which increase models' degree of\nshortcut bias. Taken together, these findings suggest that the propensity to\nlearn shortcut features is a fundamental characteristic of deep nonlinear\narchitectures warranting systematic study given its role in shaping how models\nsolve tasks.\n","authors":["Katherine L. Hermann","Hossein Mobahi","Thomas Fel","Michael C. Mozer"],"pdf_url":"https://arxiv.org/pdf/2310.16228v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16226v1","updated":"2023-10-24T22:41:14Z","published":"2023-10-24T22:41:14Z","title":"TiC-CLIP: Continual Training of CLIP Models","summary":" Keeping large foundation models up to date on latest data is inherently\nexpensive. To avoid the prohibitive costs of constantly retraining, it is\nimperative to continually train these models. This problem is exacerbated by\nthe lack of any large scale continual learning benchmarks or baselines. We\nintroduce the first set of web-scale Time-Continual (TiC) benchmarks for\ntraining vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with\nover 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first\nuse our benchmarks to curate various dynamic evaluations to measure temporal\nrobustness of existing models. We show OpenAI's CLIP (trained on data up to\n2020) loses $\\approx 8\\%$ zero-shot accuracy on our curated retrieval task from\n2021--2022 compared with more recently trained models in OpenCLIP repository.\nWe then study how to efficiently train models on time-continuous data. We\ndemonstrate that a simple rehearsal-based approach that continues training from\nthe last checkpoint and replays old data reduces compute by $2.5\\times$ when\ncompared to the standard practice of retraining from scratch.\n","authors":["Saurabh Garg","Mehrdad Farajtabar","Hadi Pouransari","Raviteja Vemulapalli","Sachin Mehta","Oncel Tuzel","Vaishaal Shankar","Fartash Faghri"],"pdf_url":"https://arxiv.org/pdf/2310.16226v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19809v2","updated":"2023-10-24T22:29:00Z","published":"2023-05-31T12:51:10Z","title":"Direct Diffusion Bridge using Data Consistency for Inverse Problems","summary":" Diffusion model-based inverse problem solvers have shown impressive\nperformance, but are limited in speed, mostly as they require reverse diffusion\nsampling starting from noise. Several recent works have tried to alleviate this\nproblem by building a diffusion process, directly bridging the clean and the\ncorrupted for specific inverse problems. In this paper, we first unify these\nexisting works under the name Direct Diffusion Bridges (DDB), showing that\nwhile motivated by different theories, the resulting algorithms only differ in\nthe choice of parameters. Then, we highlight a critical limitation of the\ncurrent DDB framework, namely that it does not ensure data consistency. To\naddress this problem, we propose a modified inference procedure that imposes\ndata consistency without the need for fine-tuning. We term the resulting method\ndata Consistent DDB (CDDB), which outperforms its inconsistent counterpart in\nterms of both perception and distortion metrics, thereby effectively pushing\nthe Pareto-frontier toward the optimum. Our proposed method achieves\nstate-of-the-art results on both evaluation criteria, showcasing its\nsuperiority over existing methods. Code is available at\nhttps://github.com/HJ-harry/CDDB\n","authors":["Hyungjin Chung","Jeongsol Kim","Jong Chul Ye"],"pdf_url":"https://arxiv.org/pdf/2305.19809v2.pdf","comment":"NeurIPS 2023 camera-ready. 16 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.16221v1","updated":"2023-10-24T22:24:44Z","published":"2023-10-24T22:24:44Z","title":"Hierarchical Randomized Smoothing","summary":" Real-world data is complex and often consists of objects that can be\ndecomposed into multiple entities (e.g. images into pixels, graphs into\ninterconnected nodes). Randomized smoothing is a powerful framework for making\nmodels provably robust against small changes to their inputs - by guaranteeing\nrobustness of the majority vote when randomly adding noise before\nclassification. Yet, certifying robustness on such complex data via randomized\nsmoothing is challenging when adversaries do not arbitrarily perturb entire\nobjects (e.g. images) but only a subset of their entities (e.g. pixels). As a\nsolution, we introduce hierarchical randomized smoothing: We partially smooth\nobjects by adding random noise only on a randomly selected subset of their\nentities. By adding noise in a more targeted manner than existing methods we\nobtain stronger robustness guarantees while maintaining high accuracy. We\ninitialize hierarchical smoothing using different noising distributions,\nyielding novel robustness certificates for discrete and continuous domains. We\nexperimentally demonstrate the importance of hierarchical smoothing in image\nand node classification, where it yields superior robustness-accuracy\ntrade-offs. Overall, hierarchical smoothing is an important contribution\ntowards models that are both - certifiably robust to perturbations and\naccurate.\n","authors":["Yan Scholten","Jan Schuchardt","Aleksandar Bojchevski","Stephan Günnemann"],"pdf_url":"https://arxiv.org/pdf/2310.16221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.05831v4","updated":"2023-10-24T22:11:31Z","published":"2022-05-12T01:54:22Z","title":"Feature Extractor Stacking for Cross-domain Few-shot Learning","summary":" Cross-domain few-shot learning (CDFSL) addresses learning problems where\nknowledge needs to be transferred from one or more source domains into an\ninstance-scarce target domain with an explicitly different distribution.\nRecently published CDFSL methods generally construct a universal model that\ncombines knowledge of multiple source domains into one feature extractor. This\nenables efficient inference but necessitates re-computation of the extractor\nwhenever a new source domain is added. Some of these methods are also\nincompatible with heterogeneous source domain extractor architectures. We\npropose feature extractor stacking (FES), a new CDFSL method for combining\ninformation from a collection of extractors, that can utilise heterogeneous\npretrained extractors out of the box and does not maintain a universal model\nthat needs to be re-computed when its extractor collection is updated. We\npresent the basic FES algorithm, which is inspired by the classic stacked\ngeneralisation approach, and also introduce two variants: convolutional FES\n(ConFES) and regularised FES (ReFES). Given a target-domain task, these\nalgorithms fine-tune each extractor independently, use cross-validation to\nextract training data for stacked generalisation from the support set, and\nlearn a simple linear stacking classifier from this data. We evaluate our FES\nmethods on the well-known Meta-Dataset benchmark, targeting image\nclassification with convolutional neural networks, and show that they can\nachieve state-of-the-art performance.\n","authors":["Hongyu Wang","Eibe Frank","Bernhard Pfahringer","Michael Mayo","Geoffrey Holmes"],"pdf_url":"https://arxiv.org/pdf/2205.05831v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16212v1","updated":"2023-10-24T22:01:14Z","published":"2023-10-24T22:01:14Z","title":"ShadowSense: Unsupervised Domain Adaptation and Feature Fusion for\n Shadow-Agnostic Tree Crown Detection from RGB-Thermal Drone Imagery","summary":" Accurate detection of individual tree crowns from remote sensing data poses a\nsignificant challenge due to the dense nature of forest canopy and the presence\nof diverse environmental variations, e.g., overlapping canopies, occlusions,\nand varying lighting conditions. Additionally, the lack of data for training\nrobust models adds another limitation in effectively studying complex forest\nconditions. This paper presents a novel method for detecting shadowed tree\ncrowns and provides a challenging dataset comprising roughly 50k paired\nRGB-thermal images to facilitate future research for illumination-invariant\ndetection. The proposed method (ShadowSense) is entirely self-supervised,\nleveraging domain adversarial training without source domain annotations for\nfeature extraction and foreground feature alignment for feature pyramid\nnetworks to adapt domain-invariant representations by focusing on visible\nforeground regions, respectively. It then fuses complementary information of\nboth modalities to effectively improve upon the predictions of an RGB-trained\ndetector and boost the overall accuracy. Extensive experiments demonstrate the\nsuperiority of the proposed method over both the baseline RGB-trained detector\nand state-of-the-art techniques that rely on unsupervised domain adaptation or\nearly image fusion. Our code and data are available:\nhttps://github.com/rudrakshkapil/ShadowSense\n","authors":["Rudraksh Kapil","Seyed Mojtaba Marvasti-Zadeh","Nadir Erbilgin","Nilanjan Ray"],"pdf_url":"https://arxiv.org/pdf/2310.16212v1.pdf","comment":"Accepted in IEEE/CVF Winter Applications of Computer Vision (WACV)\n 2024 main conference! 8 pages (11 with bibliography), 5 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.16210v1","updated":"2023-10-24T21:57:59Z","published":"2023-10-24T21:57:59Z","title":"Sea-Land-Cloud Segmentation in Satellite Hyperspectral Imagery by Deep\n Learning","summary":" Satellites are increasingly adopting on-board Artificial Intelligence (AI)\ntechniques to enhance platforms' autonomy through edge inference. In this\ncontext, the utilization of deep learning (DL) techniques for segmentation in\nHS satellite imagery offers advantages for remote sensing applications, and\ntherefore, we train 16 different models, whose codes are made available through\nour study, which we consider to be relevant for on-board multi-class\nsegmentation of HS imagery, focusing on classifying oceanic (sea), terrestrial\n(land), and cloud formations. We employ the HYPSO-1 mission as an illustrative\ncase for sea-land-cloud segmentation, and to demonstrate the utility of the\nsegments, we introduce a novel sea-land-cloud ranking application scenario. Our\nsystem prioritizes HS image downlink based on sea, land, and cloud coverage\nlevels from the segmented images. We comparatively evaluate the models for\nin-orbit deployment, considering performance, parameter count, and inference\ntime. The models include both shallow and deep models, and after we propose\nfour new DL models, we demonstrate that segmenting single spectral signatures\n(1D) outperforms 3D data processing comprising both spectral (1D) and spatial\n(2D) contexts. We conclude that our lightweight DL model, called\n1D-Justo-LiuNet, consistently surpasses state-of-the-art models for\nsea-land-cloud segmentation, such as U-Net and its variations, in terms of\nperformance (0.93 accuracy) and parameter count (4,563). However, the 1D models\npresent longer inference time (15s) in the tested processing architecture,\nwhich is clearly suboptimal. Finally, after demonstrating that in-orbit image\nsegmentation should occur post L1b radiance calibration rather than on raw\ndata, we additionally show that reducing spectral channels down to 3 lowers\nmodels' parameters and inference time, at the cost of weaker segmentation\nperformance.\n","authors":["Jon Alvarez Justo","Joseph Landon Garrett","Mariana-Iuliana Georgescu","Jesus Gonzalez-Llorente","Radu Tudor Ionescu","Tor Arne Johansen"],"pdf_url":"https://arxiv.org/pdf/2310.16210v1.pdf","comment":"Remote Sensing, Satellite Imagery, Hyperspectral Imaging, Deep\n Learning, Segmentation"},{"id":"http://arxiv.org/abs/2305.06343v2","updated":"2023-10-24T21:40:00Z","published":"2023-05-10T17:52:26Z","title":"Incorporating Structured Representations into Pretrained Vision &\n Language Models Using Scene Graphs","summary":" Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS)\nperformance in a variety of tasks. However, recent works have shown that even\nthe best VLMs struggle to capture aspects of compositional scene understanding,\nsuch as object attributes, relations, and action states. In contrast, obtaining\nstructured annotations, such as scene graphs (SGs), that could improve these\nmodels is time-consuming and costly, and thus cannot be used on a large scale.\nHere we ask whether small SG datasets can provide sufficient information for\nenhancing structured understanding of pretrained VLMs. We show that it is\nindeed possible to improve VLMs when learning from SGs by integrating\ncomponents that incorporate structured information into both visual and textual\nrepresentations. For the visual side, we incorporate a special \"SG Component\"\nin the image transformer trained to predict SG information, while for the\ntextual side, we utilize SGs to generate fine-grained captions that highlight\ndifferent compositional aspects of the scene. Our method improves the\nperformance of several popular VLMs on multiple VL datasets with only a mild\ndegradation in ZS capabilities.\n","authors":["Roei Herzig","Alon Mendelson","Leonid Karlinsky","Assaf Arbelle","Rogerio Feris","Trevor Darrell","Amir Globerson"],"pdf_url":"https://arxiv.org/pdf/2305.06343v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.04966v2","updated":"2023-10-24T21:30:16Z","published":"2023-05-08T18:02:11Z","title":"NerfAcc: Efficient Sampling Accelerates NeRFs","summary":" Optimizing and rendering Neural Radiance Fields is computationally expensive\ndue to the vast number of samples required by volume rendering. Recent works\nhave included alternative sampling approaches to help accelerate their methods,\nhowever, they are often not the focus of the work. In this paper, we\ninvestigate and compare multiple sampling approaches and demonstrate that\nimproved sampling is generally applicable across NeRF variants under an unified\nconcept of transmittance estimator. To facilitate future experiments, we\ndevelop NerfAcc, a Python toolbox that provides flexible APIs for incorporating\nadvanced sampling methods into NeRF related methods. We demonstrate its\nflexibility by showing that it can reduce the training time of several recent\nNeRF methods by 1.5x to 20x with minimal modifications to the existing\ncodebase. Additionally, highly customized NeRFs, such as Instant-NGP, can be\nimplemented in native PyTorch using NerfAcc.\n","authors":["Ruilong Li","Hang Gao","Matthew Tancik","Angjoo Kanazawa"],"pdf_url":"https://arxiv.org/pdf/2305.04966v2.pdf","comment":"Website: https://www.nerfacc.com"},{"id":"http://arxiv.org/abs/2310.16194v1","updated":"2023-10-24T21:24:27Z","published":"2023-10-24T21:24:27Z","title":"Learning Low-Rank Latent Spaces with Simple Deterministic Autoencoder:\n Theoretical and Empirical Insights","summary":" The autoencoder is an unsupervised learning paradigm that aims to create a\ncompact latent representation of data by minimizing the reconstruction loss.\nHowever, it tends to overlook the fact that most data (images) are embedded in\na lower-dimensional space, which is crucial for effective data representation.\nTo address this limitation, we propose a novel approach called Low-Rank\nAutoencoder (LoRAE). In LoRAE, we incorporated a low-rank regularizer to\nadaptively reconstruct a low-dimensional latent space while preserving the\nbasic objective of an autoencoder. This helps embed the data in a\nlower-dimensional space while preserving important information. It is a simple\nautoencoder extension that learns low-rank latent space. Theoretically, we\nestablish a tighter error bound for our model. Empirically, our model's\nsuperiority shines through various tasks such as image generation and\ndownstream classification. Both theoretical and practical outcomes highlight\nthe importance of acquiring low-dimensional embeddings.\n","authors":["Alokendu Mazumder","Tirthajit Baruah","Bhartendu Kumar","Rishab Sharma","Vishwajeet Pattanaik","Punit Rathore"],"pdf_url":"https://arxiv.org/pdf/2310.16194v1.pdf","comment":"Accepted @ IEEE/CVF WACV 2024"},{"id":"http://arxiv.org/abs/2305.13812v3","updated":"2023-10-24T21:21:00Z","published":"2023-05-23T08:28:38Z","title":"Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for\n Improved Vision-Language Compositionality","summary":" Contrastively trained vision-language models have achieved remarkable\nprogress in vision and language representation learning, leading to\nstate-of-the-art models for various downstream multimodal tasks. However,\nrecent research has highlighted severe limitations of these models in their\nability to perform compositional reasoning over objects, attributes, and\nrelations. Scene graphs have emerged as an effective way to understand images\ncompositionally. These are graph-structured semantic representations of images\nthat contain objects, their attributes, and relations with other objects in a\nscene. In this work, we consider the scene graph parsed from text as a proxy\nfor the image scene graph and propose a graph decomposition and augmentation\nframework along with a coarse-to-fine contrastive learning objective between\nimages and text that aligns sentences of various complexities to the same\nimage. Along with this, we propose novel negative mining techniques in the\nscene graph space for improving attribute binding and relation understanding.\nThrough extensive experiments, we demonstrate the effectiveness of our approach\nthat significantly improves attribute binding, relation understanding,\nsystematic generalization, and productivity on multiple recently proposed\nbenchmarks (For example, improvements upto $18\\%$ for systematic\ngeneralization, $16.5\\%$ for relation understanding over a strong baseline),\nwhile achieving similar or better performance than CLIP on various general\nmultimodal tasks.\n","authors":["Harman Singh","Pengchuan Zhang","Qifan Wang","Mengjiao Wang","Wenhan Xiong","Jingfei Du","Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2305.13812v3.pdf","comment":"EMNLP 2023 (long paper, main conference)"},{"id":"http://arxiv.org/abs/2307.03833v3","updated":"2023-10-24T20:46:24Z","published":"2023-07-07T21:03:18Z","title":"Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation","summary":" Learning-based methods have dominated the 3D human pose estimation (HPE)\ntasks with significantly better performance in most benchmarks than traditional\noptimization-based methods. Nonetheless, 3D HPE in the wild is still the\nbiggest challenge for learning-based models, whether with 2D-3D lifting,\nimage-to-3D, or diffusion-based methods, since the trained networks implicitly\nlearn camera intrinsic parameters and domain-based 3D human pose distributions\nand estimate poses by statistical average. On the other hand, the\noptimization-based methods estimate results case-by-case, which can predict\nmore diverse and sophisticated human poses in the wild. By combining the\nadvantages of optimization-based and learning-based methods, we propose the\n\\textbf{Ze}ro-shot \\textbf{D}iffusion-based \\textbf{O}ptimization\n(\\textbf{ZeDO}) pipeline for 3D HPE to solve the problem of cross-domain and\nin-the-wild 3D HPE. Our multi-hypothesis \\textit{\\textbf{ZeDO}} achieves\nstate-of-the-art (SOTA) performance on Human3.6M, with minMPJPE $51.4$mm,\nwithout training with any 2D-3D or image-3D pairs. Moreover, our\nsingle-hypothesis \\textit{\\textbf{ZeDO}} achieves SOTA performance on 3DPW\ndataset with PA-MPJPE $40.3$mm on cross-dataset evaluation, which even\noutperforms learning-based methods trained on 3DPW.\n","authors":["Zhongyu Jiang","Zhuoran Zhou","Lei Li","Wenhao Chai","Cheng-Yen Yang","Jenq-Neng Hwang"],"pdf_url":"https://arxiv.org/pdf/2307.03833v3.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2305.14410v2","updated":"2023-10-24T20:44:09Z","published":"2023-05-23T17:59:10Z","title":"Image Manipulation via Multi-Hop Instructions -- A New Dataset and\n Weakly-Supervised Neuro-Symbolic Approach","summary":" We are interested in image manipulation via natural language text -- a task\nthat is useful for multiple AI applications but requires complex reasoning over\nmulti-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning\n(NSCL), which has been quite effective for the task of Visual Question\nAnswering (VQA), for the task of image manipulation. Our system referred to as\nNeuroSIM can perform complex multi-hop reasoning over multi-object scenes and\nonly requires weak supervision in the form of annotated data for VQA. NeuroSIM\nparses an instruction into a symbolic program, based on a Domain Specific\nLanguage (DSL) comprising of object attributes and manipulation operations,\nthat guides its execution. We create a new dataset for the task, and extensive\nexperiments demonstrate that NeuroSIM is highly competitive with or beats SOTA\nbaselines that make use of supervised data for manipulation.\n","authors":["Harman Singh","Poorva Garg","Mohit Gupta","Kevin Shah","Ashish Goswami","Satyam Modi","Arnab Kumar Mondal","Dinesh Khandelwal","Dinesh Garg","Parag Singla"],"pdf_url":"https://arxiv.org/pdf/2305.14410v2.pdf","comment":"EMNLP 2023 (long paper, main conference)"},{"id":"http://arxiv.org/abs/2310.16175v1","updated":"2023-10-24T20:41:04Z","published":"2023-10-24T20:41:04Z","title":"G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D\n Medical Image Segmentation","summary":" In recent years, medical image segmentation has become an important\napplication in the field of computer-aided diagnosis. In this paper, we are the\nfirst to propose a new graph convolution-based decoder namely, Cascaded Graph\nConvolutional Attention Decoder (G-CASCADE), for 2D medical image segmentation.\nG-CASCADE progressively refines multi-stage feature maps generated by\nhierarchical transformer encoders with an efficient graph convolution block.\nThe encoder utilizes the self-attention mechanism to capture long-range\ndependencies, while the decoder refines the feature maps preserving long-range\ninformation due to the global receptive fields of the graph convolution block.\nRigorous evaluations of our decoder with multiple transformer encoders on five\nmedical image segmentation tasks (i.e., Abdomen organs, Cardiac organs, Polyp\nlesions, Skin lesions, and Retinal vessels) show that our model outperforms\nother state-of-the-art (SOTA) methods. We also demonstrate that our decoder\nachieves better DICE scores than the SOTA CASCADE decoder with 80.8% fewer\nparameters and 82.3% fewer FLOPs. Our decoder can easily be used with other\nhierarchical encoders for general-purpose semantic and medical image\nsegmentation tasks.\n","authors":["Md Mostafijur Rahman","Radu Marculescu"],"pdf_url":"https://arxiv.org/pdf/2310.16175v1.pdf","comment":"13 pages, IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV 2024)"},{"id":"http://arxiv.org/abs/2310.16167v1","updated":"2023-10-24T20:33:19Z","published":"2023-10-24T20:33:19Z","title":"iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis","summary":" We present a method for generating consistent novel views from a single\nsource image. Our approach focuses on maximizing the reuse of visible pixels\nfrom the source image. To achieve this, we use a monocular depth estimator that\ntransfers visible pixels from the source view to the target view. Starting from\na pre-trained 2D inpainting diffusion model, we train our method on the\nlarge-scale Objaverse dataset to learn 3D object priors. While training we use\na novel masking mechanism based on epipolar lines to further improve the\nquality of our approach. This allows our framework to perform zero-shot novel\nview synthesis on a variety of objects. We evaluate the zero-shot abilities of\nour framework on three challenging datasets: Google Scanned Objects, Ray Traced\nMultiview, and Common Objects in 3D. See our webpage for more details:\nhttps://yashkant.github.io/invs/\n","authors":["Yash Kant","Aliaksandr Siarohin","Michael Vasilkovsky","Riza Alp Guler","Jian Ren","Sergey Tulyakov","Igor Gilitschenski"],"pdf_url":"https://arxiv.org/pdf/2310.16167v1.pdf","comment":"Accepted to SIGGRAPH Asia, 2023 (Conference Papers)"},{"id":"http://arxiv.org/abs/2301.08730v3","updated":"2023-10-24T20:19:51Z","published":"2023-01-20T18:49:58Z","title":"Novel-View Acoustic Synthesis","summary":" We introduce the novel-view acoustic synthesis (NVAS) task: given the sight\nand sound observed at a source viewpoint, can we synthesize the sound of that\nscene from an unseen target viewpoint? We propose a neural rendering approach:\nVisually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize\nthe sound of an arbitrary point in space by analyzing the input audio-visual\ncues. To benchmark this task, we collect two first-of-their-kind large-scale\nmulti-view audio-visual datasets, one synthetic and one real. We show that our\nmodel successfully reasons about the spatial cues and synthesizes faithful\naudio on both datasets. To our knowledge, this work represents the very first\nformulation, dataset, and approach to solve the novel-view acoustic synthesis\ntask, which has exciting potential applications ranging from AR/VR to art and\ndesign. Unlocked by this work, we believe that the future of novel-view\nsynthesis is in multi-modal learning from videos.\n","authors":["Changan Chen","Alexander Richard","Roman Shapovalov","Vamsi Krishna Ithapu","Natalia Neverova","Kristen Grauman","Andrea Vedaldi"],"pdf_url":"https://arxiv.org/pdf/2301.08730v3.pdf","comment":"Accepted at CVPR 2023. Project page:\n https://vision.cs.utexas.edu/projects/nvas"},{"id":"http://arxiv.org/abs/2310.16161v1","updated":"2023-10-24T20:08:15Z","published":"2023-10-24T20:08:15Z","title":"MyriadAL: Active Few Shot Learning for Histopathology","summary":" Active Learning (AL) and Few Shot Learning (FSL) are two label-efficient\nmethods which have achieved excellent results recently. However, most prior\narts in both learning paradigms fail to explore the wealth of the vast\nunlabelled data. In this study, we address this issue in the scenario where the\nannotation budget is very limited, yet a large amount of unlabelled data for\nthe target task is available. We frame this work in the context of\nhistopathology where labelling is prohibitively expensive. To this end, we\nintroduce an active few shot learning framework, Myriad Active Learning (MAL),\nincluding a contrastive-learning encoder, pseudo-label generation, and novel\nquery sample selection in the loop. Specifically, we propose to massage\nunlabelled data in a self-supervised manner, where the obtained data\nrepresentations and clustering knowledge form the basis to activate the AL\nloop. With feedback from the oracle in each AL cycle, the pseudo-labels of the\nunlabelled data are refined by optimizing a shallow task-specific net on top of\nthe encoder. These updated pseudo-labels serve to inform and improve the active\nlearning query selection process. Furthermore, we introduce a novel recipe to\ncombine existing uncertainty measures and utilize the entire uncertainty list\nto reduce sample redundancy in AL. Extensive experiments on two public\nhistopathology datasets show that MAL has superior test accuracy, macro\nF1-score, and label efficiency compared to prior works, and can achieve a\ncomparable test accuracy to a fully supervised algorithm while labelling only\n5% of the dataset.\n","authors":["Nico Schiavone","Jingyi Wang","Shuangzhi Li","Roger Zemp","Xingyu Li"],"pdf_url":"https://arxiv.org/pdf/2310.16161v1.pdf","comment":"9 pages, 2 figures, 6 tables"},{"id":"http://arxiv.org/abs/2310.16148v1","updated":"2023-10-24T19:48:07Z","published":"2023-10-24T19:48:07Z","title":"Yin Yang Convolutional Nets: Image Manifold Extraction by the Analysis\n of Opposites","summary":" Computer vision in general presented several advances such as training\noptimizations, new architectures (pure attention, efficient block, vision\nlanguage models, generative models, among others). This have improved\nperformance in several tasks such as classification, and others. However, the\nmajority of these models focus on modifications that are taking distance from\nrealistic neuroscientific approaches related to the brain. In this work, we\nadopt a more bio-inspired approach and present the Yin Yang Convolutional\nNetwork, an architecture that extracts visual manifold, its blocks are intended\nto separate analysis of colors and forms at its initial layers, simulating\noccipital lobe's operations. Our results shows that our architecture provides\nState-of-the-Art efficiency among low parameter architectures in the dataset\nCIFAR-10. Our first model reached 93.32\\% test accuracy, 0.8\\% more than the\nolder SOTA in this category, while having 150k less parameters (726k in total).\nOur second model uses 52k parameters, losing only 3.86\\% test accuracy. We also\nperformed an analysis on ImageNet, where we reached 66.49\\% validation accuracy\nwith 1.6M parameters. We make the code publicly available at:\nhttps://github.com/NoSavedDATA/YinYang_CNN.\n","authors":["Augusto Seben da Rosa","Frederico Santos de Oliveira","Anderson da Silva Soares","Arnaldo Candido Junior"],"pdf_url":"https://arxiv.org/pdf/2310.16148v1.pdf","comment":"12 pages, 5 tables and 6 figures"},{"id":"http://arxiv.org/abs/2310.16139v1","updated":"2023-10-24T19:27:35Z","published":"2023-10-24T19:27:35Z","title":"Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis\n approach for high-speed HDR videos","summary":" Accurately capturing dynamic scenes with wide-ranging motion and light\nintensity is crucial for many vision applications. However, acquiring\nhigh-speed high dynamic range (HDR) video is challenging because the camera's\nframe rate restricts its dynamic range. Existing methods sacrifice speed to\nacquire multi-exposure frames. Yet, misaligned motion in these frames can still\npose complications for HDR fusion algorithms, resulting in artifacts. Instead\nof frame-based exposures, we sample the videos using individual pixels at\nvarying exposures and phase offsets. Implemented on a pixel-wise programmable\nimage sensor, our sampling pattern simultaneously captures fast motion at a\nhigh dynamic range. We then transform pixel-wise outputs into an HDR video\nusing end-to-end learned weights from deep neural networks, achieving high\nspatiotemporal resolution with minimized motion blurring. We demonstrate\naliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under\nlow-light conditions and against bright backgrounds - both challenging\nconditions for conventional cameras. By combining the versatility of pixel-wise\nsampling patterns with the strength of deep neural networks at decoding complex\nscenes, our method greatly enhances the vision system's adaptability and\nperformance in dynamic conditions.\n","authors":["Caixin Wang","Jie Zhang","Matthew A. Wilson","Ralph Etienne-Cummings"],"pdf_url":"https://arxiv.org/pdf/2310.16139v1.pdf","comment":"14 pages, 14 figures"},{"id":"http://arxiv.org/abs/2310.16138v1","updated":"2023-10-24T19:26:07Z","published":"2023-10-24T19:26:07Z","title":"Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as\n a Neurodevelopmental Cue","summary":" Non-nutritive sucking (NNS), which refers to the act of sucking on a\npacifier, finger, or similar object without nutrient intake, plays a crucial\nrole in assessing healthy early development. In the case of preterm infants,\nNNS behavior is a key component in determining their readiness for feeding. In\nolder infants, the characteristics of NNS behavior offer valuable insights into\nneural and motor development. Additionally, NNS activity has been proposed as a\npotential safeguard against sudden infant death syndrome (SIDS). However, the\nclinical application of NNS assessment is currently hindered by labor-intensive\nand subjective finger-in-mouth evaluations. Consequently, researchers often\nresort to expensive pressure transducers for objective NNS signal measurement.\nTo enhance the accessibility and reliability of NNS signal monitoring for both\nclinicians and researchers, we introduce a vision-based algorithm designed for\nnon-contact detection of NNS activity using baby monitor footage in natural\nsettings. Our approach involves a comprehensive exploration of optical flow and\ntemporal convolutional networks, enabling the detection and amplification of\nsubtle infant-sucking signals. We successfully classify short video clips of\nuniform length into NNS and non-NNS periods. Furthermore, we investigate manual\nand learning-based techniques to piece together local classification results,\nfacilitating the segmentation of longer mixed-activity videos into NNS and\nnon-NNS segments of varying duration. Our research introduces two novel\ndatasets of annotated infant videos, including one sourced from our clinical\nstudy featuring 19 infant subjects and 183 hours of overnight baby monitor\nfootage.\n","authors":["Shaotong Zhu","Michael Wan","Sai Kumar Reddy Manne","Emily Zimmerman","Sarah Ostadabbas"],"pdf_url":"https://arxiv.org/pdf/2310.16138v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.04979v2","updated":"2023-10-24T19:16:45Z","published":"2022-10-10T19:27:37Z","title":"Label-free segmentation from cardiac ultrasound using self-supervised\n learning","summary":" Segmentation and measurement of cardiac chambers is critical in cardiac\nultrasound but is laborious and poorly reproducible. Neural networks can\nassist, but supervised approaches require the same laborious manual\nannotations. We built a pipeline for self-supervised (no manual labels)\nsegmentation combining computer vision, clinical domain knowledge, and deep\nlearning. We trained on 450 echocardiograms (93,000 images) and tested on 8,393\nechocardiograms (4,476,266 images; mean 61 years, 51% female), using the\nresulting segmentations to calculate biometrics. We also tested against\nexternal images from an additional 10,030 patients with available manual\ntracings of the left ventricle. r2 between clinically measured and\npipeline-predicted measurements were similar to reported inter-clinician\nvariation and comparable to supervised learning across several different\nmeasurements (r2 0.56-0.84). Average accuracy for detecting abnormal chamber\nsize and function was 0.85 (range 0.71-0.97) compared to clinical measurements.\nA subset of test echocardiograms (n=553) had corresponding cardiac MRIs, where\nMRI is the gold standard. Correlation between pipeline and MRI measurements was\nsimilar to that between clinical echocardiogram and MRI. Finally, the pipeline\naccurately segments the left ventricle with an average Dice score of 0.89 (95%\nCI [0.89]) in the external, manually labeled dataset. Our results demonstrate a\nmanual-label free, clinically valid, and highly scalable method for\nsegmentation from ultrasound, a noisy but globally important imaging modality.\n","authors":["Danielle L. Ferreira","Zaynaf Salaymang","Rima Arnaout"],"pdf_url":"https://arxiv.org/pdf/2210.04979v2.pdf","comment":"37 pages, 3 Tables, 7 Figures"},{"id":"http://arxiv.org/abs/2309.04579v2","updated":"2023-10-24T19:13:31Z","published":"2023-09-08T20:14:25Z","title":"EGOFALLS: A visual-audio dataset and benchmark for fall detection using\n egocentric cameras","summary":" Falls are significant and often fatal for vulnerable populations such as the\nelderly. Previous works have addressed the detection of falls by relying on\ndata capture by a single sensor, images or accelerometers. In this work, we\nrely on multimodal descriptors extracted from videos captured by egocentric\ncameras. Our proposed method includes a late decision fusion layer that builds\non top of the extracted descriptors. Furthermore, we collect a new dataset on\nwhich we assess our proposed approach. We believe this is the first public\ndataset of its kind. The dataset comprises 10,948 video samples by 14 subjects.\nWe conducted ablation experiments to assess the performance of individual\nfeature extractors, fusion of visual information, and fusion of both visual and\naudio information. Moreover, we experimented with internal and external\ncross-validation. Our results demonstrate that the fusion of audio and visual\ninformation through late decision fusion improves detection performance, making\nit a promising tool for fall prevention and mitigation.\n","authors":["Xueyi Wang"],"pdf_url":"https://arxiv.org/pdf/2309.04579v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16120v1","updated":"2023-10-24T18:48:20Z","published":"2023-10-24T18:48:20Z","title":"Stereoscopic Depth Perception Through Foliage","summary":" Both humans and computational methods struggle to discriminate the depths of\nobjects hidden beneath foliage. However, such discrimination becomes feasible\nwhen we combine computational optical synthetic aperture sensing with the human\nability to fuse stereoscopic images. For object identification tasks, as\nrequired in search and rescue, wildlife observation, surveillance, and early\nwildfire detection, depth assists in differentiating true from false findings,\nsuch as people, animals, or vehicles vs. sun-heated patches at the ground level\nor in the tree crowns, or ground fires vs. tree trunks. We used video captured\nby a drone above dense woodland to test users' ability to discriminate depth.\nWe found that this is impossible when viewing monoscopic video and relying on\nmotion parallax. The same was true with stereoscopic video because of the\nocclusions caused by foliage. However, when synthetic aperture sensing was used\nto reduce occlusions and disparity-scaled stereoscopic video was presented,\nwhereas computational (stereoscopic matching) methods were unsuccessful, human\nobservers successfully discriminated depth. This shows the potential of systems\nwhich exploit the synergy between computational methods and human vision to\nperform tasks that neither can perform alone.\n","authors":["Robert Kerschner","Rakesh John Amala Arokia Nathan","Rafal Mantiuk","Oliver Bimber"],"pdf_url":"https://arxiv.org/pdf/2310.16120v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.07287v2","updated":"2023-10-24T18:47:41Z","published":"2022-10-13T18:30:11Z","title":"Improving Deep Learning Models for Pediatric Low-Grade Glioma Tumors\n Molecular Subtype Identification Using 3D Probability Distributions of Tumor\n Location","summary":" Background and Purpose: Pediatric low-grade glioma (pLGG) is the most common\ntype of brain tumor in children, and identification of molecular markers for\npLGG is crucial for successful treatment planning. Convolutional Neural Network\n(CNN) models for pLGG subtype identification rely on tumor segmentation. We\nhypothesize tumor segmentations are suboptimal and thus, we propose to augment\nthe CNN models using tumor location probability in MRI data.\n Materials and Methods: Our REB-approved retrospective study included MRI\nFluid-Attenuated Inversion Recovery (FLAIR) sequences of 143 BRAF fused and 71\nBRAF V600E mutated tumors. Tumor segmentations (regions of interest (ROIs))\nwere provided by a pediatric neuroradiology fellow and verified by a senior\npediatric neuroradiologist. In each experiment, we randomly split the data into\ndevelopment and test with an 80/20 ratio. We combined the 3D binary ROI masks\nfor each class in the development dataset to derive the probability density\nfunctions (PDF) of tumor location, and developed three pipelines:\nlocation-based, CNN-based, and hybrid.\n Results: We repeated the experiment with different model initializations and\ndata splits 100 times and calculated the Area Under Receiver Operating\nCharacteristic Curve (AUC). The location-based classifier achieved an AUC of\n77.90, 95% confidence interval (CI) (76.76, 79.03). CNN-based classifiers\nachieved AUC of 86.11, CI (84.96, 87.25), while the tumor-location-guided CNNs\noutperformed the formers with an average AUC of 88.64 CI (87.57, 89.72), which\nwas statistically significant (Student's t-test p-value 0.0018).\n Conclusion: We achieved statistically significant improvements by\nincorporating tumor location into the CNN models. Our results suggest that\nmanually segmented ROIs may not be optimal.\n","authors":["Khashayar Namdar","Matthias W. Wagner","Kareem Kudus","Cynthia Hawkins","Uri Tabori","Brigit Ertl-Wagner","Farzad Khalvati"],"pdf_url":"https://arxiv.org/pdf/2210.07287v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2207.14776"},{"id":"http://arxiv.org/abs/2207.14776v2","updated":"2023-10-24T18:41:44Z","published":"2022-07-29T16:37:46Z","title":"Open-radiomics: A Collection of Standardized Datasets and a Technical\n Protocol for Reproducible Radiomics Machine Learning Pipelines","summary":" Purpose: As an important branch of machine learning pipelines in medical\nimaging, radiomics faces two major challenges namely reproducibility and\naccessibility. In this work, we introduce open-radiomics, a set of radiomics\ndatasets along with a comprehensive radiomics pipeline based on our proposed\ntechnical protocol to investigate the effects of radiomics feature extraction\non the reproducibility of the results.\n Materials and Methods: Experiments are conducted on BraTS 2020 open-source\nMagnetic Resonance Imaging (MRI) dataset that includes 369 adult patients with\nbrain tumors (76 low-grade glioma (LGG), and 293 high-grade glioma (HGG)).\nUsing PyRadiomics library for LGG vs. HGG classification, 288 radiomics\ndatasets are formed; the combinations of 4 MRI sequences, 3 binWidths, 6 image\nnormalization methods, and 4 tumor subregions.\n Random Forest classifiers were used, and for each radiomics dataset the\ntraining-validation-test (60%/20%/20%) experiment with different data splits\nand model random states was repeated 100 times (28,800 test results) and Area\nUnder Receiver Operating Characteristic Curve (AUC) was calculated.\n Results: Unlike binWidth and image normalization, tumor subregion and imaging\nsequence significantly affected performance of the models. T1 contrast-enhanced\nsequence and the union of necrotic and the non-enhancing tumor core subregions\nresulted in the highest AUCs (average test AUC 0.951, 95% confidence interval\nof (0.949, 0.952)). Although 28 settings and data splits yielded test AUC of 1,\nthey were irreproducible.\n Conclusion: Our experiments demonstrate the sources of variability in\nradiomics pipelines (e.g., tumor subregion) can have a significant impact on\nthe results, which may lead to superficial perfect performances that are\nirreproducible.\n","authors":["Khashayar Namdar","Matthias W. Wagner","Birgit B. Ertl-Wagner","Farzad Khalvati"],"pdf_url":"https://arxiv.org/pdf/2207.14776v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16115v1","updated":"2023-10-24T18:32:46Z","published":"2023-10-24T18:32:46Z","title":"Wakening Past Concepts without Past Data: Class-Incremental Learning\n from Online Placebos","summary":" Not forgetting old class knowledge is a key challenge for class-incremental\nlearning (CIL) when the model continuously adapts to new classes. A common\ntechnique to address this is knowledge distillation (KD), which penalizes\nprediction inconsistencies between old and new models. Such prediction is made\nwith almost new class data, as old class data is extremely scarce due to the\nstrict memory limitation in CIL. In this paper, we take a deep dive into KD\nlosses and find that \"using new class data for KD\" not only hinders the model\nadaption (for learning new classes) but also results in low efficiency for\npreserving old class knowledge. We address this by \"using the placebos of old\nclasses for KD\", where the placebos are chosen from a free image stream, such\nas Google Images, in an automatical and economical fashion. To this end, we\ntrain an online placebo selection policy to quickly evaluate the quality of\nstreaming images (good or bad placebos) and use only good ones for one-time\nfeed-forward computation of KD. We formulate the policy training process as an\nonline Markov Decision Process (MDP), and introduce an online learning\nalgorithm to solve this MDP problem without causing much computation costs. In\nexperiments, we show that our method 1) is surprisingly effective even when\nthere is no class overlap between placebos and original old class data, 2) does\nnot require any additional supervision or memory budget, and 3) significantly\noutperforms a number of top-performing CIL methods, in particular when using\nlower memory budgets for old class exemplars, e.g., five exemplars per class.\n","authors":["Yaoyao Liu","Yingying Li","Bernt Schiele","Qianru Sun"],"pdf_url":"https://arxiv.org/pdf/2310.16115v1.pdf","comment":"Accepted to WACV 2024. Code:\n https://github.com/yaoyao-liu/online-placebos"},{"id":"http://arxiv.org/abs/2310.16112v1","updated":"2023-10-24T18:26:22Z","published":"2023-10-24T18:26:22Z","title":"Towards long-tailed, multi-label disease classification from chest\n X-ray: Overview of the CXR-LT challenge","summary":" Many real-world image recognition problems, such as diagnostic medical\nimaging exams, are \"long-tailed\" $\\unicode{x2013}$ there are a few common\nfindings followed by many more relatively rare conditions. In chest\nradiography, diagnosis is both a long-tailed and multi-label problem, as\npatients often present with multiple findings simultaneously. While researchers\nhave begun to study the problem of long-tailed learning in medical image\nrecognition, few have studied the interaction of label imbalance and label\nco-occurrence posed by long-tailed, multi-label disease classification. To\nengage with the research community on this emerging topic, we conducted an open\nchallenge, CXR-LT, on long-tailed, multi-label thorax disease classification\nfrom chest X-rays (CXRs). We publicly release a large-scale benchmark dataset\nof over 350,000 CXRs, each labeled with at least one of 26 clinical findings\nfollowing a long-tailed distribution. We synthesize common themes of\ntop-performing solutions, providing practical recommendations for long-tailed,\nmulti-label medical image classification. Finally, we use these insights to\npropose a path forward involving vision-language foundation models for few- and\nzero-shot disease classification.\n","authors":["Gregory Holste","Yiliang Zhou","Song Wang","Ajay Jaiswal","Mingquan Lin","Sherry Zhuge","Yuzhe Yang","Dongkyun Kim","Trong-Hieu Nguyen-Mau","Minh-Triet Tran","Jaehyup Jeong","Wongi Park","Jongbin Ryu","Feng Hong","Arsh Verma","Yosuke Yamagishi","Changhyun Kim","Hyeryeong Seo","Myungjoo Kang","Leo Anthony Celi","Zhiyong Lu","Ronald M. Summers","George Shih","Zhangyang Wang","Yifan Peng"],"pdf_url":"https://arxiv.org/pdf/2310.16112v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16109v1","updated":"2023-10-24T18:21:03Z","published":"2023-10-24T18:21:03Z","title":"Complex Image Generation SwinTransformer Network for Audio Denoising","summary":" Achieving high-performance audio denoising is still a challenging task in\nreal-world applications. Existing time-frequency methods often ignore the\nquality of generated frequency domain images. This paper converts the audio\ndenoising problem into an image generation task. We first develop a complex\nimage generation SwinTransformer network to capture more information from the\ncomplex Fourier domain. We then impose structure similarity and detailed loss\nfunctions to generate high-quality images and develop an SDR loss to minimize\nthe difference between denoised and clean audios. Extensive experiments on two\nbenchmark datasets demonstrate that our proposed model is better than\nstate-of-the-art methods.\n","authors":["Youshan Zhang","Jialu Li"],"pdf_url":"https://arxiv.org/pdf/2310.16109v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16103v1","updated":"2023-10-24T18:11:25Z","published":"2023-10-24T18:11:25Z","title":"LaksNet: an end-to-end deep learning model for self-driving cars in\n Udacity simulator","summary":" The majority of road accidents occur because of human errors, including\ndistraction, recklessness, and drunken driving. One of the effective ways to\novercome this dangerous situation is by implementing self-driving technologies\nin vehicles. In this paper, we focus on building an efficient deep-learning\nmodel for self-driving cars. We propose a new and effective convolutional\nneural network model called `LaksNet' consisting of four convolutional layers\nand two fully connected layers. We conduct extensive experiments using our\nLaksNet model with the training data generated from the Udacity simulator. Our\nmodel outperforms many existing pre-trained ImageNet and NVIDIA models in terms\nof the duration of the car for which it drives without going off the track on\nthe simulator.\n","authors":["Lakshmikar R. Polamreddy","Youshan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16102v1","updated":"2023-10-24T18:06:03Z","published":"2023-10-24T18:06:03Z","title":"Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient\n Multiphoton Microscopy","summary":" Multiphoton microscopy (MPM) is a powerful imaging tool that has been a\ncritical enabler for live tissue imaging. However, since most multiphoton\nmicroscopy platforms rely on point scanning, there is an inherent trade-off\nbetween acquisition time, field of view (FOV), phototoxicity, and image\nquality, often resulting in noisy measurements when fast, large FOV, and/or\ngentle imaging is needed. Deep learning could be used to denoise multiphoton\nmicroscopy measurements, but these algorithms can be prone to hallucination,\nwhich can be disastrous for medical and scientific applications. We propose a\nmethod to simultaneously denoise and predict pixel-wise uncertainty for\nmultiphoton imaging measurements, improving algorithm trustworthiness and\nproviding statistical guarantees for the deep learning predictions.\nFurthermore, we propose to leverage this learned, pixel-wise uncertainty to\ndrive an adaptive acquisition technique that rescans only the most uncertain\nregions of a sample. We demonstrate our method on experimental noisy MPM\nmeasurements of human endometrium tissues, showing that we can maintain fine\nfeatures and outperform other denoising methods while predicting uncertainty at\neach pixel. Finally, with our adaptive acquisition technique, we demonstrate a\n120X reduction in acquisition time and total light dose while successfully\nrecovering fine features in the sample. We are the first to demonstrate\ndistribution-free uncertainty quantification for a denoising task with real\nexperimental data and the first to propose adaptive acquisition based on\nreconstruction uncertainty\n","authors":["Cassandra Tong Ye","Jiashu Han","Kunzan Liu","Anastasios Angelopoulos","Linda Griffith","Kristina Monakhova","Sixian You"],"pdf_url":"https://arxiv.org/pdf/2310.16102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16100v1","updated":"2023-10-24T18:04:53Z","published":"2023-10-24T18:04:53Z","title":"Deep Feature Registration for Unsupervised Domain Adaptation","summary":" While unsupervised domain adaptation has been explored to leverage the\nknowledge from a labeled source domain to an unlabeled target domain, existing\nmethods focus on the distribution alignment between two domains. However, how\nto better align source and target features is not well addressed. In this\npaper, we propose a deep feature registration (DFR) model to generate\nregistered features that maintain domain invariant features and simultaneously\nminimize the domain-dissimilarity of registered features and target features\nvia histogram matching. We further employ a pseudo label refinement process,\nwhich considers both probabilistic soft selection and center-based hard\nselection to improve the quality of pseudo labels in the target domain.\nExtensive experiments on multiple UDA benchmarks demonstrate the effectiveness\nof our DFR model, resulting in new state-of-the-art performance.\n","authors":["Youshan Zhang","Brian D. Davison"],"pdf_url":"https://arxiv.org/pdf/2310.16100v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16099v1","updated":"2023-10-24T18:03:07Z","published":"2023-10-24T18:03:07Z","title":"Anatomically-aware Uncertainty for Semi-supervised Image Segmentation","summary":" Semi-supervised learning relaxes the need of large pixel-wise labeled\ndatasets for image segmentation by leveraging unlabeled data. A prominent way\nto exploit unlabeled data is to regularize model predictions. Since the\npredictions of unlabeled data can be unreliable, uncertainty-aware schemes are\ntypically employed to gradually learn from meaningful and reliable predictions.\nUncertainty estimation methods, however, rely on multiple inferences from the\nmodel predictions that must be computed for each training step, which is\ncomputationally expensive. Moreover, these uncertainty maps capture pixel-wise\ndisparities and do not consider global information. This work proposes a novel\nmethod to estimate segmentation uncertainty by leveraging global information\nfrom the segmentation masks. More precisely, an anatomically-aware\nrepresentation is first learnt to model the available segmentation masks. The\nlearnt representation thereupon maps the prediction of a new segmentation into\nan anatomically-plausible segmentation. The deviation from the plausible\nsegmentation aids in estimating the underlying pixel-level uncertainty in order\nto further guide the segmentation network. The proposed method consequently\nestimates the uncertainty using a single inference from our representation,\nthereby reducing the total computation. We evaluate our method on two publicly\navailable segmentation datasets of left atria in cardiac MRIs and of multiple\norgans in abdominal CTs. Our anatomically-aware method improves the\nsegmentation accuracy over the state-of-the-art semi-supervised methods in\nterms of two commonly used evaluation metrics.\n","authors":["Sukesh Adiga V","Jose Dolz","Herve Lombaert"],"pdf_url":"https://arxiv.org/pdf/2310.16099v1.pdf","comment":"Accepted at Medical Image Analysis. Code is available at:\n $\\href{https://github.com/adigasu/Anatomically-aware_Uncertainty_for_Semi-supervised_Segmentation}{Github}$"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.15950v1","updated":"2023-10-24T15:51:13Z","published":"2023-10-24T15:51:13Z","title":"Representation Learning with Large Language Models for Recommendation","summary":" Recommender systems have seen significant advancements with the influence of\ndeep learning and graph neural networks, particularly in capturing complex\nuser-item relationships. However, these graph-based recommenders heavily depend\non ID-based data, potentially disregarding valuable textual information\nassociated with users and items, resulting in less informative learned\nrepresentations. Moreover, the utilization of implicit feedback data introduces\npotential noise and bias, posing challenges for the effectiveness of user\npreference learning. While the integration of large language models (LLMs) into\ntraditional ID-based recommenders has gained attention, challenges such as\nscalability issues, limitations in text-only reliance, and prompt input\nconstraints need to be addressed for effective implementation in practical\nrecommender systems. To address these challenges, we propose a model-agnostic\nframework RLMRec that aims to enhance existing recommenders with LLM-empowered\nrepresentation learning. It proposes a recommendation paradigm that integrates\nrepresentation learning with LLMs to capture intricate semantic aspects of user\nbehaviors and preferences. RLMRec incorporates auxiliary textual signals,\ndevelops a user/item profiling paradigm empowered by LLMs, and aligns the\nsemantic space of LLMs with the representation space of collaborative\nrelational signals through a cross-view alignment framework. This work further\nestablish a theoretical foundation demonstrating that incorporating textual\nsignals through mutual information maximization enhances the quality of\nrepresentations. In our evaluation, we integrate RLMRec with state-of-the-art\nrecommender models, while also analyzing its efficiency and robustness to noise\ndata. Our implementation codes are available at\nhttps://github.com/HKUDS/RLMRec.\n","authors":["Xubin Ren","Wei Wei","Lianghao Xia","Lixin Su","Suqi Cheng","Junfeng Wang","Dawei Yin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15858v1","updated":"2023-10-24T14:16:19Z","published":"2023-10-24T14:16:19Z","title":"Topology-aware Debiased Self-supervised Graph Learning for\n Recommendation","summary":" In recommendation, graph-based Collaborative Filtering (CF) methods mitigate\nthe data sparsity by introducing Graph Contrastive Learning (GCL). However, the\nrandom negative sampling strategy in these GCL-based CF models neglects the\nsemantic structure of users (items), which not only introduces false negatives\n(negatives that are similar to anchor user (item)) but also ignores the\npotential positive samples. To tackle the above issues, we propose\nTopology-aware Debiased Self-supervised Graph Learning (TDSGL) for\nrecommendation, which constructs contrastive pairs according to the semantic\nsimilarity between users (items). Specifically, since the original user-item\ninteraction data commendably reflects the purchasing intent of users and\ncertain characteristics of items, we calculate the semantic similarity between\nusers (items) on interaction data. Then, given a user (item), we construct its\nnegative pairs by selecting users (items) which embed different semantic\nstructures to ensure the semantic difference between the given user (item) and\nits negatives. Moreover, for a user (item), we design a feature extraction\nmodule that converts other semantically similar users (items) into an auxiliary\npositive sample to acquire a more informative representation. Experimental\nresults show that the proposed model outperforms the state-of-the-art models\nsignificantly on three public datasets. Our model implementation codes are\navailable at https://github.com/malajikuai/TDSGL.\n","authors":["Lei Han","Hui Yan","Zhicheng Qiao"],"pdf_url":"https://arxiv.org/pdf/2310.15858v1.pdf","comment":"6 pages,8 figures"},{"id":"http://arxiv.org/abs/2310.15790v1","updated":"2023-10-24T12:39:30Z","published":"2023-10-24T12:39:30Z","title":"A statistical significance testing approach for measuring term\n burstiness with applications to domain-specific terminology extraction","summary":" Domain-specific terminology extraction is an important task in text analysis.\nA term in a corpus is said to be \"bursty\" when its occurrences are concentrated\nin few out of many documents. Being content rich, bursty terms are highly\nsuited for subject matter characterization, and serve as natural candidates for\nidentifying with technical terminology. Multiple measures of term burstiness\nhave been proposed in the literature. However, the statistical significance\ntesting paradigm has remained underexplored in text analysis, including in\nrelation to term burstiness. To test these waters, we propose as our main\ncontribution a multinomial language model-based exact test of statistical\nsignificance for term burstiness. Due to its prohibitive computational cost, we\nadvance a heuristic formula designed to serve as a proxy for test P-values. As\na complementary theoretical contribution, we derive a previously unreported\nrelationship connecting the inverse document frequency and inverse collection\nfrequency (two foundational quantities in text analysis) under the multinomial\nlanguage model. The relation is used in the evaluation of our heuristic. Using\nthe GENIA Term corpus benchmark, we compare our approach against established\nmethods, demonstrating our heuristic's potential in identifying domain-specific\ntechnical terms. We hope this demonstration of statistical significance testing\nin text analysis serves as a springboard for future research.\n","authors":["Samuel Sarria Hurtado","Todd Mullen","Taku Onodera","Paul Sheridan"],"pdf_url":"https://arxiv.org/pdf/2310.15790v1.pdf","comment":"23 pages, 1 figure, 6 tables"},{"id":"http://arxiv.org/abs/2308.05379v4","updated":"2023-10-24T08:49:01Z","published":"2023-08-10T06:52:53Z","title":"Beyond Semantics: Learning a Behavior Augmented Relevance Model with\n Self-supervised Learning","summary":" Relevance modeling aims to locate desirable items for corresponding queries,\nwhich is crucial for search engines to ensure user experience. Although most\nconventional approaches address this problem by assessing the semantic\nsimilarity between the query and item, pure semantic matching is not\neverything. In reality, auxiliary query-item interactions extracted from user\nhistorical behavior data of the search log could provide hints to reveal users'\nsearch intents further. Drawing inspiration from this, we devise a novel\nBehavior Augmented Relevance Learning model for Alipay Search (BARL-ASe) that\nleverages neighbor queries of target item and neighbor items of target query to\ncomplement target query-item semantic matching. Specifically, our model builds\nmulti-level co-attention for distilling coarse-grained and fine-grained\nsemantic representations from both neighbor and target views. The model\nsubsequently employs neighbor-target self-supervised learning to improve the\naccuracy and robustness of BARL-ASe by strengthening representation and logit\nlearning. Furthermore, we discuss how to deal with the long-tail query-item\nmatching of the mini apps search scenario of Alipay practically. Experiments on\nreal-world industry data and online A/B testing demonstrate our proposal\nachieves promising performance with low latency.\n","authors":["Zeyuan Chen","Wei Chen","Jia Xu","Zhongyi Liu","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.05379v4.pdf","comment":"Accepted by CIKM2023"},{"id":"http://arxiv.org/abs/2310.09706v2","updated":"2023-10-24T08:48:06Z","published":"2023-10-15T02:19:28Z","title":"AdaptSSR: Pre-training User Model with Augmentation-Adaptive\n Self-Supervised Ranking","summary":" User modeling, which aims to capture users' characteristics or interests,\nheavily relies on task-specific labeled data and suffers from the data sparsity\nissue. Several recent studies tackled this problem by pre-training the user\nmodel on massive user behavior sequences with a contrastive learning task.\nGenerally, these methods assume different views of the same behavior sequence\nconstructed via data augmentation are semantically consistent, i.e., reflecting\nsimilar characteristics or interests of the user, and thus maximizing their\nagreement in the feature space. However, due to the diverse interests and heavy\nnoise in user behaviors, existing augmentation methods tend to lose certain\ncharacteristics of the user or introduce noisy behaviors. Thus, forcing the\nuser model to directly maximize the similarity between the augmented views may\nresult in a negative transfer. To this end, we propose to replace the\ncontrastive learning task with a new pretext task: Augmentation-Adaptive\nSelfSupervised Ranking (AdaptSSR), which alleviates the requirement of semantic\nconsistency between the augmented views while pre-training a discriminative\nuser model. Specifically, we adopt a multiple pairwise ranking loss which\ntrains the user model to capture the similarity orders between the implicitly\naugmented view, the explicitly augmented view, and views from other users. We\nfurther employ an in-batch hard negative sampling strategy to facilitate model\ntraining. Moreover, considering the distinct impacts of data augmentation on\ndifferent behavior sequences, we design an augmentation-adaptive fusion\nmechanism to automatically adjust the similarity order constraint applied to\neach sample based on the estimated similarity between the augmented views.\nExtensive experiments on both public and industrial datasets with six\ndownstream tasks verify the effectiveness of AdaptSSR.\n","authors":["Yang Yu","Qi Liu","Kai Zhang","Yuren Zhang","Chao Song","Min Hou","Yuqing Yuan","Zhihao Ye","Zaixi Zhang","Sanshi Lei Yu"],"pdf_url":"https://arxiv.org/pdf/2310.09706v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.10933v3","updated":"2023-10-24T07:42:09Z","published":"2023-06-19T13:44:48Z","title":"Towards Open-World Recommendation with Knowledge Augmentation from Large\n Language Models","summary":" Recommender systems play a vital role in various online services. However,\nthe insulated nature of training and deploying separately within a specific\ndomain limits their access to open-world knowledge. Recently, the emergence of\nlarge language models (LLMs) has shown promise in bridging this gap by encoding\nextensive world knowledge and demonstrating reasoning capability. Nevertheless,\nprevious attempts to directly use LLMs as recommenders have not achieved\nsatisfactory results. In this work, we propose an Open-World Knowledge\nAugmented Recommendation Framework with Large Language Models, dubbed KAR, to\nacquire two types of external knowledge from LLMs -- the reasoning knowledge on\nuser preferences and the factual knowledge on items. We introduce factorization\nprompting to elicit accurate reasoning on user preferences. The generated\nreasoning and factual knowledge are effectively transformed and condensed into\naugmented vectors by a hybrid-expert adaptor in order to be compatible with the\nrecommendation task. The obtained vectors can then be directly used to enhance\nthe performance of any recommendation model. We also ensure efficient inference\nby preprocessing and prestoring the knowledge from the LLM. Extensive\nexperiments show that KAR significantly outperforms the state-of-the-art\nbaselines and is compatible with a wide range of recommendation algorithms.\n","authors":["Yunjia Xi","Weiwen Liu","Jianghao Lin","Jieming Zhu","Bo Chen","Ruiming Tang","Weinan Zhang","Rui Zhang","Yong Yu"],"pdf_url":"https://arxiv.org/pdf/2306.10933v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15556v1","updated":"2023-10-24T06:56:38Z","published":"2023-10-24T06:56:38Z","title":"TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for\n Inference Cost Reduction","summary":" Since ChatGPT released its API for public use, the number of applications\nbuilt on top of commercial large language models (LLMs) increase exponentially.\nOne popular usage of such models is leveraging its in-context learning ability\nand generating responses given user queries leveraging knowledge obtained by\nretrieval augmentation. One problem of deploying commercial retrieval-augmented\nLLMs is the cost due to the additionally retrieved context that largely\nincreases the input token size of the LLMs. To mitigate this, we propose a\ntoken compression scheme that includes two methods: summarization compression\nand semantic compression. The first method applies a T5-based model that is\nfine-tuned by datasets generated using self-instruct containing samples with\nvarying lengths and reduce token size by doing summarization. The second method\nfurther compresses the token size by removing words with lower impact on the\nsemantic. In order to adequately evaluate the effectiveness of the proposed\nmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)\nfocusing on food recommendation for women around pregnancy period or infants.\nOur summarization compression can reduce 65% of the retrieval token size with\nfurther 0.3% improvement on the accuracy; semantic compression provides a more\nflexible way to trade-off the token size with performance, for which we can\nreduce the token size by 20% with only 1.6% of accuracy drop.\n","authors":["Junyi Liu","Liangzhi Li","Tong Xiang","Bowen Wang","Yiming Qian"],"pdf_url":"https://arxiv.org/pdf/2310.15556v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.15511v1","updated":"2023-10-24T04:40:38Z","published":"2023-10-24T04:40:38Z","title":"KITAB: Evaluating LLMs on Constraint Satisfaction for Information\n Retrieval","summary":" We study the ability of state-of-the art models to answer constraint\nsatisfaction queries for information retrieval (e.g., 'a list of ice cream\nshops in San Diego'). In the past, such queries were considered to be tasks\nthat could only be solved via web-search or knowledge bases. More recently,\nlarge language models (LLMs) have demonstrated initial emergent abilities in\nthis task. However, many current retrieval benchmarks are either saturated or\ndo not measure constraint satisfaction. Motivated by rising concerns around\nfactual incorrectness and hallucinations of LLMs, we present KITAB, a new\ndataset for measuring constraint satisfaction abilities of language models.\nKITAB consists of book-related data across more than 600 authors and 13,000\nqueries, and also offers an associated dynamic data collection and constraint\nverification approach for acquiring similar test data for other authors. Our\nextended experiments on GPT4 and GPT3.5 characterize and decouple common\nfailure modes across dimensions such as information popularity, constraint\ntypes, and context availability. Results show that in the absence of context,\nmodels exhibit severe limitations as measured by irrelevant information,\nfactual errors, and incompleteness, many of which exacerbate as information\npopularity decreases. While context availability mitigates irrelevant\ninformation, it is not helpful for satisfying constraints, identifying\nfundamental barriers to constraint satisfaction. We open source our\ncontributions to foster further research on improving constraint satisfaction\nabilities of future models.\n","authors":["Marah I Abdin","Suriya Gunasekar","Varun Chandrasekaran","Jerry Li","Mert Yuksekgonul","Rahee Ghosh Peshawaria","Ranjita Naik","Besmira Nushi"],"pdf_url":"https://arxiv.org/pdf/2310.15511v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2210.03116v4","updated":"2023-10-24T04:26:10Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","Mia Tang","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v4.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2310.15492v1","updated":"2023-10-24T03:42:20Z","published":"2023-10-24T03:42:20Z","title":"Robust Representation Learning for Unified Online Top-K Recommendation","summary":" In large-scale industrial e-commerce, the efficiency of an online\nrecommendation system is crucial in delivering highly relevant item/content\nadvertising that caters to diverse business scenarios. However, most existing\nstudies focus solely on item advertising, neglecting the significance of\ncontent advertising. This oversight results in inconsistencies within the\nmulti-entity structure and unfair retrieval. Furthermore, the challenge of\nretrieving top-k advertisements from multi-entity advertisements across\ndifferent domains adds to the complexity. Recent research proves that\nuser-entity behaviors within different domains exhibit characteristics of\ndifferentiation and homogeneity. Therefore, the multi-domain matching models\ntypically rely on the hybrid-experts framework with domain-invariant and\ndomain-specific representations. Unfortunately, most approaches primarily focus\non optimizing the combination mode of different experts, failing to address the\ninherent difficulty in optimizing the expert modules themselves. The existence\nof redundant information across different domains introduces interference and\ncompetition among experts, while the distinct learning objectives of each\ndomain lead to varying optimization challenges among experts. To tackle these\nissues, we propose robust representation learning for the unified online top-k\nrecommendation. Our approach constructs unified modeling in entity space to\nensure data fairness. The robust representation learning employs domain\nadversarial learning and multi-view wasserstein distribution learning to learn\nrobust representations. Moreover, the proposed method balances conflicting\nobjectives through the homoscedastic uncertainty weights and orthogonality\nconstraints. Various experiments validate the effectiveness and rationality of\nour proposed method, which has been successfully deployed online to serve real\nbusiness scenarios.\n","authors":["Minfang Lu","Yuchen Jiang","Huihui Dong","Qi Li","Ziru Xu","Yuanlin Liu","Lixia Wu","Haoyuan Hu","Han Zhu","Yuning Jiang","Jian Xu","Bo Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.15492v1.pdf","comment":"14 pages, 6 figures, submitted to ICDE"},{"id":"http://arxiv.org/abs/2310.14512v2","updated":"2023-10-24T02:45:55Z","published":"2023-10-23T02:47:27Z","title":"CorefPrompt: Prompt-based Event Coreference Resolution by Measuring\n Event Type and Argument Compatibilities","summary":" Event coreference resolution (ECR) aims to group event mentions referring to\nthe same real-world event into clusters. Most previous studies adopt the\n\"encoding first, then scoring\" framework, making the coreference judgment rely\non event encoding. Furthermore, current methods struggle to leverage\nhuman-summarized ECR rules, e.g., coreferential events should have the same\nevent type, to guide the model. To address these two issues, we propose a\nprompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM\n(masked language model) task. This allows for simultaneous event modeling and\ncoreference discrimination within a single template, with a fully shared\ncontext. In addition, we introduce two auxiliary prompt tasks, event-type\ncompatibility and argument compatibility, to explicitly demonstrate the\nreasoning process of ECR, which helps the model make final predictions.\nExperimental results show that our method CorefPrompt performs well in a\nstate-of-the-art (SOTA) benchmark.\n","authors":["Sheng Xu","Peifeng Li","Qiaoming Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14512v2.pdf","comment":"Accepted by EMNLP2023"},{"id":"http://arxiv.org/abs/2310.15433v1","updated":"2023-10-24T01:00:01Z","published":"2023-10-24T01:00:01Z","title":"Off-Policy Evaluation for Large Action Spaces via Policy Convolution","summary":" Developing accurate off-policy estimators is crucial for both evaluating and\noptimizing for new policies. The main challenge in off-policy estimation is the\ndistribution shift between the logging policy that generates data and the\ntarget policy that we aim to evaluate. Typically, techniques for correcting\ndistribution shift involve some form of importance sampling. This approach\nresults in unbiased value estimation but often comes with the trade-off of high\nvariance, even in the simpler case of one-step contextual bandits. Furthermore,\nimportance sampling relies on the common support assumption, which becomes\nimpractical when the action space is large. To address these challenges, we\nintroduce the Policy Convolution (PC) family of estimators. These methods\nleverage latent structure within actions -- made available through action\nembeddings -- to strategically convolve the logging and target policies. This\nconvolution introduces a unique bias-variance trade-off, which can be\ncontrolled by adjusting the amount of convolution. Our experiments on synthetic\nand benchmark datasets demonstrate remarkable mean squared error (MSE)\nimprovements when using PC, especially when either the action space or policy\nmismatch becomes large, with gains of up to 5 - 6 orders of magnitude over\nexisting estimators.\n","authors":["Noveen Sachdeva","Lequn Wang","Dawen Liang","Nathan Kallus","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2310.15433v1.pdf","comment":"Under review. 36 pages, 31 figures"},{"id":"http://arxiv.org/abs/2305.11430v2","updated":"2023-10-24T22:50:02Z","published":"2023-05-19T04:59:34Z","title":"TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks","summary":" While LLMs have shown great success in understanding and generating text in\ntraditional conversational settings, their potential for performing ill-defined\ncomplex tasks is largely under-studied. Indeed, we are yet to conduct\ncomprehensive benchmarking studies with multiple LLMs that are exclusively\nfocused on a complex task. However, conducting such benchmarking studies is\nchallenging because of the large variations in LLMs' performance when different\nprompt types/styles are used and different degrees of detail are provided in\nthe prompts. To address this issue, the paper proposes a general taxonomy that\ncan be used to design prompts with specific properties in order to perform a\nwide range of complex tasks. This taxonomy will allow future benchmarking\nstudies to report the specific categories of prompts used as part of the study,\nenabling meaningful comparisons across different studies. Also, by establishing\na common standard through this taxonomy, researchers will be able to draw more\naccurate conclusions about LLMs' performance on a specific complex task.\n","authors":["Shubhra Kanti Karmaker Santu","Dongji Feng"],"pdf_url":"https://arxiv.org/pdf/2305.11430v2.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16157v1","updated":"2023-10-24T20:02:02Z","published":"2023-10-24T20:02:02Z","title":"Context-aware feature attribution through argumentation","summary":" Feature attribution is a fundamental task in both machine learning and data\nanalysis, which involves determining the contribution of individual features or\nvariables to a model's output. This process helps identify the most important\nfeatures for predicting an outcome. The history of feature attribution methods\ncan be traced back to General Additive Models (GAMs), which extend linear\nregression models by incorporating non-linear relationships between dependent\nand independent variables. In recent years, gradient-based methods and\nsurrogate models have been applied to unravel complex Artificial Intelligence\n(AI) systems, but these methods have limitations. GAMs tend to achieve lower\naccuracy, gradient-based methods can be difficult to interpret, and surrogate\nmodels often suffer from stability and fidelity issues. Furthermore, most\nexisting methods do not consider users' contexts, which can significantly\ninfluence their preferences. To address these limitations and advance the\ncurrent state-of-the-art, we define a novel feature attribution framework\ncalled Context-Aware Feature Attribution Through Argumentation (CA-FATA). Our\nframework harnesses the power of argumentation by treating each feature as an\nargument that can either support, attack or neutralize a prediction.\nAdditionally, CA-FATA formulates feature attribution as an argumentation\nprocedure, and each computation has explicit semantics, which makes it\ninherently interpretable. CA-FATA also easily integrates side information, such\nas users' contexts, resulting in more accurate predictions.\n","authors":["Jinfeng Zhong","Elsa Negre"],"pdf_url":"https://arxiv.org/pdf/2310.16157v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16146v1","updated":"2023-10-24T19:43:39Z","published":"2023-10-24T19:43:39Z","title":"Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model\n System for Answering Medical Questions using Scientific Literature","summary":" The quickly-expanding nature of published medical literature makes it\nchallenging for clinicians and researchers to keep up with and summarize\nrecent, relevant findings in a timely manner. While several closed-source\nsummarization tools based on large language models (LLMs) now exist, rigorous\nand systematic evaluations of their outputs are lacking. Furthermore, there is\na paucity of high-quality datasets and appropriate benchmark tasks with which\nto evaluate these tools. We address these issues with four contributions: we\nrelease Clinfo.ai, an open-source WebApp that answers clinical questions based\non dynamically retrieved scientific literature; we specify an information\nretrieval and abstractive summarization task to evaluate the performance of\nsuch retrieval-augmented LLM systems; we release a dataset of 200 questions and\ncorresponding answers derived from published systematic reviews, which we name\nPubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for\nClinfo.ai and other publicly available OpenQA systems on PubMedRS-200.\n","authors":["Alejandro Lozano","Scott L Fleming","Chia-Chun Chiang","Nigam Shah"],"pdf_url":"https://arxiv.org/pdf/2310.16146v1.pdf","comment":"Preprint of an article published in Pacific Symposium on Biocomputing\n copyright 2024 World Scientific Publishing Co., Singapore,\n http://psb.stanford.edu/"},{"id":"http://arxiv.org/abs/2310.16141v1","updated":"2023-10-24T19:30:31Z","published":"2023-10-24T19:30:31Z","title":"Context-aware explainable recommendations over knowledge graphs","summary":" Knowledge graphs contain rich semantic relationships related to items and\nincorporating such semantic relationships into recommender systems helps to\nexplore the latent connections of items, thus improving the accuracy of\nprediction and enhancing the explainability of recommendations. However, such\nexplainability is not adapted to users' contexts, which can significantly\ninfluence their preferences. In this work, we propose CA-KGCN (Context-Aware\nKnowledge Graph Convolutional Network), an end-to-end framework that can model\nusers' preferences adapted to their contexts and can incorporate rich semantic\nrelationships in the knowledge graph related to items. This framework captures\nusers' attention to different factors: contexts and features of items. More\nspecifically, the framework can model users' preferences adapted to their\ncontexts and provide explanations adapted to the given context. Experiments on\nthree real-world datasets show the effectiveness of our framework: modeling\nusers' preferences adapted to their contexts and explaining the recommendations\ngenerated.\n","authors":["Jinfeng Zhong","Elsa Negre"],"pdf_url":"https://arxiv.org/pdf/2310.16141v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2302.12250v2","updated":"2023-10-24T17:59:46Z","published":"2023-02-23T18:59:30Z","title":"Phase diagram of early training dynamics in deep neural networks: effect\n of the learning rate, depth, and width","summary":" We systematically analyze optimization dynamics in deep neural networks\n(DNNs) trained with stochastic gradient descent (SGD) and study the effect of\nlearning rate $\\eta$, depth $d$, and width $w$ of the neural network. By\nanalyzing the maximum eigenvalue $\\lambda^H_t$ of the Hessian of the loss,\nwhich is a measure of sharpness of the loss landscape, we find that the\ndynamics can show four distinct regimes: (i) an early time transient regime,\n(ii) an intermediate saturation regime, (iii) a progressive sharpening regime,\nand (iv) a late time ``edge of stability\" regime. The early and intermediate\nregimes (i) and (ii) exhibit a rich phase diagram depending on $\\eta \\equiv c /\n\\lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which\nseparate qualitatively distinct phenomena in the early time dynamics of\ntraining loss and sharpness. Notably, we discover the opening up of a\n``sharpness reduction\" phase, where sharpness decreases at early times, as $d$\nand $1/w$ are increased.\n","authors":["Dayal Singh Kalra","Maissam Barkeshli"],"pdf_url":"https://arxiv.org/pdf/2302.12250v2.pdf","comment":"Accepted at NeurIPS 2023 (camera-ready version): Additional results\n added for cross-entropy loss and effect on network output at initialization;\n 10+32 pages, 8+35 figures"},{"id":"http://arxiv.org/abs/2310.16048v1","updated":"2023-10-24T17:59:04Z","published":"2023-10-24T17:59:04Z","title":"AI Alignment and Social Choice: Fundamental Limitations and Policy\n Implications","summary":" Aligning AI agents to human intentions and values is a key bottleneck in\nbuilding safe and deployable AI applications. But whose values should AI agents\nbe aligned with? Reinforcement learning with human feedback (RLHF) has emerged\nas the key framework for AI alignment. RLHF uses feedback from human\nreinforcers to fine-tune outputs; all widely deployed large language models\n(LLMs) use RLHF to align their outputs to human values. It is critical to\nunderstand the limitations of RLHF and consider policy challenges arising from\nthese limitations. In this paper, we investigate a specific challenge in\nbuilding RLHF systems that respect democratic norms. Building on impossibility\nresults in social choice theory, we show that, under fairly broad assumptions,\nthere is no unique voting protocol to universally align AI systems using RLHF\nthrough democratic processes. Further, we show that aligning AI agents with the\nvalues of all individuals will always violate certain private ethical\npreferences of an individual user i.e., universal AI alignment using RLHF is\nimpossible. We discuss policy implications for the governance of AI systems\nbuilt using RLHF: first, the need for mandating transparent voting rules to\nhold model builders accountable. Second, the need for model builders to focus\non developing AI agents that are narrowly aligned to specific user groups.\n","authors":["Abhilash Mishra"],"pdf_url":"https://arxiv.org/pdf/2310.16048v1.pdf","comment":"10 pages, no figures"},{"id":"http://arxiv.org/abs/2310.16047v1","updated":"2023-10-24T17:58:54Z","published":"2023-10-24T17:58:54Z","title":"From Posterior Sampling to Meaningful Diversity in Image Restoration","summary":" Image restoration problems are typically ill-posed in the sense that each\ndegraded image can be restored in infinitely many valid ways. To accommodate\nthis, many works generate a diverse set of outputs by attempting to randomly\nsample from the posterior distribution of natural images given the degraded\ninput. Here we argue that this strategy is commonly of limited practical value\nbecause of the heavy tail of the posterior distribution. Consider for example\ninpainting a missing region of the sky in an image. Since there is a high\nprobability that the missing region contains no object but clouds, any set of\nsamples from the posterior would be entirely dominated by (practically\nidentical) completions of sky. However, arguably, presenting users with only\none clear sky completion, along with several alternative solutions such as\nairships, birds, and balloons, would better outline the set of possibilities.\nIn this paper, we initiate the study of meaningfully diverse image restoration.\nWe explore several post-processing approaches that can be combined with any\ndiverse image restoration method to yield semantically meaningful diversity.\nMoreover, we propose a practical approach for allowing diffusion based image\nrestoration methods to generate meaningfully diverse outputs, while incurring\nonly negligent computational overhead. We conduct extensive user studies to\nanalyze the proposed techniques, and find the strategy of reducing similarity\nbetween outputs to be significantly favorable over posterior sampling. Code and\nexamples are available in https://noa-cohen.github.io/MeaningfulDiversityInIR\n","authors":["Noa Cohen","Hila Manor","Yuval Bahat","Tomer Michaeli"],"pdf_url":"https://arxiv.org/pdf/2310.16047v1.pdf","comment":"Code and examples are available in\n https://noa-cohen.github.io/MeaningfulDiversityInIR"},{"id":"http://arxiv.org/abs/2310.16046v1","updated":"2023-10-24T17:58:26Z","published":"2023-10-24T17:58:26Z","title":"A Unified, Scalable Framework for Neural Population Decoding","summary":" Our ability to use deep learning approaches to decipher neural activity would\nlikely benefit from greater scale, in terms of both model size and datasets.\nHowever, the integration of many neural recordings into one unified model is\nchallenging, as each recording contains the activity of different neurons from\ndifferent individual animals. In this paper, we introduce a training framework\nand architecture designed to model the population dynamics of neural activity\nacross diverse, large-scale neural recordings. Our method first tokenizes\nindividual spikes within the dataset to build an efficient representation of\nneural events that captures the fine temporal structure of neural activity. We\nthen employ cross-attention and a PerceiverIO backbone to further construct a\nlatent tokenization of neural population activities. Utilizing this\narchitecture and training framework, we construct a large-scale multi-session\nmodel trained on large datasets from seven nonhuman primates, spanning over 158\ndifferent sessions of recording from over 27,373 neural units and over 100\nhours of recordings. In a number of different tasks, we demonstrate that our\npretrained model can be rapidly adapted to new, unseen sessions with\nunspecified neuron correspondence, enabling few-shot performance with minimal\nlabels. This work presents a powerful new approach for building deep learning\ntools to analyze neural data and stakes out a clear path to training at scale.\n","authors":["Mehdi Azabou","Vinam Arora","Venkataramana Ganesh","Ximeng Mao","Santosh Nachimuthu","Michael J. Mendelson","Blake Richards","Matthew G. Perich","Guillaume Lajoie","Eva L. Dyer"],"pdf_url":"https://arxiv.org/pdf/2310.16046v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16045v1","updated":"2023-10-24T17:58:07Z","published":"2023-10-24T17:58:07Z","title":"Woodpecker: Hallucination Correction for Multimodal Large Language\n Models","summary":" Hallucination is a big shadow hanging over the rapidly evolving Multimodal\nLarge Language Models (MLLMs), referring to the phenomenon that the generated\ntext is inconsistent with the image content. In order to mitigate\nhallucinations, existing studies mainly resort to an instruction-tuning manner\nthat requires retraining the models with specific data. In this paper, we pave\na different way, introducing a training-free method named Woodpecker. Like a\nwoodpecker heals trees, it picks out and corrects hallucinations from the\ngenerated text. Concretely, Woodpecker consists of five stages: key concept\nextraction, question formulation, visual knowledge validation, visual claim\ngeneration, and hallucination correction. Implemented in a post-remedy manner,\nWoodpecker can easily serve different MLLMs, while being interpretable by\naccessing intermediate outputs of the five stages. We evaluate Woodpecker both\nquantitatively and qualitatively and show the huge potential of this new\nparadigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement\nin accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released\nat https://github.com/BradyFU/Woodpecker.\n","authors":["Shukang Yin","Chaoyou Fu","Sirui Zhao","Tong Xu","Hao Wang","Dianbo Sui","Yunhang Shen","Ke Li","Xing Sun","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16045v1.pdf","comment":"16 pages, 7 figures. Code Website:\n https://github.com/BradyFU/Woodpecker"},{"id":"http://arxiv.org/abs/2305.15901v3","updated":"2023-10-24T17:53:26Z","published":"2023-05-25T10:01:57Z","title":"Consistent Optimal Transport with Empirical Conditional Measures","summary":" Given samples from two joint distributions, we consider the problem of\nOptimal Transportation (OT) between them when conditioned on a common variable.\nWe focus on the general setting where the conditioned variable may be\ncontinuous, and the marginals of this variable in the two joint distributions\nmay not be the same. In such settings, standard OT variants cannot be employed,\nand novel estimation techniques are necessary. Since the main challenge is that\nthe conditional distributions are not explicitly available, the key idea in our\nOT formulation is to employ kernelized-least-squares terms computed over the\njoint samples, which implicitly match the transport plan's marginals with the\nempirical conditionals. Under mild conditions, we prove that our estimated\ntransport plans, as a function of the conditioned variable, are asymptotically\noptimal. For finite samples, we show that the deviation in terms of our\nregularized objective is bounded by $O(1/m^{1/4})$, where $m$ is the number of\nsamples. We also discuss how the conditional transport plan could be modelled\nusing explicit probabilistic models as well as using implicit generative ones.\nWe empirically verify the consistency of our estimator on synthetic datasets,\nwhere the optimal plan is analytically known. When employed in applications\nlike prompt learning for few-shot classification and conditional-generation in\nthe context of predicting cell responses to treatment, our methodology improves\nupon state-of-the-art methods.\n","authors":["Piyushi Manupriya","Rachit Keerti Das","Sayantan Biswas","Saketha Nath Jagarlapudi"],"pdf_url":"https://arxiv.org/pdf/2305.15901v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16035v1","updated":"2023-10-24T17:50:20Z","published":"2023-10-24T17:50:20Z","title":"What's Left? Concept Grounding with Logic-Enhanced Foundation Models","summary":" Recent works such as VisProg and ViperGPT have smartly composed foundation\nmodels for visual reasoning-using large language models (LLMs) to produce\nprograms that can be executed by pre-trained vision-language models. However,\nthey operate in limited domains, such as 2D images, not fully exploiting the\ngeneralization of language: abstract concepts like \"left\" can also be grounded\nin 3D, temporal, and action data, as in moving to your left. This limited\ngeneralization stems from these inference-only methods' inability to learn or\nadapt pre-trained models to a new domain. We propose the Logic-Enhanced\nFoundation Model (LEFT), a unified framework that learns to ground and reason\nwith concepts across domains with a differentiable, domain-independent,\nfirst-order logic-based program executor. LEFT has an LLM interpreter that\noutputs a program represented in a general, logic-based reasoning language,\nwhich is shared across all domains and tasks. LEFT's executor then executes the\nprogram with trainable domain-specific grounding modules. We show that LEFT\nflexibly learns concepts in four domains: 2D images, 3D scenes, human motions,\nand robotic manipulation. It exhibits strong reasoning ability in a wide\nvariety of tasks, including those that are complex and not seen during\ntraining, and can be easily applied to new domains.\n","authors":["Joy Hsu","Jiayuan Mao","Joshua B. Tenenbaum","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16035v1.pdf","comment":"NeurIPS 2023. First two authors contributed equally. Project page:\n https://web.stanford.edu/~joycj/projects/left_neurips_2023"},{"id":"http://arxiv.org/abs/2310.16029v1","updated":"2023-10-24T17:46:12Z","published":"2023-10-24T17:46:12Z","title":"Finetuning Offline World Models in the Real World","summary":" Reinforcement Learning (RL) is notoriously data-inefficient, which makes\ntraining on a real robot difficult. While model-based RL algorithms (world\nmodels) improve data-efficiency to some extent, they still require hours or\ndays of interaction to learn skills. Recently, offline RL has been proposed as\na framework for training RL policies on pre-existing datasets without any\nonline interaction. However, constraining an algorithm to a fixed dataset\ninduces a state-action distribution shift between training and inference, and\nlimits its applicability to new tasks. In this work, we seek to get the best of\nboth worlds: we consider the problem of pretraining a world model with offline\ndata collected on a real robot, and then finetuning the model on online data\ncollected by planning with the learned model. To mitigate extrapolation errors\nduring online interaction, we propose to regularize the planner at test-time by\nbalancing estimated returns and (epistemic) model uncertainty. We evaluate our\nmethod on a variety of visuo-motor control tasks in simulation and on a real\nrobot, and find that our method enables few-shot finetuning to seen and unseen\ntasks even when offline data is limited. Videos, code, and data are available\nat https://yunhaifeng.com/FOWM .\n","authors":["Yunhai Feng","Nicklas Hansen","Ziyan Xiong","Chandramouli Rajagopalan","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16029v1.pdf","comment":"CoRL 2023 Oral; Project website: https://yunhaifeng.com/FOWM"},{"id":"http://arxiv.org/abs/2310.16028v1","updated":"2023-10-24T17:43:29Z","published":"2023-10-24T17:43:29Z","title":"What Algorithms can Transformers Learn? A Study in Length Generalization","summary":" Large language models exhibit surprising emergent generalization properties,\nyet also struggle on many simple reasoning tasks such as arithmetic and parity.\nThis raises the question of if and when Transformer models can learn the true\nalgorithm for solving a task. We study the scope of Transformers' abilities in\nthe specific setting of length generalization on algorithmic tasks. Here, we\npropose a unifying framework to understand when and how Transformers can\nexhibit strong length generalization on a given task. Specifically, we leverage\nRASP (Weiss et al., 2021) -- a programming language designed for the\ncomputational model of a Transformer -- and introduce the RASP-Generalization\nConjecture: Transformers tend to length generalize on a task if the task can be\nsolved by a short RASP program which works for all input lengths. This simple\nconjecture remarkably captures most known instances of length generalization on\nalgorithmic tasks. Moreover, we leverage our insights to drastically improve\ngeneralization performance on traditionally hard tasks (such as parity and\naddition). On the theoretical side, we give a simple example where the\n\"min-degree-interpolator\" model of learning from Abbe et al. (2023) does not\ncorrectly predict Transformers' out-of-distribution behavior, but our\nconjecture does. Overall, our work provides a novel perspective on the\nmechanisms of compositional generalization and the algorithmic capabilities of\nTransformers.\n","authors":["Hattie Zhou","Arwen Bradley","Etai Littwin","Noam Razin","Omid Saremi","Josh Susskind","Samy Bengio","Preetum Nakkiran"],"pdf_url":"https://arxiv.org/pdf/2310.16028v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2310.16027v1","updated":"2023-10-24T17:43:16Z","published":"2023-10-24T17:43:16Z","title":"TimewarpVAE: Simultaneous Time-Warping and Representation Learning of\n Trajectories","summary":" Human demonstrations of trajectories are an important source of training data\nfor many machine learning problems. However, the difficulty of collecting human\ndemonstration data for complex tasks makes learning efficient representations\nof those trajectories challenging. For many problems, such as for handwriting\nor for quasistatic dexterous manipulation, the exact timings of the\ntrajectories should be factored from their spatial path characteristics. In\nthis work, we propose TimewarpVAE, a fully differentiable manifold-learning\nalgorithm that incorporates Dynamic Time Warping (DTW) to simultaneously learn\nboth timing variations and latent factors of spatial variation. We show how the\nTimewarpVAE algorithm learns appropriate time alignments and meaningful\nrepresentations of spatial variations in small handwriting and fork\nmanipulation datasets. Our results have lower spatial reconstruction test error\nthan baseline approaches and the learned low-dimensional representations can be\nused to efficiently generate semantically meaningful novel trajectories.\n","authors":["Travers Rhodes","Daniel D. Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16027v1.pdf","comment":"17 pages, 12 figures"},{"id":"http://arxiv.org/abs/2302.03169v2","updated":"2023-10-24T17:39:05Z","published":"2023-02-06T23:57:56Z","title":"Data Selection for Language Models via Importance Resampling","summary":" Selecting a suitable pretraining dataset is crucial for both general-domain\n(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We\nformalize this problem as selecting a subset of a large raw unlabeled dataset\nto match a desired target distribution given some unlabeled target samples. Due\nto the large scale and dimensionality of the raw text data, existing methods\nuse simple heuristics or use experts to manually curate data. Instead, we\nextend the classic importance resampling approach used in low-dimensions for LM\ndata selection. We propose Data Selection with Importance Resampling (DSIR), an\nefficient and scalable framework that estimates importance weights in a reduced\nfeature space for tractability and selects data with importance resampling\naccording to these weights. To determine an appropriate feature space, we show\nthat KL reduction, a data metric that measures the proximity between selected\npretraining data and the target in a feature space, has high correlation with\naverage downstream accuracy (r=0.89) when computed with simple n-gram features.\nThis motivates our instantiation of DSIR using n-gram features. When performing\ncontinued pretraining towards a specific domain, DSIR performs comparably to\nexpert curation across 8 target distributions. When pretraining general-domain\nmodels (target is Wikipedia + books), DSIR improves over random selection and\nheuristic filtering baselines by 2-2.5% on the GLUE benchmark.\n","authors":["Sang Michael Xie","Shibani Santurkar","Tengyu Ma","Percy Liang"],"pdf_url":"https://arxiv.org/pdf/2302.03169v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.04712v2","updated":"2023-10-24T17:28:53Z","published":"2023-06-07T18:23:38Z","title":"Differentiable Earth Mover's Distance for Data Compression at the\n High-Luminosity LHC","summary":" The Earth mover's distance (EMD) is a useful metric for image recognition and\nclassification, but its usual implementations are not differentiable or too\nslow to be used as a loss function for training other algorithms via gradient\ndescent. In this paper, we train a convolutional neural network (CNN) to learn\na differentiable, fast approximation of the EMD and demonstrate that it can be\nused as a substitute for computing-intensive EMD implementations. We apply this\ndifferentiable approximation in the training of an autoencoder-inspired neural\nnetwork (encoder NN) for data compression at the high-luminosity LHC at CERN.\nThe goal of this encoder NN is to compress the data while preserving the\ninformation related to the distribution of energy deposits in particle\ndetectors. We demonstrate that the performance of our encoder NN trained using\nthe differentiable EMD CNN surpasses that of training with loss functions based\non mean squared error.\n","authors":["Rohan Shenoy","Javier Duarte","Christian Herwig","James Hirschauer","Daniel Noonan","Maurizio Pierini","Nhan Tran","Cristina Mantilla Suarez"],"pdf_url":"https://arxiv.org/pdf/2306.04712v2.pdf","comment":"16 pages, 7 figures, submitted to Machine Learning: Science and\n Technology"},{"id":"http://arxiv.org/abs/2112.01956v2","updated":"2023-10-24T17:28:15Z","published":"2021-12-03T15:02:22Z","title":"Provably Valid and Diverse Mutations of Real-World Media Data for DNN\n Testing","summary":" Deep neural networks (DNNs) often accept high-dimensional media data (e.g.,\nphotos, text, and audio) and understand their perceptual content (e.g., a cat).\nTo test DNNs, diverse inputs are needed to trigger mis-predictions. Some\npreliminary works use byte-level mutations or domain-specific filters (e.g.,\nfoggy), whose enabled mutations may be limited and likely error-prone. SOTA\nworks employ deep generative models to generate (infinite) inputs. Also, to\nkeep the mutated inputs perceptually valid (e.g., a cat remains a \"cat\" after\nmutation), existing efforts rely on imprecise and less generalizable\nheuristics.\n This study revisits two key objectives in media input mutation - perception\ndiversity (DIV) and validity (VAL) - in a rigorous manner based on manifold, a\nwell-developed theory capturing perceptions of high-dimensional media data in a\nlow-dimensional space. We show important results that DIV and VAL inextricably\nbound each other, and prove that SOTA generative model-based methods\nfundamentally fail to mutate real-world media data (either sacrificing DIV or\nVAL). In contrast, we discuss the feasibility of mutating real-world media data\nwith provably high DIV and VAL based on manifold.\n We concretize the technical solution of mutating media data of various\nformats (images, audios, text) via a unified manner based on manifold.\nSpecifically, when media data are projected into a low-dimensional manifold,\nthe data can be mutated by walking on the manifold with certain directions and\nstep sizes. When contrasted with the input data, the mutated data exhibit\nencouraging DIV in the perceptual traits (e.g., lying vs. standing dog) while\nretaining reasonably high VAL (i.e., a dog remains a dog). We implement our\ntechniques in DEEPWALK for testing DNNs. DEEPWALK outperforms prior methods in\ntesting comprehensiveness and can find more error-triggering inputs with higher\nquality.\n","authors":["Yuanyuan Yuan","Qi Pang","Shuai Wang"],"pdf_url":"https://arxiv.org/pdf/2112.01956v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13534v5","updated":"2023-10-24T17:21:36Z","published":"2023-04-26T13:08:50Z","title":"A mean-field games laboratory for generative modeling","summary":" We demonstrate the versatility of mean-field games (MFGs) as a mathematical\nframework for explaining, enhancing, and designing generative models. In\ngenerative flows, a Lagrangian formulation is used where each particle\n(generated sample) aims to minimize a loss function over its simulated path.\nThe loss, however, is dependent on the paths of other particles, which leads to\na competition among the population of particles. The asymptotic behavior of\nthis competition yields a mean-field game. We establish connections between\nMFGs and major classes of generative flows and diffusions including\ncontinuous-time normalizing flows, score-based generative models (SGM), and\nWasserstein gradient flows. Furthermore, we study the mathematical properties\nof each generative model by studying their associated MFG's optimality\ncondition, which is a set of coupled forward-backward nonlinear partial\ndifferential equations. The mathematical structure described by the MFG\noptimality conditions identifies the inductive biases of generative flows. We\ninvestigate the well-posedness and structure of normalizing flows, unravel the\nmathematical structure of SGMs, and derive a MFG formulation of Wasserstein\ngradient flows. From an algorithmic perspective, the optimality conditions\nyields Hamilton-Jacobi-Bellman (HJB) regularizers for enhanced training of\ngenerative models. In particular, we propose and demonstrate an HJB-regularized\nSGM with improved performance over standard SGMs. We present this framework as\nan MFG laboratory which serves as a platform for revealing new avenues of\nexperimentation and invention of generative models.\n","authors":["Benjamin J. Zhang","Markos A. Katsoulakis"],"pdf_url":"https://arxiv.org/pdf/2304.13534v5.pdf","comment":"56 pages, 10 figures. Version 5 has a slightly modified version of\n the normalizing flow and improved introduction and conclusions"},{"id":"http://arxiv.org/abs/2303.17971v2","updated":"2023-10-24T17:16:25Z","published":"2023-03-31T11:14:59Z","title":"Rule Enforcing Through Ordering","summary":" In many real world situations, like minor traffic offenses in big cities, a\ncentral authority is tasked with periodic administering punishments to a large\nnumber of individuals. Common practice is to give each individual a chance to\nsuffer a smaller fine and be guaranteed to avoid the legal process with\nprobable considerably larger punishment. However, thanks to the large number of\noffenders and a limited capacity of the central authority, the individual risk\nis typically small and a rational individual will not choose to pay the fine.\nHere we show that if the central authority processes the offenders in a\npublicly known order, it properly incentives the offenders to pay the fine. We\nshow analytically and on realistic experiments that our mechanism promotes\nnon-cooperation and incentives individuals to pay. Moreover, the same holds for\nan arbitrary coalition. We quantify the expected total payment the central\nauthority receives, and show it increases considerably.\n","authors":["David Sychrovský","Sameer Desai","Martin Loebl"],"pdf_url":"https://arxiv.org/pdf/2303.17971v2.pdf","comment":"Accepted at the 14th Conference on Decision and Game Theory for\n Security (GameSec-23)"},{"id":"http://arxiv.org/abs/2307.14619v5","updated":"2023-10-24T17:16:16Z","published":"2023-07-27T04:27:26Z","title":"Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level\n Stability and High-Level Behavior","summary":" We propose a theoretical framework for studying behavior cloning of complex\nexpert demonstrations using generative modeling. Our framework invokes\nlow-level controllers - either learned or implicit in position-command control\n- to stabilize imitation around expert demonstrations. We show that with (a) a\nsuitable low-level stability guarantee and (b) a powerful enough generative\nmodel as our imitation learner, pure supervised behavior cloning can generate\ntrajectories matching the per-time step distribution of essentially arbitrary\nexpert trajectories in an optimal transport cost. Our analysis relies on a\nstochastic continuity property of the learned policy we call \"total variation\ncontinuity\" (TVC). We then show that TVC can be ensured with minimal\ndegradation of accuracy by combining a popular data-augmentation regimen with a\nnovel algorithmic trick: adding augmentation noise at execution time. We\ninstantiate our guarantees for policies parameterized by diffusion models and\nprove that if the learner accurately estimates the score of the\n(noise-augmented) expert policy, then the distribution of imitator trajectories\nis close to the demonstrator distribution in a natural optimal transport\ndistance. Our analysis constructs intricate couplings between noise-augmented\ntrajectories, a technique that may be of independent interest. We conclude by\nempirically validating our algorithmic recommendations, and discussing\nimplications for future research directions for better behavior cloning with\ngenerative modeling.\n","authors":["Adam Block","Ali Jadbabaie","Daniel Pfrommer","Max Simchowitz","Russ Tedrake"],"pdf_url":"https://arxiv.org/pdf/2307.14619v5.pdf","comment":"updated figures, minor notational change for readability"},{"id":"http://arxiv.org/abs/2310.16014v1","updated":"2023-10-24T17:15:16Z","published":"2023-10-24T17:15:16Z","title":"Human-in-the-Loop Task and Motion Planning for Imitation Learning","summary":" Imitation learning from human demonstrations can teach robots complex\nmanipulation skills, but is time-consuming and labor intensive. In contrast,\nTask and Motion Planning (TAMP) systems are automated and excel at solving\nlong-horizon tasks, but they are difficult to apply to contact-rich tasks. In\nthis paper, we present Human-in-the-Loop Task and Motion Planning (HITL-TAMP),\na novel system that leverages the benefits of both approaches. The system\nemploys a TAMP-gated control mechanism, which selectively gives and takes\ncontrol to and from a human teleoperator. This enables the human teleoperator\nto manage a fleet of robots, maximizing data collection efficiency. The\ncollected human data is then combined with an imitation learning framework to\ntrain a TAMP-gated policy, leading to superior performance compared to training\non full task demonstrations. We compared HITL-TAMP to a conventional\nteleoperation system -- users gathered more than 3x the number of demos given\nthe same time budget. Furthermore, proficient agents (75\\%+ success) could be\ntrained from just 10 minutes of non-expert teleoperation data. Finally, we\ncollected 2.1K demos with HITL-TAMP across 12 contact-rich, long-horizon tasks\nand show that the system often produces near-perfect agents. Videos and\nadditional results at https://hitltamp.github.io .\n","authors":["Ajay Mandlekar","Caelan Garrett","Danfei Xu","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2310.16014v1.pdf","comment":"Conference on Robot Learning (CoRL) 2023"},{"id":"http://arxiv.org/abs/2310.14017v2","updated":"2023-10-24T17:13:26Z","published":"2023-10-21T13:59:31Z","title":"Contrast Everything: A Hierarchical Contrastive Framework for Medical\n Time-Series","summary":" Contrastive representation learning is crucial in medical time series\nanalysis as it alleviates dependency on labor-intensive, domain-specific, and\nscarce expert annotations. However, existing contrastive learning methods\nprimarily focus on one single data level, which fails to fully exploit the\nintricate nature of medical time series. To address this issue, we present\nCOMET, an innovative hierarchical framework that leverages data consistencies\nat all inherent levels in medical time series. Our meticulously designed model\nsystematically captures data consistency from four potential levels:\nobservation, sample, trial, and patient levels. By developing contrastive loss\nat multiple levels, we can learn effective representations that preserve\ncomprehensive data consistency, maximizing information utilization in a\nself-supervised manner. We conduct experiments in the challenging\npatient-independent setting. We compare COMET against six baselines using three\ndiverse datasets, which include ECG signals for myocardial infarction and EEG\nsignals for Alzheimer's and Parkinson's diseases. The results demonstrate that\nCOMET consistently outperforms all baselines, particularly in setup with 10%\nand 1% labeled data fractions across all datasets. These results underscore the\nsignificant impact of our framework in advancing contrastive representation\nlearning techniques for medical time series. The source code is available at\nhttps://github.com/DL4mHealth/COMET.\n","authors":["Yihe Wang","Yu Han","Haishuai Wang","Xiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14017v2.pdf","comment":"Accepted by NeruIPS 2023; 24pages (13 pages main paper + 11 pages\n supplementary materials)"},{"id":"http://arxiv.org/abs/2310.13548v2","updated":"2023-10-24T17:12:03Z","published":"2023-10-20T14:46:48Z","title":"Towards Understanding Sycophancy in Language Models","summary":" Reinforcement learning from human feedback (RLHF) is a popular technique for\ntraining high-quality AI assistants. However, RLHF may also encourage model\nresponses that match user beliefs over truthful responses, a behavior known as\nsycophancy. We investigate the prevalence of sycophancy in RLHF-trained models\nand whether human preference judgements are responsible. We first demonstrate\nthat five state-of-the-art AI assistants consistently exhibit sycophantic\nbehavior across four varied free-form text-generation tasks. To understand if\nhuman preferences drive this broadly observed behavior of RLHF models, we\nanalyze existing human preference data. We find that when a response matches a\nuser's views, it is more likely to be preferred. Moreover, both humans and\npreference models (PMs) prefer convincingly-written sycophantic responses over\ncorrect ones a non-negligible fraction of the time. Optimizing model outputs\nagainst PMs also sometimes sacrifices truthfulness in favor of sycophancy.\nOverall, our results indicate that sycophancy is a general behavior of RLHF\nmodels, likely driven in part by human preference judgements favoring\nsycophantic responses.\n","authors":["Mrinank Sharma","Meg Tong","Tomasz Korbak","David Duvenaud","Amanda Askell","Samuel R. Bowman","Newton Cheng","Esin Durmus","Zac Hatfield-Dodds","Scott R. Johnston","Shauna Kravec","Timothy Maxwell","Sam McCandlish","Kamal Ndousse","Oliver Rausch","Nicholas Schiefer","Da Yan","Miranda Zhang","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2310.13548v2.pdf","comment":"32 pages, 20 figures"},{"id":"http://arxiv.org/abs/2310.16005v1","updated":"2023-10-24T17:00:00Z","published":"2023-10-24T17:00:00Z","title":"MLFMF: Data Sets for Machine Learning for Mathematical Formalization","summary":" We introduce MLFMF, a collection of data sets for benchmarking recommendation\nsystems used to support formalization of mathematics with proof assistants.\nThese systems help humans identify which previous entries (theorems,\nconstructions, datatypes, and postulates) are relevant in proving a new theorem\nor carrying out a new construction. Each data set is derived from a library of\nformalized mathematics written in proof assistants Agda or Lean. The collection\nincludes the largest Lean~4 library Mathlib, and some of the largest Agda\nlibraries: the standard library, the library of univalent mathematics\nAgda-unimath, and the TypeTopology library. Each data set represents the\ncorresponding library in two ways: as a heterogeneous network, and as a list of\ns-expressions representing the syntax trees of all the entries in the library.\nThe network contains the (modular) structure of the library and the references\nbetween entries, while the s-expressions give complete and easily parsed\ninformation about every entry. We report baseline results using standard graph\nand word embeddings, tree ensembles, and instance-based learning algorithms.\nThe MLFMF data sets provide solid benchmarking support for further\ninvestigation of the numerous machine learning approaches to formalized\nmathematics. The methodology used to extract the networks and the s-expressions\nreadily applies to other libraries, and is applicable to other proof\nassistants. With more than $250\\,000$ entries in total, this is currently the\nlargest collection of formalized mathematical knowledge in machine learnable\nformat.\n","authors":["Andrej Bauer","Matej Petković","Ljupčo Todorovski"],"pdf_url":"https://arxiv.org/pdf/2310.16005v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15999v1","updated":"2023-10-24T16:48:56Z","published":"2023-10-24T16:48:56Z","title":"Transitivity Recovering Decompositions: Interpretable and Robust\n Fine-Grained Relationships","summary":" Recent advances in fine-grained representation learning leverage\nlocal-to-global (emergent) relationships for achieving state-of-the-art\nresults. The relational representations relied upon by such methods, however,\nare abstract. We aim to deconstruct this abstraction by expressing them as\ninterpretable graphs over image views. We begin by theoretically showing that\nabstract relational representations are nothing but a way of recovering\ntransitive relationships among local views. Based on this, we design\nTransitivity Recovering Decompositions (TRD), a graph-space search algorithm\nthat identifies interpretable equivalents of abstract emergent relationships at\nboth instance and class levels, and with no post-hoc computations. We\nadditionally show that TRD is provably robust to noisy views, with empirical\nevidence also supporting this finding. The latter allows TRD to perform at par\nor even better than the state-of-the-art, while being fully interpretable.\nImplementation is available at https://github.com/abhrac/trd.\n","authors":["Abhra Chaudhuri","Massimiliano Mancini","Zeynep Akata","Anjan Dutta"],"pdf_url":"https://arxiv.org/pdf/2310.15999v1.pdf","comment":"Neural Information Processing Systems (NeurIPS) 2023"},{"id":"http://arxiv.org/abs/2310.15991v1","updated":"2023-10-24T16:39:06Z","published":"2023-10-24T16:39:06Z","title":"White-box Compiler Fuzzing Empowered by Large Language Models","summary":" Compiler correctness is crucial, as miscompilation falsifying the program\nbehaviors can lead to serious consequences. In the literature, fuzzing has been\nextensively studied to uncover compiler defects. However, compiler fuzzing\nremains challenging: Existing arts focus on black- and grey-box fuzzing, which\ngenerates tests without sufficient understanding of internal compiler\nbehaviors. As such, they often fail to construct programs to exercise\nconditions of intricate optimizations. Meanwhile, traditional white-box\ntechniques are computationally inapplicable to the giant codebase of compilers.\nRecent advances demonstrate that Large Language Models (LLMs) excel in code\ngeneration/understanding tasks and have achieved state-of-the-art performance\nin black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code\ninformation remains a missing piece of research in compiler testing.\n To this end, we propose WhiteFox, the first white-box compiler fuzzer using\nLLMs with source-code information to test compiler optimization. WhiteFox\nadopts a dual-model framework: (i) an analysis LLM examines the low-level\noptimization source code and produces requirements on the high-level test\nprograms that can trigger the optimization; (ii) a generation LLM produces test\nprograms based on the summarized requirements. Additionally,\noptimization-triggering tests are used as feedback to further enhance the test\ngeneration on the fly. Our evaluation on four popular compilers shows that\nWhiteFox can generate high-quality tests to exercise deep optimizations\nrequiring intricate conditions, practicing up to 80 more optimizations than\nstate-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80\nconfirmed as previously unknown and 51 already fixed. Beyond compiler testing,\nWhiteFox can also be adapted for white-box fuzzing of other complex, real-world\nsoftware systems in general.\n","authors":["Chenyuan Yang","Yinlin Deng","Runyu Lu","Jiayi Yao","Jiawei Liu","Reyhaneh Jabbarvand","Lingming Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.15991v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03217v2","updated":"2023-10-24T16:37:43Z","published":"2023-07-06T17:56:10Z","title":"Quantification of Uncertainty with Adversarial Models","summary":" Quantifying uncertainty is important for actionable predictions in real-world\napplications. A crucial part of predictive uncertainty quantification is the\nestimation of epistemic uncertainty, which is defined as an integral of the\nproduct between a divergence function and the posterior. Current methods such\nas Deep Ensembles or MC dropout underperform at estimating the epistemic\nuncertainty, since they primarily consider the posterior when sampling models.\nWe suggest Quantification of Uncertainty with Adversarial Models (QUAM) to\nbetter estimate the epistemic uncertainty. QUAM identifies regions where the\nwhole product under the integral is large, not just the posterior.\nConsequently, QUAM has lower approximation error of the epistemic uncertainty\ncompared to previous methods. Models for which the product is large correspond\nto adversarial models (not adversarial examples!). Adversarial models have both\na high posterior as well as a high divergence between their predictions and\nthat of a reference model. Our experiments show that QUAM excels in capturing\nepistemic uncertainty for deep learning models and outperforms previous methods\non challenging tasks in the vision domain.\n","authors":["Kajetan Schweighofer","Lukas Aichberger","Mykyta Ielanskyi","Günter Klambauer","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2307.03217v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.02854v2","updated":"2023-10-24T16:32:45Z","published":"2023-02-06T15:22:52Z","title":"NA-SODINN: a deep learning algorithm for exoplanet image detection based\n on residual noise regimes","summary":" Supervised deep learning was recently introduced in high-contrast imaging\n(HCI) through the SODINN algorithm, a convolutional neural network designed for\nexoplanet detection in angular differential imaging (ADI) datasets. The\nbenchmarking of HCI algorithms within the Exoplanet Imaging Data Challenge\n(EIDC) showed that (i) SODINN can produce a high number of false positives in\nthe final detection maps, and (ii) algorithms processing images in a more local\nmanner perform better. This work aims to improve the SODINN detection\nperformance by introducing new local processing approaches and adapting its\nlearning process accordingly. We propose NA-SODINN, a new deep learning binary\nclassifier based on a convolutional neural network (CNN) that better captures\nimage noise correlations in ADI-processed frames by identifying noise regimes.\nOur new approach was tested against its predecessor, as well as two\nSODINN-based hybrid models and a more standard annular-PCA approach, through\nlocal receiving operating characteristics (ROC) analysis of ADI sequences from\nthe VLT/SPHERE and Keck/NIRC-2 instruments. Results show that NA-SODINN\nenhances SODINN in both sensitivity and specificity, especially in the\nspeckle-dominated noise regime. NA-SODINN is also benchmarked against the\ncomplete set of submitted detection algorithms in EIDC, in which we show that\nits final detection score matches or outperforms the most powerful detection\nalgorithms.Throughout the supervised machine learning case, this study\nillustrates and reinforces the importance of adapting the task of detection to\nthe local content of processed images.\n","authors":["Carles Cantero","Olivier Absil","Carl-Henrik Dahlqvist","Marc Van Droogenbroeck"],"pdf_url":"https://arxiv.org/pdf/2302.02854v2.pdf","comment":"A&A in press"},{"id":"http://arxiv.org/abs/2307.03838v2","updated":"2023-10-24T16:31:49Z","published":"2023-07-07T21:13:27Z","title":"RADAR: Robust AI-Text Detection via Adversarial Learning","summary":" Recent advances in large language models (LLMs) and the intensifying\npopularity of ChatGPT-like applications have blurred the boundary of\nhigh-quality text generation between humans and machines. However, in addition\nto the anticipated revolutionary changes to our technology and society, the\ndifficulty of distinguishing LLM-generated texts (AI-text) from human-generated\ntexts poses new challenges of misuse and fairness, such as fake content\ngeneration, plagiarism, and false accusations of innocent writers. While\nexisting works show that current AI-text detectors are not robust to LLM-based\nparaphrasing, this paper aims to bridge this gap by proposing a new framework\ncalled RADAR, which jointly trains a robust AI-text detector via adversarial\nlearning. RADAR is based on adversarial training of a paraphraser and a\ndetector. The paraphraser's goal is to generate realistic content to evade\nAI-text detection. RADAR uses the feedback from the detector to update the\nparaphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly\n2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets,\nexperimental results show that RADAR significantly outperforms existing AI-text\ndetection methods, especially when paraphrasing is in place. We also identify\nthe strong transferability of RADAR from instruction-tuned LLMs to other LLMs,\nand evaluate the improved capability of RADAR via GPT-3.5-Turbo.\n","authors":["Xiaomeng Hu","Pin-Yu Chen","Tsung-Yi Ho"],"pdf_url":"https://arxiv.org/pdf/2307.03838v2.pdf","comment":"Accepted by NeurIPS 2023. Project page and demos:\n https://radar.vizhub.ai"},{"id":"http://arxiv.org/abs/2310.15978v1","updated":"2023-10-24T16:26:38Z","published":"2023-10-24T16:26:38Z","title":"Graph Deep Learning for Time Series Forecasting","summary":" Graph-based deep learning methods have become popular tools to process\ncollections of correlated time series. Differently from traditional\nmultivariate forecasting methods, neural graph-based predictors take advantage\nof pairwise relationships by conditioning forecasts on a (possibly dynamic)\ngraph spanning the time series collection. The conditioning can take the form\nof an architectural inductive bias on the neural forecasting architecture,\nresulting in a family of deep learning models called spatiotemporal graph\nneural networks. Such relational inductive biases enable the training of global\nforecasting models on large time-series collections, while at the same time\nlocalizing predictions w.r.t. each element in the set (i.e., graph nodes) by\naccounting for local correlations among them (i.e., graph edges). Indeed,\nrecent theoretical and practical advances in graph neural networks and deep\nlearning for time series forecasting make the adoption of such processing\nframeworks appealing and timely. However, most of the studies in the literature\nfocus on proposing variations of existing neural architectures by taking\nadvantage of modern deep learning practices, while foundational and\nmethodological aspects have not been subject to systematic investigation. To\nfill the gap, this paper aims to introduce a comprehensive methodological\nframework that formalizes the forecasting problem and provides design\nprinciples for graph-based predictive models and methods to assess their\nperformance. At the same time, together with an overview of the field, we\nprovide design guidelines, recommendations, and best practices, as well as an\nin-depth discussion of open challenges and future research directions.\n","authors":["Andrea Cini","Ivan Marisca","Daniele Zambon","Cesare Alippi"],"pdf_url":"https://arxiv.org/pdf/2310.15978v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15975v1","updated":"2023-10-24T16:25:13Z","published":"2023-10-24T16:25:13Z","title":"Data-driven Traffic Simulation: A Comprehensive Review","summary":" Autonomous vehicles (AVs) have the potential to significantly revolutionize\nsociety by providing a secure and efficient mode of transportation. Recent\nyears have witnessed notable advance-ments in autonomous driving perception and\nprediction, but the challenge of validating the performance of AVs remains\nlargely unresolved. Data-driven microscopic traffic simulation has be-come an\nimportant tool for autonomous driving testing due to 1) availability of\nhigh-fidelity traffic data; 2) its advantages of ena-bling large-scale testing\nand scenario reproducibility; and 3) its potential in reactive and realistic\ntraffic simulation. However, a comprehensive review of this topic is currently\nlacking. This pa-per aims to fill this gap by summarizing relevant studies. The\nprimary objective of this paper is to review current research ef-forts and\nprovide a futuristic perspective that will benefit future developments in the\nfield. It introduces the general issues of data-driven traffic simulation and\noutlines key concepts and terms. After overviewing traffic simulation, various\ndatasets and evalua-tion metrics commonly used are reviewed. The paper then\noffers a comprehensive evaluation of imitation learning, reinforcement\nlearning, generative and deep learning methods, summarizing each and analyzing\ntheir advantages and disadvantages in detail. Moreover, it evaluates the\nstate-of-the-art, existing challenges, and future research directions.\n","authors":["Di Chen","Meixin Zhu","Hao Yang","Xuesong Wang","Yinhai Wang"],"pdf_url":"https://arxiv.org/pdf/2310.15975v1.pdf","comment":"18 pages, 4 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.15976v1","updated":"2023-10-24T16:25:13Z","published":"2023-10-24T16:25:13Z","title":"Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex\n Optimization","summary":" signSGD is popular in nonconvex optimization due to its communication\nefficiency. Yet, existing analyses of signSGD rely on assuming that data are\nsampled with replacement in each iteration, contradicting the practical\nimplementation where data are randomly reshuffled and sequentially fed into the\nalgorithm. We bridge this gap by proving the first convergence result of\nsignSGD with random reshuffling (SignRR) for nonconvex optimization. Given the\ndataset size $n$, the number of epochs of data passes $T$, and the variance\nbound of a stochastic gradient $\\sigma^2$, we show that SignRR has the same\nconvergence rate $O(\\log(nT)/\\sqrt{nT} + \\|\\sigma\\|_1)$ as signSGD\n\\citep{bernstein2018signsgd}. We then present SignRVR and SignRVM, which\nleverage variance-reduced gradients and momentum updates respectively, both\nconverging at $O(\\log(nT)/\\sqrt{nT})$. In contrast with the analysis of\nsignSGD, our results do not require an extremely large batch size in each\niteration to be of the same order as the total number of iterations\n\\citep{bernstein2018signsgd} or the signs of stochastic and true gradients\nmatch element-wise with a minimum probability of 1/2\n\\citep{safaryan2021stochastic}. We also extend our algorithms to cases where\ndata are distributed across different machines, yielding dist-SignRVR and\ndist-SignRVM, both converging at $O(\\log(n_0T)/\\sqrt{n_0T})$, where $n_0$ is\nthe dataset size of a single machine. We back up our theoretical findings\nthrough experiments on simulated and real-world problems, verifying that\nrandomly reshuffled sign methods match or surpass existing baselines.\n","authors":["Zhen Qin","Zhishuai Liu","Pan Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15976v1.pdf","comment":"45 pages, 4 figures"},{"id":"http://arxiv.org/abs/2104.02120v2","updated":"2023-10-24T16:23:04Z","published":"2021-04-05T19:29:46Z","title":"Nonlinear model reduction for slow-fast stochastic systems near unknown\n invariant manifolds","summary":" We introduce a nonlinear stochastic model reduction technique for\nhigh-dimensional stochastic dynamical systems that have a low-dimensional\ninvariant effective manifold with slow dynamics, and high-dimensional, large\nfast modes. Given only access to a black box simulator from which short bursts\nof simulation can be obtained, we design an algorithm that outputs an estimate\nof the invariant manifold, a process of the effective stochastic dynamics on\nit, which has averaged out the fast modes, and a simulator thereof. This\nsimulator is efficient in that it exploits of the low dimension of the\ninvariant manifold, and takes time steps of size dependent on the regularity of\nthe effective process, and therefore typically much larger than that of the\noriginal simulator, which had to resolve the fast modes. The algorithm and the\nestimation can be performed on-the-fly, leading to efficient exploration of the\neffective state space, without losing consistency with the underlying dynamics.\nThis construction enables fast and efficient simulation of paths of the\neffective dynamics, together with estimation of crucial features and\nobservables of such dynamics, including the stationary distribution,\nidentification of metastable states, and residence times and transition rates\nbetween them.\n","authors":["Felix X. -F. Ye","Sichen Yang","Mauro Maggioni"],"pdf_url":"https://arxiv.org/pdf/2104.02120v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.09708v3","updated":"2023-10-24T16:22:40Z","published":"2022-08-20T15:17:40Z","title":"DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two\n Quantization","summary":" Efficiently deploying deep neural networks on low-resource edge devices is\nchallenging due to their ever-increasing resource requirements. To address this\nissue, researchers have proposed multiplication-free neural networks, such as\nPower-of-Two quantization, or also known as Shift networks, which aim to reduce\nmemory usage and simplify computation. However, existing low-bit Shift networks\nare not as accurate as their full-precision counterparts, typically suffering\nfrom limited weight range encoding schemes and quantization loss. In this\npaper, we propose the DenseShift network, which significantly improves the\naccuracy of Shift networks, achieving competitive performance to full-precision\nnetworks for vision and speech applications. In addition, we introduce a method\nto deploy an efficient DenseShift network using non-quantized floating-point\nactivations, while obtaining 1.6X speed-up over existing methods. To achieve\nthis, we demonstrate that zero-weight values in low-bit Shift networks do not\ncontribute to model capacity and negatively impact inference computation. To\naddress this issue, we propose a zero-free shifting mechanism that simplifies\ninference and increases model capacity. We further propose a sign-scale\ndecomposition design to enhance training efficiency and a low-variance random\ninitialization strategy to improve the model's transfer learning performance.\nOur extensive experiments on various computer vision and speech tasks\ndemonstrate that DenseShift outperforms existing low-bit multiplication-free\nnetworks and achieves competitive performance compared to full-precision\nnetworks. Furthermore, our proposed approach exhibits strong transfer learning\nperformance without a drop in accuracy. Our code was released on GitHub.\n","authors":["Xinlin Li","Bang Liu","Rui Heng Yang","Vanessa Courville","Chao Xing","Vahid Partovi Nia"],"pdf_url":"https://arxiv.org/pdf/2208.09708v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15974v1","updated":"2023-10-24T16:21:41Z","published":"2023-10-24T16:21:41Z","title":"Minimax Forward and Backward Learning of Evolving Tasks with Performance\n Guarantees","summary":" For a sequence of classification tasks that arrive over time, it is common\nthat tasks are evolving in the sense that consecutive tasks often have a higher\nsimilarity. The incremental learning of a growing sequence of tasks holds\npromise to enable accurate classification even with few samples per task by\nleveraging information from all the tasks in the sequence (forward and backward\nlearning). However, existing techniques developed for continual learning and\nconcept drift adaptation are either designed for tasks with time-independent\nsimilarities or only aim to learn the last task in the sequence. This paper\npresents incremental minimax risk classifiers (IMRCs) that effectively exploit\nforward and backward learning and account for evolving tasks. In addition, we\nanalytically characterize the performance improvement provided by forward and\nbackward learning in terms of the tasks' expected quadratic change and the\nnumber of tasks. The experimental evaluation shows that IMRCs can result in a\nsignificant performance improvement, especially for reduced sample sizes.\n","authors":["Verónica Álvarez","Santiago Mazuelas","Jose A. Lozano"],"pdf_url":"https://arxiv.org/pdf/2310.15974v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15970v1","updated":"2023-10-24T16:10:58Z","published":"2023-10-24T16:10:58Z","title":"Accented Speech Recognition With Accent-specific Codebooks","summary":" Speech accents pose a significant challenge to state-of-the-art automatic\nspeech recognition (ASR) systems. Degradation in performance across\nunderrepresented accents is a severe deterrent to the inclusive adoption of\nASR. In this work, we propose a novel accent adaptation approach for end-to-end\nASR systems using cross-attention with a trainable set of codebooks. These\nlearnable codebooks capture accent-specific information and are integrated\nwithin the ASR encoder layers. The model is trained on accented English speech,\nwhile the test data also contained accents which were not seen during training.\nOn the Mozilla Common Voice multi-accented dataset, we show that our proposed\napproach yields significant performance gains not only on the seen English\naccents (up to $37\\%$ relative improvement in word error rate) but also on the\nunseen accents (up to $5\\%$ relative improvement in WER). Further, we\nillustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We\nalso compare the performance with other approaches based on accent adversarial\ntraining.\n","authors":["Darshan Prabhu","Preethi Jyothi","Sriram Ganapathy","Vinit Unni"],"pdf_url":"https://arxiv.org/pdf/2310.15970v1.pdf","comment":"Accepted to EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2310.15966v1","updated":"2023-10-24T16:07:08Z","published":"2023-10-24T16:07:08Z","title":"Constructing and Machine Learning Calabi-Yau Five-folds","summary":" We construct all possible complete intersection Calabi-Yau five-folds in a\nproduct of four or less complex projective spaces, with up to four constraints.\nWe obtain $27068$ spaces, which are not related by permutations of rows and\ncolumns of the configuration matrix, and determine the Euler number for all of\nthem. Excluding the $3909$ product manifolds among those, we calculate the\ncohomological data for $12433$ cases, i.e. $53.7 \\%$ of the non-product spaces,\nobtaining $2375$ different Hodge diamonds. The dataset containing all the above\ninformation is available at\nhttps://www.dropbox.com/scl/fo/z7ii5idt6qxu36e0b8azq/h?rlkey=0qfhx3tykytduobpld510gsfy&dl=0\n. The distributions of the invariants are presented, and a comparison with the\nlower-dimensional analogues is discussed. Supervised machine learning is\nperformed on the cohomological data, via classifier and regressor (both fully\nconnected and convolutional) neural networks. We find that $h^{1,1}$ can be\nlearnt very efficiently, with very high $R^2$ score and an accuracy of $96\\%$,\ni.e. $96 \\%$ of the predictions exactly match the correct values. For\n$h^{1,4},h^{2,3}, \\eta$, we also find very high $R^2$ scores, but the accuracy\nis lower, due to the large ranges of possible values.\n","authors":["R. Alawadhi","D. Angella","A. Leonardo","T. Schettini Gherardini"],"pdf_url":"https://arxiv.org/pdf/2310.15966v1.pdf","comment":"40 pages, 8 tables, 2 figures"},{"id":"http://arxiv.org/abs/2310.15961v1","updated":"2023-10-24T16:03:57Z","published":"2023-10-24T16:03:57Z","title":"Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation","summary":" Despite the promise of Mixture of Experts (MoE) models in increasing\nparameter counts of Transformer models while maintaining training and inference\ncosts, their application carries notable drawbacks. The key strategy of these\nmodels is to, for each processed token, activate at most a few experts -\nsubsets of an extensive feed-forward layer. But this approach is not without\nits challenges. The operation of matching experts and tokens is discrete, which\nmakes MoE models prone to issues like training instability and uneven expert\nutilization. Existing techniques designed to address these concerns, such as\nauxiliary losses or balance-aware matching, result either in lower model\nperformance or are more difficult to train. In response to these issues, we\npropose Mixture of Tokens, a fully-differentiable model that retains the\nbenefits of MoE architectures while avoiding the aforementioned difficulties.\nRather than routing tokens to experts, this approach mixes tokens from\ndifferent examples prior to feeding them to experts, enabling the model to\nlearn from all token-expert combinations. Importantly, this mixing can be\ndisabled to avoid mixing of different sequences during inference. Crucially,\nthis method is fully compatible with both masked and causal Large Language\nModel training and inference.\n","authors":["Szymon Antoniak","Sebastian Jaszczur","Michał Krutul","Maciej Pióro","Jakub Krajewski","Jan Ludziejewski","Tomasz Odrzygóźdź","Marek Cygan"],"pdf_url":"https://arxiv.org/pdf/2310.15961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15952v1","updated":"2023-10-24T15:53:07Z","published":"2023-10-24T15:53:07Z","title":"Improving Robustness and Reliability in Medical Image Classification\n with Latent-Guided Diffusion and Nested-Ensembles","summary":" While deep learning models have achieved remarkable success across a range of\nmedical image analysis tasks, deployment of these models in real clinical\ncontexts requires that they be robust to variability in the acquired images.\nWhile many methods apply predefined transformations to augment the training\ndata to enhance test-time robustness, these transformations may not ensure the\nmodel's robustness to the diverse variability seen in patient images. In this\npaper, we introduce a novel three-stage approach based on transformers coupled\nwith conditional diffusion models, with the goal of improving model robustness\nto the kinds of imaging variability commonly encountered in practice without\nthe need for pre-determined data augmentation strategies. To this end, multiple\nimage encoders first learn hierarchical feature representations to build\ndiscriminative latent spaces. Next, a reverse diffusion process, guided by the\nlatent code, acts on an informative prior and proposes prediction candidates in\na generative manner. Finally, several prediction candidates are aggregated in a\nbi-level aggregation protocol to produce the final output. Through extensive\nexperiments on medical imaging benchmark datasets, we show that our method\nimproves upon state-of-the-art methods in terms of robustness and confidence\ncalibration. Additionally, we introduce a strategy to quantify the prediction\nuncertainty at the instance level, increasing their trustworthiness to\nclinicians using them in clinical practice.\n","authors":["Xing Shen","Hengguan Huang","Brennan Nichyporuk","Tal Arbel"],"pdf_url":"https://arxiv.org/pdf/2310.15952v1.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.15951v1","updated":"2023-10-24T15:51:20Z","published":"2023-10-24T15:51:20Z","title":"Weighted Distance Nearest Neighbor Condensing","summary":" The problem of nearest neighbor condensing has enjoyed a long history of\nstudy, both in its theoretical and practical aspects. In this paper, we\nintroduce the problem of weighted distance nearest neighbor condensing, where\none assigns weights to each point of the condensed set, and then new points are\nlabeled based on their weighted distance nearest neighbor in the condensed set.\n We study the theoretical properties of this new model, and show that it can\nproduce dramatically better condensing than the standard nearest neighbor rule,\nyet is characterized by generalization bounds almost identical to the latter.\nWe then suggest a condensing heuristic for our new problem. We demonstrate\nBayes consistency for this heuristic, and also show promising empirical\nresults.\n","authors":["Lee-Ad Gottlieb","Timor Sharabi","Roi Weiss"],"pdf_url":"https://arxiv.org/pdf/2310.15951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13837v3","updated":"2023-10-24T15:49:03Z","published":"2023-09-25T02:50:20Z","title":"Backorder Prediction in Inventory Management: Classification Techniques\n and Cost Considerations","summary":" This article introduces an advanced analytical approach for predicting\nbackorders in inventory management. Backorder refers to an order that cannot be\nimmediately fulfilled due to stock depletion. Multiple classification\ntechniques, including Balanced Bagging Classifiers, Fuzzy Logic, Variational\nAutoencoder - Generative Adversarial Networks, and Multi-layer Perceptron\nclassifiers, are assessed in this work using performance evaluation metrics\nsuch as ROC-AUC and PR-AUC. Moreover, this work incorporates a profit function\nand misclassification costs, considering the financial implications and costs\nassociated with inventory management and backorder handling. The study suggests\nthat a combination of modeling approaches, including ensemble techniques and\nVAE, can effectively address imbalanced datasets in inventory management,\nemphasizing interpretability and reducing false positives and false negatives.\nThis research contributes to the advancement of predictive analytics and offers\nvaluable insights for future investigations in backorder forecasting and\ninventory control optimization for decision-making.\n","authors":["Sarit Maitra","Sukanya Kundu"],"pdf_url":"https://arxiv.org/pdf/2309.13837v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14321v2","updated":"2023-10-24T15:42:14Z","published":"2023-09-25T17:45:55Z","title":"Lifelong Robot Learning with Human Assisted Language Planners","summary":" Large Language Models (LLMs) have been shown to act like planners that can\ndecompose high-level instructions into a sequence of executable instructions.\nHowever, current LLM-based planners are only able to operate with a fixed set\nof skills. We overcome this critical limitation and present a method for using\nLLM-based planners to query new skills and teach robots these skills in a data\nand time-efficient manner for rigid object manipulation. Our system can re-use\nnewly acquired skills for future tasks, demonstrating the potential of open\nworld and lifelong learning. We evaluate the proposed framework on multiple\ntasks in simulation and the real world. Videos are available at:\nhttps://sites.google.com/mit.edu/halp-robot-learning.\n","authors":["Meenal Parakh","Alisha Fong","Anthony Simeonov","Tao Chen","Abhishek Gupta","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2309.14321v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15940v1","updated":"2023-10-24T15:35:54Z","published":"2023-10-24T15:35:54Z","title":"Combining Behaviors with the Successor Features Keyboard","summary":" The Option Keyboard (OK) was recently proposed as a method for transferring\nbehavioral knowledge across tasks. OK transfers knowledge by adaptively\ncombining subsets of known behaviors using Successor Features (SFs) and\nGeneralized Policy Improvement (GPI). However, it relies on hand-designed\nstate-features and task encodings which are cumbersome to design for every new\nenvironment. In this work, we propose the \"Successor Features Keyboard\" (SFK),\nwhich enables transfer with discovered state-features and task encodings. To\nenable discovery, we propose the \"Categorical Successor Feature Approximator\"\n(CSFA), a novel learning algorithm for estimating SFs while jointly discovering\nstate-features and task encodings. With SFK and CSFA, we achieve the first\ndemonstration of transfer with SFs in a challenging 3D environment where all\nthe necessary representations are discovered. We first compare CSFA against\nother methods for approximating SFs and show that only CSFA discovers\nrepresentations compatible with SF&GPI at this scale. We then compare SFK\nagainst transfer learning baselines and show that it transfers most quickly to\nlong-horizon tasks.\n","authors":["Wilka Carvalho","Andre Saraiva","Angelos Filos","Andrew Kyle Lampinen","Loic Matthey","Richard L. Lewis","Honglak Lee","Satinder Singh","Danilo J. Rezende","Daniel Zoran"],"pdf_url":"https://arxiv.org/pdf/2310.15940v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15938v1","updated":"2023-10-24T15:34:30Z","published":"2023-10-24T15:34:30Z","title":"ABKD: Graph Neural Network Compression with Attention-Based Knowledge\n Distillation","summary":" Graph Neural Networks (GNNs) have proven to be quite versatile for a variety\nof applications, including recommendation systems, fake news detection, drug\ndiscovery, and even computer vision. Due to the expanding size of\ngraph-structured data, GNN models have also increased in complexity, leading to\nsubstantial latency issues. This is primarily attributed to the irregular\nstructure of graph data and its access pattern into memory. The natural\nsolution to reduce latency is to compress large GNNs into small GNNs. One way\nto do this is via knowledge distillation (KD). However, most KD approaches for\nGNNs only consider the outputs of the last layers and do not consider the\noutputs of the intermediate layers of the GNNs; these layers may contain\nimportant inductive biases indicated by the graph structure. To address this\nshortcoming, we propose a novel KD approach to GNN compression that we call\nAttention-Based Knowledge Distillation (ABKD). ABKD is a KD approach that uses\nattention to identify important intermediate teacher-student layer pairs and\nfocuses on aligning their outputs. ABKD enables higher compression of GNNs with\na smaller accuracy dropoff compared to existing KD approaches. On average, we\nachieve a 1.79% increase in accuracy with a 32.3x compression ratio on\nOGBN-Mag, a large graph dataset, compared to state-of-the-art approaches.\n","authors":["Anshul Ahluwalia","Rohit Das","Payman Behnam","Alind Khare","Pan Li","Alexey Tumanov"],"pdf_url":"https://arxiv.org/pdf/2310.15938v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16635v2","updated":"2023-10-24T15:33:38Z","published":"2023-06-29T02:19:49Z","title":"Improving Fairness in Deepfake Detection","summary":" Despite the development of effective deepfake detection models in recent\nyears, several recent studies have demonstrated that biases in the training\ndata utilized to develop deepfake detection models can lead to unfair\nperformance for demographic groups of different races and/or genders. Such can\nresult in these groups being unfairly targeted or excluded from detection,\nallowing misclassified deepfakes to manipulate public opinion and erode trust\nin the model. While these studies have focused on identifying and evaluating\nthe unfairness in deepfake detection, no methods have been developed to address\nthe fairness issue of deepfake detection at the algorithm level. In this work,\nwe make the first attempt to improve deepfake detection fairness by proposing\nnovel loss functions to train fair deepfake detection models in ways that are\nagnostic or aware of demographic factors. Extensive experiments on four\ndeepfake datasets and five deepfake detectors demonstrate the effectiveness and\nflexibility of our approach in improving the deepfake detection fairness.\n","authors":["Yan Ju","Shu Hu","Shan Jia","George H. Chen","Siwei Lyu"],"pdf_url":"https://arxiv.org/pdf/2306.16635v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15932v1","updated":"2023-10-24T15:28:43Z","published":"2023-10-24T15:28:43Z","title":"Online Robust Mean Estimation","summary":" We study the problem of high-dimensional robust mean estimation in an online\nsetting. Specifically, we consider a scenario where $n$ sensors are measuring\nsome common, ongoing phenomenon. At each time step $t=1,2,\\ldots,T$, the\n$i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The\nalgorithm must then commit to its estimate $\\mu_t$ for the true mean value of\nthe process at time $t$. We assume that most of the sensors observe independent\nsamples from some common distribution $X$, but an $\\epsilon$-fraction of them\nmay instead behave maliciously. The algorithm wishes to compute a good\napproximation $\\mu$ to the true mean $\\mu^\\ast := \\mathbf{E}[X]$. We note that\nif the algorithm is allowed to wait until time $T$ to report its estimate, this\nreduces to the well-studied problem of robust mean estimation. However, the\nrequirement that our algorithm produces partial estimates as the data is coming\nin substantially complicates the situation.\n We prove two main results about online robust mean estimation in this model.\nFirst, if the uncorrupted samples satisfy the standard condition of\n$(\\epsilon,\\delta)$-stability, we give an efficient online algorithm that\noutputs estimates $\\mu_t$, $t \\in [T],$ such that with high probability it\nholds that $\\|\\mu-\\mu^\\ast\\|_2 = O(\\delta \\log(T))$, where $\\mu = (\\mu_t)_{t\n\\in [T]}$. We note that this error bound is nearly competitive with the best\noffline algorithms, which would achieve $\\ell_2$-error of $O(\\delta)$. Our\nsecond main result shows that with additional assumptions on the input (most\nnotably that $X$ is a product distribution) there are inefficient algorithms\nwhose error does not depend on $T$ at all.\n","authors":["Daniel M. Kane","Ilias Diakonikolas","Hanshen Xiao","Sihan Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15932v1.pdf","comment":"To appear in SODA2024"},{"id":"http://arxiv.org/abs/2310.15929v1","updated":"2023-10-24T15:27:15Z","published":"2023-10-24T15:27:15Z","title":"E-Sparse: Boosting the Large Language Model Inference through\n Entropy-based N:M Sparsity","summary":" Traditional pruning methods are known to be challenging to work in Large\nLanguage Models (LLMs) for Generative AI because of their unaffordable training\nprocess and large computational demands. For the first time, we introduce the\ninformation entropy of hidden state features into a pruning metric design,\nnamely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse\nemploys the information richness to leverage the channel importance, and\nfurther incorporates several novel techniques to put it into effect: (1) it\nintroduces information entropy to enhance the significance of parameter weights\nand input feature norms as a novel pruning metric, and performs N:M sparsity\nwithout modifying the remaining weights. (2) it designs global naive shuffle\nand local block shuffle to quickly optimize the information distribution and\nadequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is\nimplemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere\nGPUs. Extensive experiments on the LLaMA family and OPT models show that\nE-Sparse can significantly speed up the model inference over the dense model\n(up to 1.53X) and obtain significant memory saving (up to 43.52%), with\nacceptable accuracy loss.\n","authors":["Yun Li","Lin Niu","Xipeng Zhang","Kai Liu","Jianchen Zhu","Zhanhui Kang"],"pdf_url":"https://arxiv.org/pdf/2310.15929v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15019v2","updated":"2023-10-24T15:15:38Z","published":"2023-10-23T15:14:55Z","title":"Meta learning with language models: Challenges and opportunities in the\n classification of imbalanced text","summary":" Detecting out of policy speech (OOPS) content is important but difficult.\nWhile machine learning is a powerful tool to tackle this challenging task, it\nis hard to break the performance ceiling due to factors like quantity and\nquality limitations on training data and inconsistencies in OOPS definition and\ndata labeling. To realize the full potential of available limited resources, we\npropose a meta learning technique (MLT) that combines individual models built\nwith different text representations. We analytically show that the resulting\ntechnique is numerically stable and produces reasonable combining weights. We\ncombine the MLT with a threshold-moving (TM) technique to further improve the\nperformance of the combined predictor on highly-imbalanced in-distribution and\nout-of-distribution datasets. We also provide computational results to show the\nstatistically significant advantages of the proposed MLT approach.\n All authors contributed equally to this work.\n","authors":["Apostol Vassilev","Honglan Jin","Munawar Hasan"],"pdf_url":"https://arxiv.org/pdf/2310.15019v2.pdf","comment":"22 pages, including 5 figures, 12 tables, 1 appendix"},{"id":"http://arxiv.org/abs/2310.15912v1","updated":"2023-10-24T15:15:28Z","published":"2023-10-24T15:15:28Z","title":"Climate Change Impact on Agricultural Land Suitability: An Interpretable\n Machine Learning-Based Eurasia Case Study","summary":" The United Nations has identified improving food security and reducing hunger\nas essential components of its sustainable development goals. As of 2021,\napproximately 828 million people worldwide are experiencing hunger and\nmalnutrition, with numerous fatalities reported. Climate change significantly\nimpacts agricultural land suitability, potentially leading to severe food\nshortages and subsequent social and political conflicts. To address this\npressing issue, we have developed a machine learning-based approach to predict\nthe risk of substantial land suitability degradation and changes in irrigation\npatterns. Our study focuses on Central Eurasia, a region burdened with economic\nand social challenges.\n This study represents a pioneering effort in utilizing machine learning\nmethods to assess the impact of climate change on agricultural land suitability\nunder various carbon emissions scenarios. Through comprehensive feature\nimportance analysis, we unveil specific climate and terrain characteristics\nthat exert influence on land suitability. Our approach achieves remarkable\naccuracy, offering policymakers invaluable insights to facilitate informed\ndecisions aimed at averting a humanitarian crisis, including strategies such as\nthe provision of additional water and fertilizers. This research underscores\nthe tremendous potential of machine learning in addressing global challenges,\nwith a particular emphasis on mitigating hunger and malnutrition.\n","authors":["Valeriy Shevchenko","Daria Taniushkina","Aleksander Lukashevich","Aleksandr Bulkin","Roland Grinis","Kirill Kovalev","Veronika Narozhnaia","Nazar Sotiriadi","Alexander Krenke","Yury Maximov"],"pdf_url":"https://arxiv.org/pdf/2310.15912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15905v1","updated":"2023-10-24T15:08:12Z","published":"2023-10-24T15:08:12Z","title":"Is Probing All You Need? Indicator Tasks as an Alternative to Probing\n Embedding Spaces","summary":" The ability to identify and control different kinds of linguistic information\nencoded in vector representations of words has many use cases, especially for\nexplainability and bias removal. This is usually done via a set of simple\nclassification tasks, termed probes, to evaluate the information encoded in the\nembedding space. However, the involvement of a trainable classifier leads to\nentanglement between the probe's results and the classifier's nature. As a\nresult, contemporary works on probing include tasks that do not involve\ntraining of auxiliary models. In this work we introduce the term indicator\ntasks for non-trainable tasks which are used to query embedding spaces for the\nexistence of certain properties, and claim that this kind of tasks may point to\na direction opposite to probes, and that this contradiction complicates the\ndecision on whether a property exists in an embedding space. We demonstrate our\nclaims with two test cases, one dealing with gender debiasing and another with\nthe erasure of morphological information from embedding spaces. We show that\nthe application of a suitable indicator provides a more accurate picture of the\ninformation captured and removed compared to probes. We thus conclude that\nindicator tasks should be implemented and taken into consideration when\neliciting information from embedded representations.\n","authors":["Tal Levy","Omer Goldman","Reut Tsarfaty"],"pdf_url":"https://arxiv.org/pdf/2310.15905v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15904v1","updated":"2023-10-24T15:07:35Z","published":"2023-10-24T15:07:35Z","title":"Do Stochastic Parrots have Feelings Too? Improving Neural Detection of\n Synthetic Text via Emotion Recognition","summary":" Recent developments in generative AI have shone a spotlight on\nhigh-performance synthetic text generation technologies. The now wide\navailability and ease of use of such models highlights the urgent need to\nprovide equally powerful technologies capable of identifying synthetic text.\nWith this in mind, we draw inspiration from psychological studies which suggest\nthat people can be driven by emotion and encode emotion in the text they\ncompose. We hypothesize that pretrained language models (PLMs) have an\naffective deficit because they lack such an emotional driver when generating\ntext and consequently may generate synthetic text which has affective\nincoherence i.e. lacking the kind of emotional coherence present in\nhuman-authored text. We subsequently develop an emotionally aware detector by\nfine-tuning a PLM on emotion. Experiment results indicate that our\nemotionally-aware detector achieves improvements across a range of synthetic\ntext generators, various sized models, datasets, and domains. Finally, we\ncompare our emotionally-aware synthetic text detector to ChatGPT in the task of\nidentification of its own output and show substantial gains, reinforcing the\npotential of emotion as a signal to identify synthetic text. Code, models, and\ndatasets are available at https: //github.com/alanagiasi/emoPLMsynth\n","authors":["Alan Cowap","Yvette Graham","Jennifer Foster"],"pdf_url":"https://arxiv.org/pdf/2310.15904v1.pdf","comment":"Accepted to Findings of EMNLP 2023 (long paper). Camera ready version"},{"id":"http://arxiv.org/abs/2310.15903v1","updated":"2023-10-24T15:07:16Z","published":"2023-10-24T15:07:16Z","title":"Neural Collapse in Multi-label Learning with Pick-all-label Loss","summary":" We study deep neural networks for the multi-label classification (MLab) task\nthrough the lens of neural collapse (NC). Previous works have been restricted\nto the multi-class classification setting and discovered a prevalent NC\nphenomenon comprising of the following properties for the last-layer features:\n(i) the variability of features within every class collapses to zero, (ii) the\nset of feature means form an equi-angular tight frame (ETF), and (iii) the last\nlayer classifiers collapse to the feature mean upon some scaling. We generalize\nthe study to multi-label learning, and prove for the first time that a\ngeneralized NC phenomenon holds with the \"pick-all-label'' formulation. Under\nthe natural analog of the unconstrained feature model (UFM), we establish that\nthe only global classifier of the pick-all-label cross entropy loss display the\nsame ETF geometry which further collapse to multiplicity-1 feature class means.\nBesides, we discover a combinatorial property in generalized NC which is unique\nfor multi-label learning that we call ``tag-wise average'' property, where the\nfeature class-means of samples with multiple labels are scaled average of the\nfeature class-means of single label tags. Theoretically, we establish global\noptimality result for the pick-all-label cross-entropy risk for the UFM.\nAdditionally, We also provide empirical evidence to support our investigation\ninto training deep neural networks on multi-label datasets, resulting in\nimproved training efficiency.\n","authors":["Pengyu Li","Yutong Wang","Xiao Li","Qing Qu"],"pdf_url":"https://arxiv.org/pdf/2310.15903v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.12871v5","updated":"2023-10-24T14:59:02Z","published":"2023-09-22T13:52:42Z","title":"AnglE-optimized Text Embeddings","summary":" High-quality text embedding is pivotal in improving semantic textual\nsimilarity (STS) tasks, which are crucial components in Large Language Model\n(LLM) applications. However, a common challenge existing text embedding models\nface is the problem of vanishing gradients, primarily due to their reliance on\nthe cosine function in the optimization objective, which has saturation zones.\nTo address this issue, this paper proposes a novel angle-optimized text\nembedding model called AnglE. The core idea of AnglE is to introduce angle\noptimization in a complex space. This novel approach effectively mitigates the\nadverse effects of the saturation zone in the cosine function, which can impede\ngradient and hinder optimization processes. To set up a comprehensive STS\nevaluation, we experimented on existing short-text STS datasets and a newly\ncollected long-text STS dataset from GitHub Issues. Furthermore, we examine\ndomain-specific STS scenarios with limited labeled data and explore how AnglE\nworks with LLM-annotated data. Extensive experiments were conducted on various\ntasks including short-text STS, long-text STS, and domain-specific STS tasks.\nThe results show that AnglE outperforms the state-of-the-art (SOTA) STS models\nthat ignore the cosine saturation zone. These findings demonstrate the ability\nof AnglE to generate high-quality text embeddings and the usefulness of angle\noptimization in STS.\n","authors":["Xianming Li","Jing Li"],"pdf_url":"https://arxiv.org/pdf/2309.12871v5.pdf","comment":"update results and add non-STS transfer tasks"},{"id":"http://arxiv.org/abs/2310.15890v1","updated":"2023-10-24T14:48:23Z","published":"2023-10-24T14:48:23Z","title":"Cross-feature Contrastive Loss for Decentralized Deep Learning on\n Heterogeneous Data","summary":" The current state-of-the-art decentralized learning algorithms mostly assume\nthe data distribution to be Independent and Identically Distributed (IID).\nHowever, in practical scenarios, the distributed datasets can have\nsignificantly heterogeneous data distributions across the agents. In this work,\nwe present a novel approach for decentralized learning on heterogeneous data,\nwhere data-free knowledge distillation through contrastive loss on\ncross-features is utilized to improve performance. Cross-features for a pair of\nneighboring agents are the features (i.e., last hidden layer activations)\nobtained from the data of an agent with respect to the model parameters of the\nother agent. We demonstrate the effectiveness of the proposed technique through\nan exhaustive set of experiments on various Computer Vision datasets (CIFAR-10,\nCIFAR-100, Fashion MNIST, and ImageNet), model architectures, and network\ntopologies. Our experiments show that the proposed method achieves superior\nperformance (0.2-4% improvement in test accuracy) compared to other existing\ntechniques for decentralized learning on heterogeneous data.\n","authors":["Sai Aparna Aketi","Kaushik Roy"],"pdf_url":"https://arxiv.org/pdf/2310.15890v1.pdf","comment":"12 pages, 7 figures, 11 tables. arXiv admin note: text overlap with\n arXiv:2305.04792"},{"id":"http://arxiv.org/abs/2309.16730v2","updated":"2023-10-24T14:47:41Z","published":"2023-09-27T08:46:57Z","title":"Explainable machine learning-based prediction model for diabetic\n nephropathy","summary":" The aim of this study is to analyze the effect of serum metabolites on\ndiabetic nephropathy (DN) and predict the prevalence of DN through a machine\nlearning approach. The dataset consists of 548 patients from April 2018 to\nApril 2019 in Second Affiliated Hospital of Dalian Medical University (SAHDMU).\nWe select the optimal 38 features through a Least absolute shrinkage and\nselection operator (LASSO) regression model and a 10-fold cross-validation. We\ncompare four machine learning algorithms, including eXtreme Gradient Boosting\n(XGB), random forest, decision tree and logistic regression, by AUC-ROC curves,\ndecision curves, calibration curves. We quantify feature importance and\ninteraction effects in the optimal predictive model by Shapley Additive\nexPlanations (SHAP) method. The XGB model has the best performance to screen\nfor DN with the highest AUC value of 0.966. The XGB model also gains more\nclinical net benefits than others and the fitting degree is better. In\naddition, there are significant interactions between serum metabolites and\nduration of diabetes. We develop a predictive model by XGB algorithm to screen\nfor DN. C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have great contribution in\nthe model, and can possibly be biomarkers for DN.\n","authors":["Jing-Mei Yin","Yang Li","Jun-Tang Xue","Guo-Wei Zong","Zhong-Ze Fang","Lang Zou"],"pdf_url":"https://arxiv.org/pdf/2309.16730v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15888v1","updated":"2023-10-24T14:47:02Z","published":"2023-10-24T14:47:02Z","title":"State Sequences Prediction via Fourier Transform for Representation\n Learning","summary":" While deep reinforcement learning (RL) has been demonstrated effective in\nsolving complex control tasks, sample efficiency remains a key challenge due to\nthe large amounts of data required for remarkable performance. Existing\nresearch explores the application of representation learning for data-efficient\nRL, e.g., learning predictive representations by predicting long-term future\nstates. However, many existing methods do not fully exploit the structural\ninformation inherent in sequential state signals, which can potentially improve\nthe quality of long-term decision-making but is difficult to discern in the\ntime domain. To tackle this problem, we propose State Sequences Prediction via\nFourier Transform (SPF), a novel method that exploits the frequency domain of\nstate sequences to extract the underlying patterns in time series data for\nlearning expressive representations efficiently. Specifically, we theoretically\nanalyze the existence of structural information in state sequences, which is\nclosely related to policy performance and signal regularity, and then propose\nto predict the Fourier transform of infinite-step future state sequences to\nextract such information. One of the appealing features of SPF is that it is\nsimple to implement while not requiring storage of infinite-step future states\nas prediction targets. Experiments demonstrate that the proposed method\noutperforms several state-of-the-art algorithms in terms of both sample\nefficiency and performance.\n","authors":["Mingxuan Ye","Yufei Kuang","Jie Wang","Rui Yang","Wengang Zhou","Houqiang Li","Feng Wu"],"pdf_url":"https://arxiv.org/pdf/2310.15888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.05481v3","updated":"2023-10-24T14:37:58Z","published":"2022-03-10T17:03:12Z","title":"Fully Adaptive Composition in Differential Privacy","summary":" Composition is a key feature of differential privacy. Well-known advanced\ncomposition theorems allow one to query a private database quadratically more\ntimes than basic privacy composition would permit. However, these results\nrequire that the privacy parameters of all algorithms be fixed before\ninteracting with the data. To address this, Rogers et al. introduced fully\nadaptive composition, wherein both algorithms and their privacy parameters can\nbe selected adaptively. They defined two probabilistic objects to measure\nprivacy in adaptive composition: privacy filters, which provide differential\nprivacy guarantees for composed interactions, and privacy odometers,\ntime-uniform bounds on privacy loss. There are substantial gaps between\nadvanced composition and existing filters and odometers. First, existing\nfilters place stronger assumptions on the algorithms being composed. Second,\nthese odometers and filters suffer from large constants, making them\nimpractical. We construct filters that match the rates of advanced composition,\nincluding constants, despite allowing for adaptively chosen privacy parameters.\nEn route we also derive a privacy filter for approximate zCDP. We also\nconstruct several general families of odometers. These odometers match the\ntightness of advanced composition at an arbitrary, preselected point in time,\nor at all points in time simultaneously, up to a doubly-logarithmic factor. We\nobtain our results by leveraging advances in martingale concentration. In sum,\nwe show that fully adaptive privacy is obtainable at almost no loss.\n","authors":["Justin Whitehouse","Aaditya Ramdas","Ryan Rogers","Zhiwei Steven Wu"],"pdf_url":"https://arxiv.org/pdf/2203.05481v3.pdf","comment":"23 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.15872v1","updated":"2023-10-24T14:28:00Z","published":"2023-10-24T14:28:00Z","title":"KirchhoffNet: A Circuit Bridging Message Passing and Continuous-Depth\n Models","summary":" In this paper, we exploit a fundamental principle of analog electronic\ncircuitry, Kirchhoff's current law, to introduce a unique class of neural\nnetwork models that we refer to as KirchhoffNet. KirchhoffNet establishes close\nconnections with message passing neural networks and continuous-depth networks.\nWe demonstrate that even in the absence of any traditional layers (such as\nconvolution, pooling, or linear layers), KirchhoffNet attains 98.86% test\naccuracy on the MNIST dataset, comparable with state of the art (SOTA) results.\nWhat makes KirchhoffNet more intriguing is its potential in the realm of\nhardware. Contemporary deep neural networks are conventionally deployed on\nGPUs. In contrast, KirchhoffNet can be physically realized by an analog\nelectronic circuit. Moreover, we justify that irrespective of the number of\nparameters within a KirchhoffNet, its forward calculation can always be\ncompleted within 1/f seconds, with f representing the hardware's clock\nfrequency. This characteristic introduces a promising technology for\nimplementing ultra-large-scale neural networks.\n","authors":["Zhengqi Gao","Fan-Keng Sun","Duane S. Boning"],"pdf_url":"https://arxiv.org/pdf/2310.15872v1.pdf","comment":"4 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.15865v1","updated":"2023-10-24T14:23:10Z","published":"2023-10-24T14:23:10Z","title":"Using Causality-Aware Graph Neural Networks to Predict Temporal\n Centralities in Dynamic Graphs","summary":" Node centralities play a pivotal role in network science, social network\nanalysis, and recommender systems. In temporal data, static path-based\ncentralities like closeness or betweenness can give misleading results about\nthe true importance of nodes in a temporal graph. To address this issue,\ntemporal generalizations of betweenness and closeness have been defined that\nare based on the shortest time-respecting paths between pairs of nodes.\nHowever, a major issue of those generalizations is that the calculation of such\npaths is computationally expensive. Addressing this issue, we study the\napplication of De Bruijn Graph Neural Networks (DBGNN), a causality-aware graph\nneural network architecture, to predict temporal path-based centralities in\ntime series data. We experimentally evaluate our approach in 13 temporal graphs\nfrom biological and social systems and show that it considerably improves the\nprediction of both betweenness and closeness centrality compared to a static\nGraph Convolutional Neural Network.\n","authors":["Franziska Heeg","Ingo Scholtes"],"pdf_url":"https://arxiv.org/pdf/2310.15865v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11768v5","updated":"2023-10-24T14:22:59Z","published":"2023-06-20T14:21:58Z","title":"A Systematic Survey in Geometric Deep Learning for Structure-based Drug\n Design","summary":" Structure-based drug design (SBDD) utilizes the three-dimensional geometry of\nproteins to identify potential drug candidates. Traditional methods, grounded\nin physicochemical modeling and informed by domain expertise, are\nresource-intensive. Recent developments in geometric deep learning, focusing on\nthe integration and processing of 3D geometric data, coupled with the\navailability of accurate protein 3D structure predictions from tools like\nAlphaFold, have greatly advanced the field of structure-based drug design. This\npaper systematically reviews the current state of geometric deep learning in\nSBDD. We first outline foundational tasks in SBDD, detail prevalent 3D protein\nrepresentations, and highlight representative predictive and generative models.\nWe then offer in-depth reviews of each key task, including binding site\nprediction, binding pose generation, \\emph{de novo} molecule generation, linker\ndesign, and binding affinity prediction. We provide formal problem definitions\nand outline each task's representative methods, datasets, evaluation metrics,\nand performance benchmarks. Finally, we summarize the current challenges and\nfuture opportunities: current challenges in SBDD include oversimplified problem\nformulations, inadequate out-of-distribution generalization, a lack of reliable\nevaluation metrics and large-scale benchmarks, and the need for experimental\nverification and enhanced model understanding; opportunities include leveraging\nmultimodal datasets, integrating domain knowledge, building comprehensive\nbenchmarks, designing criteria based on clinical endpoints, and developing\nfoundation models that broaden the range of design tasks. We also curate\n\\url{https://github.com/zaixizhang/Awesome-SBDD}, reflecting ongoing\ncontributions and new datasets in SBDD.\n","authors":["Zaixi Zhang","Jiaxian Yan","Qi Liu","Enhong Chen","Marinka Zitnik"],"pdf_url":"https://arxiv.org/pdf/2306.11768v5.pdf","comment":"20 pages, under review"},{"id":"http://arxiv.org/abs/2310.15047v2","updated":"2023-10-24T14:22:28Z","published":"2023-10-23T15:50:08Z","title":"Meta- (out-of-context) learning in neural networks","summary":" Brown et al. (2020) famously introduced the phenomenon of in-context learning\nin large language models (LLMs). We establish the existence of a phenomenon we\ncall meta-out-of-context learning (meta-OCL) via carefully designed synthetic\nexperiments with LLMs. Our results suggest that meta-OCL leads LLMs to more\nreadily \"internalize\" the semantic content of text that is, or appears to be,\nbroadly useful (such as true statements, or text from authoritative sources)\nand use it in appropriate circumstances. We further demonstrate meta-OCL in a\nsynthetic computer vision setting, and propose two hypotheses for the emergence\nof meta-OCL: one relying on the way models store knowledge in their parameters,\nand another suggesting that the implicit gradient alignment bias of\ngradient-descent-based optimizers may be responsible. Finally, we reflect on\nwhat our results might imply about capabilities of future AI systems, and\ndiscuss potential risks. Our code can be found at\nhttps://github.com/krasheninnikov/internalization.\n","authors":["Dmitrii Krasheninnikov","Egor Krasheninnikov","Bruno Mlodozeniec","David Krueger"],"pdf_url":"https://arxiv.org/pdf/2310.15047v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06721v2","updated":"2023-10-24T14:15:38Z","published":"2023-06-11T16:46:00Z","title":"Differentially Private Conditional Independence Testing","summary":" Conditional independence (CI) tests are widely used in statistical data\nanalysis, e.g., they are the building block of many algorithms for causal graph\ndiscovery. The goal of a CI test is to accept or reject the null hypothesis\nthat $X \\perp \\!\\!\\! \\perp Y \\mid Z$, where $X \\in \\mathbb{R}, Y \\in\n\\mathbb{R}, Z \\in \\mathbb{R}^d$. In this work, we investigate conditional\nindependence testing under the constraint of differential privacy. We design\ntwo private CI testing procedures: one based on the generalized covariance\nmeasure of Shah and Peters (2020) and another based on the conditional\nrandomization test of Cand\\`es et al. (2016) (under the model-X assumption). We\nprovide theoretical guarantees on the performance of our tests and validate\nthem empirically. These are the first private CI tests with rigorous\ntheoretical guarantees that work for the general case when $Z$ is continuous.\n","authors":["Iden Kalemaj","Shiva Prasad Kasiviswanathan","Aaditya Ramdas"],"pdf_url":"https://arxiv.org/pdf/2306.06721v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15853v1","updated":"2023-10-24T14:11:40Z","published":"2023-10-24T14:11:40Z","title":"Improving Event Time Prediction by Learning to Partition the Event Time\n Space","summary":" Recently developed survival analysis methods improve upon existing approaches\nby predicting the probability of event occurrence in each of a number\npre-specified (discrete) time intervals. By avoiding placing strong parametric\nassumptions on the event density, this approach tends to improve prediction\nperformance, particularly when data are plentiful. However, in clinical\nsettings with limited available data, it is often preferable to judiciously\npartition the event time space into a limited number of intervals well suited\nto the prediction task at hand. In this work, we develop a method to learn from\ndata a set of cut points defining such a partition. We show that in two\nsimulated datasets, we are able to recover intervals that match the underlying\ngenerative model. We then demonstrate improved prediction performance on three\nreal-world observational datasets, including a large, newly harmonized stroke\nrisk prediction dataset. Finally, we argue that our approach facilitates\nclinical decision-making by suggesting time intervals that are most appropriate\nfor each task, in the sense that they facilitate more accurate risk prediction.\n","authors":["Jimmy Hickey","Ricardo Henao","Daniel Wojdyla","Michael Pencina","Matthew M. Engelhard"],"pdf_url":"https://arxiv.org/pdf/2310.15853v1.pdf","comment":"16 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2310.15848v1","updated":"2023-10-24T14:01:53Z","published":"2023-10-24T14:01:53Z","title":"On Responsible Machine Learning Datasets with Fairness, Privacy, and\n Regulatory Norms","summary":" Artificial Intelligence (AI) has made its way into various scientific fields,\nproviding astonishing improvements over existing algorithms for a wide variety\nof tasks. In recent years, there have been severe concerns over the\ntrustworthiness of AI technologies. The scientific community has focused on the\ndevelopment of trustworthy AI algorithms. However, machine and deep learning\nalgorithms, popular in the AI community today, depend heavily on the data used\nduring their development. These learning algorithms identify patterns in the\ndata, learning the behavioral objective. Any flaws in the data have the\npotential to translate directly into algorithms. In this study, we discuss the\nimportance of Responsible Machine Learning Datasets and propose a framework to\nevaluate the datasets through a responsible rubric. While existing work focuses\non the post-hoc evaluation of algorithms for their trustworthiness, we provide\na framework that considers the data component separately to understand its role\nin the algorithm. We discuss responsible datasets through the lens of fairness,\nprivacy, and regulatory compliance and provide recommendations for constructing\nfuture datasets. After surveying over 100 datasets, we use 60 datasets for\nanalysis and demonstrate that none of these datasets is immune to issues of\nfairness, privacy preservation, and regulatory compliance. We provide\nmodifications to the ``datasheets for datasets\" with important additions for\nimproved dataset documentation. With governments around the world regularizing\ndata protection laws, the method for the creation of datasets in the scientific\ncommunity requires revision. We believe this study is timely and relevant in\ntoday's era of AI.\n","authors":["Surbhi Mittal","Kartik Thakral","Richa Singh","Mayank Vatsa","Tamar Glaser","Cristian Canton Ferrer","Tal Hassner"],"pdf_url":"https://arxiv.org/pdf/2310.15848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13316v2","updated":"2023-10-24T13:51:36Z","published":"2023-05-08T17:43:31Z","title":"KineticNet: Deep learning a transferable kinetic energy functional for\n orbital-free density functional theory","summary":" Orbital-free density functional theory (OF-DFT) holds the promise to compute\nground state molecular properties at minimal cost. However, it has been held\nback by our inability to compute the kinetic energy as a functional of the\nelectron density only. We here set out to learn the kinetic energy functional\nfrom ground truth provided by the more expensive Kohn-Sham density functional\ntheory. Such learning is confronted with two key challenges: Giving the model\nsufficient expressivity and spatial context while limiting the memory footprint\nto afford computations on a GPU; and creating a sufficiently broad distribution\nof training data to enable iterative density optimization even when starting\nfrom a poor initial guess. In response, we introduce KineticNet, an equivariant\ndeep neural network architecture based on point convolutions adapted to the\nprediction of quantities on molecular quadrature grids. Important contributions\ninclude convolution filters with sufficient spatial resolution in the vicinity\nof the nuclear cusp, an atom-centric sparse but expressive architecture that\nrelays information across multiple bond lengths; and a new strategy to generate\nvaried training data by finding ground state densities in the face of\nperturbations by a random external potential. KineticNet achieves, for the\nfirst time, chemical accuracy of the learned functionals across input densities\nand geometries of tiny molecules. For two electron systems, we additionally\ndemonstrate OF-DFT density optimization with chemical accuracy.\n","authors":["Roman Remme","Tobias Kaczun","Maximilian Scheurer","Andreas Dreuw","Fred A. Hamprecht"],"pdf_url":"https://arxiv.org/pdf/2305.13316v2.pdf","comment":"This article may be downloaded for personal use only. Any other use\n requires prior permission of the author and AIP Publishing. This article\n appeared in The Journal of Chemical Physics 159, 144113 (2023) and may be\n found at https://doi.org/10.1063/5.0158275"},{"id":"http://arxiv.org/abs/2310.15836v1","updated":"2023-10-24T13:43:01Z","published":"2023-10-24T13:43:01Z","title":"A Diffusion Weighted Graph Framework for New Intent Discovery","summary":" New Intent Discovery (NID) aims to recognize both new and known intents from\nunlabeled data with the aid of limited labeled data containing only known\nintents. Without considering structure relationships between samples, previous\nmethods generate noisy supervisory signals which cannot strike a balance\nbetween quantity and quality, hindering the formation of new intent clusters\nand effective transfer of the pre-training knowledge. To mitigate this\nlimitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to\ncapture both semantic similarities and structure relationships inherent in\ndata, enabling more sufficient and reliable supervisory signals. Specifically,\nfor each sample, we diffuse neighborhood relationships along semantic paths\nguided by the nearest neighbors for multiple hops to characterize its local\nstructure discriminately. Then, we sample its positive keys and weigh them\nbased on semantic similarities and local structures for contrastive learning.\nDuring inference, we further propose Graph Smoothing Filter (GSF) to explicitly\nutilize the structure relationships to filter high-frequency noise embodied in\nsemantically ambiguous samples on the cluster boundary. Extensive experiments\nshow that our method outperforms state-of-the-art models on all evaluation\nmetrics across multiple benchmark datasets. Code and data are available at\nhttps://github.com/yibai-shi/DWGF.\n","authors":["Wenkai Shi","Wenbin An","Feng Tian","Qinghua Zheng","QianYing Wang","Ping Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15836v1.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2110.08483v6","updated":"2023-10-24T13:35:47Z","published":"2021-10-16T06:06:36Z","title":"Simplest Streaming Trees","summary":" Decision forests, including random forests and gradient boosting trees,\nremain the leading machine learning methods for many real-world data problems,\nespecially on tabular data. However, most of the current implementations only\noperate in batch mode, and therefore cannot incrementally update when more data\narrive. Several previous works developed streaming trees and ensembles to\novercome this limitation. Nonetheless, we found that those state-of-the-art\nalgorithms suffer from a number of drawbacks, including low accuracy on some\nproblems and high memory usage on others. We therefore developed the simplest\npossible extension of decision trees: given new data, simply update existing\ntrees by continuing to grow them, and replace some old trees with new ones to\ncontrol the total number of trees. In a benchmark suite containing 72\nclassification problems (the OpenML-CC18 data suite), we illustrate that our\napproach, Stream Decision Forest (SDF), does not suffer from either of the\naforementioned limitations. On those datasets, we also demonstrate that our\napproach often performs as well, and sometimes even better, than conventional\nbatch decision forest algorithm. Thus, SDFs establish a simple standard for\nstreaming trees and forests that could readily be applied to many real-world\nproblems.\n","authors":["Haoyin Xu","Jayanta Dey","Sambit Panda","Joshua T. Vogelstein"],"pdf_url":"https://arxiv.org/pdf/2110.08483v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17297v2","updated":"2023-10-24T13:33:37Z","published":"2023-05-26T22:41:40Z","title":"Denoising Low-Rank Data Under Distribution Shift: Double Descent and\n Data Augmentation","summary":" Despite the importance of denoising in modern machine learning and ample\nempirical work on supervised denoising, its theoretical understanding is still\nrelatively scarce. One concern about studying supervised denoising is that one\nmight not always have noiseless training data from the test distribution. It is\nmore reasonable to have access to noiseless training data from a different\ndataset than the test dataset. Motivated by this, we study supervised denoising\nand noisy-input regression under distribution shift. We add three\nconsiderations to increase the applicability of our theoretical insights to\nreal-life data and modern machine learning. First, while most past theoretical\nwork assumes that the data covariance matrix is full-rank and well-conditioned,\nempirical studies have shown that real-life data is approximately low-rank.\nThus, we assume that our data matrices are low-rank. Second, we drop\nindependence assumptions on our data. Third, the rise in computational power\nand dimensionality of data have made it important to study non-classical\nregimes of learning. Thus, we work in the non-classical proportional regime,\nwhere data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$.\n For this setting, we derive general test error expressions for both denoising\nand noisy-input regression, and study when overfitting the noise is benign,\ntempered or catastrophic. We show that the test error exhibits double descent\nunder general distribution shift, providing insights for data augmentation and\nthe role of noise as an implicit regularizer. We also perform experiments using\nreal-life data, where we match the theoretical predictions with under 1% MSE\nerror for low-rank data.\n","authors":["Chinmaya Kausik","Kashvi Srivastava","Rishi Sonthalia"],"pdf_url":"https://arxiv.org/pdf/2305.17297v2.pdf","comment":"Complete overhaul of presentation, many new results"},{"id":"http://arxiv.org/abs/2310.15830v1","updated":"2023-10-24T13:33:19Z","published":"2023-10-24T13:33:19Z","title":"Localization of Small Leakages in Water Distribution Networks using\n Concept Drift Explanation Methods","summary":" Facing climate change the already limited availability of drinking water will\ndecrease in the future rendering drinking water an increasingly scarce\nresource. Considerable amounts of it are lost through leakages in water\ntransportation and distribution networks. Leakage detection and localization\nare challenging problems due to the complex interactions and changing demands\nin water distribution networks. Especially small leakages are hard to pinpoint\nyet their localization is vital to avoid water loss over long periods of time.\nWhile there exist different approaches to solving the tasks of leakage\ndetection and localization, they are relying on various information about the\nsystem, e.g. real-time demand measurements and the precise network topology,\nwhich is an unrealistic assumption in many real-world scenarios. In contrast,\nthis work attempts leakage localization using pressure measurements only. For\nthis purpose, first, leakages in the water distribution network are modeled\nemploying Bayesian networks, and the system dynamics are analyzed. We then show\nhow the problem is connected to and can be considered through the lens of\nconcept drift. In particular, we argue that model-based explanations of concept\ndrift are a promising tool for localizing leakages given limited information\nabout the network. The methodology is experimentally evaluated using realistic\nbenchmark scenarios.\n","authors":["Valerie Vaquet","Fabian Hinder","Kathrin Lammers","Jonas Vaquet","Barbara Hammer"],"pdf_url":"https://arxiv.org/pdf/2310.15830v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15827v1","updated":"2023-10-24T13:28:46Z","published":"2023-10-24T13:28:46Z","title":"Automatic Aorta Segmentation with Heavily Augmented, High-Resolution 3-D\n ResUNet: Contribution to the SEG.A Challenge","summary":" Automatic aorta segmentation from 3-D medical volumes is an important yet\ndifficult task. Several factors make the problem challenging, e.g. the\npossibility of aortic dissection or the difficulty with segmenting and\nannotating the small branches. This work presents a contribution by the MedGIFT\nteam to the SEG.A challenge organized during the MICCAI 2023 conference. We\npropose a fully automated algorithm based on deep encoder-decoder architecture.\nThe main assumption behind our work is that data preprocessing and augmentation\nare much more important than the deep architecture, especially in low data\nregimes. Therefore, the solution is based on a variant of traditional\nconvolutional U-Net. The proposed solution achieved a Dice score above 0.9 for\nall testing cases with the highest stability among all participants. The method\nscored 1st, 4th, and 3rd in terms of the clinical evaluation, quantitative\nresults, and volumetric meshing quality, respectively. We freely release the\nsource code, pretrained model, and provide access to the algorithm on the\nGrand-Challenge platform.\n","authors":["Marek Wodzinski","Henning Müller"],"pdf_url":"https://arxiv.org/pdf/2310.15827v1.pdf","comment":"MICCAI 2023 - SEG.A Challenge Contribution"},{"id":"http://arxiv.org/abs/2310.15826v1","updated":"2023-10-24T13:25:19Z","published":"2023-10-24T13:25:19Z","title":"One or Two Things We know about Concept Drift -- A Survey on Monitoring\n Evolving Environments","summary":" The world surrounding us is subject to constant change. These changes,\nfrequently described as concept drift, influence many industrial and technical\nprocesses. As they can lead to malfunctions and other anomalous behavior, which\nmay be safety-critical in many scenarios, detecting and analyzing concept drift\nis crucial. In this paper, we provide a literature review focusing on concept\ndrift in unsupervised data streams. While many surveys focus on supervised data\nstreams, so far, there is no work reviewing the unsupervised setting. However,\nthis setting is of particular relevance for monitoring and anomaly detection\nwhich are directly applicable to many tasks and challenges in engineering. This\nsurvey provides a taxonomy of existing work on drift detection. Besides, it\ncovers the current state of research on drift localization in a systematic way.\nIn addition to providing a systematic literature review, this work provides\nprecise mathematical definitions of the considered problems and contains\nstandardized experiments on parametric artificial datasets allowing for a\ndirect comparison of different strategies for detection and localization.\nThereby, the suitability of different schemes can be analyzed systematically\nand guidelines for their usage in real-world scenarios can be provided.\nFinally, there is a section on the emerging topic of explaining concept drift.\n","authors":["Fabian Hinder","Valerie Vaquet","Barbara Hammer"],"pdf_url":"https://arxiv.org/pdf/2310.15826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11978v2","updated":"2023-10-24T13:23:58Z","published":"2023-10-18T14:05:04Z","title":"Can bin-wise scaling improve consistency and adaptivity of prediction\n uncertainty for machine learning regression ?","summary":" Binwise Variance Scaling (BVS) has recently been proposed as a post hoc\nrecalibration method for prediction uncertainties of machine learning\nregression problems that is able of more efficient corrections than uniform\nvariance (or temperature) scaling. The original version of BVS uses\nuncertainty-based binning, which is aimed to improve calibration conditionally\non uncertainty, i.e. consistency. I explore here several adaptations of BVS, in\nparticular with alternative loss functions and a binning scheme based on an\ninput-feature (X) in order to improve adaptivity, i.e. calibration conditional\non X. The performances of BVS and its proposed variants are tested on a\nbenchmark dataset for the prediction of atomization energies and compared to\nthe results of isotonic regression.\n","authors":["Pascal Pernot"],"pdf_url":"https://arxiv.org/pdf/2310.11978v2.pdf","comment":"This version corrects an error in the estimation of the Sx scores for\n the test set, affecting Fig. 2 and Tables I-III of the initial version. The\n main points of the discussion and the conclusions are unchanged"},{"id":"http://arxiv.org/abs/2306.05131v2","updated":"2023-10-24T13:16:05Z","published":"2023-06-08T11:54:58Z","title":"Conformal Prediction for Federated Uncertainty Quantification Under\n Label Shift","summary":" Federated Learning (FL) is a machine learning framework where many clients\ncollaboratively train models while keeping the training data decentralized.\nDespite recent advances in FL, the uncertainty quantification topic (UQ)\nremains partially addressed. Among UQ methods, conformal prediction (CP)\napproaches provides distribution-free guarantees under minimal assumptions. We\ndevelop a new federated conformal prediction method based on quantile\nregression and take into account privacy constraints. This method takes\nadvantage of importance weighting to effectively address the label shift\nbetween agents and provides theoretical guarantees for both valid coverage of\nthe prediction sets and differential privacy. Extensive experimental studies\ndemonstrate that this method outperforms current competitors.\n","authors":["Vincent Plassier","Mehdi Makni","Aleksandr Rubashevskii","Eric Moulines","Maxim Panov"],"pdf_url":"https://arxiv.org/pdf/2306.05131v2.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2310.15817v1","updated":"2023-10-24T13:14:22Z","published":"2023-10-24T13:14:22Z","title":"Discriminator Guidance for Autoregressive Diffusion Models","summary":" We introduce discriminator guidance in the setting of Autoregressive\nDiffusion Models. The use of a discriminator to guide a diffusion process has\npreviously been used for continuous diffusion models, and in this work we\nderive ways of using a discriminator together with a pretrained generative\nmodel in the discrete case. First, we show that using an optimal discriminator\nwill correct the pretrained model and enable exact sampling from the underlying\ndata distribution. Second, to account for the realistic scenario of using a\nsub-optimal discriminator, we derive a sequential Monte Carlo algorithm which\niteratively takes the predictions from the discrimiator into account during the\ngeneration process. We test these approaches on the task of generating\nmolecular graphs and show how the discriminator improves the generative\nperformance over using only the pretrained model.\n","authors":["Filip Ekström Kelvinius","Fredrik Lindsten"],"pdf_url":"https://arxiv.org/pdf/2310.15817v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15816v1","updated":"2023-10-24T13:10:43Z","published":"2023-10-24T13:10:43Z","title":"Nonlinear dimensionality reduction then and now: AIMs for dissipative\n PDEs in the ML era","summary":" This study presents a collection of purely data-driven workflows for\nconstructing reduced-order models (ROMs) for distributed dynamical systems. The\nROMs we focus on, are data-assisted models inspired by, and templated upon, the\ntheory of Approximate Inertial Manifolds (AIMs); the particular motivation is\nthe so-called post-processing Galerkin method of Garcia-Archilla, Novo and\nTiti. Its applicability can be extended: the need for accurate truncated\nGalerkin projections and for deriving closed-formed corrections can be\ncircumvented using machine learning tools. When the right latent variables are\nnot a priori known, we illustrate how autoencoders as well as Diffusion Maps (a\nmanifold learning scheme) can be used to discover good sets of latent variables\nand test their explainability. The proposed methodology can express the ROMs in\nterms of (a) theoretical (Fourier coefficients), (b) linear data-driven (POD\nmodes) and/or (c) nonlinear data-driven (Diffusion Maps) coordinates. Both\nBlack-Box and (theoretically-informed and data-corrected) Gray-Box models are\ndescribed; the necessity for the latter arises when truncated Galerkin\nprojections are so inaccurate as to not be amenable to post-processing. We use\nthe Chafee-Infante reaction-diffusion and the Kuramoto-Sivashinsky dissipative\npartial differential equations to illustrate and successfully test the overall\nframework.\n","authors":["Eleni D. Koronaki","Nikolaos Evangelou","Cristina P. Martin-Linares","Edriss S. Titi","Ioannis G. Kevrekidis"],"pdf_url":"https://arxiv.org/pdf/2310.15816v1.pdf","comment":"27 pages, 22 figures"},{"id":"http://arxiv.org/abs/2310.15815v1","updated":"2023-10-24T13:09:56Z","published":"2023-10-24T13:09:56Z","title":"Good Better Best: Self-Motivated Imitation Learning for noisy\n Demonstrations","summary":" Imitation Learning (IL) aims to discover a policy by minimizing the\ndiscrepancy between the agent's behavior and expert demonstrations. However, IL\nis susceptible to limitations imposed by noisy demonstrations from non-expert\nbehaviors, presenting a significant challenge due to the lack of supplementary\ninformation to assess their expertise. In this paper, we introduce\nSelf-Motivated Imitation LEarning (SMILE), a method capable of progressively\nfiltering out demonstrations collected by policies deemed inferior to the\ncurrent policy, eliminating the need for additional information. We utilize the\nforward and reverse processes of Diffusion Models to emulate the shift in\ndemonstration expertise from low to high and vice versa, thereby extracting the\nnoise information that diffuses expertise. Then, the noise information is\nleveraged to predict the diffusion steps between the current policy and\ndemonstrators, which we theoretically demonstrate its equivalence to their\nexpertise gap. We further explain in detail how the predicted diffusion steps\nare applied to filter out noisy demonstrations in a self-motivated manner and\nprovide its theoretical grounds. Through empirical evaluations on MuJoCo tasks,\nwe demonstrate that our method is proficient in learning the expert policy\namidst noisy demonstrations, and effectively filters out demonstrations with\nexpertise inferior to the current policy.\n","authors":["Ye Yuan","Xin Li","Yong Heng","Leiji Zhang","MingZhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.15815v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14468v2","updated":"2023-10-24T13:06:57Z","published":"2023-10-23T00:51:24Z","title":"Revisiting Implicit Differentiation for Learning Problems in Optimal\n Control","summary":" This paper proposes a new method for differentiating through optimal\ntrajectories arising from non-convex, constrained discrete-time optimal control\n(COC) problems using the implicit function theorem (IFT). Previous works solve\na differential Karush-Kuhn-Tucker (KKT) system for the trajectory derivative,\nand achieve this efficiently by solving an auxiliary Linear Quadratic Regulator\n(LQR) problem. In contrast, we directly evaluate the matrix equations which\narise from applying variable elimination on the Lagrange multiplier terms in\nthe (differential) KKT system. By appropriately accounting for the structure of\nthe terms within the resulting equations, we show that the trajectory\nderivatives scale linearly with the number of timesteps. Furthermore, our\napproach allows for easy parallelization, significantly improved scalability\nwith model size, direct computation of vector-Jacobian products and improved\nnumerical stability compared to prior works. As an additional contribution, we\nunify prior works, addressing claims that computing trajectory derivatives\nusing IFT scales quadratically with the number of timesteps. We evaluate our\nmethod on a both synthetic benchmark and four challenging, learning from\ndemonstration benchmarks including a 6-DoF maneuvering quadrotor and 6-DoF\nrocket powered landing.\n","authors":["Ming Xu","Timothy Molloy","Stephen Gould"],"pdf_url":"https://arxiv.org/pdf/2310.14468v2.pdf","comment":"Accepted to NeurIPS 2023 (poster)"},{"id":"http://arxiv.org/abs/2204.06863v3","updated":"2023-10-24T12:59:27Z","published":"2022-04-14T10:29:01Z","title":"ULF: Unsupervised Labeling Function Correction using Cross-Validation\n for Weak Supervision","summary":" A cost-effective alternative to manual data labeling is weak supervision\n(WS), where data samples are automatically annotated using a predefined set of\nlabeling functions (LFs), rule-based mechanisms that generate artificial labels\nfor the associated classes. In this work, we investigate noise reduction\ntechniques for WS based on the principle of k-fold cross-validation. We\nintroduce a new algorithm ULF for Unsupervised Labeling Function correction,\nwhich denoises WS data by leveraging models trained on all but some LFs to\nidentify and correct biases specific to the held-out LFs. Specifically, ULF\nrefines the allocation of LFs to classes by re-estimating this assignment on\nhighly reliable cross-validated samples. Evaluation on multiple datasets\nconfirms ULF's effectiveness in enhancing WS learning without the need for\nmanual labeling.\n","authors":["Anastasiia Sedova","Benjamin Roth"],"pdf_url":"https://arxiv.org/pdf/2204.06863v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03387v3","updated":"2023-10-24T12:57:08Z","published":"2023-03-02T17:30:43Z","title":"CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a\n Context Synergized Hyperbolic Network","summary":" The tremendous growth of social media users interacting in online\nconversations has led to significant growth in hate speech, affecting people\nfrom various demographics. Most of the prior works focus on detecting explicit\nhate speech, which is overt and leverages hateful phrases, with very little\nwork focusing on detecting hate speech that is implicit or denotes hatred\nthrough indirect or coded language. In this paper, we present CoSyn, a\ncontext-synergized neural network that explicitly incorporates user- and\nconversational context for detecting implicit hate speech in online\nconversations. CoSyn introduces novel ways to encode these external contexts\nand employs a novel context interaction mechanism that clearly captures the\ninterplay between them, making independent assessments of the amounts of\ninformation to be retrieved from these noisy contexts. Additionally, it carries\nout all these operations in the hyperbolic space to account for the scale-free\ndynamics of social media. We demonstrate the effectiveness of CoSyn on 6 hate\nspeech datasets and show that CoSyn outperforms all our baselines in detecting\nimplicit hate speech with absolute improvements in the range of 1.24% - 57.8%.\n","authors":["Sreyan Ghosh","Manan Suri","Purva Chiniya","Utkarsh Tyagi","Sonal Kumar","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2303.03387v3.pdf","comment":"Accepted to EMNLP 2023 Main Conference. Code:\n https://github.com/Sreyan88/CoSyn"},{"id":"http://arxiv.org/abs/2302.06434v2","updated":"2023-10-24T12:55:10Z","published":"2023-02-13T15:13:22Z","title":"Efficient Graph Laplacian Estimation by Proximal Newton","summary":" The Laplacian-constrained Gaussian Markov Random Field (LGMRF) is a common\nmultivariate statistical model for learning a weighted sparse dependency graph\nfrom given data. This graph learning problem can be formulated as a maximum\nlikelihood estimation (MLE) of the precision matrix, subject to Laplacian\nstructural constraints, with a sparsity-inducing penalty term. This paper aims\nto solve this learning problem accurately and efficiently. First, since the\ncommonly used $\\ell_1$-norm penalty is inappropriate in this setting and may\nlead to a complete graph, we employ the nonconvex minimax concave penalty\n(MCP), which promotes sparse solutions with lower estimation bias. Second, as\nopposed to existing first-order methods for this problem, we develop a\nsecond-order proximal Newton approach to obtain an efficient solver, utilizing\nseveral algorithmic features, such as using Conjugate Gradients,\npreconditioning, and splitting to active/free sets. Numerical experiments\ndemonstrate the advantages of the proposed method in terms of both\ncomputational complexity and graph learning accuracy compared to existing\nmethods.\n","authors":["Yakov Medvedovsky","Eran Treister","Tirza Routtenberg"],"pdf_url":"https://arxiv.org/pdf/2302.06434v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01320v3","updated":"2023-10-24T12:51:28Z","published":"2023-10-02T16:27:36Z","title":"Avalon's Game of Thoughts: Battle Against Deception through Recursive\n Contemplation","summary":" Recent breakthroughs in large language models (LLMs) have brought remarkable\nsuccess in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is\nthat the information processed by LLMs is consistently honest, neglecting the\npervasive deceptive or misleading information in human society and AI-generated\ncontent. This oversight makes LLMs susceptible to malicious manipulations,\npotentially resulting in detrimental outcomes. This study utilizes the\nintricate Avalon game as a testbed to explore LLMs' potential in deceptive\nenvironments. Avalon, full of misinformation and requiring sophisticated logic,\nmanifests as a \"Game-of-Thoughts\". Inspired by the efficacy of humans'\nrecursive thinking and perspective-taking in the Avalon game, we introduce a\nnovel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to\nidentify and counteract deceptive information. ReCon combines formulation and\nrefinement contemplation processes; formulation contemplation produces initial\nthoughts and speech, while refinement contemplation further polishes them.\nAdditionally, we incorporate first-order and second-order perspective\ntransitions into these processes respectively. Specifically, the first-order\nallows an LLM agent to infer others' mental states, and the second-order\ninvolves understanding how others perceive the agent's mental state. After\nintegrating ReCon with different LLMs, extensive experiment results from the\nAvalon game indicate its efficacy in aiding LLMs to discern and maneuver around\ndeceptive information without extra fine-tuning and data. Finally, we offer a\npossible explanation for the efficacy of ReCon and explore the current\nlimitations of LLMs in terms of safety, reasoning, speaking style, and format,\npotentially furnishing insights for subsequent research.\n","authors":["Shenzhi Wang","Chang Liu","Zilong Zheng","Siyuan Qi","Shuo Chen","Qisen Yang","Andrew Zhao","Chaofei Wang","Shiji Song","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.01320v3.pdf","comment":"40 pages"},{"id":"http://arxiv.org/abs/2310.15797v1","updated":"2023-10-24T12:48:52Z","published":"2023-10-24T12:48:52Z","title":"Random Entity Quantization for Parameter-Efficient Compositional\n Knowledge Graph Representation","summary":" Representation Learning on Knowledge Graphs (KGs) is essential for downstream\ntasks. The dominant approach, KG Embedding (KGE), represents entities with\nindependent vectors and faces the scalability challenge. Recent studies propose\nan alternative way for parameter efficiency, which represents entities by\ncomposing entity-corresponding codewords matched from predefined small-scale\ncodebooks. We refer to the process of obtaining corresponding codewords of each\nentity as entity quantization, for which previous works have designed\ncomplicated strategies. Surprisingly, this paper shows that simple random\nentity quantization can achieve similar results to current strategies. We\nanalyze this phenomenon and reveal that entity codes, the quantization outcomes\nfor expressing entities, have higher entropy at the code level and Jaccard\ndistance at the codeword level under random entity quantization. Therefore,\ndifferent entities become more easily distinguished, facilitating effective KG\nrepresentation. The above results show that current quantization strategies are\nnot critical for KG representation, and there is still room for improvement in\nentity distinguishability beyond current strategies. The code to reproduce our\nresults is available at https://github.com/JiaangL/RandomQuantization.\n","authors":["Jiaang Li","Quan Wang","Yi Liu","Licheng Zhang","Zhendong Mao"],"pdf_url":"https://arxiv.org/pdf/2310.15797v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15793v1","updated":"2023-10-24T12:44:09Z","published":"2023-10-24T12:44:09Z","title":"Improving generalization in large language models by learning prefix\n subspaces","summary":" This article focuses on large language models (LLMs) fine-tuning in the\nscarce data regime (also known as the \"few-shot\" learning setting). We propose\na method to increase the generalization capabilities of LLMs based on neural\nnetwork subspaces. This optimization method, recently introduced in computer\nvision, aims to improve model generalization by identifying wider local optima\nthrough the joint optimization of an entire simplex of models in parameter\nspace. Its adaptation to massive, pretrained transformers, however, poses some\nchallenges. First, their considerable number of parameters makes it difficult\nto train several models jointly, and second, their deterministic parameter\ninitialization schemes make them unfit for the subspace method as originally\nproposed. We show in this paper that \"Parameter Efficient Fine-Tuning\" (PEFT)\nmethods, however, are perfectly compatible with this original approach, and\npropose to learn entire simplex of continuous prefixes. We test our method on a\nvariant of the GLUE benchmark adapted to the few-shot learning setting, and\nshow that both our contributions jointly lead to a gain in average performances\ncompared to sota methods. The implementation can be found at the following\nlink: https://github.com/Liloulou/prefix_subspace\n","authors":["Louis Falissard","Vincent Guigue","Laure Soulier"],"pdf_url":"https://arxiv.org/pdf/2310.15793v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.11582v2","updated":"2023-10-24T12:43:04Z","published":"2023-04-23T08:42:45Z","title":"DiffTraj: Generating GPS Trajectory with Diffusion Probabilistic Model","summary":" Pervasive integration of GPS-enabled devices and data acquisition\ntechnologies has led to an exponential increase in GPS trajectory data,\nfostering advancements in spatial-temporal data mining research. Nonetheless,\nGPS trajectories contain personal geolocation information, rendering serious\nprivacy concerns when working with raw data. A promising approach to address\nthis issue is trajectory generation, which involves replacing original data\nwith generated, privacy-free alternatives. Despite the potential of trajectory\ngeneration, the complex nature of human behavior and its inherent stochastic\ncharacteristics pose challenges in generating high-quality trajectories. In\nthis work, we propose a spatial-temporal diffusion probabilistic model for\ntrajectory generation (DiffTraj). This model effectively combines the\ngenerative abilities of diffusion models with the spatial-temporal features\nderived from real trajectories. The core idea is to reconstruct and synthesize\ngeographic trajectories from white noise through a reverse trajectory denoising\nprocess. Furthermore, we propose a Trajectory UNet (Traj-UNet) deep neural\nnetwork to embed conditional information and accurately estimate noise levels\nduring the reverse process. Experiments on two real-world datasets show that\nDiffTraj can be intuitively applied to generate high-fidelity trajectories\nwhile retaining the original distributions. Moreover, the generated results can\nsupport downstream trajectory analysis tasks and significantly outperform other\nmethods in terms of geo-distribution evaluations.\n","authors":["Yuanshao Zhu","Yongchao Ye","Shiyao Zhang","Xiangyu Zhao","James J. Q. Yu"],"pdf_url":"https://arxiv.org/pdf/2304.11582v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15786v1","updated":"2023-10-24T12:34:25Z","published":"2023-10-24T12:34:25Z","title":"Amortised Inference in Neural Networks for Small-Scale Probabilistic\n Meta-Learning","summary":" The global inducing point variational approximation for BNNs is based on\nusing a set of inducing inputs to construct a series of conditional\ndistributions that accurately approximate the conditionals of the true\nposterior distribution. Our key insight is that these inducing inputs can be\nreplaced by the actual data, such that the variational distribution consists of\na set of approximate likelihoods for each datapoint. This structure lends\nitself to amortised inference, in which the parameters of each approximate\nlikelihood are obtained by passing each datapoint through a meta-model known as\nthe inference network. By training this inference network across related\ndatasets, we can meta-learn Bayesian inference over task-specific BNNs.\n","authors":["Matthew Ashman","Tommy Rochussen","Adrian Weller"],"pdf_url":"https://arxiv.org/pdf/2310.15786v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15767v1","updated":"2023-10-24T12:13:51Z","published":"2023-10-24T12:13:51Z","title":"Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning","summary":" High-resolution (HR) magnetic resonance imaging (MRI) is crucial for\nenhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent\nlimitation of MRI resolution restricts its widespread applicability. Deep\nlearning-based image super-resolution (SR) methods exhibit promise in improving\nMRI resolution without additional cost. However, these methods frequently\nrequire a substantial number of HR MRI images for training, which can be\nchallenging to acquire. In this paper, we propose an unpaired MRI SR approach\nthat employs self-supervised contrastive learning to enhance SR performance\nwith limited training data. Our approach leverages both authentic HR images and\nsynthetically generated SR images to construct positive and negative sample\npairs, thus facilitating the learning of discriminative features. Empirical\nresults presented in this study underscore significant enhancements in the peak\nsignal-to-noise ratio and structural similarity index, even when a paucity of\nHR images is available. These findings accentuate the potential of our approach\nin addressing the challenge of limited training data, thereby contributing to\nthe advancement of high-resolution MRI in clinical applications.\n","authors":["Hao Li","Quanwei Liu","Jianan Liu","Xiling Liu","Yanni Dong","Tao Huang","Zhihan Lv"],"pdf_url":"https://arxiv.org/pdf/2310.15767v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15766v1","updated":"2023-10-24T12:13:49Z","published":"2023-10-24T12:13:49Z","title":"Robust Learning via Conditional Prevalence Adjustment","summary":" Healthcare data often come from multiple sites in which the correlations\nbetween confounding variables can vary widely. If deep learning models exploit\nthese unstable correlations, they might fail catastrophically in unseen sites.\nAlthough many methods have been proposed to tackle unstable correlations, each\nhas its limitations. For example, adversarial training forces models to\ncompletely ignore unstable correlations, but doing so may lead to poor\npredictive performance. Other methods (e.g. Invariant risk minimization [4])\ntry to learn domain-invariant representations that rely only on stable\nassociations by assuming a causal data-generating process (input X causes class\nlabel Y ). Thus, they may be ineffective for anti-causal tasks (Y causes X),\nwhich are common in computer vision. We propose a method called CoPA\n(Conditional Prevalence-Adjustment) for anti-causal tasks. CoPA assumes that\n(1) generation mechanism is stable, i.e. label Y and confounding variable(s) Z\ngenerate X, and (2) the unstable conditional prevalence in each site E fully\naccounts for the unstable correlations between X and Y . Our crucial\nobservation is that confounding variables are routinely recorded in healthcare\nsettings and the prevalence can be readily estimated, for example, from a set\nof (Y, Z) samples (no need for corresponding samples of X). CoPA can work even\nif there is a single training site, a scenario which is often overlooked by\nexisting methods. Our experiments on synthetic and real data show CoPA beating\ncompetitive baselines.\n","authors":["Minh Nguyen","Alan Q. Wang","Heejong Kim","Mert R. Sabuncu"],"pdf_url":"https://arxiv.org/pdf/2310.15766v1.pdf","comment":"Accepted at WACV"},{"id":"http://arxiv.org/abs/2302.00617v3","updated":"2023-10-24T12:04:55Z","published":"2023-02-01T17:32:16Z","title":"Learning Large-scale Neural Fields via Context Pruned Meta-Learning","summary":" We introduce an efficient optimization-based meta-learning technique for\nlarge-scale neural field training by realizing significant memory savings\nthrough automated online context point selection. This is achieved by focusing\neach learning step on the subset of data with the highest expected immediate\nimprovement in model quality, resulting in the almost instantaneous modeling of\nglobal structure and subsequent refinement of high-frequency details. We\nfurther improve the quality of our meta-learned initialization by introducing a\nbootstrap correction resulting in the minimization of any error introduced by\nreduced context sets while simultaneously mitigating the well-known myopia of\noptimization-based meta-learning. Finally, we show how gradient re-scaling at\nmeta-test time allows the learning of extremely high-quality neural fields in\nsignificantly shortened optimization procedures. Our framework is\nmodel-agnostic, intuitive, straightforward to implement, and shows significant\nreconstruction improvements for a wide range of signals. We provide an\nextensive empirical evaluation on nine datasets across multiple multiple\nmodalities, demonstrating state-of-the-art results while providing additional\ninsight through careful analysis of the algorithmic components constituting our\nmethod. Code is available at https://github.com/jihoontack/GradNCP\n","authors":["Jihoon Tack","Subin Kim","Sihyun Yu","Jaeho Lee","Jinwoo Shin","Jonathan Richard Schwarz"],"pdf_url":"https://arxiv.org/pdf/2302.00617v3.pdf","comment":"Published as a conference proceeding for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15744v1","updated":"2023-10-24T11:36:41Z","published":"2023-10-24T11:36:41Z","title":"Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix\n Factorization","summary":" Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that\nhas stimulated enormous interest in statistics, data science, and computational\nbiology due to the high dimensionality, complexity, and large scale associated\nwith scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique\napproach due to its meta-gene interpretation of resulting low-dimensional\ncomponents. However, NMF approaches suffer from the lack of multiscale\nanalysis. This work introduces two persistent Laplacian regularized NMF\nmethods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By\nemploying a total of 12 datasets, we demonstrate that the proposed TNMF and\nrTNMF significantly outperform all other NMF-based methods. We have also\nutilized TNMF and rTNMF for the visualization of popular Uniform Manifold\nApproximation and Projection (UMAP) and t-distributed stochastic neighbor\nembedding (t-SNE).\n","authors":["Yuta Hozumi","Guo-Wei Wei"],"pdf_url":"https://arxiv.org/pdf/2310.15744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15742v1","updated":"2023-10-24T11:34:15Z","published":"2023-10-24T11:34:15Z","title":"Improving Diffusion Models for ECG Imputation with an Augmented Template\n Prior","summary":" Pulsative signals such as the electrocardiogram (ECG) are extensively\ncollected as part of routine clinical care. However, noisy and poor-quality\nrecordings, leading to missing values, are a major issue for signals collected\nusing mobile health systems, decreasing the signal quality and affecting the\nautomated downstream tasks. Recent studies have explored imputation of missing\nvalues for ECG with probabilistic time-series models. Nevertheless, in\ncomparison with the deterministic models, their performance is still limited,\nas the variations across subjects and heart-beat relationships are not\nexplicitly considered in the training objective. In this work, to improve the\nECG imputation and forecasting accuracy with probabilistic models, we present\nan template-guided denoising diffusion probabilistic model, PulseDiff, which is\nconditioned an informative prior for a range of health conditions.\nSpecifically, 1) we first extract a subject-level pulsative template from the\nobservation as an informative prior of missing values, which captures the\npersonal characteristics; 2) we then add beat-level stochastic shift terms on\nthe template for prior augmentation, which considers the beat-level variance of\npositioning and amplitude; 3) we finally design a confidence score to consider\nthe health condition of subject, which ensures our prior is provided in a safe\nway. Experiments with the PTBXL dataset reveal PulseDiff improves the\nperformance of two strong DDPMs baseline models, CSDI and SSSD$^{S4}$,\nverifying our method guides the generation of DDPMs while managing the\nuncertainty. When combining with SSSD$^{S4}$, our PulseDiff method outperforms\nthe leading deterministic model for short-interval missing data and is\ncomparable for long-interval data loss.\n","authors":["Alexander Jenkins","Zehua Chen","Fu Siong Ng","Danilo Mandic"],"pdf_url":"https://arxiv.org/pdf/2310.15742v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15842v2","updated":"2023-10-24T11:09:55Z","published":"2023-09-27T17:59:11Z","title":"Exploiting the Signal-Leak Bias in Diffusion Models","summary":" There is a bias in the inference pipeline of most diffusion models. This bias\narises from a signal leak whose distribution deviates from the noise\ndistribution, creating a discrepancy between training and inference processes.\nWe demonstrate that this signal-leak bias is particularly significant when\nmodels are tuned to a specific style, causing sub-optimal style matching.\nRecent research tries to avoid the signal leakage during training. We instead\nshow how we can exploit this signal-leak bias in existing diffusion models to\nallow more control over the generated images. This enables us to generate\nimages with more varied brightness, and images that better match a desired\nstyle or color. By modeling the distribution of the signal leak in the spatial\nfrequency and pixel domains, and including a signal leak in the initial latent,\nwe generate images that better match expected results without any additional\ntraining.\n","authors":["Martin Nicolas Everaert","Athanasios Fitsios","Marco Bocchio","Sami Arpa","Sabine Süsstrunk","Radhakrishna Achanta"],"pdf_url":"https://arxiv.org/pdf/2309.15842v2.pdf","comment":"corrected the author names in reference [24]"},{"id":"http://arxiv.org/abs/2305.00586v4","updated":"2023-10-24T11:09:21Z","published":"2023-04-30T21:44:21Z","title":"How does GPT-2 compute greater-than?: Interpreting mathematical\n abilities in a pre-trained language model","summary":" Pre-trained language models can be surprisingly adept at tasks they were not\nexplicitly trained on, but how they implement these capabilities is poorly\nunderstood. In this paper, we investigate the basic mathematical abilities\noften acquired by pre-trained language models. Concretely, we use mechanistic\ninterpretability techniques to explain the (limited) mathematical abilities of\nGPT-2 small. As a case study, we examine its ability to take in sentences such\nas \"The war lasted from the year 1732 to the year 17\", and predict valid\ntwo-digit end years (years > 32). We first identify a circuit, a small subset\nof GPT-2 small's computational graph that computes this task's output. Then, we\nexplain the role of each circuit component, showing that GPT-2 small's final\nmulti-layer perceptrons boost the probability of end years greater than the\nstart year. Finally, we find related tasks that activate our circuit. Our\nresults suggest that GPT-2 small computes greater-than using a complex but\ngeneral mechanism that activates across diverse contexts.\n","authors":["Michael Hanna","Ollie Liu","Alexandre Variengien"],"pdf_url":"https://arxiv.org/pdf/2305.00586v4.pdf","comment":"NeurIPS 2023 Camera Ready Version"},{"id":"http://arxiv.org/abs/1910.03201v6","updated":"2023-10-24T10:59:28Z","published":"2019-10-08T03:57:04Z","title":"Differentiable Sparsification for Deep Neural Networks","summary":" Deep neural networks have significantly alleviated the burden of feature\nengineering, but comparable efforts are now required to determine effective\narchitectures for these networks. Furthermore, as network sizes have become\nexcessively large, a substantial amount of resources is invested in reducing\ntheir sizes. These challenges can be effectively addressed through the\nsparsification of over-complete models. In this study, we propose a fully\ndifferentiable sparsification method for deep neural networks, which can zero\nout unimportant parameters by directly optimizing a regularized objective\nfunction with stochastic gradient descent. Consequently, the proposed method\ncan learn both the sparsified structure and weights of a network in an\nend-to-end manner. It can be directly applied to various modern deep neural\nnetworks and requires minimal modification to the training process. To the best\nof our knowledge, this is the first fully differentiable sparsification method.\n","authors":["Yognjin Lee"],"pdf_url":"https://arxiv.org/pdf/1910.03201v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07644v2","updated":"2023-10-24T10:52:16Z","published":"2023-05-12T17:55:40Z","title":"Beware of diffusion models for synthesizing medical images -- A\n comparison with GANs in terms of memorizing brain MRI and chest x-ray images","summary":" Diffusion models were initially developed for text-to-image generation and\nare now being utilized to generate high-quality synthetic images. Preceded by\nGANs, diffusion models have shown impressive results using various evaluation\nmetrics. However, commonly used metrics such as FID and IS are not suitable for\ndetermining whether diffusion models are simply reproducing the training\nimages. Here we train StyleGAN and diffusion models, using BRATS20, BRATS21 and\na chest x-ray pneumonia dataset, to synthesize brain MRI and chest x-ray\nimages, and measure the correlation between the synthe4c images and all\ntraining images. Our results show that diffusion models are more likely to\nmemorize the training images, compared to StyleGAN, especially for small\ndatasets and when using 2D slices from 3D volumes. Researchers should be\ncareful when using diffusion models for medical imaging, if the final goal is\nto share the synthe4c images\n","authors":["Muhammad Usman Akbar","Wuhao Wang","Anders Eklund"],"pdf_url":"https://arxiv.org/pdf/2305.07644v2.pdf","comment":"12 Pages, 6 Figures"},{"id":"http://arxiv.org/abs/2310.15719v1","updated":"2023-10-24T10:51:50Z","published":"2023-10-24T10:51:50Z","title":"Recurrent Linear Transformers","summary":" The self-attention mechanism in the transformer architecture is capable of\ncapturing long-range dependencies and it is the main reason behind its\neffectiveness in processing sequential data. Nevertheless, despite their\nsuccess, transformers have two significant drawbacks that still limit their\nbroader applicability: (1) In order to remember past information, the\nself-attention mechanism requires access to the whole history to be provided as\ncontext. (2) The inference cost in transformers is expensive. In this paper we\nintroduce recurrent alternatives to the transformer self-attention mechanism\nthat offer a context-independent inference cost, leverage long-range\ndependencies effectively, and perform well in practice. We evaluate our\napproaches in reinforcement learning problems where the aforementioned\ncomputational limitations make the application of transformers nearly\ninfeasible. We quantify the impact of the different components of our\narchitecture in a diagnostic environment and assess performance gains in 2D and\n3D pixel-based partially-observable environments. When compared to a\nstate-of-the-art architecture, GTrXL, inference in our approach is at least 40%\ncheaper while reducing memory use in more than 50%. Our approach either\nperforms similarly or better than GTrXL, improving more than 37% upon GTrXL\nperformance on harder tasks.\n","authors":["Subhojeet Pramanik","Esraa Elelimy","Marlos C. Machado","Adam White"],"pdf_url":"https://arxiv.org/pdf/2310.15719v1.pdf","comment":"transformers, reinforcement learning, partial observability"},{"id":"http://arxiv.org/abs/2310.15709v1","updated":"2023-10-24T10:38:02Z","published":"2023-10-24T10:38:02Z","title":"Causal Representation Learning Made Identifiable by Grouping of\n Observational Variables","summary":" A topic of great current interest is Causal Representation Learning (CRL),\nwhose goal is to learn a causal model for hidden features in a data-driven\nmanner. Unfortunately, CRL is severely ill-posed since it is a combination of\nthe two notoriously ill-posed problems of representation learning and causal\ndiscovery. Yet, finding practical identifiability conditions that guarantee a\nunique solution is crucial for its practical applicability. Most approaches so\nfar have been based on assumptions on the latent causal mechanisms, such as\ntemporal causality, or existence of supervision or interventions; these can be\ntoo restrictive in actual applications. Here, we show identifiability based on\nnovel, weak constraints, which requires no temporal structure, intervention,\nnor weak supervision. The approach is based assuming the observational mixing\nexhibits a suitable grouping of the observational variables. We also propose a\nnovel self-supervised estimation framework consistent with the model, prove its\nstatistical consistency, and experimentally show its superior CRL performances\ncompared to the state-of-the-art baselines. We further demonstrate its\nrobustness against latent confounders and causal cycles.\n","authors":["Hiroshi Morioka","Aapo Hyvärinen"],"pdf_url":"https://arxiv.org/pdf/2310.15709v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15706v1","updated":"2023-10-24T10:35:08Z","published":"2023-10-24T10:35:08Z","title":"Solving large flexible job shop scheduling instances by generating a\n diverse set of scheduling policies with deep reinforcement learning","summary":" The Flexible Job Shop Scheduling Problem (FJSSP) has been extensively studied\nin the literature, and multiple approaches have been proposed within the\nheuristic, exact, and metaheuristic methods. However, the industry's demand to\nbe able to respond in real-time to disruptive events has generated the\nnecessity to be able to generate new schedules within a few seconds. Among\nthese methods, under this constraint, only dispatching rules (DRs) are capable\nof generating schedules, even though their quality can be improved. To improve\nthe results, recent methods have been proposed for modeling the FJSSP as a\nMarkov Decision Process (MDP) and employing reinforcement learning to create a\npolicy that generates an optimal solution assigning operations to machines.\nNonetheless, there is still room for improvement, particularly in the larger\nFJSSP instances which are common in real-world scenarios. Therefore, the\nobjective of this paper is to propose a method capable of robustly solving\nlarge instances of the FJSSP. To achieve this, we propose a novel way of\nmodeling the FJSSP as an MDP using graph neural networks. We also present two\nmethods to make inference more robust: generating a diverse set of scheduling\npolicies that can be parallelized and limiting them using DRs. We have tested\nour approach on synthetically generated instances and various public benchmarks\nand found that our approach outperforms dispatching rules and achieves better\nresults than three other recent deep reinforcement learning methods on larger\nFJSSP instances.\n","authors":["Imanol Echeverria","Maialen Murua","Roberto Santana"],"pdf_url":"https://arxiv.org/pdf/2310.15706v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.07365v3","updated":"2023-10-24T10:34:22Z","published":"2022-08-15T17:59:31Z","title":"Unsupervised Video Domain Adaptation for Action Recognition: A\n Disentanglement Perspective","summary":" Unsupervised video domain adaptation is a practical yet challenging task. In\nthis work, for the first time, we tackle it from a disentanglement view. Our\nkey idea is to handle the spatial and temporal domain divergence separately\nthrough disentanglement. Specifically, we consider the generation of\ncross-domain videos from two sets of latent factors, one encoding the static\ninformation and another encoding the dynamic information. A Transfer Sequential\nVAE (TranSVAE) framework is then developed to model such generation. To better\nserve for adaptation, we propose several objectives to constrain the latent\nfactors. With these constraints, the spatial divergence can be readily removed\nby disentangling the static domain-specific information out, and the temporal\ndivergence is further reduced from both frame- and video-levels through\nadversarial learning. Extensive experiments on the UCF-HMDB, Jester, and\nEpic-Kitchens datasets verify the effectiveness and superiority of TranSVAE\ncompared with several state-of-the-art approaches. Code is publicly available.\n","authors":["Pengfei Wei","Lingdong Kong","Xinghua Qu","Yi Ren","Zhiqiang Xu","Jing Jiang","Xiang Yin"],"pdf_url":"https://arxiv.org/pdf/2208.07365v3.pdf","comment":"NeurIPS 2023; 20 pages, 9 figures, 10 tables; Code at\n https://github.com/ldkong1205/TranSVAE"},{"id":"http://arxiv.org/abs/2307.09302v2","updated":"2023-10-24T10:34:22Z","published":"2023-07-18T14:40:48Z","title":"Conformal prediction under ambiguous ground truth","summary":" Conformal Prediction (CP) allows to perform rigorous uncertainty\nquantification by constructing a prediction set $C(X)$ satisfying $\\mathbb{P}(Y\n\\in C(X))\\geq 1-\\alpha$ for a user-chosen $\\alpha \\in [0,1]$ by relying on\ncalibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\\mathbb{P}=\\mathbb{P}^{X}\n\\otimes \\mathbb{P}^{Y|X}$. It is typically implicitly assumed that\n$\\mathbb{P}^{Y|X}$ is the \"true\" posterior label distribution. However, in many\nreal-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating\nexpert opinions using a voting procedure, resulting in a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus\nw.r.t. $\\mathbb{P}_{vote}=\\mathbb{P}^X \\otimes \\mathbb{P}_{vote}^{Y|X}$ rather\nthan the true distribution $\\mathbb{P}$. In cases with unambiguous ground truth\nlabels, the distinction between $\\mathbb{P}_{vote}$ and $\\mathbb{P}$ is\nirrelevant. However, when experts do not agree because of ambiguous labels,\napproximating $\\mathbb{P}^{Y|X}$ with a one-hot distribution\n$\\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose\nto leverage expert opinions to approximate $\\mathbb{P}^{Y|X}$ using a\nnon-degenerate distribution $\\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP\nprocedures which provide guarantees w.r.t. $\\mathbb{P}_{agg}=\\mathbb{P}^X\n\\otimes \\mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels\nfrom $\\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a\ncase study of skin condition classification with significant disagreement among\nexpert annotators, we show that applying CP w.r.t. $\\mathbb{P}_{vote}$\nunder-covers expert annotations: calibrated for $72\\%$ coverage, it falls short\nby on average $10\\%$; our Monte Carlo CP closes this gap both empirically and\ntheoretically.\n","authors":["David Stutz","Abhijit Guha Roy","Tatiana Matejovicova","Patricia Strachan","Ali Taylan Cemgil","Arnaud Doucet"],"pdf_url":"https://arxiv.org/pdf/2307.09302v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.05295v4","updated":"2023-10-24T10:08:10Z","published":"2022-11-10T02:05:17Z","title":"Harmonizing output imbalance for defect segmentation on\n extremely-imbalanced photovoltaic module cells images","summary":" The continuous development of the photovoltaic (PV) industry has raised high\nrequirements for the quality of monocrystalline of PV module cells. When\nlearning to segment defect regions in PV module cell images, Tiny Hidden Cracks\n(THC) lead to extremely-imbalanced samples. The ratio of defect pixels to\nnormal pixels can be as low as 1:2000. This extreme imbalance makes it\ndifficult to segment the THC of PV module cells, which is also a challenge for\nsemantic segmentation. To address the problem of segmenting defects on\nextremely-imbalanced THC data, the paper makes contributions from three\naspects: (1) it proposes an explicit measure for output imbalance; (2) it\ngeneralizes a distribution-based loss that can handle different types of output\nimbalances; and (3) it introduces a compound loss with our adaptive\nhyperparameter selection algorithm that can keep the consistency of training\nand inference for harmonizing the output imbalance on extremelyimbalanced input\ndata. The proposed method is evaluated on four widely-used deep learning\narchitectures and four datasets with varying degrees of input imbalance. The\nexperimental results show that the proposed method outperforms existing\nmethods.\n","authors":["Jianye Yi","Xiaopin Zhong","Weixiang Liu","Zongze Wu","Yuanlong Deng","Zhengguang Wu"],"pdf_url":"https://arxiv.org/pdf/2211.05295v4.pdf","comment":"19 pages, 16 figures, 3 appendixes"},{"id":"http://arxiv.org/abs/2310.15694v1","updated":"2023-10-24T10:05:32Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15690v1","updated":"2023-10-24T10:01:15Z","published":"2023-10-24T10:01:15Z","title":"Physics-Informed with Power-Enhanced Residual Network for Interpolation\n and Inverse Problems","summary":" This paper introduces a novel neural network structure called the\nPower-Enhancing residual network, designed to improve interpolation\ncapabilities for both smooth and non-smooth functions in 2D and 3D settings. By\nadding power terms to residual elements, the architecture boosts the network's\nexpressive power. The study explores network depth, width, and optimization\nmethods, showing the architecture's adaptability and performance advantages.\nConsistently, the results emphasize the exceptional accuracy of the proposed\nPower-Enhancing residual network, particularly for non-smooth functions.\nReal-world examples also confirm its superiority over plain neural network in\nterms of accuracy, convergence, and efficiency. The study also looks at the\nimpact of deeper network. Moreover, the proposed architecture is also applied\nto solving the inverse Burgers' equation, demonstrating superior performance.\nIn conclusion, the Power-Enhancing residual network offers a versatile solution\nthat significantly enhances neural network capabilities. The codes implemented\nare available at: \\url{https://github.com/CMMAi/ResNet_for_PINN}.\n","authors":["Amir Noorizadegan","D. L. Young","Y. C. Hon","C. S. Chen"],"pdf_url":"https://arxiv.org/pdf/2310.15690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.10238v2","updated":"2023-10-24T09:51:27Z","published":"2023-08-20T11:56:02Z","title":"Thompson Sampling for Real-Valued Combinatorial Pure Exploration of\n Multi-Armed Bandit","summary":" We study the real-valued combinatorial pure exploration of the multi-armed\nbandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic\narms, and the reward of each arm $s\\in\\{1, \\ldots, d\\}$ follows an unknown\ndistribution with mean $\\mu_s$. In each time step, a player pulls a single arm\nand observes its reward. The player's goal is to identify the optimal\n\\emph{action} $\\boldsymbol{\\pi}^{*} = \\argmax_{\\boldsymbol{\\pi} \\in\n\\mathcal{A}} \\boldsymbol{\\mu}^{\\top}\\boldsymbol{\\pi}$ from a finite-sized\nreal-valued \\emph{action set} $\\mathcal{A}\\subset \\mathbb{R}^{d}$ with as few\narm pulls as possible. Previous methods in the R-CPE-MAB assume that the size\nof the action set $\\mathcal{A}$ is polynomial in $d$. We introduce an algorithm\nnamed the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm,\nwhich is the first algorithm that can work even when the size of the action set\nis exponentially large in $d$. We also introduce a novel problem-dependent\nsample complexity lower bound of the R-CPE-MAB problem, and show that the\nGenTS-Explore algorithm achieves the optimal sample complexity up to a\nproblem-dependent constant factor.\n","authors":["Shintaro Nakamura","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2308.10238v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09347v2","updated":"2023-10-24T09:51:00Z","published":"2023-06-15T17:59:54Z","title":"Segment Any Point Cloud Sequences by Distilling Vision Foundation Models","summary":" Recent advancements in vision foundation models (VFMs) have opened up new\npossibilities for versatile and efficient visual perception. In this work, we\nintroduce Seal, a novel framework that harnesses VFMs for segmenting diverse\nautomotive point cloud sequences. Seal exhibits three appealing properties: i)\nScalability: VFMs are directly distilled into point clouds, obviating the need\nfor annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial\nand temporal relationships are enforced at both the camera-to-LiDAR and\npoint-to-segment regularization stages, facilitating cross-modal representation\nlearning. iii) Generalizability: Seal enables knowledge transfer in an\noff-the-shelf manner to downstream tasks involving diverse point clouds,\nincluding those from real/synthetic, low/high-resolution, large/small-scale,\nand clean/corrupted datasets. Extensive experiments conducted on eleven\ndifferent point cloud datasets showcase the effectiveness and superiority of\nSeal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear\nprobing, surpassing random initialization by 36.9% mIoU and outperforming prior\narts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains\nover existing methods across 20 different few-shot fine-tuning tasks on all\neleven tested point cloud datasets.\n","authors":["Youquan Liu","Lingdong Kong","Jun Cen","Runnan Chen","Wenwei Zhang","Liang Pan","Kai Chen","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2306.09347v2.pdf","comment":"NeurIPS 2023 (Spotlight); 37 pages, 16 figures, 15 tables; Code at\n https://github.com/youquanl/Segment-Any-Point-Cloud"},{"id":"http://arxiv.org/abs/2310.15681v1","updated":"2023-10-24T09:47:32Z","published":"2023-10-24T09:47:32Z","title":"Fixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed\n Bandit","summary":" We study the real-valued combinatorial pure exploration of the multi-armed\nbandit in the fixed-budget setting. We first introduce the Combinatorial\nSuccessive Asign (CSA) algorithm, which is the first algorithm that can\nidentify the best action even when the size of the action class is\nexponentially large with respect to the number of arms. We show that the upper\nbound of the probability of error of the CSA algorithm matches a lower bound up\nto a logarithmic factor in the exponent. Then, we introduce another algorithm\nnamed the Minimax Combinatorial Successive Accepts and Rejects\n(Minimax-CombSAR) algorithm for the case where the size of the action class is\npolynomial, and show that it is optimal, which matches a lower bound. Finally,\nwe experimentally compare the algorithms with previous methods and show that\nour algorithm performs better.\n","authors":["Shintaro Nakamura","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2310.15681v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2201.11808v5","updated":"2023-10-24T09:42:53Z","published":"2022-01-27T21:10:20Z","title":"LAP: An Attention-Based Module for Concept Based Self-Interpretation and\n Knowledge Injection in Convolutional Neural Networks","summary":" Despite the state-of-the-art performance of deep convolutional neural\nnetworks, they are susceptible to bias and malfunction in unseen situations.\nMoreover, the complex computation behind their reasoning is not\nhuman-understandable to develop trust. External explainer methods have tried to\ninterpret network decisions in a human-understandable way, but they are accused\nof fallacies due to their assumptions and simplifications. On the other side,\nthe inherent self-interpretability of models, while being more robust to the\nmentioned fallacies, cannot be applied to the already trained models. In this\nwork, we propose a new attention-based pooling layer, called Local Attention\nPooling (LAP), that accomplishes self-interpretability and the possibility for\nknowledge injection without performance loss. The module is easily pluggable\ninto any convolutional neural network, even the already trained ones. We have\ndefined a weakly supervised training scheme to learn the distinguishing\nfeatures in decision-making without depending on experts' annotations. We\nverified our claims by evaluating several LAP-extended models on two datasets,\nincluding ImageNet. The proposed framework offers more valid\nhuman-understandable and faithful-to-the-model interpretations than the\ncommonly used white-box explainer methods.\n","authors":["Rassa Ghavami Modegh","Ahmad Salimi","Alireza Dizaji","Hamid R. Rabiee"],"pdf_url":"https://arxiv.org/pdf/2201.11808v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.13060v5","updated":"2023-10-24T09:28:08Z","published":"2023-01-30T17:02:23Z","title":"Zero-One Laws of Graph Neural Networks","summary":" Graph neural networks (GNNs) are the de facto standard deep learning\narchitectures for machine learning on graphs. This has led to a large body of\nwork analyzing the capabilities and limitations of these models, particularly\npertaining to their representation and extrapolation capacity. We offer a novel\ntheoretical perspective on the representation and extrapolation capacity of\nGNNs, by answering the question: how do GNNs behave as the number of graph\nnodes become very large? Under mild assumptions, we show that when we draw\ngraphs of increasing size from the Erd\\H{o}s-R\\'enyi model, the probability\nthat such graphs are mapped to a particular output by a class of GNN\nclassifiers tends to either zero or to one. This class includes the popular\ngraph convolutional network architecture. The result establishes 'zero-one\nlaws' for these GNNs, and analogously to other convergence laws, entails\ntheoretical limitations on their capacity. We empirically verify our results,\nobserving that the theoretical asymptotic limits are evident already on\nrelatively small graphs.\n","authors":["Sam Adam-Day","Theodor Mihai Iliant","İsmail İlkan Ceylan"],"pdf_url":"https://arxiv.org/pdf/2301.13060v5.pdf","comment":"NeurIPS '23 camera-ready version; 10 pages + references + 10 pages\n appendices, 7 figures"},{"id":"http://arxiv.org/abs/2310.15662v1","updated":"2023-10-24T09:17:47Z","published":"2023-10-24T09:17:47Z","title":"Interactive Generalized Additive Model and Its Applications in Electric\n Load Forecasting","summary":" Electric load forecasting is an indispensable component of electric power\nsystem planning and management. Inaccurate load forecasting may lead to the\nthreat of outages or a waste of energy. Accurate electric load forecasting is\nchallenging when there is limited data or even no data, such as load\nforecasting in holiday, or under extreme weather conditions. As high-stakes\ndecision-making usually follows after load forecasting, model interpretability\nis crucial for the adoption of forecasting models. In this paper, we propose an\ninteractive GAM which is not only interpretable but also can incorporate\nspecific domain knowledge in electric power industry for improved performance.\nThis boosting-based GAM leverages piecewise linear functions and can be learned\nthrough our efficient algorithm. In both public benchmark and electricity\ndatasets, our interactive GAM outperforms current state-of-the-art methods and\ndemonstrates good generalization ability in the cases of extreme weather\nevents. We launched a user-friendly web-based tool based on interactive GAM and\nalready incorporated it into our eForecaster product, a unified AI platform for\nelectricity forecasting.\n","authors":["Linxiao Yang","Rui Ren","Xinyue Gu","Liang Sun"],"pdf_url":"https://arxiv.org/pdf/2310.15662v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15656v1","updated":"2023-10-24T09:10:45Z","published":"2023-10-24T09:10:45Z","title":"Momentum Gradient-based Untargeted Attack on Hypergraph Neural Networks","summary":" Hypergraph Neural Networks (HGNNs) have been successfully applied in various\nhypergraph-related tasks due to their excellent higher-order representation\ncapabilities. Recent works have shown that deep learning models are vulnerable\nto adversarial attacks. Most studies on graph adversarial attacks have focused\non Graph Neural Networks (GNNs), and the study of adversarial attacks on HGNNs\nremains largely unexplored. In this paper, we try to reduce this gap. We design\na new HGNNs attack model for the untargeted attack, namely MGHGA, which focuses\non modifying node features. We consider the process of HGNNs training and use a\nsurrogate model to implement the attack before hypergraph modeling.\nSpecifically, MGHGA consists of two parts: feature selection and feature\nmodification. We use a momentum gradient mechanism to choose the attack node\nfeatures in the feature selection module. In the feature modification module,\nwe use two feature generation approaches (direct modification and sign\ngradient) to enable MGHGA to be employed on discrete and continuous datasets.\nWe conduct extensive experiments on five benchmark datasets to validate the\nattack performance of MGHGA in the node and the visual object classification\ntasks. The results show that MGHGA improves performance by an average of 2%\ncompared to the than the baselines.\n","authors":["Yang Chen","Stjepan Picek","Zhonglin Ye","Zhaoyang Wang","Haixing Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.15656v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15654v1","updated":"2023-10-24T09:10:26Z","published":"2023-10-24T09:10:26Z","title":"A Survey on Detection of LLMs-Generated Content","summary":" The burgeoning capabilities of advanced large language models (LLMs) such as\nChatGPT have led to an increase in synthetic content generation with\nimplications across a variety of sectors, including media, cybersecurity,\npublic discourse, and education. As such, the ability to detect LLMs-generated\ncontent has become of paramount importance. We aim to provide a detailed\noverview of existing detection strategies and benchmarks, scrutinizing their\ndifferences and identifying key challenges and prospects in the field,\nadvocating for more adaptable and robust models to enhance detection accuracy.\nWe also posit the necessity for a multi-faceted approach to defend against\nvarious attacks to counter the rapidly advancing capabilities of LLMs. To the\nbest of our knowledge, this work is the first comprehensive survey on the\ndetection in the era of LLMs. We hope it will provide a broad understanding of\nthe current landscape of LLMs-generated content detection, offering a guiding\nreference for researchers and practitioners striving to uphold the integrity of\ndigital information in an era increasingly dominated by synthetic content. The\nrelevant papers are summarized and will be consistently updated at\nhttps://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git.\n","authors":["Xianjun Yang","Liangming Pan","Xuandong Zhao","Haifeng Chen","Linda Petzold","William Yang Wang","Wei Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.15654v1.pdf","comment":"We will keep updating at\n https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git"},{"id":"http://arxiv.org/abs/2310.15653v1","updated":"2023-10-24T09:10:14Z","published":"2023-10-24T09:10:14Z","title":"Deceptive Fairness Attacks on Graphs via Meta Learning","summary":" We study deceptive fairness attacks on graphs to answer the following\nquestion: How can we achieve poisoning attacks on a graph learning model to\nexacerbate the bias deceptively? We answer this question via a bi-level\noptimization problem and propose a meta learning-based framework named FATE.\nFATE is broadly applicable with respect to various fairness definitions and\ngraph learning models, as well as arbitrary choices of manipulation operations.\nWe further instantiate FATE to attack statistical parity and individual\nfairness on graph neural networks. We conduct extensive experimental\nevaluations on real-world datasets in the task of semi-supervised node\nclassification. The experimental results demonstrate that FATE could amplify\nthe bias of graph neural networks with or without fairness consideration while\nmaintaining the utility on the downstream task. We hope this paper provides\ninsights into the adversarial robustness of fair graph learning and can shed\nlight on designing robust and fair graph learning in future studies.\n","authors":["Jian Kang","Yinglong Xia","Ross Maciejewski","Jiebo Luo","Hanghang Tong"],"pdf_url":"https://arxiv.org/pdf/2310.15653v1.pdf","comment":"23 pages, 11 tables"},{"id":"http://arxiv.org/abs/2305.09836v2","updated":"2023-10-24T09:10:03Z","published":"2023-05-16T22:37:01Z","title":"Revisiting the Minimalist Approach to Offline Reinforcement Learning","summary":" Recent years have witnessed significant advancements in offline reinforcement\nlearning (RL), resulting in the development of numerous algorithms with varying\ndegrees of complexity. While these algorithms have led to noteworthy\nimprovements, many incorporate seemingly minor design choices that impact their\neffectiveness beyond core algorithmic advances. However, the effect of these\ndesign choices on established baselines remains understudied. In this work, we\naim to bridge this gap by conducting a retrospective analysis of recent works\nin offline RL and propose ReBRAC, a minimalistic algorithm that integrates such\ndesign elements built on top of the TD3+BC method. We evaluate ReBRAC on 51\ndatasets with both proprioceptive and visual state spaces using D4RL and V-D4RL\nbenchmarks, demonstrating its state-of-the-art performance among ensemble-free\nmethods in both offline and offline-to-online settings. To further illustrate\nthe efficacy of these design choices, we perform a large-scale ablation study\nand hyperparameter sensitivity analysis on the scale of thousands of\nexperiments.\n","authors":["Denis Tarasov","Vladislav Kurenkov","Alexander Nikulin","Sergey Kolesnikov"],"pdf_url":"https://arxiv.org/pdf/2305.09836v2.pdf","comment":"Source code: https://github.com/DT6A/ReBRAC"},{"id":"http://arxiv.org/abs/2310.15648v1","updated":"2023-10-24T09:08:20Z","published":"2023-10-24T09:08:20Z","title":"Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio\n Models","summary":" The introduction of large-scale audio datasets, such as AudioSet, paved the\nway for Transformers to conquer the audio domain and replace CNNs as the\nstate-of-the-art neural network architecture for many tasks. Audio Spectrogram\nTransformers are excellent at exploiting large datasets, creating powerful\npre-trained models that surpass CNNs when fine-tuned on downstream tasks.\nHowever, current popular Audio Spectrogram Transformers are demanding in terms\nof computational complexity compared to CNNs. Recently, we have shown that, by\nemploying Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch\nup with and even outperform Transformers on large datasets. In this work, we\nextend this line of research and increase the capacity of efficient CNNs by\nintroducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic\nconvolutions and attention mechanisms. We show that these dynamic CNNs\noutperform traditional efficient CNNs, in terms of the performance-complexity\ntrade-off and parameter efficiency, at the task of audio tagging on the\nlarge-scale AudioSet. Our experiments further indicate that the introduced\ndynamic CNNs achieve better performance on downstream tasks and scale up well,\nattaining Transformer performance and even outperforming them on AudioSet and\nseveral downstream tasks.\n","authors":["Florian Schmid","Khaled Koutini","Gerhard Widmer"],"pdf_url":"https://arxiv.org/pdf/2310.15648v1.pdf","comment":"Submitted to IEEE/ACM Transactions on Audio, Speech, and Language\n Processing. Source Code available at:\n https://github.com/fschmid56/EfficientAT"},{"id":"http://arxiv.org/abs/2210.01125v2","updated":"2023-10-24T09:08:04Z","published":"2022-10-03T03:07:33Z","title":"Spectral2Spectral: Image-spectral Similarity Assisted Spectral CT Deep\n Reconstruction without Reference","summary":" Spectral computed tomography based on a photon-counting detector (PCD)\nattracts more and more attentions since it has the capability to provide more\naccurate identification and quantitative analysis for biomedical materials. The\nlimited number of photons within narrow energy bins leads to imaging results of\nlow signal-noise ratio. The existing supervised deep reconstruction networks\nfor CT reconstruction are difficult to address these challenges because it is\nusually impossible to acquire noise-free clinical images with clear structures\nas references. In this paper, we propose an iterative deep reconstruction\nnetwork to synergize unsupervised method and data priors into a unified\nframework, named as Spectral2Spectral. Our Spectral2Spectral employs an\nunsupervised deep training strategy to obtain high-quality images from noisy\ndata in an end-to-end fashion. The structural similarity prior within\nimage-spectral domain is refined as a regularization term to further constrain\nthe network training. The weights of neural network are automatically updated\nto capture image features and structures within the iterative process. Three\nlarge-scale preclinical datasets experiments demonstrate that the\nSpectral2spectral reconstructs better image quality than other the\nstate-of-the-art methods.\n","authors":["Xiaodong Guo","Longhui Li","Peng He","Peng Feng","Dingyue Chang","Hengyong Yu","Weiwen Wu"],"pdf_url":"https://arxiv.org/pdf/2210.01125v2.pdf","comment":"Accepted by IEEE TCI"},{"id":"http://arxiv.org/abs/2310.15645v1","updated":"2023-10-24T09:07:23Z","published":"2023-10-24T09:07:23Z","title":"Light up that Droid! On the Effectiveness of Static Analysis Features\n against App Obfuscation for Android Malware Detection","summary":" Malware authors have seen obfuscation as the mean to bypass malware detectors\nbased on static analysis features. For Android, several studies have confirmed\nthat many anti-malware products are easily evaded with simple program\ntransformations. As opposed to these works, ML detection proposals for Android\nleveraging static analysis features have also been proposed as\nobfuscation-resilient. Therefore, it needs to be determined to what extent the\nuse of a specific obfuscation strategy or tool poses a risk for the validity of\nML malware detectors for Android based on static analysis features. To shed\nsome light in this regard, in this article we assess the impact of specific\nobfuscation techniques on common features extracted using static analysis and\ndetermine whether the changes are significant enough to undermine the\neffectiveness of ML malware detectors that rely on these features. The\nexperimental results suggest that obfuscation techniques affect all static\nanalysis features to varying degrees across different tools. However, certain\nfeatures retain their validity for ML malware detection even in the presence of\nobfuscation. Based on these findings, we propose a ML malware detector for\nAndroid that is robust against obfuscation and outperforms current\nstate-of-the-art detectors.\n","authors":["Borja Molina-Coronado","Antonio Ruggia","Usue Mori","Alessio Merlo","Alexander Mendiburu","Jose Miguel-Alonso"],"pdf_url":"https://arxiv.org/pdf/2310.15645v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13035v4","updated":"2023-10-24T09:00:20Z","published":"2023-05-22T13:39:28Z","title":"Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design","summary":" Scaling laws have been recently employed to derive compute-optimal model size\n(number of parameters) for a given compute duration. We advance and refine such\nmethods to infer compute-optimal model shapes, such as width and depth, and\nsuccessfully implement this in vision transformers. Our shape-optimized vision\ntransformer, SoViT, achieves results competitive with models that exceed twice\nits size, despite being pre-trained with an equivalent amount of compute. For\nexample, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012,\nsurpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical\nsettings, with also less than half the inference cost. We conduct a thorough\nevaluation across multiple tasks, such as image classification, captioning, VQA\nand zero-shot transfer, demonstrating the effectiveness of our model across a\nbroad range of domains and identifying limitations. Overall, our findings\nchallenge the prevailing approach of blindly scaling up vision models and pave\na path for a more informed scaling.\n","authors":["Ibrahim Alabdulmohsin","Xiaohua Zhai","Alexander Kolesnikov","Lucas Beyer"],"pdf_url":"https://arxiv.org/pdf/2305.13035v4.pdf","comment":"10 pages, 7 figures, 9 tables. Version 2: Layout fixes"},{"id":"http://arxiv.org/abs/2310.15641v1","updated":"2023-10-24T08:59:40Z","published":"2023-10-24T08:59:40Z","title":"Guaranteed Coverage Prediction Intervals with Gaussian Process\n Regression","summary":" Gaussian Process Regression (GPR) is a popular regression method, which\nunlike most Machine Learning techniques, provides estimates of uncertainty for\nits predictions. These uncertainty estimates however, are based on the\nassumption that the model is well-specified, an assumption that is violated in\nmost practical applications, since the required knowledge is rarely available.\nAs a result, the produced uncertainty estimates can become very misleading; for\nexample the prediction intervals (PIs) produced for the 95\\% confidence level\nmay cover much less than 95\\% of the true labels. To address this issue, this\npaper introduces an extension of GPR based on a Machine Learning framework\ncalled, Conformal Prediction (CP). This extension guarantees the production of\nPIs with the required coverage even when the model is completely misspecified.\nThe proposed approach combines the advantages of GPR with the valid coverage\nguarantee of CP, while the performed experimental results demonstrate its\nsuperiority over existing methods.\n","authors":["Harris Papadopoulos"],"pdf_url":"https://arxiv.org/pdf/2310.15641v1.pdf","comment":"12 pages. This work has been submitted to IEEE Transactions on\n Pattern Analysis and Machine Intelligence for possible publication. Copyright\n may be transferred without notice, after which this version may no longer be\n accessible"},{"id":"http://arxiv.org/abs/2309.16679v2","updated":"2023-10-24T08:53:30Z","published":"2023-07-23T08:53:27Z","title":"Leveraging Deep Learning and Online Source Sentiment for Financial\n Portfolio Management","summary":" Financial portfolio management describes the task of distributing funds and\nconducting trading operations on a set of financial assets, such as stocks,\nindex funds, foreign exchange or cryptocurrencies, aiming to maximize the\nprofit while minimizing the loss incurred by said operations. Deep Learning\n(DL) methods have been consistently excelling at various tasks and automated\nfinancial trading is one of the most complex one of those. This paper aims to\nprovide insight into various DL methods for financial trading, under both the\nsupervised and reinforcement learning schemes. At the same time, taking into\nconsideration sentiment information regarding the traded assets, we discuss and\ndemonstrate their usefulness through corresponding research studies. Finally,\nwe discuss commonly found problems in training such financial agents and equip\nthe reader with the necessary knowledge to avoid these problems and apply the\ndiscussed methods in practice.\n","authors":["Paraskevi Nousi","Loukia Avramelou","Georgios Rodinos","Maria Tzelepi","Theodoros Manousis","Konstantinos Tsampazis","Kyriakos Stefanidis","Dimitris Spanos","Manos Kirtas","Pavlos Tosidis","Avraam Tsantekidis","Nikolaos Passalis","Anastasios Tefas"],"pdf_url":"https://arxiv.org/pdf/2309.16679v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15627v1","updated":"2023-10-24T08:52:04Z","published":"2023-10-24T08:52:04Z","title":"Contextual directed acyclic graphs","summary":" Estimating the structure of directed acyclic graphs (DAGs) from observational\ndata remains a significant challenge in machine learning. Most research in this\narea concentrates on learning a single DAG for the entire population. This\npaper considers an alternative setting where the graph structure varies across\nindividuals based on available \"contextual\" features. We tackle this contextual\nDAG problem via a neural network that maps the contextual features to a DAG,\nrepresented as a weighted adjacency matrix. The neural network is equipped with\na novel projection layer that ensures the output matrices are sparse and\nsatisfy a recently developed characterization of acyclicity. We devise a\nscalable computational framework for learning contextual DAGs and provide a\nconvergence guarantee and an analytical gradient for backpropagating through\nthe projection layer. Our experiments suggest that the new approach can recover\nthe true context-specific graph where existing approaches fail.\n","authors":["Ryan Thompson","Edwin V. Bonilla","Robert Kohn"],"pdf_url":"https://arxiv.org/pdf/2310.15627v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15624v1","updated":"2023-10-24T08:45:15Z","published":"2023-10-24T08:45:15Z","title":"GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D\n Object Detection","summary":" Geometry plays a significant role in monocular 3D object detection. It can be\nused to estimate object depth by using the perspective projection between\nobject's physical size and 2D projection in the image plane, which can\nintroduce mathematical priors into deep models. However, this projection\nprocess also introduces error amplification, where the error of the estimated\nheight is amplified and reflected into the projected depth. It leads to\nunreliable depth inferences and also impairs training stability. To tackle this\nproblem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++)\nby modeling geometry projection in a probabilistic manner. This ensures depth\npredictions are well-bounded and associated with a reasonable uncertainty. The\nsignificance of introducing such geometric uncertainty is two-fold: (1). It\nmodels the uncertainty propagation relationship of the geometry projection\nduring training, improving the stability and efficiency of the end-to-end model\nlearning. (2). It can be derived to a highly reliable confidence to indicate\nthe quality of the 3D detection result, enabling more reliable detection\ninference. Experiments show that the proposed approach not only obtains\n(state-of-the-art) SOTA performance in image-based monocular 3D detection but\nalso demonstrates superiority in efficacy with a simplified framework.\n","authors":["Yan Lu","Xinzhu Ma","Lei Yang","Tianzhu Zhang","Yating Liu","Qi Chu","Tong He","Yonghui Li","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2310.15624v1.pdf","comment":"18 pages, 9 figures"},{"id":"http://arxiv.org/abs/2301.10956v3","updated":"2023-10-24T08:32:04Z","published":"2023-01-26T06:28:41Z","title":"Graph Neural Networks can Recover the Hidden Features Solely from the\n Graph Structure","summary":" Graph Neural Networks (GNNs) are popular models for graph learning problems.\nGNNs show strong empirical performance in many practical tasks. However, the\ntheoretical properties have not been completely elucidated. In this paper, we\ninvestigate whether GNNs can exploit the graph structure from the perspective\nof the expressive power of GNNs. In our analysis, we consider graph generation\nprocesses that are controlled by hidden (or latent) node features, which\ncontain all information about the graph structure. A typical example of this\nframework is kNN graphs constructed from the hidden features. In our main\nresults, we show that GNNs can recover the hidden node features from the input\ngraph alone, even when all node features, including the hidden features\nthemselves and any indirect hints, are unavailable. GNNs can further use the\nrecovered node features for downstream tasks. These results show that GNNs can\nfully exploit the graph structure by themselves, and in effect, GNNs can use\nboth the hidden and explicit node features for downstream tasks. In the\nexperiments, we confirm the validity of our results by showing that GNNs can\naccurately recover the hidden features using a GNN architecture built based on\nour theoretical analysis.\n","authors":["Ryoma Sato"],"pdf_url":"https://arxiv.org/pdf/2301.10956v3.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2310.15612v1","updated":"2023-10-24T08:27:56Z","published":"2023-10-24T08:27:56Z","title":"Machine Translation for Nko: Tools, Corpora and Baseline Results","summary":" Currently, there is no usable machine translation system for Nko, a language\nspoken by tens of millions of people across multiple West African countries,\nwhich holds significant cultural and educational value. To address this issue,\nwe present a set of tools, resources, and baseline results aimed towards the\ndevelopment of usable machine translation systems for Nko and other languages\nthat do not currently have sufficiently large parallel text corpora available.\n(1) Friallel: A novel collaborative parallel text curation software that\nincorporates quality control through copyedit-based workflows. (2) Expansion of\nthe FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko\ntranslations in parallel with 204 and 40 other languages. (3) nicolingua-0005:\nA collection of trilingual and bilingual corpora with 130,850 parallel segments\nand monolingual corpora containing over 3 million Nko words. (4) Baseline\nbilingual and multilingual neural machine translation results with the best\nmodel scoring 30.83 English-Nko chrF++ on FLoRes-devtest.\n","authors":["Moussa Koulako Bala Doumbouya","Baba Mamadi Diané","Solo Farabado Cissé","Djibrila Diané","Abdoulaye Sow","Séré Moussa Doumbouya","Daouda Bangoura","Fodé Moriba Bayo","Ibrahima Sory 2. Condé","Kalo Mory Diané","Chris Piech","Christopher Manning"],"pdf_url":"https://arxiv.org/pdf/2310.15612v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.15316v3","updated":"2023-10-24T08:26:50Z","published":"2022-06-30T14:42:18Z","title":"Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory\n Models","summary":" We propose a novel anomaly detection method for echocardiogram videos. The\nintroduced method takes advantage of the periodic nature of the heart cycle to\nlearn three variants of a variational latent trajectory model (TVAE). While the\nfirst two variants (TVAE-C and TVAE-R) model strict periodic movements of the\nheart, the third (TVAE-S) is more general and allows shifts in the spatial\nrepresentation throughout the video. All models are trained on the healthy\nsamples of a novel in-house dataset of infant echocardiogram videos consisting\nof multiple chamber views to learn a normative prior of the healthy population.\nDuring inference, maximum a posteriori (MAP) based anomaly detection is\nperformed to detect out-of-distribution samples in our dataset. The proposed\nmethod reliably identifies severe congenital heart defects, such as Ebstein's\nAnomaly or Shone-complex. Moreover, it achieves superior performance over\nMAP-based anomaly detection with standard variational autoencoders when\ndetecting pulmonary hypertension and right ventricular dilation. Finally, we\ndemonstrate that the proposed method enables interpretable explanations of its\noutput through heatmaps highlighting the regions corresponding to anomalous\nheart structures.\n","authors":["Alain Ryser","Laura Manduchi","Fabian Laumer","Holger Michel","Sven Wellmann","Julia E. Vogt"],"pdf_url":"https://arxiv.org/pdf/2206.15316v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15610v1","updated":"2023-10-24T08:25:49Z","published":"2023-10-24T08:25:49Z","title":"Using Slisemap to interpret physical data","summary":" Manifold visualisation techniques are commonly used to visualise\nhigh-dimensional datasets in physical sciences. In this paper we apply a\nrecently introduced manifold visualisation method, called Slise, on datasets\nfrom physics and chemistry. Slisemap combines manifold visualisation with\nexplainable artificial intelligence. Explainable artificial intelligence is\nused to investigate the decision processes of black box machine learning models\nand complex simulators. With Slisemap we find an embedding such that data items\nwith similar local explanations are grouped together. Hence, Slisemap gives us\nan overview of the different behaviours of a black box model. This makes\nSlisemap into a supervised manifold visualisation method, where the patterns in\nthe embedding reflect a target property. In this paper we show how Slisemap can\nbe used and evaluated on physical data and that Slisemap is helpful in finding\nmeaningful information on classification and regression models trained on these\ndatasets.\n","authors":["Lauri Seppäläinen","Anton Björklund","Vitus Besel","Kai Puolamäki"],"pdf_url":"https://arxiv.org/pdf/2310.15610v1.pdf","comment":"17 pages, 5 + 1 figures, 1 table. The datasets and source code used\n in the paper are available at https://www.edahelsinki.fi/papers/slisemap_phys"},{"id":"http://arxiv.org/abs/2310.15605v1","updated":"2023-10-24T08:17:48Z","published":"2023-10-24T08:17:48Z","title":"tagE: Enabling an Embodied Agent to Understand Human Instructions","summary":" Natural language serves as the primary mode of communication when an\nintelligent agent with a physical presence engages with human beings. While a\nplethora of research focuses on natural language understanding (NLU),\nencompassing endeavors such as sentiment analysis, intent prediction, question\nanswering, and summarization, the scope of NLU directed at situations\nnecessitating tangible actions by an embodied agent remains limited. The\ninherent ambiguity and incompleteness inherent in natural language present\nchallenges for intelligent agents striving to decipher human intention. To\ntackle this predicament head-on, we introduce a novel system known as task and\nargument grounding for Embodied agents (tagE). At its core, our system employs\nan inventive neural network model designed to extract a series of tasks from\ncomplex task instructions expressed in natural language. Our proposed model\nadopts an encoder-decoder framework enriched with nested decoding to\neffectively extract tasks and their corresponding arguments from these\nintricate instructions. These extracted tasks are then mapped (or grounded) to\nthe robot's established collection of skills, while the arguments find\ngrounding in objects present within the environment. To facilitate the training\nand evaluation of our system, we have curated a dataset featuring complex\ninstructions. The results of our experiments underscore the prowess of our\napproach, as it outperforms robust baseline models.\n","authors":["Chayan Sarkar","Avik Mitra","Pradip Pramanick","Tapas Nayak"],"pdf_url":"https://arxiv.org/pdf/2310.15605v1.pdf","comment":"Accepted in EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2308.02490v3","updated":"2023-10-24T07:59:31Z","published":"2023-08-04T17:59:47Z","title":"MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities","summary":" We propose MM-Vet, an evaluation benchmark that examines large multimodal\nmodels (LMMs) on complicated multimodal tasks. Recent LMMs have shown various\nintriguing abilities, such as solving math problems written on the blackboard,\nreasoning about events and celebrities in news images, and explaining visual\njokes. Rapid model advancements pose challenges to evaluation benchmark\ndevelopment. Problems include: (1) How to systematically structure and evaluate\nthe complicated multimodal tasks; (2) How to design evaluation metrics that\nwork well across question and answer types; and (3) How to give model insights\nbeyond a simple performance ranking. To this end, we present MM-Vet, designed\nbased on the insight that the intriguing ability to solve complicated tasks is\noften achieved by a generalist model being able to integrate different core\nvision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and\nexamines the 16 integrations of interest derived from the capability\ncombination. For evaluation metrics, we propose an LLM-based evaluator for\nopen-ended outputs. The evaluator enables the evaluation across different\nquestion types and answer styles, resulting in a unified scoring metric. We\nevaluate representative LMMs on MM-Vet, providing insights into the\ncapabilities of different LMM system paradigms and models. Code and data are\navailable at https://github.com/yuweihao/MM-Vet.\n","authors":["Weihao Yu","Zhengyuan Yang","Linjie Li","Jianfeng Wang","Kevin Lin","Zicheng Liu","Xinchao Wang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2308.02490v3.pdf","comment":"Add results of GPT-4V. Code, data and leaderboard:\n https://github.com/yuweihao/MM-Vet"},{"id":"http://arxiv.org/abs/2310.15586v1","updated":"2023-10-24T07:51:29Z","published":"2023-10-24T07:51:29Z","title":"Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance\n Using Self-Supervised Deep Learning","summary":" In maritime traffic surveillance, detecting illegal activities, such as\nillegal fishing or transshipment of illicit products is a crucial task of the\ncoastal administration. In the open sea, one has to rely on Automatic\nIdentification System (AIS) message transmitted by on-board transponders, which\nare captured by surveillance satellites. However, insincere vessels often\nintentionally shut down their AIS transponders to hide illegal activities. In\nthe open sea, it is very challenging to differentiate intentional AIS shutdowns\nfrom missing reception due to protocol limitations, bad weather conditions or\nrestricting satellite positions. This paper presents a novel approach for the\ndetection of abnormal AIS missing reception based on self-supervised deep\nlearning techniques and transformer models. Using historical data, the trained\nmodel predicts if a message should be received in the upcoming minute or not.\nAfterwards, the model reports on detected anomalies by comparing the prediction\nwith what actually happens. Our method can process AIS messages in real-time,\nin particular, more than 500 Millions AIS messages per month, corresponding to\nthe trajectories of more than 60 000 ships. The method is evaluated on 1-year\nof real-world data coming from four Norwegian surveillance satellites. Using\nrelated research results, we validated our method by rediscovering already\ndetected intentional AIS shutdowns.\n","authors":["Pierre Bernabé","Arnaud Gotlieb","Bruno Legeard","Dusica Marijan","Frank Olaf Sem-Jacobsen","Helge Spieker"],"pdf_url":"https://arxiv.org/pdf/2310.15586v1.pdf","comment":"IEEE Transactions on Intelligent Transportation Systems"},{"id":"http://arxiv.org/abs/2310.15585v1","updated":"2023-10-24T07:51:08Z","published":"2023-10-24T07:51:08Z","title":"Multimodal Representations for Teacher-Guided Compositional Visual\n Reasoning","summary":" Neural Module Networks (NMN) are a compelling method for visual question\nanswering, enabling the translation of a question into a program consisting of\na series of reasoning sub-tasks that are sequentially executed on the image to\nproduce an answer. NMNs provide enhanced explainability compared to integrated\nmodels, allowing for a better understanding of the underlying reasoning\nprocess. To improve the effectiveness of NMNs we propose to exploit features\nobtained by a large-scale cross-modal encoder. Also, the current training\napproach of NMNs relies on the propagation of module outputs to subsequent\nmodules, leading to the accumulation of prediction errors and the generation of\nfalse answers. To mitigate this, we introduce an NMN learning strategy\ninvolving scheduled teacher guidance. Initially, the model is fully guided by\nthe ground-truth intermediate outputs, but gradually transitions to an\nautonomous behavior as training progresses. This reduces error accumulation,\nthus improving training efficiency and final performance.We demonstrate that by\nincorporating cross-modal features and employing more effective training\ntechniques for NMN, we achieve a favorable balance between performance and\ntransparency in the reasoning process.\n","authors":["Wafa Aissa","Marin Ferecatu","Michel Crucianu"],"pdf_url":"https://arxiv.org/pdf/2310.15585v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.07303v3","updated":"2023-10-24T07:50:45Z","published":"2022-11-14T12:32:18Z","title":"Adaptive Federated Minimax Optimization with Lower complexities","summary":" Federated learning is a popular distributed and privacy-preserving machine\nlearning paradigm. Meanwhile, minimax optimization, as an effective\nhierarchical optimization, is widely applied in machine learning. Recently,\nsome federated optimization methods have been proposed to solve the distributed\nminimax problems. However, these federated minimax methods still suffer from\nhigh gradient and communication complexities. Meanwhile, few algorithm focuses\non using adaptive learning rate to accelerate algorithms. To fill this gap, in\nthe paper, we study a class of nonconvex minimax optimization, and propose an\nefficient adaptive federated minimax optimization algorithm (i.e., AdaFGDA) to\nsolve these distributed minimax problems. Specifically, our AdaFGDA builds on\nthe momentum-based variance reduced and local-SGD techniques, and it can\nflexibly incorporate various adaptive learning rates by using the unified\nadaptive matrix. Theoretically, we provide a solid convergence analysis\nframework for our AdaFGDA algorithm under non-i.i.d. setting. Moreover, we\nprove our algorithms obtain lower gradient (i.e., stochastic first-order\noracle, SFO) complexity of $\\tilde{O}(\\epsilon^{-3})$ with lower communication\ncomplexity of $\\tilde{O}(\\epsilon^{-2})$ in finding $\\epsilon$-stationary point\nof the nonconvex minimax problems. Experimentally, we conduct some experiments\non the deep AUC maximization and robust neural network training tasks to verify\nefficiency of our algorithms.\n","authors":["Feihu Huang"],"pdf_url":"https://arxiv.org/pdf/2211.07303v3.pdf","comment":"Submitted to AISTATS-2024"},{"id":"http://arxiv.org/abs/2310.15584v1","updated":"2023-10-24T07:49:56Z","published":"2023-10-24T07:49:56Z","title":"Accelerating Split Federated Learning over Wireless Communication\n Networks","summary":" The development of artificial intelligence (AI) provides opportunities for\nthe promotion of deep neural network (DNN)-based applications. However, the\nlarge amount of parameters and computational complexity of DNN makes it\ndifficult to deploy it on edge devices which are resource-constrained. An\nefficient method to address this challenge is model partition/splitting, in\nwhich DNN is divided into two parts which are deployed on device and server\nrespectively for co-training or co-inference. In this paper, we consider a\nsplit federated learning (SFL) framework that combines the parallel model\ntraining mechanism of federated learning (FL) and the model splitting structure\nof split learning (SL). We consider a practical scenario of heterogeneous\ndevices with individual split points of DNN. We formulate a joint problem of\nsplit point selection and bandwidth allocation to minimize the system latency.\nBy using alternating optimization, we decompose the problem into two\nsub-problems and solve them optimally. Experiment results demonstrate the\nsuperiority of our work in latency reduction and accuracy improvement.\n","authors":["Ce Xu","Jinxuan Li","Yuan Liu","Yushi Ling","Miaowen Wen"],"pdf_url":"https://arxiv.org/pdf/2310.15584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15580v1","updated":"2023-10-24T07:46:10Z","published":"2023-10-24T07:46:10Z","title":"Identifiable Latent Polynomial Causal Models Through the Lens of Change","summary":" Causal representation learning aims to unveil latent high-level causal\nrepresentations from observed low-level data. One of its primary tasks is to\nprovide reliable assurance of identifying these latent causal models, known as\nidentifiability. A recent breakthrough explores identifiability by leveraging\nthe change of causal influences among latent causal variables across multiple\nenvironments \\citep{liu2022identifying}. However, this progress rests on the\nassumption that the causal relationships among latent causal variables adhere\nstrictly to linear Gaussian models. In this paper, we extend the scope of\nlatent causal models to involve nonlinear causal relationships, represented by\npolynomial models, and general noise distributions conforming to the\nexponential family. Additionally, we investigate the necessity of imposing\nchanges on all causal parameters and present partial identifiability results\nwhen part of them remains unchanged. Further, we propose a novel empirical\nestimation method, grounded in our theoretical finding, that enables learning\nconsistent latent causal representations. Our experimental results, obtained\nfrom both synthetic and real-world data, validate our theoretical contributions\nconcerning identifiability and consistency.\n","authors":["Yuhang Liu","Zhen Zhang","Dong Gong","Mingming Gong","Biwei Huang","Anton van den Hengel","Kun Zhang","Javen Qinfeng Shi"],"pdf_url":"https://arxiv.org/pdf/2310.15580v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15578v1","updated":"2023-10-24T07:42:04Z","published":"2023-10-24T07:42:04Z","title":"VMAF Re-implementation on PyTorch: Some Experimental Results","summary":" Based on the standard VMAF implementation we propose an implementation of\nVMAF using PyTorch framework. For this implementation comparisons with the\nstandard (libvmaf) show the discrepancy $\\lesssim 10^{-2}$ in VMAF units. We\ninvestigate gradients computation when using VMAF as an objective function and\ndemonstrate that training using this function does not result in ill-behaving\ngradients.\n","authors":["Kirill Aistov","Maxim Koroteev"],"pdf_url":"https://arxiv.org/pdf/2310.15578v1.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2302.04032v2","updated":"2023-10-24T07:35:41Z","published":"2023-02-08T13:08:51Z","title":"A Systematic Performance Analysis of Deep Perceptual Loss Networks:\n Breaking Transfer Learning Conventions","summary":" Deep perceptual loss is a type of loss function in computer vision that aims\nto mimic human perception by using the deep features extracted from neural\nnetworks. In recent years, the method has been applied to great effect on a\nhost of interesting computer vision tasks, especially for tasks with image or\nimage-like outputs, such as image synthesis, segmentation, depth prediction,\nand more. Many applications of the method use pretrained networks, often\nconvolutional networks, for loss calculation. Despite the increased interest\nand broader use, more effort is needed toward exploring which networks to use\nfor calculating deep perceptual loss and from which layers to extract the\nfeatures.\n This work aims to rectify this by systematically evaluating a host of\ncommonly used and readily available, pretrained networks for a number of\ndifferent feature extraction points on four existing use cases of deep\nperceptual loss. The use cases of perceptual similarity, super-resolution,\nimage segmentation, and dimensionality reduction, are evaluated through\nbenchmarks. The benchmarks are implementations of previous works where the\nselected networks and extraction points are evaluated. The performance on the\nbenchmarks, and attributes of the networks and extraction points are then used\nas a basis for an in-depth analysis. This analysis uncovers insight regarding\nwhich architectures provide superior performance for deep perceptual loss and\nhow to choose an appropriate extraction point for a particular task and\ndataset. Furthermore, the work discusses the implications of the results for\ndeep perceptual loss and the broader field of transfer learning. The results\nshow that deep perceptual loss deviates from two commonly held conventions in\ntransfer learning, which suggests that those conventions are in need of deeper\nanalysis.\n","authors":["Gustav Grund Pihlgren","Konstantina Nikolaidou","Prakash Chandra Chhipa","Nosheen Abid","Rajkumar Saini","Fredrik Sandin","Marcus Liwicki"],"pdf_url":"https://arxiv.org/pdf/2302.04032v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14948v2","updated":"2023-10-24T07:30:31Z","published":"2023-10-20T09:46:12Z","title":"Physics-Informed Graph Convolutional Networks: Towards a generalized\n framework for complex geometries","summary":" Since the seminal work of [9] and their Physics-Informed neural networks\n(PINNs), many efforts have been conducted towards solving partial differential\nequations (PDEs) with Deep Learning models. However, some challenges remain,\nfor instance the extension of such models to complex three-dimensional\ngeometries, and a study on how such approaches could be combined to classical\nnumerical solvers. In this work, we justify the use of graph neural networks\nfor these problems, based on the similarity between these architectures and the\nmeshes used in traditional numerical techniques for solving partial\ndifferential equations. After proving an issue with the Physics-Informed\nframework for complex geometries, during the computation of PDE residuals, an\nalternative procedure is proposed, by combining classical numerical solvers and\nthe Physics-Informed framework. Finally, we propose an implementation of this\napproach, that we test on a three-dimensional problem on an irregular geometry.\n","authors":["Marien Chenaud","José Alves","Frédéric Magoulès"],"pdf_url":"https://arxiv.org/pdf/2310.14948v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.10888v2","updated":"2023-10-24T07:26:52Z","published":"2022-09-22T09:45:10Z","title":"Amortized Variational Inference: A Systematic Review","summary":" The core principle of Variational Inference (VI) is to convert the\nstatistical inference problem of computing complex posterior probability\ndensities into a tractable optimization problem. This property enables VI to be\nfaster than several sampling-based techniques. However, the traditional VI\nalgorithm is not scalable to large data sets and is unable to readily infer\nout-of-bounds data points without re-running the optimization process. Recent\ndevelopments in the field, like stochastic-, black box-, and amortized-VI, have\nhelped address these issues. Generative modeling tasks nowadays widely make use\nof amortized VI for its efficiency and scalability, as it utilizes a\nparameterized function to learn the approximate posterior density parameters.\nIn this paper, we review the mathematical foundations of various VI techniques\nto form the basis for understanding amortized VI. Additionally, we provide an\noverview of the recent trends that address several issues of amortized VI, such\nas the amortization gap, generalization issues, inconsistent representation\nlearning, and posterior collapse. Finally, we analyze alternate divergence\nmeasures that improve VI optimization.\n","authors":["Ankush Ganguly","Sanjana Jain","Ukrit Watchareeruetai"],"pdf_url":"https://arxiv.org/pdf/2209.10888v2.pdf","comment":"Accepted for publication at the Journal of Artificial Intelligence\n Research (JAIR)"},{"id":"http://arxiv.org/abs/2310.15559v1","updated":"2023-10-24T07:02:47Z","published":"2023-10-24T07:02:47Z","title":"From Oja's Algorithm to the Multiplicative Weights Update Method with\n Applications","summary":" Oja's algorithm is a well known online algorithm studied mainly in the\ncontext of stochastic principal component analysis. We make a simple\nobservation, yet to the best of our knowledge a novel one, that when applied to\na any (not necessarily stochastic) sequence of symmetric matrices which share\ncommon eigenvectors, the regret of Oja's algorithm could be directly bounded in\nterms of the regret of the well known multiplicative weights update method for\nthe problem of prediction with expert advice. Several applications to\noptimization with quadratic forms over the unit sphere in $\\reals^n$ are\ndiscussed.\n","authors":["Dan Garber"],"pdf_url":"https://arxiv.org/pdf/2310.15559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15555v1","updated":"2023-10-24T06:54:50Z","published":"2023-10-24T06:54:50Z","title":"Transfer learning for day-ahead load forecasting: a case study on\n European national electricity demand time series","summary":" Short-term load forecasting (STLF) is crucial for the daily operation of\npower grids. However, the non-linearity, non-stationarity, and randomness\ncharacterizing electricity demand time series renders STLF a challenging task.\nVarious forecasting approaches have been proposed for improving STLF, including\nneural network (NN) models which are trained using data from multiple\nelectricity demand series that may not necessary include the target series. In\nthe present study, we investigate the performance of this special case of STLF,\ncalled transfer learning (TL), by considering a set of 27 time series that\nrepresent the national day-ahead electricity demand of indicative European\ncountries. We employ a popular and easy-to-implement NN model and perform a\nclustering analysis to identify similar patterns among the series and assist\nTL. In this context, two different TL approaches, with and without the\nclustering step, are compiled and compared against each other as well as a\ntypical NN training setup. Our results demonstrate that TL can outperform the\nconventional approach, especially when clustering techniques are considered.\n","authors":["Alexandros-Menelaos Tzortzis","Sotiris Pelekis","Evangelos Spiliotis","Spiros Mouzakitis","John Psarras","Dimitris Askounis"],"pdf_url":"https://arxiv.org/pdf/2310.15555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15317v2","updated":"2023-10-24T06:53:55Z","published":"2023-07-28T05:32:56Z","title":"DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable\n Kendall's Rank Correlation","summary":" Few-shot learning aims to adapt models trained on the base dataset to novel\ntasks where the categories were not seen by the model before. This often leads\nto a relatively uniform distribution of feature values across channels on novel\nclasses, posing challenges in determining channel importance for novel tasks.\nStandard few-shot learning methods employ geometric similarity metrics such as\ncosine similarity and negative Euclidean distance to gauge the semantic\nrelatedness between two features. However, features with high geometric\nsimilarities may carry distinct semantics, especially in the context of\nfew-shot learning. In this paper, we demonstrate that the importance ranking of\nfeature channels is a more reliable indicator for few-shot learning than\ngeometric similarity metrics. We observe that replacing the geometric\nsimilarity metric with Kendall's rank correlation only during inference is able\nto improve the performance of few-shot learning across a wide range of methods\nand datasets with different domains. Furthermore, we propose a carefully\ndesigned differentiable loss for meta-training to address the\nnon-differentiability issue of Kendall's rank correlation. By replacing\ngeometric similarity with differentiable Kendall's rank correlation, our method\ncan integrate with numerous existing few-shot approaches and is ready for\nintegrating with future state-of-the-art methods that rely on geometric\nsimilarity metrics. Extensive experiments validate the efficacy of the\nrank-correlation-based approach, showcasing a significant improvement in\nfew-shot learning.\n","authors":["Kaipeng Zheng","Huishuai Zhang","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2307.15317v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.05374v3","updated":"2023-10-24T06:48:55Z","published":"2023-10-09T03:10:49Z","title":"Improving End-to-End Speech Processing by Efficient Text Data\n Utilization with Latent Synthesis","summary":" Training a high performance end-to-end speech (E2E) processing model requires\nan enormous amount of labeled speech data, especially in the era of\ndata-centric artificial intelligence. However, labeled speech data are usually\nscarcer and more expensive for collection, compared to textual data. We propose\nLatent Synthesis (LaSyn), an efficient textual data utilization framework for\nE2E speech processing models. We train a latent synthesizer to convert textual\ndata into an intermediate latent representation of a pre-trained speech model.\nThese pseudo acoustic representations of textual data augment acoustic data for\nmodel training. We evaluate LaSyn on low-resource automatic speech recognition\n(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an\nE2E baseline trained on LibriSpeech train-clean-100, with relative word error\nrate reductions over 22.3% on different test sets. For SLU, LaSyn improves our\nE2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for\nslot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM)\nand EM-Tree accuracies on STOP respectively. With fewer parameters, the results\nof LaSyn are competitive to published state-of-the-art works. The results\ndemonstrate the quality of the augmented training data.\n","authors":["Jianqiao Lu","Wenyong Huang","Nianzu Zheng","Xingshan Zeng","Yu Ting Yeung","Xiao Chen"],"pdf_url":"https://arxiv.org/pdf/2310.05374v3.pdf","comment":"15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.15550v1","updated":"2023-10-24T06:43:56Z","published":"2023-10-24T06:43:56Z","title":"PET Synthesis via Self-supervised Adaptive Residual Estimation\n Generative Adversarial Network","summary":" Positron emission tomography (PET) is a widely used, highly sensitive\nmolecular imaging in clinical diagnosis. There is interest in reducing the\nradiation exposure from PET but also maintaining adequate image quality. Recent\nmethods using convolutional neural networks (CNNs) to generate synthesized\nhigh-quality PET images from low-dose counterparts have been reported to be\nstate-of-the-art for low-to-high image recovery methods. However, these methods\nare prone to exhibiting discrepancies in texture and structure between\nsynthesized and real images. Furthermore, the distribution shift between\nlow-dose PET and standard PET has not been fully investigated. To address these\nissues, we developed a self-supervised adaptive residual estimation generative\nadversarial network (SS-AEGAN). We introduce (1) An adaptive residual\nestimation mapping mechanism, AE-Net, designed to dynamically rectify the\npreliminary synthesized PET images by taking the residual map between the\nlow-dose PET and synthesized output as the input, and (2) A self-supervised\npre-training strategy to enhance the feature representation of the coarse\ngenerator. Our experiments with a public benchmark dataset of total-body PET\nimages show that SS-AEGAN consistently outperformed the state-of-the-art\nsynthesis methods with various dose reduction factors.\n","authors":["Yuxin Xue","Lei Bi","Yige Peng","Michael Fulham","David Dagan Feng","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2310.15550v1.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2310.15549v1","updated":"2023-10-24T06:40:26Z","published":"2023-10-24T06:40:26Z","title":"Algorithmic Regularization in Tensor Optimization: Towards a Lifted\n Approach in Matrix Sensing","summary":" Gradient descent (GD) is crucial for generalization in machine learning\nmodels, as it induces implicit regularization, promoting compact\nrepresentations. In this work, we examine the role of GD in inducing implicit\nregularization for tensor optimization, particularly within the context of the\nlifted matrix sensing framework. This framework has been recently proposed to\naddress the non-convex matrix sensing problem by transforming spurious\nsolutions into strict saddles when optimizing over symmetric, rank-1 tensors.\nWe show that, with sufficiently small initialization scale, GD applied to this\nlifted problem results in approximate rank-1 tensors and critical points with\nescape directions. Our findings underscore the significance of the tensor\nparametrization of matrix sensing, in combination with first-order methods, in\nachieving global optimality in such problems.\n","authors":["Ziye Ma","Javad Lavaei","Somayeh Sojoudi"],"pdf_url":"https://arxiv.org/pdf/2310.15549v1.pdf","comment":"NeurIPS23 Poster"},{"id":"http://arxiv.org/abs/2310.15543v1","updated":"2023-10-24T06:22:20Z","published":"2023-10-24T06:22:20Z","title":"Symmetry-preserving graph attention network to solve routing problems at\n multiple resolutions","summary":" Travelling Salesperson Problems (TSPs) and Vehicle Routing Problems (VRPs)\nhave achieved reasonable improvement in accuracy and computation time with the\nadaptation of Machine Learning (ML) methods. However, none of the previous\nworks completely respects the symmetries arising from TSPs and VRPs including\nrotation, translation, permutation, and scaling. In this work, we introduce the\nfirst-ever completely equivariant model and training to solve combinatorial\nproblems. Furthermore, it is essential to capture the multiscale structure\n(i.e. from local to global information) of the input graph, especially for the\ncases of large and long-range graphs, while previous methods are limited to\nextracting only local information that can lead to a local or sub-optimal\nsolution. To tackle the above limitation, we propose a Multiresolution scheme\nin combination with Equivariant Graph Attention network (mEGAT) architecture,\nwhich can learn the optimal route based on low-level and high-level graph\nresolutions in an efficient way. In particular, our approach constructs a\nhierarchy of coarse-graining graphs from the input graph, in which we try to\nsolve the routing problems on simple low-level graphs first, then utilize that\nknowledge for the more complex high-level graphs. Experimentally, we have shown\nthat our model outperforms existing baselines and proved that symmetry\npreservation and multiresolution are important recipes for solving\ncombinatorial problems in a data-driven manner. Our source code is publicly\navailable at https://github.com/HySonLab/Multires-NP-hard\n","authors":["Cong Dao Tran","Thong Bach","Truong Son Hy"],"pdf_url":"https://arxiv.org/pdf/2310.15543v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14846v2","updated":"2023-10-24T06:16:43Z","published":"2023-06-26T16:57:03Z","title":"ViNT: A Foundation Model for Visual Navigation","summary":" General-purpose pre-trained models (\"foundation models\") have enabled\npractitioners to produce generalizable solutions for individual machine\nlearning problems with datasets that are significantly smaller than those\nrequired for learning from scratch. Such models are typically trained on large\nand diverse datasets with weak supervision, consuming much more training data\nthan is available for any individual downstream application. In this paper, we\ndescribe the Visual Navigation Transformer (ViNT), a foundation model that aims\nto bring the success of general-purpose pre-trained models to vision-based\nrobotic navigation. ViNT is trained with a general goal-reaching objective that\ncan be used with any navigation dataset, and employs a flexible\nTransformer-based architecture to learn navigational affordances and enable\nefficient adaptation to a variety of downstream navigational tasks. ViNT is\ntrained on a number of existing navigation datasets, comprising hundreds of\nhours of robotic navigation from a variety of different robotic platforms, and\nexhibits positive transfer, outperforming specialist models trained on singular\ndatasets. ViNT can be augmented with diffusion-based subgoal proposals to\nexplore novel environments, and can solve kilometer-scale navigation problems\nwhen equipped with long-range heuristics. ViNT can also be adapted to novel\ntask specifications with a technique inspired by prompt-tuning, where the goal\nencoder is replaced by an encoding of another task modality (e.g., GPS\nwaypoints or routing commands) embedded into the same space of goal tokens.\nThis flexibility and ability to accommodate a variety of downstream problem\ndomains establishes ViNT as an effective foundation model for mobile robotics.\nFor videos, code, and model checkpoints, see our project page at\nhttps://visualnav-transformer.github.io.\n","authors":["Dhruv Shah","Ajay Sridhar","Nitish Dashora","Kyle Stachowicz","Kevin Black","Noriaki Hirose","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2306.14846v2.pdf","comment":"Accepted for oral presentation at CoRL 2023"},{"id":"http://arxiv.org/abs/2306.03552v2","updated":"2023-10-24T06:14:58Z","published":"2023-06-06T10:06:09Z","title":"State Regularized Policy Optimization on Data with Dynamics Shift","summary":" In many real-world scenarios, Reinforcement Learning (RL) algorithms are\ntrained on data with dynamics shift, i.e., with different underlying\nenvironment dynamics. A majority of current methods address such issue by\ntraining context encoders to identify environment parameters. Data with\ndynamics shift are separated according to their environment parameters to train\nthe corresponding policy. However, these methods can be sample inefficient as\ndata are used \\textit{ad hoc}, and policies trained for one dynamics cannot\nbenefit from data collected in all other environments with different dynamics.\nIn this paper, we find that in many environments with similar structures and\ndifferent dynamics, optimal policies have similar stationary state\ndistributions. We exploit such property and learn the stationary state\ndistribution from data with dynamics shift for efficient data reuse. Such\ndistribution is used to regularize the policy trained in a new environment,\nleading to the SRPO (\\textbf{S}tate \\textbf{R}egularized \\textbf{P}olicy\n\\textbf{O}ptimization) algorithm. To conduct theoretical analyses, the\nintuition of similar environment structures is characterized by the notion of\nhomomorphous MDPs. We then demonstrate a lower-bound performance guarantee on\npolicies regularized by the stationary state distribution. In practice, SRPO\ncan be an add-on module to context-based algorithms in both online and offline\nRL settings. Experimental results show that SRPO can make several context-based\nalgorithms far more data efficient and significantly improve their overall\nperformance.\n","authors":["Zhenghai Xue","Qingpeng Cai","Shuchang Liu","Dong Zheng","Peng Jiang","Kun Gai","Bo An"],"pdf_url":"https://arxiv.org/pdf/2306.03552v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.12671v2","updated":"2023-10-24T06:09:44Z","published":"2023-09-22T07:27:32Z","title":"How to Fine-tune the Model: Unified Model Shift and Model Bias Policy\n Optimization","summary":" Designing and deriving effective model-based reinforcement learning (MBRL)\nalgorithms with a performance improvement guarantee is challenging, mainly\nattributed to the high coupling between model learning and policy optimization.\nMany prior methods that rely on return discrepancy to guide model learning\nignore the impacts of model shift, which can lead to performance deterioration\ndue to excessive model updates. Other methods use performance difference bound\nto explicitly consider model shift. However, these methods rely on a fixed\nthreshold to constrain model shift, resulting in a heavy dependence on the\nthreshold and a lack of adaptability during the training process. In this\npaper, we theoretically derive an optimization objective that can unify model\nshift and model bias and then formulate a fine-tuning process. This process\nadaptively adjusts the model updates to get a performance improvement guarantee\nwhile avoiding model overfitting. Based on these, we develop a straightforward\nalgorithm USB-PO (Unified model Shift and model Bias Policy Optimization).\nEmpirical results show that USB-PO achieves state-of-the-art performance on\nseveral challenging benchmark tasks.\n","authors":["Hai Zhang","Hang Yu","Junqiao Zhao","Di Zhang","Chang Huang","Hongtu Zhou","Xiao Zhang","Chen Ye"],"pdf_url":"https://arxiv.org/pdf/2309.12671v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15526v1","updated":"2023-10-24T05:16:52Z","published":"2023-10-24T05:16:52Z","title":"Privacy Amplification for Matrix Mechanisms","summary":" Privacy amplification exploits randomness in data selection to provide\ntighter differential privacy (DP) guarantees. This analysis is key to DP-SGD's\nsuccess in machine learning, but, is not readily applicable to the newer\nstate-of-the-art algorithms. This is because these algorithms, known as\nDP-FTRL, use the matrix mechanism to add correlated noise instead of\nindependent noise as in DP-SGD.\n In this paper, we propose \"MMCC\", the first algorithm to analyze privacy\namplification via sampling for any generic matrix mechanism. MMCC is nearly\ntight in that it approaches a lower bound as $\\epsilon\\to0$. To analyze\ncorrelated outputs in MMCC, we prove that they can be analyzed as if they were\nindependent, by conditioning them on prior outputs. Our \"conditional\ncomposition theorem\" has broad utility: we use it to show that the noise added\nto binary-tree-DP-FTRL can asymptotically match the noise added to DP-SGD with\namplification. Our amplification algorithm also has practical empirical\nutility: we show it leads to significant improvement in the privacy-utility\ntrade-offs for DP-FTRL algorithms on standard benchmarks.\n","authors":["Christopher A. Choquette-Choo","Arun Ganesh","Thomas Steinke","Abhradeep Thakurta"],"pdf_url":"https://arxiv.org/pdf/2310.15526v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15524v1","updated":"2023-10-24T05:07:31Z","published":"2023-10-24T05:07:31Z","title":"On the Inherent Privacy Properties of Discrete Denoising Diffusion\n Models","summary":" Privacy concerns have led to a surge in the creation of synthetic datasets,\nwith diffusion models emerging as a promising avenue. Although prior studies\nhave performed empirical evaluations on these models, there has been a gap in\nproviding a mathematical characterization of their privacy-preserving\ncapabilities. To address this, we present the pioneering theoretical\nexploration of the privacy preservation inherent in discrete diffusion models\n(DDMs) for discrete dataset generation. Focusing on per-instance differential\nprivacy (pDP), our framework elucidates the potential privacy leakage for each\ndata point in a given training dataset, offering insights into data\npreprocessing to reduce privacy risks of the synthetic dataset generation via\nDDMs. Our bounds also show that training with $s$-sized data points leads to a\nsurge in privacy leakage from $(\\epsilon,\n\\mathcal{O}(\\frac{1}{s^2\\epsilon}))$-pDP to $(\\epsilon,\n\\mathcal{O}(\\frac{1}{s\\epsilon}))$-pDP during the transition from the pure\nnoise to the synthetic clean data phase, and a faster decay in diffusion\ncoefficients amplifies the privacy guarantee. Finally, we empirically verify\nour theoretical findings on both synthetic and real-world datasets.\n","authors":["Rongzhe Wei","Eleonora Kreačić","Haoyu Wang","Haoteng Yin","Eli Chien","Vamsi K. Potluru","Pan Li"],"pdf_url":"https://arxiv.org/pdf/2310.15524v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15523v1","updated":"2023-10-24T05:06:06Z","published":"2023-10-24T05:06:06Z","title":"Generative and Contrastive Paradigms Are Complementary for Graph\n Self-Supervised Learning","summary":" For graph self-supervised learning (GSSL), masked autoencoder (MAE) follows\nthe generative paradigm and learns to reconstruct masked graph edges or node\nfeatures. Contrastive Learning (CL) maximizes the similarity between augmented\nviews of the same graph and is widely used for GSSL. However, MAE and CL are\nconsidered separately in existing works for GSSL. We observe that the MAE and\nCL paradigms are complementary and propose the graph contrastive masked\nautoencoder (GCMAE) framework to unify them. Specifically, by focusing on local\nedges or node features, MAE cannot capture global information of the graph and\nis sensitive to particular edges and features. On the contrary, CL excels in\nextracting global information because it considers the relation between graphs.\nAs such, we equip GCMAE with an MAE branch and a CL branch, and the two\nbranches share a common encoder, which allows the MAE branch to exploit the\nglobal information extracted by the CL branch. To force GCMAE to capture global\ngraph structures, we train it to reconstruct the entire adjacency matrix\ninstead of only the masked edges as in existing works. Moreover, a\ndiscrimination loss is proposed for feature reconstruction, which improves the\ndisparity between node embeddings rather than reducing the reconstruction error\nto tackle the feature smoothing problem of MAE. We evaluate GCMAE on four\npopular graph tasks (i.e., node classification, node clustering, link\nprediction, and graph classification) and compare with 14 state-of-the-art\nbaselines. The results show that GCMAE consistently provides good accuracy\nacross these tasks, and the maximum accuracy improvement is up to 3.2% compared\nwith the best-performing baseline.\n","authors":["Yuxiang Wang","Xiao Yan","Chuang Hu","Fangcheng Fu","Wentao Zhang","Hao Wang","Shuo Shang","Jiawei Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.15523v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15516v1","updated":"2023-10-24T04:50:32Z","published":"2023-10-24T04:50:32Z","title":"Graph Attention-based Deep Reinforcement Learning for solving the\n Chinese Postman Problem with Load-dependent costs","summary":" Recently, Deep reinforcement learning (DRL) models have shown promising\nresults in solving routing problems. However, most DRL solvers are commonly\nproposed to solve node routing problems, such as the Traveling Salesman Problem\n(TSP). Meanwhile, there has been limited research on applying neural methods to\narc routing problems, such as the Chinese Postman Problem (CPP), since they\noften feature irregular and complex solution spaces compared to TSP. To fill\nthese gaps, this paper proposes a novel DRL framework to address the CPP with\nload-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc\nrouting problem with load constraints. The novelty of our method is two-fold.\nFirst, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential\nmodel. Subsequently, we introduce an autoregressive model based on DRL, namely\nArc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge\neffectively. Such a framework allows the DRL model to work efficiently and\nscalably to arc routing problems. Furthermore, we propose a new bio-inspired\nmeta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC.\nExtensive experiments show that Arc-DRL outperforms existing meta-heuristic\nmethods such as Iterative Local Search (ILS) and Variable Neighborhood Search\n(VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for\nCPP-LC regarding both solution quality and running time; while the EA gives the\nbest solution quality with much more running time. We release our C++\nimplementations for metaheuristics such as EA, ILS and VNS along with the code\nfor data generation and our generated data at\nhttps://github.com/HySonLab/Chinese_Postman_Problem\n","authors":["Cong Dao Tran","Truong Son Hy"],"pdf_url":"https://arxiv.org/pdf/2310.15516v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15511v1","updated":"2023-10-24T04:40:38Z","published":"2023-10-24T04:40:38Z","title":"KITAB: Evaluating LLMs on Constraint Satisfaction for Information\n Retrieval","summary":" We study the ability of state-of-the art models to answer constraint\nsatisfaction queries for information retrieval (e.g., 'a list of ice cream\nshops in San Diego'). In the past, such queries were considered to be tasks\nthat could only be solved via web-search or knowledge bases. More recently,\nlarge language models (LLMs) have demonstrated initial emergent abilities in\nthis task. However, many current retrieval benchmarks are either saturated or\ndo not measure constraint satisfaction. Motivated by rising concerns around\nfactual incorrectness and hallucinations of LLMs, we present KITAB, a new\ndataset for measuring constraint satisfaction abilities of language models.\nKITAB consists of book-related data across more than 600 authors and 13,000\nqueries, and also offers an associated dynamic data collection and constraint\nverification approach for acquiring similar test data for other authors. Our\nextended experiments on GPT4 and GPT3.5 characterize and decouple common\nfailure modes across dimensions such as information popularity, constraint\ntypes, and context availability. Results show that in the absence of context,\nmodels exhibit severe limitations as measured by irrelevant information,\nfactual errors, and incompleteness, many of which exacerbate as information\npopularity decreases. While context availability mitigates irrelevant\ninformation, it is not helpful for satisfying constraints, identifying\nfundamental barriers to constraint satisfaction. We open source our\ncontributions to foster further research on improving constraint satisfaction\nabilities of future models.\n","authors":["Marah I Abdin","Suriya Gunasekar","Varun Chandrasekaran","Jerry Li","Mert Yuksekgonul","Rahee Ghosh Peshawaria","Ranjita Naik","Besmira Nushi"],"pdf_url":"https://arxiv.org/pdf/2310.15511v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2210.03116v4","updated":"2023-10-24T04:26:10Z","published":"2022-10-06T17:59:51Z","title":"Content-Based Search for Deep Generative Models","summary":" The growing proliferation of customized and pretrained generative models has\nmade it infeasible for a user to be fully cognizant of every model in\nexistence. To address this need, we introduce the task of content-based model\nsearch: given a query and a large set of generative models, finding the models\nthat best match the query. As each generative model produces a distribution of\nimages, we formulate the search task as an optimization problem to select the\nmodel with the highest probability of generating similar content as the query.\nWe introduce a formulation to approximate this probability given the query from\ndifferent modalities, e.g., image, sketch, and text. Furthermore, we propose a\ncontrastive learning framework for model retrieval, which learns to adapt\nfeatures for various query modalities. We demonstrate that our method\noutperforms several baselines on Generative Model Zoo, a new benchmark we\ncreate for the model retrieval task.\n","authors":["Daohan Lu","Sheng-Yu Wang","Nupur Kumari","Rohan Agarwal","Mia Tang","David Bau","Jun-Yan Zhu"],"pdf_url":"https://arxiv.org/pdf/2210.03116v4.pdf","comment":"Our project page is hosted at\n https://generative-intelligence-lab.github.io/modelverse/"},{"id":"http://arxiv.org/abs/2310.14168v2","updated":"2023-10-24T04:20:24Z","published":"2023-10-22T04:02:39Z","title":"Randomized Forward Mode of Automatic Differentiation for Optimization\n Algorithms","summary":" Backpropagation within neural networks leverages a fundamental element of\nautomatic differentiation, which is referred to as the reverse mode\ndifferentiation, or vector Jacobian Product (VJP) or, in the context of\ndifferential geometry, known as the pull-back process. The computation of\ngradient is important as update of neural network parameters is performed using\ngradient descent method. In this study, we present a genric randomized method,\nwhich updates the parameters of neural networks by using directional\nderivatives of loss functions computed efficiently by using forward mode AD or\nJacobian vector Product (JVP). These JVP are computed along the random\ndirections sampled from different probability distributions e.g., Bernoulli,\nNormal, Wigner, Laplace and Uniform distributions. The computation of gradient\nis performed during the forward pass of the neural network. We also present a\nrigorous analysis of the presented methods providing the rate of convergence\nalong with the computational experiments deployed in scientific Machine\nlearning in particular physics-informed neural networks and Deep Operator\nNetworks.\n","authors":["Khemraj Shukla","Yeonjong Shin"],"pdf_url":"https://arxiv.org/pdf/2310.14168v2.pdf","comment":"23 Pages, 8 Figures"},{"id":"http://arxiv.org/abs/2310.14085v2","updated":"2023-10-24T04:16:23Z","published":"2023-10-21T18:38:13Z","title":"Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and\n Exp-Concave Games with Gradient Feedback","summary":" Online gradient descent (OGD) is well known to be doubly optimal under strong\nconvexity or monotonicity assumptions: (1) in the single-agent setting, it\nachieves an optimal regret of $\\Theta(\\log T)$ for strongly convex cost\nfunctions; and (2) in the multi-agent setting of strongly monotone games, with\neach agent employing OGD, we obtain last-iterate convergence of the joint\naction to a unique Nash equilibrium at an optimal rate of\n$\\Theta(\\frac{1}{T})$. While these finite-time guarantees highlight its merits,\nOGD has the drawback that it requires knowing the strong convexity/monotonicity\nparameters. In this paper, we design a fully adaptive OGD algorithm,\n\\textsf{AdaOGD}, that does not require a priori knowledge of these parameters.\nIn the single-agent setting, our algorithm achieves $O(\\log^2(T))$ regret under\nstrong convexity, which is optimal up to a log factor. Further, if each agent\nemploys \\textsf{AdaOGD} in strongly monotone games, the joint action converges\nin a last-iterate sense to a unique Nash equilibrium at a rate of\n$O(\\frac{\\log^3 T}{T})$, again optimal up to log factors. We illustrate our\nalgorithms in a learning version of the classical newsvendor problem, where due\nto lost sales, only (noisy) gradient feedback can be observed. Our results\nimmediately yield the first feasible and near-optimal algorithm for both the\nsingle-retailer and multi-retailer settings. We also extend our results to the\nmore general setting of exp-concave cost functions and games, using the online\nNewton step (ONS) algorithm.\n","authors":["Michael I. Jordan","Tianyi Lin","Zhengyuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.14085v2.pdf","comment":"Accepted by Operations Research; 47 pages"},{"id":"http://arxiv.org/abs/2305.13999v3","updated":"2023-10-24T03:41:37Z","published":"2023-05-23T12:28:37Z","title":"Towards A Unified View of Sparse Feed-Forward Network in Pretraining\n Large Language Model","summary":" Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE)\nhave proven effective in scaling up Transformers model size for\n\\textit{pretraining} large language models. By only activating part of the FFN\nparameters conditioning on input, S-FFN improves generalization performance\nwhile keeping training and inference costs (in FLOPs) fixed. In this work, we\nanalyzed two major design choices of S-FFN: the memory block (a.k.a. expert)\nsize and the memory block selection method under a general conceptual framework\nof sparse neural memory. Using this unified framework, we compare several S-FFN\narchitectures for language modeling and provide insights into their relative\nefficacy and efficiency. We found a simpler selection method --\n\\textbf{\\texttt{Avg-K}} that selects blocks through their mean aggregated\nhidden states, achieving lower perplexity in language model pretraining\ncompared to existing MoE architectures including Switch Transformer (Fedus et\nal., 2021) and HashLayer (Roller et al., 2021).\n","authors":["Zeyu Leo Liu","Tim Dettmers","Xi Victoria Lin","Veselin Stoyanov","Xian Li"],"pdf_url":"https://arxiv.org/pdf/2305.13999v3.pdf","comment":"Accepted to EMNLP 2023"}],"Multimedia":[{"id":"http://arxiv.org/abs/2307.14491v2","updated":"2023-10-24T17:48:24Z","published":"2023-07-26T20:30:34Z","title":"A Unified Framework for Modality-Agnostic Deepfakes Detection","summary":" As AI-generated content (AIGC) thrives, deepfakes have expanded from\nsingle-modality falsification to cross-modal fake content creation, where\neither audio or visual components can be manipulated. While using two unimodal\ndetectors can detect audio-visual deepfakes, cross-modal forgery clues could be\noverlooked. Existing multimodal deepfake detection methods typically establish\ncorrespondence between the audio and visual modalities for binary real/fake\nclassification, and require the co-occurrence of both modalities. However, in\nreal-world multi-modal applications, missing modality scenarios may occur where\neither modality is unavailable. In such cases, audio-visual detection methods\nare less practical than two independent unimodal methods. Consequently, the\ndetector can not always obtain the number or type of manipulated modalities\nbeforehand, necessitating a fake-modality-agnostic audio-visual detector. In\nthis work, we introduce a comprehensive framework that is agnostic to fake\nmodalities, which facilitates the identification of multimodal deepfakes and\nhandles situations with missing modalities, regardless of the manipulations\nembedded in audio, video, or even cross-modal forms. To enhance the modeling of\ncross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as\na preliminary task. This efficiently extracts speech correlations across\nmodalities, a feature challenging for deepfakes to replicate. Additionally, we\npropose a dual-label detection approach that follows the structure of AVSR to\nsupport the independent detection of each modality. Extensive experiments on\nthree audio-visual datasets show that our scheme outperforms state-of-the-art\ndetection methods with promising performance on modality-agnostic audio/video\ndeepfakes.\n","authors":["Cai Yu","Peng Chen","Jiahe Tian","Jin Liu","Jiao Dai","Xi Wang","Yesheng Chai","Shan Jia","Siwei Lyu","Jizhong Han"],"pdf_url":"https://arxiv.org/pdf/2307.14491v2.pdf","comment":"This work has been submitted to the IEEE for possible publication.\n Copyright may be transferred without notice, after which this version may no\n longer be accessible"},{"id":"http://arxiv.org/abs/2310.15593v1","updated":"2023-10-24T07:58:14Z","published":"2023-10-24T07:58:14Z","title":"RecipeMeta: Metapath-enhanced Recipe Recommendation on Heterogeneous\n Recipe Network","summary":" Recipe is a set of instructions that describes how to make food. It can help\npeople from the preparation of ingredients, food cooking process, etc. to\nprepare the food, and increasingly in demand on the Web. To help users find the\nvast amount of recipes on the Web, we address the task of recipe\nrecommendation. Due to multiple data types and relationships in a recipe, we\ncan treat it as a heterogeneous network to describe its information more\naccurately. To effectively utilize the heterogeneous network, metapath was\nproposed to describe the higher-level semantic information between two entities\nby defining a compound path from peer entities. Therefore, we propose a\nmetapath-enhanced recipe recommendation framework, RecipeMeta, that combines\nGNN (Graph Neural Network)-based representation learning and specific\nmetapath-based information in a recipe to predict User-Recipe pairs for\nrecommendation. Through extensive experiments, we demonstrate that the proposed\nmodel, RecipeMeta, outperforms state-of-the-art methods for recipe\nrecommendation.\n","authors":["Jialiang Shi","Takahiro Komamizu","Keisuke Doman","Haruya Kyutoku","Ichiro Ide"],"pdf_url":"https://arxiv.org/pdf/2310.15593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15317v2","updated":"2023-10-24T06:53:55Z","published":"2023-07-28T05:32:56Z","title":"DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable\n Kendall's Rank Correlation","summary":" Few-shot learning aims to adapt models trained on the base dataset to novel\ntasks where the categories were not seen by the model before. This often leads\nto a relatively uniform distribution of feature values across channels on novel\nclasses, posing challenges in determining channel importance for novel tasks.\nStandard few-shot learning methods employ geometric similarity metrics such as\ncosine similarity and negative Euclidean distance to gauge the semantic\nrelatedness between two features. However, features with high geometric\nsimilarities may carry distinct semantics, especially in the context of\nfew-shot learning. In this paper, we demonstrate that the importance ranking of\nfeature channels is a more reliable indicator for few-shot learning than\ngeometric similarity metrics. We observe that replacing the geometric\nsimilarity metric with Kendall's rank correlation only during inference is able\nto improve the performance of few-shot learning across a wide range of methods\nand datasets with different domains. Furthermore, we propose a carefully\ndesigned differentiable loss for meta-training to address the\nnon-differentiability issue of Kendall's rank correlation. By replacing\ngeometric similarity with differentiable Kendall's rank correlation, our method\ncan integrate with numerous existing few-shot approaches and is ready for\nintegrating with future state-of-the-art methods that rely on geometric\nsimilarity metrics. Extensive experiments validate the efficacy of the\nrank-correlation-based approach, showcasing a significant improvement in\nfew-shot learning.\n","authors":["Kaipeng Zheng","Huishuai Zhang","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2307.15317v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.12775v3","updated":"2023-10-24T01:51:20Z","published":"2023-09-22T10:34:11Z","title":"Semantic Change Driven Generative Semantic Communication Framework","summary":" The burgeoning generative artificial intelligence technology offers novel\ninsights into the development of semantic communication (SemCom) frameworks.\nThese frameworks hold the potential to address the challenges associated with\nthe black-box nature inherent in existing end-to-end training manner for the\nexisting SemCom framework, as well as deterioration of the user experience\ncaused by the inevitable error floor in deep learning-based SemCom. In this\npaper, we focus on the widespread remote monitoring scenario, and propose a\nsemantic change driven generative SemCom framework. Therein, the semantic\nencoder and semantic decoder can be optimized independently. Specifically, we\ndevelop a modular semantic encoder with value of information based semantic\nsampling function. In addition, we propose a conditional denoising diffusion\nprobabilistic mode-assisted semantic decoder that relies on received semantic\ninformation from the source, namely, the semantic map, and the local static\nscene information to remotely regenerate scenes. Moreover, we demonstrate the\neffectiveness of the proposed semantic encoder and decoder as well as the\nconsiderable potential in reducing energy consumption through simulation based\non the realistic $\\mathcal{F}$ composite channel fading model. The code is\navailable at https://github.com/wty2011jl/SCDGSC.git.\n","authors":["Wanting Yang","Zehui Xiong","Hongyang Du","Yanli Yuan","Tony Q. S. Quek"],"pdf_url":"https://arxiv.org/pdf/2309.12775v3.pdf","comment":null}]},"2023-10-25T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.16836v1","updated":"2023-10-25T17:59:32Z","published":"2023-10-25T17:59:32Z","title":"LLM-FP4: 4-Bit Floating-Point Quantized Transformers","summary":" We propose LLM-FP4 for quantizing both weights and activations in large\nlanguage models (LLMs) down to 4-bit floating-point values, in a post-training\nmanner. Existing post-training quantization (PTQ) solutions are primarily\ninteger-based and struggle with bit widths below 8 bits. Compared to integer\nquantization, floating-point (FP) quantization is more flexible and can better\nhandle long-tail or bell-shaped distributions, and it has emerged as a default\nchoice in many hardware platforms. One characteristic of FP quantization is\nthat its performance largely depends on the choice of exponent bits and\nclipping range. In this regard, we construct a strong FP-PTQ baseline by\nsearching for the optimal quantization parameters. Furthermore, we observe a\nhigh inter-channel variance and low intra-channel variance pattern in\nactivation distributions, which adds activation quantization difficulty. We\nrecognize this pattern to be consistent across a spectrum of transformer models\ndesigned for diverse tasks, such as LLMs, BERT, and Vision Transformer models.\nTo tackle this, we propose per-channel activation quantization and show that\nthese additional scaling factors can be reparameterized as exponential biases\nof weights, incurring a negligible cost. Our method, for the first time, can\nquantize both weights and activations in the LLaMA-13B to only 4-bit and\nachieves an average score of 63.1 on the common sense zero-shot reasoning\ntasks, which is only 5.8 lower than the full-precision model, significantly\noutperforming the previous state-of-the-art by 12.7 points. Code is available\nat: https://github.com/nbasyl/LLM-FP4.\n","authors":["Shih-yang Liu","Zechun Liu","Xijie Huang","Pingcheng Dong","Kwang-Ting Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.16836v1.pdf","comment":"EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.16834v1","updated":"2023-10-25T17:59:12Z","published":"2023-10-25T17:59:12Z","title":"Discrete Diffusion Language Modeling by Estimating the Ratios of the\n Data Distribution","summary":" Despite their groundbreaking performance for many generative modeling tasks,\ndiffusion models have fallen short on discrete data domains such as natural\nlanguage. Crucially, standard diffusion models rely on the well-established\ntheory of score matching, but efforts to generalize this to discrete structures\nhave not yielded the same empirical gains. In this work, we bridge this gap by\nproposing score entropy, a novel discrete score matching loss that is more\nstable than existing methods, forms an ELBO for maximum likelihood training,\nand can be efficiently optimized with a denoising variant. We scale our Score\nEntropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2,\nachieving highly competitive likelihoods while also introducing distinct\nalgorithmic advantages. In particular, when comparing similarly sized SEDD and\nGPT-2 models, SEDD attains comparable perplexities (normally within $+10\\%$ of\nand sometimes outperforming the baseline). Furthermore, SEDD models learn a\nmore faithful sequence distribution (around $4\\times$ better compared to GPT-2\nmodels with ancestral sampling as measured by large models), can trade off\ncompute for generation quality (needing only $16\\times$ fewer network\nevaluations to match GPT-2), and enables arbitrary infilling beyond the\nstandard left to right prompting.\n","authors":["Aaron Lou","Chenlin Meng","Stefano Ermon"],"pdf_url":"https://arxiv.org/pdf/2310.16834v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2310.16822v1","updated":"2023-10-25T17:51:56Z","published":"2023-10-25T17:51:56Z","title":"Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity\n and Relation Extraction","summary":" How can we better extract entities and relations from text? Using multimodal\nextraction with images and text obtains more signals for entities and\nrelations, and aligns them through graphs or hierarchical fusion, aiding in\nextraction. Despite attempts at various fusions, previous works have overlooked\nmany unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes\ninnovative pre-training objectives for entity-object and relation-image\nalignment, extracting objects from images and aligning them with entity and\nrelation prompts for soft pseudo-labels. These labels are used as\nself-supervised signals for pre-training, enhancing the ability to extract\nentities and relations. Experiments on three datasets show an average 3.41% F1\nimprovement over prior SOTA. Additionally, our method is orthogonal to previous\nmultimodal fusions, and using it on prior SOTA fusions further improves 5.47%\nF1.\n","authors":["Xuming Hu","Junzhe Chen","Aiwei Liu","Shiao Meng","Lijie Wen","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.16822v1.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2310.13613v2","updated":"2023-10-25T17:41:39Z","published":"2023-10-20T16:03:33Z","title":"Hunayn: Elevating Translation Beyond the Literal","summary":" This project introduces an advanced English-to-Arabic translator surpassing\nconventional tools. Leveraging the Helsinki transformer (MarianMT), our\napproach involves fine-tuning on a self-scraped, purely literary Arabic\ndataset. Evaluations against Google Translate show consistent outperformance in\nqualitative assessments. Notably, it excels in cultural sensitivity and context\naccuracy. This research underscores the Helsinki transformer's superiority for\nEnglish-to-Arabic translation using a Fusha dataset.\n","authors":["Nasser Almousa","Nasser Alzamil","Abdullah Alshehri","Ahmad Sait"],"pdf_url":"https://arxiv.org/pdf/2310.13613v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16810v1","updated":"2023-10-25T17:39:07Z","published":"2023-10-25T17:39:07Z","title":"Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT\n and GPT-4 for Dialogue Summarization","summary":" This study explores the capabilities of prompt-driven Large Language Models\n(LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue\nsummarization. Experiments employed DialogSum (English social conversations)\nand DECODA (French call center interactions), testing various prompts:\nincluding prompts from existing literature and those from human summarization\nguidelines, as well as a two-step prompt approach. Our findings indicate that\nGPT models often produce lengthy summaries and deviate from human summarization\nguidelines. However, using human guidelines as an intermediate step shows\npromise, outperforming direct word-length constraint prompts in some cases. The\nresults reveal that GPT models exhibit unique stylistic tendencies in their\nsummaries. While BERTScores did not dramatically decrease for GPT outputs\nsuggesting semantic similarity to human references and specialised pre-trained\nmodels, ROUGE scores reveal grammatical and lexical disparities between\nGPT-generated and human-written summaries. These findings shed light on the\ncapabilities and limitations of GPT models in following human instructions for\ndialogue summarization.\n","authors":["Yongxin Zhou","Fabien Ringeval","François Portet"],"pdf_url":"https://arxiv.org/pdf/2310.16810v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16803v1","updated":"2023-10-25T17:34:52Z","published":"2023-10-25T17:34:52Z","title":"Language Agnostic Code Embeddings","summary":" Recently, code language models have achieved notable advancements in\naddressing a diverse array of essential code comprehension and generation\ntasks. Yet, the field lacks a comprehensive deep dive and understanding of the\ncode embeddings of multilingual code models. In this paper, we present a\ncomprehensive study on multilingual code embeddings, focusing on the\ncross-lingual capabilities of these embeddings across different programming\nlanguages. Through probing experiments, we demonstrate that code embeddings\ncomprise two distinct components: one deeply tied to the nuances and syntax of\na specific language, and the other remaining agnostic to these details,\nprimarily focusing on semantics. Further, we show that when we isolate and\neliminate this language-specific component, we witness significant improvements\nin downstream code retrieval tasks, leading to an absolute increase of up to\n+17 in the Mean Reciprocal Rank (MRR).\n","authors":["Saiteja Utpala","Alex Gu","Pin Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16790v1","updated":"2023-10-25T17:23:37Z","published":"2023-10-25T17:23:37Z","title":"Improving a Named Entity Recognizer Trained on Noisy Data with a Few\n Clean Instances","summary":" To achieve state-of-the-art performance, one still needs to train NER models\non large-scale, high-quality annotated data, an asset that is both costly and\ntime-intensive to accumulate. In contrast, real-world applications often resort\nto massive low-quality labeled data through non-expert annotators via\ncrowdsourcing and external knowledge bases via distant supervision as a\ncost-effective alternative. However, these annotation methods result in noisy\nlabels, which in turn lead to a notable decline in performance. Hence, we\npropose to denoise the noisy NER data with guidance from a small set of clean\ninstances. Along with the main NER model we train a discriminator model and use\nits outputs to recalibrate the sample weights. The discriminator is capable of\ndetecting both span and category errors with different discriminative prompts.\nResults on public crowdsourcing and distant supervision datasets show that the\nproposed method can consistently improve performance with a small guidance set.\n","authors":["Zhendong Chu","Ruiyi Zhang","Tong Yu","Rajiv Jain","Vlad I Morariu","Jiuxiang Gu","Ani Nenkova"],"pdf_url":"https://arxiv.org/pdf/2310.16790v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2310.16789v1","updated":"2023-10-25T17:21:23Z","published":"2023-10-25T17:21:23Z","title":"Detecting Pretraining Data from Large Language Models","summary":" Although large language models (LLMs) are widely deployed, the data used to\ntrain them is rarely disclosed. Given the incredible scale of this data, up to\ntrillions of tokens, it is all but certain that it includes potentially\nproblematic text such as copyrighted materials, personally identifiable\ninformation, and test data for widely reported reference benchmarks. However,\nwe currently have no way to know which data of these types is included or in\nwhat proportions. In this paper, we study the pretraining data detection\nproblem: given a piece of text and black-box access to an LLM without knowing\nthe pretraining data, can we determine if the model was trained on the provided\ntext? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that\nuses data created before and after model training to support gold truth\ndetection. We also introduce a new detection method Min-K% Prob based on a\nsimple hypothesis: an unseen example is likely to contain a few outlier words\nwith low probabilities under the LLM, while a seen example is less likely to\nhave words with such low probabilities. Min-K% Prob can be applied without any\nknowledge about the pretraining corpus or any additional training, departing\nfrom previous detection methods that require training a reference model on data\nthat is similar to the pretraining data. Moreover, our experiments demonstrate\nthat Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous\nmethods. We apply Min-K% Prob to two real-world scenarios, copyrighted book\ndetection, and contaminated downstream example detection, and find it a\nconsistently effective solution.\n","authors":["Weijia Shi","Anirudh Ajith","Mengzhou Xia","Yangsibo Huang","Daogao Liu","Terra Blevins","Danqi Chen","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2310.16789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16787v1","updated":"2023-10-25T17:20:26Z","published":"2023-10-25T17:20:26Z","title":"The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing\n & Attribution in AI","summary":" The race to train language models on vast, diverse, and inconsistently\ndocumented datasets has raised pressing concerns about the legal and ethical\nrisks for practitioners. To remedy these practices threatening data\ntransparency and understanding, we convene a multi-disciplinary effort between\nlegal and machine learning experts to systematically audit and trace 1800+ text\ndatasets. We develop tools and standards to trace the lineage of these\ndatasets, from their source, creators, series of license conditions,\nproperties, and subsequent use. Our landscape analysis highlights the sharp\ndivides in composition and focus of commercially open vs closed datasets, with\nclosed datasets monopolizing important categories: lower resource languages,\nmore creative tasks, richer topic variety, newer and more synthetic training\ndata. This points to a deepening divide in the types of data that are made\navailable under different license conditions, and heightened implications for\njurisdictional legal interpretations of copyright and fair use. We also observe\nfrequent miscategorization of licenses on widely used dataset hosting sites,\nwith license omission of 72%+ and error rates of 50%+. This points to a crisis\nin misattribution and informed use of the most popular datasets driving many\nrecent breakthroughs. As a contribution to ongoing improvements in dataset\ntransparency and responsible use, we release our entire audit, with an\ninteractive UI, the Data Provenance Explorer, which allows practitioners to\ntrace and filter on data provenance for the most popular open source finetuning\ndata collections: www.dataprovenance.org.\n","authors":["Shayne Longpre","Robert Mahari","Anthony Chen","Naana Obeng-Marnu","Damien Sileo","William Brannon","Niklas Muennighoff","Nathan Khazam","Jad Kabbara","Kartik Perisetla"," Xinyi"," Wu","Enrico Shippole","Kurt Bollacker","Tongshuang Wu","Luis Villa","Sandy Pentland","Deb Roy","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2310.16787v1.pdf","comment":"30 pages (18 main), 6 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.16781v1","updated":"2023-10-25T17:15:55Z","published":"2023-10-25T17:15:55Z","title":"Kiki or Bouba? Sound Symbolism in Vision-and-Language Models","summary":" Although the mapping between sound and meaning in human language is assumed\nto be largely arbitrary, research in cognitive science has shown that there are\nnon-trivial correlations between particular sounds and meanings across\nlanguages and demographic groups, a phenomenon known as sound symbolism. Among\nthe many dimensions of meaning, sound symbolism is particularly salient and\nwell-demonstrated with regards to cross-modal associations between language and\nthe visual domain. In this work, we address the question of whether sound\nsymbolism is reflected in vision-and-language models such as CLIP and Stable\nDiffusion. Using zero-shot knowledge probing to investigate the inherent\nknowledge of these models, we find strong evidence that they do show this\npattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our\nwork provides a novel method for demonstrating sound symbolism and\nunderstanding its nature using computational tools. Our code will be made\npublicly available.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2310.16781v1.pdf","comment":"Accepted to NeurIPS 2023 (spotlight). Project webpage:\n https://kiki-bouba.github.io/"},{"id":"http://arxiv.org/abs/2309.09357v3","updated":"2023-10-25T17:10:49Z","published":"2023-09-17T19:46:03Z","title":"Talk2Care: Facilitating Asynchronous Patient-Provider Communication with\n Large-Language-Model","summary":" Despite the plethora of telehealth applications to assist home-based older\nadults and healthcare providers, basic messaging and phone calls are still the\nmost common communication methods, which suffer from limited availability,\ninformation loss, and process inefficiencies. One promising solution to\nfacilitate patient-provider communication is to leverage large language models\n(LLMs) with their powerful natural conversation and summarization capability.\nHowever, there is a limited understanding of LLMs' role during the\ncommunication. We first conducted two interview studies with both older adults\n(N=10) and healthcare providers (N=9) to understand their needs and\nopportunities for LLMs in patient-provider asynchronous communication. Based on\nthe insights, we built an LLM-powered communication system, Talk2Care, and\ndesigned interactive components for both groups: (1) For older adults, we\nleveraged the convenience and accessibility of voice assistants (VAs) and built\nan LLM-powered VA interface for effective information collection. (2) For\nhealth providers, we built an LLM-based dashboard to summarize and present\nimportant health information based on older adults' conversations with the VA.\nWe further conducted two user studies with older adults and providers to\nevaluate the usability of the system. The results showed that Talk2Care could\nfacilitate the communication process, enrich the health information collected\nfrom older adults, and considerably save providers' efforts and time. We\nenvision our work as an initial exploration of LLMs' capability in the\nintersection of healthcare and interpersonal communication.\n","authors":["Ziqi Yang","Xuhai Xu","Bingsheng Yao","Shao Zhang","Ethan Rogers","Stephen Intille","Nawar Shara","Guodong Gordon Gao","Dakuo Wang"],"pdf_url":"https://arxiv.org/pdf/2309.09357v3.pdf","comment":"Under submission to CHI2024"},{"id":"http://arxiv.org/abs/2310.16776v1","updated":"2023-10-25T17:06:42Z","published":"2023-10-25T17:06:42Z","title":"DEFT: Data Efficient Fine-Tuning for Large Language Models via\n Unsupervised Core-Set Selection","summary":" Recent advances have led to the availability of many pre-trained language\nmodels (PLMs); however, a question that remains is how much data is truly\nneeded to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT,\na data-efficient fine-tuning framework that leverages unsupervised core-set\nselection to minimize the amount of data needed to fine-tune PLMs for\ndownstream tasks. We demonstrate the efficacy of our DEFT framework in the\ncontext of text-editing LMs, and compare to the state-of-the art text-editing\nmodel, CoEDIT (Raheja et al., 2023). Our quantitative and qualitative results\ndemonstrate that DEFT models are just as accurate as CoEDIT while being\nfinetuned on ~70% less data.\n","authors":["Devleena Das","Vivek Khetan"],"pdf_url":"https://arxiv.org/pdf/2310.16776v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16763v1","updated":"2023-10-25T16:52:00Z","published":"2023-10-25T16:52:00Z","title":"SuperHF: Supervised Iterative Learning from Human Feedback","summary":" While large language models demonstrate remarkable capabilities, they often\npresent challenges in terms of safety, alignment with human values, and\nstability during training. Here, we focus on two prevalent methods used to\nalign these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning\nfrom Human Feedback (RLHF). SFT is simple and robust, powering a host of\nopen-source models, while RLHF is a more sophisticated method used in top-tier\nmodels like ChatGPT but also suffers from instability and susceptibility to\nreward hacking. We propose a novel approach, Supervised Iterative Learning from\nHuman Feedback (SuperHF), which seeks to leverage the strengths of both\nmethods. Our hypothesis is two-fold: that the reward model used in RLHF is\ncritical for efficient data use and model generalization and that the use of\nProximal Policy Optimization (PPO) in RLHF may not be necessary and could\ncontribute to instability issues. SuperHF replaces PPO with a simple supervised\nloss and a Kullback-Leibler (KL) divergence prior. It creates its own training\ndata by repeatedly sampling a batch of model outputs and filtering them through\nthe reward model in an online learning regime. We then break down the reward\noptimization problem into three components: robustly optimizing the training\nrewards themselves, preventing reward hacking-exploitation of the reward model\nthat degrades model performance-as measured by a novel METEOR similarity\nmetric, and maintaining good performance on downstream evaluations. Our\nexperimental results show SuperHF exceeds PPO-based RLHF on the training\nobjective, easily and favorably trades off high reward with low reward hacking,\nimproves downstream calibration, and performs the same on our GPT-4 based\nqualitative evaluation scheme all the while being significantly simpler to\nimplement, highlighting SuperHF's potential as a competitive language model\nalignment technique.\n","authors":["Gabriel Mukobi","Peter Chatain","Su Fong","Robert Windesheim","Gitta Kutyniok","Kush Bhatia","Silas Alberti"],"pdf_url":"https://arxiv.org/pdf/2310.16763v1.pdf","comment":"Accepted to the Socially Responsible Language Modelling Research\n (SoLaR) workshop at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16761v1","updated":"2023-10-25T16:50:24Z","published":"2023-10-25T16:50:24Z","title":"IntenDD: A Unified Contrastive Learning Approach for Intent Detection\n and Discovery","summary":" Identifying intents from dialogue utterances forms an integral component of\ntask-oriented dialogue systems. Intent-related tasks are typically formulated\neither as a classification task, where the utterances are classified into\npredefined categories or as a clustering task when new and previously unknown\nintent categories need to be discovered from these utterances. Further, the\nintent classification may be modeled in a multiclass (MC) or multilabel (ML)\nsetup. While typically these tasks are modeled as separate tasks, we propose\nIntenDD, a unified approach leveraging a shared utterance encoding backbone.\nIntenDD uses an entirely unsupervised contrastive learning strategy for\nrepresentation learning, where pseudo-labels for the unlabeled utterances are\ngenerated based on their lexical features. Additionally, we introduce a\ntwo-step post-processing setup for the classification tasks using modified\nadsorption. Here, first, the residuals in the training data are propagated\nfollowed by smoothing the labels both modeled in a transductive setting.\nThrough extensive evaluations on various benchmark datasets, we find that our\napproach consistently outperforms competitive baselines across all three tasks.\nOn average, IntenDD reports percentage improvements of 2.32%, 1.26%, and 1.52%\nin their respective metrics for few-shot MC, few-shot ML, and the intent\ndiscovery tasks respectively.\n","authors":["Bhavuk Singhal","Ashim Gupta","Shivasankaran V P","Amrith Krishna"],"pdf_url":"https://arxiv.org/pdf/2310.16761v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.16755v1","updated":"2023-10-25T16:41:15Z","published":"2023-10-25T16:41:15Z","title":"HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning\n in Large Language Models","summary":" Theory of Mind (ToM) is the ability to reason about one's own and others'\nmental states. ToM plays a critical role in the development of intelligence,\nlanguage understanding, and cognitive processes. While previous work has\nprimarily focused on first and second-order ToM, we explore higher-order ToM,\nwhich involves recursive reasoning on others' beliefs. We introduce HI-TOM, a\nHigher Order Theory of Mind benchmark. Our experimental evaluation using\nvarious Large Language Models (LLMs) indicates a decline in performance on\nhigher-order ToM tasks, demonstrating the limitations of current LLMs. We\nconduct a thorough analysis of different failure cases of LLMs, and share our\nthoughts on the implications of our findings on the future of NLP.\n","authors":["Yinghui He","Yufan Wu","Yilin Jia","Rada Mihalcea","Yulong Chen","Naihao Deng"],"pdf_url":"https://arxiv.org/pdf/2310.16755v1.pdf","comment":"Accepted at Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2306.01439v2","updated":"2023-10-25T16:40:27Z","published":"2023-06-02T10:59:44Z","title":"Interpretable and Explainable Logical Policies via Neurally Guided\n Symbolic Abstraction","summary":" The limited priors required by neural networks make them the dominating\nchoice to encode and learn policies using reinforcement learning (RL). However,\nthey are also black-boxes, making it hard to understand the agent's behaviour,\nespecially when working on the image level. Therefore, neuro-symbolic RL aims\nat creating policies that are interpretable in the first place. Unfortunately,\ninterpretability is not explainability. To achieve both, we introduce Neurally\ngUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural\nnetwork-based agents to guide the search of candidate-weighted logic rules,\nthen uses differentiable logic to train the logic agents. Our experimental\nevaluation demonstrates that NUDGE agents can induce interpretable and\nexplainable policies while outperforming purely neural ones and showing good\nflexibility to environments of different initial states and problem sizes.\n","authors":["Quentin Delfosse","Hikaru Shindo","Devendra Dhami","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2306.01439v2.pdf","comment":"9 main pages + appendix (19 in total)"},{"id":"http://arxiv.org/abs/2310.16753v1","updated":"2023-10-25T16:39:00Z","published":"2023-10-25T16:39:00Z","title":"PROMINET: Prototype-based Multi-View Network for Interpretable Email\n Response Prediction","summary":" Email is a widely used tool for business communication, and email marketing\nhas emerged as a cost-effective strategy for enterprises. While previous\nstudies have examined factors affecting email marketing performance, limited\nresearch has focused on understanding email response behavior by considering\nemail content and metadata. This study proposes a Prototype-based Multi-view\nNetwork (PROMINET) that incorporates semantic and structural information from\nemail data. By utilizing prototype learning, the PROMINET model generates\nlatent exemplars, enabling interpretable email response prediction. The model\nmaps learned semantic and structural exemplars to observed samples in the\ntraining data at different levels of granularity, such as document, sentence,\nor phrase. The approach is evaluated on two real-world email datasets: the\nEnron corpus and an in-house Email Marketing corpus. Experimental results\ndemonstrate that the PROMINET model outperforms baseline models, achieving a\n~3% improvement in F1 score on both datasets. Additionally, the model provides\ninterpretability through prototypes at different granularity levels while\nmaintaining comparable performance to non-interpretable models. The learned\nprototypes also show potential for generating suggestions to enhance email text\nediting and improve the likelihood of effective email responses. This research\ncontributes to enhancing sender-receiver communication and customer engagement\nin email interactions.\n","authors":["Yuqing Wang","Prashanth Vijayaraghavan","Ehsan Degan"],"pdf_url":"https://arxiv.org/pdf/2310.16753v1.pdf","comment":"Accepted at EMNLP 2023 (industry)"},{"id":"http://arxiv.org/abs/2310.16749v1","updated":"2023-10-25T16:32:02Z","published":"2023-10-25T16:32:02Z","title":"DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in\n Indo-European Languages","summary":" Disfluency correction (DC) is the process of removing disfluent elements like\nfillers, repetitions and corrections from spoken utterances to create readable\nand interpretable text. DC is a vital post-processing step applied to Automatic\nSpeech Recognition (ASR) outputs, before subsequent processing by downstream\nlanguage understanding tasks. Existing DC research has primarily focused on\nEnglish due to the unavailability of large-scale open-source datasets. Towards\nthe goal of multilingual disfluency correction, we present a high-quality\nhuman-annotated DC corpus covering four important Indo-European languages:\nEnglish, Hindi, German and French. We provide extensive analysis of results of\nstate-of-the-art DC models across all four languages obtaining F1 scores of\n97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To\ndemonstrate the benefits of DC on downstream tasks, we show that DC leads to\n5.65 points increase in BLEU scores on average when used in conjunction with a\nstate-of-the-art Machine Translation (MT) system. We release code to run our\nexperiments along with our annotated dataset here.\n","authors":["Vineet Bhat","Preethi Jyothi","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2310.16749v1.pdf","comment":"Accepted at EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.16746v1","updated":"2023-10-25T16:23:17Z","published":"2023-10-25T16:23:17Z","title":"HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis","summary":" Authorship Analysis, also known as stylometry, has been an essential aspect\nof Natural Language Processing (NLP) for a long time. Likewise, the recent\nadvancement of Large Language Models (LLMs) has made authorship analysis\nincreasingly crucial for distinguishing between human-written and AI-generated\ntexts. However, these authorship analysis tasks have primarily been focused on\nwritten texts, not considering spoken texts. Thus, we introduce the largest\nbenchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark).\nHANSEN encompasses meticulous curation of existing speech datasets accompanied\nby transcripts, alongside the creation of novel AI-generated spoken text\ndatasets. Together, it comprises 17 human datasets, and AI-generated spoken\ntexts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To\nevaluate and demonstrate the utility of HANSEN, we perform Authorship\nAttribution (AA) & Author Verification (AV) on human-spoken datasets and\nconducted Human vs. AI spoken text detection using state-of-the-art (SOTA)\nmodels. While SOTA methods, such as, character ngram or Transformer-based\nmodel, exhibit similar AA & AV performance in human-spoken datasets compared to\nwritten ones, there is much room for improvement in AI-generated spoken text\ndetection. The HANSEN benchmark is available at:\nhttps://huggingface.co/datasets/HANSEN-REPO/HANSEN.\n","authors":["Nafis Irtiza Tripto","Adaku Uchendu","Thai Le","Mattia Setzu","Fosca Giannotti","Dongwon Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16746v1.pdf","comment":"9 pages, EMNLP-23 findings, 5 pages appendix, 6 figures, 17 tables"},{"id":"http://arxiv.org/abs/2310.16738v1","updated":"2023-10-25T16:11:55Z","published":"2023-10-25T16:11:55Z","title":"Improving Conversational Recommendation Systems via Bias Analysis and\n Language-Model-Enhanced Data Augmentation","summary":" Conversational Recommendation System (CRS) is a rapidly growing research area\nthat has gained significant attention alongside advancements in language\nmodelling techniques. However, the current state of conversational\nrecommendation faces numerous challenges due to its relative novelty and\nlimited existing contributions. In this study, we delve into benchmark datasets\nfor developing CRS models and address potential biases arising from the\nfeedback loop inherent in multi-turn interactions, including selection bias and\nmultiple popularity bias variants. Drawing inspiration from the success of\ngenerative data via using language models and data augmentation techniques, we\npresent two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model\nperformance while mitigating biases. Through extensive experiments on ReDial\nand TG-ReDial benchmark datasets, we show a consistent improvement of CRS\ntechniques with our data augmentation approaches and offer additional insights\non addressing multiple newly formulated biases.\n","authors":["Xi Wang","Hossein A. Rahmani","Jiqun Liu","Emine Yilmaz"],"pdf_url":"https://arxiv.org/pdf/2310.16738v1.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.16731v1","updated":"2023-10-25T16:00:47Z","published":"2023-10-25T16:00:47Z","title":"Disentangling Extraction and Reasoning in Multi-hop Spatial Reasoning","summary":" Spatial reasoning over text is challenging as the models not only need to\nextract the direct spatial information from the text but also reason over those\nand infer implicit spatial relations. Recent studies highlight the struggles\neven large language models encounter when it comes to performing spatial\nreasoning over text. In this paper, we explore the potential benefits of\ndisentangling the processes of information extraction and reasoning in models\nto address this challenge. To explore this, we design various models that\ndisentangle extraction and reasoning(either symbolic or neural) and compare\nthem with state-of-the-art(SOTA) baselines with no explicit design for these\nparts. Our experimental results consistently demonstrate the efficacy of\ndisentangling, showcasing its ability to enhance models' generalizability\nwithin realistic data domains.\n","authors":["Roshanak Mirzaee","Parisa Kordjamshidi"],"pdf_url":"https://arxiv.org/pdf/2310.16731v1.pdf","comment":"Accepted in EMNLP-Finding 2023"},{"id":"http://arxiv.org/abs/2310.16713v1","updated":"2023-10-25T15:34:55Z","published":"2023-10-25T15:34:55Z","title":"SkyMath: Technical Report","summary":" Large language models (LLMs) have shown great potential to solve varieties of\nnatural language processing (NLP) tasks, including mathematical reasoning. In\nthis work, we present SkyMath, a large language model for mathematics with 13\nbillion parameters. By applying self-compare fine-tuning, we have enhanced\nmathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K,\nSkyMath outperforms all known open-source models of similar size and has\nestablished a new SOTA performance.\n","authors":["Liu Yang","Haihua Yang","Wenjun Cheng","Lei Lin","Chenxia Li","Yifu Chen","Lunan Liu","Jianfei Pan","Tianwen Wei","Biye Li","Liang Zhao","Lijie Wang","Bo Zhu","Jujie He","Guoliang Li","Xuejie Wu","Xilin Luo","Rui Hu"],"pdf_url":"https://arxiv.org/pdf/2310.16713v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16712v1","updated":"2023-10-25T15:34:30Z","published":"2023-10-25T15:34:30Z","title":"LLM Performance Predictors are good initializers for Architecture Search","summary":" Large language models (LLMs) have become an integral component in solving a\nwide range of NLP tasks. In this work, we explore a novel use case of using\nLLMs to build performance predictors (PP): models that, given a specific deep\nneural network architecture, predict its performance on a downstream task. We\ndesign PP prompts for LLMs consisting of: (i) role: description of the role\nassigned to the LLM, (ii) instructions: set of instructions to be followed by\nthe LLM to carry out performance prediction, (iii) hyperparameters: a\ndefinition of each architecture-specific hyperparameter and (iv)\ndemonstrations: sample architectures along with their efficiency metrics and\n'training from scratch' performance. For machine translation (MT) tasks, we\ndiscover that GPT-4 with our PP prompts (LLM-PP) can predict the performance of\narchitecture with a mean absolute error matching the SOTA and a marginal\ndegradation in rank correlation coefficient compared to SOTA performance\npredictors. Further, we show that the predictions from LLM-PP can be distilled\nto a small regression model (LLM-Distill-PP). LLM-Distill-PP models\nsurprisingly retain the performance of LLM-PP largely and can be a\ncost-effective alternative for heavy use cases of performance estimation.\nSpecifically, for neural architecture search (NAS), we propose a Hybrid-Search\nalgorithm for NAS (HS-NAS), which uses LLM-Distill-PP for the initial part of\nsearch, resorting to the baseline predictor for rest of the search. We show\nthat HS-NAS performs very similar to SOTA NAS across benchmarks, reduces search\nhours by 50% roughly, and in some cases, improves latency, GFLOPs, and model\nsize.\n","authors":["Ganesh Jawahar","Muhammad Abdul-Mageed","Laks V. S. Lakshmanan","Dujian Ding"],"pdf_url":"https://arxiv.org/pdf/2310.16712v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.05588v2","updated":"2023-10-25T15:30:59Z","published":"2023-05-09T16:20:48Z","title":"StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure","summary":" This work presents StrAE: a Structured Autoencoder framework that through\nstrict adherence to explicit structure, and use of a novel contrastive\nobjective over tree-structured representations, enables effective learning of\nmulti-level representations. Through comparison over different forms of\nstructure, we verify that our results are directly attributable to the\ninformativeness of the structure provided as input, and show that this is not\nthe case for existing tree models. We then further extend StrAE to allow the\nmodel to define its own compositions using a simple localised-merge algorithm.\nThis variant, called Self-StrAE, outperforms baselines that don't involve\nexplicit hierarchical compositions, and is comparable to models given\ninformative structure (e.g. constituency parses). Our experiments are conducted\nin a data-constrained (circa 10M tokens) setting to help tease apart the\ncontribution of the inductive bias to effective learning. However, we find that\nthis framework can be robust to scale, and when extended to a much larger\ndataset (circa 100M tokens), our 430 parameter model performs comparably to a\n6-layer RoBERTa many orders of magnitude larger in size. Our findings support\nthe utility of incorporating explicit composition as an inductive bias for\neffective representation learning.\n","authors":["Mattia Opper","Victor Prokhorov","N. Siddharth"],"pdf_url":"https://arxiv.org/pdf/2305.05588v2.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2110.03427v3","updated":"2023-10-25T15:21:08Z","published":"2021-10-05T16:38:57Z","title":"Is Attention always needed? A Case Study on Language Identification from\n Speech","summary":" Language Identification (LID) is a crucial preliminary process in the field\nof Automatic Speech Recognition (ASR) that involves the identification of a\nspoken language from audio samples. Contemporary systems that can process\nspeech in multiple languages require users to expressly designate one or more\nlanguages prior to utilization. The LID task assumes a significant role in\nscenarios where ASR systems are unable to comprehend the spoken language in\nmultilingual settings, leading to unsuccessful speech recognition outcomes. The\npresent study introduces convolutional recurrent neural network (CRNN) based\nLID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC)\ncharacteristics of audio samples. Furthermore, we replicate certain\nstate-of-the-art methodologies, specifically the Convolutional Neural Network\n(CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with\nattention), and conduct a comparative analysis with our CRNN-based approach. We\nconducted comprehensive evaluations on thirteen distinct Indian languages and\nour model resulted in over 98\\% classification accuracy. The LID model exhibits\nhigh-performance levels ranging from 97% to 100% for languages that are\nlinguistically similar. The proposed LID model exhibits a high degree of\nextensibility to additional languages and demonstrates a strong resistance to\nnoise, achieving 91.2% accuracy in a noisy setting when applied to a European\nLanguage (EU) dataset.\n","authors":["Atanu Mandal","Santanu Pal","Indranil Dutta","Mahidas Bhattacharya","Sudip Kumar Naskar"],"pdf_url":"https://arxiv.org/pdf/2110.03427v3.pdf","comment":"Accepted for publication in Natural Language Engineering"},{"id":"http://arxiv.org/abs/2310.16685v1","updated":"2023-10-25T14:48:58Z","published":"2023-10-25T14:48:58Z","title":"Detection of news written by the ChatGPT through authorship attribution\n performed by a Bidirectional LSTM model","summary":" The large language based-model chatbot ChatGPT gained a lot of popularity\nsince its launch and has been used in a wide range of situations. This research\ncenters around a particular situation, when the ChatGPT is used to produce news\nthat will be consumed by the population, causing the facilitation in the\nproduction of fake news, spread of misinformation and lack of trust in news\nsources. Aware of these problems, this research aims to build an artificial\nintelligence model capable of performing authorship attribution on news\narticles, identifying the ones written by the ChatGPT. To achieve this goal, a\ndataset containing equal amounts of human and ChatGPT written news was\nassembled and different natural processing language techniques were used to\nextract features from it that were used to train, validate and test three\nmodels built with different techniques. The best performance was produced by\nthe Bidirectional Long Short Term Memory (LSTM) Neural Network model, achiving\n91.57\\% accuracy when tested against the data from the testing set.\n","authors":["Amanda Ferrari Iaquinta","Gustavo Voltani von Atzingen"],"pdf_url":"https://arxiv.org/pdf/2310.16685v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16681v1","updated":"2023-10-25T14:45:48Z","published":"2023-10-25T14:45:48Z","title":"BabyStories: Can Reinforcement Learning Teach Baby Language Models to\n Write Better Stories?","summary":" Language models have seen significant growth in the size of their corpus,\nleading to notable performance improvements. Yet, there has been limited\nprogress in developing models that handle smaller, more human-like datasets. As\npart of the BabyLM shared task, this study explores the impact of reinforcement\nlearning from human feedback (RLHF) on language models pretrained from scratch\nwith a limited training corpus. Comparing two GPT-2 variants, the larger model\nperforms better in storytelling tasks after RLHF fine-tuning. These findings\nsuggest that RLHF techniques may be more advantageous for larger models due to\ntheir higher learning and adaptation capacity, though more experiments are\nneeded to confirm this finding. These insights highlight the potential benefits\nof RLHF fine-tuning for language models within limited data, enhancing their\nability to maintain narrative focus and coherence while adhering better to\ninitial instructions in storytelling tasks. The code for this work is publicly\nat https://github.com/Zephyr1022/BabyStories-UTSA.\n","authors":["Xingmeng Zhao","Tongnian Wang","Sheri Osborn","Anthony Rios"],"pdf_url":"https://arxiv.org/pdf/2310.16681v1.pdf","comment":"Accepted to BabyLM workshop at CoNLL"},{"id":"http://arxiv.org/abs/2310.16676v1","updated":"2023-10-25T14:41:14Z","published":"2023-10-25T14:41:14Z","title":"SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning\n Framework for Emotion Recognition in Conversations","summary":" Emotion recognition in conversations (ERC) is a rapidly evolving task within\nthe natural language processing community, which aims to detect the emotions\nexpressed by speakers during a conversation. Recently, a growing number of ERC\nmethods have focused on leveraging supervised contrastive learning (SCL) to\nenhance the robustness and generalizability of learned features. However,\ncurrent SCL-based approaches in ERC are impeded by the constraint of large\nbatch sizes and the lack of compatibility with most existing ERC models. To\naddress these challenges, we propose an efficient and model-agnostic SCL\nframework named Supervised Sample-Label Contrastive Learning with Soft-HGR\nMaximal Correlation (SSLCL), which eliminates the need for a large batch size\nand can be seamlessly integrated with existing ERC models without introducing\nany model-specific assumptions. Specifically, we introduce a novel perspective\non utilizing label representations by projecting discrete labels into dense\nembeddings through a shallow multilayer perceptron, and formulate the training\nobjective to maximize the similarity between sample features and their\ncorresponding ground-truth label embeddings, while minimizing the similarity\nbetween sample features and label embeddings of disparate classes. Moreover, we\ninnovatively adopt the Soft-HGR maximal correlation as a measure of similarity\nbetween sample features and label embeddings, leading to significant\nperformance improvements over conventional similarity measures. Additionally,\nmultimodal cues of utterances are effectively leveraged by SSLCL as data\naugmentations to boost model performances. Extensive experiments on two ERC\nbenchmark datasets, IEMOCAP and MELD, demonstrate the compatibility and\nsuperiority of our proposed SSLCL framework compared to existing\nstate-of-the-art SCL methods. Our code is available at\n\\url{https://github.com/TaoShi1998/SSLCL}.\n","authors":["Tao Shi","Xiao Liang","Yaoyuan Liang","Xinyi Tong","Shao-Lun Huang"],"pdf_url":"https://arxiv.org/pdf/2310.16676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16654v1","updated":"2023-10-25T14:08:39Z","published":"2023-10-25T14:08:39Z","title":"ChatGPT is a Potential Zero-Shot Dependency Parser","summary":" Pre-trained language models have been widely used in dependency parsing task\nand have achieved significant improvements in parser performance. However, it\nremains an understudied question whether pre-trained language models can\nspontaneously exhibit the ability of dependency parsing without introducing\nadditional parser structure in the zero-shot scenario. In this paper, we\npropose to explore the dependency parsing ability of large language models such\nas ChatGPT and conduct linguistic analysis. The experimental results\ndemonstrate that ChatGPT is a potential zero-shot dependency parser, and the\nlinguistic analysis also shows some unique preferences in parsing outputs.\n","authors":["Boda Lin","Xinyi Zhou","Binghao Tang","Xiaocheng Gong","Si Li"],"pdf_url":"https://arxiv.org/pdf/2310.16654v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2310.12127v2","updated":"2023-10-25T13:43:49Z","published":"2023-10-18T17:36:55Z","title":"A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for\n Fairer Instruction-Tuned Machine Translation","summary":" Recent instruction fine-tuned models can solve multiple NLP tasks when\nprompted to do so, with machine translation (MT) being a prominent use case.\nHowever, current research often focuses on standard performance benchmarks,\nleaving compelling fairness and ethical considerations behind. In MT, this\nmight lead to misgendered translations, resulting, among other harms, in the\nperpetuation of stereotypes and prejudices. In this work, we address this gap\nby investigating whether and to what extent such models exhibit gender bias in\nmachine translation and how we can mitigate it. Concretely, we compute\nestablished gender bias metrics on the WinoMT corpus from English to German and\nSpanish. We discover that IFT models default to male-inflected translations,\neven disregarding female occupational stereotypes. Next, using interpretability\nmethods, we unveil that models systematically overlook the pronoun indicating\nthe gender of a target occupation in misgendered translations. Finally, based\non this finding, we propose an easy-to-implement and effective bias mitigation\nsolution based on few-shot learning that leads to significantly fairer\ntranslations.\n","authors":["Giuseppe Attanasio","Flor Miriam Plaza-del-Arco","Debora Nozza","Anne Lauscher"],"pdf_url":"https://arxiv.org/pdf/2310.12127v2.pdf","comment":"Accepted at EMNLP 2023. Code and data at\n https://github.com/MilaNLProc/interpretability-mt-gender-bias"},{"id":"http://arxiv.org/abs/2310.11954v2","updated":"2023-10-25T13:34:13Z","published":"2023-10-18T13:31:10Z","title":"MusicAgent: An AI Agent for Music Understanding and Generation with\n Large Language Models","summary":" AI-empowered music processing is a diverse field that encompasses dozens of\ntasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension\ntasks (e.g., music classification). For developers and amateurs, it is very\ndifficult to grasp all of these task to satisfy their requirements in music\nprocessing, especially considering the huge differences in the representations\nof music data and the model applicability across platforms among various tasks.\nConsequently, it is necessary to build a system to organize and integrate these\ntasks, and thus help practitioners to automatically analyze their demand and\ncall suitable tools as solutions to fulfill their requirements. Inspired by the\nrecent success of large language models (LLMs) in task automation, we develop a\nsystem, named MusicAgent, which integrates numerous music-related tools and an\nautonomous workflow to address user requirements. More specifically, we build\n1) toolset that collects tools from diverse sources, including Hugging Face,\nGitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g.,\nChatGPT) to organize these tools and automatically decompose user requests into\nmultiple sub-tasks and invoke corresponding music tools. The primary goal of\nthis system is to free users from the intricacies of AI-music tools, enabling\nthem to concentrate on the creative aspect. By granting users the freedom to\neffortlessly combine tools, the system offers a seamless and enriching music\nexperience.\n","authors":["Dingyao Yu","Kaitao Song","Peiling Lu","Tianyu He","Xu Tan","Wei Ye","Shikun Zhang","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.11954v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10307v4","updated":"2023-10-25T13:32:58Z","published":"2023-05-17T15:44:57Z","title":"FACE: Evaluating Natural Language Generation with Fourier Analysis of\n Cross-Entropy","summary":" Measuring the distance between machine-produced and human language is a\ncritical open problem. Inspired by empirical findings from psycholinguistics on\nthe periodicity of entropy in language, we propose FACE, a set of metrics based\non Fourier Analysis of the estimated Cross-Entropy of language, for measuring\nthe similarity between model-generated and human-written languages. Based on an\nopen-ended generation task and the experimental data from previous studies, we\nfind that FACE can effectively identify the human-model gap, scales with model\nsize, reflects the outcomes of different sampling methods for decoding,\ncorrelates well with other evaluation metrics and with human judgment scores.\n","authors":["Zuhao Yang","Yingfang Yuan","Yang Xu","Shuo Zhan","Huajun Bai","Kefan Chen"],"pdf_url":"https://arxiv.org/pdf/2305.10307v4.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2212.10526v3","updated":"2023-10-25T13:25:20Z","published":"2022-12-20T18:41:38Z","title":"Open Domain Multi-document Summarization: A Comprehensive Study of Model\n Brittleness under Retrieval","summary":" Multi-document summarization (MDS) assumes a set of topic-related documents\nare provided as input. In practice, this document set is not always available;\nit would need to be retrieved given an information need, i.e. a question or\ntopic statement, a setting we dub \"open-domain\" MDS. We study this more\nchallenging setting by formalizing the task and bootstrapping it using existing\ndatasets, retrievers and summarizers. Via extensive automatic and human\nevaluation, we determine: (1) state-of-the-art summarizers suffer large\nreductions in performance when applied to open-domain MDS, (2) additional\ntraining in the open-domain setting can reduce this sensitivity to imperfect\nretrieval, and (3) summarizers are insensitive to the retrieval of duplicate\ndocuments and the order of retrieved documents, but highly sensitive to other\nerrors, like the retrieval of irrelevant documents. Based on our results, we\nprovide practical guidelines to enable future work on open-domain MDS, e.g. how\nto choose the number of retrieved documents to summarize. Our results suggest\nthat new retrieval and summarization methods and annotated resources for\ntraining and evaluation are necessary for further progress in the open-domain\nsetting.\n","authors":["John Giorgi","Luca Soldaini","Bo Wang","Gary Bader","Kyle Lo","Lucy Lu Wang","Arman Cohan"],"pdf_url":"https://arxiv.org/pdf/2212.10526v3.pdf","comment":"Accepted to EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2310.16621v1","updated":"2023-10-25T13:20:54Z","published":"2023-10-25T13:20:54Z","title":"ArTST: Arabic Text and Speech Transformer","summary":" We present ArTST, a pre-trained Arabic text and speech transformer for\nsupporting open-source speech technologies for the Arabic language. The model\narchitecture follows the unified-modal framework, SpeechT5, that was recently\nreleased for English, and is focused on Modern Standard Arabic (MSA), with\nplans to extend the model for dialectal and code-switched Arabic in future\neditions. We pre-trained the model from scratch on MSA speech and text data,\nand fine-tuned it for the following tasks: Automatic Speech Recognition (ASR),\nText-To-Speech synthesis (TTS), and spoken dialect identification. In our\nexperiments comparing ArTST with SpeechT5, as well as with previously reported\nresults in these tasks, ArTST performs on a par with or exceeding the current\nstate-of-the-art in all three tasks. Moreover, we find that our pre-training is\nconducive for generalization, which is particularly evident in the low-resource\nTTS task. The pre-trained model as well as the fine-tuned ASR and TTS models\nare released for research use.\n","authors":["Hawau Olamide Toyin","Amirbek Djanibekov","Ajinkya Kulkarni","Hanan Aldarmaki"],"pdf_url":"https://arxiv.org/pdf/2310.16621v1.pdf","comment":"11 pages, 1 figure, SIGARAB ArabicNLP 2023"},{"id":"http://arxiv.org/abs/2310.13447v2","updated":"2023-10-25T13:14:40Z","published":"2023-10-20T12:26:04Z","title":"Multiscale Superpixel Structured Difference Graph Convolutional Network\n for VL Representation","summary":" Within the multimodal field, the key to integrating vision and language lies\nin establishing a good alignment strategy. Recently, benefiting from the\nsuccess of self-supervised learning, significant progress has been made in\nmultimodal semantic representation based on pre-trained models for vision and\nlanguage. However, there is still room for improvement in visual semantic\nrepresentation. The lack of spatial semantic coherence and vulnerability to\nnoise makes it challenging for current pixel or patch-based methods to\naccurately extract complex scene boundaries. To this end, this paper develops\nsuperpixel as a comprehensive compact representation of learnable image data,\nwhich effectively reduces the number of visual primitives for subsequent\nprocessing by clustering perceptually similar pixels. To mine more precise\ntopological relations, we propose a Multiscale Difference Graph Convolutional\nNetwork (MDGCN). It parses the entire image as a fine-to-coarse hierarchical\nstructure of constituent visual patterns, and captures multiscale features by\nprogressively merging adjacent superpixels as graph nodes. Moreover, we predict\nthe differences between adjacent nodes through the graph structure,\nfacilitating key information aggregation of graph nodes to reason actual\nsemantic relations. Afterward, we design a multi-level fusion rule in a\nbottom-up manner to avoid understanding deviation by learning complementary\nspatial information at different regional scales. Our proposed method can be\nwell applied to multiple downstream task learning. Extensive experiments\ndemonstrate that our method is competitive with other state-of-the-art methods\nin visual reasoning. Our code will be released upon publication.\n","authors":["Siyu Zhang","Yeming Chen","Sirui Cheng","Yaoru Sun","Jun Yang","Lizhi Bai"],"pdf_url":"https://arxiv.org/pdf/2310.13447v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16616v1","updated":"2023-10-25T13:12:39Z","published":"2023-10-25T13:12:39Z","title":"Context Does Matter: End-to-end Panoptic Narrative Grounding with\n Deformable Attention Refined Matching Network","summary":" Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that\naims to segment visual objects in images based on dense narrative captions. The\ncurrent state-of-the-art methods first refine the representation of phrase by\naggregating the most similar $k$ image pixels, and then match the refined text\nrepresentations with the pixels of the image feature map to generate\nsegmentation results. However, simply aggregating sampled image features\nignores the contextual information, which can lead to phrase-to-pixel\nmis-match. In this paper, we propose a novel learning framework called\nDeformable Attention Refined Matching Network (DRMN), whose main idea is to\nbring deformable attention in the iterative process of feature learning to\nincorporate essential context information of different scales of pixels. DRMN\niteratively re-encodes pixels with the deformable attention network after\nupdating the feature representation of the top-$k$ most similar pixels. As\nsuch, DRMN can lead to accurate yet discriminative pixel representations,\npurify the top-$k$ most similar pixels, and consequently alleviate the\nphrase-to-pixel mis-match substantially.Experimental results show that our\nnovel design significantly improves the matching results between text phrases\nand image pixels. Concretely, DRMN achieves new state-of-the-art performance on\nthe PNG benchmark with an average recall improvement 3.5%. The codes are\navailable in: https://github.com/JaMesLiMers/DRMN.\n","authors":["Yiming Lin","Xiao-Bo Jin","Qiufeng Wang","Kaizhu Huang"],"pdf_url":"https://arxiv.org/pdf/2310.16616v1.pdf","comment":"Accepted by ICDM 2023"},{"id":"http://arxiv.org/abs/2310.16609v1","updated":"2023-10-25T13:07:07Z","published":"2023-10-25T13:07:07Z","title":"Back Transcription as a Method for Evaluating Robustness of Natural\n Language Understanding Models to Speech Recognition Errors","summary":" In a spoken dialogue system, an NLU model is preceded by a speech recognition\nsystem that can deteriorate the performance of natural language understanding.\nThis paper proposes a method for investigating the impact of speech recognition\nerrors on the performance of natural language understanding models. The\nproposed method combines the back transcription procedure with a fine-grained\ntechnique for categorizing the errors that affect the performance of NLU\nmodels. The method relies on the usage of synthesized speech for NLU\nevaluation. We show that the use of synthesized speech in place of audio\nrecording does not change the outcomes of the presented technique in a\nsignificant way.\n","authors":["Marek Kubis","Paweł Skórzewski","Marcin Sowański","Tomasz Ziętkiewicz"],"pdf_url":"https://arxiv.org/pdf/2310.16609v1.pdf","comment":"Accepted to EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.16607v1","updated":"2023-10-25T12:59:51Z","published":"2023-10-25T12:59:51Z","title":"On the Interplay between Fairness and Explainability","summary":" In order to build reliable and trustworthy NLP applications, models need to\nbe both fair across different demographics and explainable. Usually these two\nobjectives, fairness and explainability, are optimized and/or examined\nindependently of each other. Instead, we argue that forthcoming, trustworthy\nNLP systems should consider both. In this work, we perform a first study to\nunderstand how they influence each other: do fair(er) models rely on more\nplausible rationales? and vice versa. To this end, we conduct experiments on\ntwo English multi-class text classification datasets, BIOS and ECtHR, that\nprovide information on gender and nationality, respectively, as well as\nhuman-annotated rationales. We fine-tune pre-trained language models with\nseveral methods for (i) bias mitigation, which aims to improve fairness; (ii)\nrationale extraction, which aims to produce plausible explanations. We find\nthat bias mitigation algorithms do not always lead to fairer models. Moreover,\nwe discover that empirical fairness and explainability are orthogonal.\n","authors":["Stephanie Brandl","Emanuele Bugliarello","Ilias Chalkidis"],"pdf_url":"https://arxiv.org/pdf/2310.16607v1.pdf","comment":"15 pages (incl Appendix), 4 figures, 8 tables"},{"id":"http://arxiv.org/abs/2307.16200v2","updated":"2023-10-25T12:41:34Z","published":"2023-07-30T10:51:32Z","title":"A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue\n Information Extraction","summary":" This paper focuses on term-status pair extraction from medical dialogues\n(MD-TSPE), which is essential in diagnosis dialogue systems and the automatic\nscribe of electronic medical records (EMRs). In the past few years, works on\nMD-TSPE have attracted increasing research attention, especially after the\nremarkable progress made by generative methods. However, these generative\nmethods output a whole sequence consisting of term-status pairs in one stage\nand ignore integrating prior knowledge, which demands a deeper understanding to\nmodel the relationship between terms and infer the status of each term. This\npaper presents a knowledge-enhanced two-stage generative framework (KTGF) to\naddress the above challenges. Using task-specific prompts, we employ a single\nmodel to complete the MD-TSPE through two phases in a unified generative form:\nwe generate all terms the first and then generate the status of each generated\nterm. In this way, the relationship between terms can be learned more\neffectively from the sequence containing only terms in the first phase, and our\ndesigned knowledge-enhanced prompt in the second phase can leverage the\ncategory and status candidates of the generated term for status generation.\nFurthermore, our proposed special status \"not mentioned\" makes more terms\navailable and enriches the training data in the second phase, which is critical\nin the low-resource setting. The experiments on the Chunyu and CMDD datasets\nshow that the proposed method achieves superior results compared to the\nstate-of-the-art models in the full training and low-resource settings.\n","authors":["Zefa Hu","Ziyi Ni","Jing Shi","Shuang Xu","Bo Xu"],"pdf_url":"https://arxiv.org/pdf/2307.16200v2.pdf","comment":"Published in Machine Intelligence Research"},{"id":"http://arxiv.org/abs/2310.16582v1","updated":"2023-10-25T12:16:33Z","published":"2023-10-25T12:16:33Z","title":"Tailoring Personality Traits in Large Language Models via\n Unsupervisedly-Built Personalized Lexicons","summary":" Personality plays a pivotal role in shaping human expression patterns, and\nempowering and manipulating large language models (LLMs) with personality\ntraits holds significant promise in enhancing the user experience of LLMs.\nHowever, prior approaches either rely on fine-tuning LLMs on a corpus enriched\nwith personalized expressions or necessitate the manual crafting of prompts to\ninduce LLMs to produce personalized responses. The former approaches demand\nsubstantial time and resources for collecting sufficient training examples\nwhile the latter might fail in enabling the precise manipulation of the\npersonality traits at a fine-grained level (e.g., achieving high agreeableness\nwhile reducing openness). In this study, we introduce a novel approach for\ntailoring personality traits within LLMs, allowing for the incorporation of any\ncombination of the Big Five factors (i.e., openness, conscientiousness,\nextraversion, agreeableness, and neuroticism) in a pluggable manner. This is\nachieved by employing a set of Unsupervisedly-Built Personalized Lexicons\n(UBPL) that are utilized to adjust the probability of the next token predicted\nby the original LLMs during the decoding phase. This adjustment encourages the\nmodels to generate words present in the personalized lexicons while preserving\nthe naturalness of the generated texts. Extensive experimentation demonstrates\nthe effectiveness of our approach in finely manipulating LLMs' personality\ntraits. Furthermore, our method can be seamlessly integrated into other LLMs\nwithout necessitating updates to their parameters.\n","authors":["Tianlong Li","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2310.16582v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.16579v1","updated":"2023-10-25T12:06:55Z","published":"2023-10-25T12:06:55Z","title":"WSDMS: Debunk Fake News via Weakly Supervised Detection of Misinforming\n Sentences with Contextualized Social Wisdom","summary":" In recent years, we witness the explosion of false and unconfirmed\ninformation (i.e., rumors) that went viral on social media and shocked the\npublic. Rumors can trigger versatile, mostly controversial stance expressions\namong social media users. Rumor verification and stance detection are different\nyet relevant tasks. Fake news debunking primarily focuses on determining the\ntruthfulness of news articles, which oversimplifies the issue as fake news\noften combines elements of both truth and falsehood. Thus, it becomes crucial\nto identify specific instances of misinformation within the articles. In this\nresearch, we investigate a novel task in the field of fake news debunking,\nwhich involves detecting sentence-level misinformation. One of the major\nchallenges in this task is the absence of a training dataset with\nsentence-level annotations regarding veracity. Inspired by the Multiple\nInstance Learning (MIL) approach, we propose a model called Weakly Supervised\nDetection of Misinforming Sentences (WSDMS). This model only requires bag-level\nlabels for training but is capable of inferring both sentence-level\nmisinformation and article-level veracity, aided by relevant social media\nconversations that are attentively contextualized with news sentences. We\nevaluate WSDMS on three real-world benchmarks and demonstrate that it\noutperforms existing state-of-the-art baselines in debunking fake news at both\nthe sentence and article levels.\n","authors":["Ruichao Yang","Wei Gao","Jing Ma","Hongzhan Lin","Zhiwei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16570v1","updated":"2023-10-25T11:57:13Z","published":"2023-10-25T11:57:13Z","title":"Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained\n Language Models","summary":" Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich\nin world knowledge. This fact has sparked the interest of the community in\nquantifying the amount of factual knowledge present in PLMs, as this explains\ntheir performance on downstream tasks, and potentially justifies their use as\nknowledge bases. In this work, we survey methods and datasets that are used to\nprobe PLMs for factual knowledge. Our contributions are: (1) We propose a\ncategorization scheme for factual probing methods that is based on how their\ninputs, outputs and the probed PLMs are adapted; (2) We provide an overview of\nthe datasets used for factual probing; (3) We synthesize insights about\nknowledge retention and prompt optimization in PLMs, analyze obstacles to\nadopting PLMs as knowledge bases and outline directions for future work.\n","authors":["Paul Youssef","Osman Alperen Koraş","Meijie Li","Jörg Schlötterer","Christin Seifert"],"pdf_url":"https://arxiv.org/pdf/2310.16570v1.pdf","comment":"Accepted at EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2305.12920v3","updated":"2023-10-25T11:56:49Z","published":"2023-05-22T11:08:00Z","title":"A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and\n Why?","summary":" Understanding the fundamental concepts and trends in a scientific field is\ncrucial for keeping abreast of its continuous advancement. In this study, we\npropose a systematic framework for analyzing the evolution of research topics\nin a scientific field using causal discovery and inference techniques. We\ndefine three variables to encompass diverse facets of the evolution of research\ntopics within NLP and utilize a causal discovery algorithm to unveil the causal\nconnections among these variables using observational data. Subsequently, we\nleverage this structure to measure the intensity of these relationships. By\nconducting extensive experiments on the ACL Anthology corpus, we demonstrate\nthat our framework effectively uncovers evolutionary trends and the underlying\ncauses for a wide range of NLP research topics. Specifically, we show that\ntasks and methods are primary drivers of research in NLP, with datasets\nfollowing, while metrics have minimal impact.\n","authors":["Aniket Pramanick","Yufang Hou","Saif M. Mohammad","Iryna Gurevych"],"pdf_url":"https://arxiv.org/pdf/2305.12920v3.pdf","comment":"accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16568v1","updated":"2023-10-25T11:51:22Z","published":"2023-10-25T11:51:22Z","title":"1-PAGER: One Pass Answer Generation and Evidence Retrieval","summary":" We present 1-Pager the first system that answers a question and retrieves\nevidence using a single Transformer-based model and decoding process. 1-Pager\nincrementally partitions the retrieval corpus using constrained decoding to\nselect a document and answer string, and we show that this is competitive with\ncomparable retrieve-and-read alternatives according to both retrieval and\nanswer accuracy metrics. 1-Pager also outperforms the equivalent closed-book\nquestion answering model, by grounding predictions in an evidence corpus. While\n1-Pager is not yet on-par with more expensive systems that read many more\ndocuments before generating an answer, we argue that it provides an important\nstep toward attributed generation by folding retrieval into the\nsequence-to-sequence paradigm that is currently dominant in NLP. We also show\nthat the search paths used to partition the corpus are easy to read and\nunderstand, paving a way forward for interpretable neural retrieval.\n","authors":["Palak Jain","Livio Baldini Soares","Tom Kwiatkowski"],"pdf_url":"https://arxiv.org/pdf/2310.16568v1.pdf","comment":"Accepted at EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2309.13230v3","updated":"2023-10-25T11:31:40Z","published":"2023-09-23T01:52:14Z","title":"Unify word-level and span-level tasks: NJUNLP's Participation for the\n WMT2023 Quality Estimation Shared Task","summary":" We introduce the submissions of the NJUNLP team to the WMT 2023 Quality\nEstimation (QE) shared task. Our team submitted predictions for the\nEnglish-German language pair on all two sub-tasks: (i) sentence- and word-level\nquality prediction; and (ii) fine-grained error span detection. This year, we\nfurther explore pseudo data methods for QE based on NJUQE framework\n(https://github.com/NJUNLP/njuqe). We generate pseudo MQM data using parallel\ndata from the WMT translation task. We pre-train the XLMR large model on pseudo\nQE data, then fine-tune it on real QE data. At both stages, we jointly learn\nsentence-level scores and word-level tags. Empirically, we conduct experiments\nto find the key hyper-parameters that improve the performance. Technically, we\npropose a simple method that covert the word-level outputs to fine-grained\nerror span results. Overall, our models achieved the best results in\nEnglish-German for both word-level and fine-grained error span detection\nsub-tasks by a considerable margin.\n","authors":["Xiang Geng","Zhejian Lai","Yu Zhang","Shimin Tao","Hao Yang","Jiajun Chen","Shujian Huang"],"pdf_url":"https://arxiv.org/pdf/2309.13230v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02031v4","updated":"2023-10-25T10:55:38Z","published":"2023-10-03T13:17:35Z","title":"OceanGPT: A Large Language Model for Ocean Science Tasks","summary":" Ocean science, which delves into the oceans that are reservoirs of life and\nbiodiversity, is of great significance given that oceans cover over 70% of our\nplanet's surface. Recently, advances in Large Language Models (LLMs) have\ntransformed the paradigm in science. Despite the success in other domains,\ncurrent LLMs often fall short in catering to the needs of domain experts like\noceanographers, and the potential of LLMs for ocean science is under-explored.\nThe intrinsic reason may be the immense and intricate nature of ocean data as\nwell as the necessity for higher granularity and richness in knowledge. To\nalleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean\ndomain, which is expert in various ocean science tasks. We propose DoInstruct,\na novel framework to automatically obtain a large volume of ocean domain\ninstruction data, which generates instructions based on multi-agent\ncollaboration. Additionally, we construct the first oceanography benchmark,\nOceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though\ncomprehensive experiments, OceanGPT not only shows a higher level of knowledge\nexpertise for oceans science tasks but also gains preliminary embodied\nintelligence capabilities in ocean technology. Codes, data and checkpoints will\nsoon be available at https://github.com/zjunlp/KnowLM.\n","authors":["Zhen Bi","Ningyu Zhang","Yida Xue","Yixin Ou","Daxiong Ji","Guozhou Zheng","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2310.02031v4.pdf","comment":"Work in progress. Project Website:\n https://zjunlp.github.io/project/OceanGPT/"},{"id":"http://arxiv.org/abs/2310.16538v1","updated":"2023-10-25T10:35:09Z","published":"2023-10-25T10:35:09Z","title":"FedTherapist: Mental Health Monitoring with User-Generated Linguistic\n Expressions on Smartphones via Federated Learning","summary":" Psychiatrists diagnose mental disorders via the linguistic use of patients.\nStill, due to data privacy, existing passive mental health monitoring systems\nuse alternative features such as activity, app usage, and location via mobile\ndevices. We propose FedTherapist, a mobile mental health monitoring system that\nutilizes continuous speech and keyboard input in a privacy-preserving way via\nfederated learning. We explore multiple model designs by comparing their\nperformance and overhead for FedTherapist to overcome the complex nature of\non-device language model training on smartphones. We further propose a\nContext-Aware Language Learning (CALL) methodology to effectively utilize\nsmartphones' large and noisy text for mental health signal sensing. Our\nIRB-approved evaluation of the prediction of self-reported depression, stress,\nanxiety, and mood from 46 participants shows higher accuracy of FedTherapist\ncompared with the performance with non-language features, achieving 0.15 AUROC\nimprovement and 8.21% MAE reduction.\n","authors":["Jaemin Shin","Hyungjun Yoon","Seungjoo Lee","Sungjoon Park","Yunxin Liu","Jinho D. Choi","Sung-Ju Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16538v1.pdf","comment":"Accepted to the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2307.15776v2","updated":"2023-10-25T10:35:03Z","published":"2023-07-28T19:33:18Z","title":"Select and Augment: Enhanced Dense Retrieval Knowledge Graph\n Augmentation","summary":" Injecting textual information into knowledge graph (KG) entity\nrepresentations has been a worthwhile expedition in terms of improving\nperformance in KG oriented tasks within the NLP community. External knowledge\noften adopted to enhance KG embeddings ranges from semantically rich lexical\ndependency parsed features to a set of relevant key words to entire text\ndescriptions supplied from an external corpus such as wikipedia and many more.\nDespite the gains this innovation (Text-enhanced KG embeddings) has made, the\nproposal in this work suggests that it can be improved even further. Instead of\nusing a single text description (which would not sufficiently represent an\nentity because of the inherent lexical ambiguity of text), we propose a\nmulti-task framework that jointly selects a set of text descriptions relevant\nto KG entities as well as align or augment KG embeddings with text\ndescriptions. Different from prior work that plugs formal entity descriptions\ndeclared in knowledge bases, this framework leverages a retriever model to\nselectively identify richer or highly relevant text descriptions to use in\naugmenting entities. Furthermore, the framework treats the number of\ndescriptions to use in augmentation process as a parameter, which allows the\nflexibility of enumerating across several numbers before identifying an\nappropriate number. Experiment results for Link Prediction demonstrate a 5.5%\nand 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10\nscores respectively, in comparison to text-enhanced knowledge graph\naugmentation methods using traditional CNNs.\n","authors":["Micheal Abaho","Yousef H. Alfaifi"],"pdf_url":"https://arxiv.org/pdf/2307.15776v2.pdf","comment":"Article has already been puclished to Journal of Artificial\n Intelligence Research (JAIR)"},{"id":"http://arxiv.org/abs/2310.16535v1","updated":"2023-10-25T10:34:02Z","published":"2023-10-25T10:34:02Z","title":"R$^3$ Prompting: Review, Rephrase and Resolve for Chain-of-Thought\n Reasoning in Large Language Models under Noisy Context","summary":" With the help of Chain-of-Thought (CoT) prompting, Large Language Models\n(LLMs) have achieved remarkable performance on various reasoning tasks.\nHowever, most of them have been evaluated under noise-free context and the\ndilemma for LLMs to produce inaccurate results under the noisy context has not\nbeen fully investigated. Existing studies utilize trigger sentences to\nencourage LLMs to concentrate on the relevant information but the trigger has\nlimited effect on final answer prediction. Inspired by interactive CoT method,\nwhere intermediate reasoning steps are promoted by multiple rounds of\ninteraction between users and LLMs, we propose a novel prompting method, namely\nR$^3$ prompting, for CoT reasoning under noisy context. Specifically, R$^3$\nprompting interacts with LLMs to perform key sentence extraction, variable\ndeclaration and answer prediction, which corresponds to a thought process of\nreviewing, rephrasing and resolving. The responses generated at the last\ninteraction will perform as hints to guide toward the responses of the next\ninteraction. Our experiments show that R$^3$ prompting significantly\noutperforms existing CoT prompting methods on five reasoning tasks under noisy\ncontext. With GPT-3.5-turbo, we observe 3.7% accuracy improvement on average on\nthe reasoning tasks under noisy context compared to the most competitive\nprompting baseline. More analyses and ablation studies show the robustness and\ngeneralization of R$^3$ prompting method in solving reasoning tasks in LLMs\nunder noisy context.\n","authors":["Qingyuan Tian","Hanlun Zhu","Lei Wang","Yang Li","Yunshi Lan"],"pdf_url":"https://arxiv.org/pdf/2310.16535v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16534v1","updated":"2023-10-25T10:33:17Z","published":"2023-10-25T10:33:17Z","title":"An Early Evaluation of GPT-4V(ision)","summary":" In this paper, we evaluate different abilities of GPT-4V including visual\nunderstanding, language understanding, visual puzzle solving, and understanding\nof other modalities such as depth, thermal, video, and audio. To estimate\nGPT-4V's performance, we manually construct 656 test instances and carefully\nevaluate the results of GPT-4V. The highlights of our findings are as follows:\n(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks\nbut fails to recognize simple Chinese texts in the images; (2) GPT-4V shows\ninconsistent refusal behavior when answering questions related to sensitive\ntraits such as gender, race, and age; (3) GPT-4V obtains worse results than\nGPT-4 (API) on language understanding tasks including general language\nunderstanding benchmarks and visual commonsense knowledge evaluation\nbenchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both\nvisual understanding and language understanding; (5) GPT-4V struggles to find\nthe nuances between two similar images and solve the easy math picture puzzles;\n(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to\nimage, such as video and thermal. Our experimental results reveal the ability\nand limitations of GPT-4V and we hope our paper can provide some insights into\nthe application and research of GPT-4V.\n","authors":["Yang Wu","Shilong Wang","Hao Yang","Tian Zheng","Hongbo Zhang","Yanyan Zhao","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2310.16534v1.pdf","comment":"Technical Report. Data are available at\n https://github.com/albertwy/GPT-4V-Evaluation"},{"id":"http://arxiv.org/abs/2310.16528v1","updated":"2023-10-25T10:22:49Z","published":"2023-10-25T10:22:49Z","title":"CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task\n Information Retrieval","summary":" We present the Charles University system for the MRL~2023 Shared Task on\nMulti-lingual Multi-task Information Retrieval. The goal of the shared task was\nto develop systems for named entity recognition and question answering in\nseveral under-represented languages. Our solutions to both subtasks rely on the\ntranslate-test approach. We first translate the unlabeled examples into English\nusing a multilingual machine translation model. Then, we run inference on the\ntranslated data using a strong task-specific model. Finally, we project the\nlabeled data back into the original language. To keep the inferred tags on the\ncorrect positions in the original language, we propose a method based on\nscoring the candidate positions using a label-sensitive translation model. In\nboth settings, we experiment with finetuning the classification models on the\ntranslated data. However, due to a domain mismatch between the development data\nand the shared task validation and test sets, the finetuned models could not\noutperform our baselines.\n","authors":["Jindřich Helcl","Jindřich Libovický"],"pdf_url":"https://arxiv.org/pdf/2310.16528v1.pdf","comment":"8 pages, 2 figures; System description paper at the MRL 2023 workshop\n at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16523v1","updated":"2023-10-25T10:17:17Z","published":"2023-10-25T10:17:17Z","title":"Improving Diversity of Demographic Representation in Large Language\n Models via Collective-Critiques and Self-Voting","summary":" A crucial challenge for generative large language models (LLMs) is diversity:\nwhen a user's prompt is under-specified, models may follow implicit assumptions\nwhile generating a response, which may result in homogenization of the\nresponses, as well as certain demographic groups being under-represented or\neven erased from the generated responses. In this paper, we formalize diversity\nof representation in generative LLMs. We present evaluation datasets and\npropose metrics to measure diversity in generated responses along people and\nculture axes. We find that LLMs understand the notion of diversity, and that\nthey can reason and critique their own responses for that goal. This finding\nmotivated a new prompting technique called collective-critique and self-voting\n(CCSV) to self-improve people diversity of LLMs by tapping into its diversity\nreasoning capabilities, without relying on handcrafted examples or prompt\ntuning. Extensive empirical experiments with both human and automated\nevaluations show that our proposed approach is effective at improving people\nand culture diversity, and outperforms all baseline methods by a large margin.\n","authors":["Preethi Lahoti","Nicholas Blumm","Xiao Ma","Raghavendra Kotikalapudi","Sahitya Potluri","Qijun Tan","Hansa Srinivasan","Ben Packer","Ahmad Beirami","Alex Beutel","Jilin Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16523v1.pdf","comment":"To appear at EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.12541v2","updated":"2023-10-25T10:11:12Z","published":"2023-10-19T07:46:54Z","title":"Large Language Model for Multi-objective Evolutionary Optimization","summary":" Multiobjective evolutionary algorithms (MOEAs) are major methods for solving\nmultiobjective optimization problems (MOPs). Many MOEAs have been proposed in\nthe past decades, of which the search operators need a carefully handcrafted\ndesign with domain knowledge. Recently, some attempts have been made to replace\nthe manually designed operators in MOEAs with learning-based operators (e.g.,\nneural network models). However, much effort is still required for designing\nand training such models, and the learned operators might not generalize well\non new problems. To tackle the above challenges, this work investigates a novel\napproach that leverages the powerful large language model (LLM) to design MOEA\noperators. With proper prompt engineering, we successfully let a general LLM\nserve as a black-box search operator for decomposition-based MOEA (MOEA/D) in a\nzero-shot manner. In addition, by learning from the LLM behavior, we further\ndesign an explicit white-box operator with randomness and propose a new version\nof decomposition-based MOEA, termed MOEA/D-LO. Experimental studies on\ndifferent test benchmarks show that our proposed method can achieve competitive\nperformance with widely used MOEAs. It is also promising to see the operator\nonly learned from a few instances can have robust generalization performance on\nunseen problems with quite different patterns and settings. The results reveal\nthe potential benefits of using pre-trained LLMs in the design of MOEAs.\n","authors":["Fei Liu","Xi Lin","Zhenkun Wang","Shunyu Yao","Xialiang Tong","Mingxuan Yuan","Qingfu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.12541v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16517v1","updated":"2023-10-25T10:06:17Z","published":"2023-10-25T10:06:17Z","title":"OccuQuest: Mitigating Occupational Bias for Inclusive Large Language\n Models","summary":" The emergence of large language models (LLMs) has revolutionized natural\nlanguage processing tasks. However, existing instruction-tuning datasets suffer\nfrom occupational bias: the majority of data relates to only a few occupations,\nwhich hampers the instruction-tuned LLMs to generate helpful responses to\nprofessional queries from practitioners in specific fields. To mitigate this\nissue and promote occupation-inclusive LLMs, we create an instruction-tuning\ndataset named \\emph{OccuQuest}, which contains 110,000+ prompt-completion pairs\nand 30,000+ dialogues covering over 1,000 occupations in 26 occupational\ncategories. We systematically request ChatGPT, organizing queries\nhierarchically based on Occupation, Responsibility, Topic, and Question, to\nensure a comprehensive coverage of occupational specialty inquiries. By\ncomparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we\nobserve that OccuQuest exhibits a more balanced distribution across\noccupations. Furthermore, we assemble three test sets for comprehensive\nevaluation, an occu-test set covering 25 occupational categories, an estate set\nfocusing on real estate, and an occu-quora set containing real-world questions\nfrom Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which\nsignificantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and\nWizardLM) on professional questions in GPT-4 and human evaluations. Notably, on\nthe occu-quora set, OccuLLaMA reaches a high win rate of 86.4\\% against\nWizardLM.\n","authors":["Mingfeng Xue","Dayiheng Liu","Kexin Yang","Guanting Dong","Wenqiang Lei","Zheng Yuan","Chang Zhou","Jingren Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.16517v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15694v2","updated":"2023-10-25T10:03:52Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.03232v2","updated":"2023-10-25T10:01:19Z","published":"2021-06-06T20:04:39Z","title":"A Targeted Assessment of Incremental Processing in Neural LanguageModels\n and Humans","summary":" We present a targeted, scaled-up comparison of incremental processing in\nhumans and neural language models by collecting by-word reaction time data for\nsixteen different syntactic test suites across a range of structural phenomena.\nHuman reaction time data comes from a novel online experimental paradigm called\nthe Interpolated Maze task. We compare human reaction times to by-word\nprobabilities for four contemporary language models, with different\narchitectures and trained on a range of data set sizes. We find that across\nmany phenomena, both humans and language models show increased processing\ndifficulty in ungrammatical sentence regions with human and model `accuracy'\nscores (a la Marvin and Linzen(2018)) about equal. However, although language\nmodel outputs match humans in direction, we show that models systematically\nunder-predict the difference in magnitude of incremental processing difficulty\nbetween grammatical and ungrammatical sentences. Specifically, when models\nencounter syntactic violations they fail to accurately predict the longer\nreaction times observed in the human data. These results call into question\nwhether contemporary language models are approaching human-like performance for\nsensitivity to syntactic violations.\n","authors":["Ethan Gotlieb Wilcox","Pranali Vani","Roger P. Levy"],"pdf_url":"https://arxiv.org/pdf/2106.03232v2.pdf","comment":"Published in the proceedings of ACL 2021"},{"id":"http://arxiv.org/abs/2305.14482v2","updated":"2023-10-25T09:13:34Z","published":"2023-05-23T19:24:42Z","title":"Is a Prestigious Job the same as a Prestigious Country? A Case Study on\n Multilingual Sentence Embeddings and European Countries","summary":" We study how multilingual sentence representations capture European countries\nand occupations and how this differs across European languages. We prompt the\nmodels with templated sentences that we machine-translate into 12 European\nlanguages and analyze the most prominent dimensions in the embeddings.Our\nanalysis reveals that the most prominent feature in the embedding is the\ngeopolitical distinction between Eastern and Western Europe and the country's\neconomic strength in terms of GDP. When prompted specifically for job prestige,\nthe embedding space clearly distinguishes high and low-prestige jobs. The\noccupational dimension is uncorrelated with the most dominant country\ndimensions in three out of four studied models. The exception is a small\ndistilled model that exhibits a connection between occupational prestige and\ncountry of origin, which is a potential source of nationality-based\ndiscrimination. Our findings are consistent across languages.\n","authors":["Jindřich Libovický"],"pdf_url":"https://arxiv.org/pdf/2305.14482v2.pdf","comment":"10 pages, 1 figure; Findings of EMNLP 2023, camera-ready"},{"id":"http://arxiv.org/abs/2310.16484v1","updated":"2023-10-25T09:09:55Z","published":"2023-10-25T09:09:55Z","title":"Subspace Chronicles: How Linguistic Information Emerges, Shifts and\n Interacts during Language Model Training","summary":" Representational spaces learned via language modeling are fundamental to\nNatural Language Processing (NLP), however there has been limited understanding\nregarding how and when during training various types of linguistic information\nemerge and interact. Leveraging a novel information theoretic probing suite,\nwhich enables direct comparisons of not just task performance, but their\nrepresentational subspaces, we analyze nine tasks covering syntax, semantics\nand reasoning, across 2M pre-training steps and five seeds. We identify\ncritical learning phases across tasks and time, during which subspaces emerge,\nshare information, and later disentangle to specialize. Across these phases,\nsyntactic knowledge is acquired rapidly after 0.5% of full training. Continued\nperformance improvements primarily stem from the acquisition of open-domain\nknowledge, while semantics and reasoning tasks benefit from later boosts to\nlong-range contextualization and higher specialization. Measuring cross-task\nsimilarity further reveals that linguistically related tasks share information\nthroughout training, and do so more during the critical phase of learning than\nbefore or after. Our findings have implications for model interpretability,\nmulti-task learning, and learning from limited data.\n","authors":["Max Müller-Eberstein","Rob van der Goot","Barbara Plank","Ivan Titov"],"pdf_url":"https://arxiv.org/pdf/2310.16484v1.pdf","comment":"Accepted at EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.10333v3","updated":"2023-10-25T08:18:40Z","published":"2023-10-16T12:17:11Z","title":"Legal NLP Meets MiCAR: Advancing the Analysis of Crypto White Papers","summary":" In the rapidly evolving field of crypto assets, white papers are essential\ndocuments for investor guidance, and are now subject to unprecedented content\nrequirements under the European Union's Markets in Crypto-Assets Regulation\n(MiCAR). Natural Language Processing (NLP) can serve as a powerful tool for\nboth analyzing these documents and assisting in regulatory compliance. This\npaper delivers two contributions to the topic. First, we survey existing\napplications of textual analysis to unregulated crypto asset white papers,\nuncovering a research gap that could be bridged with interdisciplinary\ncollaboration. We then conduct an analysis of the changes introduced by MiCAR,\nhighlighting the opportunities and challenges of integrating NLP within the new\nregulatory framework. The findings set the stage for further research, with the\npotential to benefit regulators, crypto asset issuers, and investors.\n","authors":["Carolina Camassa"],"pdf_url":"https://arxiv.org/pdf/2310.10333v3.pdf","comment":"Accepted at NLLP23"},{"id":"http://arxiv.org/abs/2309.17332v2","updated":"2023-10-25T08:16:43Z","published":"2023-09-29T15:43:42Z","title":"Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of\n Biomedical Research Articles","summary":" This paper presents the results of the shared task on Lay Summarisation of\nBiomedical Research Articles (BioLaySumm), hosted at the BioNLP Workshop at ACL\n2023. The goal of this shared task is to develop abstractive summarisation\nmodels capable of generating \"lay summaries\" (i.e., summaries that are\ncomprehensible to non-technical audiences) in both a controllable and\nnon-controllable setting. There are two subtasks: 1) Lay Summarisation, where\nthe goal is for participants to build models for lay summary generation only,\ngiven the full article text and the corresponding abstract as input; and 2)\nReadability-controlled Summarisation, where the goal is for participants to\ntrain models to generate both the technical abstract and the lay summary, given\nan article's main text as input. In addition to overall results, we report on\nthe setup and insights from the BioLaySumm shared task, which attracted a total\nof 20 participating teams across both subtasks.\n","authors":["Tomas Goldsack","Zheheng Luo","Qianqian Xie","Carolina Scarton","Matthew Shardlow","Sophia Ananiadou","Chenghua Lin"],"pdf_url":"https://arxiv.org/pdf/2309.17332v2.pdf","comment":"Published at BioNLP@ACL2023"},{"id":"http://arxiv.org/abs/2310.16450v1","updated":"2023-10-25T08:13:02Z","published":"2023-10-25T08:13:02Z","title":"CLEX: Continuous Length Extrapolation for Large Language Models","summary":" Transformer-based Large Language Models (LLMs) are pioneering advances in\nmany natural language processing tasks, however, their exceptional capabilities\nare restricted within the preset context window of Transformer. Position\nEmbedding (PE) scaling methods, while effective in extending the context window\nto a specific length, demonstrate either notable limitations in their\nextrapolation abilities or sacrificing partial performance within the context\nwindow. Length extrapolation methods, although theoretically capable of\nextending the context window beyond the training sequence length, often\nunderperform in practical long-context applications. To address these\nchallenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We\ngeneralise the PE scaling approaches to model the continuous dynamics by\nordinary differential equations over the length scaling factor, thereby\novercoming the constraints of current PE scaling methods designed for specific\nlengths. Moreover, by extending the dynamics to desired context lengths beyond\nthe training sequence length, CLEX facilitates the length extrapolation with\nimpressive performance in practical tasks. We demonstrate that CLEX can be\nseamlessly incorporated into LLMs equipped with Rotary Position Embedding, such\nas LLaMA and GPT-NeoX, with negligible impact on training and inference\nlatency. Experimental results reveal that CLEX can effectively extend the\ncontext window to over 4x or almost 8x training length, with no deterioration\nin performance. Furthermore, when evaluated on the practical LongBench\nbenchmark, our model trained on a 4k length exhibits competitive performance\nagainst state-of-the-art open-source models trained on context lengths up to\n32k.\n","authors":["Guanzheng Chen","Xin Li","Zaiqiao Meng","Shangsong Liang","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2310.16450v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16446v1","updated":"2023-10-25T08:10:04Z","published":"2023-10-25T08:10:04Z","title":"Diversity Enhanced Narrative Question Generation for Storybooks","summary":" Question generation (QG) from a given context can enhance comprehension,\nengagement, assessment, and overall efficacy in learning or conversational\nenvironments. Despite recent advancements in QG, the challenge of enhancing or\nmeasuring the diversity of generated questions often remains unaddressed. In\nthis paper, we introduce a multi-question generation model (mQG), which is\ncapable of generating multiple, diverse, and answerable questions by focusing\non context and questions. To validate the answerability of the generated\nquestions, we employ a SQuAD2.0 fine-tuned question answering model,\nclassifying the questions as answerable or not. We train and evaluate mQG on\nthe FairytaleQA dataset, a well-structured QA dataset based on storybooks, with\nnarrative questions. We further apply a zero-shot adaptation on the TellMeWhy\nand SQuAD1.1 datasets. mQG shows promising results across various evaluation\nmetrics, among strong baselines.\n","authors":["Hokeun Yoon","JinYeong Bak"],"pdf_url":"https://arxiv.org/pdf/2310.16446v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16436v1","updated":"2023-10-25T08:03:10Z","published":"2023-10-25T08:03:10Z","title":"DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning\n in Language Models","summary":" A long-standing goal of AI systems is to perform complex multimodal reasoning\nlike humans. Recently, large language models (LLMs) have made remarkable\nstrides in such multi-step reasoning on the language modality solely by\nleveraging the chain of thought (CoT) to mimic human thinking. However, the\ntransfer of these advancements to multimodal contexts introduces heightened\nchallenges, including but not limited to the impractical need for\nlabor-intensive annotation and the limitations in terms of flexibility,\ngeneralizability, and explainability. To evoke CoT reasoning in multimodality,\nthis work first conducts an in-depth analysis of these challenges posed by\nmultimodality and presents two key insights: \"keeping critical thinking\" and\n\"letting everyone do their jobs\" in multimodal CoT reasoning. Furthermore, this\nstudy proposes a novel DDCoT prompting that maintains a critical attitude\nthrough negative-space prompting and incorporates multimodality into reasoning\nby first dividing the reasoning responsibility of LLMs into reasoning and\nrecognition and then integrating the visual recognition capability of visual\nmodels into the joint reasoning process. The rationales generated by DDCoT not\nonly improve the reasoning abilities of both large and small language models in\nzero-shot prompting and fine-tuning learning, significantly outperforming\nstate-of-the-art methods but also exhibit impressive generalizability and\nexplainability.\n","authors":["Ge Zheng","Bin Yang","Jiajin Tang","Hong-Yu Zhou","Sibei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16436v1.pdf","comment":"24 pages, 13 figures, to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15556v2","updated":"2023-10-25T07:50:52Z","published":"2023-10-24T06:56:38Z","title":"TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for\n Inference Cost Reduction","summary":" Since ChatGPT released its API for public use, the number of applications\nbuilt on top of commercial large language models (LLMs) increase exponentially.\nOne popular usage of such models is leveraging its in-context learning ability\nand generating responses given user queries leveraging knowledge obtained by\nretrieval augmentation. One problem of deploying commercial retrieval-augmented\nLLMs is the cost due to the additionally retrieved context that largely\nincreases the input token size of the LLMs. To mitigate this, we propose a\ntoken compression scheme that includes two methods: summarization compression\nand semantic compression. The first method applies a T5-based model that is\nfine-tuned by datasets generated using self-instruct containing samples with\nvarying lengths and reduce token size by doing summarization. The second method\nfurther compresses the token size by removing words with lower impact on the\nsemantic. In order to adequately evaluate the effectiveness of the proposed\nmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)\nfocusing on food recommendation for women around pregnancy period or infants.\nOur summarization compression can reduce 65% of the retrieval token size with\nfurther 0.3% improvement on the accuracy; semantic compression provides a more\nflexible way to trade-off the token size with performance, for which we can\nreduce the token size by 20% with only 1.6% of accuracy drop.\n","authors":["Junyi Liu","Liangzhi Li","Tong Xiang","Bowen Wang","Yiming Qian"],"pdf_url":"https://arxiv.org/pdf/2310.15556v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.03368v4","updated":"2023-10-25T07:49:37Z","published":"2023-10-05T07:57:09Z","title":"Evaluating Hallucinations in Chinese Large Language Models","summary":" In this paper, we establish a benchmark named HalluQA (Chinese Hallucination\nQuestion-Answering) to measure the hallucination phenomenon in Chinese large\nlanguage models. HalluQA contains 450 meticulously designed adversarial\nquestions, spanning multiple domains, and takes into account Chinese historical\nculture, customs, and social phenomena. During the construction of HalluQA, we\nconsider two types of hallucinations: imitative falsehoods and factual errors,\nand we construct adversarial samples based on GLM-130B and ChatGPT. For\nevaluation, we design an automated evaluation method using GPT-4 to judge\nwhether a model output is hallucinated. We conduct extensive experiments on 24\nlarge language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk\nand etc. Out of the 24 models, 18 achieved non-hallucination rates lower than\n50%. This indicates that HalluQA is highly challenging. We analyze the primary\ntypes of hallucinations in different types of models and their causes.\nAdditionally, we discuss which types of hallucinations should be prioritized\nfor different types of models.\n","authors":["Qinyuan Cheng","Tianxiang Sun","Wenwei Zhang","Siyin Wang","Xiangyang Liu","Mozhi Zhang","Junliang He","Mianqiu Huang","Zhangyue Yin","Kai Chen","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2310.03368v4.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.16427v1","updated":"2023-10-25T07:47:01Z","published":"2023-10-25T07:47:01Z","title":"PromptAgent: Strategic Planning with Language Models Enables\n Expert-level Prompt Optimization","summary":" Highly effective, task-specific prompts are often heavily engineered by\nexperts to integrate detailed instructions and domain insights based on a deep\nunderstanding of both instincts of large language models (LLMs) and the\nintricacies of the target task. However, automating the generation of such\nexpert-level prompts remains elusive. Existing prompt optimization methods tend\nto overlook the depth of domain knowledge and struggle to efficiently explore\nthe vast space of expert-level prompts. Addressing this, we present\nPromptAgent, an optimization method that autonomously crafts prompts equivalent\nin quality to those handcrafted by experts. At its core, PromptAgent views\nprompt optimization as a strategic planning problem and employs a principled\nplanning algorithm, rooted in Monte Carlo tree search, to strategically\nnavigate the expert-level prompt space. Inspired by human-like trial-and-error\nexploration, PromptAgent induces precise expert-level insights and in-depth\ninstructions by reflecting on model errors and generating constructive error\nfeedback. Such a novel framework allows the agent to iteratively examine\nintermediate prompts (states), refine them based on error feedbacks (actions),\nsimulate future rewards, and search for high-reward paths leading to expert\nprompts. We apply PromptAgent to 12 tasks spanning three practical domains:\nBIG-Bench Hard (BBH), as well as domain-specific and general NLP tasks, showing\nit significantly outperforms strong Chain-of-Thought and recent prompt\noptimization baselines. Extensive analyses emphasize its capability to craft\nexpert-level, detailed, and domain-insightful prompts with great efficiency and\ngeneralizability.\n","authors":["Xinyuan Wang","Chenxi Li","Zhen Wang","Fan Bai","Haotian Luo","Jiayou Zhang","Nebojsa Jojic","Eric P. Xing","Zhiting Hu"],"pdf_url":"https://arxiv.org/pdf/2310.16427v1.pdf","comment":"34 pages, 10 figures"},{"id":"http://arxiv.org/abs/2308.03688v2","updated":"2023-10-25T07:41:24Z","published":"2023-08-07T16:08:11Z","title":"AgentBench: Evaluating LLMs as Agents","summary":" Large Language Models (LLMs) are becoming increasingly smart and autonomous,\ntargeting real-world pragmatic missions beyond traditional NLP tasks. As a\nresult, there has been an urgent need to evaluate LLMs as agents on challenging\ntasks in interactive environments. We present AgentBench, a multi-dimensional\nevolving benchmark that currently consists of 8 distinct environments to assess\nLLM-as-Agent's reasoning and decision-making abilities in a multi-turn\nopen-ended generation setting. Our extensive test over 27 API-based and\nopen-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong\nability of acting as agents in complex environments, there is a significant\ndisparity in performance between them and OSS competitors. We identify the\ntypical reasons of failures in environments and LLMs, showing that poor\nlong-term reasoning, decision-making, and instruction following abilities are\nthe main obstacles for developing usable LLM agents. Training on code and high\nquality multi-turn alignment data could improve agent performance. Datasets,\nenvironments, and an integrated evaluation package for AgentBench are released\nat \\url{https://github.com/THUDM/AgentBench}.\n","authors":["Xiao Liu","Hao Yu","Hanchen Zhang","Yifan Xu","Xuanyu Lei","Hanyu Lai","Yu Gu","Hangliang Ding","Kaiwen Men","Kejuan Yang","Shudan Zhang","Xiang Deng","Aohan Zeng","Zhengxiao Du","Chenhui Zhang","Sheng Shen","Tianjun Zhang","Yu Su","Huan Sun","Minlie Huang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2308.03688v2.pdf","comment":"55 pages"},{"id":"http://arxiv.org/abs/2310.05163v2","updated":"2023-10-25T07:27:12Z","published":"2023-10-08T13:45:05Z","title":"An Investigation of LLMs' Inefficacy in Understanding Converse Relations","summary":" Large Language Models (LLMs) have achieved remarkable success in many formal\nlanguage oriented tasks, such as structural data-to-text and semantic parsing.\nHowever current benchmarks mostly follow the data distribution of the\npre-training data of LLMs. Therefore, a natural question rises that do LLMs\nreally understand the structured semantics of formal languages. In this paper,\nwe investigate this problem on a special case, converse binary relation. We\nintroduce a new benchmark ConvRe focusing on converse relations, which contains\n17 relations and 1240 triples extracted from popular knowledge graph completion\ndatasets. Our ConvRE features two tasks, Re2Text and Text2Re, which are\nformulated as multi-choice question answering to evaluate LLMs' ability to\ndetermine the matching between relations and associated text. For the\nevaluation protocol, apart from different prompting methods, we further\nintroduce variants to the test text and few-shot example text. We conduct\nexperiments on three popular LLM families and have observed various scaling\ntrends. The results suggest that LLMs often resort to shortcut learning and\nstill face challenges on our proposed benchmark.\n","authors":["Chengwen Qi","Bowen Li","Binyuan Hui","Bailin Wang","Jinyang Li","Jinwang Wu","Yuanjun Laili"],"pdf_url":"https://arxiv.org/pdf/2310.05163v2.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16417v1","updated":"2023-10-25T07:10:42Z","published":"2023-10-25T07:10:42Z","title":"Enhanced Simultaneous Machine Translation with Word-level Policies","summary":" Recent years have seen remarkable advances in the field of Simultaneous\nMachine Translation (SiMT) due to the introduction of innovative policies that\ndictate whether to READ or WRITE at each step of the translation process.\nHowever, a common assumption in many existing studies is that operations are\ncarried out at the subword level, even though the standard unit for input and\noutput in most practical scenarios is typically at the word level. This paper\ndemonstrates that policies devised and validated at the subword level are\nsurpassed by those operating at the word level, which process multiple subwords\nto form a complete word in a single step. Additionally, we suggest a method to\nboost SiMT models using language models (LMs), wherein the proposed word-level\npolicy plays a vital role in addressing the subword disparity between LMs and\nSiMT models. Code is available at https://github.com/xl8-ai/WordSiMT.\n","authors":["Kang Kim","Hankyu Cho"],"pdf_url":"https://arxiv.org/pdf/2310.16417v1.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.16411v1","updated":"2023-10-25T06:54:39Z","published":"2023-10-25T06:54:39Z","title":"Decoding Stumpers: Large Language Models vs. Human Problem-Solvers","summary":" This paper investigates the problem-solving capabilities of Large Language\nModels (LLMs) by evaluating their performance on stumpers, unique single-step\nintuition problems that pose challenges for human solvers but are easily\nverifiable. We compare the performance of four state-of-the-art LLMs\n(Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our\nfindings reveal that the new-generation LLMs excel in solving stumpers and\nsurpass human performance. However, humans exhibit superior skills in verifying\nsolutions to the same problems. This research enhances our understanding of\nLLMs' cognitive abilities and provides insights for enhancing their\nproblem-solving potential across various domains.\n","authors":["Alon Goldstein","Miriam Havin","Roi Reichart","Ariel Goldstein"],"pdf_url":"https://arxiv.org/pdf/2310.16411v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.08244v2","updated":"2023-10-25T06:54:12Z","published":"2023-04-14T14:05:32Z","title":"API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs","summary":" Recent research has demonstrated that Large Language Models (LLMs) can\nenhance their capabilities by utilizing external tools. However, three pivotal\nquestions remain unanswered: (1) How effective are current LLMs in utilizing\ntools? (2) How can we enhance LLMs' ability to utilize tools? (3) What\nobstacles need to be overcome to leverage tools? To address these questions, we\nintroduce API-Bank, a groundbreaking benchmark, specifically designed for\ntool-augmented LLMs. For the first question, we develop a runnable evaluation\nsystem consisting of 73 API tools. We annotate 314 tool-use dialogues with 753\nAPI calls to assess the existing LLMs' capabilities in planning, retrieving,\nand calling APIs. For the second question, we construct a comprehensive\ntraining set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000\ndistinct domains. Using this dataset, we train Lynx, a tool-augmented LLM\ninitialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits\nimproved tool utilization compared to GPT-3, while GPT-4 excels in planning.\nHowever, there is still significant potential for further improvement.\nMoreover, Lynx surpasses Alpaca's tool utilization performance by more than 26\npts and approaches the effectiveness of GPT-3.5. Through error analysis, we\nhighlight the key challenges for future research in this field to answer the\nthird question.\n","authors":["Minghao Li","Yingxiu Zhao","Bowen Yu","Feifan Song","Hangyu Li","Haiyang Yu","Zhoujun Li","Fei Huang","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2304.08244v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15421v2","updated":"2023-10-25T06:46:42Z","published":"2023-10-24T00:24:11Z","title":"FANToM: A Benchmark for Stress-testing Machine Theory of Mind in\n Interactions","summary":" Theory of mind (ToM) evaluations currently focus on testing models using\npassive narratives that inherently lack interactivity. We introduce FANToM, a\nnew benchmark designed to stress-test ToM within information-asymmetric\nconversational contexts via question answering. Our benchmark draws upon\nimportant theoretical requisites from psychology and necessary empirical\nconsiderations when evaluating large language models (LLMs). In particular, we\nformulate multiple types of questions that demand the same underlying reasoning\nto identify illusory or false sense of ToM capabilities in LLMs. We show that\nFANToM is challenging for state-of-the-art LLMs, which perform significantly\nworse than humans even with chain-of-thought reasoning or fine-tuning.\n","authors":["Hyunwoo Kim","Melanie Sclar","Xuhui Zhou","Ronan Le Bras","Gunhee Kim","Yejin Choi","Maarten Sap"],"pdf_url":"https://arxiv.org/pdf/2310.15421v2.pdf","comment":"EMNLP 2023. Code and dataset can be found here:\n https://hyunw.kim/fantom"},{"id":"http://arxiv.org/abs/2310.16402v1","updated":"2023-10-25T06:38:42Z","published":"2023-10-25T06:38:42Z","title":"Video Referring Expression Comprehension via Transformer with\n Content-conditioned Query","summary":" Video Referring Expression Comprehension (REC) aims to localize a target\nobject in videos based on the queried natural language. Recent improvements in\nvideo REC have been made using Transformer-based methods with learnable\nqueries. However, we contend that this naive query design is not ideal given\nthe open-world nature of video REC brought by text supervision. With numerous\npotential semantic categories, relying on only a few slow-updated queries is\ninsufficient to characterize them. Our solution to this problem is to create\ndynamic queries that are conditioned on both the input video and language to\nmodel the diverse objects referred to. Specifically, we place a fixed number of\nlearnable bounding boxes throughout the frame and use corresponding region\nfeatures to provide prior information. Also, we noticed that current query\nfeatures overlook the importance of cross-modal alignment. To address this, we\nalign specific phrases in the sentence with semantically relevant visual areas,\nannotating them in existing video datasets (VID-Sentence and VidSTG). By\nincorporating these two designs, our proposed model (called ConFormer)\noutperforms other models on widely benchmarked datasets. For example, in the\ntesting split of VID-Sentence dataset, ConFormer achieves 8.75% absolute\nimprovement on Accu.@0.6 compared to the previous state-of-the-art model.\n","authors":["Ji Jiang","Meng Cao","Tengtao Song","Long Chen","Yi Wang","Yuexian Zou"],"pdf_url":"https://arxiv.org/pdf/2310.16402v1.pdf","comment":"Accepted to ACM International Conference on Multimedia Workshop (ACM\n MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.02953"},{"id":"http://arxiv.org/abs/2305.13808v2","updated":"2023-10-25T06:24:16Z","published":"2023-05-23T08:20:01Z","title":"Asking Clarification Questions to Handle Ambiguity in Open-Domain QA","summary":" Ambiguous questions persist in open-domain question answering, because\nformulating a precise question with a unique answer is often challenging.\nPreviously, Min et al. (2020) have tackled this issue by generating\ndisambiguated questions for all possible interpretations of the ambiguous\nquestion. This can be effective, but not ideal for providing an answer to the\nuser. Instead, we propose to ask a clarification question, where the user's\nresponse will help identify the interpretation that best aligns with the user's\nintention. We first present CAMBIGNQ, a dataset consisting of 5,654 ambiguous\nquestions, each with relevant passages, possible answers, and a clarification\nquestion. The clarification questions were efficiently created by generating\nthem using InstructGPT and manually revising them as necessary. We then define\na pipeline of tasks and design appropriate evaluation metrics. Lastly, we\nachieve 61.3 F1 on ambiguity detection and 40.5 F1 on clarification-based QA,\nproviding strong baselines for future work.\n","authors":["Dongryeol Lee","Segwang Kim","Minwoo Lee","Hwanhee Lee","Joonsuk Park","Sang-Woo Lee","Kyomin Jung"],"pdf_url":"https://arxiv.org/pdf/2305.13808v2.pdf","comment":"15 pages, 4 figures, accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2306.02858v4","updated":"2023-10-25T06:23:31Z","published":"2023-06-05T13:17:27Z","title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video\n Understanding","summary":" We present Video-LLaMA a multi-modal framework that empowers Large Language\nModels (LLMs) with the capability of understanding both visual and auditory\ncontent in the video. Video-LLaMA bootstraps cross-modal training from the\nfrozen pre-trained visual and audio encoders and the frozen LLMs. Unlike\nprevious works that complement LLMs to process the visual or audio signals\nonly, Video-LLaMA enables video comprehension by tackling two challenges: (1)\ncapturing the temporal changes in visual scenes, (2) integrating audio-visual\nsignals. To counter the first challenge, we propose a Video Q-former to\nassemble a pre-trained image encoder into our video encoder and introduce a\nvideo-to-text generation task to learn video-language correspondence. For the\nsecond challenge, we leverage ImageBind, a universal embedding model aligning\nmultiple modalities, as the pre-trained audio encoder and introduce an Audio\nQ-former on top of ImageBind to learn reasonable auditory query embeddings for\nthe LLM module. To align the output of both visual and audio encoders with\nLLM's embedding space, we first train Video-LLaMA on massive\nvideo/image-caption pairs and then tune our model with visual-instruction\ndatasets of moderate amount but higher quality. We found Video-LLaMA shows the\nability to perceive and comprehend video content and generate meaningful\nresponses grounded in the visual and auditory information presented in the\nvideos.\n","authors":["Hang Zhang","Xin Li","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2306.02858v4.pdf","comment":"Accepted by EMNLP 2023's demo track; Code, Pretrained Model, and\n Dataset: https://github.com/DAMO-NLP-SG/Video-LLaMA"},{"id":"http://arxiv.org/abs/2310.16393v1","updated":"2023-10-25T06:22:29Z","published":"2023-10-25T06:22:29Z","title":"ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source\n Ensembling of Language Adapters","summary":" We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via\nthe use of language adapters (LAs). Most of the earlier works have explored\ntraining with adapter of a single source (often English), and testing either\nusing the target LA or LA of another related language. Training target LA\nrequires unlabeled data, which may not be readily available for low resource\nunseen languages: those that are neither seen by the underlying multilingual\nlanguage model (e.g., mBERT), nor do we have any (labeled or unlabeled) data\nfor them. We posit that for more effective cross-lingual transfer, instead of\njust one source LA, we need to leverage LAs of multiple (linguistically or\ngeographically related) source languages, both at train and test-time - which\nwe investigate via our novel neural architecture, ZGUL. Extensive\nexperimentation across four language groups, covering 15 unseen target\nlanguages, demonstrates improvements of up to 3.2 average F1 points over\nstandard fine-tuning and other strong baselines on POS tagging and NER tasks.\nWe also extend ZGUL to settings where either (1) some unlabeled data or (2)\nfew-shot training examples are available for the target language. We find that\nZGUL continues to outperform baselines in these settings too.\n","authors":["Vipul Rathore","Rajdeep Dhingra","Parag Singla"," Mausam"],"pdf_url":"https://arxiv.org/pdf/2310.16393v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11069v3","updated":"2023-10-25T06:20:39Z","published":"2023-10-17T08:33:02Z","title":"VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System","summary":" Arabic is a complex language with many varieties and dialects spoken by over\n450 millions all around the world. Due to the linguistic diversity and\nvariations, it is challenging to build a robust and generalized ASR system for\nArabic. In this work, we address this gap by developing and demoing a system,\ndubbed VoxArabica, for dialect identification (DID) as well as automatic speech\nrecognition (ASR) of Arabic. We train a wide range of models such as HuBERT\n(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR\ntasks. Our DID models are trained to identify 17 different dialects in addition\nto MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.\nAdditionally, for the remaining dialects in ASR, we provide the option to\nchoose various models such as Whisper and MMS in a zero-shot setting. We\nintegrate these models into a single web interface with diverse features such\nas audio recording, file upload, model selection, and the option to raise flags\nfor incorrect outputs. Overall, we believe VoxArabica will be useful for a wide\nrange of audiences concerned with Arabic research. Our system is currently\nrunning at https://cdce-206-12-100-168.ngrok.io/.\n","authors":["Abdul Waheed","Bashar Talafha","Peter Suvellin","AbdelRahim Elmadany","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2310.11069v3.pdf","comment":"Accepted at ArabicNLP conference co-located with EMNLP'23. First\n three authors contributed equally"},{"id":"http://arxiv.org/abs/2103.10385v2","updated":"2023-10-25T06:15:58Z","published":"2021-03-18T17:13:50Z","title":"GPT Understands, Too","summary":" Prompting a pretrained language model with natural language patterns has been\nproved effective for natural language understanding (NLU). However, our\npreliminary study reveals that manual discrete prompts often lead to unstable\nperformance -- e.g., changing a single word in the prompt might result in\nsubstantial performance drop. We propose a novel method P-Tuning that employs\ntrainable continuous prompt embeddings in concatenation with discrete prompts.\nEmpirically, P-Tuning not only stabilizes training by minimizing the gap\nbetween various discrete prompts, but also improves performance by a sizeable\nmargin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is\ngenerally effective for both frozen and tuned language models, under both the\nfully-supervised and few-shot settings.\n","authors":["Xiao Liu","Yanan Zheng","Zhengxiao Du","Ming Ding","Yujie Qian","Zhilin Yang","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2103.10385v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15298v2","updated":"2023-10-25T06:10:07Z","published":"2023-10-23T19:03:35Z","title":"TaskDiff: A Similarity Metric for Task-Oriented Conversations","summary":" The popularity of conversational digital assistants has resulted in the\navailability of large amounts of conversational data which can be utilized for\nimproved user experience and personalized response generation. Building these\nassistants using popular large language models like ChatGPT also require\nadditional emphasis on prompt engineering and evaluation methods. Textual\nsimilarity metrics are a key ingredient for such analysis and evaluations.\nWhile many similarity metrics have been proposed in the literature, they have\nnot proven effective for task-oriented conversations as they do not take\nadvantage of unique conversational features. To address this gap, we present\nTaskDiff, a novel conversational similarity metric that utilizes different\ndialogue components (utterances, intents, and slots) and their distributions to\ncompute similarity. Extensive experimental evaluation of TaskDiff on a\nbenchmark dataset demonstrates its superior performance and improved robustness\nover other related approaches.\n","authors":["Ankita Bhaumik","Praveen Venkateswaran","Yara Rizk","Vatche Isahagian"],"pdf_url":"https://arxiv.org/pdf/2310.15298v2.pdf","comment":"Accepted to the main conference at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15205v2","updated":"2023-10-25T05:56:13Z","published":"2023-10-23T11:33:41Z","title":"DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple\n Experts Fine-tuning","summary":" We propose Multiple Experts Fine-tuning Framework to build a financial large\nlanguage model (LLM), DISC-FinLLM. Our methodology improves general LLMs by\nendowing them with multi-turn question answering abilities, domain text\nprocessing capabilities, mathematical computation skills, and\nretrieval-enhanced generation capabilities. We build a financial\ninstruction-tuning dataset named DISC-FIN-SFT, including instruction samples of\nfour categories (consulting, NLP tasks, computing and retrieval-augmented\ngeneration). Evaluations conducted on multiple benchmarks demonstrate that our\nmodel performs better than baseline models in various financial scenarios.\nFurther resources can be found at https://github.com/FudanDISC/DISC-FinLLM.\n","authors":["Wei Chen","Qiushi Wang","Zefei Long","Xianyin Zhang","Zhongtian Lu","Bingxuan Li","Siyuan Wang","Jiarong Xu","Xiang Bai","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2310.15205v2.pdf","comment":"18 pages, 13 figures, 7 tables"},{"id":"http://arxiv.org/abs/2210.02414v2","updated":"2023-10-25T05:22:43Z","published":"2022-10-05T17:34:44Z","title":"GLM-130B: An Open Bilingual Pre-trained Model","summary":" We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language\nmodel with 130 billion parameters. It is an attempt to open-source a 100B-scale\nmodel at least as good as GPT-3 (davinci) and unveil how models of such a scale\ncan be successfully pre-trained. Over the course of this effort, we face\nnumerous unexpected technical and engineering challenges, particularly on loss\nspikes and divergence. In this paper, we introduce the training process of\nGLM-130B including its design choices, training strategies for both efficiency\nand stability, and engineering efforts. The resultant GLM-130B model offers\nsignificant outperformance over GPT-3 175B (davinci) on a wide range of popular\nEnglish benchmarks while the performance advantage is not observed in OPT-175B\nand BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN\n3.0 260B -- the largest Chinese language model -- across related benchmarks.\nFinally, we leverage a unique scaling property of GLM-130B to reach INT4\nquantization without post training, with almost no performance loss, making it\nthe first among 100B-scale models and more importantly, allowing its effective\ninference on 4$\\times$RTX 3090 (24G) or 8$\\times$RTX 2080 Ti (11G) GPUs, the\nmost affordable GPUs required for using 100B-scale models. The GLM-130B model\nweights are publicly accessible and its code, training logs, related toolkit,\nand lessons learned are open-sourced at\n\\url{https://github.com/THUDM/GLM-130B/}.\n","authors":["Aohan Zeng","Xiao Liu","Zhengxiao Du","Zihan Wang","Hanyu Lai","Ming Ding","Zhuoyi Yang","Yifan Xu","Wendi Zheng","Xiao Xia","Weng Lam Tam","Zixuan Ma","Yufei Xue","Jidong Zhai","Wenguang Chen","Peng Zhang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2210.02414v2.pdf","comment":"Accepted to ICLR 2023"},{"id":"http://arxiv.org/abs/2310.16368v1","updated":"2023-10-25T05:12:35Z","published":"2023-10-25T05:12:35Z","title":"Transformer-based Live Update Generation for Soccer Matches from\n Microblog Posts","summary":" It has been known to be difficult to generate adequate sports updates from a\nsequence of vast amounts of diverse live tweets, although the live sports\nviewing experience with tweets is gaining the popularity. In this paper, we\nfocus on soccer matches and work on building a system to generate live updates\nfor soccer matches from tweets so that users can instantly grasp a match's\nprogress and enjoy the excitement of the match from raw tweets. Our proposed\nsystem is based on a large pre-trained language model and incorporates a\nmechanism to control the number of updates and a mechanism to reduce the\nredundancy of duplicate and similar updates.\n","authors":["Masashi Oshika","Kosuke Yamada","Ryohei Sasano","Koichi Takeda"],"pdf_url":"https://arxiv.org/pdf/2310.16368v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.06578v2","updated":"2023-10-25T04:57:41Z","published":"2023-09-07T04:15:17Z","title":"Can Large Language Models Discern Evidence for Scientific Hypotheses?\n Case Studies in the Social Sciences","summary":" Hypothesis formulation and testing are central to empirical research. A\nstrong hypothesis is a best guess based on existing evidence and informed by a\ncomprehensive view of relevant literature. However, with exponential increase\nin the number of scientific articles published annually, manual aggregation and\nsynthesis of evidence related to a given hypothesis is a challenge. Our work\nexplores the ability of current large language models (LLMs) to discern\nevidence in support or refute of specific hypotheses based on the text of\nscientific abstracts. We share a novel dataset for the task of scientific\nhypothesis evidencing using community-driven annotations of studies in the\nsocial sciences. We compare the performance of LLMs to several state-of-the-art\nbenchmarks and highlight opportunities for future research in this area. The\ndataset is available at\nhttps://github.com/Sai90000/ScientificHypothesisEvidencing.git\n","authors":["Sai Koneru","Jian Wu","Sarah Rajtmajer"],"pdf_url":"https://arxiv.org/pdf/2309.06578v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16361v1","updated":"2023-10-25T04:56:07Z","published":"2023-10-25T04:56:07Z","title":"InstructPTS: Instruction-Tuning LLMs for Product Title Summarization","summary":" E-commerce product catalogs contain billions of items. Most products have\nlengthy titles, as sellers pack them with product attributes to improve\nretrieval, and highlight key product aspects. This results in a gap between\nsuch unnatural products titles, and how customers refer to them. It also limits\nhow e-commerce stores can use these seller-provided titles for recommendation,\nQA, or review summarization.\n Inspired by recent work on instruction-tuned LLMs, we present InstructPTS, a\ncontrollable approach for the task of Product Title Summarization (PTS).\nTrained using a novel instruction fine-tuning strategy, our approach is able to\nsummarize product titles according to various criteria (e.g. number of words in\na summary, inclusion of specific phrases, etc.). Extensive evaluation on a\nreal-world e-commerce catalog shows that compared to simple fine-tuning of\nLLMs, our proposed approach can generate more accurate product name summaries,\nwith an improvement of over 14 and 8 BLEU and ROUGE points, respectively.\n","authors":["Besnik Fetahu","Zhiyu Chen","Oleg Rokhlenko","Shervin Malmasi"],"pdf_url":"https://arxiv.org/pdf/2310.16361v1.pdf","comment":"Accepted by EMNLP 2023 (Industry Track)"},{"id":"http://arxiv.org/abs/2310.16358v1","updated":"2023-10-25T04:38:02Z","published":"2023-10-25T04:38:02Z","title":"From Simple to Complex: A Progressive Framework for Document-level\n Informative Argument Extraction","summary":" Document-level Event Argument Extraction (EAE) requires the model to extract\narguments of multiple events from a single document. Considering the underlying\ndependencies between these events, recent efforts leverage the idea of\n\"memory\", where the results of already predicted events are cached and can be\nretrieved to help the prediction of upcoming events. These methods extract\nevents according to their appearance order in the document, however, the event\nthat appears in the first sentence does not mean that it is the easiest to\nextract. Existing methods might introduce noise to the extraction of upcoming\nevents if they rely on an incorrect prediction of previous events. In order to\nprovide more reliable memory, we propose a simple-to-complex progressive\nframework for document-level EAE. Specifically, we first calculate the\ndifficulty of each event and then, we conduct the extraction following a\nsimple-to-complex order. In this way, the memory will store the most certain\nresults, and the model could use these reliable sources to help the prediction\nof more difficult events. Experiments on WikiEvents show that our model\noutperforms SOTA by 1.4% in F1, indicating the proposed simple-to-complex\nframework is useful in the EAE task.\n","authors":["Quzhe Huang","Yanxi Zhang","Dongyan Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.16358v1.pdf","comment":"Accepted to the Findings of EMNLP 2023 (Long Paper)"},{"id":"http://arxiv.org/abs/2310.16356v1","updated":"2023-10-25T04:35:06Z","published":"2023-10-25T04:35:06Z","title":"A Multi-Modal Multilingual Benchmark for Document Image Classification","summary":" Document image classification is different from plain-text document\nclassification and consists of classifying a document by understanding the\ncontent and structure of documents such as forms, emails, and other such\ndocuments. We show that the only existing dataset for this task (Lewis et al.,\n2006) has several limitations and we introduce two newly curated multilingual\ndatasets WIKI-DOC and MULTIEURLEX-DOC that overcome these limitations. We\nfurther undertake a comprehensive study of popular visually-rich document\nunderstanding or Document AI models in previously untested setting in document\nimage classification such as 1) multi-label classification, and 2) zero-shot\ncross-lingual transfer setup. Experimental results show limitations of\nmultilingual Document AI models on cross-lingual transfer across typologically\ndistant languages. Our datasets and findings open the door for future research\ninto improving Document AI models.\n","authors":["Yoshinari Fujinuma","Siddharth Varia","Nishant Sankaran","Srikar Appalaraju","Bonan Min","Yogarshi Vyas"],"pdf_url":"https://arxiv.org/pdf/2310.16356v1.pdf","comment":"Accepted to EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.16350v1","updated":"2023-10-25T04:22:40Z","published":"2023-10-25T04:22:40Z","title":"Unraveling Feature Extraction Mechanisms in Neural Networks","summary":" The underlying mechanism of neural networks in capturing precise knowledge\nhas been the subject of consistent research efforts. In this work, we propose a\ntheoretical approach based on Neural Tangent Kernels (NTKs) to investigate such\nmechanisms. Specifically, considering the infinite network width, we\nhypothesize the learning dynamics of target models may intuitively unravel the\nfeatures they acquire from training data, deepening our insights into their\ninternal mechanisms. We apply our approach to several fundamental models and\nreveal how these models leverage statistical features during gradient descent\nand how they are integrated into final decisions. We also discovered that the\nchoice of activation function can affect feature extraction. For instance, the\nuse of the \\textit{ReLU} activation function could potentially introduce a bias\nin features, providing a plausible explanation for its replacement with\nalternative functions in recent pre-trained language models. Additionally, we\nfind that while self-attention and CNN models may exhibit limitations in\nlearning n-grams, multiplication-based models seem to excel in this area. We\nverify these theoretical findings through experiments and find that they can be\napplied to analyze language modeling tasks, which can be regarded as a special\nvariant of classification. Our contributions offer insights into the roles and\ncapacities of fundamental components within large language models, thereby\naiding the broader understanding of these complex systems.\n","authors":["Xiaobing Sun","Jiaxi Li","Wei Lu"],"pdf_url":"https://arxiv.org/pdf/2310.16350v1.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16343v1","updated":"2023-10-25T03:58:49Z","published":"2023-10-25T03:58:49Z","title":"A Comprehensive Evaluation of Constrained Text Generation for Large\n Language Models","summary":" Advancements in natural language generation (NLG) and large language models\n(LLMs) have led to proficient text generation in various tasks. However,\nintegrating intricate constraints into neural text generation, due to LLMs'\nopacity, remains challenging. This study investigates constrained text\ngeneration for LLMs, where predefined constraints are applied during LLM's\ngeneration process. Our research examines multiple LLMs, including ChatGPT and\nGPT-4, categorizing constraints into lexical, structural, and relation-based\ntypes. We also present various benchmarks to facilitate fair evaluation. The\nstudy addresses some key research questions, including the extent of LLMs'\ncompliance with constraints. Results illuminate LLMs' capacity and deficiency\nto incorporate constraints and provide insights for future developments in\nconstrained text generation. Codes and datasets will be released upon\nacceptance.\n","authors":["Xiang Chen","Xiaojun Wan"],"pdf_url":"https://arxiv.org/pdf/2310.16343v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2310.16042v2","updated":"2023-10-25T03:54:11Z","published":"2023-10-24T17:57:03Z","title":"WebWISE: Web Interface Control and Sequential Exploration with Large\n Language Models","summary":" The paper investigates using a Large Language Model (LLM) to automatically\nperform web software tasks using click, scroll, and text input operations.\nPrevious approaches, such as reinforcement learning (RL) or imitation learning,\nare inefficient to train and task-specific. Our method uses filtered Document\nObject Model (DOM) elements as observations and performs tasks step-by-step,\nsequentially generating small programs based on the current observations. We\nuse in-context learning, either benefiting from a single manually provided\nexample, or an automatically generated example based on a successful zero-shot\ntrial. We evaluate the proposed method on the MiniWob++ benchmark. With only\none in-context example, our WebWISE method achieves similar or better\nperformance than other methods that require many demonstrations or trials.\n","authors":["Heyi Tao","Sethuraman T V","Michal Shlapentokh-Rothman","Derek Hoiem"],"pdf_url":"https://arxiv.org/pdf/2310.16042v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16340v1","updated":"2023-10-25T03:53:31Z","published":"2023-10-25T03:53:31Z","title":"RCAgent: Cloud Root Cause Analysis by Autonomous Agents with\n Tool-Augmented Large Language Models","summary":" Large language model (LLM) applications in cloud root cause analysis (RCA)\nhave been actively explored recently. However, current methods are still\nreliant on manual workflow settings and do not unleash LLMs' decision-making\nand environment interaction capabilities. We present RCAgent, a tool-augmented\nLLM autonomous agent framework for practical and privacy-aware industrial RCA\nusage. Running on an internally deployed model rather than GPT families,\nRCAgent is capable of free-form data collection and comprehensive analysis with\ntools. Our framework combines a variety of enhancements, including a unique\nSelf-Consistency for action trajectories, and a suite of methods for context\nmanagement, stabilization, and importing domain knowledge. Our experiments show\nRCAgent's evident and consistent superiority over ReAct across all aspects of\nRCA -- predicting root causes, solutions, evidence, and responsibilities -- and\ntasks covered or uncovered by current rules, as validated by both automated\nmetrics and human evaluations. Furthermore, RCAgent has already been integrated\ninto the diagnosis and issue discovery workflow of the Real-time Compute\nPlatform for Apache Flink of Alibaba Cloud.\n","authors":["Zefan Wang","Zichuan Liu","Yingying Zhang","Aoxiao Zhong","Lunting Fan","Lingfei Wu","Qingsong Wen"],"pdf_url":"https://arxiv.org/pdf/2310.16340v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.11596v3","updated":"2023-10-25T03:52:07Z","published":"2023-08-22T17:44:18Z","title":"SeamlessM4T: Massively Multilingual & Multimodal Machine Translation","summary":" What does it take to create the Babel Fish, a tool that can help individuals\ntranslate speech between any two languages? While recent breakthroughs in\ntext-based models have pushed machine translation coverage beyond 200\nlanguages, unified speech-to-speech translation models have yet to achieve\nsimilar strides. More specifically, conventional speech-to-speech translation\nsystems rely on cascaded systems that perform translation progressively,\nputting high-performing unified systems out of reach. To address these gaps, we\nintroduce SeamlessM4T, a single model that supports speech-to-speech\ntranslation, speech-to-text translation, text-to-speech translation,\ntext-to-text translation, and automatic speech recognition for up to 100\nlanguages. To build this, we used 1 million hours of open speech audio data to\nlearn self-supervised speech representations with w2v-BERT 2.0. Subsequently,\nwe created a multimodal corpus of automatically aligned speech translations.\nFiltered and combined with human-labeled and pseudo-labeled data, we developed\nthe first multilingual system capable of translating from and into English for\nboth speech and text. On FLEURS, SeamlessM4T sets a new standard for\ntranslations into multiple target languages, achieving an improvement of 20%\nBLEU over the previous SOTA in direct speech-to-text translation. Compared to\nstrong cascaded models, SeamlessM4T improves the quality of into-English\ntranslation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in\nspeech-to-speech. Tested for robustness, our system performs better against\nbackground noises and speaker variations in speech-to-text tasks compared to\nthe current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and\nadded toxicity to assess translation safety. Finally, all contributions in this\nwork are open-sourced and accessible at\nhttps://github.com/facebookresearch/seamless_communication\n","authors":["Seamless Communication","Loïc Barrault","Yu-An Chung","Mariano Cora Meglioli","David Dale","Ning Dong","Paul-Ambroise Duquenne","Hady Elsahar","Hongyu Gong","Kevin Heffernan","John Hoffman","Christopher Klaiber","Pengwei Li","Daniel Licht","Jean Maillard","Alice Rakotoarison","Kaushik Ram Sadagopan","Guillaume Wenzek","Ethan Ye","Bapi Akula","Peng-Jen Chen","Naji El Hachem","Brian Ellis","Gabriel Mejia Gonzalez","Justin Haaheim","Prangthip Hansanti","Russ Howes","Bernie Huang","Min-Jae Hwang","Hirofumi Inaguma","Somya Jain","Elahe Kalbassi","Amanda Kallet","Ilia Kulikov","Janice Lam","Daniel Li","Xutai Ma","Ruslan Mavlyutov","Benjamin Peloquin","Mohamed Ramadan","Abinesh Ramakrishnan","Anna Sun","Kevin Tran","Tuan Tran","Igor Tufanov","Vish Vogeti","Carleigh Wood","Yilin Yang","Bokai Yu","Pierre Andrews","Can Balioglu","Marta R. Costa-jussà","Onur Celebi","Maha Elbayad","Cynthia Gao","Francisco Guzmán","Justine Kao","Ann Lee","Alexandre Mourachko","Juan Pino","Sravya Popuri","Christophe Ropers","Safiyyah Saleem","Holger Schwenk","Paden Tomasello","Changhan Wang","Jeff Wang","Skyler Wang"],"pdf_url":"https://arxiv.org/pdf/2308.11596v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16338v1","updated":"2023-10-25T03:40:50Z","published":"2023-10-25T03:40:50Z","title":"Generative Pre-training for Speech with Flow Matching","summary":" Generative models have gained more and more attention in recent years for\ntheir remarkable success in tasks that required estimating and sampling data\ndistribution to generate high-fidelity synthetic data. In speech,\ntext-to-speech synthesis and neural vocoder are good examples where generative\nmodels have shined. While generative models have been applied to different\napplications in speech, there exists no general-purpose generative model that\nmodels speech directly. In this work, we take a step toward this direction by\nshowing a single pre-trained generative model can be adapted to different\ndownstream tasks with strong performance. Specifically, we pre-trained a\ngenerative model, named SpeechFlow, on 60k hours of untranscribed speech with\nFlow Matching and masked conditions. Experiment results show the pre-trained\ngenerative model can be fine-tuned with task-specific data to match or surpass\nexisting expert models on speech enhancement, separation, and synthesis. Our\nwork suggested a foundational model for generation tasks in speech can be built\nwith generative pre-training.\n","authors":["Alexander H. Liu","Matt Le","Apoorv Vyas","Bowen Shi","Andros Tjandra","Wei-Ning Hsu"],"pdf_url":"https://arxiv.org/pdf/2310.16338v1.pdf","comment":"Preprint, under review"},{"id":"http://arxiv.org/abs/2310.15970v2","updated":"2023-10-25T03:23:30Z","published":"2023-10-24T16:10:58Z","title":"Accented Speech Recognition With Accent-specific Codebooks","summary":" Speech accents pose a significant challenge to state-of-the-art automatic\nspeech recognition (ASR) systems. Degradation in performance across\nunderrepresented accents is a severe deterrent to the inclusive adoption of\nASR. In this work, we propose a novel accent adaptation approach for end-to-end\nASR systems using cross-attention with a trainable set of codebooks. These\nlearnable codebooks capture accent-specific information and are integrated\nwithin the ASR encoder layers. The model is trained on accented English speech,\nwhile the test data also contained accents which were not seen during training.\nOn the Mozilla Common Voice multi-accented dataset, we show that our proposed\napproach yields significant performance gains not only on the seen English\naccents (up to $37\\%$ relative improvement in word error rate) but also on the\nunseen accents (up to $5\\%$ relative improvement in WER). Further, we\nillustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We\nalso compare the performance with other approaches based on accent adversarial\ntraining.\n","authors":["Darshan Prabhu","Preethi Jyothi","Sriram Ganapathy","Vinit Unni"],"pdf_url":"https://arxiv.org/pdf/2310.15970v2.pdf","comment":"Accepted to EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2310.16329v1","updated":"2023-10-25T03:21:20Z","published":"2023-10-25T03:21:20Z","title":"CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment\n of Coherence in Generated Texts","summary":" Coherence is a linguistic term that refers to the relations between small\ntextual units (sentences, propositions), which make the text logically\nconsistent and meaningful to the reader. With the advances of generative\nfoundational models in NLP, there is a pressing need to automatically assess\nthe human-perceived coherence of automatically generated texts. Up until now,\nlittle work has been done on explicitly assessing the coherence of generated\ntexts and analyzing the factors contributing to (in)coherence. Previous work on\nthe topic used other tasks, e.g., sentence reordering, as proxies of coherence,\nrather than approaching coherence detection heads on. In this paper, we\nintroduce {\\sc CoheSentia}, a novel benchmark of human-perceived coherence of\nautomatically generated texts. Our annotation protocol reflects two\nperspectives; one is global, assigning a single coherence score, and the other\nis incremental, scoring sentence by sentence. The incremental method produces\nan (in)coherence score for each text fragment and also pinpoints reasons for\nincoherence at that point. Our benchmark contains 500 automatically-generated\nand human-annotated paragraphs, each annotated in both methods, by multiple\nraters. Our analysis shows that the inter-annotator agreement in the\nincremental mode is higher than in the holistic alternative, and our\nexperiments show that standard LMs fine-tuned for coherence detection show\nvaried performance on the different factors contributing to (in)coherence. All\nin all, these models yield unsatisfactory performance, emphasizing the need for\ndeveloping more reliable methods for coherence assessment.\n","authors":["Aviya Maimon","Reut Tsarfaty"],"pdf_url":"https://arxiv.org/pdf/2310.16329v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16322v1","updated":"2023-10-25T03:10:52Z","published":"2023-10-25T03:10:52Z","title":"Samsung R&D Institute Philippines at WMT 2023","summary":" In this paper, we describe the constrained MT systems submitted by Samsung\nR&D Institute Philippines to the WMT 2023 General Translation Task for two\ndirections: en$\\rightarrow$he and he$\\rightarrow$en. Our systems comprise of\nTransformer-based sequence-to-sequence models that are trained with a mix of\nbest practices: comprehensive data preprocessing pipelines, synthetic\nbacktranslated data, and the use of noisy channel reranking during online\ndecoding. Our models perform comparably to, and sometimes outperform, strong\nbaseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite\nhaving significantly fewer parameters on two public benchmarks: FLORES-200 and\nNTREX-128.\n","authors":["Jan Christian Blaise Cruz"],"pdf_url":"https://arxiv.org/pdf/2310.16322v1.pdf","comment":"To appear in Proceedings of the Eighth Conference on Machine\n Translation 2023 (WMT)"},{"id":"http://arxiv.org/abs/2301.12609v4","updated":"2023-10-25T03:10:01Z","published":"2023-01-30T02:05:24Z","title":"Knowledge Distillation $\\approx$ Label Smoothing: Fact or Fallacy?","summary":" Originally proposed as a method for knowledge transfer from one model to\nanother, some recent studies have suggested that knowledge distillation (KD) is\nin fact a form of regularization. Perhaps the strongest argument of all for\nthis new perspective comes from its apparent similarities with label smoothing\n(LS). Here we re-examine this stated equivalence between the two methods by\ncomparing the predictive confidences of the models they train. Experiments on\nfour text classification tasks involving models of different sizes show that:\n(a) In most settings, KD and LS drive model confidence in completely opposite\ndirections, and (b) In KD, the student inherits not only its knowledge but also\nits confidence from the teacher, reinforcing the classical knowledge transfer\nview.\n","authors":["Md Arafat Sultan"],"pdf_url":"https://arxiv.org/pdf/2301.12609v4.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16319v1","updated":"2023-10-25T03:04:57Z","published":"2023-10-25T03:04:57Z","title":"DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue\n Assessment","summary":" Dialogue assessment plays a critical role in the development of open-domain\ndialogue systems. Existing work are uncapable of providing an end-to-end and\nhuman-epistemic assessment dataset, while they only provide sub-metrics like\ncoherence or the dialogues are conversed between annotators far from real user\nsettings. In this paper, we release a large-scale dialogue quality assessment\ndataset (DiQAD), for automatically assessing open-domain dialogue quality.\nSpecifically, we (1) establish the assessment criteria based on the dimensions\nconforming to human judgements on dialogue qualities, and (2) annotate\nlarge-scale dialogues that conversed between real users based on these\nannotation criteria, which contains around 100,000 dialogues. We conduct\nseveral experiments and report the performances of the baselines as the\nbenchmark on DiQAD. The dataset is openly accessible at\nhttps://github.com/yukunZhao/Dataset_Dialogue_quality_evaluation.\n","authors":["Yukun Zhao","Lingyong Yan","Weiwei Sun","Chong Meng","Shuaiqiang Wang","Zhicong Cheng","Zhaochun Ren","Dawei Yin"],"pdf_url":"https://arxiv.org/pdf/2310.16319v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2306.01755v2","updated":"2023-10-25T02:56:03Z","published":"2023-05-23T04:54:26Z","title":"Training Priors Predict Text-To-Image Model Performance","summary":" Text-to-image models can often generate some relations, i.e., \"astronaut\nriding horse\", but fail to generate other relations composed of the same basic\nparts, i.e., \"horse riding astronaut\". These failures are often taken as\nevidence that models rely on training priors rather than constructing novel\nimages compositionally. This paper tests this intuition on the stablediffusion\n2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads\nthat underlie these prompts (e.g., \"astronaut\", \"ride\", \"horse\"), we find that\nthe more often an SVO triad appears in the training data, the better the model\ncan generate an image aligned with that triad. Here, by aligned we mean that\neach of the terms appears in the generated image in the proper relation to each\nother. Surprisingly, this increased frequency also diminishes how well the\nmodel can generate an image aligned with the flipped triad. For example, if\n\"astronaut riding horse\" appears frequently in the training data, the image for\n\"horse riding astronaut\" will tend to be poorly aligned. Our results thus show\nthat current models are biased to generate images with relations seen in\ntraining, and provide new data to the ongoing debate on whether these\ntext-to-image models employ abstract compositional structure in a traditional\nsense, or rather, interpolate between relations explicitly seen in the training\ndata.\n","authors":["Charles Lovering","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2306.01755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16303v1","updated":"2023-10-25T02:22:50Z","published":"2023-10-25T02:22:50Z","title":"URL-BERT: Training Webpage Representations via Social Media Engagements","summary":" Understanding and representing webpages is crucial to online social networks\nwhere users may share and engage with URLs. Common language model (LM) encoders\nsuch as BERT can be used to understand and represent the textual content of\nwebpages. However, these representations may not model thematic information of\nweb domains and URLs or accurately capture their appeal to social media users.\nIn this work, we introduce a new pre-training objective that can be used to\nadapt LMs to understand URLs and webpages. Our proposed framework consists of\ntwo steps: (1) scalable graph embeddings to learn shallow representations of\nURLs based on user engagement on social media and (2) a contrastive objective\nthat aligns LM representations with the aforementioned graph-based\nrepresentation. We apply our framework to the multilingual version of BERT to\nobtain the model URL-BERT. We experimentally demonstrate that our continued\npre-training approach improves webpage understanding on a variety of tasks and\nTwitter internal and external benchmarks.\n","authors":["Ayesha Qamar","Chetan Verma","Ahmed El-Kishky","Sumit Binnani","Sneha Mehta","Taylor Berg-Kirkpatrick"],"pdf_url":"https://arxiv.org/pdf/2310.16303v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16301v1","updated":"2023-10-25T02:18:40Z","published":"2023-10-25T02:18:40Z","title":"Is ChatGPT a Good Multi-Party Conversation Solver?","summary":" Large Language Models (LLMs) have emerged as influential instruments within\nthe realm of natural language processing; nevertheless, their capacity to\nhandle multi-party conversations (MPCs) -- a scenario marked by the presence of\nmultiple interlocutors involved in intricate information exchanges -- remains\nuncharted. In this paper, we delve into the potential of generative LLMs such\nas ChatGPT and GPT-4 within the context of MPCs. An empirical analysis is\nconducted to assess the zero-shot learning capabilities of ChatGPT and GPT-4 by\nsubjecting them to evaluation across three MPC datasets that encompass five\nrepresentative tasks. The findings reveal that ChatGPT's performance on a\nnumber of evaluated MPC tasks leaves much to be desired, whilst GPT-4's results\nportend a promising future. Additionally, we endeavor to bolster performance\nthrough the incorporation of MPC structures, encompassing both speaker and\naddressee architecture. This study provides an exhaustive evaluation and\nanalysis of applying generative LLMs to MPCs, casting a light upon the\nconception and creation of increasingly effective and robust MPC agents.\nConcurrently, this work underscores the challenges implicit in the utilization\nof LLMs for MPCs, such as deciphering graphical information flows and\ngenerating stylistically consistent responses.\n","authors":["Chao-Hong Tan","Jia-Chen Gu","Zhen-Hua Ling"],"pdf_url":"https://arxiv.org/pdf/2310.16301v1.pdf","comment":"Accepted by Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14563v2","updated":"2023-10-25T02:00:19Z","published":"2023-10-23T04:38:34Z","title":"NormDial: A Comparable Bilingual Synthetic Dialog Dataset for Modeling\n Social Norm Adherence and Violation","summary":" Social norms fundamentally shape interpersonal communication. We present\nNormDial, a high-quality dyadic dialogue dataset with turn-by-turn annotations\nof social norm adherences and violations for Chinese and American cultures.\nIntroducing the task of social norm observance detection, our dataset is\nsynthetically generated in both Chinese and English using a human-in-the-loop\npipeline by prompting large language models with a small collection of\nexpert-annotated social norms. We show that our generated dialogues are of high\nquality through human evaluation and further evaluate the performance of\nexisting large language models on this task. Our findings point towards new\ndirections for understanding the nuances of social norms as they manifest in\nconversational contexts that span across languages and cultures.\n","authors":["Oliver Li","Mallika Subramanian","Arkadiy Saakyan","Sky CH-Wang","Smaranda Muresan"],"pdf_url":"https://arxiv.org/pdf/2310.14563v2.pdf","comment":"EMNLP 2023 Main Conference, Short Paper; Data at\n https://github.com/Aochong-Li/NormDial"},{"id":"http://arxiv.org/abs/2307.13854v3","updated":"2023-10-25T01:56:14Z","published":"2023-07-25T22:59:32Z","title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","summary":" With advances in generative AI, there is now potential for autonomous agents\nto manage daily tasks via natural language commands. However, current agents\nare primarily created and tested in simplified synthetic environments, leading\nto a disconnect with real-world scenarios. In this paper, we build an\nenvironment for language-guided agents that is highly realistic and\nreproducible. Specifically, we focus on agents that perform tasks on the web,\nand create an environment with fully functional websites from four common\ndomains: e-commerce, social forum discussions, collaborative software\ndevelopment, and content management. Our environment is enriched with tools\n(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage\nhuman-like task-solving. Building upon our environment, we release a set of\nbenchmark tasks focusing on evaluating the functional correctness of task\ncompletions. The tasks in our benchmark are diverse, long-horizon, and designed\nto emulate tasks that humans routinely perform on the internet. We experiment\nwith several baseline agents, integrating recent techniques such as reasoning\nbefore acting. The results demonstrate that solving complex tasks is\nchallenging: our best GPT-4-based agent only achieves an end-to-end task\nsuccess rate of 14.41%, significantly lower than the human performance of\n78.24%. These results highlight the need for further development of robust\nagents, that current state-of-the-art large language models are far from\nperfect performance in these real-life tasks, and that WebArena can be used to\nmeasure such progress.\n","authors":["Shuyan Zhou","Frank F. Xu","Hao Zhu","Xuhui Zhou","Robert Lo","Abishek Sridhar","Xianyi Cheng","Tianyue Ou","Yonatan Bisk","Daniel Fried","Uri Alon","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2307.13854v3.pdf","comment":"Our code, data, environment reproduction resources, and video\n demonstrations are publicly available at https://webarena.dev/"},{"id":"http://arxiv.org/abs/2310.16278v1","updated":"2023-10-25T01:20:17Z","published":"2023-10-25T01:20:17Z","title":"XFEVER: Exploring Fact Verification across Languages","summary":" This paper introduces the Cross-lingual Fact Extraction and VERification\n(XFEVER) dataset designed for benchmarking the fact verification models across\ndifferent languages. We constructed it by translating the claim and evidence\ntexts of the Fact Extraction and VERification (FEVER) dataset into six\nlanguages. The training and development sets were translated using machine\ntranslation, whereas the test set includes texts translated by professional\ntranslators and machine-translated texts. Using the XFEVER dataset, two\ncross-lingual fact verification scenarios, zero-shot learning and\ntranslate-train learning, are defined, and baseline models for each scenario\nare also proposed in this paper. Experimental results show that the\nmultilingual language model can be used to build fact verification models in\ndifferent languages efficiently. However, the performance varies by language\nand is somewhat inferior to the English case. We also found that we can\neffectively mitigate model miscalibration by considering the prediction\nsimilarity between the English and target languages. The XFEVER dataset, code,\nand model checkpoints are available at\nhttps://github.com/nii-yamagishilab/xfever.\n","authors":["Yi-Chen Chang","Canasai Kruengkrai","Junichi Yamagishi"],"pdf_url":"https://arxiv.org/pdf/2310.16278v1.pdf","comment":"Accepted for an oral presentation at the 35th Conference on\n Computational Linguistics and Speech Processing (ROCLING 2023)"},{"id":"http://arxiv.org/abs/2310.16271v1","updated":"2023-10-25T01:05:03Z","published":"2023-10-25T01:05:03Z","title":"CycleAlign: Iterative Distillation from Black-box LLM to White-box\n Models for Better Human Alignment","summary":" Language models trained on large-scale corpus often generate content that is\nharmful, toxic, or contrary to human preferences, making their alignment with\nhuman values a critical concern. Reinforcement learning from human feedback\n(RLHF) with algorithms like PPO is a prevalent approach for alignment but is\noften complex, unstable, and resource-intensive. Recently, ranking-based\nalignment methods have emerged, offering stability and effectiveness by\nreplacing the RL framework with supervised fine-tuning, but they are costly due\nto the need for annotated data. Considering that existing large language models\n(LLMs) like ChatGPT are already relatively well-aligned and cost-friendly,\nresearchers have begun to align the language model with human preference from\nAI feedback. The common practices, which unidirectionally distill the\ninstruction-following responses from LLMs, are constrained by their bottleneck.\nThus we introduce CycleAlign to distill alignment capabilities from\nparameter-invisible LLMs (black-box) to a parameter-visible model (white-box)\nin an iterative manner. With in-context learning (ICL) as the core of the\ncycle, the black-box models are able to rank the model-generated responses\nguided by human-craft instruction and demonstrations about their preferences.\nDuring iterative interaction, the white-box models also have a judgment about\nresponses generated by them. Consequently, the agreement ranking could be\nviewed as a pseudo label to dynamically update the in-context demonstrations\nand improve the preference ranking ability of black-box models. Through\nmultiple interactions, the CycleAlign framework could align the white-box model\nwith the black-box model effectively in a low-resource way. Empirical results\nillustrate that the model fine-tuned by CycleAlign remarkably exceeds existing\nmethods, and achieves the state-of-the-art performance in alignment with human\nvalue.\n","authors":["Jixiang Hong","Quan Tu","Changyu Chen","Xing Gao","Ji Zhang","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.16271v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16270v1","updated":"2023-10-25T01:03:35Z","published":"2023-10-25T01:03:35Z","title":"Attention Lens: A Tool for Mechanistically Interpreting the Attention\n Head Information Retrieval Mechanism","summary":" Transformer-based Large Language Models (LLMs) are the state-of-the-art for\nnatural language tasks. Recent work has attempted to decode, by reverse\nengineering the role of linear layers, the internal mechanisms by which LLMs\narrive at their final predictions for text completion tasks. Yet little is\nknown about the specific role of attention heads in producing the final token\nprediction. We propose Attention Lens, a tool that enables researchers to\ntranslate the outputs of attention heads into vocabulary tokens via learned\nattention-head-specific transformations called lenses. Preliminary findings\nfrom our trained lenses indicate that attention heads play highly specialized\nroles in language models. The code for Attention Lens is available at\ngithub.com/msakarvadia/AttentionLens.\n","authors":["Mansi Sakarvadia","Arham Khan","Aswathy Ajith","Daniel Grzenda","Nathaniel Hudson","André Bauer","Kyle Chard","Ian Foster"],"pdf_url":"https://arxiv.org/pdf/2310.16270v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16269v1","updated":"2023-10-25T01:01:28Z","published":"2023-10-25T01:01:28Z","title":"Multilingual Coarse Political Stance Classification of Media. The\n Editorial Line of a ChatGPT and Bard Newspaper","summary":" Neutrality is difficult to achieve and, in politics, subjective. Traditional\nmedia typically adopt an editorial line that can be used by their potential\nreaders as an indicator of the media bias. Several platforms currently rate\nnews outlets according to their political bias. The editorial line and the\nratings help readers in gathering a balanced view of news. But in the advent of\ninstruction-following language models, tasks such as writing a newspaper\narticle can be delegated to computers. Without imposing a biased persona, where\nwould an AI-based news outlet lie within the bias ratings? In this work, we use\nthe ratings of authentic news outlets to create a multilingual corpus of news\nwith coarse stance annotations (Left and Right) along with automatically\nextracted topic annotations. We show that classifiers trained on this data are\nable to identify the editorial line of most unseen newspapers in English,\nGerman, Spanish and Catalan. We then apply the classifiers to 101\nnewspaper-like articles written by ChatGPT and Bard in the 4 languages at\ndifferent time periods. We observe that, similarly to traditional newspapers,\nChatGPT editorial line evolves with time and, being a data-driven system, the\nstance of the generated articles differs among languages.\n","authors":["Cristina España-Bonet"],"pdf_url":"https://arxiv.org/pdf/2310.16269v1.pdf","comment":"To be published at EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2308.06975v3","updated":"2023-10-25T00:57:54Z","published":"2023-08-14T07:20:49Z","title":"Can Knowledge Graphs Simplify Text?","summary":" Knowledge Graph (KG)-to-Text Generation has seen recent improvements in\ngenerating fluent and informative sentences which describe a given KG. As KGs\nare widespread across multiple domains and contain important entity-relation\ninformation, and as text simplification aims to reduce the complexity of a text\nwhile preserving the meaning of the original text, we propose KGSimple, a novel\napproach to unsupervised text simplification which infuses KG-established\ntechniques in order to construct a simplified KG path and generate a concise\ntext which preserves the original input's meaning. Through an iterative and\nsampling KG-first approach, our model is capable of simplifying text when\nstarting from a KG by learning to keep important information while harnessing\nKG-to-text generation to output fluent and descriptive sentences. We evaluate\nvarious settings of the KGSimple model on currently-available KG-to-text\ndatasets, demonstrating its effectiveness compared to unsupervised text\nsimplification models which start with a given complex text. Our code is\navailable on GitHub.\n","authors":["Anthony Colas","Haodi Ma","Xuanli He","Yang Bai","Daisy Zhe Wang"],"pdf_url":"https://arxiv.org/pdf/2308.06975v3.pdf","comment":"Accepted as a Main Conference Long Paper at CIKM 2023"},{"id":"http://arxiv.org/abs/2310.13276v2","updated":"2023-10-25T00:46:42Z","published":"2023-10-20T04:45:44Z","title":"InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution","summary":" Over recent decades, significant advancements in cross-modal retrieval are\nmainly driven by breakthroughs in visual and linguistic modeling. However, a\nrecent study shows that multi-modal data representations tend to cluster within\na limited convex cone (as representation degeneration problem), which hinders\nretrieval performance due to the inseparability of these representations. In\nour study, we first empirically validate the presence of the representation\ndegeneration problem across multiple cross-modal benchmarks and methods. Next,\nto address it, we introduce a novel method, called InvGC, a post-processing\ntechnique inspired by graph convolution and average pooling. Specifically,\nInvGC defines the graph topology within the datasets and then applies graph\nconvolution in a subtractive manner. This method effectively separates\nrepresentations by increasing the distances between data points. To improve the\nefficiency and effectiveness of InvGC, we propose an advanced graph topology,\nLocalAdj, which only aims to increase the distances between each data point and\nits nearest neighbors. To understand why InvGC works, we present a detailed\ntheoretical analysis, proving that the lower bound of recall will be improved\nafter deploying InvGC. Extensive empirical results show that InvGC and InvGC\nw/LocalAdj significantly mitigate the representation degeneration problem,\nthereby enhancing retrieval performance.\n Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval\n","authors":["Xiangru Jian","Yimu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.13276v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2302.14233v2","updated":"2023-10-25T00:43:04Z","published":"2023-02-28T01:32:32Z","title":"Goal Driven Discovery of Distributional Differences via Language\n Descriptions","summary":" Mining large corpora can generate useful discoveries but is time-consuming\nfor humans. We formulate a new task, D5, that automatically discovers\ndifferences between two large corpora in a goal-driven way. The task input is a\nproblem comprising a research goal \"$\\textit{comparing the side effects of drug\nA and drug B}$\" and a corpus pair (two large collections of patients'\nself-reported reactions after taking each drug). The output is a language\ndescription (discovery) of how these corpora differ (patients taking drug A\n\"$\\textit{mention feelings of paranoia}$\" more often). We build a D5 system,\nand to quantitatively measure its performance, we 1) contribute a meta-dataset,\nOpenD5, aggregating 675 open-ended problems ranging across business, social\nsciences, humanities, machine learning, and health, and 2) propose a set of\nunified evaluation metrics: validity, relevance, novelty, and significance.\nWith the dataset and the unified metrics, we confirm that language models can\nuse the goals to propose more relevant, novel, and significant candidate\ndiscoveries. Finally, our system produces discoveries previously unknown to the\nauthors on a wide range of applications in OpenD5, including temporal and\ndemographic differences in discussion topics, political stances and stereotypes\nin speech, insights in commercial reviews, and error patterns in NLP models.\n","authors":["Ruiqi Zhong","Peter Zhang","Steve Li","Jinwoo Ahn","Dan Klein","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2302.14233v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16263v1","updated":"2023-10-25T00:32:56Z","published":"2023-10-25T00:32:56Z","title":"Enhancing Large Language Models for Secure Code Generation: A\n Dataset-driven Study on Vulnerability Mitigation","summary":" Large language models (LLMs) have brought significant advancements to code\ngeneration, benefiting both novice and experienced developers. However, their\ntraining using unsanitized data from open-source repositories, like GitHub,\nintroduces the risk of inadvertently propagating security vulnerabilities. To\neffectively mitigate this concern, this paper presents a comprehensive study\nfocused on evaluating and enhancing code LLMs from a software security\nperspective. We introduce SecuCoGen\\footnote{SecuCoGen has been uploaded as\nsupplemental material and will be made publicly available after publication.},\na meticulously curated dataset targeting 21 critical vulnerability types.\nSecuCoGen comprises 180 samples and serves as the foundation for conducting\nexperiments on three crucial code-related tasks: code generation, code repair\nand vulnerability classification, with a strong emphasis on security. Our\nexperimental results reveal that existing models often overlook security\nconcerns during code generation, leading to the generation of vulnerable code.\nTo address this, we propose effective approaches to mitigate the security\nvulnerabilities and enhance the overall robustness of code generated by LLMs.\nMoreover, our study identifies weaknesses in existing models' ability to repair\nvulnerable code, even when provided with vulnerability information.\nAdditionally, certain vulnerability types pose challenges for the models,\nhindering their performance in vulnerability classification. Based on these\nfindings, we believe our study will have a positive impact on the software\nengineering community, inspiring the development of improved methods for\ntraining and utilizing LLMs, thereby leading to safer and more trustworthy\nmodel deployment.\n","authors":["Jiexin Wang","Liuwen Cao","Xitong Luo","Zhiping Zhou","Jiayuan Xie","Adam Jatowt","Yi Cai"],"pdf_url":"https://arxiv.org/pdf/2310.16263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16261v1","updated":"2023-10-25T00:31:29Z","published":"2023-10-25T00:31:29Z","title":"The Distributional Hypothesis Does Not Fully Explain the Benefits of\n Masked Language Model Pretraining","summary":" We analyze the masked language modeling pretraining objective function from\nthe perspective of the distributional hypothesis. We investigate whether better\nsample efficiency and the better generalization capability of models pretrained\nwith masked language modeling can be attributed to the semantic similarity\nencoded in the pretraining data's distributional property. Via a synthetic\ndataset, our analysis suggests that distributional property indeed leads to the\nbetter sample efficiency of pretrained masked language models, but does not\nfully explain the generalization capability. We also conduct analyses over two\nreal-world datasets and demonstrate that the distributional property does not\nexplain the generalization ability of pretrained natural language models\neither. Our results illustrate our limited understanding of model pretraining\nand provide future research directions.\n","authors":["Ting-Rui Chiang","Dani Yogatama"],"pdf_url":"https://arxiv.org/pdf/2310.16261v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14583v2","updated":"2023-10-25T00:08:17Z","published":"2023-05-23T23:45:20Z","title":"Natural Language Decompositions of Implicit Content Enable Better Text\n Representations","summary":" When people interpret text, they rely on inferences that go beyond the\nobserved language itself. Inspired by this observation, we introduce a method\nfor the analysis of text that takes implicitly communicated content explicitly\ninto account. We use a large language model to produce sets of propositions\nthat are inferentially related to the text that has been observed, then\nvalidate the plausibility of the generated content via human judgments.\nIncorporating these explicit representations of implicit content proves useful\nin multiple problem settings that involve the human interpretation of\nutterances: assessing the similarity of arguments, making sense of a body of\nopinion data, and modeling legislative behavior. Our results suggest that\nmodeling the meanings behind observed language, rather than the literal text\nalone, is a valuable direction for NLP and particularly its applications to\nsocial science.\n","authors":["Alexander Hoyle","Rupak Sarkar","Pranav Goel","Philip Resnik"],"pdf_url":"https://arxiv.org/pdf/2305.14583v2.pdf","comment":"Accepted to EMNLP 2023 (Main conference)"},{"id":"http://arxiv.org/abs/2310.17064v1","updated":"2023-10-25T23:54:04Z","published":"2023-10-25T23:54:04Z","title":"math-PVS: A Large Language Model Framework to Map Scientific\n Publications to PVS Theories","summary":" As artificial intelligence (AI) gains greater adoption in a wide variety of\napplications, it has immense potential to contribute to mathematical discovery,\nby guiding conjecture generation, constructing counterexamples, assisting in\nformalizing mathematics, and discovering connections between different\nmathematical areas, to name a few.\n While prior work has leveraged computers for exhaustive mathematical proof\nsearch, recent efforts based on large language models (LLMs) aspire to position\ncomputing platforms as co-contributors in the mathematical research process.\nDespite their current limitations in logic and mathematical tasks, there is\ngrowing interest in melding theorem proving systems with foundation models.\nThis work investigates the applicability of LLMs in formalizing advanced\nmathematical concepts and proposes a framework that can critically review and\ncheck mathematical reasoning in research papers. Given the noted reasoning\nshortcomings of LLMs, our approach synergizes the capabilities of proof\nassistants, specifically PVS, with LLMs, enabling a bridge between textual\ndescriptions in academic papers and formal specifications in PVS. By harnessing\nthe PVS environment, coupled with data ingestion and conversion mechanisms, we\nenvision an automated process, called \\emph{math-PVS}, to extract and formalize\nmathematical theorems from research papers, offering an innovative tool for\nacademic review and discovery.\n","authors":["Hassen Saidi","Susmit Jha","Tuhin Sahai"],"pdf_url":"https://arxiv.org/pdf/2310.17064v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17054v1","updated":"2023-10-25T23:32:12Z","published":"2023-10-25T23:32:12Z","title":"BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs'\n Generation","summary":" Large language models (LLMs) such as GPT-3 have demonstrated a strong\ncapability to generate coherent and contextually relevant text. However, amidst\ntheir successes, a crucial issue persists: their generated outputs still lack\ncommonsense at times. Moreover, fine-tuning the entire LLM towards more\ncommonsensical outputs is computationally expensive if not infeasible. In this\npaper, we present a computation-efficient framework that steers a frozen\nPre-Trained Language Model (PTLM) towards more commonsensical generation (i.e.,\nproducing a plausible output that incorporates a list of concepts in a\nmeaningful way). Specifically, we first construct a reference-free evaluator\nthat assigns a sentence with a commonsensical score by grounding the sentence\nto a dynamic commonsense knowledge base from four different relational aspects.\nWe then use the scorer as the oracle for commonsense knowledge, and extend the\ncontrollable generation method called NADO to train an auxiliary head that\nguides a fixed PTLM to better satisfy the oracle. We test our framework on a\nseries of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two\nconstrained concept-to-sentence benchmarks. Human evaluation results\ndemonstrate that our method consistently leads to the most commonsensical\noutputs.\n","authors":["Yufei Tian","Felix Zhang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.17054v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17041v1","updated":"2023-10-25T22:42:30Z","published":"2023-10-25T22:42:30Z","title":"On Surgical Fine-tuning for Language Encoders","summary":" Fine-tuning all the layers of a pre-trained neural language encoder (either\nusing all the parameters or using parameter-efficient methods) is often the\nde-facto way of adapting it to a new task. We show evidence that for different\ndownstream language tasks, fine-tuning only a subset of layers is sufficient to\nobtain performance that is close to and often better than fine-tuning all the\nlayers in the language encoder. We propose an efficient metric based on the\ndiagonal of the Fisher information matrix (FIM score), to select the candidate\nlayers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE\ntasks and across distinct language encoders, that this metric can effectively\nselect layers leading to a strong downstream performance. Our work highlights\nthat task-specific information corresponding to a given downstream task is\noften localized within a few layers, and tuning only those is sufficient for\nstrong performance. Additionally, we demonstrate the robustness of the FIM\nscore to rank layers in a manner that remains constant during the optimization\nprocess.\n","authors":["Abhilasha Lodha","Gayatri Belapurkar","Saloni Chalkapurkar","Yuanming Tao","Reshmi Ghosh","Samyadeep Basu","Dmitrii Petrov","Soundararajan Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2310.17041v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.02531v5","updated":"2023-10-25T22:31:50Z","published":"2023-05-04T03:51:31Z","title":"Can LLMs Capture Intertemporal Preferences?","summary":" We explore the viability of Large Language Models (LLMs), specifically\nOpenAI's GPT-3.5 and GPT-4, in emulating human survey respondents and eliciting\npreferences, with a focus on intertemporal choices. Leveraging the extensive\nliterature on intertemporal discounting for benchmarking, we examine responses\nfrom LLMs across various languages and compare them to human responses,\nexploring preferences between smaller, sooner, and larger, later rewards. Our\nfindings reveal that both GPT models demonstrate less patience than humans,\nwith GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike\nhuman decision-makers. Though GPT-4 does not display lexicographic preferences,\nits measured discount rates are still considerably larger than those found in\nhumans. Interestingly, GPT models show greater patience in languages with weak\nfuture tense references, such as German and Mandarin, aligning with existing\nliterature that suggests a correlation between language structure and\nintertemporal preferences. We demonstrate how prompting GPT to explain its\ndecisions, a procedure we term ``chain-of-thought conjoint,\" can mitigate, but\ndoes not eliminate, discrepancies between LLM and human responses. While\ndirectly eliciting preferences using LLMs may yield misleading results,\ncombining chain-of-thought conjoint with topic modeling aids in hypothesis\ngeneration, enabling researchers to explore the underpinnings of preferences.\nChain-of-thought conjoint provides a structured framework for marketers to use\nLLMs to identify potential attributes or factors that can explain preference\nheterogeneity across different customers and contexts.\n","authors":["Ali Goli","Amandeep Singh"],"pdf_url":"https://arxiv.org/pdf/2305.02531v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17034v1","updated":"2023-10-25T22:22:18Z","published":"2023-10-25T22:22:18Z","title":"Follow-on Question Suggestion via Voice Hints for Voice Assistants","summary":" The adoption of voice assistants like Alexa or Siri has grown rapidly,\nallowing users to instantly access information via voice search. Query\nsuggestion is a standard feature of screen-based search experiences, allowing\nusers to explore additional topics. However, this is not trivial to implement\nin voice-based settings. To enable this, we tackle the novel task of suggesting\nquestions with compact and natural voice hints to allow users to ask follow-up\nquestions.\n We define the task, ground it in syntactic theory and outline linguistic\ndesiderata for spoken hints. We propose baselines and an approach using\nsequence-to-sequence Transformers to generate spoken hints from a list of\nquestions. Using a new dataset of 6681 input questions and human written hints,\nwe evaluated the models with automatic metrics and human evaluation. Results\nshow that a naive approach of concatenating suggested questions creates poor\nvoice hints. Our approach, which applies a linguistically-motivated pretraining\ntask was strongly preferred by humans for producing the most natural hints.\n","authors":["Besnik Fetahu","Pedro Faustini","Giuseppe Castellucci","Anjie Fang","Oleg Rokhlenko","Shervin Malmasi"],"pdf_url":"https://arxiv.org/pdf/2310.17034v1.pdf","comment":"Accepted as Long Paper at EMNLP'23 Findings"},{"id":"http://arxiv.org/abs/2310.17022v1","updated":"2023-10-25T22:00:05Z","published":"2023-10-25T22:00:05Z","title":"Controlled Decoding from Language Models","summary":" We propose controlled decoding (CD), a novel off-policy reinforcement\nlearning method to control the autoregressive generation from language models\ntowards high reward outcomes. CD solves an off-policy reinforcement learning\nproblem through a value function for the reward, which we call a prefix scorer.\nThe prefix scorer is used at inference time to steer the generation towards\nhigher reward outcomes. We show that the prefix scorer may be trained on\n(possibly) off-policy data to predict the expected reward when decoding is\ncontinued from a partially decoded response. We empirically demonstrate that CD\nis effective as a control mechanism on Reddit conversations corpus. We also\nshow that the modularity of the design of CD makes it possible to control for\nmultiple rewards, effectively solving a multi-objective reinforcement learning\nproblem with no additional complexity. Finally, we show that CD can be applied\nin a novel blockwise fashion at inference-time, again without the need for any\ntraining-time changes, essentially bridging the gap between the popular\nbest-of-$K$ strategy and token-level reinforcement learning. This makes CD a\npromising approach for alignment of language models.\n","authors":["Sidharth Mudgal","Jong Lee","Harish Ganapathy","YaGuang Li","Tao Wang","Yanping Huang","Zhifeng Chen","Heng-Tze Cheng","Michael Collins","Trevor Strohman","Jilin Chen","Alex Beutel","Ahmad Beirami"],"pdf_url":"https://arxiv.org/pdf/2310.17022v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17019v1","updated":"2023-10-25T21:46:34Z","published":"2023-10-25T21:46:34Z","title":"Conditionally Combining Robot Skills using Large Language Models","summary":" This paper combines two contributions. First, we introduce an extension of\nthe Meta-World benchmark, which we call \"Language-World,\" which allows a large\nlanguage model to operate in a simulated robotic environment using\nsemi-structured natural language queries and scripted skills described using\nnatural language. By using the same set of tasks as Meta-World, Language-World\nresults can be easily compared to Meta-World results, allowing for a point of\ncomparison between recent methods using Large Language Models (LLMs) and those\nusing Deep Reinforcement Learning. Second, we introduce a method we call Plan\nConditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of\nhigh-level plans using end-to-end demonstrations. Using Language-World, we show\nthat PCBC is able to achieve strong performance in a variety of few-shot\nregimes, often achieving task generalization with as little as a single\ndemonstration. We have made Language-World available as open-source software at\nhttps://github.com/krzentner/language-world/.\n","authors":["K. R. Zentner","Ryan Julian","Brian Ichter","Gaurav S. Sukhatme"],"pdf_url":"https://arxiv.org/pdf/2310.17019v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17017v1","updated":"2023-10-25T21:37:57Z","published":"2023-10-25T21:37:57Z","title":"An Integrative Survey on Mental Health Conversational Agents to Bridge\n Computer Science and Medical Perspectives","summary":" Mental health conversational agents (a.k.a. chatbots) are widely studied for\ntheir potential to offer accessible support to those experiencing mental health\nchallenges. Previous surveys on the topic primarily consider papers published\nin either computer science or medicine, leading to a divide in understanding\nand hindering the sharing of beneficial knowledge between both domains. To\nbridge this gap, we conduct a comprehensive literature review using the PRISMA\nframework, reviewing 534 papers published in both computer science and\nmedicine. Our systematic review reveals 136 key papers on building mental\nhealth-related conversational agents with diverse characteristics of modeling\nand experimental design techniques. We find that computer science papers focus\non LLM techniques and evaluating response quality using automated metrics with\nlittle attention to the application while medical papers use rule-based\nconversational agents and outcome metrics to measure the health outcomes of\nparticipants. Based on our findings on transparency, ethics, and cultural\nheterogeneity in this review, we provide a few recommendations to help bridge\nthe disciplinary divide and enable the cross-disciplinary development of mental\nhealth conversational agents.\n","authors":["Young Min Cho","Sunny Rai","Lyle Ungar","João Sedoc","Sharath Chandra Guntuku"],"pdf_url":"https://arxiv.org/pdf/2310.17017v1.pdf","comment":"Accepted in EMNLP 2023 Main Conference, camera ready"},{"id":"http://arxiv.org/abs/2310.17015v1","updated":"2023-10-25T21:29:36Z","published":"2023-10-25T21:29:36Z","title":"Data Augmentation for Emotion Detection in Small Imbalanced Text Data","summary":" Emotion recognition in text, the task of identifying emotions such as joy or\nanger, is a challenging problem in NLP with many applications. One of the\nchallenges is the shortage of available datasets that have been annotated with\nemotions. Certain existing datasets are small, follow different emotion\ntaxonomies and display imbalance in their emotion distribution. In this work,\nwe studied the impact of data augmentation techniques precisely when applied to\nsmall imbalanced datasets, for which current state-of-the-art models (such as\nRoBERTa) under-perform. Specifically, we utilized four data augmentation\nmethods (Easy Data Augmentation EDA, static and contextual Embedding-based, and\nProtAugment) on three datasets that come from different sources and vary in\nsize, emotion categories and distributions. Our experimental results show that\nusing the augmented data when training the classifier model leads to\nsignificant improvements. Finally, we conducted two case studies: a) directly\nusing the popular chat-GPT API to paraphrase text using different prompts, and\nb) using external data to augment the training set. Results show the promising\npotential of these methods.\n","authors":["Anna Koufakou","Diego Grisales","Ragy Costa de jesus","Oscar Fox"],"pdf_url":"https://arxiv.org/pdf/2310.17015v1.pdf","comment":"Accepted paper at IEEE ICMLA 2023"},{"id":"http://arxiv.org/abs/2310.17010v1","updated":"2023-10-25T21:18:35Z","published":"2023-10-25T21:18:35Z","title":"This Reads Like That: Deep Learning for Interpretable Natural Language\n Processing","summary":" Prototype learning, a popular machine learning method designed for inherently\ninterpretable decisions, leverages similarities to learned prototypes for\nclassifying new data. While it is mainly applied in computer vision, in this\nwork, we build upon prior research and further explore the extension of\nprototypical networks to natural language processing. We introduce a learned\nweighted similarity measure that enhances the similarity computation by\nfocusing on informative dimensions of pre-trained sentence embeddings.\nAdditionally, we propose a post-hoc explainability mechanism that extracts\nprediction-relevant words from both the prototype and input sentences. Finally,\nwe empirically demonstrate that our proposed method not only improves\npredictive performance on the AG News and RT Polarity datasets over a previous\nprototype-based approach, but also improves the faithfulness of explanations\ncompared to rationale-based recurrent convolutions.\n","authors":["Claudio Fanconi","Moritz Vandenhirtz","Severin Husmann","Julia E. Vogt"],"pdf_url":"https://arxiv.org/pdf/2310.17010v1.pdf","comment":"10 pages, 1 figure, 5 tables"},{"id":"http://arxiv.org/abs/1901.05066v2","updated":"2023-10-25T20:55:00Z","published":"2019-01-15T21:56:32Z","title":"Investigating Antigram Behaviour using Distributional Semantics","summary":" The field of computational linguistics constantly presents new challenges and\ntopics for research. Whether it be analyzing word usage changes over time or\nidentifying relationships between pairs of seemingly unrelated words. To this\npoint, we identify Anagrams and Antigrams as words possessing such unique\nproperties. The presented work is an exploration into generating anagrams from\na given word and determining whether there exists antigram (semantically\nopposite anagrams) relationships between the pairs of generated anagrams using\nGloVe embeddings. We propose a rudimentary, yet interpretable, rule-based\nalgorithm for detecting antigrams. On a small dataset of just 12 antigrams, our\napproach yielded an accuracy of 39\\% which shows that there is much work left\nto be done in this space.\n","authors":["Saptarshi Sengupta"],"pdf_url":"https://arxiv.org/pdf/1901.05066v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16995v1","updated":"2023-10-25T20:48:16Z","published":"2023-10-25T20:48:16Z","title":"Quality > Quantity: Synthetic Corpora from Foundation Models for\n Closed-Domain Extractive Question Answering","summary":" Domain adaptation, the process of training a model in one domain and applying\nit to another, has been extensively explored in machine learning. While\ntraining a domain-specific foundation model (FM) from scratch is an option,\nrecent methods have focused on adapting pre-trained FMs for domain-specific\ntasks. However, our experiments reveal that either approach does not\nconsistently achieve state-of-the-art (SOTA) results in the target domain. In\nthis work, we study extractive question answering within closed domains and\nintroduce the concept of targeted pre-training. This involves determining and\ngenerating relevant data to further pre-train our models, as opposed to the\nconventional philosophy of utilizing domain-specific FMs trained on a wide\nrange of data. Our proposed framework uses Galactica to generate synthetic,\n``targeted'' corpora that align with specific writing styles and topics, such\nas research papers and radiology reports. This process can be viewed as a form\nof knowledge distillation. We apply our method to two biomedical extractive\nquestion answering datasets, COVID-QA and RadQA, achieving a new benchmark on\nthe former and demonstrating overall improvements on the latter. Code available\nat https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main.\n","authors":["Saptarshi Sengupta","Connor Heaton","Shreya Ghosh","Preslav Nakov","Prasenjit Mitra"],"pdf_url":"https://arxiv.org/pdf/2310.16995v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16992v1","updated":"2023-10-25T20:43:07Z","published":"2023-10-25T20:43:07Z","title":"How well can machine-generated texts be identified and can language\n models be trained to avoid identification?","summary":" With the rise of generative pre-trained transformer models such as GPT-3,\nGPT-NeoX, or OPT, distinguishing human-generated texts from machine-generated\nones has become important. We refined five separate language models to generate\nsynthetic tweets, uncovering that shallow learning classification algorithms,\nlike Naive Bayes, achieve detection accuracy between 0.6 and 0.8.\n Shallow learning classifiers differ from human-based detection, especially\nwhen using higher temperature values during text generation, resulting in a\nlower detection rate. Humans prioritize linguistic acceptability, which tends\nto be higher at lower temperature values. In contrast, transformer-based\nclassifiers have an accuracy of 0.9 and above. We found that using a\nreinforcement learning approach to refine our generative models can\nsuccessfully evade BERT-based classifiers with a detection accuracy of 0.15 or\nless.\n","authors":["Sinclair Schneider","Florian Steuber","Joao A. G. Schneider","Gabi Dreo Rodosek"],"pdf_url":"https://arxiv.org/pdf/2310.16992v1.pdf","comment":"This paper has been accepted for the upcoming 57th Hawaii\n International Conference on System Sciences (HICSS-57)"},{"id":"http://arxiv.org/abs/2310.16990v1","updated":"2023-10-25T20:41:30Z","published":"2023-10-25T20:41:30Z","title":"STEER: Semantic Turn Extension-Expansion Recognition for Voice\n Assistants","summary":" In the context of a voice assistant system, steering refers to the phenomenon\nin which a user issues a follow-up command attempting to direct or clarify a\nprevious turn. We propose STEER, a steering detection model that predicts\nwhether a follow-up turn is a user's attempt to steer the previous command.\nConstructing a training dataset for steering use cases poses challenges due to\nthe cold-start problem. To overcome this, we developed heuristic rules to\nsample opt-in usage data, approximating positive and negative samples without\nany annotation. Our experimental results show promising performance in\nidentifying steering intent, with over 95% accuracy on our sampled data.\nMoreover, STEER, in conjunction with our sampling strategy, aligns effectively\nwith real-world steering scenarios, as evidenced by its strong zero-shot\nperformance on a human-graded evaluation set. In addition to relying solely on\nuser transcripts as input, we introduce STEER+, an enhanced version of the\nmodel. STEER+ utilizes a semantic parse tree to provide more context on\nout-of-vocabulary words, such as named entities that often occur at the\nsentence boundary. This further improves model performance, reducing error rate\nin domains where entities frequently appear, such as messaging. Lastly, we\npresent a data analysis that highlights the improvement in user experience when\nvoice assistants support steering use cases.\n","authors":["Leon Liyang Zhang","Jiarui Lu","Joel Ruben Antony Moniz","Aditya Kulkarni","Dhivya Piraviperumal","Tien Dung Tran","Nicholas Tzou","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2310.16990v1.pdf","comment":"EMNLP 2023 Industry Track"},{"id":"http://arxiv.org/abs/2310.02255v2","updated":"2023-10-25T20:22:24Z","published":"2023-10-03T17:57:24Z","title":"MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V,\n Bard, and Other Large Multimodal Models","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit\nimpressive problem-solving skills in many tasks and domains, but their ability\nin mathematical reasoning in visual contexts has not been systematically\nstudied. To bridge this gap, we present MathVista, a benchmark designed to\ncombine challenges from diverse mathematical and visual tasks. It consists of\n6,141 examples, derived from 28 existing multimodal datasets involving\nmathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and\nPaperQA). Completing these tasks requires fine-grained, deep visual\nunderstanding and compositional reasoning, which all state-of-the-art\nfoundation models find challenging. With MathVista, we have conducted a\ncomprehensive, quantitative evaluation of 12 prominent foundation models. The\nbest-performing GPT-4V model achieves an overall accuracy of 49.9%,\nsubstantially outperforming Bard, the second-best performer, by 15.1%. Our\nin-depth analysis reveals that the superiority of GPT-4V is mainly attributed\nto its enhanced visual perception and mathematical reasoning. However, GPT-4V\nstill falls short of human performance by 10.4%, as it often struggles to\nunderstand complex figures and perform rigorous reasoning. This significant gap\nunderscores the critical role that MathVista will play in the development of\ngeneral-purpose AI agents capable of tackling mathematically intensive and\nvisually rich real-world tasks. We further explore the new ability of\nself-verification, the application of self-consistency, and the interactive\nchatbot capabilities of GPT-4V, highlighting its promising potential for future\nresearch. The project is available at https://mathvista.github.io/.\n","authors":["Pan Lu","Hritik Bansal","Tony Xia","Jiacheng Liu","Chunyuan Li","Hannaneh Hajishirzi","Hao Cheng","Kai-Wei Chang","Michel Galley","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.02255v2.pdf","comment":"112 pages, 117 figures. Work in progress"},{"id":"http://arxiv.org/abs/2310.16968v1","updated":"2023-10-25T20:09:14Z","published":"2023-10-25T20:09:14Z","title":"Understanding Social Structures from Contemporary Literary Fiction using\n Character Interaction Graph -- Half Century Chronology of Influential Bengali\n Writers","summary":" Social structures and real-world incidents often influence contemporary\nliterary fiction. Existing research in literary fiction analysis explains these\nreal-world phenomena through the manual critical analysis of stories.\nConventional Natural Language Processing (NLP) methodologies, including\nsentiment analysis, narrative summarization, and topic modeling, have\ndemonstrated substantial efficacy in analyzing and identifying similarities\nwithin fictional works. However, the intricate dynamics of character\ninteractions within fiction necessitate a more nuanced approach that\nincorporates visualization techniques. Character interaction graphs (or\nnetworks) emerge as a highly suitable means for visualization and information\nretrieval from the realm of fiction. Therefore, we leverage character\ninteraction graphs with NLP-derived features to explore a diverse spectrum of\nsocietal inquiries about contemporary culture's impact on the landscape of\nliterary fiction. Our study involves constructing character interaction graphs\nfrom fiction, extracting relevant graph features, and exploiting these features\nto resolve various real-life queries. Experimental evaluation of influential\nBengali fiction over half a century demonstrates that character interaction\ngraphs can be highly effective in specific assessments and information\nretrieval from literary fiction. Our data and codebase are available at\nhttps://cutt.ly/fbMgGEM\n","authors":["Nafis Irtiza Tripto","Mohammed Eunus Ali"],"pdf_url":"https://arxiv.org/pdf/2310.16968v1.pdf","comment":"8 pages, 11 figures, 6 pages appendix"},{"id":"http://arxiv.org/abs/2310.16964v1","updated":"2023-10-25T20:05:07Z","published":"2023-10-25T20:05:07Z","title":"Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text\n Generation","summary":" Hallucination of text ungrounded in the input is a well-known problem in\nneural data-to-text generation. Many methods have been proposed to mitigate it,\nbut they typically require altering model architecture or collecting additional\ndata, and thus cannot be easily applied to an existing model. In this paper, we\nexplore a new way to mitigate hallucinations by combining the probabilistic\noutput of a generator language model (LM) with the output of a special \"text\ncritic\" classifier, which guides the generation by assessing the match between\nthe input data and the text generated so far. Our method does not need any\nchanges to the underlying LM's architecture or training procedure and can thus\nbe combined with any model and decoding operating on word probabilities. The\ncritic does not need any additional training data, using the base LM's training\ndata and synthetic negative examples. Our experimental results show that our\nmethod improves over the baseline on the WebNLG and OpenDialKG benchmarks.\n","authors":["Mateusz Lango","Ondřej Dušek"],"pdf_url":"https://arxiv.org/pdf/2310.16964v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14702v2","updated":"2023-10-25T19:40:18Z","published":"2023-05-24T04:13:15Z","title":"DecipherPref: Analyzing Influential Factors in Human Preference\n Judgments via GPT-4","summary":" Human preference judgments are pivotal in guiding large language models\n(LLMs) to produce outputs that align with human values. Human evaluations are\nalso used in summarization tasks to compare outputs from various systems,\ncomplementing existing automatic metrics. Despite their significance, however,\nthere has been limited research probing these pairwise or $k$-wise comparisons.\nThe collective impact and relative importance of factors such as output length,\ninformativeness, fluency, and factual consistency are still not well\nunderstood. It is also unclear if there are other hidden factors influencing\nhuman judgments. In this paper, we conduct an in-depth examination of a\ncollection of pairwise human judgments released by OpenAI. Utilizing the\nBradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in\nthese human judgments. We find that the most favored factors vary across tasks\nand genres, whereas the least favored factors tend to be consistent, e.g.,\noutputs are too brief, contain excessive off-focus content or hallucinated\nfacts. Our findings have implications on the construction of balanced datasets\nin human preference evaluations, which is a crucial step in shaping the\nbehaviors of future LLMs.\n","authors":["Yebowen Hu","Kaiqiang Song","Sangwoo Cho","Xiaoyang Wang","Hassan Foroosh","Fei Liu"],"pdf_url":"https://arxiv.org/pdf/2305.14702v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16944v1","updated":"2023-10-25T19:25:16Z","published":"2023-10-25T19:25:16Z","title":"Zephyr: Direct Distillation of LM Alignment","summary":" We aim to produce a smaller language model that is aligned to user intent.\nPrevious research has shown that applying distilled supervised fine-tuning\n(dSFT) on larger models significantly improves task accuracy; however, these\nmodels are unaligned, i.e. they do not respond well to natural prompts. To\ndistill this property, we experiment with the use of preference data from AI\nFeedback (AIF). Starting from a dataset of outputs ranked by a teacher model,\nwe apply distilled direct preference optimization (dDPO) to learn a chat model\nwith significantly improved intent alignment. The approach requires only a few\nhours of training without any additional sampling during fine-tuning. The final\nresult, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B\nparameter models, and requires no human annotation. In particular, results on\nMT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access\nRLHF-based model. Code, models, data, and tutorials for the system are\navailable at https://github.com/huggingface/alignment-handbook.\n","authors":["Lewis Tunstall","Edward Beeching","Nathan Lambert","Nazneen Rajani","Kashif Rasul","Younes Belkada","Shengyi Huang","Leandro von Werra","Clémentine Fourrier","Nathan Habib","Nathan Sarrazin","Omar Sanseviero","Alexander M. Rush","Thomas Wolf"],"pdf_url":"https://arxiv.org/pdf/2310.16944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16937v1","updated":"2023-10-25T19:04:33Z","published":"2023-10-25T19:04:33Z","title":"Learning Transfers over Several Programming Languages","summary":" Large language models (LLMs) have recently become remarkably good at\nimproving developer productivity for high-resource programming languages. These\nmodels use two kinds of data: large amounts of unlabeled code samples for\npretraining and relatively smaller amounts of labeled code samples for\nfine-tuning or in-context learning. Unfortunately, many programming languages\nare low-resource, lacking labeled samples for most tasks and often even lacking\nunlabeled samples. Therefore, users of low-resource languages (e.g., legacy or\nnew languages) miss out on the benefits of LLMs. Cross-lingual transfer\nlearning uses data from a source language to improve model performance on a\ntarget language. It has been well-studied for natural languages, but has\nreceived little attention for programming languages. This paper reports\nextensive experiments on four tasks using a transformer-based LLM and 11 to 41\nprogramming languages to explore the following questions. First, how well\ncross-lingual transfer works for a given task across different language pairs.\nSecond, given a task and target language, how to best choose a source language.\nThird, the characteristics of a language pair that are predictive of transfer\nperformance, and fourth, how that depends on the given task.\n","authors":["Razan Baltaji","Saurabh Pujar","Louis Mandel","Martin Hirzel","Luca Buratti","Lav Varshney"],"pdf_url":"https://arxiv.org/pdf/2310.16937v1.pdf","comment":"16 pages, 5 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.16931v1","updated":"2023-10-25T18:55:40Z","published":"2023-10-25T18:55:40Z","title":"CL-MASR: A Continual Learning Benchmark for Multilingual ASR","summary":" Modern multilingual automatic speech recognition (ASR) systems like Whisper\nhave made it possible to transcribe audio in multiple languages with a single\nmodel. However, current state-of-the-art ASR models are typically evaluated on\nindividual languages or in a multi-task setting, overlooking the challenge of\ncontinually learning new languages. There is insufficient research on how to\nadd new languages without losing valuable information from previous data.\nFurthermore, existing continual learning benchmarks focus mostly on vision and\nlanguage tasks, leaving continual learning for multilingual ASR largely\nunexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for\nstudying multilingual ASR in a continual learning setting. CL-MASR provides a\ndiverse set of continual learning methods implemented on top of large-scale\npretrained ASR models, along with common metrics to assess the effectiveness of\nlearning new languages while addressing the issue of catastrophic forgetting.\nTo the best of our knowledge, CL-MASR is the first continual learning benchmark\nfor the multilingual ASR task. The code is available at\nhttps://github.com/speechbrain/benchmarks.\n","authors":["Luca Della Libera","Pooneh Mousavi","Salah Zaiem","Cem Subakan","Mirco Ravanelli"],"pdf_url":"https://arxiv.org/pdf/2310.16931v1.pdf","comment":"16 pages, 5 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.16924v1","updated":"2023-10-25T18:44:14Z","published":"2023-10-25T18:44:14Z","title":"Physician Detection of Clinical Harm in Machine Translation: Quality\n Estimation Aids in Reliance and Backtranslation Identifies Critical Errors","summary":" A major challenge in the practical use of Machine Translation (MT) is that\nusers lack guidance to make informed decisions about when to rely on outputs.\nProgress in quality estimation research provides techniques to automatically\nassess MT quality, but these techniques have primarily been evaluated in vitro\nby comparison against human judgments outside of a specific context of use.\nThis paper evaluates quality estimation feedback in vivo with a human study\nsimulating decision-making in high-stakes medical settings. Using Emergency\nDepartment discharge instructions, we study how interventions based on quality\nestimation versus backtranslation assist physicians in deciding whether to show\nMT outputs to a patient. We find that quality estimation improves appropriate\nreliance on MT, but backtranslation helps physicians detect more clinically\nharmful errors that QE alone often misses.\n","authors":["Nikita Mehandru","Sweta Agrawal","Yimin Xiao","Elaine C Khoong","Ge Gao","Marine Carpuat","Niloufar Salehi"],"pdf_url":"https://arxiv.org/pdf/2310.16924v1.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.13507v3","updated":"2023-10-25T18:43:02Z","published":"2023-05-22T21:52:24Z","title":"Multimodal Automated Fact-Checking: A Survey","summary":" Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned\nimage. Multimodal misinformation is perceived as more credible by humans, and\nspreads faster than its text-only counterparts. While an increasing body of\nresearch investigates automated fact-checking (AFC), previous surveys mostly\nfocus on text. In this survey, we conceptualise a framework for AFC including\nsubtasks unique to multimodal misinformation. Furthermore, we discuss related\nterms used in different communities and map them to our framework. We focus on\nfour modalities prevalent in real-world fact-checking: text, image, audio, and\nvideo. We survey benchmarks and models, and discuss limitations and promising\ndirections for future research\n","authors":["Mubashara Akhtar","Michael Schlichtkrull","Zhijiang Guo","Oana Cocarascu","Elena Simperl","Andreas Vlachos"],"pdf_url":"https://arxiv.org/pdf/2305.13507v3.pdf","comment":"The 2023 Conference on Empirical Methods in Natural Language\n Processing (EMNLP): Findings"},{"id":"http://arxiv.org/abs/2310.16897v1","updated":"2023-10-25T18:00:15Z","published":"2023-10-25T18:00:15Z","title":"Divide et Impera: Multi-Transformer Architectures for Complex NLP-Tasks","summary":" The growing capabilities of transformer models pave the way for solving\nincreasingly complex NLP tasks. A key to supporting application-specific\nrequirements is the ability to fine-tune. However, compiling a fine-tuning\ndataset tailored to complex tasks is tedious and results in large datasets,\nlimiting the ability to control transformer output. We present an approach in\nwhich complex tasks are divided into simpler subtasks. Multiple transformer\nmodels are fine-tuned to one subtask each, and lined up to accomplish the\ncomplex task. This simplifies the compilation of fine-tuning datasets and\nincreases overall controllability. Using the example of reducing gender bias as\na complex task, we demonstrate our approach and show that it performs better\nthan using a single model.\n","authors":["Solveig Helland","Elena Gavagnin","Alexandre de Spindler"],"pdf_url":"https://arxiv.org/pdf/2310.16897v1.pdf","comment":"Proceedings of the Swiss Text Analytics Conference 2023"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.16838v1","updated":"2023-10-25T17:59:41Z","published":"2023-10-25T17:59:41Z","title":"SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous\n Manipulation","summary":" Humans excel at transferring manipulation skills across diverse object\nshapes, poses, and appearances due to their understanding of semantic\ncorrespondences between different instances. To endow robots with a similar\nhigh-level understanding, we develop a Distilled Feature Field (DFF) for 3D\nscenes, leveraging large 2D vision models to distill semantic features from\nmultiview images. While current research demonstrates advanced performance in\nreconstructing DFFs from dense views, the development of learning a DFF from\nsparse views is relatively nascent, despite its prevalence in numerous\nmanipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a\nnovel method for acquiring view-consistent 3D DFFs from sparse RGBD\nobservations, enabling one-shot learning of dexterous manipulations that are\ntransferable to novel scenes. Specifically, we map the image features to the 3D\npoint cloud, allowing for propagation across the 3D space to establish a dense\nfeature field. At the core of SparseDFF is a lightweight feature refinement\nnetwork, optimized with a contrastive loss between pairwise views after\nback-projecting the image features onto the 3D point cloud. Additionally, we\nimplement a point-pruning mechanism to augment feature continuity within each\nlocal neighborhood. By establishing coherent feature fields on both source and\ntarget scenes, we devise an energy function that facilitates the minimization\nof feature discrepancies w.r.t. the end-effector parameters between the\ndemonstration and the target manipulation. We evaluate our approach using a\ndexterous hand, mastering real-world manipulations on both rigid and deformable\nobjects, and showcase robust generalization in the face of object and\nscene-context variations.\n","authors":["Qianxu Wang","Haotong Zhang","Congyue Deng","Yang You","Hao Dong","Yixin Zhu","Leonidas Guibas"],"pdf_url":"https://arxiv.org/pdf/2310.16838v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16836v1","updated":"2023-10-25T17:59:32Z","published":"2023-10-25T17:59:32Z","title":"LLM-FP4: 4-Bit Floating-Point Quantized Transformers","summary":" We propose LLM-FP4 for quantizing both weights and activations in large\nlanguage models (LLMs) down to 4-bit floating-point values, in a post-training\nmanner. Existing post-training quantization (PTQ) solutions are primarily\ninteger-based and struggle with bit widths below 8 bits. Compared to integer\nquantization, floating-point (FP) quantization is more flexible and can better\nhandle long-tail or bell-shaped distributions, and it has emerged as a default\nchoice in many hardware platforms. One characteristic of FP quantization is\nthat its performance largely depends on the choice of exponent bits and\nclipping range. In this regard, we construct a strong FP-PTQ baseline by\nsearching for the optimal quantization parameters. Furthermore, we observe a\nhigh inter-channel variance and low intra-channel variance pattern in\nactivation distributions, which adds activation quantization difficulty. We\nrecognize this pattern to be consistent across a spectrum of transformer models\ndesigned for diverse tasks, such as LLMs, BERT, and Vision Transformer models.\nTo tackle this, we propose per-channel activation quantization and show that\nthese additional scaling factors can be reparameterized as exponential biases\nof weights, incurring a negligible cost. Our method, for the first time, can\nquantize both weights and activations in the LLaMA-13B to only 4-bit and\nachieves an average score of 63.1 on the common sense zero-shot reasoning\ntasks, which is only 5.8 lower than the full-precision model, significantly\noutperforming the previous state-of-the-art by 12.7 points. Code is available\nat: https://github.com/nbasyl/LLM-FP4.\n","authors":["Shih-yang Liu","Zechun Liu","Xijie Huang","Pingcheng Dong","Kwang-Ting Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.16836v1.pdf","comment":"EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.16835v1","updated":"2023-10-25T17:59:26Z","published":"2023-10-25T17:59:26Z","title":"Proposal-Contrastive Pretraining for Object Detection from Fewer Data","summary":" The use of pretrained deep neural networks represents an attractive way to\nachieve strong results with few data available. When specialized in dense\nproblems such as object detection, learning local rather than global\ninformation in images has proven to be more efficient. However, for\nunsupervised pretraining, the popular contrastive learning requires a large\nbatch size and, therefore, a lot of resources. To address this problem, we are\ninterested in transformer-based object detectors that have recently gained\ntraction in the community with good performance and with the particularity of\ngenerating many diverse object proposals.\n In this work, we present Proposal Selection Contrast (ProSeCo), a novel\nunsupervised overall pretraining approach that leverages this property. ProSeCo\nuses the large number of object proposals generated by the detector for\ncontrastive learning, which allows the use of a smaller batch size, combined\nwith object-level features to learn local information in the images. To improve\nthe effectiveness of the contrastive loss, we introduce the object location\ninformation in the selection of positive examples to take into account multiple\noverlapping object proposals. When reusing pretrained backbone, we advocate for\nconsistency in learning local information between the backbone and the\ndetection head.\n We show that our method outperforms state of the art in unsupervised\npretraining for object detection on standard and novel benchmarks in learning\nwith fewer data.\n","authors":["Quentin Bouniot","Romaric Audigier","Angélique Loesch","Amaury Habrard"],"pdf_url":"https://arxiv.org/pdf/2310.16835v1.pdf","comment":"Published as a conference paper at ICLR 2023"},{"id":"http://arxiv.org/abs/2310.16832v1","updated":"2023-10-25T17:59:05Z","published":"2023-10-25T17:59:05Z","title":"LightSpeed: Light and Fast Neural Light Fields on Mobile Devices","summary":" Real-time novel-view image synthesis on mobile devices is prohibitive due to\nthe limited computational power and storage. Using volumetric rendering\nmethods, such as NeRF and its derivatives, on mobile devices is not suitable\ndue to the high computational cost of volumetric rendering. On the other hand,\nrecent advances in neural light field representations have shown promising\nreal-time view synthesis results on mobile devices. Neural light field methods\nlearn a direct mapping from a ray representation to the pixel color. The\ncurrent choice of ray representation is either stratified ray sampling or\nPl\\\"{u}cker coordinates, overlooking the classic light slab (two-plane)\nrepresentation, the preferred representation to interpolate between light field\nviews. In this work, we find that using the light slab representation is an\nefficient representation for learning a neural light field. More importantly,\nit is a lower-dimensional ray representation enabling us to learn the 4D ray\nspace using feature grids which are significantly faster to train and render.\nAlthough mostly designed for frontal views, we show that the light-slab\nrepresentation can be further extended to non-frontal scenes using a\ndivide-and-conquer strategy. Our method offers superior rendering quality\ncompared to previous light field methods and achieves a significantly improved\ntrade-off between rendering quality and speed.\n","authors":["Aarush Gupta","Junli Cao","Chaoyang Wang","Ju Hu","Sergey Tulyakov","Jian Ren","László A Jeni"],"pdf_url":"https://arxiv.org/pdf/2310.16832v1.pdf","comment":"Project Page: http://lightspeed-r2l.github.io/website/"},{"id":"http://arxiv.org/abs/2310.16831v1","updated":"2023-10-25T17:59:01Z","published":"2023-10-25T17:59:01Z","title":"PERF: Panoramic Neural Radiance Field from a Single Panorama","summary":" Neural Radiance Field (NeRF) has achieved substantial progress in novel view\nsynthesis given multi-view images. Recently, some works have attempted to train\na NeRF from a single image with 3D priors. They mainly focus on a limited field\nof view and there are few invisible occlusions, which greatly limits their\nscalability to real-world 360-degree panoramic scenarios with large-size\nocclusions. In this paper, we present PERF, a 360-degree novel view synthesis\nframework that trains a panoramic neural radiance field from a single panorama.\nNotably, PERF allows 3D roaming in a complex scene without expensive and\ntedious image collection. To achieve this goal, we propose a novel\ncollaborative RGBD inpainting method and a progressive inpainting-and-erasing\nmethod to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first\npredict a panoramic depth map as initialization given a single panorama, and\nreconstruct visible 3D regions with volume rendering. Then we introduce a\ncollaborative RGBD inpainting approach into a NeRF for completing RGB images\nand depth maps from random views, which is derived from an RGB Stable Diffusion\nmodel and a monocular depth estimator. Finally, we introduce an\ninpainting-and-erasing strategy to avoid inconsistent geometry between a\nnewly-sampled view and reference views. The two components are integrated into\nthe learning of NeRFs in a unified optimization framework and achieve promising\nresults. Extensive experiments on Replica and a new dataset PERF-in-the-wild\ndemonstrate the superiority of our PERF over state-of-the-art methods. Our PERF\ncan be widely used for real-world applications, such as panorama-to-3D,\ntext-to-3D, and 3D scene stylization applications. Project page and code are\navailable at https://perf-project.github.io/.\n","authors":["Guangcong Wang","Peng Wang","Zhaoxi Chen","Wenping Wang","Chen Change Loy","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16831v1.pdf","comment":"Project page and code: https://perf-project.github.io/"},{"id":"http://arxiv.org/abs/2310.16828v1","updated":"2023-10-25T17:57:07Z","published":"2023-10-25T17:57:07Z","title":"TD-MPC2: Scalable, Robust World Models for Continuous Control","summary":" TD-MPC is a model-based reinforcement learning (RL) algorithm that performs\nlocal trajectory optimization in the latent space of a learned implicit\n(decoder-free) world model. In this work, we present TD-MPC2: a series of\nimprovements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves\nsignificantly over baselines across 104 online RL tasks spanning 4 diverse task\ndomains, achieving consistently strong results with a single set of\nhyperparameters. We further show that agent capabilities increase with model\nand data size, and successfully train a single 317M parameter agent to perform\n80 tasks across multiple task domains, embodiments, and action spaces. We\nconclude with an account of lessons, opportunities, and risks associated with\nlarge TD-MPC2 agents. Explore videos, models, data, code, and more at\nhttps://nicklashansen.github.io/td-mpc2\n","authors":["Nicklas Hansen","Hao Su","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16828v1.pdf","comment":"Explore videos, models, data, code, and more at\n https://nicklashansen.github.io/td-mpc2"},{"id":"http://arxiv.org/abs/2310.16825v1","updated":"2023-10-25T17:56:07Z","published":"2023-10-25T17:56:07Z","title":"CommonCanvas: An Open Diffusion Model Trained with Creative-Commons\n Images","summary":" We assemble a dataset of Creative-Commons-licensed (CC) images, which we use\nto train a set of open diffusion models that are qualitatively competitive with\nStable Diffusion 2 (SD2). This task presents two challenges: (1)\nhigh-resolution CC images lack the captions necessary to train text-to-image\ngenerative models; (2) CC images are relatively scarce. In turn, to address\nthese challenges, we use an intuitive transfer learning technique to produce a\nset of high-quality synthetic captions paired with curated CC images. We then\ndevelop a data- and compute-efficient training recipe that requires as little\nas 3% of the LAION-2B data needed to train existing SD2 models, but obtains\ncomparable quality. These results indicate that we have a sufficient number of\nCC images (~70 million) for training high-quality models. Our training recipe\nalso implements a variety of optimizations that achieve ~3X training speed-ups,\nenabling rapid model iteration. We leverage this recipe to train several\nhigh-quality text-to-image models, which we dub the CommonCanvas family. Our\nlargest model achieves comparable performance to SD2 on a human evaluation,\ndespite being trained on our CC dataset that is significantly smaller than\nLAION and using synthetic captions for training. We release our models, data,\nand code at\nhttps://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md\n","authors":["Aaron Gokaslan","A. Feder Cooper","Jasmine Collins","Landan Seguin","Austin Jacobson","Mihir Patel","Jonathan Frankle","Cory Stephenson","Volodymyr Kuleshov"],"pdf_url":"https://arxiv.org/pdf/2310.16825v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16818v1","updated":"2023-10-25T17:50:10Z","published":"2023-10-25T17:50:10Z","title":"DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion\n Prior","summary":" We present DreamCraft3D, a hierarchical 3D content generation method that\nproduces high-fidelity and coherent 3D objects. We tackle the problem by\nleveraging a 2D reference image to guide the stages of geometry sculpting and\ntexture boosting. A central focus of this work is to address the consistency\nissue that existing works encounter. To sculpt geometries that render\ncoherently, we perform score distillation sampling via a view-dependent\ndiffusion model. This 3D prior, alongside several training strategies,\nprioritizes the geometry consistency but compromises the texture fidelity. We\nfurther propose Bootstrapped Score Distillation to specifically boost the\ntexture. We train a personalized diffusion model, Dreambooth, on the augmented\nrenderings of the scene, imbuing it with 3D knowledge of the scene being\noptimized. The score distillation from this 3D-aware diffusion prior provides\nview-consistent guidance for the scene. Notably, through an alternating\noptimization of the diffusion prior and 3D scene representation, we achieve\nmutually reinforcing improvements: the optimized 3D scene aids in training the\nscene-specific diffusion model, which offers increasingly view-consistent\nguidance for 3D optimization. The optimization is thus bootstrapped and leads\nto substantial texture boosting. With tailored 3D priors throughout the\nhierarchical generation, DreamCraft3D generates coherent 3D objects with\nphotorealistic renderings, advancing the state-of-the-art in 3D content\ngeneration. Code available at https://github.com/deepseek-ai/DreamCraft3D.\n","authors":["Jingxiang Sun","Bo Zhang","Ruizhi Shao","Lizhen Wang","Wen Liu","Zhenda Xie","Yebin Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16818v1.pdf","comment":"Project Page: https://mrtornado24.github.io/DreamCraft3D/"},{"id":"http://arxiv.org/abs/2310.16809v1","updated":"2023-10-25T17:38:55Z","published":"2023-10-25T17:38:55Z","title":"Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and\n In-depth Evaluation","summary":" This paper presents a comprehensive evaluation of the Optical Character\nRecognition (OCR) capabilities of the recently released GPT-4V(ision), a Large\nMultimodal Model (LMM). We assess the model's performance across a range of OCR\ntasks, including scene text recognition, handwritten text recognition,\nhandwritten mathematical expression recognition, table structure recognition,\nand information extraction from visually-rich document. The evaluation reveals\nthat GPT-4V performs well in recognizing and understanding Latin contents, but\nstruggles with multilingual scenarios and complex tasks. Based on these\nobservations, we delve deeper into the necessity of specialized OCR models and\ndeliberate on the strategies to fully harness the pretrained general LMMs like\nGPT-4V for OCR downstream tasks. The study offers a critical reference for\nfuture research in OCR with LMMs. Evaluation pipeline and results are available\nat https://github.com/SCUT-DLVCLab/GPT-4V_OCR.\n","authors":["Yongxin Shi","Dezhi Peng","Wenhui Liao","Zening Lin","Xinhong Chen","Chongyu Liu","Yuyi Zhang","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2310.16809v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16808v1","updated":"2023-10-25T17:38:16Z","published":"2023-10-25T17:38:16Z","title":"Fingervein Verification using Convolutional Multi-Head Attention Network","summary":" Biometric verification systems are deployed in various security-based\naccess-control applications that require user-friendly and reliable person\nverification. Among the different biometric characteristics, fingervein\nbiometrics have been extensively studied owing to their reliable verification\nperformance. Furthermore, fingervein patterns reside inside the skin and are\nnot visible outside; therefore, they possess inherent resistance to\npresentation attacks and degradation due to external factors. In this paper, we\nintroduce a novel fingervein verification technique using a convolutional\nmultihead attention network called VeinAtnNet. The proposed VeinAtnNet is\ndesigned to achieve light weight with a smaller number of learnable parameters\nwhile extracting discriminant information from both normal and enhanced\nfingervein images. The proposed VeinAtnNet was trained on the newly constructed\nfingervein dataset with 300 unique fingervein patterns that were captured in\nmultiple sessions to obtain 92 samples per unique fingervein. Extensive\nexperiments were performed on the newly collected dataset FV-300 and the\npublicly available FV-USM and FV-PolyU fingervein dataset. The performance of\nthe proposed method was compared with five state-of-the-art fingervein\nverification systems, indicating the efficacy of the proposed VeinAtnNet.\n","authors":["Raghavendra Ramachandra","Sushma Venkatesh"],"pdf_url":"https://arxiv.org/pdf/2310.16808v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2304.03510v3","updated":"2023-10-25T17:26:16Z","published":"2023-04-07T07:03:00Z","title":"Multispectral Imaging for Differential Face Morphing Attack Detection: A\n Preliminary Study","summary":" Face morphing attack detection is emerging as an increasingly challenging\nproblem owing to advancements in high-quality and realistic morphing attack\ngeneration. Reliable detection of morphing attacks is essential because these\nattacks are targeted for border control applications. This paper presents a\nmultispectral framework for differential morphing-attack detection (D-MAD). The\nD-MAD methods are based on using two facial images that are captured from the\nePassport (also called the reference image) and the trusted device (for\nexample, Automatic Border Control (ABC) gates) to detect whether the face image\npresented in ePassport is morphed. The proposed multispectral D-MAD framework\nintroduce a multispectral image captured as a trusted capture to acquire seven\ndifferent spectral bands to detect morphing attacks. Extensive experiments were\nconducted on the newly created Multispectral Morphed Datasets (MSMD) with 143\nunique data subjects that were captured using both visible and multispectral\ncameras in multiple sessions. The results indicate the superior performance of\nthe proposed multispectral framework compared to visible images.\n","authors":["Raghavendra Ramachandra","Sushma Venkatesh","Naser Damer","Narayan Vetrekar","Rajendra Gad"],"pdf_url":"https://arxiv.org/pdf/2304.03510v3.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2302.02285v2","updated":"2023-10-25T17:24:18Z","published":"2023-02-05T03:01:28Z","title":"ReDi: Efficient Learning-Free Diffusion Inference via Trajectory\n Retrieval","summary":" Diffusion models show promising generation capability for a variety of data.\nDespite their high generation quality, the inference for diffusion models is\nstill time-consuming due to the numerous sampling iterations required. To\naccelerate the inference, we propose ReDi, a simple yet learning-free\nRetrieval-based Diffusion sampling framework. From a precomputed knowledge\nbase, ReDi retrieves a trajectory similar to the partially generated trajectory\nat an early stage of generation, skips a large portion of intermediate steps,\nand continues sampling from a later step in the retrieved trajectory. We\ntheoretically prove that the generation performance of ReDi is guaranteed. Our\nexperiments demonstrate that ReDi improves the model inference efficiency by 2x\nspeedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain\nimage generation such as image stylization.\n","authors":["Kexun Zhang","Xianjun Yang","William Yang Wang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2302.02285v2.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2310.16788v1","updated":"2023-10-25T17:20:38Z","published":"2023-10-25T17:20:38Z","title":"The GOOSE Dataset for Perception in Unstructured Environments","summary":" The potential for deploying autonomous systems can be significantly increased\nby improving the perception and interpretation of the environment. However, the\ndevelopment of deep learning-based techniques for autonomous systems in\nunstructured outdoor environments poses challenges due to limited data\navailability for training and testing. To address this gap, we present the\nGerman Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset\nspecifically designed for unstructured outdoor environments. The GOOSE dataset\nincorporates 10 000 labeled pairs of images and point clouds, which are\nutilized to train a range of state-of-the-art segmentation models on both image\nand point cloud data. We open source the dataset, along with an ontology for\nunstructured terrain, as well as dataset standards and guidelines. This\ninitiative aims to establish a common framework, enabling the seamless\ninclusion of existing datasets and a fast way to enhance the perception\ncapabilities of various robots operating in unstructured environments. The\ndataset, pre-trained models for offroad perception, and additional\ndocumentation can be found at https://goose-dataset.de/.\n","authors":["Peter Mortimer","Raphael Hagmanns","Miguel Granero","Thorsten Luettel","Janko Petereit","Hans-Joachim Wuensche"],"pdf_url":"https://arxiv.org/pdf/2310.16788v1.pdf","comment":"Preprint; Submitted to IEEE for review"},{"id":"http://arxiv.org/abs/2310.16783v1","updated":"2023-10-25T17:19:14Z","published":"2023-10-25T17:19:14Z","title":"S$^3$-TTA: Scale-Style Selection for Test-Time Augmentation in\n Biomedical Image Segmentation","summary":" Deep-learning models have been successful in biomedical image segmentation.\nTo generalize for real-world deployment, test-time augmentation (TTA) methods\nare often used to transform the test image into different versions that are\nhopefully closer to the training domain. Unfortunately, due to the vast\ndiversity of instance scale and image styles, many augmented test images\nproduce undesirable results, thus lowering the overall performance. This work\nproposes a new TTA framework, S$^3$-TTA, which selects the suitable image scale\nand style for each test image based on a transformation consistency metric. In\naddition, S$^3$-TTA constructs an end-to-end augmentation-segmentation\njoint-training pipeline to ensure a task-oriented augmentation. On public\nbenchmarks for cell and lung segmentation, S$^3$-TTA demonstrates improvements\nover the prior art by 3.4% and 1.3%, respectively, by simply augmenting the\ninput data in testing phase.\n","authors":["Kangxian Xie","Siyu Huang","Sebastian Cajas Ordone","Hanspeter Pfister","Donglai Wei"],"pdf_url":"https://arxiv.org/pdf/2310.16783v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16781v1","updated":"2023-10-25T17:15:55Z","published":"2023-10-25T17:15:55Z","title":"Kiki or Bouba? Sound Symbolism in Vision-and-Language Models","summary":" Although the mapping between sound and meaning in human language is assumed\nto be largely arbitrary, research in cognitive science has shown that there are\nnon-trivial correlations between particular sounds and meanings across\nlanguages and demographic groups, a phenomenon known as sound symbolism. Among\nthe many dimensions of meaning, sound symbolism is particularly salient and\nwell-demonstrated with regards to cross-modal associations between language and\nthe visual domain. In this work, we address the question of whether sound\nsymbolism is reflected in vision-and-language models such as CLIP and Stable\nDiffusion. Using zero-shot knowledge probing to investigate the inherent\nknowledge of these models, we find strong evidence that they do show this\npattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our\nwork provides a novel method for demonstrating sound symbolism and\nunderstanding its nature using computational tools. Our code will be made\npublicly available.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2310.16781v1.pdf","comment":"Accepted to NeurIPS 2023 (spotlight). Project webpage:\n https://kiki-bouba.github.io/"},{"id":"http://arxiv.org/abs/2310.16777v1","updated":"2023-10-25T17:10:37Z","published":"2023-10-25T17:10:37Z","title":"MixerFlow for Image Modelling","summary":" Normalising flows are statistical models that transform a complex density\ninto a simpler density through the use of bijective transformations enabling\nboth density estimation and data generation from a single model. In the context\nof image modelling, the predominant choice has been the Glow-based\narchitecture, whereas alternative architectures remain largely unexplored in\nthe research community. In this work, we propose a novel architecture called\nMixerFlow, based on the MLP-Mixer architecture, further unifying the generative\nand discriminative modelling architectures. MixerFlow offers an effective\nmechanism for weight sharing for flow-based models. Our results demonstrate\nbetter density estimation on image datasets under a fixed computational budget\nand scales well as the image resolution increases, making MixeFlow a powerful\nyet simple alternative to the Glow-based architectures. We also show that\nMixerFlow provides more informative embeddings than Glow-based architectures.\n","authors":["Eshant English","Matthias Kirchler","Christoph Lippert"],"pdf_url":"https://arxiv.org/pdf/2310.16777v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16764v1","updated":"2023-10-25T16:52:13Z","published":"2023-10-25T16:52:13Z","title":"ConvNets Match Vision Transformers at Scale","summary":" Many researchers believe that ConvNets perform well on small or moderately\nsized datasets, but are not competitive with Vision Transformers when given\naccess to datasets on the web-scale. We challenge this belief by evaluating a\nperformant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset\nof images often used for training foundation models. We consider pre-training\ncompute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a\nseries of networks of increasing depth and width from the NFNet model family.\nWe observe a log-log scaling law between held out loss and compute budget.\nAfter fine-tuning on ImageNet, NFNets match the reported performance of Vision\nTransformers with comparable compute budgets. Our strongest fine-tuned model\nachieves a Top-1 accuracy of 90.4%.\n","authors":["Samuel L. Smith","Andrew Brock","Leonard Berrada","Soham De"],"pdf_url":"https://arxiv.org/pdf/2310.16764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16044v2","updated":"2023-10-25T16:40:59Z","published":"2023-10-24T17:57:58Z","title":"Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark","summary":" We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering\nBenchmark. Recent advances in inverse rendering have enabled a wide range of\nreal-world applications in 3D content generation, moving rapidly from research\nand commercial use cases to consumer devices. While the results continue to\nimprove, there is no real-world benchmark that can quantitatively assess and\ncompare the performance of various inverse rendering methods. Existing\nreal-world datasets typically only consist of the shape and multi-view images\nof objects, which are not sufficient for evaluating the quality of material\nrecovery and object relighting. Methods capable of recovering material and\nlighting often resort to synthetic data for quantitative evaluation, which on\nthe other hand does not guarantee generalization to complex real-world\nenvironments. We introduce a new dataset of real-world objects captured under a\nvariety of natural scenes with ground-truth 3D scans, multi-view images, and\nenvironment lighting. Using this dataset, we establish the first comprehensive\nreal-world evaluation benchmark for object inverse rendering tasks from\nin-the-wild scenes, and compare the performance of various existing methods.\n","authors":["Zhengfei Kuang","Yunzhi Zhang","Hong-Xing Yu","Samir Agarwala","Shangzhe Wu","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16044v2.pdf","comment":"NeurIPS 2023 Datasets and Benchmarks Track. The first two authors\n contributed equally to this work. Project page:\n https://stanfordorb.github.io/"},{"id":"http://arxiv.org/abs/2310.16754v1","updated":"2023-10-25T16:40:09Z","published":"2023-10-25T16:40:09Z","title":"CAD -- Contextual Multi-modal Alignment for Dynamic AVQA","summary":" In the context of Audio Visual Question Answering (AVQA) tasks, the audio\nvisual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and\n3) Semantic. Existing AVQA methods suffer from two major shortcomings; the\naudio-visual (AV) information passing through the network isn't aligned on\nSpatial and Temporal levels; and, inter-modal (audio and visual) Semantic\ninformation is often not balanced within a context; this results in poor\nperformance. In this paper, we propose a novel end-to-end Contextual\nMulti-modal Alignment (CAD) network that addresses the challenges in AVQA\nmethods by i) introducing a parameter-free stochastic Contextual block that\nensures robust audio and visual alignment on the Spatial level; ii) proposing a\npre-training technique for dynamic audio and visual alignment on Temporal level\nin a self-supervised setting, and iii) introducing a cross-attention mechanism\nto balance audio and visual information on Semantic level. The proposed novel\nCAD network improves the overall performance over the state-of-the-art methods\non average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our\nproposed contributions to AVQA can be added to the existing methods to improve\ntheir performance without additional complexity requirements.\n","authors":["Asmar Nadeem","Adrian Hilton","Robert Dawes","Graham Thomas","Armin Mustafa"],"pdf_url":"https://arxiv.org/pdf/2310.16754v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16750v1","updated":"2023-10-25T16:32:31Z","published":"2023-10-25T16:32:31Z","title":"Metrically Scaled Monocular Depth Estimation through Sparse Priors for\n Underwater Robots","summary":" In this work, we address the problem of real-time dense depth estimation from\nmonocular images for mobile underwater vehicles. We formulate a deep learning\nmodel that fuses sparse depth measurements from triangulated features to\nimprove the depth predictions and solve the problem of scale ambiguity. To\nallow prior inputs of arbitrary sparsity, we apply a dense parameterization\nmethod. Our model extends recent state-of-the-art approaches to monocular image\nbased depth estimation, using an efficient encoder-decoder backbone and modern\nlightweight transformer optimization stage to encode global context. The\nnetwork is trained in a supervised fashion on the forward-looking underwater\ndataset, FLSea. Evaluation results on this dataset demonstrate significant\nimprovement in depth prediction accuracy by the fusion of the sparse feature\npriors. In addition, without any retraining, our method achieves similar depth\nprediction accuracy on a downward looking dataset we collected with a diver\noperated camera rig, conducting a survey of a coral reef. The method achieves\nreal-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single\nCPU core and is suitable for direct deployment on embedded systems. The\nimplementation of this work is made publicly available at\nhttps://github.com/ebnerluca/uw_depth.\n","authors":["Luca Ebner","Gideon Billings","Stefan Williams"],"pdf_url":"https://arxiv.org/pdf/2310.16750v1.pdf","comment":"Submitted to ICRA 2024"},{"id":"http://arxiv.org/abs/2310.16742v1","updated":"2023-10-25T16:17:47Z","published":"2023-10-25T16:17:47Z","title":"Interferometric Neural Networks","summary":" On the one hand, artificial neural networks have many successful applications\nin the field of machine learning and optimization. On the other hand,\ninterferometers are integral parts of any field that deals with waves such as\noptics, astronomy, and quantum physics. Here, we introduce neural networks\ncomposed of interferometers and then build generative adversarial networks from\nthem. Our networks do not have any classical layer and can be realized on\nquantum computers or photonic chips. We demonstrate their applicability for\ncombinatorial optimization, image classification, and image generation. For\ncombinatorial optimization, our network consistently converges to the global\noptimum or remains within a narrow range of it. In multi-class image\nclassification tasks, our networks achieve accuracies of 93% and 83%. Lastly,\nwe show their capability to generate images of digits from 0 to 9 as well as\nhuman faces.\n","authors":["Arun Sehrawat"],"pdf_url":"https://arxiv.org/pdf/2310.16742v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2310.16732v1","updated":"2023-10-25T16:01:05Z","published":"2023-10-25T16:01:05Z","title":"A No-Reference Quality Assessment Method for Digital Human Head","summary":" In recent years, digital humans have been widely applied in augmented/virtual\nreality (A/VR), where viewers are allowed to freely observe and interact with\nthe volumetric content. However, the digital humans may be degraded with\nvarious distortions during the procedure of generation and transmission.\nMoreover, little effort has been put into the perceptual quality assessment of\ndigital humans. Therefore, it is urgent to carry out objective quality\nassessment methods to tackle the challenge of digital human quality assessment\n(DHQA). In this paper, we develop a novel no-reference (NR) method based on\nTransformer to deal with DHQA in a multi-task manner. Specifically, the front\n2D projections of the digital humans are rendered as inputs and the vision\ntransformer (ViT) is employed for the feature extraction. Then we design a\nmulti-task module to jointly classify the distortion types and predict the\nperceptual quality levels of digital humans. The experimental results show that\nthe proposed method well correlates with the subjective ratings and outperforms\nthe state-of-the-art quality assessment methods.\n","authors":["Yingjie Zhou","Zicheng Zhang","Wei Sun","Xiongkuo Min","Xianghe Ma","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2310.16732v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16717v1","updated":"2023-10-25T15:44:50Z","published":"2023-10-25T15:44:50Z","title":"Rebuild City Buildings from Off-Nadir Aerial Images with Offset-Building\n Model (OBM)","summary":" Accurate measurement of the offset from roof-to-footprint in\nvery-high-resolution remote sensing imagery is crucial for urban information\nextraction tasks. With the help of deep learning, existing methods typically\nrely on two-stage CNN models to extract regions of interest on building feature\nmaps. At the first stage, a Region Proposal Network (RPN) is applied to extract\nthousands of ROIs (Region of Interests) which will post-imported into a\nRegion-based Convolutional Neural Networks (RCNN) to extract wanted\ninformation. However, because of inflexible RPN, these methods often lack\neffective user interaction, encounter difficulties in instance correspondence,\nand struggle to keep up with the advancements in general artificial\nintelligence. This paper introduces an interactive Transformer model combined\nwith a prompt encoder to precisely extract building segmentation as well as the\noffset vectors from roofs to footprints. In our model, a powerful module,\nnamely ROAM, was tailored for common problems in predicting roof-to-footprint\noffsets. We tested our model's feasibility on the publicly available BONAI\ndataset, achieving a significant reduction in Prompt-Instance-Level offset\nerrors ranging from 14.6% to 16.3%. Additionally, we developed a Distance-NMS\nalgorithm tailored for large-scale building offsets, significantly enhancing\nthe accuracy of predicted building offset angles and lengths in a\nstraightforward and efficient manner. To further validate the model's\nrobustness, we created a new test set using 0.5m remote sensing imagery from\nHuizhou, China, for inference testing. Our code, training methods, and the\nupdated dataset will be accessable at https://github.com/likaiucas.\n","authors":["Kai Li","Yupeng Deng","Yunlong Kong","Diyou Liu","Jingbo Chen","Yu Meng","Junxian Ma"],"pdf_url":"https://arxiv.org/pdf/2310.16717v1.pdf","comment":"24 pages, 9 figures"},{"id":"http://arxiv.org/abs/2305.11772v2","updated":"2023-10-25T15:34:16Z","published":"2023-05-19T15:56:06Z","title":"Neural Foundations of Mental Simulation: Future Prediction of Latent\n Representations on Dynamic Scenes","summary":" Humans and animals have a rich and flexible understanding of the physical\nworld, which enables them to infer the underlying dynamical trajectories of\nobjects and events, plausible future states, and use that to plan and\nanticipate the consequences of actions. However, the neural mechanisms\nunderlying these computations are unclear. We combine a goal-driven modeling\napproach with dense neurophysiological data and high-throughput human\nbehavioral readouts to directly impinge on this question. Specifically, we\nconstruct and evaluate several classes of sensory-cognitive networks to predict\nthe future state of rich, ethologically-relevant environments, ranging from\nself-supervised end-to-end models with pixel-wise or object-centric objectives,\nto models that future predict in the latent space of purely static image-based\nor dynamic video-based pretrained foundation models. We find strong\ndifferentiation across these model classes in their ability to predict neural\nand behavioral data both within and across diverse environments. In particular,\nwe find that neural responses are currently best predicted by models trained to\npredict the future state of their environment in the latent space of pretrained\nfoundation models optimized for dynamic scenes in a self-supervised manner.\nNotably, models that future predict in the latent space of video foundation\nmodels that are optimized to support a diverse range of sensorimotor tasks,\nreasonably match both human behavioral error patterns and neural dynamics\nacross all environmental scenarios that we were able to test. Overall, these\nfindings suggest that the neural mechanisms and behaviors of primate mental\nsimulation are thus far most consistent with being optimized to future predict\non dynamic, reusable visual representations that are useful for Embodied AI\nmore generally.\n","authors":["Aran Nayebi","Rishi Rajalingham","Mehrdad Jazayeri","Guangyu Robert Yang"],"pdf_url":"https://arxiv.org/pdf/2305.11772v2.pdf","comment":"20 pages, 10 figures, NeurIPS 2023 Camera Ready Version (spotlight)"},{"id":"http://arxiv.org/abs/2310.16706v1","updated":"2023-10-25T15:23:33Z","published":"2023-10-25T15:23:33Z","title":"Nighttime Driver Behavior Prediction Using Taillight Signal Recognition\n via CNN-SVM Classifier","summary":" This paper aims to enhance the ability to predict nighttime driving behavior\nby identifying taillights of both human-driven and autonomous vehicles. The\nproposed model incorporates a customized detector designed to accurately detect\nfront-vehicle taillights on the road. At the beginning of the detector, a\nlearnable pre-processing block is implemented, which extracts deep features\nfrom input images and calculates the data rarity for each feature. In the next\nstep, drawing inspiration from soft attention, a weighted binary mask is\ndesigned that guides the model to focus more on predetermined regions. This\nresearch utilizes Convolutional Neural Networks (CNNs) to extract\ndistinguishing characteristics from these areas, then reduces dimensions using\nPrincipal Component Analysis (PCA). Finally, the Support Vector Machine (SVM)\nis used to predict the behavior of the vehicles. To train and evaluate the\nmodel, a large-scale dataset is collected from two types of dash-cams and\nInsta360 cameras from the rear view of Ford Motor Company vehicles. This\ndataset includes over 12k frames captured during both daytime and nighttime\nhours. To address the limited nighttime data, a unique pixel-wise image\nprocessing technique is implemented to convert daytime images into realistic\nnight images. The findings from the experiments demonstrate that the proposed\nmethodology can accurately categorize vehicle behavior with 92.14% accuracy,\n97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's\nKappa Statistic. Further details are available at\nhttps://github.com/DeepCar/Taillight_Recognition.\n","authors":["Amir Hossein Barshooi","Elmira Bagheri"],"pdf_url":"https://arxiv.org/pdf/2310.16706v1.pdf","comment":"12 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.15952v2","updated":"2023-10-25T15:11:57Z","published":"2023-10-24T15:53:07Z","title":"Improving Robustness and Reliability in Medical Image Classification\n with Latent-Guided Diffusion and Nested-Ensembles","summary":" While deep learning models have achieved remarkable success across a range of\nmedical image analysis tasks, deployment of these models in real clinical\ncontexts requires that they be robust to variability in the acquired images.\nWhile many methods apply predefined transformations to augment the training\ndata to enhance test-time robustness, these transformations may not ensure the\nmodel's robustness to the diverse variability seen in patient images. In this\npaper, we introduce a novel three-stage approach based on transformers coupled\nwith conditional diffusion models, with the goal of improving model robustness\nto the kinds of imaging variability commonly encountered in practice without\nthe need for pre-determined data augmentation strategies. To this end, multiple\nimage encoders first learn hierarchical feature representations to build\ndiscriminative latent spaces. Next, a reverse diffusion process, guided by the\nlatent code, acts on an informative prior and proposes prediction candidates in\na generative manner. Finally, several prediction candidates are aggregated in a\nbi-level aggregation protocol to produce the final output. Through extensive\nexperiments on medical imaging benchmark datasets, we show that our method\nimproves upon state-of-the-art methods in terms of robustness and confidence\ncalibration. Additionally, we introduce a strategy to quantify the prediction\nuncertainty at the instance level, increasing their trustworthiness to\nclinicians using them in clinical practice.\n","authors":["Xing Shen","Hengguan Huang","Brennan Nichyporuk","Tal Arbel"],"pdf_url":"https://arxiv.org/pdf/2310.15952v2.pdf","comment":"13 pages, 6 figures, 7 tables"},{"id":"http://arxiv.org/abs/2310.16695v1","updated":"2023-10-25T15:06:32Z","published":"2023-10-25T15:06:32Z","title":"From Pointwise to Powerhouse: Initialising Neural Networks with\n Generative Models","summary":" Traditional initialisation methods, e.g. He and Xavier, have been effective\nin avoiding the problem of vanishing or exploding gradients in neural networks.\nHowever, they only use simple pointwise distributions, which model\none-dimensional variables. Moreover, they ignore most information about the\narchitecture and disregard past training experiences. These limitations can be\novercome by employing generative models for initialisation. In this paper, we\nintroduce two groups of new initialisation methods. First, we locally\ninitialise weight groups by employing variational autoencoders. Secondly, we\nglobally initialise full weight sets by employing graph hypernetworks. We\nthoroughly evaluate the impact of the employed generative models on\nstate-of-the-art neural networks in terms of accuracy, convergence speed and\nensembling. Our results show that global initialisations result in higher\naccuracy and faster initial convergence speed. However, the implementation\nthrough graph hypernetworks leads to diminished ensemble performance on out of\ndistribution data. To counteract, we propose a modification called noise graph\nhypernetwork, which encourages diversity in the produced ensemble members.\nFurthermore, our approach might be able to transfer learned knowledge to\ndifferent image distributions. Our work provides insights into the potential,\nthe trade-offs and possible modifications of these new initialisation methods.\n","authors":["Christian Harder","Moritz Fuchs","Yuri Tolkach","Anirban Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2310.16695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16694v1","updated":"2023-10-25T15:04:57Z","published":"2023-10-25T15:04:57Z","title":"DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for\n Vehicle Re-identification","summary":" In recent years, vehicle re-identification (Re-ID) has gained increasing\nimportance in various applications such as assisted driving systems, traffic\nflow management, and vehicle tracking, due to the growth of intelligent\ntransportation systems. However, the presence of extraneous background\ninformation and occlusions can interfere with the learning of discriminative\nfeatures, leading to significant variations in the same vehicle image across\ndifferent scenarios. This paper proposes a method, named graph network based on\ndynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel\napproach for constructing adjacency matrices to capture spatial relationships\nof local features and reduce background noise. Specifically, the proposed\nmethod divides the extracted vehicle features into different patches as nodes\nwithin the graph network. A spatial attention-based similarity adjacency matrix\ngeneration (SASAMG) module is employed to compute similarity matrices of nodes,\nand a dynamic erasure operation is applied to disconnect nodes with low\nsimilarity, resulting in similarity adjacency matrices. Finally, the nodes and\nsimilarity adjacency matrices are fed into graph networks to extract more\ndiscriminative features for vehicle Re-ID. Experimental results on public\ndatasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed\nmethod compared with recent works.\n","authors":["Yuejun Jiao","Song Qiu","Mingsong Chen","Dingding Han","Qingli Li","Yue Lu"],"pdf_url":"https://arxiv.org/pdf/2310.16694v1.pdf","comment":"This paper has been accepted by the 20th Pacific Rim International\n Conference on Artificial Intelligence in 2023"},{"id":"http://arxiv.org/abs/2305.14095v2","updated":"2023-10-25T14:49:23Z","published":"2023-05-23T14:18:11Z","title":"S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist\n Captions","summary":" Vision-language models, such as contrastive language-image pre-training\n(CLIP), have demonstrated impressive results in natural image domains. However,\nthese models often struggle when applied to specialized domains like remote\nsensing, and adapting to such domains is challenging due to the limited number\nof image-text pairs available for training. To address this, we propose S-CLIP,\na semi-supervised learning method for training CLIP that utilizes additional\nunpaired images. S-CLIP employs two pseudo-labeling strategies specifically\ndesigned for contrastive learning and the language modality. The caption-level\npseudo-label is given by a combination of captions of paired images, obtained\nby solving an optimal transport problem between unpaired and paired images. The\nkeyword-level pseudo-label is given by a keyword in the caption of the nearest\npaired image, trained through partial label learning that assumes a candidate\nset of labels for supervision instead of the exact one. By combining these\nobjectives, S-CLIP significantly enhances the training of CLIP using only a few\nimage-text pairs, as demonstrated in various specialist domains, including\nremote sensing, fashion, scientific figures, and comics. For instance, S-CLIP\nimproves CLIP by 10% for zero-shot classification and 4% for image-text\nretrieval on the remote sensing benchmark, matching the performance of\nsupervised CLIP while using three times fewer image-text pairs.\n","authors":["Sangwoo Mo","Minkyu Kim","Kyungmin Lee","Jinwoo Shin"],"pdf_url":"https://arxiv.org/pdf/2305.14095v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16684v1","updated":"2023-10-25T14:47:32Z","published":"2023-10-25T14:47:32Z","title":"Local Statistics for Generative Image Detection","summary":" Diffusion models (DMs) are generative models that learn to synthesize images\nfrom Gaussian noise. DMs can be trained to do a variety of tasks such as image\ngeneration and image super-resolution. Researchers have made significant\nimprovement in the capability of synthesizing photorealistic images in the past\nfew years. These successes also hasten the need to address the potential misuse\nof synthesized images. In this paper, we highlight the effectiveness of\ncomputing local statistics, as opposed to global statistics, in distinguishing\ndigital camera images from DM-generated images. We hypothesized that local\nstatistics should be used to address the spatial non-stationarity problem in\nimages. We show that our approach produced promising results and it is also\nrobust to various perturbations such as image resizing and JPEG compression.\n","authors":["Yung Jer Wong","Teck Khim Ng"],"pdf_url":"https://arxiv.org/pdf/2310.16684v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02674v2","updated":"2023-10-25T14:34:03Z","published":"2023-10-04T09:26:44Z","title":"Land-cover change detection using paired OpenStreetMap data and optical\n high-resolution imagery via object-guided Transformer","summary":" Optical high-resolution imagery and OpenStreetMap (OSM) data are two\nimportant data sources for land-cover change detection. Previous studies in\nthese two data sources focus on utilizing the information in OSM data to aid\nthe change detection on multi-temporal optical high-resolution images. This\npaper pioneers the direct detection of land-cover changes utilizing paired OSM\ndata and optical imagery, thereby broadening the horizons of change detection\ntasks to encompass more dynamic earth observations. To this end, we propose an\nobject-guided Transformer (ObjFormer) architecture by naturally combining the\nprevalent object-based image analysis (OBIA) technique with the advanced vision\nTransformer architecture. The introduction of OBIA can significantly reduce the\ncomputational overhead and memory burden in the self-attention module.\nSpecifically, the proposed ObjFormer has a hierarchical pseudo-siamese encoder\nconsisting of object-guided self-attention modules that extract representative\nfeatures of different levels from OSM data and optical images; a decoder\nconsisting of object-guided cross-attention modules can progressively recover\nthe land-cover changes from the extracted heterogeneous features. In addition\nto the basic supervised binary change detection task, this paper raises a new\nsemi-supervised semantic change detection task that does not require any\nmanually annotated land-cover labels of optical images to train semantic change\ndetectors. Two lightweight semantic decoders are added to ObjFormer to\naccomplish this task efficiently. A converse cross-entropy loss is designed to\nfully utilize the negative samples, thereby contributing to the great\nperformance improvement in this task. The first large-scale benchmark dataset\ncontaining 1,287 map-image pairs (1024$\\times$ 1024 pixels for each sample)\ncovering 40 regions on six continents ...(see the manuscript for the full\nabstract)\n","authors":["Hongruixuan Chen","Cuiling Lan","Jian Song","Clifford Broni-Bediako","Junshi Xia","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2310.02674v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16667v1","updated":"2023-10-25T14:31:02Z","published":"2023-10-25T14:31:02Z","title":"CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary\n Object Detection","summary":" Deriving reliable region-word alignment from image-text pairs is critical to\nlearn object-level vision-language representations for open-vocabulary object\ndetection. Existing methods typically rely on pre-trained or self-trained\nvision-language models for alignment, which are prone to limitations in\nlocalization accuracy or generalization capabilities. In this paper, we propose\nCoDet, a novel approach that overcomes the reliance on pre-aligned\nvision-language space by reformulating region-word alignment as a co-occurring\nobject discovery problem. Intuitively, by grouping images that mention a shared\nconcept in their captions, objects corresponding to the shared concept shall\nexhibit high co-occurrence among the group. CoDet then leverages visual\nsimilarities to discover the co-occurring objects and align them with the\nshared concept. Extensive experiments demonstrate that CoDet has superior\nperformances and compelling scalability in open-vocabulary detection, e.g., by\nscaling up the visual backbone, CoDet achieves 37.0 $\\text{AP}^m_{novel}$ and\n44.7 $\\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2\n$\\text{AP}^m_{novel}$ and 9.8 $\\text{AP}^m_{all}$. Code is available at\nhttps://github.com/CVMI-Lab/CoDet.\n","authors":["Chuofan Ma","Yi Jiang","Xin Wen","Zehuan Yuan","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2310.16667v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15578v2","updated":"2023-10-25T14:30:39Z","published":"2023-10-24T07:42:04Z","title":"VMAF Re-implementation on PyTorch: Some Experimental Results","summary":" Based on the standard VMAF implementation we propose an implementation of\nVMAF using PyTorch framework. For this implementation comparisons with the\nstandard (libvmaf) show the discrepancy $\\lesssim 10^{-2}$ in VMAF units. We\ninvestigate gradients computation when using VMAF as an objective function and\ndemonstrate that training using this function does not result in ill-behaving\ngradients.\n","authors":["Kirill Aistov","Maxim Koroteev"],"pdf_url":"https://arxiv.org/pdf/2310.15578v2.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2310.16665v1","updated":"2023-10-25T14:25:18Z","published":"2023-10-25T14:25:18Z","title":"Robust Source-Free Domain Adaptation for Fundus Image Segmentation","summary":" Unsupervised Domain Adaptation (UDA) is a learning technique that transfers\nknowledge learned in the source domain from labelled training data to the\ntarget domain with only unlabelled data. It is of significant importance to\nmedical image segmentation because of the usual lack of labelled training data.\nAlthough extensive efforts have been made to optimize UDA techniques to improve\nthe ac?curacy of segmentation models in the target domain, few studies have\naddressed the robustness of these models under UDA. In this study, we propose a\ntwo-stage training strat?egy for robust domain adaptation. In the source\ntraining stage, we utilize adversarial sample augmentation to en?hance the\nrobustness and generalization capability of the source model. And in the target\ntraining stage, we propose a novel robust pseudo-label and pseudo-boundary\n(PLPB) method, which effectively utilizes unlabeled target data to generate\npseudo labels and pseudo boundaries that enable model self-adaptation without\nrequiring source data. Ex?tensive experimental results on cross-domain fundus\nimage segmentation confirm the effectiveness and versatility of our method.\nSource code of this study is openly accessible at\nhttps://github.com/LinGrayy/PLPB.\n","authors":["Lingrui Li","Yanfeng Zhou","Ge Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16665v1.pdf","comment":"10 pages, WACV2024"},{"id":"http://arxiv.org/abs/2310.16662v1","updated":"2023-10-25T14:23:40Z","published":"2023-10-25T14:23:40Z","title":"Deep Learning Techniques for Cervical Cancer Diagnosis based on\n Pathology and Colposcopy Images","summary":" Cervical cancer is a prevalent disease affecting millions of women worldwide\nevery year. It requires significant attention, as early detection during the\nprecancerous stage provides an opportunity for a cure. The screening and\ndiagnosis of cervical cancer rely on cytology and colposcopy methods. Deep\nlearning, a promising technology in computer vision, has emerged as a potential\nsolution to improve the accuracy and efficiency of cervical cancer screening\ncompared to traditional clinical inspection methods that are prone to human\nerror. This review article discusses cervical cancer and its screening\nprocesses, followed by the Deep Learning training process and the\nclassification, segmentation, and detection tasks for cervical cancer\ndiagnosis. Additionally, we explored the most common public datasets used in\nboth cytology and colposcopy and highlighted the popular and most utilized\narchitectures that researchers have applied to both cytology and colposcopy. We\nreviewed 24 selected practical papers in this study and summarized them. This\narticle highlights the remarkable efficiency in enhancing the precision and\nspeed of cervical cancer analysis by Deep Learning, bringing us closer to early\ndiagnosis and saving lives.\n","authors":["Hana Ahmadzadeh Sarhangi","Dorsa Beigifard","Elahe Farmani","Hamidreza Bolhasani"],"pdf_url":"https://arxiv.org/pdf/2310.16662v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16656v1","updated":"2023-10-25T14:10:08Z","published":"2023-10-25T14:10:08Z","title":"A Picture is Worth a Thousand Words: Principled Recaptioning Improves\n Image Generation","summary":" Text-to-image diffusion models achieved a remarkable leap in capabilities\nover the last few years, enabling high-quality and diverse synthesis of images\nfrom a textual prompt. However, even the most advanced models often struggle to\nprecisely follow all of the directions in their prompts. The vast majority of\nthese models are trained on datasets consisting of (image, caption) pairs where\nthe images often come from the web, and the captions are their HTML alternate\ntext. A notable example is the LAION dataset, used by Stable Diffusion and\nother models. In this work we observe that these captions are often of low\nquality, and argue that this significantly affects the model's capability to\nunderstand nuanced semantics in the textual prompts. We show that by relabeling\nthe corpus with a specialized automatic captioning model and training a\ntext-to-image model on the recaptioned dataset, the model benefits\nsubstantially across the board. First, in overall image quality: e.g. FID 14.84\nvs. the baseline of 17.87, and 64.3% improvement in faithful image generation\naccording to human evaluation. Second, in semantic alignment, e.g. semantic\nobject accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and\npositional alignment 62.42 vs. 57.60. We analyze various ways to relabel the\ncorpus and provide evidence that this technique, which we call RECAP, both\nreduces the train-inference discrepancy and provides the model with more\ninformation per example, increasing sample efficiency and allowing the model to\nbetter understand the relations between captions and images.\n","authors":["Eyal Segalis","Dani Valevski","Danny Lumen","Yossi Matias","Yaniv Leviathan"],"pdf_url":"https://arxiv.org/pdf/2310.16656v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03022v2","updated":"2023-10-25T13:51:25Z","published":"2023-06-05T16:38:48Z","title":"Interpretable Alzheimer's Disease Classification Via a Contrastive\n Diffusion Autoencoder","summary":" In visual object classification, humans often justify their choices by\ncomparing objects to prototypical examples within that class. We may therefore\nincrease the interpretability of deep learning models by imbuing them with a\nsimilar style of reasoning. In this work, we apply this principle by\nclassifying Alzheimer's Disease based on the similarity of images to training\nexamples within the latent space. We use a contrastive loss combined with a\ndiffusion autoencoder backbone, to produce a semantically meaningful latent\nspace, such that neighbouring latents have similar image-level features. We\nachieve a classification accuracy comparable to black box approaches on a\ndataset of 2D MRI images, whilst producing human interpretable model\nexplanations. Therefore, this work stands as a contribution to the pertinent\ndevelopment of accurate and interpretable deep learning within medical imaging.\n","authors":["Ayodeji Ijishakin","Ahmed Abdulaal","Adamos Hadjivasiliou","Sophie Martin","James Cole"],"pdf_url":"https://arxiv.org/pdf/2306.03022v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16640v1","updated":"2023-10-25T13:43:36Z","published":"2023-10-25T13:43:36Z","title":"EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression\n Recognition","summary":" Facial Expression Recognition (FER) is a crucial task in affective computing,\nbut its conventional focus on the seven basic emotions limits its applicability\nto the complex and expanding emotional spectrum. To address the issue of new\nand unseen emotions present in dynamic in-the-wild FER, we propose a novel\nvision-language model that utilises sample-level text descriptions (i.e.\ncaptions of the context, expressions or emotional cues) as natural language\nsupervision, aiming to enhance the learning of rich latent representations, for\nzero-shot classification. To test this, we evaluate using zero-shot\nclassification of the model trained on sample-level descriptions on four\npopular dynamic FER datasets. Our findings show that this approach yields\nsignificant improvements when compared to baseline methods. Specifically, for\nzero-shot video FER, we outperform CLIP by over 10\\% in terms of Weighted\nAverage Recall and 5\\% in terms of Unweighted Average Recall on several\ndatasets. Furthermore, we evaluate the representations obtained from the\nnetwork trained using sample-level descriptions on the downstream task of\nmental health symptom estimation, achieving performance comparable or superior\nto state-of-the-art methods and strong agreement with human experts. Namely, we\nachieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia\nsymptom severity estimation, which is comparable to human experts' agreement.\nThe code is publicly available at: https://github.com/NickyFot/EmoCLIP.\n","authors":["Niki Maria Foteinopoulou","Ioannis Patras"],"pdf_url":"https://arxiv.org/pdf/2310.16640v1.pdf","comment":"10 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.16639v1","updated":"2023-10-25T13:39:04Z","published":"2023-10-25T13:39:04Z","title":"Driving through the Concept Gridlock: Unraveling Explainability\n Bottlenecks","summary":" Concept bottleneck models have been successfully used for explainable machine\nlearning by encoding information within the model with a set of human-defined\nconcepts. In the context of human-assisted or autonomous driving,\nexplainability models can help user acceptance and understanding of decisions\nmade by the autonomous vehicle, which can be used to rationalize and explain\ndriver or vehicle behavior. We propose a new approach using concept bottlenecks\nas visual features for control command predictions and explanations of user and\nvehicle behavior. We learn a human-understandable concept layer that we use to\nexplain sequential driving scenes while learning vehicle control commands. This\napproach can then be used to determine whether a change in a preferred gap or\nsteering commands from a human (or autonomous vehicle) is led by an external\nstimulus or change in preferences. We achieve competitive performance to latent\nvisual features while gaining interpretability within our model setup.\n","authors":["Jessica Echterhoff","An Yan","Kyungtae Han","Amr Abdelraouf","Rohit Gupta","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2310.16639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16629v1","updated":"2023-10-25T13:27:56Z","published":"2023-10-25T13:27:56Z","title":"EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless\n LiDAR-Camera Calibration","summary":" In multimodal perception systems, achieving precise extrinsic calibration\nbetween LiDAR and camera is of critical importance. Previous calibration\nmethods often required specific targets or manual adjustments, making them both\nlabor-intensive and costly. Online calibration methods based on features have\nbeen proposed, but these methods encounter challenges such as imprecise feature\nextraction, unreliable cross-modality associations, and high scene-specific\nrequirements. To address this, we introduce an edge-based approach for\nautomatic online calibration of LiDAR and cameras in real-world scenarios. The\nedge features, which are prevalent in various environments, are aligned in both\nimages and point clouds to determine the extrinsic parameters. Specifically,\nstable and robust image edge features are extracted using a SAM-based method\nand the edge features extracted from the point cloud are weighted through a\nmulti-frame weighting strategy for feature filtering. Finally, accurate\nextrinsic parameters are optimized based on edge correspondence constraints. We\nconducted evaluations on both the KITTI dataset and our dataset. The results\nshow a state-of-the-art rotation accuracy of 0.086{\\deg} and a translation\naccuracy of 0.977 cm, outperforming existing edge-based calibration methods in\nboth precision and robustness.\n","authors":["Xingchen Li","Yifan Duan","Beibei Wang","Haojie Ren","Guoliang You","Yu Sheng","Jianmin Ji","Yanyong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13447v2","updated":"2023-10-25T13:14:40Z","published":"2023-10-20T12:26:04Z","title":"Multiscale Superpixel Structured Difference Graph Convolutional Network\n for VL Representation","summary":" Within the multimodal field, the key to integrating vision and language lies\nin establishing a good alignment strategy. Recently, benefiting from the\nsuccess of self-supervised learning, significant progress has been made in\nmultimodal semantic representation based on pre-trained models for vision and\nlanguage. However, there is still room for improvement in visual semantic\nrepresentation. The lack of spatial semantic coherence and vulnerability to\nnoise makes it challenging for current pixel or patch-based methods to\naccurately extract complex scene boundaries. To this end, this paper develops\nsuperpixel as a comprehensive compact representation of learnable image data,\nwhich effectively reduces the number of visual primitives for subsequent\nprocessing by clustering perceptually similar pixels. To mine more precise\ntopological relations, we propose a Multiscale Difference Graph Convolutional\nNetwork (MDGCN). It parses the entire image as a fine-to-coarse hierarchical\nstructure of constituent visual patterns, and captures multiscale features by\nprogressively merging adjacent superpixels as graph nodes. Moreover, we predict\nthe differences between adjacent nodes through the graph structure,\nfacilitating key information aggregation of graph nodes to reason actual\nsemantic relations. Afterward, we design a multi-level fusion rule in a\nbottom-up manner to avoid understanding deviation by learning complementary\nspatial information at different regional scales. Our proposed method can be\nwell applied to multiple downstream task learning. Extensive experiments\ndemonstrate that our method is competitive with other state-of-the-art methods\nin visual reasoning. Our code will be released upon publication.\n","authors":["Siyu Zhang","Yeming Chen","Sirui Cheng","Yaoru Sun","Jun Yang","Lizhi Bai"],"pdf_url":"https://arxiv.org/pdf/2310.13447v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16618v1","updated":"2023-10-25T13:14:12Z","published":"2023-10-25T13:14:12Z","title":"Real-time 6-DoF Pose Estimation by an Event-based Camera using Active\n LED Markers","summary":" Real-time applications for autonomous operations depend largely on fast and\nrobust vision-based localization systems. Since image processing tasks require\nprocessing large amounts of data, the computational resources often limit the\nperformance of other processes. To overcome this limitation, traditional\nmarker-based localization systems are widely used since they are easy to\nintegrate and achieve reliable accuracy. However, classical marker-based\nlocalization systems significantly depend on standard cameras with low frame\nrates, which often lack accuracy due to motion blur. In contrast, event-based\ncameras provide high temporal resolution and a high dynamic range, which can be\nutilized for fast localization tasks, even under challenging visual conditions.\nThis paper proposes a simple but effective event-based pose estimation system\nusing active LED markers (ALM) for fast and accurate pose estimation. The\nproposed algorithm is able to operate in real time with a latency below\n\\SI{0.5}{\\milli\\second} while maintaining output rates of \\SI{3}{\\kilo \\hertz}.\nExperimental results in static and dynamic scenarios are presented to\ndemonstrate the performance of the proposed approach in terms of computational\nspeed and absolute accuracy, using the OptiTrack system as the basis for\nmeasurement.\n","authors":["Gerald Ebmer","Adam Loch","Minh Nhat Vu","Germain Haessig","Roberto Mecca","Markus Vincze","Christian Hartl-Nesic","Andreas Kugi"],"pdf_url":"https://arxiv.org/pdf/2310.16618v1.pdf","comment":"14 pages, 12 figures, this paper has been accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.16616v1","updated":"2023-10-25T13:12:39Z","published":"2023-10-25T13:12:39Z","title":"Context Does Matter: End-to-end Panoptic Narrative Grounding with\n Deformable Attention Refined Matching Network","summary":" Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that\naims to segment visual objects in images based on dense narrative captions. The\ncurrent state-of-the-art methods first refine the representation of phrase by\naggregating the most similar $k$ image pixels, and then match the refined text\nrepresentations with the pixels of the image feature map to generate\nsegmentation results. However, simply aggregating sampled image features\nignores the contextual information, which can lead to phrase-to-pixel\nmis-match. In this paper, we propose a novel learning framework called\nDeformable Attention Refined Matching Network (DRMN), whose main idea is to\nbring deformable attention in the iterative process of feature learning to\nincorporate essential context information of different scales of pixels. DRMN\niteratively re-encodes pixels with the deformable attention network after\nupdating the feature representation of the top-$k$ most similar pixels. As\nsuch, DRMN can lead to accurate yet discriminative pixel representations,\npurify the top-$k$ most similar pixels, and consequently alleviate the\nphrase-to-pixel mis-match substantially.Experimental results show that our\nnovel design significantly improves the matching results between text phrases\nand image pixels. Concretely, DRMN achieves new state-of-the-art performance on\nthe PNG benchmark with an average recall improvement 3.5%. The codes are\navailable in: https://github.com/JaMesLiMers/DRMN.\n","authors":["Yiming Lin","Xiao-Bo Jin","Qiufeng Wang","Kaizhu Huang"],"pdf_url":"https://arxiv.org/pdf/2310.16616v1.pdf","comment":"Accepted by ICDM 2023"},{"id":"http://arxiv.org/abs/2307.10455v2","updated":"2023-10-25T13:06:01Z","published":"2023-07-19T20:54:08Z","title":"A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect\n Dataset","summary":" In an effort to catalog insect biodiversity, we propose a new large dataset\nof hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is\ntaxonomically classified by an expert, and also has associated genetic\ninformation including raw nucleotide barcode sequences and assigned barcode\nindex numbers, which are genetically-based proxies for species classification.\nThis paper presents a curated million-image dataset, primarily to train\ncomputer-vision models capable of providing image-based taxonomic assessment,\nhowever, the dataset also presents compelling characteristics, the study of\nwhich would be of interest to the broader machine learning community. Driven by\nthe biological nature inherent to the dataset, a characteristic long-tailed\nclass-imbalance distribution is exhibited. Furthermore, taxonomic labelling is\na hierarchical classification scheme, presenting a highly fine-grained\nclassification problem at lower levels. Beyond spurring interest in\nbiodiversity research within the machine learning community, progress on\ncreating an image-based taxonomic classifier will also further the ultimate\ngoal of all BIOSCAN research: to lay the foundation for a comprehensive survey\nof global biodiversity. This paper introduces the dataset and explores the\nclassification task through the implementation and analysis of a baseline\nclassifier.\n","authors":["Zahra Gharaee","ZeMing Gong","Nicholas Pellegrino","Iuliia Zarubiieva","Joakim Bruslund Haurum","Scott C. Lowe","Jaclyn T. A. McKeown","Chris C. Y. Ho","Joschka McLeod","Yi-Yun C Wei","Jireh Agda","Sujeevan Ratnasingham","Dirk Steinke","Angel X. Chang","Graham W. Taylor","Paul Fieguth"],"pdf_url":"https://arxiv.org/pdf/2307.10455v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.02814v4","updated":"2023-10-25T12:44:30Z","published":"2021-12-06T06:56:00Z","title":"A Survey of Deep Learning for Low-Shot Object Detection","summary":" Object detection has achieved a huge breakthrough with deep neural networks\nand massive annotated data. However, current detection methods cannot be\ndirectly transferred to the scenario where the annotated data is scarce due to\nthe severe overfitting problem. Although few-shot learning and zero-shot\nlearning have been extensively explored in the field of image classification,\nit is indispensable to design new methods for object detection in the\ndata-scarce scenario since object detection has an additional challenging\nlocalization task. Low-Shot Object Detection (LSOD) is an emerging research\ntopic of detecting objects from a few or even no annotated samples, consisting\nof One-Shot Object Detection (OSOD), Few-Shot Object Detection (FSOD) and\nZero-Shot Object Detection (ZSD). This survey provides a comprehensive review\nof LSOD methods. First, we propose a thorough taxonomy of LSOD methods and\nanalyze them systematically, comprising some extensional topics of LSOD\n(semi-supervised LSOD, weakly-supervised LSOD, and incremental LSOD). Then, we\nindicate the pros and cons of current LSOD methods with a comparison of their\nperformance. Finally, we discuss the challenges and promising directions of\nLSOD to provide guidance for future works.\n","authors":["Qihan Huang","Haofei Zhang","Mengqi Xue","Jie Song","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2112.02814v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.05946v3","updated":"2023-10-25T12:27:47Z","published":"2022-12-12T14:59:11Z","title":"Evaluation and Improvement of Interpretability for Self-Explainable\n Part-Prototype Networks","summary":" Part-prototype networks (e.g., ProtoPNet, ProtoTree, and ProtoPool) have\nattracted broad research interest for their intrinsic interpretability and\ncomparable accuracy to non-interpretable counterparts. However, recent works\nfind that the interpretability from prototypes is fragile, due to the semantic\ngap between the similarities in the feature space and that in the input space.\nIn this work, we strive to address this challenge by making the first attempt\nto quantitatively and objectively evaluate the interpretability of the\npart-prototype networks. Specifically, we propose two evaluation metrics,\ntermed as consistency score and stability score, to evaluate the explanation\nconsistency across images and the explanation robustness against perturbations,\nrespectively, both of which are essential for explanations taken into practice.\nFurthermore, we propose an elaborated part-prototype network with a\nshallow-deep feature alignment (SDFA) module and a score aggregation (SA)\nmodule to improve the interpretability of prototypes. We conduct systematical\nevaluation experiments and provide substantial discussions to uncover the\ninterpretability of existing part-prototype networks. Experiments on three\nbenchmarks across nine architectures demonstrate that our model achieves\nsignificantly superior performance to the state of the art, in both the\naccuracy and interpretability. Our code is available at\nhttps://github.com/hqhQAQ/EvalProtoPNet.\n","authors":["Qihan Huang","Mengqi Xue","Wenqi Huang","Haofei Zhang","Jie Song","Yongcheng Jing","Mingli Song"],"pdf_url":"https://arxiv.org/pdf/2212.05946v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16590v1","updated":"2023-10-25T12:25:53Z","published":"2023-10-25T12:25:53Z","title":"$\\mathbb{VD}$-$\\mathbb{GR}$: Boosting $\\mathbb{V}$isual\n $\\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal\n $\\mathbb{GR}$aphs","summary":" We propose $\\mathbb{VD}$-$\\mathbb{GR}$ - a novel visual dialog model that\ncombines pre-trained language models (LMs) with graph neural networks (GNNs).\nPrior works mainly focused on one class of models at the expense of the other,\nthus missing out on the opportunity of combining their respective benefits. At\nthe core of $\\mathbb{VD}$-$\\mathbb{GR}$ is a novel integration mechanism that\nalternates between spatial-temporal multi-modal GNNs and BERT layers, and that\ncovers three distinct contributions: First, we use multi-modal GNNs to process\nthe features of each modality (image, question, and dialog history) and exploit\ntheir local structures before performing BERT global attention. Second, we\npropose hub-nodes that link to all other nodes within one modality graph,\nallowing the model to propagate information from one GNN (modality) to the\nother in a cascaded manner. Third, we augment the BERT hidden states with\nfine-grained multi-modal GNN features before passing them to the next\n$\\mathbb{VD}$-$\\mathbb{GR}$ layer. Evaluations on VisDial v1.0, VisDial v0.9,\nVisDialConv, and VisPro show that $\\mathbb{VD}$-$\\mathbb{GR}$ achieves new\nstate-of-the-art results across all four datasets.\n","authors":["Adnen Abdessaied","Lei Shi","Andreas Bulling"],"pdf_url":"https://arxiv.org/pdf/2310.16590v1.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2310.16587v1","updated":"2023-10-25T12:22:18Z","published":"2023-10-25T12:22:18Z","title":"Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent\n Representations","summary":" Uncertainty estimation aims to evaluate the confidence of a trained deep\nneural network. However, existing uncertainty estimation approaches rely on\nlow-dimensional distributional assumptions and thus suffer from the high\ndimensionality of latent features. Existing approaches tend to focus on\nuncertainty on discrete classification probabilities, which leads to poor\ngeneralizability to uncertainty estimation for other tasks. Moreover, most of\nthe literature requires seeing the out-of-distribution (OOD) data in the\ntraining for better estimation of uncertainty, which limits the uncertainty\nestimation performance in practice because the OOD data are typically unseen.\nTo overcome these limitations, we propose a new framework using data-adaptive\nhigh-dimensional hypothesis testing for uncertainty estimation, which leverages\nthe statistical properties of the feature representations. Our method directly\noperates on latent representations and thus does not require retraining the\nfeature encoder under a modified objective. The test statistic relaxes the\nfeature distribution assumptions to high dimensionality, and it is more\ndiscriminative to uncertainties in the latent representations. We demonstrate\nthat encoding features with Bayesian neural networks can enhance testing\nperformance and lead to more accurate uncertainty estimation. We further\nintroduce a family-wise testing procedure to determine the optimal threshold of\nOOD detection, which minimizes the false discovery rate (FDR). Extensive\nexperiments validate the satisfactory performance of our framework on\nuncertainty estimation and task-specific prediction over a variety of\ncompetitors. The experiments on the OOD detection task also show satisfactory\nperformance of our method when the OOD data are unseen in the training. Codes\nare available at https://github.com/HKU-MedAI/bnn_uncertainty.\n","authors":["Tsai Hor Chan","Kin Wai Lau","Jiajun Shen","Guosheng Yin","Lequan Yu"],"pdf_url":"https://arxiv.org/pdf/2310.16587v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16584v1","updated":"2023-10-25T12:18:00Z","published":"2023-10-25T12:18:00Z","title":"Learning to Explain: A Model-Agnostic Framework for Explaining Black Box\n Models","summary":" We present Learning to Explain (LTX), a model-agnostic framework designed for\nproviding post-hoc explanations for vision models. The LTX framework introduces\nan \"explainer\" model that generates explanation maps, highlighting the crucial\nregions that justify the predictions made by the model being explained. To\ntrain the explainer, we employ a two-stage process consisting of initial\npretraining followed by per-instance finetuning. During both stages of\ntraining, we utilize a unique configuration where we compare the explained\nmodel's prediction for a masked input with its original prediction for the\nunmasked input. This approach enables the use of a novel counterfactual\nobjective, which aims to anticipate the model's output using masked versions of\nthe input image. Importantly, the LTX framework is not restricted to a specific\nmodel architecture and can provide explanations for both Transformer-based and\nconvolutional models. Through our evaluations, we demonstrate that LTX\nsignificantly outperforms the current state-of-the-art in explainability across\nvarious metrics.\n","authors":["Oren Barkan","Yuval Asher","Amit Eshel","Yehonatan Elisha","Noam Koenigstein"],"pdf_url":"https://arxiv.org/pdf/2310.16584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13384v2","updated":"2023-10-25T11:58:40Z","published":"2023-06-23T09:10:41Z","title":"DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch\n Diffusion in Histopathology","summary":" We present DiffInfinite, a hierarchical diffusion model that generates\narbitrarily large histological images while preserving long-range correlation\nstructural information. Our approach first generates synthetic segmentation\nmasks, subsequently used as conditions for the high-fidelity generative\ndiffusion process. The proposed sampling method can be scaled up to any desired\nimage size while only requiring small patches for fast training. Moreover, it\ncan be parallelized more efficiently than previous large-content generation\nmethods while avoiding tiling artifacts. The training leverages classifier-free\nguidance to augment a small, sparsely annotated dataset with unlabelled data.\nOur method alleviates unique challenges in histopathological imaging practice:\nlarge-scale information, costly manual annotation, and protective data\nhandling. The biological plausibility of DiffInfinite data is evaluated in a\nsurvey by ten experienced pathologists as well as a downstream classification\nand segmentation task. Samples from the model score strongly on anti-copying\nmetrics which is relevant for the protection of patient data.\n","authors":["Marco Aversa","Gabriel Nobis","Miriam Hägele","Kai Standvoss","Mihaela Chirica","Roderick Murray-Smith","Ahmed Alaa","Lukas Ruff","Daniela Ivanova","Wojciech Samek","Frederick Klauschen","Bruno Sanguinetti","Luis Oala"],"pdf_url":"https://arxiv.org/pdf/2306.13384v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16573v1","updated":"2023-10-25T11:58:14Z","published":"2023-10-25T11:58:14Z","title":"Adapt Anything: Tailor Any Image Classifiers across Domains And\n Categories Using Text-to-Image Diffusion Models","summary":" We do not pursue a novel method in this paper, but aim to study if a modern\ntext-to-image diffusion model can tailor any task-adaptive image classifier\nacross domains and categories. Existing domain adaptive image classification\nworks exploit both source and target data for domain alignment so as to\ntransfer the knowledge learned from the labeled source data to the unlabeled\ntarget data. However, as the development of the text-to-image diffusion model,\nwe wonder if the high-fidelity synthetic data from the text-to-image generator\ncan serve as a surrogate of the source data in real world. In this way, we do\nnot need to collect and annotate the source data for each domain adaptation\ntask in a one-for-one manner. Instead, we utilize only one off-the-shelf\ntext-to-image model to synthesize images with category labels derived from the\ncorresponding text prompts, and then leverage the surrogate data as a bridge to\ntransfer the knowledge embedded in the task-agnostic text-to-image generator to\nthe task-oriented image classifier via domain adaptation. Such a one-for-all\nadaptation paradigm allows us to adapt anything in the world using only one\ntext-to-image generator as well as the corresponding unlabeled target data.\nExtensive experiments validate the feasibility of the proposed idea, which even\nsurpasses the state-of-the-art domain adaptation works using the source data\ncollected and annotated in real world.\n","authors":["Weijie Chen","Haoyu Wang","Shicai Yang","Lei Zhang","Wei Wei","Yanning Zhang","Luojun Lin","Di Xie","Yueting Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.16573v1.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.16569v1","updated":"2023-10-25T11:54:21Z","published":"2023-10-25T11:54:21Z","title":"Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask\n Detection","summary":" Anti-spoofing detection has become a necessity for face recognition systems\ndue to the security threat posed by spoofing attacks. Despite great success in\ntraditional attacks, most deep-learning-based methods perform poorly in 3D\nmasks, which can highly simulate real faces in appearance and structure,\nsuffering generalizability insufficiency while focusing only on the spatial\ndomain with single frame input. This has been mitigated by the recent\nintroduction of a biomedical technology called rPPG (remote\nphotoplethysmography). However, rPPG-based methods are sensitive to noisy\ninterference and require at least one second (> 25 frames) of observation time,\nwhich induces high computational overhead. To address these challenges, we\npropose a novel 3D mask detection framework, called FASTEN\n(Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the\nnetwork for focusing more on fine-grained details in large movements, which can\neliminate redundant spatio-temporal feature interference and quickly capture\nsplicing traces of 3D masks in fewer frames. Our proposed network contains\nthree key modules: 1) a facial optical flow network to obtain non-RGB\ninter-frame flow information; 2) flow attention to assign different\nsignificance to each frame; 3) spatio-temporal aggregation to aggregate\nhigh-level spatial features and temporal transition features. Through extensive\nexperiments, FASTEN only requires five frames of input and outperforms eight\ncompetitors for both intra-dataset and cross-dataset evaluations in terms of\nmultiple detection metrics. Moreover, FASTEN has been deployed in real-world\nmobile devices for practical 3D mask detection.\n","authors":["Yuxin Cao","Yian Li","Yumeng Zhu","Derui Wang","Minhui Xue"],"pdf_url":"https://arxiv.org/pdf/2310.16569v1.pdf","comment":"13 pages, 5 figures. Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.13472v2","updated":"2023-10-25T11:30:04Z","published":"2023-09-23T20:12:32Z","title":"Edge Aware Learning for 3D Point Cloud","summary":" This paper proposes an innovative approach to Hierarchical Edge Aware 3D\nPoint Cloud Learning (HEA-Net) that seeks to address the challenges of noise in\npoint cloud data, and improve object recognition and segmentation by focusing\non edge features. In this study, we present an innovative edge-aware learning\nmethodology, specifically designed to enhance point cloud classification and\nsegmentation. Drawing inspiration from the human visual system, the concept of\nedge-awareness has been incorporated into this methodology, contributing to\nimproved object recognition while simultaneously reducing computational time.\nOur research has led to the development of an advanced 3D point cloud learning\nframework that effectively manages object classification and segmentation\ntasks. A unique fusion of local and global network learning paradigms has been\nemployed, enriched by edge-focused local and global embeddings, thereby\nsignificantly augmenting the model's interpretative prowess. Further, we have\napplied a hierarchical transformer architecture to boost point cloud processing\nefficiency, thus providing nuanced insights into structural understanding. Our\napproach demonstrates significant promise in managing noisy point cloud data\nand highlights the potential of edge-aware strategies in 3D point cloud\nlearning. The proposed approach is shown to outperform existing techniques in\nobject classification and segmentation tasks, as demonstrated by experiments on\nModelNet40 and ShapeNet datasets.\n","authors":["Lei Li"],"pdf_url":"https://arxiv.org/pdf/2309.13472v2.pdf","comment":"CGI 2023"},{"id":"http://arxiv.org/abs/2310.01164v3","updated":"2023-10-25T11:28:35Z","published":"2023-10-02T12:49:20Z","title":"Segment Any Building For Remote Sensing","summary":" The task of identifying and segmenting buildings within remote sensing\nimagery has perennially stood at the forefront of scholarly investigations.\nThis manuscript accentuates the potency of harnessing diversified datasets in\ntandem with cutting-edge representation learning paradigms for building\nsegmentation in such images. Through the strategic amalgamation of disparate\ndatasets, we have not only expanded the informational horizon accessible for\nmodel training but also manifested unparalleled performance metrics across\nmultiple datasets. Our avant-garde joint training regimen underscores the merit\nof our approach, bearing significant implications in pivotal domains such as\nurban infrastructural development, disaster mitigation strategies, and\necological surveillance. Our methodology, predicated upon the fusion of\ndatasets and gleaning insights from pre-trained models, carves a new benchmark\nin the annals of building segmentation endeavors. The outcomes of this research\nboth fortify the foundations for ensuing scholarly pursuits and presage a\nhorizon replete with innovative applications in the discipline of building\nsegmentation.\n","authors":["Lei Li"],"pdf_url":"https://arxiv.org/pdf/2310.01164v3.pdf","comment":"Accepted by CGI 2023"},{"id":"http://arxiv.org/abs/2310.07440v2","updated":"2023-10-25T11:24:28Z","published":"2023-10-11T12:46:11Z","title":"Distance Weighted Trans Network for Image Completion","summary":" The challenge of image generation has been effectively modeled as a problem\nof structure priors or transformation. However, existing models have\nunsatisfactory performance in understanding the global input image structures\nbecause of particular inherent features (for example, local inductive prior).\nRecent studies have shown that self-attention is an efficient modeling\ntechnique for image completion problems. In this paper, we propose a new\narchitecture that relies on Distance-based Weighted Transformer (DWT) to better\nunderstand the relationships between an image's components. In our model, we\nleverage the strengths of both Convolutional Neural Networks (CNNs) and DWT\nblocks to enhance the image completion process. Specifically, CNNs are used to\naugment the local texture information of coarse priors and DWT blocks are used\nto recover certain coarse textures and coherent visual structures. Unlike\ncurrent approaches that generally use CNNs to create feature maps, we use the\nDWT to encode global dependencies and compute distance-based weighted feature\nmaps, which substantially minimizes the problem of visual ambiguities.\nMeanwhile, to better produce repeated textures, we introduce Residual Fast\nFourier Convolution (Res-FFC) blocks to combine the encoder's skip features\nwith the coarse features provided by our generator. Furthermore, a simple yet\neffective technique is proposed to normalize the non-zero values of\nconvolutions, and fine-tune the network layers for regularization of the\ngradient norms to provide an efficient training stabiliser. Extensive\nquantitative and qualitative experiments on three challenging datasets\ndemonstrate the superiority of our proposed model compared to existing\napproaches.\n","authors":["Pourya Shamsolmoali","Masoumeh Zareapoor","Huiyu Zhou","Xuelong Li","Yue Lu"],"pdf_url":"https://arxiv.org/pdf/2310.07440v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.16572v2","updated":"2023-10-25T11:10:57Z","published":"2023-08-31T09:13:30Z","title":"CL-MAE: Curriculum-Learned Masked Autoencoders","summary":" Masked image modeling has been demonstrated as a powerful pretext task for\ngenerating robust representations that can be effectively generalized across\nmultiple downstream tasks. Typically, this approach involves randomly masking\npatches (tokens) in input images, with the masking strategy remaining unchanged\nduring training. In this paper, we propose a curriculum learning approach that\nupdates the masking strategy to continually increase the complexity of the\nself-supervised reconstruction task. We conjecture that, by gradually\nincreasing the task complexity, the model can learn more sophisticated and\ntransferable representations. To facilitate this, we introduce a novel\nlearnable masking module that possesses the capability to generate masks of\ndifferent complexities, and integrate the proposed module into masked\nautoencoders (MAE). Our module is jointly trained with the MAE, while adjusting\nits behavior during training, transitioning from a partner to the MAE\n(optimizing the same reconstruction loss) to an adversary (optimizing the\nopposite loss), while passing through a neutral state. The transition between\nthese behaviors is smooth, being regulated by a factor that is multiplied with\nthe reconstruction loss of the masking module. The resulting training procedure\ngenerates an easy-to-hard curriculum. We train our Curriculum-Learned Masked\nAutoencoder (CL-MAE) on ImageNet and show that it exhibits superior\nrepresentation learning capabilities compared to MAE. The empirical results on\nfive downstream tasks confirm our conjecture, demonstrating that curriculum\nlearning can be successfully used to self-supervise masked autoencoders. We\nrelease our code at https://github.com/ristea/cl-mae.\n","authors":["Neelu Madan","Nicolae-Catalin Ristea","Kamal Nasrollahi","Thomas B. Moeslund","Radu Tudor Ionescu"],"pdf_url":"https://arxiv.org/pdf/2308.16572v2.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.16542v1","updated":"2023-10-25T10:45:38Z","published":"2023-10-25T10:45:38Z","title":"ParisLuco3D: A high-quality target dataset for domain generalization of\n LiDAR perception","summary":" LiDAR is a sensor system that supports autonomous driving by gathering\nprecise geometric information about the scene. Exploiting this information for\nperception is interesting as the amount of available data increases.\n As the quantitative performance of various perception tasks has improved, the\nfocus has shifted from source-to-source perception to domain adaptation and\ndomain generalization for perception. These new goals require access to a large\nvariety of domains for evaluation. Unfortunately, the various annotation\nstrategies of data providers complicate the computation of cross-domain\nperformance based on the available data\n This paper provides a novel dataset, specifically designed for cross-domain\nevaluation to make it easier to evaluate the performance of various source\ndatasets. Alongside the dataset, a flexible online benchmark is provided to\nensure a fair comparison across methods.\n","authors":["Jules Sanchez","Louis Soum-Fontez","Jean-Emmanuel Deschaud","Francois Goulette"],"pdf_url":"https://arxiv.org/pdf/2310.16542v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16540v1","updated":"2023-10-25T10:39:51Z","published":"2023-10-25T10:39:51Z","title":"Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking\n against Face Swapping","summary":" The malicious applications of deep forgery, represented by face swapping,\nhave introduced security threats such as misinformation dissemination and\nidentity fraud. While some research has proposed the use of robust watermarking\nmethods to trace the copyright of facial images for post-event traceability,\nthese methods cannot effectively prevent the generation of forgeries at the\nsource and curb their dissemination. To address this problem, we propose a\nnovel comprehensive active defense mechanism that combines traceability and\nadversariality, called Dual Defense. Dual Defense invisibly embeds a single\nrobust watermark within the target face to actively respond to sudden cases of\nmalicious face swapping. It disrupts the output of the face swapping model\nwhile maintaining the integrity of watermark information throughout the entire\ndissemination process. This allows for watermark extraction at any stage of\nimage tracking for traceability. Specifically, we introduce a watermark\nembedding network based on original-domain feature impersonation attack. This\nnetwork learns robust adversarial features of target facial images and embeds\nwatermarks, seeking a well-balanced trade-off between watermark invisibility,\nadversariality, and traceability through perceptual adversarial encoding\nstrategies. Extensive experiments demonstrate that Dual Defense achieves\noptimal overall defense success rates and exhibits promising universality in\nanti-face swapping tasks and dataset generalization ability. It maintains\nimpressive adversariality and traceability in both original and robust\nsettings, surpassing current forgery defense methods that possess only one of\nthese capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD\nmethods.\n","authors":["Yunming Zhang","Dengpan Ye","Caiyun Xie","Long Tang","Chuanxi Chen","Ziyi Liu","Jiacheng Deng"],"pdf_url":"https://arxiv.org/pdf/2310.16540v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16534v1","updated":"2023-10-25T10:33:17Z","published":"2023-10-25T10:33:17Z","title":"An Early Evaluation of GPT-4V(ision)","summary":" In this paper, we evaluate different abilities of GPT-4V including visual\nunderstanding, language understanding, visual puzzle solving, and understanding\nof other modalities such as depth, thermal, video, and audio. To estimate\nGPT-4V's performance, we manually construct 656 test instances and carefully\nevaluate the results of GPT-4V. The highlights of our findings are as follows:\n(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks\nbut fails to recognize simple Chinese texts in the images; (2) GPT-4V shows\ninconsistent refusal behavior when answering questions related to sensitive\ntraits such as gender, race, and age; (3) GPT-4V obtains worse results than\nGPT-4 (API) on language understanding tasks including general language\nunderstanding benchmarks and visual commonsense knowledge evaluation\nbenchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both\nvisual understanding and language understanding; (5) GPT-4V struggles to find\nthe nuances between two similar images and solve the easy math picture puzzles;\n(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to\nimage, such as video and thermal. Our experimental results reveal the ability\nand limitations of GPT-4V and we hope our paper can provide some insights into\nthe application and research of GPT-4V.\n","authors":["Yang Wu","Shilong Wang","Hao Yang","Tian Zheng","Hongbo Zhang","Yanyan Zhao","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2310.16534v1.pdf","comment":"Technical Report. Data are available at\n https://github.com/albertwy/GPT-4V-Evaluation"},{"id":"http://arxiv.org/abs/2310.16532v1","updated":"2023-10-25T10:26:07Z","published":"2023-10-25T10:26:07Z","title":"Learning Robust Deep Visual Representations from EEG Brain Recordings","summary":" Decoding the human brain has been a hallmark of neuroscientists and\nArtificial Intelligence researchers alike. Reconstruction of visual images from\nbrain Electroencephalography (EEG) signals has garnered a lot of interest due\nto its applications in brain-computer interfacing. This study proposes a\ntwo-stage method where the first step is to obtain EEG-derived features for\nrobust learning of deep representations and subsequently utilize the learned\nrepresentation for image generation and classification. We demonstrate the\ngeneralizability of our feature extraction pipeline across three different\ndatasets using deep-learning architectures with supervised and contrastive\nlearning methods. We have performed the zero-shot EEG classification task to\nsupport the generalizability claim further. We observed that a subject\ninvariant linearly separable visual representation was learned using EEG data\nalone in an unimodal setting that gives better k-means accuracy as compared to\na joint representation learning between EEG and images. Finally, we propose a\nnovel framework to transform unseen images into the EEG space and reconstruct\nthem with approximation, showcasing the potential for image reconstruction from\nEEG signals. Our proposed image synthesis method from EEG shows 62.9% and\n36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz\ndatasets, which is better than state-of-the-art performance in GAN.\n","authors":["Prajwal Singh","Dwip Dalal","Gautam Vashishtha","Krishna Miyapuram","Shanmuganathan Raman"],"pdf_url":"https://arxiv.org/pdf/2310.16532v1.pdf","comment":"Accepted in WACV 2024"},{"id":"http://arxiv.org/abs/2310.16527v1","updated":"2023-10-25T10:22:30Z","published":"2023-10-25T10:22:30Z","title":"Enhancing Document Information Analysis with Multi-Task Pre-training: A\n Robust Approach for Information Extraction in Visually-Rich Documents","summary":" This paper introduces a deep learning model tailored for document information\nanalysis, emphasizing document classification, entity relation extraction, and\ndocument visual question answering. The proposed model leverages\ntransformer-based models to encode all the information present in a document\nimage, including textual, visual, and layout information. The model is\npre-trained and subsequently fine-tuned for various document image analysis\ntasks. The proposed model incorporates three additional tasks during the\npre-training phase, including reading order identification of different layout\nsegments in a document image, layout segments categorization as per PubLayNet,\nand generation of the text sequence within a given layout segment (text block).\nThe model also incorporates a collective pre-training scheme where losses of\nall the tasks under consideration, including pre-training and fine-tuning tasks\nwith all datasets, are considered. Additional encoder and decoder blocks are\nadded to the RoBERTa network to generate results for all tasks. The proposed\nmodel achieved impressive results across all tasks, with an accuracy of 95.87%\non the RVL-CDIP dataset for document classification, F1 scores of 0.9306,\n0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets\nrespectively for entity relation extraction, and an ANLS score of 0.8468 on the\nDocVQA dataset for visual question answering. The results highlight the\neffectiveness of the proposed model in understanding and interpreting complex\ndocument layouts and content, making it a promising tool for document analysis\ntasks.\n","authors":["Tofik Ali","Partha Pratim Roy"],"pdf_url":"https://arxiv.org/pdf/2310.16527v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16494v1","updated":"2023-10-25T09:26:16Z","published":"2023-10-25T09:26:16Z","title":"Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph\n prediction","summary":" D scene graphs are an emerging 3D scene representation, that models both the\nobjects present in the scene as well as their relationships. However, learning\n3D scene graphs is a challenging task because it requires not only object\nlabels but also relationship annotations, which are very scarce in datasets.\nWhile it is widely accepted that pre-training is an effective approach to\nimprove model performance in low data regimes, in this paper, we find that\nexisting pre-training methods are ill-suited for 3D scene graphs. To solve this\nissue, we present the first language-based pre-training approach for 3D scene\ngraphs, whereby we exploit the strong relationship between scene graphs and\nlanguage. To this end, we leverage the language encoder of CLIP, a popular\nvision-language model, to distill its knowledge into our graph-based network.\nWe formulate a contrastive pre-training, which aligns text embeddings of\nrelationships (subject-predicate-object triplets) and predicted 3D graph\nfeatures. Our method achieves state-of-the-art results on the main semantic 3D\nscene graph benchmark by showing improved effectiveness over pre-training\nbaselines and outperforming all the existing fully supervised scene graph\nprediction methods by a significant margin. Furthermore, since our scene graph\nfeatures are language-aligned, it allows us to query the language space of the\nfeatures in a zero-shot manner. In this paper, we show an example of utilizing\nthis property of the features to predict the room type of a scene without\nfurther training.\n","authors":["Sebastian Koch","Pedro Hermosilla","Narunas Vaskevicius","Mirco Colosi","Timo Ropinski"],"pdf_url":"https://arxiv.org/pdf/2310.16494v1.pdf","comment":"3DV 2024. Project page: https://kochsebastian.com/lang3dsg"},{"id":"http://arxiv.org/abs/2310.16492v1","updated":"2023-10-25T09:19:45Z","published":"2023-10-25T09:19:45Z","title":"On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection","summary":" Successful detection of Out-of-Distribution (OoD) data is becoming\nincreasingly important to ensure safe deployment of neural networks. One of the\nmain challenges in OoD detection is that neural networks output overconfident\npredictions on OoD data, make it difficult to determine OoD-ness of data solely\nbased on their predictions. Outlier exposure addresses this issue by\nintroducing an additional loss that encourages low-confidence predictions on\nOoD data during training. While outlier exposure has shown promising potential\nin improving OoD detection performance, all previous studies on outlier\nexposure have been limited to utilizing visual outliers. Drawing inspiration\nfrom the recent advancements in vision-language pre-training, this paper\nventure out to the uncharted territory of textual outlier exposure. First, we\nuncover the benefits of using textual outliers by replacing real or virtual\noutliers in the image-domain with textual equivalents. Then, we propose various\nways of generating preferable textual outliers. Our extensive experiments\ndemonstrate that generated textual outliers achieve competitive performance on\nlarge-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical\nanalyses of textual outliers to provide primary criteria for designing\nadvantageous textual outliers: near-distribution, descriptiveness, and\ninclusion of visual semantics.\n","authors":["Sangha Park","Jisoo Mok","Dahuin Jung","Saehyung Lee","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2310.16492v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.12795v3","updated":"2023-10-25T09:11:50Z","published":"2023-06-22T10:53:10Z","title":"Learning Unseen Modality Interaction","summary":" Multimodal learning assumes all modality combinations of interest are\navailable during training to learn cross-modal correspondences. In this paper,\nwe challenge this modality-complete assumption for multimodal learning and\ninstead strive for generalization to unseen modality combinations during\ninference. We pose the problem of unseen modality interaction and introduce a\nfirst solution. It exploits a module that projects the multidimensional\nfeatures of different modalities into a common space with rich information\npreserved. This allows the information to be accumulated with a simple\nsummation operation across available modalities. To reduce overfitting to less\ndiscriminative modality combinations during training, we further improve the\nmodel learning with pseudo-supervision indicating the reliability of a\nmodality's prediction. We demonstrate that our approach is effective for\ndiverse tasks and modalities by evaluating it for multimodal video\nclassification, robot state regression, and multimedia retrieval. Project\nwebsite: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.\n","authors":["Yunhua Zhang","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2306.12795v3.pdf","comment":"Published at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16483v1","updated":"2023-10-25T09:08:58Z","published":"2023-10-25T09:08:58Z","title":"Gramian Attention Heads are Strong yet Efficient Vision Learners","summary":" We introduce a novel architecture design that enhances expressiveness by\nincorporating multiple head classifiers (\\ie, classification heads) instead of\nrelying on channel expansion or additional building blocks. Our approach\nemploys attention-based aggregation, utilizing pairwise feature similarity to\nenhance multiple lightweight heads with minimal resource overhead. We compute\nthe Gramian matrices to reinforce class tokens in an attention layer for each\nhead. This enables the heads to learn more discriminative representations,\nenhancing their aggregation capabilities. Furthermore, we propose a learning\nalgorithm that encourages heads to complement each other by reducing\ncorrelation for aggregation. Our models eventually surpass state-of-the-art\nCNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and\ndeliver remarkable performance across various downstream tasks, such as COCO\nobject instance segmentation, ADE20k semantic segmentation, and fine-grained\nvisual classification datasets. The effectiveness of our framework is\nsubstantiated by practical experimental results and further underpinned by\ngeneralization error bound. We release the code publicly at:\nhttps://github.com/Lab-LVM/imagenet-models.\n","authors":["Jongbin Ryu","Dongyoon Han","Jongwoo Lim"],"pdf_url":"https://arxiv.org/pdf/2310.16483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16477v1","updated":"2023-10-25T08:55:48Z","published":"2023-10-25T08:55:48Z","title":"Show from Tell: Audio-Visual Modelling in Clinical Settings","summary":" Auditory and visual signals usually present together and correlate with each\nother, not only in natural environments but also in clinical settings. However,\nthe audio-visual modelling in the latter case can be more challenging, due to\nthe different sources of audio/video signals and the noise (both signal-level\nand semantic-level) in auditory signals -- usually speech. In this paper, we\nconsider audio-visual modelling in a clinical setting, providing a solution to\nlearn medical representations that benefit various clinical tasks, without\nhuman expert annotation. A simple yet effective multi-modal self-supervised\nlearning framework is proposed for this purpose. The proposed approach is able\nto localise anatomical regions of interest during ultrasound imaging, with only\nspeech audio as a reference. Experimental evaluations on a large-scale clinical\nmulti-modal ultrasound video dataset show that the proposed self-supervised\nmethod learns good transferable anatomical representations that boost the\nperformance of automated downstream clinical tasks, even outperforming\nfully-supervised solutions.\n","authors":["Jianbo Jiao","Mohammad Alsharid","Lior Drukker","Aris T. Papageorghiou","Andrew Zisserman","J. Alison Noble"],"pdf_url":"https://arxiv.org/pdf/2310.16477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10790v2","updated":"2023-10-25T08:39:38Z","published":"2023-09-19T17:39:20Z","title":"Guide Your Agent with Adaptive Multimodal Rewards","summary":" Developing an agent capable of adapting to unseen environments remains a\ndifficult challenge in imitation learning. This work presents Adaptive\nReturn-conditioned Policy (ARP), an efficient framework designed to enhance the\nagent's generalization ability using natural language task descriptions and\npre-trained multimodal encoders. Our key idea is to calculate a similarity\nbetween visual observations and natural language instructions in the\npre-trained multimodal embedding space (such as CLIP) and use it as a reward\nsignal. We then train a return-conditioned policy using expert demonstrations\nlabeled with multimodal rewards. Because the multimodal rewards provide\nadaptive signals at each timestep, our ARP effectively mitigates the goal\nmisgeneralization. This results in superior generalization performances even\nwhen faced with unseen text instructions, compared to existing text-conditioned\npolicies. To improve the quality of rewards, we also introduce a fine-tuning\nmethod for pre-trained multimodal encoders, further enhancing the performance.\nVideo demonstrations and source code are available on the project website:\n\\url{https://sites.google.com/view/2023arp}.\n","authors":["Changyeon Kim","Younggyo Seo","Hao Liu","Lisa Lee","Jinwoo Shin","Honglak Lee","Kimin Lee"],"pdf_url":"https://arxiv.org/pdf/2309.10790v2.pdf","comment":"Accepted to NeurIPS 2023. Project webpage:\n https://sites.google.com/view/2023arp"},{"id":"http://arxiv.org/abs/2310.16459v1","updated":"2023-10-25T08:34:05Z","published":"2023-10-25T08:34:05Z","title":"DualMatch: Robust Semi-Supervised Learning with Dual-Level Interaction","summary":" Semi-supervised learning provides an expressive framework for exploiting\nunlabeled data when labels are insufficient. Previous semi-supervised learning\nmethods typically match model predictions of different data-augmented views in\na single-level interaction manner, which highly relies on the quality of\npseudo-labels and results in semi-supervised learning not robust. In this\npaper, we propose a novel SSL method called DualMatch, in which the class\nprediction jointly invokes feature embedding in a dual-level interaction\nmanner. DualMatch requires consistent regularizations for data augmentation,\nspecifically, 1) ensuring that different augmented views are regulated with\nconsistent class predictions, and 2) ensuring that different data of one class\nare regulated with similar feature embeddings. Extensive experiments\ndemonstrate the effectiveness of DualMatch. In the standard SSL setting, the\nproposal achieves 9% error reduction compared with SOTA methods, even in a more\nchallenging class-imbalanced setting, the proposal can still achieve 6% error\nreduction. Code is available at https://github.com/CWangAI/DualMatch\n","authors":["Cong Wang","Xiaofeng Cao","Lanzhe Guo2","Zenglin Shi"],"pdf_url":"https://arxiv.org/pdf/2310.16459v1.pdf","comment":"14 pages, 8 figures, Accepted by ECMLPKDD 2023"},{"id":"http://arxiv.org/abs/2310.16457v1","updated":"2023-10-25T08:31:04Z","published":"2023-10-25T08:31:04Z","title":"Towards Explainability in Monocular Depth Estimation","summary":" The estimation of depth in two-dimensional images has long been a challenging\nand extensively studied subject in computer vision. Recently, significant\nprogress has been made with the emergence of Deep Learning-based approaches,\nwhich have proven highly successful. This paper focuses on the explainability\nin monocular depth estimation methods, in terms of how humans perceive depth.\nThis preliminary study emphasizes on one of the most significant visual cues,\nthe relative size, which is prominent in almost all viewed images. We designed\na specific experiment to mimic the experiments in humans and have tested\nstate-of-the-art methods to indirectly assess the explainability in the context\ndefined. In addition, we observed that measuring the accuracy required further\nattention and a particular approach is proposed to this end. The results show\nthat a mean accuracy of around 77% across methods is achieved, with some of the\nmethods performing markedly better, thus, indirectly revealing their\ncorresponding potential to uncover monocular depth cues, like relative size.\n","authors":["Vasileios Arampatzakis","George Pavlidis","Kyriakos Pantoglou","Nikolaos Mitianoudis","Nikos Papamarkos"],"pdf_url":"https://arxiv.org/pdf/2310.16457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16447v1","updated":"2023-10-25T08:11:02Z","published":"2023-10-25T08:11:02Z","title":"ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors","summary":" Understanding the behavior of non-human primates is crucial for improving\nanimal welfare, modeling social behavior, and gaining insights into\ndistinctively human and phylogenetically shared behaviors. However, the lack of\ndatasets on non-human primate behavior hinders in-depth exploration of primate\nsocial interactions, posing challenges to research on our closest living\nrelatives. To address these limitations, we present ChimpACT, a comprehensive\ndataset for quantifying the longitudinal behavior and social relations of\nchimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT\nfeatures videos of a group of over 20 chimpanzees residing at the Leipzig Zoo,\nGermany, with a particular focus on documenting the developmental trajectory of\none young male, Azibo. ChimpACT is both comprehensive and challenging,\nconsisting of 163 videos with a cumulative 160,500 frames, each richly\nannotated with detection, identification, pose estimation, and fine-grained\nspatiotemporal behavior labels. We benchmark representative methods of three\ntracks on ChimpACT: (i) tracking and identification, (ii) pose estimation, and\n(iii) spatiotemporal action detection of the chimpanzees. Our experiments\nreveal that ChimpACT offers ample opportunities for both devising new methods\nand adapting existing ones to solve fundamental computer vision tasks applied\nto chimpanzee groups, such as detection, pose estimation, and behavior\nanalysis, ultimately deepening our comprehension of communication and sociality\nin non-human primates.\n","authors":["Xiaoxuan Ma","Stephan P. Kaufhold","Jiajun Su","Wentao Zhu","Jack Terwilliger","Andres Meza","Yixin Zhu","Federico Rossano","Yizhou Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16447v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2207.03182v3","updated":"2023-10-25T08:05:48Z","published":"2022-07-07T09:23:03Z","title":"Chilled Sampling for Uncertainty Quantification: A Motivation From A\n Meteorological Inverse Problem","summary":" Atmospheric motion vectors (AMVs) extracted from satellite imagery are the\nonly wind observations with good global coverage. They are important features\nfor feeding numerical weather prediction (NWP) models. Several Bayesian models\nhave been proposed to estimate AMVs. Although critical for correct assimilation\ninto NWP models, very few methods provide a thorough characterization of the\nestimation errors. The difficulty of estimating errors stems from the\nspecificity of the posterior distribution, which is both very high dimensional,\nand highly ill-conditioned due to a singular likelihood. Motivated by this\ndifficult inverse problem, this work studies the evaluation of the (expected)\nestimation errors using gradient-based Markov Chain Monte Carlo (MCMC)\nalgorithms. The main contribution is to propose a general strategy, called here\nchilling, which amounts to sampling a local approximation of the posterior\ndistribution in the neighborhood of a point estimate. From a theoretical point\nof view, we show that under regularity assumptions, the family of chilled\nposterior distributions converges in distribution as temperature decreases to\nan optimal Gaussian approximation at a point estimate given by the Maximum A\nPosteriori, also known as the Laplace approximation. Chilled sampling therefore\nprovides access to this approximation generally out of reach in such\nhigh-dimensional nonlinear contexts. From an empirical perspective, we evaluate\nthe proposed approach based on some quantitative Bayesian criteria. Our\nnumerical simulations are performed on synthetic and real meteorological data.\nThey reveal that not only the proposed chilling exhibits a significant gain in\nterms of accuracy of the point estimates and of their associated expected\nerrors, but also a substantial acceleration in the convergence speed of the\nMCMC algorithms.\n","authors":["Patrick Héas","Frédéric Cérou","Mathias Rousset"],"pdf_url":"https://arxiv.org/pdf/2207.03182v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16436v1","updated":"2023-10-25T08:03:10Z","published":"2023-10-25T08:03:10Z","title":"DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning\n in Language Models","summary":" A long-standing goal of AI systems is to perform complex multimodal reasoning\nlike humans. Recently, large language models (LLMs) have made remarkable\nstrides in such multi-step reasoning on the language modality solely by\nleveraging the chain of thought (CoT) to mimic human thinking. However, the\ntransfer of these advancements to multimodal contexts introduces heightened\nchallenges, including but not limited to the impractical need for\nlabor-intensive annotation and the limitations in terms of flexibility,\ngeneralizability, and explainability. To evoke CoT reasoning in multimodality,\nthis work first conducts an in-depth analysis of these challenges posed by\nmultimodality and presents two key insights: \"keeping critical thinking\" and\n\"letting everyone do their jobs\" in multimodal CoT reasoning. Furthermore, this\nstudy proposes a novel DDCoT prompting that maintains a critical attitude\nthrough negative-space prompting and incorporates multimodality into reasoning\nby first dividing the reasoning responsibility of LLMs into reasoning and\nrecognition and then integrating the visual recognition capability of visual\nmodels into the joint reasoning process. The rationales generated by DDCoT not\nonly improve the reasoning abilities of both large and small language models in\nzero-shot prompting and fine-tuning learning, significantly outperforming\nstate-of-the-art methods but also exhibit impressive generalizability and\nexplainability.\n","authors":["Ge Zheng","Bin Yang","Jiajin Tang","Hong-Yu Zhou","Sibei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16436v1.pdf","comment":"24 pages, 13 figures, to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16435v1","updated":"2023-10-25T08:02:27Z","published":"2023-10-25T08:02:27Z","title":"On Pixel-level Performance Assessment in Anomaly Detection","summary":" Anomaly detection methods have demonstrated remarkable success across various\napplications. However, assessing their performance, particularly at the\npixel-level, presents a complex challenge due to the severe imbalance that is\nmost commonly present between normal and abnormal samples. Commonly adopted\nevaluation metrics designed for pixel-level detection may not effectively\ncapture the nuanced performance variations arising from this class imbalance.\nIn this paper, we dissect the intricacies of this challenge, underscored by\nvisual evidence and statistical analysis, leading to delve into the need for\nevaluation metrics that account for the imbalance. We offer insights into more\naccurate metrics, using eleven leading contemporary anomaly detection methods\non twenty-one anomaly detection problems. Overall, from this extensive\nexperimental evaluation, we can conclude that Precision-Recall-based metrics\ncan better capture relative method performance, making them more suitable for\nthe task.\n","authors":["Mehdi Rafiei","Toby P. Breckon","Alexandros Iosifidis"],"pdf_url":"https://arxiv.org/pdf/2310.16435v1.pdf","comment":"5 pages, 5 figures, 1 table"},{"id":"http://arxiv.org/abs/2310.16430v1","updated":"2023-10-25T07:55:02Z","published":"2023-10-25T07:55:02Z","title":"An Integrative Paradigm for Enhanced Stroke Prediction: Synergizing\n XGBoost and xDeepFM Algorithms","summary":" Stroke prediction plays a crucial role in preventing and managing this\ndebilitating condition. In this study, we address the challenge of stroke\nprediction using a comprehensive dataset, and propose an ensemble model that\ncombines the power of XGBoost and xDeepFM algorithms. Our work aims to improve\nupon existing stroke prediction models by achieving higher accuracy and\nrobustness. Through rigorous experimentation, we validate the effectiveness of\nour ensemble model using the AUC metric. Through comparing our findings with\nthose of other models in the field, we gain valuable insights into the merits\nand drawbacks of various approaches. This, in turn, contributes significantly\nto the progress of machine learning and deep learning techniques specifically\nin the domain of stroke prediction.\n","authors":["Weinan Dai","Yifeng Jiang","Chengjie Mou","Chongyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16402v1","updated":"2023-10-25T06:38:42Z","published":"2023-10-25T06:38:42Z","title":"Video Referring Expression Comprehension via Transformer with\n Content-conditioned Query","summary":" Video Referring Expression Comprehension (REC) aims to localize a target\nobject in videos based on the queried natural language. Recent improvements in\nvideo REC have been made using Transformer-based methods with learnable\nqueries. However, we contend that this naive query design is not ideal given\nthe open-world nature of video REC brought by text supervision. With numerous\npotential semantic categories, relying on only a few slow-updated queries is\ninsufficient to characterize them. Our solution to this problem is to create\ndynamic queries that are conditioned on both the input video and language to\nmodel the diverse objects referred to. Specifically, we place a fixed number of\nlearnable bounding boxes throughout the frame and use corresponding region\nfeatures to provide prior information. Also, we noticed that current query\nfeatures overlook the importance of cross-modal alignment. To address this, we\nalign specific phrases in the sentence with semantically relevant visual areas,\nannotating them in existing video datasets (VID-Sentence and VidSTG). By\nincorporating these two designs, our proposed model (called ConFormer)\noutperforms other models on widely benchmarked datasets. For example, in the\ntesting split of VID-Sentence dataset, ConFormer achieves 8.75% absolute\nimprovement on Accu.@0.6 compared to the previous state-of-the-art model.\n","authors":["Ji Jiang","Meng Cao","Tengtao Song","Long Chen","Yi Wang","Yuexian Zou"],"pdf_url":"https://arxiv.org/pdf/2310.16402v1.pdf","comment":"Accepted to ACM International Conference on Multimedia Workshop (ACM\n MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.02953"},{"id":"http://arxiv.org/abs/2310.16400v1","updated":"2023-10-25T06:35:01Z","published":"2023-10-25T06:35:01Z","title":"Fuse Your Latents: Video Editing with Multi-source Latent Diffusion\n Models","summary":" Latent Diffusion Models (LDMs) are renowned for their powerful capabilities\nin image and video synthesis. Yet, video editing methods suffer from\ninsufficient pre-training data or video-by-video re-training cost. In\naddressing this gap, we propose FLDM (Fused Latent Diffusion Model), a\ntraining-free framework to achieve text-guided video editing by applying\noff-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses\nlatents from an image LDM and an video LDM during the denoising process. In\nthis way, temporal consistency can be kept with video LDM while high-fidelity\nfrom the image LDM can also be exploited. Meanwhile, FLDM possesses high\nflexibility since both image LDM and video LDM can be replaced so advanced\nimage editing methods such as InstructPix2Pix and ControlNet can be exploited.\nTo the best of our knowledge, FLDM is the first method to adapt off-the-shelf\nimage editing methods into video LDMs for video editing. Extensive quantitative\nand qualitative experiments demonstrate that FLDM can improve the textual\nalignment and temporal consistency of edited videos.\n","authors":["Tianyi Lu","Xing Zhang","Jiaxi Gu","Hang Xu","Renjing Pei","Songcen Xu","Zuxuan Wu"],"pdf_url":"https://arxiv.org/pdf/2310.16400v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.02858v4","updated":"2023-10-25T06:23:31Z","published":"2023-06-05T13:17:27Z","title":"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video\n Understanding","summary":" We present Video-LLaMA a multi-modal framework that empowers Large Language\nModels (LLMs) with the capability of understanding both visual and auditory\ncontent in the video. Video-LLaMA bootstraps cross-modal training from the\nfrozen pre-trained visual and audio encoders and the frozen LLMs. Unlike\nprevious works that complement LLMs to process the visual or audio signals\nonly, Video-LLaMA enables video comprehension by tackling two challenges: (1)\ncapturing the temporal changes in visual scenes, (2) integrating audio-visual\nsignals. To counter the first challenge, we propose a Video Q-former to\nassemble a pre-trained image encoder into our video encoder and introduce a\nvideo-to-text generation task to learn video-language correspondence. For the\nsecond challenge, we leverage ImageBind, a universal embedding model aligning\nmultiple modalities, as the pre-trained audio encoder and introduce an Audio\nQ-former on top of ImageBind to learn reasonable auditory query embeddings for\nthe LLM module. To align the output of both visual and audio encoders with\nLLM's embedding space, we first train Video-LLaMA on massive\nvideo/image-caption pairs and then tune our model with visual-instruction\ndatasets of moderate amount but higher quality. We found Video-LLaMA shows the\nability to perceive and comprehend video content and generate meaningful\nresponses grounded in the visual and auditory information presented in the\nvideos.\n","authors":["Hang Zhang","Xin Li","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2306.02858v4.pdf","comment":"Accepted by EMNLP 2023's demo track; Code, Pretrained Model, and\n Dataset: https://github.com/DAMO-NLP-SG/Video-LLaMA"},{"id":"http://arxiv.org/abs/2310.14664v2","updated":"2023-10-25T06:19:05Z","published":"2023-10-23T08:00:03Z","title":"Data Pruning via Moving-one-Sample-out","summary":" In this paper, we propose a novel data-pruning approach called\nmoving-one-sample-out (MoSo), which aims to identify and remove the least\ninformative samples from the training set. The core insight behind MoSo is to\ndetermine the importance of each sample by assessing its impact on the optimal\nempirical risk. This is achieved by measuring the extent to which the empirical\nrisk changes when a particular sample is excluded from the training set.\nInstead of using the computationally expensive leaving-one-out-retraining\nprocedure, we propose an efficient first-order approximator that only requires\ngradient information from different training stages. The key idea behind our\napproximation is that samples with gradients that are consistently aligned with\nthe average gradient of the training set are more informative and should\nreceive higher scores, which could be intuitively understood as follows: if the\ngradient from a specific sample is consistent with the average gradient vector,\nit implies that optimizing the network using the sample will yield a similar\neffect on all remaining samples. Experimental results demonstrate that MoSo\neffectively mitigates severe performance degradation at high pruning ratios and\nachieves satisfactory performance across various settings.\n","authors":["Haoru Tan","Sitong Wu","Fei Du","Yukang Chen","Zhibin Wang","Fan Wang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2310.14664v2.pdf","comment":"Accepted by the Thirty-seventh Conference on Neural Information\n Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.16391v1","updated":"2023-10-25T06:10:57Z","published":"2023-10-25T06:10:57Z","title":"Winning Prize Comes from Losing Tickets: Improve Invariant Learning by\n Exploring Variant Parameters for Out-of-Distribution Generalization","summary":" Out-of-Distribution (OOD) Generalization aims to learn robust models that\ngeneralize well to various environments without fitting to\ndistribution-specific features. Recent studies based on Lottery Ticket\nHypothesis (LTH) address this problem by minimizing the learning target to find\nsome of the parameters that are critical to the task. However, in OOD problems,\nsuch solutions are suboptimal as the learning task contains severe distribution\nnoises, which can mislead the optimization process. Therefore, apart from\nfinding the task-related parameters (i.e., invariant parameters), we propose\nExploring Variant parameters for Invariant Learning (EVIL) which also leverages\nthe distribution knowledge to find the parameters that are sensitive to\ndistribution shift (i.e., variant parameters). Once the variant parameters are\nleft out of invariant learning, a robust subnetwork that is resistant to\ndistribution shift can be found. Additionally, the parameters that are\nrelatively stable across distributions can be considered invariant ones to\nimprove invariant learning. By fully exploring both variant and invariant\nparameters, our EVIL can effectively identify a robust subnetwork to improve\nOOD generalization. In extensive experiments on integrated testbed: DomainBed,\nEVIL can effectively and efficiently enhance many popular methods, such as ERM,\nIRM, SAM, etc.\n","authors":["Zhuo Huang","Muyang Li","Li Shen","Jun Yu","Chen Gong","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16391v1.pdf","comment":"27 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.16389v1","updated":"2023-10-25T06:10:07Z","published":"2023-10-25T06:10:07Z","title":"MVFAN: Multi-View Feature Assisted Network for 4D Radar Object Detection","summary":" 4D radar is recognized for its resilience and cost-effectiveness under\nadverse weather conditions, thus playing a pivotal role in autonomous driving.\nWhile cameras and LiDAR are typically the primary sensors used in perception\nmodules for autonomous vehicles, radar serves as a valuable supplementary\nsensor. Unlike LiDAR and cameras, radar remains unimpaired by harsh weather\nconditions, thereby offering a dependable alternative in challenging\nenvironments. Developing radar-based 3D object detection not only augments the\ncompetency of autonomous vehicles but also provides economic benefits. In\nresponse, we propose the Multi-View Feature Assisted Network (\\textit{MVFAN}),\nan end-to-end, anchor-free, and single-stage framework for 4D-radar-based 3D\nobject detection for autonomous vehicles. We tackle the issue of insufficient\nfeature utilization by introducing a novel Position Map Generation module to\nenhance feature learning by reweighing foreground and background points, and\ntheir features, considering the irregular distribution of radar point clouds.\nAdditionally, we propose a pioneering backbone, the Radar Feature Assisted\nbackbone, explicitly crafted to fully exploit the valuable Doppler velocity and\nreflectivity data provided by the 4D radar sensor. Comprehensive experiments\nand ablation studies carried out on Astyx and VoD datasets attest to the\nefficacy of our framework. The incorporation of Doppler velocity and RCS\nreflectivity dramatically improves the detection performance for small moving\nobjects such as pedestrians and cyclists. Consequently, our approach culminates\nin a highly optimized 4D-radar-based 3D object detection capability for\nautonomous driving systems, setting a new standard in the field.\n","authors":["Qiao Yan","Yihan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16389v1.pdf","comment":"19 Pages, 7 figures, Accepted by ICONIP 2023"},{"id":"http://arxiv.org/abs/2310.16388v1","updated":"2023-10-25T06:00:37Z","published":"2023-10-25T06:00:37Z","title":"Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles","summary":" In the dynamic realm of deepfake detection, this work presents an innovative\napproach to validate video content. The methodology blends advanced\n2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is\nuniquely tailored to capture spatiotemporal features via sliding filters,\nextending through both spatial and temporal dimensions. This configuration\nenables nuanced pattern recognition in pixel arrangement and temporal evolution\nacross frames. Simultaneously, the 2D model leverages EfficientNet\narchitecture, harnessing auto-scaling in Convolutional Neural Networks.\nNotably, this ensemble integrates Voting Ensembles and Adaptive Weighted\nEnsembling. Strategic prioritization of the 3-dimensional model's output\ncapitalizes on its exceptional spatio-temporal feature extraction. Experimental\nvalidation underscores the effectiveness of this strategy, showcasing its\npotential in countering deepfake generation's deceptive practices.\n","authors":["Aagam Bakliwal","Amit D. Joshi"],"pdf_url":"https://arxiv.org/pdf/2310.16388v1.pdf","comment":"6 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.16387v1","updated":"2023-10-25T05:59:25Z","published":"2023-10-25T05:59:25Z","title":"Frequency-Aware Transformer for Learned Image Compression","summary":" Learned image compression (LIC) has gained traction as an effective solution\nfor image storage and transmission in recent years. However, existing LIC\nmethods are redundant in latent representation due to limitations in capturing\nanisotropic frequency components and preserving directional details. To\novercome these challenges, we propose a novel frequency-aware transformer (FAT)\nblock that for the first time achieves multiscale directional ananlysis for\nLIC. The FAT block comprises frequency-decomposition window attention (FDWA)\nmodules to capture multiscale and directional frequency components of natural\nimages. Additionally, we introduce frequency-modulation feed-forward network\n(FMFFN) to adaptively modulate different frequency components, improving\nrate-distortion performance. Furthermore, we present a transformer-based\nchannel-wise autoregressive (T-CA) model that effectively exploits channel\ndependencies. Experiments show that our method achieves state-of-the-art\nrate-distortion performance compared to existing LIC methods, and evidently\noutperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in\nBD-rate on the Kodak, Tecnick, and CLIC datasets.\n","authors":["Han Li","Shaohui Li","Wenrui Dai","Chenglin Li","Junni Zou","Hongkai Xiong"],"pdf_url":"https://arxiv.org/pdf/2310.16387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16383v1","updated":"2023-10-25T05:43:14Z","published":"2023-10-25T05:43:14Z","title":"Open-NeRF: Towards Open Vocabulary NeRF Decomposition","summary":" In this paper, we address the challenge of decomposing Neural Radiance Fields\n(NeRF) into objects from an open vocabulary, a critical task for object\nmanipulation in 3D reconstruction and view synthesis. Current techniques for\nNeRF decomposition involve a trade-off between the flexibility of processing\nopen-vocabulary queries and the accuracy of 3D segmentation. We present,\nOpen-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage\nlarge-scale, off-the-shelf, segmentation models like the Segment Anything Model\n(SAM) and introduce an integrate-and-distill paradigm with hierarchical\nembeddings to achieve both the flexibility of open-vocabulary querying and 3D\nsegmentation accuracy. Open-NeRF first utilizes large-scale foundation models\nto generate hierarchical 2D mask proposals from varying viewpoints. These\nproposals are then aligned via tracking approaches and integrated within the 3D\nspace and subsequently distilled into the 3D field. This process ensures\nconsistent recognition and granularity of objects from different viewpoints,\neven in challenging scenarios involving occlusion and indistinct features. Our\nexperimental results show that the proposed Open-NeRF outperforms\nstate-of-the-art methods such as LERF \\cite{lerf} and FFD \\cite{ffd} in\nopen-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF\ndecomposition, guided by open-vocabulary queries, enabling novel applications\nin robotics and vision-language interaction in open-world 3D scenes.\n","authors":["Hao Zhang","Fang Li","Narendra Ahuja"],"pdf_url":"https://arxiv.org/pdf/2310.16383v1.pdf","comment":"Accepted by WACV 2024"},{"id":"http://arxiv.org/abs/2304.08965v3","updated":"2023-10-25T05:09:56Z","published":"2023-04-18T12:58:21Z","title":"PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via\n Cross-modal Distillation and Super-Voxel Clustering","summary":" Semantic segmentation of point clouds usually requires exhausting efforts of\nhuman annotations, hence it attracts wide attention to the challenging topic of\nlearning from unlabeled or weaker forms of annotations. In this paper, we take\nthe first attempt for fully unsupervised semantic segmentation of point clouds,\nwhich aims to delineate semantically meaningful objects without any form of\nannotations. Previous works of unsupervised pipeline on 2D images fails in this\ntask of point clouds, due to: 1) Clustering Ambiguity caused by limited\nmagnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity\ncaused by the irregular sparsity of point cloud. Therefore, we propose a novel\nframework, PointDC, which is comprised of two steps that handle the\naforementioned problems respectively: Cross-Modal Distillation (CMD) and\nSuper-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual\nfeatures are back-projected to the 3D space and aggregated to a unified point\nfeature to distill the training of the point representation. In the second\nstage of SVC, the point features are aggregated to super-voxels and then fed to\nthe iterative clustering process for excavating semantic classes. PointDC\nyields a significant improvement over the prior state-of-the-art unsupervised\nmethods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic\nsegmentation benchmarks.\n","authors":["Zisheng Chen","Hongbin Xu","Weitao Chen","Zhipeng Chen","Haihong Xiao","Baigui Sun","Xuansong Xie","Wenxiong Kang"],"pdf_url":"https://arxiv.org/pdf/2304.08965v3.pdf","comment":"Accepted by International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2310.16364v1","updated":"2023-10-25T05:04:47Z","published":"2023-10-25T05:04:47Z","title":"Towards Large-scale Masked Face Recognition","summary":" During the COVID-19 coronavirus epidemic, almost everyone is wearing masks,\nwhich poses a huge challenge for deep learning-based face recognition\nalgorithms. In this paper, we will present our \\textbf{championship} solutions\nin ICCV MFR WebFace260M and InsightFace unconstrained tracks. We will focus on\nfour challenges in large-scale masked face recognition, i.e., super-large scale\ntraining, data noise handling, masked and non-masked face recognition accuracy\nbalancing, and how to design inference-friendly model architecture. We hope\nthat the discussion on these four aspects can guide future research towards\nmore robust masked face recognition systems.\n","authors":["Manyuan Zhang","Bingqi Ma","Guanglu Song","Yunxiao Wang","Hongsheng Li","Yu Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16364v1.pdf","comment":"the top1 solution for ICCV2021-MFR challenge"},{"id":"http://arxiv.org/abs/2306.01293v3","updated":"2023-10-25T04:22:02Z","published":"2023-06-02T06:33:08Z","title":"LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning","summary":" We present a novel vision-language prompt learning approach for few-shot\nout-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD\nimages from classes that are unseen during training using only a few labeled\nin-distribution (ID) images. While prompt learning methods such as CoOp have\nshown effectiveness and efficiency in few-shot ID classification, they still\nface limitations in OOD detection due to the potential presence of\nID-irrelevant information in text embeddings. To address this issue, we\nintroduce a new approach called Local regularized Context Optimization\n(LoCoOp), which performs OOD regularization that utilizes the portions of CLIP\nlocal features as OOD features during training. CLIP's local features have a\nlot of ID-irrelevant nuisances (e.g., backgrounds), and by learning to push\nthem away from the ID class text embeddings, we can remove the nuisances in the\nID class text embeddings and enhance the separation between ID and OOD.\nExperiments on the large-scale ImageNet OOD detection benchmarks demonstrate\nthe superiority of our LoCoOp over zero-shot, fully supervised detection\nmethods and prompt learning methods. Notably, even in a one-shot setting --\njust one label per class, LoCoOp outperforms existing zero-shot and fully\nsupervised detection methods. The code will be available via\nhttps://github.com/AtsuMiyai/LoCoOp.\n","authors":["Atsuyuki Miyai","Qing Yu","Go Irie","Kiyoharu Aizawa"],"pdf_url":"https://arxiv.org/pdf/2306.01293v3.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16349v1","updated":"2023-10-25T04:17:13Z","published":"2023-10-25T04:17:13Z","title":"DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object\n Detection","summary":" Denoising diffusion models show remarkable performances in generative tasks,\nand their potential applications in perception tasks are gaining interest. In\nthis paper, we introduce a novel framework named DiffRef3D which adopts the\ndiffusion process on 3D object detection with point clouds for the first time.\nSpecifically, we formulate the proposal refinement stage of two-stage 3D object\ndetectors as a conditional diffusion process. During training, DiffRef3D\ngradually adds noise to the residuals between proposals and target objects,\nthen applies the noisy residuals to proposals to generate hypotheses. The\nrefinement module utilizes these hypotheses to denoise the noisy residuals and\ngenerate accurate box predictions. In the inference phase, DiffRef3D generates\ninitial hypotheses by sampling noise from a Gaussian distribution as residuals\nand refines the hypotheses through iterative steps. DiffRef3D is a versatile\nproposal refinement framework that consistently improves the performance of\nexisting 3D object detection models. We demonstrate the significance of\nDiffRef3D through extensive experiments on the KITTI benchmark. Code will be\navailable.\n","authors":["Se-Ho Kim","Inyong Koo","Inyoung Lee","Byeongjun Park","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2310.16349v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.13369v3","updated":"2023-10-25T03:32:42Z","published":"2023-08-25T13:29:31Z","title":"Distribution-Aligned Diffusion for Human Mesh Recovery","summary":" Recovering a 3D human mesh from a single RGB image is a challenging task due\nto depth ambiguity and self-occlusion, resulting in a high degree of\nuncertainty. Meanwhile, diffusion models have recently seen much success in\ngenerating high-quality outputs by progressively denoising noisy inputs.\nInspired by their capability, we explore a diffusion-based approach for human\nmesh recovery, and propose a Human Mesh Diffusion (HMDiff) framework which\nframes mesh recovery as a reverse diffusion process. We also propose a\nDistribution Alignment Technique (DAT) that infuses prior distribution\ninformation into the mesh distribution diffusion process, and provides useful\nprior knowledge to facilitate the mesh recovery task. Our method achieves\nstate-of-the-art performance on three widely used datasets. Project page:\nhttps://gongjia0208.github.io/HMDiff/.\n","authors":["Lin Geng Foo","Jia Gong","Hossein Rahmani","Jun Liu"],"pdf_url":"https://arxiv.org/pdf/2308.13369v3.pdf","comment":"Accepted to ICCV 2023"},{"id":"http://arxiv.org/abs/2306.01755v2","updated":"2023-10-25T02:56:03Z","published":"2023-05-23T04:54:26Z","title":"Training Priors Predict Text-To-Image Model Performance","summary":" Text-to-image models can often generate some relations, i.e., \"astronaut\nriding horse\", but fail to generate other relations composed of the same basic\nparts, i.e., \"horse riding astronaut\". These failures are often taken as\nevidence that models rely on training priors rather than constructing novel\nimages compositionally. This paper tests this intuition on the stablediffusion\n2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads\nthat underlie these prompts (e.g., \"astronaut\", \"ride\", \"horse\"), we find that\nthe more often an SVO triad appears in the training data, the better the model\ncan generate an image aligned with that triad. Here, by aligned we mean that\neach of the terms appears in the generated image in the proper relation to each\nother. Surprisingly, this increased frequency also diminishes how well the\nmodel can generate an image aligned with the flipped triad. For example, if\n\"astronaut riding horse\" appears frequently in the training data, the image for\n\"horse riding astronaut\" will tend to be poorly aligned. Our results thus show\nthat current models are biased to generate images with relations seen in\ntraining, and provide new data to the ongoing debate on whether these\ntext-to-image models employ abstract compositional structure in a traditional\nsense, or rather, interpolate between relations explicitly seen in the training\ndata.\n","authors":["Charles Lovering","Ellie Pavlick"],"pdf_url":"https://arxiv.org/pdf/2306.01755v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16305v1","updated":"2023-10-25T02:26:04Z","published":"2023-10-25T02:26:04Z","title":"Dolfin: Diffusion Layout Transformers without Autoencoder","summary":" In this paper, we introduce a novel generative model, Diffusion Layout\nTransformers without Autoencoder (Dolfin), which significantly improves the\nmodeling capability with reduced complexity compared to existing methods.\nDolfin employs a Transformer-based diffusion process to model layout\ngeneration. In addition to an efficient bi-directional (non-causal joint)\nsequence representation, we further propose an autoregressive diffusion model\n(Dolfin-AR) that is especially adept at capturing rich semantic correlations\nfor the neighboring objects, such as alignment, size, and overlap. When\nevaluated against standard generative layout benchmarks, Dolfin notably\nimproves performance across various metrics (fid, alignment, overlap, MaxIoU\nand DocSim scores), enhancing transparency and interoperability in the process.\nMoreover, Dolfin's applications extend beyond layout generation, making it\nsuitable for modeling geometric structures, such as line segments. Our\nexperiments present both qualitative and quantitative results to demonstrate\nthe advantages of Dolfin.\n","authors":["Yilin Wang","Zeyuan Chen","Liangjun Zhong","Zheng Ding","Zhizhou Sha","Zhuowen Tu"],"pdf_url":"https://arxiv.org/pdf/2310.16305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.09338v2","updated":"2023-10-25T02:25:14Z","published":"2023-06-15T17:59:27Z","title":"Understanding Optimization of Deep Learning via Jacobian Matrix and\n Lipschitz Constant","summary":" This article provides a comprehensive understanding of optimization in deep\nlearning, with a primary focus on the challenges of gradient vanishing and\ngradient exploding, which normally lead to diminished model representational\nability and training instability, respectively. We analyze these two challenges\nthrough several strategic measures, including the improvement of gradient flow\nand the imposition of constraints on a network's Lipschitz constant. To help\nunderstand the current optimization methodologies, we categorize them into two\nclasses: explicit optimization and implicit optimization. Explicit optimization\nmethods involve direct manipulation of optimizer parameters, including weight,\ngradient, learning rate, and weight decay. Implicit optimization methods, by\ncontrast, focus on improving the overall landscape of a network by enhancing\nits modules, such as residual shortcuts, normalization methods, attention\nmechanisms, and activations. In this article, we provide an in-depth analysis\nof these two optimization classes and undertake a thorough examination of the\nJacobian matrices and the Lipschitz constants of many widely used deep learning\nmodules, highlighting existing issues as well as potential improvements.\nMoreover, we also conduct a series of analytical experiments to substantiate\nour theoretical discussions. This article does not aim to propose a new\noptimizer or network. Rather, our intention is to present a comprehensive\nunderstanding of optimization in deep learning. We hope that this article will\nassist readers in gaining a deeper insight in this field and encourages the\ndevelopment of more robust, efficient, and high-performing models.\n","authors":["Xianbiao Qi","Jianan Wang","Lei Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.09338v2.pdf","comment":"International Digital Economy Academy (IDEA)"},{"id":"http://arxiv.org/abs/2310.16295v1","updated":"2023-10-25T02:07:39Z","published":"2023-10-25T02:07:39Z","title":"Instance-wise Linearization of Neural Network for Model Interpretation","summary":" Neural network have achieved remarkable successes in many scientific fields.\nHowever, the interpretability of the neural network model is still a major\nbottlenecks to deploy such technique into our daily life. The challenge can\ndive into the non-linear behavior of the neural network, which rises a critical\nquestion that how a model use input feature to make a decision. The classical\napproach to address this challenge is feature attribution, which assigns an\nimportant score to each input feature and reveal its importance of current\nprediction. However, current feature attribution approaches often indicate the\nimportance of each input feature without detail of how they are actually\nprocessed by a model internally. These attribution approaches often raise a\nconcern that whether they highlight correct features for a model prediction.\n For a neural network model, the non-linear behavior is often caused by\nnon-linear activation units of a model. However, the computation behavior of a\nprediction from a neural network model is locally linear, because one\nprediction has only one activation pattern. Base on the observation, we propose\nan instance-wise linearization approach to reformulates the forward computation\nprocess of a neural network prediction. This approach reformulates different\nlayers of convolution neural networks into linear matrix multiplication.\nAggregating all layers' computation, a prediction complex convolution neural\nnetwork operations can be described as a linear matrix multiplication $F(x) = W\n\\cdot x + b$. This equation can not only provides a feature attribution map\nthat highlights the important of the input features but also tells how each\ninput feature contributes to a prediction exactly. Furthermore, we discuss the\napplication of this technique in both supervise classification and unsupervised\nneural network learning parametric t-SNE dimension reduction.\n","authors":["Zhimin Li","Shusen Liu","Kailkhura Bhavya","Timo Bremer","Valerio Pascucci"],"pdf_url":"https://arxiv.org/pdf/2310.16295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08872v3","updated":"2023-10-25T02:07:27Z","published":"2023-10-13T05:48:42Z","title":"R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image\n Generation","summary":" Recent text-to-image (T2I) diffusion models have achieved remarkable progress\nin generating high-quality images given text-prompts as input. However, these\nmodels fail to convey appropriate spatial composition specified by a layout\ninstruction. In this work, we probe into zero-shot grounded T2I generation with\ndiffusion models, that is, generating images corresponding to the input layout\ninformation without training auxiliary modules or finetuning diffusion models.\nWe propose a Region and Boundary (R&B) aware cross-attention guidance approach\nthat gradually modulates the attention maps of diffusion model during\ngenerative process, and assists the model to synthesize images (1) with high\nfidelity, (2) highly compatible with textual input, and (3) interpreting layout\ninstructions accurately. Specifically, we leverage the discrete sampling to\nbridge the gap between consecutive attention maps and discrete layout\nconstraints, and design a region-aware loss to refine the generative layout\nduring diffusion process. We further propose a boundary-aware loss to\nstrengthen object discriminability within the corresponding regions.\nExperimental results show that our method outperforms existing state-of-the-art\nzero-shot grounded T2I generation methods by a large margin both qualitatively\nand quantitatively on several benchmarks.\n","authors":["Jiayu Xiao","Liang Li","Henglei Lv","Shuhui Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08872v3.pdf","comment":"Preprint. Under review. Project page:\n https://sagileo.github.io/Region-and-Boundary"},{"id":"http://arxiv.org/abs/2310.16288v1","updated":"2023-10-25T01:46:35Z","published":"2023-10-25T01:46:35Z","title":"MotionAGFormer: Enhancing 3D Human Pose Estimation with a\n Transformer-GCNFormer Network","summary":" Recent transformer-based approaches have demonstrated excellent performance\nin 3D human pose estimation. However, they have a holistic view and by encoding\nglobal relationships between all the joints, they do not capture the local\ndependencies precisely. In this paper, we present a novel Attention-GCNFormer\n(AGFormer) block that divides the number of channels by using two parallel\ntransformer and GCNFormer streams. Our proposed GCNFormer module exploits the\nlocal relationship between adjacent joints, outputting a new representation\nthat is complementary to the transformer output. By fusing these two\nrepresentation in an adaptive way, AGFormer exhibits the ability to better\nlearn the underlying 3D structure. By stacking multiple AGFormer blocks, we\npropose MotionAGFormer in four different variants, which can be chosen based on\nthe speed-accuracy trade-off. We evaluate our model on two popular benchmark\ndatasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves\nstate-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively.\nRemarkably, it uses a quarter of the parameters and is three times more\ncomputationally efficient than the previous leading model on Human3.6M dataset.\nCode and models are available at https://github.com/TaatiTeam/MotionAGFormer.\n","authors":["Soroush Mehraban","Vida Adeli","Babak Taati"],"pdf_url":"https://arxiv.org/pdf/2310.16288v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16279v1","updated":"2023-10-25T01:24:12Z","published":"2023-10-25T01:24:12Z","title":"TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer","summary":" Estimating the 6D object pose is an essential task in many applications. Due\nto the lack of depth information, existing RGB-based methods are sensitive to\nocclusion and illumination changes. How to extract and utilize the geometry\nfeatures in depth information is crucial to achieve accurate predictions. To\nthis end, we propose TransPose, a novel 6D pose framework that exploits\nTransformer Encoder with geometry-aware module to develop better learning of\npoint cloud feature representations. Specifically, we first uniformly sample\npoint cloud and extract local geometry features with the designed local feature\nextractor base on graph convolution network. To improve robustness to\nocclusion, we adopt Transformer to perform the exchange of global information,\nmaking each local feature contains global information. Finally, we introduce\ngeometry-aware module in Transformer Encoder, which to form an effective\nconstrain for point cloud feature learning and makes the global information\nexchange more tightly coupled with point cloud tasks. Extensive experiments\nindicate the effectiveness of TransPose, our pose estimation pipeline achieves\ncompetitive results on three benchmark datasets.\n","authors":["Xiao Lin","Deming Wang","Guangliang Zhou","Chengju Liu","Qijun Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16279v1.pdf","comment":"10 pages, 5 figures, IEEE Journal"},{"id":"http://arxiv.org/abs/2310.16273v1","updated":"2023-10-25T01:06:18Z","published":"2023-10-25T01:06:18Z","title":"Deep Learning for Plant Identification and Disease Classification from\n Leaf Images: Multi-prediction Approaches","summary":" Deep learning plays an important role in modern agriculture, especially in\nplant pathology using leaf images where convolutional neural networks (CNN) are\nattracting a lot of attention. While numerous reviews have explored the\napplications of deep learning within this research domain, there remains a\nnotable absence of an empirical study to offer insightful comparisons due to\nthe employment of varied datasets in the evaluation. Furthermore, a majority of\nthese approaches tend to address the problem as a singular prediction task,\noverlooking the multifaceted nature of predicting various aspects of plant\nspecies and disease types. Lastly, there is an evident need for a more profound\nconsideration of the semantic relationships that underlie plant species and\ndisease types. In this paper, we start our study by surveying current deep\nlearning approaches for plant identification and disease classification. We\ncategorise the approaches into multi-model, multi-label, multi-output, and\nmulti-task, in which different backbone CNNs can be employed. Furthermore,\nbased on the survey of existing approaches in plant pathology and the study of\navailable approaches in machine learning, we propose a new model named\nGeneralised Stacking Multi-output CNN (GSMo-CNN). To investigate the\neffectiveness of different backbone CNNs and learning approaches, we conduct an\nintensive experiment on three benchmark datasets Plant Village, Plant Leaves,\nand PlantDoc. The experimental results demonstrate that InceptionV3 can be a\ngood choice for a backbone CNN as its performance is better than AlexNet,\nVGG16, ResNet101, EfficientNet, MobileNet, and a custom CNN developed by us.\nInterestingly, empirical results support the hypothesis that using a single\nmodel can be comparable or better than using two models. Finally, we show that\nthe proposed GSMo-CNN achieves state-of-the-art performance on three benchmark\ndatasets.\n","authors":["Jianping Yao","Son N. Tran","Saurabh Garg","Samantha Sawyer"],"pdf_url":"https://arxiv.org/pdf/2310.16273v1.pdf","comment":"Jianping and Son are joint first authors (equal contribution)"},{"id":"http://arxiv.org/abs/2310.13276v2","updated":"2023-10-25T00:46:42Z","published":"2023-10-20T04:45:44Z","title":"InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution","summary":" Over recent decades, significant advancements in cross-modal retrieval are\nmainly driven by breakthroughs in visual and linguistic modeling. However, a\nrecent study shows that multi-modal data representations tend to cluster within\na limited convex cone (as representation degeneration problem), which hinders\nretrieval performance due to the inseparability of these representations. In\nour study, we first empirically validate the presence of the representation\ndegeneration problem across multiple cross-modal benchmarks and methods. Next,\nto address it, we introduce a novel method, called InvGC, a post-processing\ntechnique inspired by graph convolution and average pooling. Specifically,\nInvGC defines the graph topology within the datasets and then applies graph\nconvolution in a subtractive manner. This method effectively separates\nrepresentations by increasing the distances between data points. To improve the\nefficiency and effectiveness of InvGC, we propose an advanced graph topology,\nLocalAdj, which only aims to increase the distances between each data point and\nits nearest neighbors. To understand why InvGC works, we present a detailed\ntheoretical analysis, proving that the lower bound of recall will be improved\nafter deploying InvGC. Extensive empirical results show that InvGC and InvGC\nw/LocalAdj significantly mitigate the representation degeneration problem,\nthereby enhancing retrieval performance.\n Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval\n","authors":["Xiangru Jian","Yimu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.13276v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16267v1","updated":"2023-10-25T00:46:26Z","published":"2023-10-25T00:46:26Z","title":"SCB-ST-Dataset4: Extending the Spatio-Temporal Behavior Dataset in\n Student Classroom Scenarios Through Image Dataset Method","summary":" Using deep learning methods to detect students' classroom behavior\nautomatically is a promising approach for analyzing their class performance and\nimproving teaching effectiveness. However, the lack of publicly available\nspatio-temporal datasets on student behavior, as well as the high cost of\nmanually labeling such datasets, pose significant challenges for researchers in\nthis field. To address this issue, we proposed a method for extending the\nspatio-temporal behavior dataset in Student Classroom Scenarios\n(SCB-ST-Dataset4) through image dataset. Our SCB-ST-Dataset4 comprises 754094\nimages with 25670 labels, focusing on 3 behaviors: hand-raising, reading,\nwriting. Our proposed method can rapidly generate spatio-temporal behavioral\ndatasets without requiring annotation. Furthermore, we proposed a Behavior\nSimilarity Index (BSI) to explore the similarity of behaviors. We evaluated the\ndataset using the YOLOv5, YOLOv7, YOLOv8, and SlowFast algorithms, achieving a\nmean average precision (map) of up to 82.3%. The experiment further\ndemonstrates the effectiveness of our method. This dataset provides a robust\nfoundation for future research in student behavior detection, potentially\ncontributing to advancements in this field. The SCB-ST-Dataset4 is available\nfor download at: https://github.com/Whiffe/SCB-dataset.\n","authors":["Fan Yang","Xiaofei Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16267v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2310.02522;\n text overlap with arXiv:2306.03318"},{"id":"http://arxiv.org/abs/2308.16355v2","updated":"2023-10-25T00:22:50Z","published":"2023-08-30T23:03:49Z","title":"A Recycling Training Strategy for Medical Image Segmentation with\n Diffusion Denoising Models","summary":" Denoising diffusion models have found applications in image segmentation by\ngenerating segmented masks conditioned on images. Existing studies\npredominantly focus on adjusting model architecture or improving inference,\nsuch as test-time sampling strategies. In this work, we focus on improving the\ntraining strategy and propose a novel recycling method. During each training\nstep, a segmentation mask is first predicted given an image and a random noise.\nThis predicted mask, which replaces the conventional ground truth mask, is used\nfor denoising task during training. This approach can be interpreted as\naligning the training strategy with inference by eliminating the dependence on\nground truth masks for generating noisy samples. Our proposed method\nsignificantly outperforms standard diffusion training, self-conditioning, and\nexisting recycling strategies across multiple medical imaging data sets: muscle\nultrasound, abdominal CT, prostate MR, and brain MR. This holds for two widely\nadopted sampling strategies: denoising diffusion probabilistic model and\ndenoising diffusion implicit model. Importantly, existing diffusion models\noften display a declining or unstable performance during inference, whereas our\nnovel recycling consistently enhances or maintains performance. We show that,\nunder a fair comparison with the same network architectures and computing\nbudget, the proposed recycling-based diffusion models achieved on-par\nperformance with non-diffusion-based supervised training. By ensembling the\nproposed diffusion and the non-diffusion models, significant improvements to\nthe non-diffusion models have been observed across all applications,\ndemonstrating the value of this novel training method. This paper summarizes\nthese quantitative results and discusses their values, with a fully\nreproducible JAX-based implementation, released at\nhttps://github.com/mathpluscode/ImgX-DiffSeg.\n","authors":["Yunguan Fu","Yiwen Li","Shaheer U Saeed","Matthew J Clarkson","Yipeng Hu"],"pdf_url":"https://arxiv.org/pdf/2308.16355v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16255v1","updated":"2023-10-25T00:20:37Z","published":"2023-10-25T00:20:37Z","title":"UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception","summary":" Tremendous variations coupled with large degrees of freedom in UAV-based\nimaging conditions lead to a significant lack of data in adequately learning\nUAV-based perception models. Using various synthetic renderers in conjunction\nwith perception models is prevalent to create synthetic data to augment the\nlearning in the ground-based imaging domain. However, severe challenges in the\naustere UAV-based domain require distinctive solutions to image synthesis for\ndata augmentation. In this work, we leverage recent advancements in neural\nrendering to improve static and dynamic novelview UAV-based image synthesis,\nespecially from high altitudes, capturing salient scene attributes. Finally, we\ndemonstrate a considerable performance boost is achieved when a state-ofthe-art\ndetection model is optimized primarily on hybrid sets of real and synthetic\ndata instead of the real or synthetic data separately.\n","authors":["Christopher Maxey","Jaehoon Choi","Hyungtae Lee","Dinesh Manocha","Heesung Kwon"],"pdf_url":"https://arxiv.org/pdf/2310.16255v1.pdf","comment":"Video Link: https://www.youtube.com/watch?v=ucPzbPLqqpI"},{"id":"http://arxiv.org/abs/2310.17050v1","updated":"2023-10-25T23:23:57Z","published":"2023-10-25T23:23:57Z","title":"Exploring Question Decomposition for Zero-Shot VQA","summary":" Visual question answering (VQA) has traditionally been treated as a\nsingle-step task where each question receives the same amount of effort, unlike\nnatural human question-answering strategies. We explore a question\ndecomposition strategy for VQA to overcome this limitation. We probe the\nability of recently developed large vision-language models to use human-written\ndecompositions and produce their own decompositions of visual questions,\nfinding they are capable of learning both tasks from demonstrations alone.\nHowever, we show that naive application of model-written decompositions can\nhurt performance. We introduce a model-driven selective decomposition approach\nfor second-guessing predictions and correcting errors, and validate its\neffectiveness on eight VQA tasks across three domains, showing consistent\nimprovements in accuracy, including improvements of >20% on medical VQA\ndatasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA\nreformulation of the challenging Winoground task. Project Site:\nhttps://zaidkhan.me/decomposition-0shot-vqa/\n","authors":["Zaid Khan","Vijay Kumar BG","Samuel Schulter","Manmohan Chandraker","Yun Fu"],"pdf_url":"https://arxiv.org/pdf/2310.17050v1.pdf","comment":"NeurIPS 2023 Camera Ready"},{"id":"http://arxiv.org/abs/2304.12470v2","updated":"2023-10-25T23:13:46Z","published":"2023-04-24T21:58:14Z","title":"Recurrent Transformer Encoders for Vision-based Estimation of Fatigue\n and Engagement in Cognitive Training Sessions","summary":" Computerized cognitive training (CCT) is a scalable, well-tolerated\nintervention that has promise for slowing cognitive decline. Outcomes from CCT\nare limited by a lack of effective engagement, which is decreased by factors\nsuch as mental fatigue, particularly in older adults at risk for dementia.\nThere is a need for scalable, automated measures that can monitor mental\nfatigue during CCT. Here, we develop and validate a novel Recurrent Video\nTransformer (RVT) method for monitoring real-time mental fatigue in older\nadults with mild cognitive impairment from video-recorded facial gestures\nduring CCT. The RVT model achieved the highest balanced accuracy(78%) and\nprecision (0.82) compared to the prior state-of-the-art models for binary and\nmulti-class classification of mental fatigue and was additionally validated via\nsignificant association (p=0.023) with CCT reaction time. By leveraging dynamic\ntemporal information, the RVT model demonstrates the potential to accurately\nmeasure real-time mental fatigue, laying the foundation for future personalized\nCCT that increase effective engagement.\n","authors":["Yanchen Wang","Yunlong Xu","Feng Vankee Lin","Ehsan Adeli"],"pdf_url":"https://arxiv.org/pdf/2304.12470v2.pdf","comment":"23 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.17042v1","updated":"2023-10-25T22:45:31Z","published":"2023-10-25T22:45:31Z","title":"StochGradAdam: Accelerating Neural Networks Training with Stochastic\n Gradient Sampling","summary":" In the rapidly advancing domain of deep learning optimization, this paper\nunveils the StochGradAdam optimizer, a novel adaptation of the well-regarded\nAdam algorithm. Central to StochGradAdam is its gradient sampling technique.\nThis method not only ensures stable convergence but also leverages the\nadvantages of selective gradient consideration, fostering robust training by\npotentially mitigating the effects of noisy or outlier data and enhancing the\nexploration of the loss landscape for more dependable convergence. In both\nimage classification and segmentation tasks, StochGradAdam has demonstrated\nsuperior performance compared to the traditional Adam optimizer. By judiciously\nsampling a subset of gradients at each iteration, the optimizer is optimized\nfor managing intricate models. The paper provides a comprehensive exploration\nof StochGradAdam's methodology, from its mathematical foundations to bias\ncorrection strategies, heralding a promising advancement in deep learning\ntraining techniques.\n","authors":["Juyoung Yun"],"pdf_url":"https://arxiv.org/pdf/2310.17042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.12406v2","updated":"2023-10-25T22:28:12Z","published":"2023-04-24T19:37:23Z","title":"AutoFocusFormer: Image Segmentation off the Grid","summary":" Real world images often have highly imbalanced content density. Some areas\nare very uniform, e.g., large patches of blue sky, while other areas are\nscattered with many small objects. Yet, the commonly used successive grid\ndownsampling strategy in convolutional deep networks treats all areas equally.\nHence, small objects are represented in very few spatial locations, leading to\nworse results in tasks such as segmentation. Intuitively, retaining more pixels\nrepresenting small objects during downsampling helps to preserve important\ninformation. To achieve this, we propose AutoFocusFormer (AFF), a\nlocal-attention transformer image recognition backbone, which performs adaptive\ndownsampling by learning to retain the most important pixels for the task.\nSince adaptive downsampling generates a set of pixels irregularly distributed\non the image plane, we abandon the classic grid structure. Instead, we develop\na novel point-based local attention block, facilitated by a balanced clustering\nmodule and a learnable neighborhood merging module, which yields\nrepresentations for our point-based versions of state-of-the-art segmentation\nheads. Experiments show that our AutoFocusFormer (AFF) improves significantly\nover baseline models of similar sizes.\n","authors":["Chen Ziwen","Kaushik Patnaik","Shuangfei Zhai","Alvin Wan","Zhile Ren","Alex Schwing","Alex Colburn","Li Fuxin"],"pdf_url":"https://arxiv.org/pdf/2304.12406v2.pdf","comment":"CVPR 2023"},{"id":"http://arxiv.org/abs/2104.02206v7","updated":"2023-10-25T21:51:26Z","published":"2021-04-06T00:53:01Z","title":"Tuned Compositional Feature Replays for Efficient Stream Learning","summary":" Our brains extract durable, generalizable knowledge from transient\nexperiences of the world. Artificial neural networks come nowhere close: when\ntasked with learning to classify objects by training on non-repeating video\nframes in temporal order (online stream learning), models that learn well from\nshuffled datasets catastrophically forget old knowledge upon learning new\nstimuli. We propose a new continual learning algorithm, Compositional Replay\nUsing Memory Blocks (CRUMB), which mitigates forgetting by replaying feature\nmaps reconstructed by recombining generic parts. CRUMB concatenates trainable\nand re-usable \"memory block\" vectors to compositionally reconstruct feature map\ntensors in convolutional neural networks, like crumbs forming a loaf of bread.\nCRUMB stores the indices of memory blocks used to reconstruct new stimuli,\nenabling replay of specific memories during later tasks. This reconstruction\nmechanism also primes the neural network to minimize catastrophic forgetting by\nforcing it to attend to information about object shapes more than information\nabout image textures, and stabilizes the network during stream learning by\nproviding a shared feature-level basis for all training examples. These\nproperties allow CRUMB to outperform an otherwise identical algorithm that\nstores and replays raw images while occupying only 3.6% as much memory. We\nstress-tested CRUMB alongside 13 competing methods on 7 challenging datasets.\nTo address the limited number of existing online stream learning datasets, we\nintroduce 2 new benchmarks by adapting existing datasets for stream learning.\nWith about 4% as much memory and 30% as much runtime, CRUMB mitigates\ncatastrophic forgetting more effectively than the prior state-of-the-art. Our\ncode is available on GitHub at https://github.com/MorganBDT/crumb.\n","authors":["Morgan B. Talbot","Rushikesh Zawar","Rohil Badkundri","Mengmi Zhang","Gabriel Kreiman"],"pdf_url":"https://arxiv.org/pdf/2104.02206v7.pdf","comment":"Copyright 2023 IEEE. Personal use of this material is permitted.\n Permission from IEEE must be obtained for all other uses, in any current or\n future media, including reprinting/republishing this material for advertising\n or promotional purposes, creating new collective works, for resale or\n redistribution to servers or lists, or reuse of any copyrighted component of\n this work in other works"},{"id":"http://arxiv.org/abs/2303.00905v2","updated":"2023-10-25T21:45:24Z","published":"2023-03-02T01:55:10Z","title":"Open-World Object Manipulation using Pre-trained Vision-Language Models","summary":" For robots to follow instructions from people, they must be able to connect\nthe rich semantic information in human vocabulary, e.g. \"can you get me the\npink stuffed whale?\" to their sensory observations and actions. This brings up\na notably difficult challenge for robots: while robot learning approaches allow\nrobots to learn many different behaviors from first-hand experience, it is\nimpractical for robots to have first-hand experiences that span all of this\nsemantic information. We would like a robot's policy to be able to perceive and\npick up the pink stuffed whale, even if it has never seen any data interacting\nwith a stuffed whale before. Fortunately, static data on the internet has vast\nsemantic information, and this information is captured in pre-trained\nvision-language models. In this paper, we study whether we can interface robot\npolicies with these pre-trained models, with the aim of allowing robots to\ncomplete instructions involving object categories that the robot has never seen\nfirst-hand. We develop a simple approach, which we call Manipulation of\nOpen-World Objects (MOO), which leverages a pre-trained vision-language model\nto extract object-identifying information from the language command and image,\nand conditions the robot policy on the current image, the instruction, and the\nextracted object information. In a variety of experiments on a real mobile\nmanipulator, we find that MOO generalizes zero-shot to a wide range of novel\nobject categories and environments. In addition, we show how MOO generalizes to\nother, non-language-based input modalities to specify the object of interest\nsuch as finger pointing, and how it can be further extended to enable\nopen-world navigation and manipulation. The project's website and evaluation\nvideos can be found at https://robot-moo.github.io/\n","authors":["Austin Stone","Ted Xiao","Yao Lu","Keerthana Gopalakrishnan","Kuang-Huei Lee","Quan Vuong","Paul Wohlhart","Sean Kirmani","Brianna Zitkovich","Fei Xia","Chelsea Finn","Karol Hausman"],"pdf_url":"https://arxiv.org/pdf/2303.00905v2.pdf","comment":"Accepted at the 7th Conference on Robot Learning (CoRL 2023)"},{"id":"http://arxiv.org/abs/2304.03659v2","updated":"2023-10-25T21:36:46Z","published":"2023-04-07T14:26:11Z","title":"Probing Conceptual Understanding of Large Visual-Language Models","summary":" In recent years large visual-language (V+L) models have achieved great\nsuccess in various downstream tasks. However, it is not well studied whether\nthese models have a conceptual grasp of the visual content. In this work we\nfocus on conceptual understanding of these large V+L models.To facilitate this\nstudy, we propose novel benchmarking datasets for probing three different\naspects of content understanding, 1) relations, 2) composition and 3) context.\nOur probes are grounded in cognitive science and help determine if a V+L model\ncan, for example, determine if ``snow garnished with a man'' is implausible, or\nif it can identify beach furniture by knowing it is located on a beach. We\nexperimented with five different state-of-the-art V+L models and observe that\nthese models mostly fail to demonstrate a conceptual understanding. This study\nreveals several interesting insights such as cross-attention helps learning\nconceptual understanding, and that CNNs are better with texture and patterns,\nwhile Transformers are better at color and shape. We further utilize some of\nthese insights and propose a baseline for improving performance by a simple\nfinetuning technique that rewards the three conceptual understanding measures\nwith promising initial results. We believe that the proposed benchmarks will\nhelp the community assess and improve the conceptual understanding capabilities\nof large V+L models.\n","authors":["Madeline Chantry Schiappa","Michael Cogswell","Ajay Divakaran","Yogesh Singh Rawat"],"pdf_url":"https://arxiv.org/pdf/2304.03659v2.pdf","comment":"All code and dataset is available at:\n https://tinyurl.com/vlm-robustness"},{"id":"http://arxiv.org/abs/2310.16999v1","updated":"2023-10-25T20:55:07Z","published":"2023-10-25T20:55:07Z","title":"Trust, but Verify: Robust Image Segmentation using Deep Learning","summary":" We describe a method for verifying the output of a deep neural network for\nmedical image segmentation that is robust to several classes of random as well\nas worst-case perturbations i.e. adversarial attacks. This method is based on a\ngeneral approach recently developed by the authors called ``Trust, but Verify\"\nwherein an auxiliary verification network produces predictions about certain\nmasked features in the input image using the segmentation as an input. A\nwell-designed auxiliary network will produce high-quality predictions when the\ninput segmentations are accurate, but will produce low-quality predictions when\nthe segmentations are incorrect. Checking the predictions of such a network\nwith the original image allows us to detect bad segmentations. However, to\nensure the verification method is truly robust, we need a method for checking\nthe quality of the predictions that does not itself rely on a black-box neural\nnetwork. Indeed, we show that previous methods for segmentation evaluation that\ndo use deep neural regression networks are vulnerable to false negatives i.e.\ncan inaccurately label bad segmentations as good. We describe the design of a\nverification network that avoids such vulnerability and present results to\ndemonstrate its robustness compared to previous methods.\n","authors":["Fahim Ahmed Zaman","Xiaodong Wu","Weiyu Xu","Milan Sonka","Raghuraman Mudumbai"],"pdf_url":"https://arxiv.org/pdf/2310.16999v1.pdf","comment":"5 Pages, 8 Figures, conference"},{"id":"http://arxiv.org/abs/2306.00650v2","updated":"2023-10-25T20:50:19Z","published":"2023-06-01T13:16:10Z","title":"Universal Test-time Adaptation through Weight Ensembling, Diversity\n Weighting, and Prior Correction","summary":" Since distribution shifts are likely to occur during test-time and can\ndrastically decrease the model's performance, online test-time adaptation (TTA)\ncontinues to update the model after deployment, leveraging the current test\ndata. Clearly, a method proposed for online TTA has to perform well for all\nkinds of environmental conditions. By introducing the variable factors domain\nnon-stationarity and temporal correlation, we first unfold all practically\nrelevant settings and define the entity as universal TTA. We want to highlight\nthat this is the first work that covers such a broad spectrum, which is\nindispensable for the use in practice. To tackle the problem of universal TTA,\nwe identify and highlight several challenges a self-training based method has\nto deal with: 1) model bias and the occurrence of trivial solutions when\nperforming entropy minimization on varying sequence lengths with and without\nmultiple domain shifts, 2) loss of generalization which exacerbates the\nadaptation to multiple domain shifts and the occurrence of catastrophic\nforgetting, and 3) performance degradation due to shifts in class prior. To\nprevent the model from becoming biased, we leverage a dataset and\nmodel-agnostic certainty and diversity weighting. In order to maintain\ngeneralization and prevent catastrophic forgetting, we propose to continually\nweight-average the source and adapted model. To compensate for disparities in\nthe class prior during test-time, we propose an adaptive prior correction\nscheme that reweights the model's predictions. We evaluate our approach, named\nROID, on a wide range of settings, datasets, and models, setting new standards\nin the field of universal TTA. Code is available at:\nhttps://github.com/mariodoebler/test-time-adaptation\n","authors":["Robert A. Marsden","Mario Döbler","Bin Yang"],"pdf_url":"https://arxiv.org/pdf/2306.00650v2.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.16991v1","updated":"2023-10-25T20:42:20Z","published":"2023-10-25T20:42:20Z","title":"An Efficient Deep Learning-based approach for Recognizing Agricultural\n Pests in the Wild","summary":" One of the biggest challenges that the farmers go through is to fight insect\npests during agricultural product yields. The problem can be solved easily and\navoid economic losses by taking timely preventive measures. This requires\nidentifying insect pests in an easy and effective manner. Most of the insect\nspecies have similarities between them. Without proper help from the\nagriculturist academician it is very challenging for the farmers to identify\nthe crop pests accurately. To address this issue we have done extensive\nexperiments considering different methods to find out the best method among\nall. This paper presents a detailed overview of the experiments done on mainly\na robust dataset named IP102 including transfer learning with finetuning,\nattention mechanism and custom architecture. Some example from another dataset\nD0 is also shown to show robustness of our experimented techniques.\n","authors":["Mohtasim Hadi Rafi","Mohammad Ratul Mahjabin","Md Sabbir Rahman"],"pdf_url":"https://arxiv.org/pdf/2310.16991v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16979v1","updated":"2023-10-25T20:31:07Z","published":"2023-10-25T20:31:07Z","title":"Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo\n Label Self-Refinement","summary":" Deep learning-based solutions for semantic segmentation suffer from\nsignificant performance degradation when tested on data with different\ncharacteristics than what was used during the training. Adapting the models\nusing annotated data from the new domain is not always practical. Unsupervised\nDomain Adaptation (UDA) approaches are crucial in deploying these models in the\nactual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ\na teacher-student self-training approach, where a teacher model is used to\ngenerate pseudo-labels for the new data which in turn guide the training\nprocess of the student model. Though this approach has seen a lot of success,\nit suffers from the issue of noisy pseudo-labels being propagated in the\ntraining process. To address this issue, we propose an auxiliary pseudo-label\nrefinement network (PRN) for online refining of the pseudo labels and also\nlocalizing the pixels whose predicted labels are likely to be noisy. Being able\nto improve the quality of pseudo labels and select highly reliable ones, PRN\nhelps self-training of segmentation models to be robust against pseudo label\nnoise propagation during different stages of adaptation. We evaluate our\napproach on benchmark datasets with three different domain shifts, and our\napproach consistently performs significantly better than the previous\nstate-of-the-art methods.\n","authors":["Xingchen Zhao","Niluthpol Chowdhury Mithun","Abhinav Rajvanshi","Han-Pang Chiu","Supun Samarasekera"],"pdf_url":"https://arxiv.org/pdf/2310.16979v1.pdf","comment":"WACV 2024"},{"id":"http://arxiv.org/abs/2310.16978v1","updated":"2023-10-25T20:28:22Z","published":"2023-10-25T20:28:22Z","title":"The Significance of Machine Learning in Clinical Disease Diagnosis: A\n Review","summary":" The global need for effective disease diagnosis remains substantial, given\nthe complexities of various disease mechanisms and diverse patient symptoms. To\ntackle these challenges, researchers, physicians, and patients are turning to\nmachine learning (ML), an artificial intelligence (AI) discipline, to develop\nsolutions. By leveraging sophisticated ML and AI methods, healthcare\nstakeholders gain enhanced diagnostic and treatment capabilities. However,\nthere is a scarcity of research focused on ML algorithms for enhancing the\naccuracy and computational efficiency. This research investigates the capacity\nof machine learning algorithms to improve the transmission of heart rate data\nin time series healthcare metrics, concentrating particularly on optimizing\naccuracy and efficiency. By exploring various ML algorithms used in healthcare\napplications, the review presents the latest trends and approaches in ML-based\ndisease diagnosis (MLBDD). The factors under consideration include the\nalgorithm utilized, the types of diseases targeted, the data types employed,\nthe applications, and the evaluation metrics. This review aims to shed light on\nthe prospects of ML in healthcare, particularly in disease diagnosis. By\nanalyzing the current literature, the study provides insights into\nstate-of-the-art methodologies and their performance metrics.\n","authors":["S M Atikur Rahman","Sifat Ibtisum","Ehsan Bazgir","Tumpa Barai"],"pdf_url":"https://arxiv.org/pdf/2310.16978v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2306.01874v3","updated":"2023-10-25T20:25:41Z","published":"2023-06-02T19:07:52Z","title":"SACSoN: Scalable Autonomous Control for Social Navigation","summary":" Machine learning provides a powerful tool for building socially compliant\nrobotic systems that go beyond simple predictive models of human behavior. By\nobserving and understanding human interactions from past experiences, learning\ncan enable effective social navigation behaviors directly from data. In this\npaper, our goal is to develop methods for training policies for socially\nunobtrusive navigation, such that robots can navigate among humans in ways that\ndon't disturb human behavior. We introduce a definition for such behavior based\non the counterfactual perturbation of the human: if the robot had not intruded\ninto the space, would the human have acted in the same way? By minimizing this\ncounterfactual perturbation, we can induce robots to behave in ways that do not\nalter the natural behavior of humans in the shared space. Instantiating this\nprinciple requires training policies to minimize their effect on human\nbehavior, and this in turn requires data that allows us to model the behavior\nof humans in the presence of robots. Therefore, our approach is based on two\nkey contributions. First, we collect a large dataset where an indoor mobile\nrobot interacts with human bystanders. Second, we utilize this dataset to train\npolicies that minimize counterfactual perturbation. We provide supplementary\nvideos and make publicly available the largest-of-its-kind visual navigation\ndataset on our project page.\n","authors":["Noriaki Hirose","Dhruv Shah","Ajay Sridhar","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2306.01874v3.pdf","comment":"11 pages, 15 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.02255v2","updated":"2023-10-25T20:22:24Z","published":"2023-10-03T17:57:24Z","title":"MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V,\n Bard, and Other Large Multimodal Models","summary":" Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit\nimpressive problem-solving skills in many tasks and domains, but their ability\nin mathematical reasoning in visual contexts has not been systematically\nstudied. To bridge this gap, we present MathVista, a benchmark designed to\ncombine challenges from diverse mathematical and visual tasks. It consists of\n6,141 examples, derived from 28 existing multimodal datasets involving\nmathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and\nPaperQA). Completing these tasks requires fine-grained, deep visual\nunderstanding and compositional reasoning, which all state-of-the-art\nfoundation models find challenging. With MathVista, we have conducted a\ncomprehensive, quantitative evaluation of 12 prominent foundation models. The\nbest-performing GPT-4V model achieves an overall accuracy of 49.9%,\nsubstantially outperforming Bard, the second-best performer, by 15.1%. Our\nin-depth analysis reveals that the superiority of GPT-4V is mainly attributed\nto its enhanced visual perception and mathematical reasoning. However, GPT-4V\nstill falls short of human performance by 10.4%, as it often struggles to\nunderstand complex figures and perform rigorous reasoning. This significant gap\nunderscores the critical role that MathVista will play in the development of\ngeneral-purpose AI agents capable of tackling mathematically intensive and\nvisually rich real-world tasks. We further explore the new ability of\nself-verification, the application of self-consistency, and the interactive\nchatbot capabilities of GPT-4V, highlighting its promising potential for future\nresearch. The project is available at https://mathvista.github.io/.\n","authors":["Pan Lu","Hritik Bansal","Tony Xia","Jiacheng Liu","Chunyuan Li","Hannaneh Hajishirzi","Hao Cheng","Kai-Wei Chang","Michel Galley","Jianfeng Gao"],"pdf_url":"https://arxiv.org/pdf/2310.02255v2.pdf","comment":"112 pages, 117 figures. Work in progress"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.16738v1","updated":"2023-10-25T16:11:55Z","published":"2023-10-25T16:11:55Z","title":"Improving Conversational Recommendation Systems via Bias Analysis and\n Language-Model-Enhanced Data Augmentation","summary":" Conversational Recommendation System (CRS) is a rapidly growing research area\nthat has gained significant attention alongside advancements in language\nmodelling techniques. However, the current state of conversational\nrecommendation faces numerous challenges due to its relative novelty and\nlimited existing contributions. In this study, we delve into benchmark datasets\nfor developing CRS models and address potential biases arising from the\nfeedback loop inherent in multi-turn interactions, including selection bias and\nmultiple popularity bias variants. Drawing inspiration from the success of\ngenerative data via using language models and data augmentation techniques, we\npresent two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model\nperformance while mitigating biases. Through extensive experiments on ReDial\nand TG-ReDial benchmark datasets, we show a consistent improvement of CRS\ntechniques with our data augmentation approaches and offer additional insights\non addressing multiple newly formulated biases.\n","authors":["Xi Wang","Hossein A. Rahmani","Jiqun Liu","Emine Yilmaz"],"pdf_url":"https://arxiv.org/pdf/2310.16738v1.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.16673v1","updated":"2023-10-25T14:38:40Z","published":"2023-10-25T14:38:40Z","title":"Exploring Large Language Models for Code Explanation","summary":" Automating code documentation through explanatory text can prove highly\nbeneficial in code understanding. Large Language Models (LLMs) have made\nremarkable strides in Natural Language Processing, especially within software\nengineering tasks such as code generation and code summarization. This study\nspecifically delves into the task of generating natural-language summaries for\ncode snippets, using various LLMs. The findings indicate that Code LLMs\noutperform their generic counterparts, and zero-shot methods yield superior\nresults when dealing with datasets with dissimilar distributions between\ntraining and testing sets.\n","authors":["Paheli Bhattacharya","Manojit Chakraborty","Kartheek N S N Palepu","Vikas Pandey","Ishan Dindorkar","Rakesh Rajpurohit","Rishabh Gupta"],"pdf_url":"https://arxiv.org/pdf/2310.16673v1.pdf","comment":"Accepted at the Forum for Information Retrieval Evaluation 2023 (IRSE\n Track)"},{"id":"http://arxiv.org/abs/2304.09097v2","updated":"2023-10-25T14:05:10Z","published":"2023-04-07T07:03:54Z","title":"Sheaf Neural Networks for Graph-based Recommender Systems","summary":" Recent progress in Graph Neural Networks has resulted in wide adoption by\nmany applications, including recommendation systems. The reason for Graph\nNeural Networks' superiority over other approaches is that many problems in\nrecommendation systems can be naturally modeled as graphs, where nodes can be\neither users or items and edges represent preference relationships. In current\nGraph Neural Network approaches, nodes are represented with a static vector\nlearned at training time. This static vector might only be suitable to capture\nsome of the nuances of users or items they define. To overcome this limitation,\nwe propose using a recently proposed model inspired by category theory: Sheaf\nNeural Networks. Sheaf Neural Networks, and its connected Laplacian, can\naddress the previous problem by associating every node (and edge) with a vector\nspace instead than a single vector. The vector space representation is richer\nand allows picking the proper representation at inference time. This approach\ncan be generalized for different related tasks on graphs and achieves\nstate-of-the-art performance in terms of F1-Score@N in collaborative filtering\nand Hits@20 in link prediction. For collaborative filtering, the approach is\nevaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a\n5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link\nprediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the\nrespective baselines.\n","authors":["Antonio Purificato","Giulia Cassarà","Pietro Liò","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2304.09097v2.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.16605v1","updated":"2023-10-25T12:50:34Z","published":"2023-10-25T12:50:34Z","title":"Distributionally Robust Unsupervised Dense Retrieval Training on Web\n Graphs","summary":" This paper introduces Web-DRO, an unsupervised dense retrieval model, which\nclusters documents based on web structures and reweights the groups during\ncontrastive training. Specifically, we first leverage web graph links and\ncontrastively train an embedding model for clustering anchor-document pairs.\nThen we use Group Distributional Robust Optimization to reweight different\nclusters of anchor-document pairs, which guides the model to assign more\nweights to the group with higher contrastive loss and pay more attention to the\nworst case during training. Our experiments on MS MARCO and BEIR show that our\nmodel, Web-DRO, significantly improves the retrieval effectiveness in\nunsupervised scenarios. A comparison of clustering techniques shows that\ntraining on the web graph combining URL information reaches optimal performance\non clustering. Further analysis confirms that group weights are stable and\nvalid, indicating consistent model preferences as well as effective\nup-weighting of valuable groups and down-weighting of uninformative ones. The\ncode of this paper can be obtained from https://github.com/OpenMatch/Web-DRO.\n","authors":["Peixuan Han","Zhenghao Liu","Zhiyuan Liu","Chenyan Xiong"],"pdf_url":"https://arxiv.org/pdf/2310.16605v1.pdf","comment":"9 pages, 5 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.16566v1","updated":"2023-10-25T11:43:29Z","published":"2023-10-25T11:43:29Z","title":"Model-enhanced Contrastive Reinforcement Learning for Sequential\n Recommendation","summary":" Reinforcement learning (RL) has been widely applied in recommendation systems\ndue to its potential in optimizing the long-term engagement of users. From the\nperspective of RL, recommendation can be formulated as a Markov decision\nprocess (MDP), where recommendation system (agent) can interact with users\n(environment) and acquire feedback (reward signals).However, it is impractical\nto conduct online interactions with the concern on user experience and\nimplementation complexity, and we can only train RL recommenders with offline\ndatasets containing limited reward signals and state transitions. Therefore,\nthe data sparsity issue of reward signals and state transitions is very severe,\nwhile it has long been overlooked by existing RL recommenders.Worse still, RL\nmethods learn through the trial-and-error mode, but negative feedback cannot be\nobtained in implicit feedback recommendation tasks, which aggravates the\noverestimation problem of offline RL recommender. To address these challenges,\nwe propose a novel RL recommender named model-enhanced contrastive\nreinforcement learning (MCRL). On the one hand, we learn a value function to\nestimate the long-term engagement of users, together with a conservative value\nlearning mechanism to alleviate the overestimation problem.On the other hand,\nwe construct some positive and negative state-action pairs to model the reward\nfunction and state transition function with contrastive learning to exploit the\ninternal structure information of MDP. Experiments demonstrate that the\nproposed method significantly outperforms existing offline RL and\nself-supervised RL methods with different representative backbone networks on\ntwo real-world datasets.\n","authors":["Chengpeng Li","Zhengyi Yang","Jizhi Zhang","Jiancan Wu","Dingxian Wang","Xiangnan He","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16566v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.16452v1","updated":"2023-10-25T08:14:49Z","published":"2023-10-25T08:14:49Z","title":"Faithful Path Language Modelling for Explainable Recommendation over\n Knowledge Graph","summary":" Path reasoning methods over knowledge graphs have gained popularity for their\npotential to improve transparency in recommender systems. However, the\nresulting models still rely on pre-trained knowledge graph embeddings, fail to\nfully exploit the interdependence between entities and relations in the KG for\nrecommendation, and may generate inaccurate explanations. In this paper, we\nintroduce PEARLM, a novel approach that efficiently captures user behaviour and\nproduct-side knowledge through language modelling. With our approach, knowledge\ngraph embeddings are directly learned from paths over the KG by the language\nmodel, which also unifies entities and relations in the same optimisation\nspace. Constraints on the sequence decoding additionally guarantee path\nfaithfulness with respect to the KG. Experiments on two datasets show the\neffectiveness of our approach compared to state-of-the-art baselines. Source\ncode and datasets: AVAILABLE AFTER GETTING ACCEPTED.\n","authors":["Giacomo Balloccu","Ludovico Boratto","Christian Cancedda","Gianni Fenu","Mirko Marras"],"pdf_url":"https://arxiv.org/pdf/2310.16452v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15556v2","updated":"2023-10-25T07:50:52Z","published":"2023-10-24T06:56:38Z","title":"TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for\n Inference Cost Reduction","summary":" Since ChatGPT released its API for public use, the number of applications\nbuilt on top of commercial large language models (LLMs) increase exponentially.\nOne popular usage of such models is leveraging its in-context learning ability\nand generating responses given user queries leveraging knowledge obtained by\nretrieval augmentation. One problem of deploying commercial retrieval-augmented\nLLMs is the cost due to the additionally retrieved context that largely\nincreases the input token size of the LLMs. To mitigate this, we propose a\ntoken compression scheme that includes two methods: summarization compression\nand semantic compression. The first method applies a T5-based model that is\nfine-tuned by datasets generated using self-instruct containing samples with\nvarying lengths and reduce token size by doing summarization. The second method\nfurther compresses the token size by removing words with lower impact on the\nsemantic. In order to adequately evaluate the effectiveness of the proposed\nmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)\nfocusing on food recommendation for women around pregnancy period or infants.\nOur summarization compression can reduce 65% of the retrieval token size with\nfurther 0.3% improvement on the accuracy; semantic compression provides a more\nflexible way to trade-off the token size with performance, for which we can\nreduce the token size by 20% with only 1.6% of accuracy drop.\n","authors":["Junyi Liu","Liangzhi Li","Tong Xiang","Bowen Wang","Yiming Qian"],"pdf_url":"https://arxiv.org/pdf/2310.15556v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.16409v1","updated":"2023-10-25T06:49:19Z","published":"2023-10-25T06:49:19Z","title":"Multiple Key-value Strategy in Recommendation Systems Incorporating\n Large Language Model","summary":" Recommendation system (RS) plays significant roles in matching users\ninformation needs for Internet applications, and it usually utilizes the\nvanilla neural network as the backbone to handle embedding details. Recently,\nthe large language model (LLM) has exhibited emergent abilities and achieved\ngreat breakthroughs both in the CV and NLP communities. Thus, it is logical to\nincorporate RS with LLM better, which has become an emerging research\ndirection. Although some existing works have made their contributions to this\nissue, they mainly consider the single key situation (e.g. historical\ninteractions), especially in sequential recommendation. The situation of\nmultiple key-value data is simply neglected. This significant scenario is\nmainstream in real practical applications, where the information of users (e.g.\nage, occupation, etc) and items (e.g. title, category, etc) has more than one\nkey. Therefore, we aim to implement sequential recommendations based on\nmultiple key-value data by incorporating RS with LLM. In particular, we\ninstruct tuning a prevalent open-source LLM (Llama 7B) in order to inject\ndomain knowledge of RS into the pre-trained LLM. Since we adopt multiple\nkey-value strategies, LLM is hard to learn well among these keys. Thus the\ngeneral and innovative shuffle and mask strategies, as an innovative manner of\ndata argument, are designed. To demonstrate the effectiveness of our approach,\nextensive experiments are conducted on the popular and suitable dataset\nMovieLens which contains multiple keys-value. The experimental results\ndemonstrate that our approach can nicely and effectively complete this\nchallenging issue.\n","authors":["Dui Wang","Xiangyu Hou","Xiaohui Yang","Bo Zhang","Renbing Chen","Daiyue Xue"],"pdf_url":"https://arxiv.org/pdf/2310.16409v1.pdf","comment":"Accepted by CIKM2023 workshop at GenRec'23"},{"id":"http://arxiv.org/abs/2310.07554v2","updated":"2023-10-25T04:35:09Z","published":"2023-10-11T14:59:53Z","title":"Retrieve Anything To Augment Large Language Models","summary":" Large language models (LLMs) face significant challenges stemming from their\ninherent limitations in knowledge, memory, alignment, and action. These\nchallenges cannot be addressed by LLMs alone, but should rely on assistance\nfrom the external world, such as knowledge base, memory store, demonstration\nexamples, and tools. Retrieval augmentation stands as a vital mechanism for\nbridging the gap between LLMs and the external assistance. However,\nconventional methods encounter two pressing issues. On the one hand, the\ngeneral-purpose retrievers are not properly optimized for the retrieval\naugmentation of LLMs. On the other hand, the task-specific retrievers lack the\nrequired versatility, hindering their performance across the diverse retrieval\naugmentation scenarios.\n In this work, we present a novel approach, the LLM-Embedder, which\ncomprehensively supports the diverse retrieval augmentation needs of LLMs with\none unified embedding model. Training such a unified model is non-trivial, as\nvarious retrieval tasks aim to capture distinct semantic relationships, often\nsubject to mutual interference. To address this challenge, we systematically\noptimize our training methodology. This includes reward formulation based on\nLLMs' feedback, the stabilization of knowledge distillation, multi-task\nfine-tuning with explicit instructions, and homogeneous in-batch negative\nsampling. These optimization strategies contribute to the outstanding empirical\nperformance of the LLM-Embedder. Notably, it yields remarkable enhancements in\nretrieval augmentation for LLMs, surpassing both general-purpose and\ntask-specific retrievers in various evaluation scenarios. Our checkpoint and\nsource code are publicly available at\nhttps://github.com/FlagOpen/FlagEmbedding.\n","authors":["Peitian Zhang","Shitao Xiao","Zheng Liu","Zhicheng Dou","Jian-Yun Nie"],"pdf_url":"https://arxiv.org/pdf/2310.07554v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16303v1","updated":"2023-10-25T02:22:50Z","published":"2023-10-25T02:22:50Z","title":"URL-BERT: Training Webpage Representations via Social Media Engagements","summary":" Understanding and representing webpages is crucial to online social networks\nwhere users may share and engage with URLs. Common language model (LM) encoders\nsuch as BERT can be used to understand and represent the textual content of\nwebpages. However, these representations may not model thematic information of\nweb domains and URLs or accurately capture their appeal to social media users.\nIn this work, we introduce a new pre-training objective that can be used to\nadapt LMs to understand URLs and webpages. Our proposed framework consists of\ntwo steps: (1) scalable graph embeddings to learn shallow representations of\nURLs based on user engagement on social media and (2) a contrastive objective\nthat aligns LM representations with the aforementioned graph-based\nrepresentation. We apply our framework to the multilingual version of BERT to\nobtain the model URL-BERT. We experimentally demonstrate that our continued\npre-training approach improves webpage understanding on a variety of tasks and\nTwitter internal and external benchmarks.\n","authors":["Ayesha Qamar","Chetan Verma","Ahmed El-Kishky","Sumit Binnani","Sneha Mehta","Taylor Berg-Kirkpatrick"],"pdf_url":"https://arxiv.org/pdf/2310.16303v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17041v1","updated":"2023-10-25T22:42:30Z","published":"2023-10-25T22:42:30Z","title":"On Surgical Fine-tuning for Language Encoders","summary":" Fine-tuning all the layers of a pre-trained neural language encoder (either\nusing all the parameters or using parameter-efficient methods) is often the\nde-facto way of adapting it to a new task. We show evidence that for different\ndownstream language tasks, fine-tuning only a subset of layers is sufficient to\nobtain performance that is close to and often better than fine-tuning all the\nlayers in the language encoder. We propose an efficient metric based on the\ndiagonal of the Fisher information matrix (FIM score), to select the candidate\nlayers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE\ntasks and across distinct language encoders, that this metric can effectively\nselect layers leading to a strong downstream performance. Our work highlights\nthat task-specific information corresponding to a given downstream task is\noften localized within a few layers, and tuning only those is sufficient for\nstrong performance. Additionally, we demonstrate the robustness of the FIM\nscore to rank layers in a manner that remains constant during the optimization\nprocess.\n","authors":["Abhilasha Lodha","Gayatri Belapurkar","Saloni Chalkapurkar","Yuanming Tao","Reshmi Ghosh","Samyadeep Basu","Dmitrii Petrov","Soundararajan Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2310.17041v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.05035v2","updated":"2023-10-25T20:23:38Z","published":"2023-09-10T14:13:54Z","title":"Duplicate Question Retrieval and Confirmation Time Prediction in\n Software Communities","summary":" Community Question Answering (CQA) in different domains is growing at a large\nscale because of the availability of several platforms and huge shareable\ninformation among users. With the rapid growth of such online platforms, a\nmassive amount of archived data makes it difficult for moderators to retrieve\npossible duplicates for a new question and identify and confirm existing\nquestion pairs as duplicates at the right time. This problem is even more\ncritical in CQAs corresponding to large software systems like askubuntu where\nmoderators need to be experts to comprehend something as a duplicate. Note that\nthe prime challenge in such CQA platforms is that the moderators are themselves\nexperts and are therefore usually extremely busy with their time being\nextraordinarily expensive. To facilitate the task of the moderators, in this\nwork, we have tackled two significant issues for the askubuntu CQA platform:\n(1) retrieval of duplicate questions given a new question and (2) duplicate\nquestion confirmation time prediction. In the first task, we focus on\nretrieving duplicate questions from a question pool for a particular newly\nposted question. In the second task, we solve a regression problem to rank a\npair of questions that could potentially take a long time to get confirmed as\nduplicates. For duplicate question retrieval, we propose a Siamese neural\nnetwork based approach by exploiting both text and network-based features,\nwhich outperforms several state-of-the-art baseline techniques. Our method\noutperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate\nconfirmation time prediction, we have used both the standard machine learning\nmodels and neural network along with the text and graph-based features. We\nobtain Spearman's rank correlation of 0.20 and 0.213 (statistically\nsignificant) for text and graph based features respectively.\n","authors":["Rima Hazra","Debanjan Saha","Amruit Sahoo","Somnath Banerjee","Animesh Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2309.05035v2.pdf","comment":"Full paper accepted at ASONAM 2023: The 2023 IEEE/ACM International\n Conference on Advances in Social Networks Analysis and Mining"},{"id":"http://arxiv.org/abs/2310.16972v1","updated":"2023-10-25T20:14:39Z","published":"2023-10-25T20:14:39Z","title":"The Word2vec Graph Model for Author Attribution and Genre Detection in\n Literary Analysis","summary":" Analyzing the writing styles of authors and articles is a key to supporting\nvarious literary analyses such as author attribution and genre detection. Over\nthe years, rich sets of features that include stylometry, bag-of-words, n-grams\nhave been widely used to perform such analysis. However, the effectiveness of\nthese features largely depends on the linguistic aspects of a particular\nlanguage and datasets specific characteristics. Consequently, techniques based\non these feature sets cannot give desired results across domains. In this\npaper, we propose a novel Word2vec graph based modeling of a document that can\nrightly capture both context and style of the document. By using these Word2vec\ngraph based features, we perform classification to perform author attribution\nand genre detection tasks. Our detailed experimental study with a comprehensive\nset of literary writings shows the effectiveness of this method over\ntraditional feature based approaches. Our code and data are publicly available\nat https://cutt.ly/svLjSgk\n","authors":["Nafis Irtiza Tripto","Mohammed Eunus Ali"],"pdf_url":"https://arxiv.org/pdf/2310.16972v1.pdf","comment":"12 pages, 6 figures"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2307.13885v4","updated":"2023-10-25T17:59:56Z","published":"2023-07-26T01:10:29Z","title":"Efficient Estimation of Average-Case Robustness for Multi-Class\n Classification","summary":" Robustness in machine learning is commonly studied in the adversarial\nsetting, yet real-world noise (such as measurement noise) is random rather than\nadversarial. Model behavior under such noise is captured by average-case\nrobustness, i.e., the probability of obtaining consistent predictions in a\nlocal region around an input. However, the na\\\"ive approach to computing\naverage-case robustness based on Monte-Carlo sampling is statistically\ninefficient, especially for high-dimensional data, leading to prohibitive\ncomputational costs for large-scale applications. In this work, we develop the\nfirst analytical estimators to efficiently compute average-case robustness of\nmulti-class discriminative models. These estimators linearize models in the\nlocal region around an input and analytically compute the robustness of the\nresulting linear models. We show empirically that these estimators efficiently\ncompute the robustness of standard deep learning models and demonstrate these\nestimators' usefulness for various tasks involving robustness, such as\nmeasuring robustness bias and identifying dataset samples that are vulnerable\nto noise perturbation. In doing so, this work not only proposes a new framework\nfor robustness, but also makes its computation practical, enabling the use of\naverage-case robustness in downstream applications.\n","authors":["Tessa Han","Suraj Srinivas","Himabindu Lakkaraju"],"pdf_url":"https://arxiv.org/pdf/2307.13885v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.11959v4","updated":"2023-10-25T17:59:45Z","published":"2021-06-22T17:58:10Z","title":"Revisiting Deep Learning Models for Tabular Data","summary":" The existing literature on deep learning for tabular data proposes a wide\nrange of novel architectures and reports competitive results on various\ndatasets. However, the proposed models are usually not properly compared to\neach other and existing works often use different benchmarks and experiment\nprotocols. As a result, it is unclear for both researchers and practitioners\nwhat models perform best. Additionally, the field still lacks effective\nbaselines, that is, the easy-to-use models that provide competitive performance\nacross different problems.\n In this work, we perform an overview of the main families of DL architectures\nfor tabular data and raise the bar of baselines in tabular DL by identifying\ntwo simple and powerful deep architectures. The first one is a ResNet-like\narchitecture which turns out to be a strong baseline that is often missing in\nprior works. The second model is our simple adaptation of the Transformer\narchitecture for tabular data, which outperforms other solutions on most tasks.\nBoth models are compared to many existing architectures on a diverse set of\ntasks under the same training and tuning protocols. We also compare the best DL\nmodels with Gradient Boosted Decision Trees and conclude that there is still no\nuniversally superior solution.\n","authors":["Yury Gorishniy","Ivan Rubachev","Valentin Khrulkov","Artem Babenko"],"pdf_url":"https://arxiv.org/pdf/2106.11959v4.pdf","comment":"NeurIPS 2021 camera-ready. Code:\n https://github.com/yandex-research/tabular-dl-revisiting-models (v4: minor\n update)"},{"id":"http://arxiv.org/abs/2305.16985v2","updated":"2023-10-25T17:59:44Z","published":"2023-05-26T14:40:46Z","title":"Inverse Dynamics Pretraining Learns Good Representations for Multitask\n Imitation","summary":" In recent years, domains such as natural language processing and image\nrecognition have popularized the paradigm of using large datasets to pretrain\nrepresentations that can be effectively transferred to downstream tasks. In\nthis work we evaluate how such a paradigm should be done in imitation learning,\nwhere both pretraining and finetuning data are trajectories collected by\nexperts interacting with an unknown environment. Namely, we consider a setting\nwhere the pretraining corpus consists of multitask demonstrations and the task\nfor each demonstration is set by an unobserved latent context variable. The\ngoal is to use the pretraining corpus to learn a low dimensional representation\nof the high dimensional (e.g., visual) observation space which can be\ntransferred to a novel context for finetuning on a limited dataset of\ndemonstrations. Among a variety of possible pretraining objectives, we argue\nthat inverse dynamics modeling -- i.e., predicting an action given the\nobservations appearing before and after it in the demonstration -- is\nwell-suited to this setting. We provide empirical evidence of this claim\nthrough evaluations on a variety of simulated visuomotor manipulation problems.\nWhile previous work has attempted various theoretical explanations regarding\nthe benefit of inverse dynamics modeling, we find that these arguments are\ninsufficient to explain the empirical advantages often observed in our\nsettings, and so we derive a novel analysis using a simple but general\nenvironment model.\n","authors":["David Brandfonbrener","Ofir Nachum","Joan Bruna"],"pdf_url":"https://arxiv.org/pdf/2305.16985v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16837v1","updated":"2023-10-25T17:59:34Z","published":"2023-10-25T17:59:34Z","title":"RDBench: ML Benchmark for Relational Databases","summary":" Benefiting from high-quality datasets and standardized evaluation metrics,\nmachine learning (ML) has achieved sustained progress and widespread\napplications. However, while applying machine learning to relational databases\n(RDBs), the absence of a well-established benchmark remains a significant\nobstacle to the development of ML. To address this issue, we introduce ML\nBenchmark For Relational Databases (RDBench), a standardized benchmark that\naims to promote reproducible ML research on RDBs that include multiple tables.\nRDBench offers diverse RDB datasets of varying scales, domains, and relational\nstructures, organized into 4 levels. Notably, to simplify the adoption of\nRDBench for diverse ML domains, for any given database, RDBench exposes three\ntypes of interfaces including tabular data, homogeneous graphs, and\nheterogeneous graphs, sharing the same underlying task definition. For the\nfirst time, RDBench enables meaningful comparisons between ML methods from\ndiverse domains, ranging from XGBoost to Graph Neural Networks, under RDB\nprediction tasks. We design multiple classification and regression tasks for\neach RDB dataset and report averaged results over the same dataset, further\nenhancing the robustness of the experimental findings. RDBench is implemented\nwith DBGym, a user-friendly platform for ML research and application on\ndatabases, enabling benchmarking new ML methods with RDBench at ease.\n","authors":["Zizhao Zhang","Yi Yang","Lutong Zou","He Wen","Tao Feng","Jiaxuan You"],"pdf_url":"https://arxiv.org/pdf/2310.16837v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16835v1","updated":"2023-10-25T17:59:26Z","published":"2023-10-25T17:59:26Z","title":"Proposal-Contrastive Pretraining for Object Detection from Fewer Data","summary":" The use of pretrained deep neural networks represents an attractive way to\nachieve strong results with few data available. When specialized in dense\nproblems such as object detection, learning local rather than global\ninformation in images has proven to be more efficient. However, for\nunsupervised pretraining, the popular contrastive learning requires a large\nbatch size and, therefore, a lot of resources. To address this problem, we are\ninterested in transformer-based object detectors that have recently gained\ntraction in the community with good performance and with the particularity of\ngenerating many diverse object proposals.\n In this work, we present Proposal Selection Contrast (ProSeCo), a novel\nunsupervised overall pretraining approach that leverages this property. ProSeCo\nuses the large number of object proposals generated by the detector for\ncontrastive learning, which allows the use of a smaller batch size, combined\nwith object-level features to learn local information in the images. To improve\nthe effectiveness of the contrastive loss, we introduce the object location\ninformation in the selection of positive examples to take into account multiple\noverlapping object proposals. When reusing pretrained backbone, we advocate for\nconsistency in learning local information between the backbone and the\ndetection head.\n We show that our method outperforms state of the art in unsupervised\npretraining for object detection on standard and novel benchmarks in learning\nwith fewer data.\n","authors":["Quentin Bouniot","Romaric Audigier","Angélique Loesch","Amaury Habrard"],"pdf_url":"https://arxiv.org/pdf/2310.16835v1.pdf","comment":"Published as a conference paper at ICLR 2023"},{"id":"http://arxiv.org/abs/2310.16834v1","updated":"2023-10-25T17:59:12Z","published":"2023-10-25T17:59:12Z","title":"Discrete Diffusion Language Modeling by Estimating the Ratios of the\n Data Distribution","summary":" Despite their groundbreaking performance for many generative modeling tasks,\ndiffusion models have fallen short on discrete data domains such as natural\nlanguage. Crucially, standard diffusion models rely on the well-established\ntheory of score matching, but efforts to generalize this to discrete structures\nhave not yielded the same empirical gains. In this work, we bridge this gap by\nproposing score entropy, a novel discrete score matching loss that is more\nstable than existing methods, forms an ELBO for maximum likelihood training,\nand can be efficiently optimized with a denoising variant. We scale our Score\nEntropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2,\nachieving highly competitive likelihoods while also introducing distinct\nalgorithmic advantages. In particular, when comparing similarly sized SEDD and\nGPT-2 models, SEDD attains comparable perplexities (normally within $+10\\%$ of\nand sometimes outperforming the baseline). Furthermore, SEDD models learn a\nmore faithful sequence distribution (around $4\\times$ better compared to GPT-2\nmodels with ancestral sampling as measured by large models), can trade off\ncompute for generation quality (needing only $16\\times$ fewer network\nevaluations to match GPT-2), and enables arbitrary infilling beyond the\nstandard left to right prompting.\n","authors":["Aaron Lou","Chenlin Meng","Stefano Ermon"],"pdf_url":"https://arxiv.org/pdf/2310.16834v1.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2310.16831v1","updated":"2023-10-25T17:59:01Z","published":"2023-10-25T17:59:01Z","title":"PERF: Panoramic Neural Radiance Field from a Single Panorama","summary":" Neural Radiance Field (NeRF) has achieved substantial progress in novel view\nsynthesis given multi-view images. Recently, some works have attempted to train\na NeRF from a single image with 3D priors. They mainly focus on a limited field\nof view and there are few invisible occlusions, which greatly limits their\nscalability to real-world 360-degree panoramic scenarios with large-size\nocclusions. In this paper, we present PERF, a 360-degree novel view synthesis\nframework that trains a panoramic neural radiance field from a single panorama.\nNotably, PERF allows 3D roaming in a complex scene without expensive and\ntedious image collection. To achieve this goal, we propose a novel\ncollaborative RGBD inpainting method and a progressive inpainting-and-erasing\nmethod to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first\npredict a panoramic depth map as initialization given a single panorama, and\nreconstruct visible 3D regions with volume rendering. Then we introduce a\ncollaborative RGBD inpainting approach into a NeRF for completing RGB images\nand depth maps from random views, which is derived from an RGB Stable Diffusion\nmodel and a monocular depth estimator. Finally, we introduce an\ninpainting-and-erasing strategy to avoid inconsistent geometry between a\nnewly-sampled view and reference views. The two components are integrated into\nthe learning of NeRFs in a unified optimization framework and achieve promising\nresults. Extensive experiments on Replica and a new dataset PERF-in-the-wild\ndemonstrate the superiority of our PERF over state-of-the-art methods. Our PERF\ncan be widely used for real-world applications, such as panorama-to-3D,\ntext-to-3D, and 3D scene stylization applications. Project page and code are\navailable at https://perf-project.github.io/.\n","authors":["Guangcong Wang","Peng Wang","Zhaoxi Chen","Wenping Wang","Chen Change Loy","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16831v1.pdf","comment":"Project page and code: https://perf-project.github.io/"},{"id":"http://arxiv.org/abs/2310.16828v1","updated":"2023-10-25T17:57:07Z","published":"2023-10-25T17:57:07Z","title":"TD-MPC2: Scalable, Robust World Models for Continuous Control","summary":" TD-MPC is a model-based reinforcement learning (RL) algorithm that performs\nlocal trajectory optimization in the latent space of a learned implicit\n(decoder-free) world model. In this work, we present TD-MPC2: a series of\nimprovements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves\nsignificantly over baselines across 104 online RL tasks spanning 4 diverse task\ndomains, achieving consistently strong results with a single set of\nhyperparameters. We further show that agent capabilities increase with model\nand data size, and successfully train a single 317M parameter agent to perform\n80 tasks across multiple task domains, embodiments, and action spaces. We\nconclude with an account of lessons, opportunities, and risks associated with\nlarge TD-MPC2 agents. Explore videos, models, data, code, and more at\nhttps://nicklashansen.github.io/td-mpc2\n","authors":["Nicklas Hansen","Hao Su","Xiaolong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16828v1.pdf","comment":"Explore videos, models, data, code, and more at\n https://nicklashansen.github.io/td-mpc2"},{"id":"http://arxiv.org/abs/2310.16826v1","updated":"2023-10-25T17:56:28Z","published":"2023-10-25T17:56:28Z","title":"Deep machine learning for meteor monitoring: advances with transfer\n learning and gradient-weighted class activation mapping","summary":" In recent decades, the use of optical detection systems for meteor studies\nhas increased dramatically, resulting in huge amounts of data being analyzed.\nAutomated meteor detection tools are essential for studying the continuous\nmeteoroid incoming flux, recovering fresh meteorites, and achieving a better\nunderstanding of our Solar System. Concerning meteor detection, distinguishing\nfalse positives between meteor and non-meteor images has traditionally been\nperformed by hand, which is significantly time-consuming. To address this\nissue, we developed a fully automated pipeline that uses Convolutional Neural\nNetworks (CNNs) to classify candidate meteor detections. Our new method is able\nto detect meteors even in images that contain static elements such as clouds,\nthe Moon, and buildings. To accurately locate the meteor within each frame, we\nemploy the Gradient-weighted Class Activation Mapping (Grad-CAM) technique.\nThis method facilitates the identification of the region of interest by\nmultiplying the activations from the last convolutional layer with the average\nof the gradients across the feature map of that layer. By combining these\nfindings with the activation map derived from the first convolutional layer, we\neffectively pinpoint the most probable pixel location of the meteor. We trained\nand evaluated our model on a large dataset collected by the Spanish Meteor\nNetwork (SPMN) and achieved a precision of 98\\%. Our new methodology presented\nhere has the potential to reduce the workload of meteor scientists and station\noperators and improve the accuracy of meteor tracking and classification.\n","authors":["Eloy Peña-Asensio","Josep M. Trigo-Rodríguez","Pau Grèbol-Tomàs","David Regordosa-Avellana","Albert Rimola"],"pdf_url":"https://arxiv.org/pdf/2310.16826v1.pdf","comment":"Accepted in Planetary and Space Science"},{"id":"http://arxiv.org/abs/2310.16819v1","updated":"2023-10-25T17:51:07Z","published":"2023-10-25T17:51:07Z","title":"CATE Lasso: Conditional Average Treatment Effect Estimation with\n High-Dimensional Linear Regression","summary":" In causal inference about two treatments, Conditional Average Treatment\nEffects (CATEs) play an important role as a quantity representing an\nindividualized causal effect, defined as a difference between the expected\noutcomes of the two treatments conditioned on covariates. This study assumes\ntwo linear regression models between a potential outcome and covariates of the\ntwo treatments and defines CATEs as a difference between the linear regression\nmodels. Then, we propose a method for consistently estimating CATEs even under\nhigh-dimensional and non-sparse parameters. In our study, we demonstrate that\ndesirable theoretical properties, such as consistency, remain attainable even\nwithout assuming sparsity explicitly if we assume a weaker assumption called\nimplicit sparsity originating from the definition of CATEs. In this assumption,\nwe suppose that parameters of linear models in potential outcomes can be\ndivided into treatment-specific and common parameters, where the\ntreatment-specific parameters take difference values between each linear\nregression model, while the common parameters remain identical. Thus, in a\ndifference between two linear regression models, the common parameters\ndisappear, leaving only differences in the treatment-specific parameters.\nConsequently, the non-zero parameters in CATEs correspond to the differences in\nthe treatment-specific parameters. Leveraging this assumption, we develop a\nLasso regression method specialized for CATE estimation and present that the\nestimator is consistent. Finally, we confirm the soundness of the proposed\nmethod by simulation studies.\n","authors":["Masahiro Kato","Masaaki Imaizumi"],"pdf_url":"https://arxiv.org/pdf/2310.16819v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16804v1","updated":"2023-10-25T17:35:01Z","published":"2023-10-25T17:35:01Z","title":"Learning COVID-19 Regional Transmission Using Universal Differential\n Equations in a SIR model","summary":" Highly-interconnected societies difficult to model the spread of infectious\ndiseases such as COVID-19. Single-region SIR models fail to account for\nincoming forces of infection and expanding them to a large number of\ninteracting regions involves many assumptions that do not hold in the real\nworld. We propose using Universal Differential Equations (UDEs) to capture the\ninfluence of neighboring regions and improve the model's predictions in a\ncombined SIR+UDE model. UDEs are differential equations totally or partially\ndefined by a deep neural network (DNN). We include an additive term to the SIR\nequations composed by a DNN that learns the incoming force of infection from\nthe other regions. The learning is performed using automatic differentiation\nand gradient descent to approach the change in the target system caused by the\nstate of the neighboring regions. We compared the proposed model using a\nsimulated COVID-19 outbreak against a single-region SIR and a fully data-driven\nmodel composed only of a DNN. The proposed UDE+SIR model generates predictions\nthat capture the outbreak dynamic more accurately, but a decay in performance\nis observed at the last stages of the outbreak. The single-area SIR and the\nfully data-driven approach do not capture the proper dynamics accurately. Once\nthe predictions were obtained, we employed the SINDy algorithm to substitute\nthe DNN with a regression, removing the black box element of the model with no\nconsiderable increase in the error levels.\n","authors":["Adrian Rojas-Campos","Lukas Stelz","Pascal Nieters"],"pdf_url":"https://arxiv.org/pdf/2310.16804v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2310.16803v1","updated":"2023-10-25T17:34:52Z","published":"2023-10-25T17:34:52Z","title":"Language Agnostic Code Embeddings","summary":" Recently, code language models have achieved notable advancements in\naddressing a diverse array of essential code comprehension and generation\ntasks. Yet, the field lacks a comprehensive deep dive and understanding of the\ncode embeddings of multilingual code models. In this paper, we present a\ncomprehensive study on multilingual code embeddings, focusing on the\ncross-lingual capabilities of these embeddings across different programming\nlanguages. Through probing experiments, we demonstrate that code embeddings\ncomprise two distinct components: one deeply tied to the nuances and syntax of\na specific language, and the other remaining agnostic to these details,\nprimarily focusing on semantics. Further, we show that when we isolate and\neliminate this language-specific component, we witness significant improvements\nin downstream code retrieval tasks, leading to an absolute increase of up to\n+17 in the Mean Reciprocal Rank (MRR).\n","authors":["Saiteja Utpala","Alex Gu","Pin Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16803v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16802v1","updated":"2023-10-25T17:32:23Z","published":"2023-10-25T17:32:23Z","title":"From Molecules to Materials: Pre-training Large Generalizable Models for\n Atomic Property Prediction","summary":" Foundation models have been transformational in machine learning fields such\nas natural language processing and computer vision. Similar success in atomic\nproperty prediction has been limited due to the challenges of training\neffective models across multiple chemical domains. To address this, we\nintroduce Joint Multi-domain Pre-training (JMP), a supervised pre-training\nstrategy that simultaneously trains on multiple datasets from different\nchemical domains, treating each dataset as a unique pre-training task within a\nmulti-task framework. Our combined training dataset consists of $\\sim$120M\nsystems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and\ngeneralization by fine-tuning over a diverse set of downstream tasks and\ndatasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP\ndemonstrates an average improvement of 59% over training from scratch, and\nmatches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the\npotential of pre-training strategies that utilize diverse data to advance\nproperty prediction across chemical domains, especially for low-data tasks.\n","authors":["Nima Shoghi","Adeesh Kolluru","John R. Kitchin","Zachary W. Ulissi","C. Lawrence Zitnick","Brandon M. Wood"],"pdf_url":"https://arxiv.org/pdf/2310.16802v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16795v1","updated":"2023-10-25T17:24:53Z","published":"2023-10-25T17:24:53Z","title":"QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models","summary":" Mixture-of-Experts (MoE) architectures offer a general solution to the high\ninference costs of large language models (LLMs) via sparse routing, bringing\nfaster and more accurate models, at the cost of massive parameter counts. For\nexample, the SwitchTransformer-c2048 model has 1.6 trillion parameters,\nrequiring 3.2TB of accelerator memory to run efficiently, which makes practical\ndeployment challenging and expensive. In this paper, we present a solution to\nthis memory problem, in form of a new compression and execution framework\ncalled QMoE. Specifically, QMoE consists of a scalable algorithm which\naccurately compresses trillion-parameter MoEs to less than 1 bit per parameter,\nin a custom format co-designed with bespoke GPU decoding kernels to facilitate\nefficient end-to-end compressed inference, with minor runtime overheads\nrelative to uncompressed execution. Concretely, QMoE can compress the 1.6\ntrillion parameter SwitchTransformer-c2048 model to less than 160GB (20x\ncompression, 0.8 bits per parameter) at only minor accuracy loss, in less than\na day on a single GPU. This enables, for the first time, the execution of a\ntrillion-parameter model on affordable commodity hardware, like a single server\nwith 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead\nrelative to ideal uncompressed inference. The source code and compressed models\nare available at github.com/IST-DASLab/qmoe.\n","authors":["Elias Frantar","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2310.16795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.02285v2","updated":"2023-10-25T17:24:18Z","published":"2023-02-05T03:01:28Z","title":"ReDi: Efficient Learning-Free Diffusion Inference via Trajectory\n Retrieval","summary":" Diffusion models show promising generation capability for a variety of data.\nDespite their high generation quality, the inference for diffusion models is\nstill time-consuming due to the numerous sampling iterations required. To\naccelerate the inference, we propose ReDi, a simple yet learning-free\nRetrieval-based Diffusion sampling framework. From a precomputed knowledge\nbase, ReDi retrieves a trajectory similar to the partially generated trajectory\nat an early stage of generation, skips a large portion of intermediate steps,\nand continues sampling from a later step in the retrieved trajectory. We\ntheoretically prove that the generation performance of ReDi is guaranteed. Our\nexperiments demonstrate that ReDi improves the model inference efficiency by 2x\nspeedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain\nimage generation such as image stylization.\n","authors":["Kexun Zhang","Xianjun Yang","William Yang Wang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2302.02285v2.pdf","comment":"ICML 2023"},{"id":"http://arxiv.org/abs/2310.16792v1","updated":"2023-10-25T17:24:01Z","published":"2023-10-25T17:24:01Z","title":"Learning Independent Program and Architecture Representations for\n Generalizable Performance Modeling","summary":" This paper proposes PerfVec, a novel deep learning-based performance modeling\nframework that learns high-dimensional, independent/orthogonal program and\nmicroarchitecture representations. Once learned, a program representation can\nbe used to predict its performance on any microarchitecture, and likewise, a\nmicroarchitecture representation can be applied in the performance prediction\nof any program. Additionally, PerfVec yields a foundation model that captures\nthe performance essence of instructions, which can be directly used by\ndevelopers in numerous performance modeling related tasks without incurring its\ntraining cost. The evaluation demonstrates that PerfVec is more general,\nefficient, and accurate than previous approaches.\n","authors":["Lingda Li","Thomas Flynn","Adolfy Hoisie"],"pdf_url":"https://arxiv.org/pdf/2310.16792v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16791v1","updated":"2023-10-25T17:23:57Z","published":"2023-10-25T17:23:57Z","title":"Covert Planning against Imperfect Observers","summary":" Covert planning refers to a class of constrained planning problems where an\nagent aims to accomplish a task with minimal information leaked to a passive\nobserver to avoid detection. However, existing methods of covert planning often\nconsider deterministic environments or do not exploit the observer's imperfect\ninformation. This paper studies how covert planning can leverage the coupling\nof stochastic dynamics and the observer's imperfect observation to achieve\noptimal task performance without being detected. Specifically, we employ a\nMarkov decision process to model the interaction between the agent and its\nstochastic environment, and a partial observation function to capture the\nleaked information to a passive observer. Assuming the observer employs\nhypothesis testing to detect if the observation deviates from a nominal policy,\nthe covert planning agent aims to maximize the total discounted reward while\nkeeping the probability of being detected as an adversary below a given\nthreshold. We prove that finite-memory policies are more powerful than\nMarkovian policies in covert planning. Then, we develop a primal-dual proximal\npolicy gradient method with a two-time-scale update to compute a (locally)\noptimal covert policy. We demonstrate the effectiveness of our methods using a\nstochastic gridworld example. Our experimental results illustrate that the\nproposed method computes a policy that maximizes the adversary's expected\nreward without violating the detection constraint, and empirically demonstrates\nhow the environmental noises can influence the performance of the covert\npolicies.\n","authors":["Haoxiang Ma","Chongyang Shi","Shuo Han","Michael R. Dorothy","Jie Fu"],"pdf_url":"https://arxiv.org/pdf/2310.16791v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16790v1","updated":"2023-10-25T17:23:37Z","published":"2023-10-25T17:23:37Z","title":"Improving a Named Entity Recognizer Trained on Noisy Data with a Few\n Clean Instances","summary":" To achieve state-of-the-art performance, one still needs to train NER models\non large-scale, high-quality annotated data, an asset that is both costly and\ntime-intensive to accumulate. In contrast, real-world applications often resort\nto massive low-quality labeled data through non-expert annotators via\ncrowdsourcing and external knowledge bases via distant supervision as a\ncost-effective alternative. However, these annotation methods result in noisy\nlabels, which in turn lead to a notable decline in performance. Hence, we\npropose to denoise the noisy NER data with guidance from a small set of clean\ninstances. Along with the main NER model we train a discriminator model and use\nits outputs to recalibrate the sample weights. The discriminator is capable of\ndetecting both span and category errors with different discriminative prompts.\nResults on public crowdsourcing and distant supervision datasets show that the\nproposed method can consistently improve performance with a small guidance set.\n","authors":["Zhendong Chu","Ruiyi Zhang","Tong Yu","Rajiv Jain","Vlad I Morariu","Jiuxiang Gu","Ani Nenkova"],"pdf_url":"https://arxiv.org/pdf/2310.16790v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2310.16789v1","updated":"2023-10-25T17:21:23Z","published":"2023-10-25T17:21:23Z","title":"Detecting Pretraining Data from Large Language Models","summary":" Although large language models (LLMs) are widely deployed, the data used to\ntrain them is rarely disclosed. Given the incredible scale of this data, up to\ntrillions of tokens, it is all but certain that it includes potentially\nproblematic text such as copyrighted materials, personally identifiable\ninformation, and test data for widely reported reference benchmarks. However,\nwe currently have no way to know which data of these types is included or in\nwhat proportions. In this paper, we study the pretraining data detection\nproblem: given a piece of text and black-box access to an LLM without knowing\nthe pretraining data, can we determine if the model was trained on the provided\ntext? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that\nuses data created before and after model training to support gold truth\ndetection. We also introduce a new detection method Min-K% Prob based on a\nsimple hypothesis: an unseen example is likely to contain a few outlier words\nwith low probabilities under the LLM, while a seen example is less likely to\nhave words with such low probabilities. Min-K% Prob can be applied without any\nknowledge about the pretraining corpus or any additional training, departing\nfrom previous detection methods that require training a reference model on data\nthat is similar to the pretraining data. Moreover, our experiments demonstrate\nthat Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous\nmethods. We apply Min-K% Prob to two real-world scenarios, copyrighted book\ndetection, and contaminated downstream example detection, and find it a\nconsistently effective solution.\n","authors":["Weijia Shi","Anirudh Ajith","Mengzhou Xia","Yangsibo Huang","Daogao Liu","Terra Blevins","Danqi Chen","Luke Zettlemoyer"],"pdf_url":"https://arxiv.org/pdf/2310.16789v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16788v1","updated":"2023-10-25T17:20:38Z","published":"2023-10-25T17:20:38Z","title":"The GOOSE Dataset for Perception in Unstructured Environments","summary":" The potential for deploying autonomous systems can be significantly increased\nby improving the perception and interpretation of the environment. However, the\ndevelopment of deep learning-based techniques for autonomous systems in\nunstructured outdoor environments poses challenges due to limited data\navailability for training and testing. To address this gap, we present the\nGerman Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset\nspecifically designed for unstructured outdoor environments. The GOOSE dataset\nincorporates 10 000 labeled pairs of images and point clouds, which are\nutilized to train a range of state-of-the-art segmentation models on both image\nand point cloud data. We open source the dataset, along with an ontology for\nunstructured terrain, as well as dataset standards and guidelines. This\ninitiative aims to establish a common framework, enabling the seamless\ninclusion of existing datasets and a fast way to enhance the perception\ncapabilities of various robots operating in unstructured environments. The\ndataset, pre-trained models for offroad perception, and additional\ndocumentation can be found at https://goose-dataset.de/.\n","authors":["Peter Mortimer","Raphael Hagmanns","Miguel Granero","Thorsten Luettel","Janko Petereit","Hans-Joachim Wuensche"],"pdf_url":"https://arxiv.org/pdf/2310.16788v1.pdf","comment":"Preprint; Submitted to IEEE for review"},{"id":"http://arxiv.org/abs/2310.16787v1","updated":"2023-10-25T17:20:26Z","published":"2023-10-25T17:20:26Z","title":"The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing\n & Attribution in AI","summary":" The race to train language models on vast, diverse, and inconsistently\ndocumented datasets has raised pressing concerns about the legal and ethical\nrisks for practitioners. To remedy these practices threatening data\ntransparency and understanding, we convene a multi-disciplinary effort between\nlegal and machine learning experts to systematically audit and trace 1800+ text\ndatasets. We develop tools and standards to trace the lineage of these\ndatasets, from their source, creators, series of license conditions,\nproperties, and subsequent use. Our landscape analysis highlights the sharp\ndivides in composition and focus of commercially open vs closed datasets, with\nclosed datasets monopolizing important categories: lower resource languages,\nmore creative tasks, richer topic variety, newer and more synthetic training\ndata. This points to a deepening divide in the types of data that are made\navailable under different license conditions, and heightened implications for\njurisdictional legal interpretations of copyright and fair use. We also observe\nfrequent miscategorization of licenses on widely used dataset hosting sites,\nwith license omission of 72%+ and error rates of 50%+. This points to a crisis\nin misattribution and informed use of the most popular datasets driving many\nrecent breakthroughs. As a contribution to ongoing improvements in dataset\ntransparency and responsible use, we release our entire audit, with an\ninteractive UI, the Data Provenance Explorer, which allows practitioners to\ntrace and filter on data provenance for the most popular open source finetuning\ndata collections: www.dataprovenance.org.\n","authors":["Shayne Longpre","Robert Mahari","Anthony Chen","Naana Obeng-Marnu","Damien Sileo","William Brannon","Niklas Muennighoff","Nathan Khazam","Jad Kabbara","Kartik Perisetla"," Xinyi"," Wu","Enrico Shippole","Kurt Bollacker","Tongshuang Wu","Luis Villa","Sandy Pentland","Deb Roy","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2310.16787v1.pdf","comment":"30 pages (18 main), 6 figures, 5 tables"},{"id":"http://arxiv.org/abs/2310.16786v1","updated":"2023-10-25T17:20:19Z","published":"2023-10-25T17:20:19Z","title":"The Simplest Inflationary Potentials","summary":" Inflation is a highly favoured theory for the early Universe. It is\ncompatible with current observations of the cosmic microwave background and\nlarge scale structure and is a driver in the quest to detect primordial\ngravitational waves. It is also, given the current quality of the data, highly\nunder-determined with a large number of candidate implementations. We use a new\nmethod in symbolic regression to generate all possible simple scalar field\npotentials for one of two possible basis sets of operators. Treating these as\nsingle-field, slow-roll inflationary models we then score them with an\ninformation-theoretic metric (\"minimum description length\") that quantifies\ntheir efficiency in compressing the information in the Planck data. We explore\ntwo possible priors on the parameter space of potentials, one related to the\nfunctions' structural complexity and one that uses a Katz back-off language\nmodel to prefer functions that may be theoretically motivated. This enables us\nto identify the inflaton potentials that optimally balance simplicity with\naccuracy at explaining the Planck data, which may subsequently find theoretical\nmotivation. Our exploratory study opens the door to extraction of fundamental\nphysics directly from data, and may be augmented with more refined theoretical\npriors in the quest for a complete understanding of the early Universe.\n","authors":["Tomás Sousa","Deaglan J. Bartlett","Harry Desmond","Pedro G. Ferreira"],"pdf_url":"https://arxiv.org/pdf/2310.16786v1.pdf","comment":"13+4 pages, 4 figures; submitted to Physical Review D"},{"id":"http://arxiv.org/abs/2310.16781v1","updated":"2023-10-25T17:15:55Z","published":"2023-10-25T17:15:55Z","title":"Kiki or Bouba? Sound Symbolism in Vision-and-Language Models","summary":" Although the mapping between sound and meaning in human language is assumed\nto be largely arbitrary, research in cognitive science has shown that there are\nnon-trivial correlations between particular sounds and meanings across\nlanguages and demographic groups, a phenomenon known as sound symbolism. Among\nthe many dimensions of meaning, sound symbolism is particularly salient and\nwell-demonstrated with regards to cross-modal associations between language and\nthe visual domain. In this work, we address the question of whether sound\nsymbolism is reflected in vision-and-language models such as CLIP and Stable\nDiffusion. Using zero-shot knowledge probing to investigate the inherent\nknowledge of these models, we find strong evidence that they do show this\npattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our\nwork provides a novel method for demonstrating sound symbolism and\nunderstanding its nature using computational tools. Our code will be made\npublicly available.\n","authors":["Morris Alper","Hadar Averbuch-Elor"],"pdf_url":"https://arxiv.org/pdf/2310.16781v1.pdf","comment":"Accepted to NeurIPS 2023 (spotlight). Project webpage:\n https://kiki-bouba.github.io/"},{"id":"http://arxiv.org/abs/2310.16779v1","updated":"2023-10-25T17:11:21Z","published":"2023-10-25T17:11:21Z","title":"Multi-scale Diffusion Denoised Smoothing","summary":" Along with recent diffusion models, randomized smoothing has become one of a\nfew tangible approaches that offers adversarial robustness to models at scale,\ne.g., those of large pre-trained models. Specifically, one can perform\nrandomized smoothing on any classifier via a simple \"denoise-and-classify\"\npipeline, so-called denoised smoothing, given that an accurate denoiser is\navailable - such as diffusion model. In this paper, we investigate the\ntrade-off between accuracy and certified robustness of denoised smoothing: for\nexample, we question on which representation of diffusion model would maximize\nthe certified robustness of denoised smoothing. We consider a new objective\nthat aims collective robustness of smoothed classifiers across multiple noise\nlevels at a shared diffusion model, which also suggests a new way to compensate\nthe cost of accuracy in randomized smoothing for its certified robustness. This\nobjective motivates us to fine-tune diffusion model (a) to perform consistent\ndenoising whenever the original image is recoverable, but (b) to generate\nrather diverse outputs otherwise. Our experiments show that this fine-tuning\nscheme of diffusion models combined with the multi-scale smoothing enables a\nstrong certified robustness possible at highest noise level while maintaining\nthe accuracy closer to non-smoothed classifiers.\n","authors":["Jongheon Jeong","Jinwoo Shin"],"pdf_url":"https://arxiv.org/pdf/2310.16779v1.pdf","comment":"24 pages; NeurIPS 2023; Code is available at\n https://github.com/jh-jeong/smoothing-multiscale"},{"id":"http://arxiv.org/abs/2310.16777v1","updated":"2023-10-25T17:10:37Z","published":"2023-10-25T17:10:37Z","title":"MixerFlow for Image Modelling","summary":" Normalising flows are statistical models that transform a complex density\ninto a simpler density through the use of bijective transformations enabling\nboth density estimation and data generation from a single model. In the context\nof image modelling, the predominant choice has been the Glow-based\narchitecture, whereas alternative architectures remain largely unexplored in\nthe research community. In this work, we propose a novel architecture called\nMixerFlow, based on the MLP-Mixer architecture, further unifying the generative\nand discriminative modelling architectures. MixerFlow offers an effective\nmechanism for weight sharing for flow-based models. Our results demonstrate\nbetter density estimation on image datasets under a fixed computational budget\nand scales well as the image resolution increases, making MixeFlow a powerful\nyet simple alternative to the Glow-based architectures. We also show that\nMixerFlow provides more informative embeddings than Glow-based architectures.\n","authors":["Eshant English","Matthias Kirchler","Christoph Lippert"],"pdf_url":"https://arxiv.org/pdf/2310.16777v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15016v3","updated":"2023-10-25T17:06:36Z","published":"2023-05-24T10:58:09Z","title":"Estimating Class Separability of Datasets Using Persistent Homology with\n Application to LLM Fine-Tuning","summary":" This paper proposes a method to estimate the class separability of an\nunlabeled text dataset by inspecting the topological characteristics of\nsentence-transformer embeddings of the text. Experiments conducted involve both\nbinary and multi-class cases, with balanced and imbalanced scenarios. The\nresults demonstrate a clear correlation and a better consistency between the\nproposed method and other separability and classification metrics, such as\nThornton's method and the AUC score of a logistic regression classifier, as\nwell as unsupervised methods. Finally, we empirically show that the proposed\nmethod can be part of a stopping criterion for fine-tuning language-model\nclassifiers. By monitoring the class separability of the embedding space after\neach training iteration, we can detect when the training process stops\nimproving the separability of the embeddings without using additional labels.\n","authors":["Najah Ghalyan","Kostis Gourgoulias","Yash Satsangi","Sean Moran","Maxime Labonne","Joseph Sabelja"],"pdf_url":"https://arxiv.org/pdf/2305.15016v3.pdf","comment":"Rewrite of the manuscript with more baselines, extended related works\n section, and discussion"},{"id":"http://arxiv.org/abs/2310.16764v1","updated":"2023-10-25T16:52:13Z","published":"2023-10-25T16:52:13Z","title":"ConvNets Match Vision Transformers at Scale","summary":" Many researchers believe that ConvNets perform well on small or moderately\nsized datasets, but are not competitive with Vision Transformers when given\naccess to datasets on the web-scale. We challenge this belief by evaluating a\nperformant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset\nof images often used for training foundation models. We consider pre-training\ncompute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a\nseries of networks of increasing depth and width from the NFNet model family.\nWe observe a log-log scaling law between held out loss and compute budget.\nAfter fine-tuning on ImageNet, NFNets match the reported performance of Vision\nTransformers with comparable compute budgets. Our strongest fine-tuned model\nachieves a Top-1 accuracy of 90.4%.\n","authors":["Samuel L. Smith","Andrew Brock","Leonard Berrada","Soham De"],"pdf_url":"https://arxiv.org/pdf/2310.16764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16763v1","updated":"2023-10-25T16:52:00Z","published":"2023-10-25T16:52:00Z","title":"SuperHF: Supervised Iterative Learning from Human Feedback","summary":" While large language models demonstrate remarkable capabilities, they often\npresent challenges in terms of safety, alignment with human values, and\nstability during training. Here, we focus on two prevalent methods used to\nalign these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning\nfrom Human Feedback (RLHF). SFT is simple and robust, powering a host of\nopen-source models, while RLHF is a more sophisticated method used in top-tier\nmodels like ChatGPT but also suffers from instability and susceptibility to\nreward hacking. We propose a novel approach, Supervised Iterative Learning from\nHuman Feedback (SuperHF), which seeks to leverage the strengths of both\nmethods. Our hypothesis is two-fold: that the reward model used in RLHF is\ncritical for efficient data use and model generalization and that the use of\nProximal Policy Optimization (PPO) in RLHF may not be necessary and could\ncontribute to instability issues. SuperHF replaces PPO with a simple supervised\nloss and a Kullback-Leibler (KL) divergence prior. It creates its own training\ndata by repeatedly sampling a batch of model outputs and filtering them through\nthe reward model in an online learning regime. We then break down the reward\noptimization problem into three components: robustly optimizing the training\nrewards themselves, preventing reward hacking-exploitation of the reward model\nthat degrades model performance-as measured by a novel METEOR similarity\nmetric, and maintaining good performance on downstream evaluations. Our\nexperimental results show SuperHF exceeds PPO-based RLHF on the training\nobjective, easily and favorably trades off high reward with low reward hacking,\nimproves downstream calibration, and performs the same on our GPT-4 based\nqualitative evaluation scheme all the while being significantly simpler to\nimplement, highlighting SuperHF's potential as a competitive language model\nalignment technique.\n","authors":["Gabriel Mukobi","Peter Chatain","Su Fong","Robert Windesheim","Gitta Kutyniok","Kush Bhatia","Silas Alberti"],"pdf_url":"https://arxiv.org/pdf/2310.16763v1.pdf","comment":"Accepted to the Socially Responsible Language Modelling Research\n (SoLaR) workshop at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.01439v2","updated":"2023-10-25T16:40:27Z","published":"2023-06-02T10:59:44Z","title":"Interpretable and Explainable Logical Policies via Neurally Guided\n Symbolic Abstraction","summary":" The limited priors required by neural networks make them the dominating\nchoice to encode and learn policies using reinforcement learning (RL). However,\nthey are also black-boxes, making it hard to understand the agent's behaviour,\nespecially when working on the image level. Therefore, neuro-symbolic RL aims\nat creating policies that are interpretable in the first place. Unfortunately,\ninterpretability is not explainability. To achieve both, we introduce Neurally\ngUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural\nnetwork-based agents to guide the search of candidate-weighted logic rules,\nthen uses differentiable logic to train the logic agents. Our experimental\nevaluation demonstrates that NUDGE agents can induce interpretable and\nexplainable policies while outperforming purely neural ones and showing good\nflexibility to environments of different initial states and problem sizes.\n","authors":["Quentin Delfosse","Hikaru Shindo","Devendra Dhami","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2306.01439v2.pdf","comment":"9 main pages + appendix (19 in total)"},{"id":"http://arxiv.org/abs/2310.16753v1","updated":"2023-10-25T16:39:00Z","published":"2023-10-25T16:39:00Z","title":"PROMINET: Prototype-based Multi-View Network for Interpretable Email\n Response Prediction","summary":" Email is a widely used tool for business communication, and email marketing\nhas emerged as a cost-effective strategy for enterprises. While previous\nstudies have examined factors affecting email marketing performance, limited\nresearch has focused on understanding email response behavior by considering\nemail content and metadata. This study proposes a Prototype-based Multi-view\nNetwork (PROMINET) that incorporates semantic and structural information from\nemail data. By utilizing prototype learning, the PROMINET model generates\nlatent exemplars, enabling interpretable email response prediction. The model\nmaps learned semantic and structural exemplars to observed samples in the\ntraining data at different levels of granularity, such as document, sentence,\nor phrase. The approach is evaluated on two real-world email datasets: the\nEnron corpus and an in-house Email Marketing corpus. Experimental results\ndemonstrate that the PROMINET model outperforms baseline models, achieving a\n~3% improvement in F1 score on both datasets. Additionally, the model provides\ninterpretability through prototypes at different granularity levels while\nmaintaining comparable performance to non-interpretable models. The learned\nprototypes also show potential for generating suggestions to enhance email text\nediting and improve the likelihood of effective email responses. This research\ncontributes to enhancing sender-receiver communication and customer engagement\nin email interactions.\n","authors":["Yuqing Wang","Prashanth Vijayaraghavan","Ehsan Degan"],"pdf_url":"https://arxiv.org/pdf/2310.16753v1.pdf","comment":"Accepted at EMNLP 2023 (industry)"},{"id":"http://arxiv.org/abs/2310.16752v1","updated":"2023-10-25T16:37:45Z","published":"2023-10-25T16:37:45Z","title":"Simple, Scalable and Effective Clustering via One-Dimensional\n Projections","summary":" Clustering is a fundamental problem in unsupervised machine learning with\nmany applications in data analysis. Popular clustering algorithms such as\nLloyd's algorithm and $k$-means++ can take $\\Omega(ndk)$ time when clustering\n$n$ points in a $d$-dimensional space (represented by an $n\\times d$ matrix\n$X$) into $k$ clusters. In applications with moderate to large $k$, the\nmultiplicative $k$ factor can become very expensive. We introduce a simple\nrandomized clustering algorithm that provably runs in expected time\n$O(\\mathrm{nnz}(X) + n\\log n)$ for arbitrary $k$. Here $\\mathrm{nnz}(X)$ is the\ntotal number of non-zero entries in the input dataset $X$, which is upper\nbounded by $nd$ and can be significantly smaller for sparse datasets. We prove\nthat our algorithm achieves approximation ratio $\\smash{\\widetilde{O}(k^4)}$ on\nany input dataset for the $k$-means objective. We also believe that our\ntheoretical analysis is of independent interest, as we show that the\napproximation ratio of a $k$-means algorithm is approximately preserved under a\nclass of projections and that $k$-means++ seeding can be implemented in\nexpected $O(n \\log n)$ time in one dimension. Finally, we show experimentally\nthat our clustering algorithm gives a new tradeoff between running time and\ncluster quality compared to previous state-of-the-art methods for these tasks.\n","authors":["Moses Charikar","Monika Henzinger","Lunjia Hu","Maxmilian Vötsch","Erik Waingarten"],"pdf_url":"https://arxiv.org/pdf/2310.16752v1.pdf","comment":"41 pages, 6 figures, to appear in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2307.03305v2","updated":"2023-10-25T16:35:19Z","published":"2023-07-06T21:38:13Z","title":"A Vulnerability of Attribution Methods Using Pre-Softmax Scores","summary":" We discuss a vulnerability involving a category of attribution methods used\nto provide explanations for the outputs of convolutional neural networks\nworking as classifiers. It is known that this type of networks are vulnerable\nto adversarial attacks, in which imperceptible perturbations of the input may\nalter the outputs of the model. In contrast, here we focus on effects that\nsmall modifications in the model may cause on the attribution method without\naltering the model outputs.\n","authors":["Miguel Lerma","Mirtha Lucas"],"pdf_url":"https://arxiv.org/pdf/2307.03305v2.pdf","comment":"7 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.16742v1","updated":"2023-10-25T16:17:47Z","published":"2023-10-25T16:17:47Z","title":"Interferometric Neural Networks","summary":" On the one hand, artificial neural networks have many successful applications\nin the field of machine learning and optimization. On the other hand,\ninterferometers are integral parts of any field that deals with waves such as\noptics, astronomy, and quantum physics. Here, we introduce neural networks\ncomposed of interferometers and then build generative adversarial networks from\nthem. Our networks do not have any classical layer and can be realized on\nquantum computers or photonic chips. We demonstrate their applicability for\ncombinatorial optimization, image classification, and image generation. For\ncombinatorial optimization, our network consistently converges to the global\noptimum or remains within a narrow range of it. In multi-class image\nclassification tasks, our networks achieve accuracies of 93% and 83%. Lastly,\nwe show their capability to generate images of digits from 0 to 9 as well as\nhuman faces.\n","authors":["Arun Sehrawat"],"pdf_url":"https://arxiv.org/pdf/2310.16742v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2310.16741v1","updated":"2023-10-25T16:17:00Z","published":"2023-10-25T16:17:00Z","title":"Stochastic Latent Transformer: Efficient Modelling of Stochastically\n Forced Zonal Jets","summary":" We introduce the 'Stochastic Latent Transformer', a probabilistic deep\nlearning approach for efficient reduced-order modelling of stochastic partial\ndifferential equations (SPDEs). Despite recent advances in deep learning for\nfluid mechanics, limited research has explored modelling stochastically driven\nflows - which play a crucial role in understanding a broad spectrum of\nphenomena, from jets on giant planets to ocean circulation and the variability\nof midlatitude weather. The model architecture consists of a\nstochastically-forced transformer, paired with a translation-equivariant\nautoencoder, that we demonstrate is capable of reproducing system dynamics\nacross various integration periods. We demonstrate its effectiveness applied to\na well-researched zonal jet system, with the neural network achieving a\nfive-order-of-magnitude speedup compared to numerical integration. This\nfacilitates the cost-effective generation of large ensembles, enabling the\nexploration of statistical questions concerning probabilities of spontaneous\ntransition events.\n","authors":["Ira J. S. Shokar","Rich R. Kerswell","Peter H. Haynes"],"pdf_url":"https://arxiv.org/pdf/2310.16741v1.pdf","comment":"23 pages, 9 figures"},{"id":"http://arxiv.org/abs/2302.08982v2","updated":"2023-10-25T16:09:19Z","published":"2023-02-17T16:37:08Z","title":"(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large\n Stepsizes and Edge of Stability","summary":" In this paper, we investigate the impact of stochasticity and large stepsizes\non the implicit regularisation of gradient descent (GD) and stochastic gradient\ndescent (SGD) over diagonal linear networks. We prove the convergence of GD and\nSGD with macroscopic stepsizes in an overparametrised regression setting and\ncharacterise their solutions through an implicit regularisation problem. Our\ncrisp characterisation leads to qualitative insights about the impact of\nstochasticity and stepsizes on the recovered solution. Specifically, we show\nthat large stepsizes consistently benefit SGD for sparse regression problems,\nwhile they can hinder the recovery of sparse solutions for GD. These effects\nare magnified for stepsizes in a tight window just below the divergence\nthreshold, in the \"edge of stability\" regime. Our findings are supported by\nexperimental results.\n","authors":["Mathieu Even","Scott Pesme","Suriya Gunasekar","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2302.08982v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.00441v3","updated":"2023-10-25T16:08:52Z","published":"2023-02-01T13:39:07Z","title":"Scaling Laws for Hyperparameter Optimization","summary":" Hyperparameter optimization is an important subfield of machine learning that\nfocuses on tuning the hyperparameters of a chosen algorithm to achieve peak\nperformance. Recently, there has been a stream of methods that tackle the issue\nof hyperparameter optimization, however, most of the methods do not exploit the\ndominant power law nature of learning curves for Bayesian optimization. In this\nwork, we propose Deep Power Laws (DPL), an ensemble of neural network models\nconditioned to yield predictions that follow a power-law scaling pattern. Our\nmethod dynamically decides which configurations to pause and train\nincrementally by making use of gray-box evaluations. We compare our method\nagainst 7 state-of-the-art competitors on 3 benchmarks related to tabular,\nimage, and NLP datasets covering 59 diverse tasks. Our method achieves the best\nresults across all benchmarks by obtaining the best any-time results compared\nto all competitors.\n","authors":["Arlind Kadra","Maciej Janowski","Martin Wistuba","Josif Grabocka"],"pdf_url":"https://arxiv.org/pdf/2302.00441v3.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.00488v2","updated":"2023-10-25T16:02:09Z","published":"2023-04-02T08:53:43Z","title":"Saddle-to-Saddle Dynamics in Diagonal Linear Networks","summary":" In this paper we fully describe the trajectory of gradient flow over diagonal\nlinear networks in the limit of vanishing initialisation. We show that the\nlimiting flow successively jumps from a saddle of the training loss to another\nuntil reaching the minimum $\\ell_1$-norm solution. This saddle-to-saddle\ndynamics translates to an incremental learning process as each saddle\ncorresponds to the minimiser of the loss constrained to an active set outside\nof which the coordinates must be zero. We explicitly characterise the visited\nsaddles as well as the jumping times through a recursive algorithm reminiscent\nof the LARS algorithm used for computing the Lasso path. Our proof leverages a\nconvenient arc-length time-reparametrisation which enables to keep track of the\nheteroclinic transitions between the jumps. Our analysis requires negligible\nassumptions on the data, applies to both under and overparametrised settings\nand covers complex cases where there is no monotonicity of the number of active\ncoordinates. We provide numerical experiments to support our findings.\n","authors":["Scott Pesme","Nicolas Flammarion"],"pdf_url":"https://arxiv.org/pdf/2304.00488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12747v2","updated":"2023-10-25T15:59:19Z","published":"2023-06-22T09:01:58Z","title":"Don't be so Monotone: Relaxing Stochastic Line Search in\n Over-Parameterized Models","summary":" Recent works have shown that line search methods can speed up Stochastic\nGradient Descent (SGD) and Adam in modern over-parameterized settings. However,\nexisting line searches may take steps that are smaller than necessary since\nthey require a monotone decrease of the (mini-)batch objective function. We\nexplore nonmonotone line search methods to relax this condition and possibly\naccept larger step sizes. Despite the lack of a monotonic decrease, we prove\nthe same fast rates of convergence as in the monotone case. Our experiments\nshow that nonmonotone methods improve the speed of convergence and\ngeneralization properties of SGD/Adam even beyond the previous monotone line\nsearches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained\nby combining a nonmonotone line search with a Polyak initial step size.\nFurthermore, we develop a new resetting technique that in the majority of the\niterations reduces the amount of backtracks to zero while still maintaining a\nlarge initial step size. To the best of our knowledge, a first runtime\ncomparison shows that the epoch-wise advantage of line-search-based methods\ngets reflected in the overall computational time.\n","authors":["Leonardo Galli","Holger Rauhut","Mark Schmidt"],"pdf_url":"https://arxiv.org/pdf/2306.12747v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16730v1","updated":"2023-10-25T15:58:51Z","published":"2023-10-25T15:58:51Z","title":"MultiPrompter: Cooperative Prompt Optimization with Multi-Agent\n Reinforcement Learning","summary":" Recently, there has been an increasing interest in automated prompt\noptimization based on reinforcement learning (RL). This approach offers\nimportant advantages, such as generating interpretable prompts and being\ncompatible with black-box foundation models. However, the substantial prompt\nspace size poses challenges for RL-based methods, often leading to suboptimal\npolicy convergence. This paper introduces MultiPrompter, a new framework that\nviews prompt optimization as a cooperative game between prompters which take\nturns composing a prompt together. Our cooperative prompt optimization\neffectively reduces the problem size and helps prompters learn optimal prompts.\nWe test our method on the text-to-image task and show its ability to generate\nhigher-quality images than baselines.\n","authors":["Dong-Ki Kim","Sungryull Sohn","Lajanugen Logeswaran","Dongsub Shim","Honglak Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.15804v2","updated":"2023-10-25T15:57:02Z","published":"2023-07-28T20:52:22Z","title":"On Single Index Models beyond Gaussian Data","summary":" Sparse high-dimensional functions have arisen as a rich framework to study\nthe behavior of gradient-descent methods using shallow neural networks,\nshowcasing their ability to perform feature learning beyond linear models.\nAmongst those functions, the simplest are single-index models $f(x) = \\phi( x\n\\cdot \\theta^*)$, where the labels are generated by an arbitrary non-linear\nscalar link function $\\phi$ applied to an unknown one-dimensional projection\n$\\theta^*$ of the input data. By focusing on Gaussian data, several recent\nworks have built a remarkable picture, where the so-called information exponent\n(related to the regularity of the link function) controls the required sample\ncomplexity. In essence, these tools exploit the stability and spherical\nsymmetry of Gaussian distributions. In this work, building from the framework\nof \\cite{arous2020online}, we explore extensions of this picture beyond the\nGaussian setting, where both stability or symmetry might be violated. Focusing\non the planted setting where $\\phi$ is known, our main results establish that\nStochastic Gradient Descent can efficiently recover the unknown direction\n$\\theta^*$ in the high-dimensional regime, under assumptions that extend\nprevious works \\cite{yehudai2020learning,wu2022learning}.\n","authors":["Joan Bruna","Loucas Pillaud-Vivien","Aaron Zweig"],"pdf_url":"https://arxiv.org/pdf/2307.15804v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.01191v2","updated":"2023-10-25T15:56:24Z","published":"2022-08-02T01:23:50Z","title":"Implicit Two-Tower Policies","summary":" We present a new class of structured reinforcement learning\npolicy-architectures, Implicit Two-Tower (ITT) policies, where the actions are\nchosen based on the attention scores of their learnable latent representations\nwith those of the input states. By explicitly disentangling action from state\nprocessing in the policy stack, we achieve two main goals: substantial\ncomputational gains and better performance. Our architectures are compatible\nwith both: discrete and continuous action spaces. By conducting tests on 15\nenvironments from OpenAI Gym and DeepMind Control Suite, we show that\nITT-architectures are particularly suited for blackbox/evolutionary\noptimization and the corresponding policy training algorithms outperform their\nvanilla unstructured implicit counterparts as well as commonly used explicit\npolicies. We complement our analysis by showing how techniques such as hashing\nand lazy tower updates, critically relying on the two-tower structure of ITTs,\ncan be applied to obtain additional computational improvements.\n","authors":["Yunfan Zhao","Qingkai Pan","Krzysztof Choromanski","Deepali Jain","Vikas Sindhwani"],"pdf_url":"https://arxiv.org/pdf/2208.01191v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16727v1","updated":"2023-10-25T15:55:50Z","published":"2023-10-25T15:55:50Z","title":"AI Hazard Management: A framework for the systematic management of root\n causes for AI risks","summary":" Recent advancements in the field of Artificial Intelligence (AI) establish\nthe basis to address challenging tasks. However, with the integration of AI,\nnew risks arise. Therefore, to benefit from its advantages, it is essential to\nadequately handle the risks associated with AI. Existing risk management\nprocesses in related fields, such as software systems, need to sufficiently\nconsider the specifics of AI. A key challenge is to systematically and\ntransparently identify and address AI risks' root causes - also called AI\nhazards. This paper introduces the AI Hazard Management (AIHM) framework, which\nprovides a structured process to systematically identify, assess, and treat AI\nhazards. The proposed process is conducted in parallel with the development to\nensure that any AI hazard is captured at the earliest possible stage of the AI\nsystem's life cycle. In addition, to ensure the AI system's auditability, the\nproposed framework systematically documents evidence that the potential impact\nof identified AI hazards could be reduced to a tolerable level. The framework\nbuilds upon an AI hazard list from a comprehensive state-of-the-art analysis.\nAlso, we provide a taxonomy that supports the optimal treatment of the\nidentified AI hazards. Additionally, we illustrate how the AIHM framework can\nincrease the overall quality of a power grid AI use case by systematically\nreducing the impact of identified hazards to an acceptable level.\n","authors":["Ronald Schnitzer","Andreas Hapfelmeier","Sven Gaube","Sonja Zillner"],"pdf_url":"https://arxiv.org/pdf/2310.16727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.07235v2","updated":"2023-10-25T15:49:30Z","published":"2023-10-11T06:53:05Z","title":"Are GATs Out of Balance?","summary":" While the expressive power and computational capabilities of graph neural\nnetworks (GNNs) have been theoretically studied, their optimization and\nlearning dynamics, in general, remain largely unexplored. Our study undertakes\nthe Graph Attention Network (GAT), a popular GNN architecture in which a node's\nneighborhood aggregation is weighted by parameterized attention coefficients.\nWe derive a conservation law of GAT gradient flow dynamics, which explains why\na high portion of parameters in GATs with standard initialization struggle to\nchange during training. This effect is amplified in deeper GATs, which perform\nsignificantly worse than their shallow counterparts. To alleviate this problem,\nwe devise an initialization scheme that balances the GAT network. Our approach\ni) allows more effective propagation of gradients and in turn enables\ntrainability of deeper networks, and ii) attains a considerable speedup in\ntraining and convergence time in comparison to the standard initialization. Our\nmain theorem serves as a stepping stone to studying the learning dynamics of\npositive homogeneous models with attention mechanisms.\n","authors":["Nimrah Mustafa","Aleksandar Bojchevski","Rebekka Burkholz"],"pdf_url":"https://arxiv.org/pdf/2310.07235v2.pdf","comment":"25 pages. To be published in Advances in Neural Information\n Processing Systems (NeurIPS), 2023"},{"id":"http://arxiv.org/abs/2309.16928v2","updated":"2023-10-25T15:38:19Z","published":"2023-09-29T02:04:24Z","title":"Learning to Receive Help: Intervention-Aware Concept Embedding Models","summary":" Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures\nby constructing and explaining their predictions using a set of high-level\nconcepts. A special property of these models is that they permit concept\ninterventions, wherein users can correct mispredicted concepts and thus improve\nthe model's performance. Recent work, however, has shown that intervention\nefficacy can be highly dependent on the order in which concepts are intervened\non and on the model's architecture and training hyperparameters. We argue that\nthis is rooted in a CBM's lack of train-time incentives for the model to be\nappropriately receptive to concept interventions. To address this, we propose\nIntervention-aware Concept Embedding models (IntCEMs), a novel CBM-based\narchitecture and training paradigm that improves a model's receptiveness to\ntest-time interventions. Our model learns a concept intervention policy in an\nend-to-end fashion from where it can sample meaningful intervention\ntrajectories at train-time. This conditions IntCEMs to effectively select and\nreceive concept interventions when deployed at test-time. Our experiments show\nthat IntCEMs significantly outperform state-of-the-art concept-interpretable\nmodels when provided with test-time concept interventions, demonstrating the\neffectiveness of our approach.\n","authors":["Mateo Espinosa Zarlenga","Katherine M. Collins","Krishnamurthy Dvijotham","Adrian Weller","Zohreh Shams","Mateja Jamnik"],"pdf_url":"https://arxiv.org/pdf/2309.16928v2.pdf","comment":"Accepted as a spotlight at the Thirty-seventh Conference on Neural\n Information Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2305.05588v2","updated":"2023-10-25T15:30:59Z","published":"2023-05-09T16:20:48Z","title":"StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure","summary":" This work presents StrAE: a Structured Autoencoder framework that through\nstrict adherence to explicit structure, and use of a novel contrastive\nobjective over tree-structured representations, enables effective learning of\nmulti-level representations. Through comparison over different forms of\nstructure, we verify that our results are directly attributable to the\ninformativeness of the structure provided as input, and show that this is not\nthe case for existing tree models. We then further extend StrAE to allow the\nmodel to define its own compositions using a simple localised-merge algorithm.\nThis variant, called Self-StrAE, outperforms baselines that don't involve\nexplicit hierarchical compositions, and is comparable to models given\ninformative structure (e.g. constituency parses). Our experiments are conducted\nin a data-constrained (circa 10M tokens) setting to help tease apart the\ncontribution of the inductive bias to effective learning. However, we find that\nthis framework can be robust to scale, and when extended to a much larger\ndataset (circa 100M tokens), our 430 parameter model performs comparably to a\n6-layer RoBERTa many orders of magnitude larger in size. Our findings support\nthe utility of incorporating explicit composition as an inductive bias for\neffective representation learning.\n","authors":["Mattia Opper","Victor Prokhorov","N. Siddharth"],"pdf_url":"https://arxiv.org/pdf/2305.05588v2.pdf","comment":"EMNLP 2023 Main"},{"id":"http://arxiv.org/abs/2110.03427v3","updated":"2023-10-25T15:21:08Z","published":"2021-10-05T16:38:57Z","title":"Is Attention always needed? A Case Study on Language Identification from\n Speech","summary":" Language Identification (LID) is a crucial preliminary process in the field\nof Automatic Speech Recognition (ASR) that involves the identification of a\nspoken language from audio samples. Contemporary systems that can process\nspeech in multiple languages require users to expressly designate one or more\nlanguages prior to utilization. The LID task assumes a significant role in\nscenarios where ASR systems are unable to comprehend the spoken language in\nmultilingual settings, leading to unsuccessful speech recognition outcomes. The\npresent study introduces convolutional recurrent neural network (CRNN) based\nLID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC)\ncharacteristics of audio samples. Furthermore, we replicate certain\nstate-of-the-art methodologies, specifically the Convolutional Neural Network\n(CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with\nattention), and conduct a comparative analysis with our CRNN-based approach. We\nconducted comprehensive evaluations on thirteen distinct Indian languages and\nour model resulted in over 98\\% classification accuracy. The LID model exhibits\nhigh-performance levels ranging from 97% to 100% for languages that are\nlinguistically similar. The proposed LID model exhibits a high degree of\nextensibility to additional languages and demonstrates a strong resistance to\nnoise, achieving 91.2% accuracy in a noisy setting when applied to a European\nLanguage (EU) dataset.\n","authors":["Atanu Mandal","Santanu Pal","Indranil Dutta","Mahidas Bhattacharya","Sudip Kumar Naskar"],"pdf_url":"https://arxiv.org/pdf/2110.03427v3.pdf","comment":"Accepted for publication in Natural Language Engineering"},{"id":"http://arxiv.org/abs/2310.16705v1","updated":"2023-10-25T15:20:53Z","published":"2023-10-25T15:20:53Z","title":"Wasserstein Gradient Flow over Variational Parameter Space for\n Variational Inference","summary":" Variational inference (VI) can be cast as an optimization problem in which\nthe variational parameters are tuned to closely align a variational\ndistribution with the true posterior. The optimization task can be approached\nthrough vanilla gradient descent in black-box VI or natural-gradient descent in\nnatural-gradient VI. In this work, we reframe VI as the optimization of an\nobjective that concerns probability distributions defined over a\n\\textit{variational parameter space}. Subsequently, we propose Wasserstein\ngradient descent for tackling this optimization problem. Notably, the\noptimization techniques, namely black-box VI and natural-gradient VI, can be\nreinterpreted as specific instances of the proposed Wasserstein gradient\ndescent. To enhance the efficiency of optimization, we develop practical\nmethods for numerically solving the discrete gradient flows. We validate the\neffectiveness of the proposed methods through empirical experiments on a\nsynthetic dataset, supplemented by theoretical analyses.\n","authors":["Dai Hai Nguyen","Tetsuya Sakurai","Hiroshi Mamitsuka"],"pdf_url":"https://arxiv.org/pdf/2310.16705v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15141v2","updated":"2023-10-25T15:20:46Z","published":"2023-05-24T13:36:06Z","title":"From Tempered to Benign Overfitting in ReLU Neural Networks","summary":" Overparameterized neural networks (NNs) are observed to generalize well even\nwhen trained to perfectly fit noisy data. This phenomenon motivated a large\nbody of work on \"benign overfitting\", where interpolating predictors achieve\nnear-optimal performance. Recently, it was conjectured and empirically observed\nthat the behavior of NNs is often better described as \"tempered overfitting\",\nwhere the performance is non-optimal yet also non-trivial, and degrades as a\nfunction of the noise level. However, a theoretical justification of this claim\nfor non-linear NNs has been lacking so far. In this work, we provide several\nresults that aim at bridging these complementing views. We study a simple\nclassification setting with 2-layer ReLU NNs, and prove that under various\nassumptions, the type of overfitting transitions from tempered in the extreme\ncase of one-dimensional data, to benign in high dimensions. Thus, we show that\nthe input dimension has a crucial role on the type of overfitting in this\nsetting, which we also validate empirically for intermediate dimensions.\nOverall, our results shed light on the intricate connections between the\ndimension, sample size, architecture and training algorithm on the one hand,\nand the type of resulting overfitting on the other hand.\n","authors":["Guy Kornowski","Gilad Yehudai","Ohad Shamir"],"pdf_url":"https://arxiv.org/pdf/2305.15141v2.pdf","comment":"NeurIPS 2023 camera ready version"},{"id":"http://arxiv.org/abs/2310.11518v2","updated":"2023-10-25T15:20:24Z","published":"2023-10-17T18:33:21Z","title":"Guarantees for Self-Play in Multiplayer Games via Polymatrix\n Decomposability","summary":" Self-play is a technique for machine learning in multi-agent systems where a\nlearning algorithm learns by interacting with copies of itself. Self-play is\nuseful for generating large quantities of data for learning, but has the\ndrawback that the agents the learner will face post-training may have\ndramatically different behavior than the learner came to expect by interacting\nwith itself. For the special case of two-player constant-sum games, self-play\nthat reaches Nash equilibrium is guaranteed to produce strategies that perform\nwell against any post-training opponent; however, no such guarantee exists for\nmultiplayer games. We show that in games that approximately decompose into a\nset of two-player constant-sum games (called constant-sum polymatrix games)\nwhere global $\\epsilon$-Nash equilibria are boundedly far from Nash equilibria\nin each subgame (called subgame stability), any no-external-regret algorithm\nthat learns by self-play will produce a strategy with bounded vulnerability.\nFor the first time, our results identify a structural property of multiplayer\ngames that enable performance guarantees for the strategies produced by a broad\nclass of self-play algorithms. We demonstrate our findings through experiments\non Leduc poker.\n","authors":["Revan MacQueen","James R. Wright"],"pdf_url":"https://arxiv.org/pdf/2310.11518v2.pdf","comment":"To appear at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.05515v2","updated":"2023-10-25T15:15:28Z","published":"2023-06-08T19:12:42Z","title":"PeFLL: Personalized Federated Learning by Learning to Learn","summary":" We present PeFLL, a new personalized federated learning algorithm that\nimproves over the state-of-the-art in three aspects: 1) it produces more\naccurate models, especially in the low-data regime, and not only for clients\npresent during its training phase, but also for any that may emerge in the\nfuture; 2) it reduces the amount of on-client computation and client-server\ncommunication by providing future clients with ready-to-use personalized models\nthat require no additional finetuning or optimization; 3) it comes with\ntheoretical guarantees that establish generalization from the observed clients\nto future ones. At the core of PeFLL lies a learning-to-learn approach that\njointly trains an embedding network and a hypernetwork. The embedding network\nis used to represent clients in a latent descriptor space in a way that\nreflects their similarity to each other. The hypernetwork takes as input such\ndescriptors and outputs the parameters of fully personalized client models. In\ncombination, both networks constitute a learning algorithm that achieves\nstate-of-the-art performance in several personalized federated learning\nbenchmarks.\n","authors":["Jonathan Scott","Hossein Zakerinia","Christoph H. Lampert"],"pdf_url":"https://arxiv.org/pdf/2306.05515v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.08677v2","updated":"2023-10-25T15:13:57Z","published":"2022-10-17T00:44:43Z","title":"Regularized Data Programming with Automated Bayesian Prior Selection","summary":" The cost of manual data labeling can be a significant obstacle in supervised\nlearning. Data programming (DP) offers a weakly supervised solution for\ntraining dataset creation, wherein the outputs of user-defined programmatic\nlabeling functions (LFs) are reconciled through unsupervised learning. However,\nDP can fail to outperform an unweighted majority vote in some scenarios,\nincluding low-data contexts. This work introduces a Bayesian extension of\nclassical DP that mitigates failures of unsupervised learning by augmenting the\nDP objective with regularization terms. Regularized learning is achieved\nthrough maximum a posteriori estimation with informative priors. Majority vote\nis proposed as a proxy signal for automated prior parameter selection. Results\nsuggest that regularized DP improves performance relative to maximum likelihood\nand majority voting, confers greater interpretability, and bolsters performance\nin low-data regimes.\n","authors":["Jacqueline R. M. A. Maasch","Hao Zhang","Qian Yang","Fei Wang","Volodymyr Kuleshov"],"pdf_url":"https://arxiv.org/pdf/2210.08677v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15952v2","updated":"2023-10-25T15:11:57Z","published":"2023-10-24T15:53:07Z","title":"Improving Robustness and Reliability in Medical Image Classification\n with Latent-Guided Diffusion and Nested-Ensembles","summary":" While deep learning models have achieved remarkable success across a range of\nmedical image analysis tasks, deployment of these models in real clinical\ncontexts requires that they be robust to variability in the acquired images.\nWhile many methods apply predefined transformations to augment the training\ndata to enhance test-time robustness, these transformations may not ensure the\nmodel's robustness to the diverse variability seen in patient images. In this\npaper, we introduce a novel three-stage approach based on transformers coupled\nwith conditional diffusion models, with the goal of improving model robustness\nto the kinds of imaging variability commonly encountered in practice without\nthe need for pre-determined data augmentation strategies. To this end, multiple\nimage encoders first learn hierarchical feature representations to build\ndiscriminative latent spaces. Next, a reverse diffusion process, guided by the\nlatent code, acts on an informative prior and proposes prediction candidates in\na generative manner. Finally, several prediction candidates are aggregated in a\nbi-level aggregation protocol to produce the final output. Through extensive\nexperiments on medical imaging benchmark datasets, we show that our method\nimproves upon state-of-the-art methods in terms of robustness and confidence\ncalibration. Additionally, we introduce a strategy to quantify the prediction\nuncertainty at the instance level, increasing their trustworthiness to\nclinicians using them in clinical practice.\n","authors":["Xing Shen","Hengguan Huang","Brennan Nichyporuk","Tal Arbel"],"pdf_url":"https://arxiv.org/pdf/2310.15952v2.pdf","comment":"13 pages, 6 figures, 7 tables"},{"id":"http://arxiv.org/abs/2310.16696v1","updated":"2023-10-25T15:06:57Z","published":"2023-10-25T15:06:57Z","title":"Interpretable time series neural representation for classification\n purposes","summary":" Deep learning has made significant advances in creating efficient\nrepresentations of time series data by automatically identifying complex\npatterns. However, these approaches lack interpretability, as the time series\nis transformed into a latent vector that is not easily interpretable. On the\nother hand, Symbolic Aggregate approximation (SAX) methods allow the creation\nof symbolic representations that can be interpreted but do not capture complex\npatterns effectively. In this work, we propose a set of requirements for a\nneural representation of univariate time series to be interpretable. We propose\na new unsupervised neural architecture that meets these requirements. The\nproposed model produces consistent, discrete, interpretable, and visualizable\nrepresentations. The model is learned independently of any downstream tasks in\nan unsupervised setting to ensure robustness. As a demonstration of the\neffectiveness of the proposed model, we propose experiments on classification\ntasks using UCR archive datasets. The obtained results are extensively compared\nto other interpretable models and state-of-the-art neural representation\nlearning models. The experiments show that the proposed model yields, on\naverage better results than other interpretable approaches on multiple\ndatasets. We also present qualitative experiments to asses the interpretability\nof the approach.\n","authors":["Etienne Le Naour","Ghislain Agoua","Nicolas Baskiotis","Vincent Guigue"],"pdf_url":"https://arxiv.org/pdf/2310.16696v1.pdf","comment":"International Conference on Data Science and Advanced Analytics\n (DSAA) 2023"},{"id":"http://arxiv.org/abs/2310.16695v1","updated":"2023-10-25T15:06:32Z","published":"2023-10-25T15:06:32Z","title":"From Pointwise to Powerhouse: Initialising Neural Networks with\n Generative Models","summary":" Traditional initialisation methods, e.g. He and Xavier, have been effective\nin avoiding the problem of vanishing or exploding gradients in neural networks.\nHowever, they only use simple pointwise distributions, which model\none-dimensional variables. Moreover, they ignore most information about the\narchitecture and disregard past training experiences. These limitations can be\novercome by employing generative models for initialisation. In this paper, we\nintroduce two groups of new initialisation methods. First, we locally\ninitialise weight groups by employing variational autoencoders. Secondly, we\nglobally initialise full weight sets by employing graph hypernetworks. We\nthoroughly evaluate the impact of the employed generative models on\nstate-of-the-art neural networks in terms of accuracy, convergence speed and\nensembling. Our results show that global initialisations result in higher\naccuracy and faster initial convergence speed. However, the implementation\nthrough graph hypernetworks leads to diminished ensemble performance on out of\ndistribution data. To counteract, we propose a modification called noise graph\nhypernetwork, which encourages diversity in the produced ensemble members.\nFurthermore, our approach might be able to transfer learned knowledge to\ndifferent image distributions. Our work provides insights into the potential,\nthe trade-offs and possible modifications of these new initialisation methods.\n","authors":["Christian Harder","Moritz Fuchs","Yuri Tolkach","Anirban Mukhopadhyay"],"pdf_url":"https://arxiv.org/pdf/2310.16695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18258v2","updated":"2023-10-25T14:59:32Z","published":"2023-05-29T17:25:26Z","title":"Maximize to Explore: One Objective Function Fusing Estimation, Planning,\n and Exploration","summary":" In online reinforcement learning (online RL), balancing exploration and\nexploitation is crucial for finding an optimal policy in a sample-efficient\nway. To achieve this, existing sample-efficient online RL algorithms typically\nconsist of three components: estimation, planning, and exploration. However, in\norder to cope with general function approximators, most of them involve\nimpractical algorithmic components to incentivize exploration, such as\noptimization within data-dependent level-sets or complicated sampling\nprocedures. To address this challenge, we propose an easy-to-implement RL\nframework called \\textit{Maximize to Explore} (\\texttt{MEX}), which only needs\nto optimize \\emph{unconstrainedly} a single objective that integrates the\nestimation and planning components while balancing exploration and exploitation\nautomatically. Theoretically, we prove that \\texttt{MEX} achieves a sublinear\nregret with general function approximations for Markov decision processes (MDP)\nand is further extendable to two-player zero-sum Markov games (MG). Meanwhile,\nwe adapt deep RL baselines to design practical versions of \\texttt{MEX}, in\nboth model-free and model-based manners, which can outperform baselines by a\nstable margin in various MuJoCo environments with sparse rewards. Compared with\nexisting sample-efficient online RL algorithms with general function\napproximations, \\texttt{MEX} achieves similar sample efficiency while enjoying\na lower computational cost and is more compatible with modern deep RL methods.\n","authors":["Zhihan Liu","Miao Lu","Wei Xiong","Han Zhong","Hao Hu","Shenao Zhang","Sirui Zheng","Zhuoran Yang","Zhaoran Wang"],"pdf_url":"https://arxiv.org/pdf/2305.18258v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16688v1","updated":"2023-10-25T14:50:15Z","published":"2023-10-25T14:50:15Z","title":"Learning-based adaption of robotic friction models","summary":" In the Fourth Industrial Revolution, wherein artificial intelligence and the\nautomation of machines occupy a central role, the deployment of robots is\nindispensable. However, the manufacturing process using robots, especially in\ncollaboration with humans, is highly intricate. In particular, modeling the\nfriction torque in robotic joints is a longstanding problem due to the lack of\na good mathematical description. This motivates the usage of data-driven\nmethods in recent works. However, model-based and data-driven models often\nexhibit limitations in their ability to generalize beyond the specific dynamics\nthey were trained on, as we demonstrate in this paper. To address this\nchallenge, we introduce a novel approach based on residual learning, which aims\nto adapt an existing friction model to new dynamics using as little data as\npossible. We validate our approach by training a base neural network on a\nsymmetric friction data set to learn an accurate relation between the velocity\nand the friction torque. Subsequently, to adapt to more complex asymmetric\nsettings, we train a second network on a small dataset, focusing on predicting\nthe residual of the initial network's output. By combining the output of both\nnetworks in a suitable manner, our proposed estimator outperforms the\nconventional model-based approach and the base neural network significantly.\nFurthermore, we evaluate our method on trajectories involving external loads\nand still observe a substantial improvement, approximately 60-70\\%, over the\nconventional approach. Our method does not rely on data with external load\nduring training, eliminating the need for external torque sensors. This\ndemonstrates the generalization capability of our approach, even with a small\namount of data-only 43 seconds of a robot movement-enabling adaptation to\ndiverse scenarios based on prior knowledge about friction in different\nsettings.\n","authors":["Philipp Scholl","Maged Iskandar","Sebastian Wolf","Jinoh Lee","Aras Bacho","Alexander Dietrich","Alin Albu-Schäffer","Gitta Kutyniok"],"pdf_url":"https://arxiv.org/pdf/2310.16688v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16686v1","updated":"2023-10-25T14:50:05Z","published":"2023-10-25T14:50:05Z","title":"Dynamics Generalisation in Reinforcement Learning via Adaptive\n Context-Aware Policies","summary":" While reinforcement learning has achieved remarkable successes in several\ndomains, its real-world application is limited due to many methods failing to\ngeneralise to unfamiliar conditions. In this work, we consider the problem of\ngeneralising to new transition dynamics, corresponding to cases in which the\nenvironment's response to the agent's actions differs. For example, the\ngravitational force exerted on a robot depends on its mass and changes the\nrobot's mobility. Consequently, in such cases, it is necessary to condition an\nagent's actions on extrinsic state information and pertinent contextual\ninformation reflecting how the environment responds. While the need for\ncontext-sensitive policies has been established, the manner in which context is\nincorporated architecturally has received less attention. Thus, in this work,\nwe present an investigation into how context information should be incorporated\ninto behaviour learning to improve generalisation. To this end, we introduce a\nneural network architecture, the Decision Adapter, which generates the weights\nof an adapter module and conditions the behaviour of an agent on the context\ninformation. We show that the Decision Adapter is a useful generalisation of a\npreviously proposed architecture and empirically demonstrate that it results in\nsuperior generalisation performance compared to previous approaches in several\nenvironments. Beyond this, the Decision Adapter is more robust to irrelevant\ndistractor variables than several alternative methods.\n","authors":["Michael Beukman","Devon Jarvis","Richard Klein","Steven James","Benjamin Rosman"],"pdf_url":"https://arxiv.org/pdf/2310.16686v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.14095v2","updated":"2023-10-25T14:49:23Z","published":"2023-05-23T14:18:11Z","title":"S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist\n Captions","summary":" Vision-language models, such as contrastive language-image pre-training\n(CLIP), have demonstrated impressive results in natural image domains. However,\nthese models often struggle when applied to specialized domains like remote\nsensing, and adapting to such domains is challenging due to the limited number\nof image-text pairs available for training. To address this, we propose S-CLIP,\na semi-supervised learning method for training CLIP that utilizes additional\nunpaired images. S-CLIP employs two pseudo-labeling strategies specifically\ndesigned for contrastive learning and the language modality. The caption-level\npseudo-label is given by a combination of captions of paired images, obtained\nby solving an optimal transport problem between unpaired and paired images. The\nkeyword-level pseudo-label is given by a keyword in the caption of the nearest\npaired image, trained through partial label learning that assumes a candidate\nset of labels for supervision instead of the exact one. By combining these\nobjectives, S-CLIP significantly enhances the training of CLIP using only a few\nimage-text pairs, as demonstrated in various specialist domains, including\nremote sensing, fashion, scientific figures, and comics. For instance, S-CLIP\nimproves CLIP by 10% for zero-shot classification and 4% for image-text\nretrieval on the remote sensing benchmark, matching the performance of\nsupervised CLIP while using three times fewer image-text pairs.\n","authors":["Sangwoo Mo","Minkyu Kim","Kyungmin Lee","Jinwoo Shin"],"pdf_url":"https://arxiv.org/pdf/2305.14095v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16678v1","updated":"2023-10-25T14:43:03Z","published":"2023-10-25T14:43:03Z","title":"Robust and Actively Secure Serverless Collaborative Learning","summary":" Collaborative machine learning (ML) is widely used to enable institutions to\nlearn better models from distributed data. While collaborative approaches to\nlearning intuitively protect user data, they remain vulnerable to either the\nserver, the clients, or both, deviating from the protocol. Indeed, because the\nprotocol is asymmetric, a malicious server can abuse its power to reconstruct\nclient data points. Conversely, malicious clients can corrupt learning with\nmalicious updates. Thus, both clients and servers require a guarantee when the\nother cannot be trusted to fully cooperate. In this work, we propose a\npeer-to-peer (P2P) learning scheme that is secure against malicious servers and\nrobust to malicious clients. Our core contribution is a generic framework that\ntransforms any (compatible) algorithm for robust aggregation of model updates\nto the setting where servers and clients can act maliciously. Finally, we\ndemonstrate the computational efficiency of our approach even with 1-million\nparameter models trained by 100s of peers on standard datasets.\n","authors":["Olive Franzese","Adam Dziedzic","Christopher A. Choquette-Choo","Mark R. Thomas","Muhammad Ahmad Kaleem","Stephan Rabanser","Congyu Fang","Somesh Jha","Nicolas Papernot","Xiao Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16678v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15578v2","updated":"2023-10-25T14:30:39Z","published":"2023-10-24T07:42:04Z","title":"VMAF Re-implementation on PyTorch: Some Experimental Results","summary":" Based on the standard VMAF implementation we propose an implementation of\nVMAF using PyTorch framework. For this implementation comparisons with the\nstandard (libvmaf) show the discrepancy $\\lesssim 10^{-2}$ in VMAF units. We\ninvestigate gradients computation when using VMAF as an objective function and\ndemonstrate that training using this function does not result in ill-behaving\ngradients.\n","authors":["Kirill Aistov","Maxim Koroteev"],"pdf_url":"https://arxiv.org/pdf/2310.15578v2.pdf","comment":"4 pages"},{"id":"http://arxiv.org/abs/2202.13490v4","updated":"2023-10-25T14:29:27Z","published":"2022-02-28T00:20:12Z","title":"Limitations of Deep Learning for Inverse Problems on Digital Hardware","summary":" Deep neural networks have seen tremendous success over the last years. Since\nthe training is performed on digital hardware, in this paper, we analyze what\nactually can be computed on current hardware platforms modeled as Turing\nmachines, which would lead to inherent restrictions of deep learning. For this,\nwe focus on the class of inverse problems, which, in particular, encompasses\nany task to reconstruct data from measurements. We prove that\nfinite-dimensional inverse problems are not Banach-Mazur computable for small\nrelaxation parameters. Even more, our results introduce a lower bound on the\naccuracy that can be obtained algorithmically.\n","authors":["Holger Boche","Adalbert Fono","Gitta Kutyniok"],"pdf_url":"https://arxiv.org/pdf/2202.13490v4.pdf","comment":"To be published in IEEE Transactions on Information Theory"},{"id":"http://arxiv.org/abs/2310.16659v1","updated":"2023-10-25T14:21:22Z","published":"2023-10-25T14:21:22Z","title":"UAV Pathfinding in Dynamic Obstacle Avoidance with Multi-agent\n Reinforcement Learning","summary":" Multi-agent reinforcement learning based methods are significant for online\nplanning of feasible and safe paths for agents in dynamic and uncertain\nscenarios. Although some methods like fully centralized and fully decentralized\nmethods achieve a certain measure of success, they also encounter problems such\nas dimension explosion and poor convergence, respectively. In this paper, we\npropose a novel centralized training with decentralized execution method based\non multi-agent reinforcement learning to solve the dynamic obstacle avoidance\nproblem online. In this approach, each agent communicates only with the central\nplanner or only with its neighbors, respectively, to plan feasible and safe\npaths online. We improve our methods based on the idea of model predictive\ncontrol to increase the training efficiency and sample utilization of agents.\nThe experimental results in both simulation, indoor, and outdoor environments\nvalidate the effectiveness of our method. The video is available at\nhttps://www.bilibili.com/video/BV1gw41197hV/?vd_source=9de61aecdd9fb684e546d032ef7fe7bf\n","authors":["Qizhen Wu","Lei Chen","Kexin Liu","Jinhu Lv"],"pdf_url":"https://arxiv.org/pdf/2310.16659v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09576v2","updated":"2023-10-25T14:13:57Z","published":"2023-04-19T11:27:09Z","title":"Leveraging the two timescale regime to demonstrate convergence of neural\n networks","summary":" We study the training dynamics of shallow neural networks, in a two-timescale\nregime in which the stepsizes for the inner layer are much smaller than those\nfor the outer layer. In this regime, we prove convergence of the gradient flow\nto a global optimum of the non-convex optimization problem in a simple\nunivariate setting. The number of neurons need not be asymptotically large for\nour result to hold, distinguishing our result from popular recent approaches\nsuch as the neural tangent kernel or mean-field regimes. Experimental\nillustration is provided, showing that the stochastic gradient descent behaves\naccording to our description of the gradient flow and thus converges to a\nglobal optimum in the two-timescale regime, but can fail outside of this\nregime.\n","authors":["Pierre Marion","Raphaël Berthier"],"pdf_url":"https://arxiv.org/pdf/2304.09576v2.pdf","comment":"NeurIPS 2023. 34 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.16656v1","updated":"2023-10-25T14:10:08Z","published":"2023-10-25T14:10:08Z","title":"A Picture is Worth a Thousand Words: Principled Recaptioning Improves\n Image Generation","summary":" Text-to-image diffusion models achieved a remarkable leap in capabilities\nover the last few years, enabling high-quality and diverse synthesis of images\nfrom a textual prompt. However, even the most advanced models often struggle to\nprecisely follow all of the directions in their prompts. The vast majority of\nthese models are trained on datasets consisting of (image, caption) pairs where\nthe images often come from the web, and the captions are their HTML alternate\ntext. A notable example is the LAION dataset, used by Stable Diffusion and\nother models. In this work we observe that these captions are often of low\nquality, and argue that this significantly affects the model's capability to\nunderstand nuanced semantics in the textual prompts. We show that by relabeling\nthe corpus with a specialized automatic captioning model and training a\ntext-to-image model on the recaptioned dataset, the model benefits\nsubstantially across the board. First, in overall image quality: e.g. FID 14.84\nvs. the baseline of 17.87, and 64.3% improvement in faithful image generation\naccording to human evaluation. Second, in semantic alignment, e.g. semantic\nobject accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and\npositional alignment 62.42 vs. 57.60. We analyze various ways to relabel the\ncorpus and provide evidence that this technique, which we call RECAP, both\nreduces the train-inference discrepancy and provides the model with more\ninformation per example, increasing sample efficiency and allowing the model to\nbetter understand the relations between captions and images.\n","authors":["Eyal Segalis","Dani Valevski","Danny Lumen","Yossi Matias","Yaniv Leviathan"],"pdf_url":"https://arxiv.org/pdf/2310.16656v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16655v1","updated":"2023-10-25T14:09:53Z","published":"2023-10-25T14:09:53Z","title":"Towards Control-Centric Representations in Reinforcement Learning from\n Images","summary":" Image-based Reinforcement Learning is a practical yet challenging task. A\nmajor hurdle lies in extracting control-centric representations while\ndisregarding irrelevant information. While approaches that follow the\nbisimulation principle exhibit the potential in learning state representations\nto address this issue, they still grapple with the limited expressive capacity\nof latent dynamics and the inadaptability to sparse reward environments. To\naddress these limitations, we introduce ReBis, which aims to capture\ncontrol-centric information by integrating reward-free control information\nalongside reward-specific knowledge. ReBis utilizes a transformer architecture\nto implicitly model the dynamics and incorporates block-wise masking to\neliminate spatiotemporal redundancy. Moreover, ReBis combines\nbisimulation-based loss with asymmetric reconstruction loss to prevent feature\ncollapse in environments with sparse rewards. Empirical studies on two large\nbenchmarks, including Atari games and DeepMind Control Suit, demonstrate that\nReBis has superior performance compared to existing methods, proving its\neffectiveness.\n","authors":["Chen Liu","Hongyu Zang","Xin Li","Yong Heng","Yifei Wang","Zhen Fang","Yisen Wang","Mingzhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16655v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.09097v2","updated":"2023-10-25T14:05:10Z","published":"2023-04-07T07:03:54Z","title":"Sheaf Neural Networks for Graph-based Recommender Systems","summary":" Recent progress in Graph Neural Networks has resulted in wide adoption by\nmany applications, including recommendation systems. The reason for Graph\nNeural Networks' superiority over other approaches is that many problems in\nrecommendation systems can be naturally modeled as graphs, where nodes can be\neither users or items and edges represent preference relationships. In current\nGraph Neural Network approaches, nodes are represented with a static vector\nlearned at training time. This static vector might only be suitable to capture\nsome of the nuances of users or items they define. To overcome this limitation,\nwe propose using a recently proposed model inspired by category theory: Sheaf\nNeural Networks. Sheaf Neural Networks, and its connected Laplacian, can\naddress the previous problem by associating every node (and edge) with a vector\nspace instead than a single vector. The vector space representation is richer\nand allows picking the proper representation at inference time. This approach\ncan be generalized for different related tasks on graphs and achieves\nstate-of-the-art performance in terms of F1-Score@N in collaborative filtering\nand Hits@20 in link prediction. For collaborative filtering, the approach is\nevaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a\n5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link\nprediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the\nrespective baselines.\n","authors":["Antonio Purificato","Giulia Cassarà","Pietro Liò","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2304.09097v2.pdf","comment":"9 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.16652v1","updated":"2023-10-25T14:03:11Z","published":"2023-10-25T14:03:11Z","title":"How Robust is Federated Learning to Communication Error? A Comparison\n Study Between Uplink and Downlink Channels","summary":" Because of its privacy-preserving capability, federated learning (FL) has\nattracted significant attention from both academia and industry. However, when\nbeing implemented over wireless networks, it is not clear how much\ncommunication error can be tolerated by FL. This paper investigates the\nrobustness of FL to the uplink and downlink communication error. Our\ntheoretical analysis reveals that the robustness depends on two critical\nparameters, namely the number of clients and the numerical range of model\nparameters. It is also shown that the uplink communication in FL can tolerate a\nhigher bit error rate (BER) than downlink communication, and this difference is\nquantified by a proposed formula. The findings and theoretical analyses are\nfurther validated by extensive experiments.\n","authors":["Linping Qu","Shenghui Song","Chi-Ying Tsui","Yuyi Mao"],"pdf_url":"https://arxiv.org/pdf/2310.16652v1.pdf","comment":"Submitted to IEEE for possible publication"},{"id":"http://arxiv.org/abs/2209.15543v2","updated":"2023-10-25T13:58:30Z","published":"2022-09-30T15:46:30Z","title":"Bayesian Neural Networks for Geothermal Resource Assessment: Prediction\n with Uncertainty","summary":" We consider the application of machine learning to the evaluation of\ngeothermal resource potential. A supervised learning problem is defined where\nmaps of 10 geological and geophysical features within the state of Nevada, USA\nare used to define geothermal potential across a broad region. We have\navailable a relatively small set of positive training sites (known resources or\nactive power plants) and negative training sites (known drill sites with\nunsuitable geothermal conditions) and use these to constrain and optimize\nartificial neural networks for this classification task. The main objective is\nto predict the geothermal resource potential at unknown sites within a large\ngeographic area where the defining features are known. These predictions could\nbe used to target promising areas for further detailed investigations. We\ndescribe the evolution of our work from defining a specific neural network\narchitecture to training and optimization trials. Upon analysis we expose the\ninevitable problems of model variability and resulting prediction uncertainty.\nFinally, to address these problems we apply the concept of Bayesian neural\nnetworks, a heuristic approach to regularization in network training, and make\nuse of the practical interpretation of the formal uncertainty measures they\nprovide.\n","authors":["Stephen Brown","William L. Rodi","Marco Seracini","Chen Gu","Michael Fehler","James Faulds","Connor M. Smith","Sven Treitel"],"pdf_url":"https://arxiv.org/pdf/2209.15543v2.pdf","comment":"27 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.16648v1","updated":"2023-10-25T13:56:02Z","published":"2023-10-25T13:56:02Z","title":"Posterior Consistency for Missing Data in Variational Autoencoders","summary":" We consider the problem of learning Variational Autoencoders (VAEs), i.e., a\ntype of deep generative model, from data with missing values. Such data is\nomnipresent in real-world applications of machine learning because complete\ndata is often impossible or too costly to obtain. We particularly focus on\nimproving a VAE's amortized posterior inference, i.e., the encoder, which in\nthe case of missing data can be susceptible to learning inconsistent posterior\ndistributions regarding the missingness. To this end, we provide a formal\ndefinition of posterior consistency and propose an approach for regularizing an\nencoder's posterior distribution which promotes this consistency. We observe\nthat the proposed regularization suggests a different training objective than\nthat typically considered in the literature when facing missing values.\nFurthermore, we empirically demonstrate that our regularization leads to\nimproved performance in missing value settings in terms of reconstruction\nquality and downstream tasks utilizing uncertainty in the latent space. This\nimproved performance can be observed for many classes of VAEs including VAEs\nequipped with normalizing flows.\n","authors":["Timur Sudak","Sebastian Tschiatschek"],"pdf_url":"https://arxiv.org/pdf/2310.16648v1.pdf","comment":"First published in ECML PKDD 2023, Proceedings, Part II, by Springer\n Nature (https://doi.org/10.1007/978-3-031-43415-0_30). This version of the\n work has been extended with the addition of an Appendix, which includes\n proofs, the derivation of the posterior regularization, additional background\n information on technical topics, an extended related work section, and\n additional experimental results"},{"id":"http://arxiv.org/abs/2310.16647v1","updated":"2023-10-25T13:55:35Z","published":"2023-10-25T13:55:35Z","title":"Achieving Constraints in Neural Networks: A Stochastic Augmented\n Lagrangian Approach","summary":" Regularizing Deep Neural Networks (DNNs) is essential for improving\ngeneralizability and preventing overfitting. Fixed penalty methods, though\ncommon, lack adaptability and suffer from hyperparameter sensitivity. In this\npaper, we propose a novel approach to DNN regularization by framing the\ntraining process as a constrained optimization problem. Where the data fidelity\nterm is the minimization objective and the regularization terms serve as\nconstraints. Then, we employ the Stochastic Augmented Lagrangian (SAL) method\nto achieve a more flexible and efficient regularization mechanism. Our approach\nextends beyond black-box regularization, demonstrating significant improvements\nin white-box models, where weights are often subject to hard constraints to\nensure interpretability. Experimental results on image-based classification on\nMNIST, CIFAR10, and CIFAR100 datasets validate the effectiveness of our\napproach. SAL consistently achieves higher Accuracy while also achieving better\nconstraint satisfaction, thus showcasing its potential for optimizing DNNs\nunder constrained settings.\n","authors":["Diogo Lavado","Cláudia Soares","Alessandra Micheletti"],"pdf_url":"https://arxiv.org/pdf/2310.16647v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16646v1","updated":"2023-10-25T13:55:14Z","published":"2023-10-25T13:55:14Z","title":"Model predictive control-based value estimation for efficient\n reinforcement learning","summary":" Reinforcement learning suffers from limitations in real practices primarily\ndue to the numbers of required interactions with virtual environments. It\nresults in a challenging problem that we are implausible to obtain an optimal\nstrategy only with a few attempts for many learning method. Hereby, we design\nan improved reinforcement learning method based on model predictive control\nthat models the environment through a data-driven approach. Based on learned\nenvironmental model, it performs multi-step prediction to estimate the value\nfunction and optimize the policy. The method demonstrates higher learning\nefficiency, faster convergent speed of strategies tending to the optimal value,\nand fewer sample capacity space required by experience replay buffers.\nExperimental results, both in classic databases and in a dynamic obstacle\navoidance scenario for unmanned aerial vehicle, validate the proposed\napproaches.\n","authors":["Qizhen Wu","Kexin Liu","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.16646v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03022v2","updated":"2023-10-25T13:51:25Z","published":"2023-06-05T16:38:48Z","title":"Interpretable Alzheimer's Disease Classification Via a Contrastive\n Diffusion Autoencoder","summary":" In visual object classification, humans often justify their choices by\ncomparing objects to prototypical examples within that class. We may therefore\nincrease the interpretability of deep learning models by imbuing them with a\nsimilar style of reasoning. In this work, we apply this principle by\nclassifying Alzheimer's Disease based on the similarity of images to training\nexamples within the latent space. We use a contrastive loss combined with a\ndiffusion autoencoder backbone, to produce a semantically meaningful latent\nspace, such that neighbouring latents have similar image-level features. We\nachieve a classification accuracy comparable to black box approaches on a\ndataset of 2D MRI images, whilst producing human interpretable model\nexplanations. Therefore, this work stands as a contribution to the pertinent\ndevelopment of accurate and interpretable deep learning within medical imaging.\n","authors":["Ayodeji Ijishakin","Ahmed Abdulaal","Adamos Hadjivasiliou","Sophie Martin","James Cole"],"pdf_url":"https://arxiv.org/pdf/2306.03022v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2109.03501v2","updated":"2023-10-25T13:44:14Z","published":"2021-09-08T08:50:56Z","title":"How do I update my model? On the resilience of Predictive Process\n Monitoring models to change","summary":" Existing well investigated Predictive Process Monitoring techniques typically\nconstruct a predictive model based on past process executions, and then use it\nto predict the future of new ongoing cases, without the possibility of updating\nit with new cases when they complete their execution. This can make Predictive\nProcess Monitoring too rigid to deal with the variability of processes working\nin real environments that continuously evolve and/or exhibit new variant\nbehaviours over time. As a solution to this problem, we evaluate the use of\nthree different strategies that allow the periodic rediscovery or incremental\nconstruction of the predictive model so as to exploit new available data. The\nevaluation focuses on the performance of the new learned predictive models, in\nterms of accuracy and time, against the original one, and uses a number of real\nand synthetic datasets with and without explicit Concept Drift. The results\nprovide an evidence of the potential of incremental learning algorithms for\npredicting process monitoring in real environments.\n","authors":["Williams Rizzi","Chiara Di Francescomarino","Chiara Ghidini","Fabrizio Maria Maggi"],"pdf_url":"https://arxiv.org/pdf/2109.03501v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12127v2","updated":"2023-10-25T13:43:49Z","published":"2023-10-18T17:36:55Z","title":"A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for\n Fairer Instruction-Tuned Machine Translation","summary":" Recent instruction fine-tuned models can solve multiple NLP tasks when\nprompted to do so, with machine translation (MT) being a prominent use case.\nHowever, current research often focuses on standard performance benchmarks,\nleaving compelling fairness and ethical considerations behind. In MT, this\nmight lead to misgendered translations, resulting, among other harms, in the\nperpetuation of stereotypes and prejudices. In this work, we address this gap\nby investigating whether and to what extent such models exhibit gender bias in\nmachine translation and how we can mitigate it. Concretely, we compute\nestablished gender bias metrics on the WinoMT corpus from English to German and\nSpanish. We discover that IFT models default to male-inflected translations,\neven disregarding female occupational stereotypes. Next, using interpretability\nmethods, we unveil that models systematically overlook the pronoun indicating\nthe gender of a target occupation in misgendered translations. Finally, based\non this finding, we propose an easy-to-implement and effective bias mitigation\nsolution based on few-shot learning that leads to significantly fairer\ntranslations.\n","authors":["Giuseppe Attanasio","Flor Miriam Plaza-del-Arco","Debora Nozza","Anne Lauscher"],"pdf_url":"https://arxiv.org/pdf/2310.12127v2.pdf","comment":"Accepted at EMNLP 2023. Code and data at\n https://github.com/MilaNLProc/interpretability-mt-gender-bias"},{"id":"http://arxiv.org/abs/2310.16639v1","updated":"2023-10-25T13:39:04Z","published":"2023-10-25T13:39:04Z","title":"Driving through the Concept Gridlock: Unraveling Explainability\n Bottlenecks","summary":" Concept bottleneck models have been successfully used for explainable machine\nlearning by encoding information within the model with a set of human-defined\nconcepts. In the context of human-assisted or autonomous driving,\nexplainability models can help user acceptance and understanding of decisions\nmade by the autonomous vehicle, which can be used to rationalize and explain\ndriver or vehicle behavior. We propose a new approach using concept bottlenecks\nas visual features for control command predictions and explanations of user and\nvehicle behavior. We learn a human-understandable concept layer that we use to\nexplain sequential driving scenes while learning vehicle control commands. This\napproach can then be used to determine whether a change in a preferred gap or\nsteering commands from a human (or autonomous vehicle) is led by an external\nstimulus or change in preferences. We achieve competitive performance to latent\nvisual features while gaining interpretability within our model setup.\n","authors":["Jessica Echterhoff","An Yan","Kyungtae Han","Amr Abdelraouf","Rohit Gupta","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2310.16639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16638v1","updated":"2023-10-25T13:38:29Z","published":"2023-10-25T13:38:29Z","title":"Covariate Shift Adaptation Robust to Density-Ratio Estimation","summary":" Consider a scenario where we have access to train data with both covariates\nand outcomes while test data only contains covariates. In this scenario, our\nprimary aim is to predict the missing outcomes of the test data. With this\nobjective in mind, we train parametric regression models under a covariate\nshift, where covariate distributions are different between the train and test\ndata. For this problem, existing studies have proposed covariate shift\nadaptation via importance weighting using the density ratio. This approach\naverages the train data losses, each weighted by an estimated ratio of the\ncovariate densities between the train and test data, to approximate the\ntest-data risk. Although it allows us to obtain a test-data risk minimizer, its\nperformance heavily relies on the accuracy of the density ratio estimation.\nMoreover, even if the density ratio can be consistently estimated, the\nestimation errors of the density ratio also yield bias in the estimators of the\nregression model's parameters of interest. To mitigate these challenges, we\nintroduce a doubly robust estimator for covariate shift adaptation via\nimportance weighting, which incorporates an additional estimator for the\nregression function. Leveraging double machine learning techniques, our\nestimator reduces the bias arising from the density ratio estimation errors. We\ndemonstrate the asymptotic distribution of the regression parameter estimator.\nNotably, our estimator remains consistent if either the density ratio estimator\nor the regression function is consistent, showcasing its robustness against\npotential errors in density ratio estimation. Finally, we confirm the soundness\nof our proposed method via simulation studies.\n","authors":["Masahiro Kato"],"pdf_url":"https://arxiv.org/pdf/2310.16638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19706v2","updated":"2023-10-25T13:35:21Z","published":"2023-05-31T10:03:04Z","title":"Necessary and Sufficient Conditions for Optimal Decision Trees using\n Dynamic Programming","summary":" Global optimization of decision trees has shown to be promising in terms of\naccuracy, size, and consequently human comprehensibility. However, many of the\nmethods used rely on general-purpose solvers for which scalability remains an\nissue. Dynamic programming methods have been shown to scale much better because\nthey exploit the tree structure by solving subtrees as independent subproblems.\nHowever, this only works when an objective can be optimized separately for\nsubtrees. We explore this relationship in detail and show necessary and\nsufficient conditions for such separability and generalize previous dynamic\nprogramming approaches into a framework that can optimize any combination of\nseparable objectives and constraints. Experiments on five application domains\nshow the general applicability of this framework, while outperforming the\nscalability of general-purpose solvers by a large margin.\n","authors":["Jacobus G. M. van der Linden","Mathijs M. de Weerdt","Emir Demirović"],"pdf_url":"https://arxiv.org/pdf/2305.19706v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16633v1","updated":"2023-10-25T13:33:40Z","published":"2023-10-25T13:33:40Z","title":"Photometric Redshifts with Copula Entropy","summary":" In this paper we propose to apply copula entropy (CE) to photometric\nredshifts. CE is used to measure the correlations between photometric\nmeasurements and redshifts and then the measurements associated with high CEs\nare selected for predicting redshifts. We verified the proposed method on the\nSDSS quasar data. Experimental results show that the accuracy of photometric\nredshifts is improved with the selected measurements compared to the results\nwith all the measurements used in the experiments, especially for the samples\nwith high redshifts. The measurements selected with CE include luminosity\nmagnitude, the brightness in ultraviolet band with standard deviation, and the\nbrightness of the other four bands. Since CE is a rigorously defined\nmathematical concept, the models such derived is interpretable.\n","authors":["Jian Ma"],"pdf_url":"https://arxiv.org/pdf/2310.16633v1.pdf","comment":"15 pages, 7 figures, 1 table"},{"id":"http://arxiv.org/abs/2310.16624v1","updated":"2023-10-25T13:23:08Z","published":"2023-10-25T13:23:08Z","title":"Free-form Flows: Make Any Architecture a Normalizing Flow","summary":" Normalizing Flows are generative models that directly maximize the\nlikelihood. Previously, the design of normalizing flows was largely constrained\nby the need for analytical invertibility. We overcome this constraint by a\ntraining procedure that uses an efficient estimator for the gradient of the\nchange of variables formula. This enables any dimension-preserving neural\nnetwork to serve as a generative model through maximum likelihood training. Our\napproach allows placing the emphasis on tailoring inductive biases precisely to\nthe task at hand. Specifically, we achieve excellent results in molecule\ngeneration benchmarks utilizing $E(n)$-equivariant networks. Moreover, our\nmethod is competitive in an inverse problem benchmark, while employing\noff-the-shelf ResNet architectures.\n","authors":["Felix Draxler","Peter Sorrenson","Lea Zimmermann","Armand Rousselot","Ullrich Köthe"],"pdf_url":"https://arxiv.org/pdf/2310.16624v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16620v1","updated":"2023-10-25T13:15:17Z","published":"2023-10-25T13:15:17Z","title":"SpikingJelly: An open-source machine learning infrastructure platform\n for spike-based intelligence","summary":" Spiking neural networks (SNNs) aim to realize brain-inspired intelligence on\nneuromorphic chips with high energy efficiency by introducing neural dynamics\nand spike properties. As the emerging spiking deep learning paradigm attracts\nincreasing interest, traditional programming frameworks cannot meet the demands\nof the automatic differentiation, parallel computation acceleration, and high\nintegration of processing neuromorphic datasets and deployment. In this work,\nwe present the SpikingJelly framework to address the aforementioned dilemma. We\ncontribute a full-stack toolkit for pre-processing neuromorphic datasets,\nbuilding deep SNNs, optimizing their parameters, and deploying SNNs on\nneuromorphic chips. Compared to existing methods, the training of deep SNNs can\nbe accelerated $11\\times$, and the superior extensibility and flexibility of\nSpikingJelly enable users to accelerate custom models at low costs through\nmultilevel inheritance and semiautomatic code generation. SpikingJelly paves\nthe way for synthesizing truly energy-efficient SNN-based machine intelligence\nsystems, which will enrich the ecology of neuromorphic computing.\n","authors":["Wei Fang","Yanqi Chen","Jianhao Ding","Zhaofei Yu","Timothée Masquelier","Ding Chen","Liwei Huang","Huihui Zhou","Guoqi Li","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2310.16620v1.pdf","comment":"Accepted in Science Advances\n (https://www.science.org/doi/10.1126/sciadv.adi1480)"},{"id":"http://arxiv.org/abs/2308.08561v2","updated":"2023-10-25T13:13:01Z","published":"2023-08-14T13:18:40Z","title":"Implementation of The Future of Drug Discovery: QuantumBased Machine\n Learning Simulation (QMLS)","summary":" The Research & Development (R&D) phase of drug development is a lengthy and\ncostly process. To revolutionize this process, we introduce our new concept\nQMLS to shorten the whole R&D phase to three to six months and decrease the\ncost to merely fifty to eighty thousand USD. For Hit Generation, Machine\nLearning Molecule Generation (MLMG) generates possible hits according to the\nmolecular structure of the target protein while the Quantum Simulation (QS)\nfilters molecules from the primary essay based on the reaction and binding\neffectiveness with the target protein. Then, For Lead Optimization, the\nresultant molecules generated and filtered from MLMG and QS are compared, and\nmolecules that appear as a result of both processes will be made into dozens of\nmolecular variations through Machine Learning Molecule Variation (MLMV), while\nothers will only be made into a few variations. Lastly, all optimized molecules\nwould undergo multiple rounds of QS filtering with a high standard for reaction\neffectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs.\nThis paper is based on our first paper, where we pitched the concept of machine\nlearning combined with quantum simulations. In this paper we will go over the\ndetailed design and framework of QMLS, including MLMG, MLMV, and QS.\n","authors":["Yifan Zhou","Yew Kee Wong","Yan Shing Liang","Haichuan Qiu","Yu Xi Wu","Bin He"],"pdf_url":"https://arxiv.org/pdf/2308.08561v2.pdf","comment":"13 pages, 6 figures"},{"id":"http://arxiv.org/abs/2307.10455v2","updated":"2023-10-25T13:06:01Z","published":"2023-07-19T20:54:08Z","title":"A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect\n Dataset","summary":" In an effort to catalog insect biodiversity, we propose a new large dataset\nof hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is\ntaxonomically classified by an expert, and also has associated genetic\ninformation including raw nucleotide barcode sequences and assigned barcode\nindex numbers, which are genetically-based proxies for species classification.\nThis paper presents a curated million-image dataset, primarily to train\ncomputer-vision models capable of providing image-based taxonomic assessment,\nhowever, the dataset also presents compelling characteristics, the study of\nwhich would be of interest to the broader machine learning community. Driven by\nthe biological nature inherent to the dataset, a characteristic long-tailed\nclass-imbalance distribution is exhibited. Furthermore, taxonomic labelling is\na hierarchical classification scheme, presenting a highly fine-grained\nclassification problem at lower levels. Beyond spurring interest in\nbiodiversity research within the machine learning community, progress on\ncreating an image-based taxonomic classifier will also further the ultimate\ngoal of all BIOSCAN research: to lay the foundation for a comprehensive survey\nof global biodiversity. This paper introduces the dataset and explores the\nclassification task through the implementation and analysis of a baseline\nclassifier.\n","authors":["Zahra Gharaee","ZeMing Gong","Nicholas Pellegrino","Iuliia Zarubiieva","Joakim Bruslund Haurum","Scott C. Lowe","Jaclyn T. A. McKeown","Chris C. Y. Ho","Joschka McLeod","Yi-Yun C Wei","Jireh Agda","Sujeevan Ratnasingham","Dirk Steinke","Angel X. Chang","Graham W. Taylor","Paul Fieguth"],"pdf_url":"https://arxiv.org/pdf/2307.10455v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16608v1","updated":"2023-10-25T13:02:45Z","published":"2023-10-25T13:02:45Z","title":"Performative Prediction: Past and Future","summary":" Predictions in the social world generally influence the target of prediction,\na phenomenon known as performativity. Self-fulfilling and self-negating\npredictions are examples of performativity. Of fundamental importance to\neconomics, finance, and the social sciences, the notion has been absent from\nthe development of machine learning. In machine learning applications,\nperformativity often surfaces as distribution shift. A predictive model\ndeployed on a digital platform, for example, influences consumption and thereby\nchanges the data-generating distribution. We survey the recently founded area\nof performative prediction that provides a definition and conceptual framework\nto study performativity in machine learning. A consequence of performative\nprediction is a natural equilibrium notion that gives rise to new optimization\nchallenges. Another consequence is a distinction between learning and steering,\ntwo mechanisms at play in performative prediction. The notion of steering is in\nturn intimately related to questions of power in digital markets. We review the\nnotion of performative power that gives an answer to the question how much a\nplatform can steer participants through its predictions. We end on a discussion\nof future directions, such as the role that performativity plays in contesting\nalgorithmic systems.\n","authors":["Moritz Hardt","Celestine Mendler-Dünner"],"pdf_url":"https://arxiv.org/pdf/2310.16608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16606v1","updated":"2023-10-25T12:51:38Z","published":"2023-10-25T12:51:38Z","title":"AirFL-Mem: Improving Communication-Learning Trade-Off by Long-Term\n Memory","summary":" Addressing the communication bottleneck inherent in federated learning (FL),\nover-the-air FL (AirFL) has emerged as a promising solution, which is, however,\nhampered by deep fading conditions. In this paper, we propose AirFL-Mem, a\nnovel scheme designed to mitigate the impact of deep fading by implementing a\n\\emph{long-term} memory mechanism. Convergence bounds are provided that account\nfor long-term memory, as well as for existing AirFL variants with short-term\nmemory, for general non-convex objectives. The theory demonstrates that\nAirFL-Mem exhibits the same convergence rate of federated averaging (FedAvg)\nwith ideal communication, while the performance of existing schemes is\ngenerally limited by error floors. The theoretical results are also leveraged\nto propose a novel convex optimization strategy for the truncation threshold\nused for power control in the presence of Rayleigh fading channels.\nExperimental results validate the analysis, confirming the advantages of a\nlong-term memory mechanism for the mitigation of deep fading.\n","authors":["Haifeng Wen","Hong Xing","Osvaldo Simeone"],"pdf_url":"https://arxiv.org/pdf/2310.16606v1.pdf","comment":"8 pages, 3 figures, this is the full version of the conference\n version that is submitted to IEEE WCNC2024 for possible publication"},{"id":"http://arxiv.org/abs/2310.16602v1","updated":"2023-10-25T12:46:34Z","published":"2023-10-25T12:46:34Z","title":"Parcel loss prediction in last-mile delivery: deep and non-deep\n approaches with insights from Explainable AI","summary":" Within the domain of e-commerce retail, an important objective is the\nreduction of parcel loss during the last-mile delivery phase. The\never-increasing availability of data, including product, customer, and order\ninformation, has made it possible for the application of machine learning in\nparcel loss prediction. However, a significant challenge arises from the\ninherent imbalance in the data, i.e., only a very low percentage of parcels are\nlost. In this paper, we propose two machine learning approaches, namely, Data\nBalance with Supervised Learning (DBSL) and Deep Hybrid Ensemble Learning\n(DHEL), to accurately predict parcel loss. The practical implication of such\npredictions is their value in aiding e-commerce retailers in optimizing\ninsurance-related decision-making policies. We conduct a comprehensive\nevaluation of the proposed machine learning models using one year data from\nBelgian shipments. The findings show that the DHEL model, which combines a\nfeed-forward autoencoder with a random forest, achieves the highest\nclassification performance. Furthermore, we use the techniques from Explainable\nAI (XAI) to illustrate how prediction models can be used in enhancing business\nprocesses and augmenting the overall value proposition for e-commerce retailers\nin the last mile delivery.\n","authors":["Jan de Leeuw","Zaharah Bukhsh","Yingqian Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16600v1","updated":"2023-10-25T12:45:49Z","published":"2023-10-25T12:45:49Z","title":"Balancing central and marginal rejection when combining independent\n significance tests","summary":" A common approach to evaluating the significance of a collection of\n$p$-values combines them with a pooling function, in particular when the\noriginal data are not available. These pooled $p$-values convert a sample of\n$p$-values into a single number which behaves like a univariate $p$-value. To\nclarify discussion of these functions, a telescoping series of alternative\nhypotheses are introduced that communicate the strength and prevalence of\nnon-null evidence in the $p$-values before general pooling formulae are\ndiscussed. A pattern noticed in the UMP pooled $p$-value for a particular\nalternative motivates the definition and discussion of central and marginal\nrejection levels at $\\alpha$. It is proven that central rejection is always\ngreater than or equal to marginal rejection, motivating a quotient to measure\nthe balance between the two for pooled $p$-values. A combining function based\non the $\\chi^2_{\\kappa}$ quantile transformation is proposed to control this\nquotient and shown to be robust to mis-specified parameters relative to the\nUMP. Different powers for different parameter settings motivate a map of\nplausible alternatives based on where this pooled $p$-value is minimized.\n","authors":["Chris Salahub","R. Wayne Oldford"],"pdf_url":"https://arxiv.org/pdf/2310.16600v1.pdf","comment":"55 page, 18 figures, public technical report"},{"id":"http://arxiv.org/abs/2310.16597v1","updated":"2023-10-25T12:38:36Z","published":"2023-10-25T12:38:36Z","title":"Beyond IID weights: sparse and low-rank deep Neural Networks are also\n Gaussian Processes","summary":" The infinitely wide neural network has been proven a useful and manageable\nmathematical model that enables the understanding of many phenomena appearing\nin deep learning. One example is the convergence of random deep networks to\nGaussian processes that allows a rigorous analysis of the way the choice of\nactivation function and network weights impacts the training dynamics. In this\npaper, we extend the seminal proof of Matthews et al. (2018) to a larger class\nof initial weight distributions (which we call PSEUDO-IID), including the\nestablished cases of IID and orthogonal weights, as well as the emerging\nlow-rank and structured sparse settings celebrated for their computational\nspeed-up benefits. We show that fully-connected and convolutional networks\ninitialized with PSEUDO-IID distributions are all effectively equivalent up to\ntheir variance. Using our results, one can identify the Edge-of-Chaos for a\nbroader class of neural networks and tune them at criticality in order to\nenhance their training.\n","authors":["Thiziri Nait-Saada","Alireza Naderi","Jared Tanner"],"pdf_url":"https://arxiv.org/pdf/2310.16597v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16592v1","updated":"2023-10-25T12:28:20Z","published":"2023-10-25T12:28:20Z","title":"Over-the-air Federated Policy Gradient","summary":" In recent years, over-the-air aggregation has been widely considered in\nlarge-scale distributed learning, optimization, and sensing. In this paper, we\npropose the over-the-air federated policy gradient algorithm, where all agents\nsimultaneously broadcast an analog signal carrying local information to a\ncommon wireless channel, and a central controller uses the received aggregated\nwaveform to update the policy parameters. We investigate the effect of noise\nand channel distortion on the convergence of the proposed algorithm, and\nestablish the complexities of communication and sampling for finding an\n$\\epsilon$-approximate stationary point. Finally, we present some simulation\nresults to show the effectiveness of the algorithm.\n","authors":["Huiwen Yang","Lingying Huang","Subhrakanti Dey","Ling Shi"],"pdf_url":"https://arxiv.org/pdf/2310.16592v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16588v1","updated":"2023-10-25T12:24:56Z","published":"2023-10-25T12:24:56Z","title":"Multi-parallel-task Time-delay Reservoir Computing combining a Silicon\n Microring with WDM","summary":" We numerically demonstrate a microring-based time-delay reservoir computing\nscheme that simultaneously solves three tasks involving time-series prediction,\nclassification, and wireless channel equalization. Each task performed on a\nwavelength-multiplexed channel achieves state-of-the-art performance with\noptimized power and frequency detuning.\n","authors":["Bernard J. Giron Castro","Christophe Peucheret","Darko Zibar","Francesco Da Ros"],"pdf_url":"https://arxiv.org/pdf/2310.16588v1.pdf","comment":"3 pages, 2 figures, Submitted to Optical Fiber Communication\n Conference (OFC) 2024"},{"id":"http://arxiv.org/abs/2310.16587v1","updated":"2023-10-25T12:22:18Z","published":"2023-10-25T12:22:18Z","title":"Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent\n Representations","summary":" Uncertainty estimation aims to evaluate the confidence of a trained deep\nneural network. However, existing uncertainty estimation approaches rely on\nlow-dimensional distributional assumptions and thus suffer from the high\ndimensionality of latent features. Existing approaches tend to focus on\nuncertainty on discrete classification probabilities, which leads to poor\ngeneralizability to uncertainty estimation for other tasks. Moreover, most of\nthe literature requires seeing the out-of-distribution (OOD) data in the\ntraining for better estimation of uncertainty, which limits the uncertainty\nestimation performance in practice because the OOD data are typically unseen.\nTo overcome these limitations, we propose a new framework using data-adaptive\nhigh-dimensional hypothesis testing for uncertainty estimation, which leverages\nthe statistical properties of the feature representations. Our method directly\noperates on latent representations and thus does not require retraining the\nfeature encoder under a modified objective. The test statistic relaxes the\nfeature distribution assumptions to high dimensionality, and it is more\ndiscriminative to uncertainties in the latent representations. We demonstrate\nthat encoding features with Bayesian neural networks can enhance testing\nperformance and lead to more accurate uncertainty estimation. We further\nintroduce a family-wise testing procedure to determine the optimal threshold of\nOOD detection, which minimizes the false discovery rate (FDR). Extensive\nexperiments validate the satisfactory performance of our framework on\nuncertainty estimation and task-specific prediction over a variety of\ncompetitors. The experiments on the OOD detection task also show satisfactory\nperformance of our method when the OOD data are unseen in the training. Codes\nare available at https://github.com/HKU-MedAI/bnn_uncertainty.\n","authors":["Tsai Hor Chan","Kin Wai Lau","Jiajun Shen","Guosheng Yin","Lequan Yu"],"pdf_url":"https://arxiv.org/pdf/2310.16587v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2301.08110v4","updated":"2023-10-25T12:05:01Z","published":"2023-01-19T15:01:00Z","title":"AtMan: Understanding Transformer Predictions Through Memory Efficient\n Attention Manipulation","summary":" Generative transformer models have become increasingly complex, with large\nnumbers of parameters and the ability to process multiple input modalities.\nCurrent methods for explaining their predictions are resource-intensive. Most\ncrucially, they require prohibitively large amounts of extra memory, since they\nrely on backpropagation which allocates almost twice as much GPU memory as the\nforward pass. This makes it difficult, if not impossible, to use them in\nproduction. We present AtMan that provides explanations of generative\ntransformer models at almost no extra cost. Specifically, AtMan is a\nmodality-agnostic perturbation method that manipulates the attention mechanisms\nof transformers to produce relevance maps for the input with respect to the\noutput prediction. Instead of using backpropagation, AtMan applies a\nparallelizable token-based search method based on cosine similarity\nneighborhood in the embedding space. Our exhaustive experiments on text and\nimage-text benchmarks demonstrate that AtMan outperforms current\nstate-of-the-art gradient-based methods on several metrics while being\ncomputationally efficient. As such, AtMan is suitable for use in large model\ninference deployments.\n","authors":["Björn Deiseroth","Mayukh Deb","Samuel Weinbach","Manuel Brack","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2301.08110v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10692v3","updated":"2023-10-25T12:01:38Z","published":"2023-10-15T14:57:14Z","title":"ACES: Generating Diverse Programming Puzzles with Autotelic Language\n Models and Semantic Descriptors","summary":" Finding and selecting new and interesting problems to solve is at the heart\nof curiosity, science and innovation. We here study automated problem\ngeneration in the context of the open-ended space of python programming\npuzzles. Existing generative models often aim at modeling a reference\ndistribution without any explicit diversity optimization. Other methods\nexplicitly optimizing for diversity do so either in limited hand-coded\nrepresentation spaces or in uninterpretable learned embedding spaces that may\nnot align with human perceptions of interesting variations. With ACES\n(Autotelic Code Exploration via Semantic descriptors), we introduce a new\nautotelic generation method that leverages semantic descriptors produced by a\nlarge language model (LLM) to directly optimize for interesting diversity, as\nwell as few-shot-based generation. Each puzzle is labeled along 10 dimensions,\neach capturing a programming skill required to solve it. ACES generates and\npursues novel and feasible goals to explore that abstract semantic space,\nslowly discovering a diversity of solvable programming puzzles in any given\nrun. Across a set of experiments, we show that ACES discovers a richer\ndiversity of puzzles than existing diversity-maximizing algorithms as measured\nacross a range of diversity metrics. We further study whether and in which\nconditions this diversity can translate into the successful training of puzzle\nsolving models.\n","authors":["Julien Pourcel","Cédric Colas","Pierre-Yves Oudeyer","Laetitia Teodorescu"],"pdf_url":"https://arxiv.org/pdf/2310.10692v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16577v1","updated":"2023-10-25T12:00:45Z","published":"2023-10-25T12:00:45Z","title":"Mapping the magnetic field using a magnetometer array with noisy input\n Gaussian process regression","summary":" Ferromagnetic materials in indoor environments give rise to disturbances in\nthe ambient magnetic field. Maps of these magnetic disturbances can be used for\nindoor localisation. A Gaussian process can be used to learn the spatially\nvarying magnitude of the magnetic field using magnetometer measurements and\ninformation about the position of the magnetometer. The position of the\nmagnetometer, however, is frequently only approximately known. This negatively\naffects the quality of the magnetic field map. In this paper, we investigate\nhow an array of magnetometers can be used to improve the quality of the\nmagnetic field map. The position of the array is approximately known, but the\nrelative locations of the magnetometers on the array are known. We include this\ninformation in a novel method to make a map of the ambient magnetic field. We\nstudy the properties of our method in simulation and show that our method\nimproves the map quality. We also demonstrate the efficacy of our method with\nexperimental data for the mapping of the magnetic field using an array of 30\nmagnetometers.\n","authors":["Thomas Edridge","Manon Kok"],"pdf_url":"https://arxiv.org/pdf/2310.16577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13384v2","updated":"2023-10-25T11:58:40Z","published":"2023-06-23T09:10:41Z","title":"DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch\n Diffusion in Histopathology","summary":" We present DiffInfinite, a hierarchical diffusion model that generates\narbitrarily large histological images while preserving long-range correlation\nstructural information. Our approach first generates synthetic segmentation\nmasks, subsequently used as conditions for the high-fidelity generative\ndiffusion process. The proposed sampling method can be scaled up to any desired\nimage size while only requiring small patches for fast training. Moreover, it\ncan be parallelized more efficiently than previous large-content generation\nmethods while avoiding tiling artifacts. The training leverages classifier-free\nguidance to augment a small, sparsely annotated dataset with unlabelled data.\nOur method alleviates unique challenges in histopathological imaging practice:\nlarge-scale information, costly manual annotation, and protective data\nhandling. The biological plausibility of DiffInfinite data is evaluated in a\nsurvey by ten experienced pathologists as well as a downstream classification\nand segmentation task. Samples from the model score strongly on anti-copying\nmetrics which is relevant for the protection of patient data.\n","authors":["Marco Aversa","Gabriel Nobis","Miriam Hägele","Kai Standvoss","Mihaela Chirica","Roderick Murray-Smith","Ahmed Alaa","Lukas Ruff","Daniela Ivanova","Wojciech Samek","Frederick Klauschen","Bruno Sanguinetti","Luis Oala"],"pdf_url":"https://arxiv.org/pdf/2306.13384v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16574v1","updated":"2023-10-25T11:58:18Z","published":"2023-10-25T11:58:18Z","title":"Large-scale magnetic field maps using structured kernel interpolation\n for Gaussian process regression","summary":" We present a mapping algorithm to compute large-scale magnetic field maps in\nindoor environments with approximate Gaussian process (GP) regression. Mapping\nthe spatial variations in the ambient magnetic field can be used for\nlocalization algorithms in indoor areas. To compute such a map, GP regression\nis a suitable tool because it provides predictions of the magnetic field at new\nlocations along with uncertainty quantification. Because full GP regression has\na complexity that grows cubically with the number of data points,\napproximations for GPs have been extensively studied. In this paper, we build\non the structured kernel interpolation (SKI) framework, speeding up inference\nby exploiting efficient Krylov subspace methods. More specifically, we\nincorporate SKI with derivatives (D-SKI) into the scalar potential model for\nmagnetic field modeling and compute both predictive mean and covariance with a\ncomplexity that is linear in the data points. In our simulations, we show that\nour method achieves better accuracy than current state-of-the-art methods on\nmagnetic field maps with a growing mapping area. In our large-scale\nexperiments, we construct magnetic field maps from up to 40000\nthree-dimensional magnetic field measurements in less than two minutes on a\nstandard laptop.\n","authors":["Clara Menzen","Marnix Fetter","Manon Kok"],"pdf_url":"https://arxiv.org/pdf/2310.16574v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16566v1","updated":"2023-10-25T11:43:29Z","published":"2023-10-25T11:43:29Z","title":"Model-enhanced Contrastive Reinforcement Learning for Sequential\n Recommendation","summary":" Reinforcement learning (RL) has been widely applied in recommendation systems\ndue to its potential in optimizing the long-term engagement of users. From the\nperspective of RL, recommendation can be formulated as a Markov decision\nprocess (MDP), where recommendation system (agent) can interact with users\n(environment) and acquire feedback (reward signals).However, it is impractical\nto conduct online interactions with the concern on user experience and\nimplementation complexity, and we can only train RL recommenders with offline\ndatasets containing limited reward signals and state transitions. Therefore,\nthe data sparsity issue of reward signals and state transitions is very severe,\nwhile it has long been overlooked by existing RL recommenders.Worse still, RL\nmethods learn through the trial-and-error mode, but negative feedback cannot be\nobtained in implicit feedback recommendation tasks, which aggravates the\noverestimation problem of offline RL recommender. To address these challenges,\nwe propose a novel RL recommender named model-enhanced contrastive\nreinforcement learning (MCRL). On the one hand, we learn a value function to\nestimate the long-term engagement of users, together with a conservative value\nlearning mechanism to alleviate the overestimation problem.On the other hand,\nwe construct some positive and negative state-action pairs to model the reward\nfunction and state transition function with contrastive learning to exploit the\ninternal structure information of MDP. Experiments demonstrate that the\nproposed method significantly outperforms existing offline RL and\nself-supervised RL methods with different representative backbone networks on\ntwo real-world datasets.\n","authors":["Chengpeng Li","Zhengyi Yang","Jizhi Zhang","Jiancan Wu","Dingxian Wang","Xiangnan He","Xiang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16566v1.pdf","comment":"11 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.16560v1","updated":"2023-10-25T11:28:26Z","published":"2023-10-25T11:28:26Z","title":"Label Propagation for Graph Label Noise","summary":" Label noise is a common challenge in large datasets, as it can significantly\ndegrade the generalization ability of deep neural networks. Most existing\nstudies focus on noisy labels in computer vision; however, graph models\nencompass both node features and graph topology as input, and become more\nsusceptible to label noise through message-passing mechanisms. Recently, only a\nfew works have been proposed to tackle the label noise on graphs. One major\nlimitation is that they assume the graph is homophilous and the labels are\nsmoothly distributed. Nevertheless, real-world graphs may contain varying\ndegrees of heterophily or even be heterophily-dominated, leading to the\ninadequacy of current methods. In this paper, we study graph label noise in the\ncontext of arbitrary heterophily, with the aim of rectifying noisy labels and\nassigning labels to previously unlabeled nodes. We begin by conducting two\nempirical analyses to explore the impact of graph homophily on graph label\nnoise. Following observations, we propose a simple yet efficient algorithm,\ndenoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three\nsteps: (1) reconstruct the graph to recover the homophily property, (2) utilize\nlabel propagation to rectify the noisy labels, (3) select high-confidence\nlabels to retain for the next iteration. By iterating these steps, we obtain a\nset of correct labels, ultimately achieving high accuracy in the node\nclassification task. The theoretical analysis is also provided to demonstrate\nits remarkable denoising \"effect\". Finally, we conduct experiments on 10\nbenchmark datasets under varying graph heterophily levels and noise types,\ncomparing the performance of LP4GLN with 7 typical baselines. Our results\nillustrate the superior performance of the proposed LP4GLN.\n","authors":["Yao Cheng","Caihua Shan","Yifei Shen","Xiang Li","Siqiang Luo","Dongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.16560v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.16572v2","updated":"2023-10-25T11:10:57Z","published":"2023-08-31T09:13:30Z","title":"CL-MAE: Curriculum-Learned Masked Autoencoders","summary":" Masked image modeling has been demonstrated as a powerful pretext task for\ngenerating robust representations that can be effectively generalized across\nmultiple downstream tasks. Typically, this approach involves randomly masking\npatches (tokens) in input images, with the masking strategy remaining unchanged\nduring training. In this paper, we propose a curriculum learning approach that\nupdates the masking strategy to continually increase the complexity of the\nself-supervised reconstruction task. We conjecture that, by gradually\nincreasing the task complexity, the model can learn more sophisticated and\ntransferable representations. To facilitate this, we introduce a novel\nlearnable masking module that possesses the capability to generate masks of\ndifferent complexities, and integrate the proposed module into masked\nautoencoders (MAE). Our module is jointly trained with the MAE, while adjusting\nits behavior during training, transitioning from a partner to the MAE\n(optimizing the same reconstruction loss) to an adversary (optimizing the\nopposite loss), while passing through a neutral state. The transition between\nthese behaviors is smooth, being regulated by a factor that is multiplied with\nthe reconstruction loss of the masking module. The resulting training procedure\ngenerates an easy-to-hard curriculum. We train our Curriculum-Learned Masked\nAutoencoder (CL-MAE) on ImageNet and show that it exhibits superior\nrepresentation learning capabilities compared to MAE. The empirical results on\nfive downstream tasks confirm our conjecture, demonstrating that curriculum\nlearning can be successfully used to self-supervise masked autoencoders. We\nrelease our code at https://github.com/ristea/cl-mae.\n","authors":["Neelu Madan","Nicolae-Catalin Ristea","Kamal Nasrollahi","Thomas B. Moeslund","Radu Tudor Ionescu"],"pdf_url":"https://arxiv.org/pdf/2308.16572v2.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.16552v1","updated":"2023-10-25T11:10:08Z","published":"2023-10-25T11:10:08Z","title":"DECWA : Density-Based Clustering using Wasserstein Distance","summary":" Clustering is a data analysis method for extracting knowledge by discovering\ngroups of data called clusters. Among these methods, state-of-the-art\ndensity-based clustering methods have proven to be effective for\narbitrary-shaped clusters. Despite their encouraging results, they suffer to\nfind low-density clusters, near clusters with similar densities, and\nhigh-dimensional data. Our proposals are a new characterization of clusters and\na new clustering algorithm based on spatial density and probabilistic approach.\nFirst of all, sub-clusters are built using spatial density represented as\nprobability density function ($p.d.f$) of pairwise distances between points. A\nmethod is then proposed to agglomerate similar sub-clusters by using both their\ndensity ($p.d.f$) and their spatial distance. The key idea we propose is to use\nthe Wasserstein metric, a powerful tool to measure the distance between $p.d.f$\nof sub-clusters. We show that our approach outperforms other state-of-the-art\ndensity-based clustering methods on a wide variety of datasets.\n","authors":["Nabil El Malki","Robin Cugny","Olivier Teste","Franck Ravat"],"pdf_url":"https://arxiv.org/pdf/2310.16552v1.pdf","comment":"6 pages, CIKM 2020"},{"id":"http://arxiv.org/abs/2211.12421v5","updated":"2023-10-25T11:04:13Z","published":"2022-11-11T02:14:28Z","title":"Data-Driven Network Neuroscience: On Data Collection and Benchmark","summary":" This paper presents a comprehensive and quality collection of functional\nhuman brain network data for potential research in the intersection of\nneuroscience, machine learning, and graph analytics. Anatomical and functional\nMRI images have been used to understand the functional connectivity of the\nhuman brain and are particularly important in identifying underlying\nneurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism.\nRecently, the study of the brain in the form of brain networks using machine\nlearning and graph analytics has become increasingly popular, especially to\npredict the early onset of these conditions. A brain network, represented as a\ngraph, retains rich structural and positional information that traditional\nexamination methods are unable to capture. However, the lack of publicly\naccessible brain network data prevents researchers from data-driven\nexplorations. One of the main difficulties lies in the complicated\ndomain-specific preprocessing steps and the exhaustive computation required to\nconvert the data from MRI images into brain networks. We bridge this gap by\ncollecting a large amount of MRI images from public databases and a private\nsource, working with domain experts to make sensible design choices, and\npreprocessing the MRI images to produce a collection of brain network datasets.\nThe datasets originate from 6 different sources, cover 4 brain conditions, and\nconsist of a total of 2,702 subjects. We test our graph datasets on 12 machine\nlearning models to provide baselines and validate the data quality on a recent\ngraph analysis model. To lower the barrier to entry and promote the research in\nthis interdisciplinary field, we release our brain network data and complete\npreprocessing details including codes at\nhttps://doi.org/10.17608/k6.auckland.21397377 and\nhttps://github.com/brainnetuoa/data_driven_network_neuroscience.\n","authors":["Jiaxing Xu","Yunhan Yang","David Tse Jung Huang","Sophi Shilpa Gururajapathy","Yiping Ke","Miao Qiao","Alan Wang","Haribalan Kumar","Josh McGeown","Eryn Kwon"],"pdf_url":"https://arxiv.org/pdf/2211.12421v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.02031v4","updated":"2023-10-25T10:55:38Z","published":"2023-10-03T13:17:35Z","title":"OceanGPT: A Large Language Model for Ocean Science Tasks","summary":" Ocean science, which delves into the oceans that are reservoirs of life and\nbiodiversity, is of great significance given that oceans cover over 70% of our\nplanet's surface. Recently, advances in Large Language Models (LLMs) have\ntransformed the paradigm in science. Despite the success in other domains,\ncurrent LLMs often fall short in catering to the needs of domain experts like\noceanographers, and the potential of LLMs for ocean science is under-explored.\nThe intrinsic reason may be the immense and intricate nature of ocean data as\nwell as the necessity for higher granularity and richness in knowledge. To\nalleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean\ndomain, which is expert in various ocean science tasks. We propose DoInstruct,\na novel framework to automatically obtain a large volume of ocean domain\ninstruction data, which generates instructions based on multi-agent\ncollaboration. Additionally, we construct the first oceanography benchmark,\nOceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though\ncomprehensive experiments, OceanGPT not only shows a higher level of knowledge\nexpertise for oceans science tasks but also gains preliminary embodied\nintelligence capabilities in ocean technology. Codes, data and checkpoints will\nsoon be available at https://github.com/zjunlp/KnowLM.\n","authors":["Zhen Bi","Ningyu Zhang","Yida Xue","Yixin Ou","Daxiong Ji","Guozhou Zheng","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2310.02031v4.pdf","comment":"Work in progress. Project Website:\n https://zjunlp.github.io/project/OceanGPT/"},{"id":"http://arxiv.org/abs/2310.16546v1","updated":"2023-10-25T10:53:04Z","published":"2023-10-25T10:53:04Z","title":"Pitfall of Optimism: Distributional Reinforcement Learning by\n Randomizing Risk Criterion","summary":" Distributional reinforcement learning algorithms have attempted to utilize\nestimated uncertainty for exploration, such as optimism in the face of\nuncertainty. However, using the estimated variance for optimistic exploration\nmay cause biased data collection and hinder convergence or performance. In this\npaper, we present a novel distributional reinforcement learning algorithm that\nselects actions by randomizing risk criterion to avoid one-sided tendency on\nrisk. We provide a perturbed distributional Bellman optimality operator by\ndistorting the risk measure and prove the convergence and optimality of the\nproposed method with the weaker contraction property. Our theoretical results\nsupport that the proposed method does not fall into biased exploration and is\nguaranteed to converge to an optimal return. Finally, we empirically show that\nour method outperforms other existing distribution-based algorithms in various\nenvironments including Atari 55 games.\n","authors":["Taehyun Cho","Seungyub Han","Heesoo Lee","Kyungjae Lee","Jungwoo Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16546v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.17190v2","updated":"2023-10-25T10:50:35Z","published":"2023-05-26T18:28:28Z","title":"Multiplication-Free Transformer Training via Piecewise Affine Operations","summary":" Multiplications are responsible for most of the computational cost involved\nin neural network training and inference. Recent research has thus looked for\nways to reduce the cost associated with them. Inspired by Mogami (2020), we\nreplace multiplication with a cheap piecewise affine approximation that is\nachieved by adding the bit representation of the floating point numbers\ntogether as integers. We show that transformers can be trained with the\nresulting modified matrix multiplications on both vision and language tasks\nwith little to no performance impact, and without changes to the training\nhyperparameters. We further replace all non-linearities in the networks making\nthem fully and jointly piecewise affine in both inputs and weights. Finally, we\nshow that we can eliminate all multiplications in the entire training process,\nincluding operations in the forward pass, backward pass and optimizer update,\ndemonstrating the first successful training of modern neural network\narchitectures in a fully multiplication-free fashion.\n","authors":["Atli Kosson","Martin Jaggi"],"pdf_url":"https://arxiv.org/pdf/2305.17190v2.pdf","comment":"Accepted to the 37th Conference on Neural Information Processing\n Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2302.13875v3","updated":"2023-10-25T10:42:47Z","published":"2023-02-27T15:25:21Z","title":"Evaluating Robustness and Uncertainty of Graph Models Under Structural\n Distributional Shifts","summary":" In reliable decision-making systems based on machine learning, models have to\nbe robust to distributional shifts or provide the uncertainty of their\npredictions. In node-level problems of graph learning, distributional shifts\ncan be especially complex since the samples are interdependent. To evaluate the\nperformance of graph models, it is important to test them on diverse and\nmeaningful distributional shifts. However, most graph benchmarks considering\ndistributional shifts for node-level problems focus mainly on node features,\nwhile structural properties are also essential for graph problems. In this\nwork, we propose a general approach for inducing diverse distributional shifts\nbased on graph structure. We use this approach to create data splits according\nto several structural node properties: popularity, locality, and density. In\nour experiments, we thoroughly evaluate the proposed distributional shifts and\nshow that they can be quite challenging for existing graph models. We also\nreveal that simple models often outperform more sophisticated methods on the\nconsidered structural shifts. Finally, our experiments provide evidence that\nthere is a trade-off between the quality of learned representations for the\nbase classification task under structural distributional shift and the ability\nto separate the nodes from different distributions using these representations.\n","authors":["Gleb Bazhenov","Denis Kuznedelev","Andrey Malinin","Artem Babenko","Liudmila Prokhorenkova"],"pdf_url":"https://arxiv.org/pdf/2302.13875v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16538v1","updated":"2023-10-25T10:35:09Z","published":"2023-10-25T10:35:09Z","title":"FedTherapist: Mental Health Monitoring with User-Generated Linguistic\n Expressions on Smartphones via Federated Learning","summary":" Psychiatrists diagnose mental disorders via the linguistic use of patients.\nStill, due to data privacy, existing passive mental health monitoring systems\nuse alternative features such as activity, app usage, and location via mobile\ndevices. We propose FedTherapist, a mobile mental health monitoring system that\nutilizes continuous speech and keyboard input in a privacy-preserving way via\nfederated learning. We explore multiple model designs by comparing their\nperformance and overhead for FedTherapist to overcome the complex nature of\non-device language model training on smartphones. We further propose a\nContext-Aware Language Learning (CALL) methodology to effectively utilize\nsmartphones' large and noisy text for mental health signal sensing. Our\nIRB-approved evaluation of the prediction of self-reported depression, stress,\nanxiety, and mood from 46 participants shows higher accuracy of FedTherapist\ncompared with the performance with non-language features, achieving 0.15 AUROC\nimprovement and 8.21% MAE reduction.\n","authors":["Jaemin Shin","Hyungjun Yoon","Seungjoo Lee","Sungjoon Park","Yunxin Liu","Jinho D. Choi","Sung-Ju Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16538v1.pdf","comment":"Accepted to the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2310.16527v1","updated":"2023-10-25T10:22:30Z","published":"2023-10-25T10:22:30Z","title":"Enhancing Document Information Analysis with Multi-Task Pre-training: A\n Robust Approach for Information Extraction in Visually-Rich Documents","summary":" This paper introduces a deep learning model tailored for document information\nanalysis, emphasizing document classification, entity relation extraction, and\ndocument visual question answering. The proposed model leverages\ntransformer-based models to encode all the information present in a document\nimage, including textual, visual, and layout information. The model is\npre-trained and subsequently fine-tuned for various document image analysis\ntasks. The proposed model incorporates three additional tasks during the\npre-training phase, including reading order identification of different layout\nsegments in a document image, layout segments categorization as per PubLayNet,\nand generation of the text sequence within a given layout segment (text block).\nThe model also incorporates a collective pre-training scheme where losses of\nall the tasks under consideration, including pre-training and fine-tuning tasks\nwith all datasets, are considered. Additional encoder and decoder blocks are\nadded to the RoBERTa network to generate results for all tasks. The proposed\nmodel achieved impressive results across all tasks, with an accuracy of 95.87%\non the RVL-CDIP dataset for document classification, F1 scores of 0.9306,\n0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets\nrespectively for entity relation extraction, and an ANLS score of 0.8468 on the\nDocVQA dataset for visual question answering. The results highlight the\neffectiveness of the proposed model in understanding and interpreting complex\ndocument layouts and content, making it a promising tool for document analysis\ntasks.\n","authors":["Tofik Ali","Partha Pratim Roy"],"pdf_url":"https://arxiv.org/pdf/2310.16527v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16525v1","updated":"2023-10-25T10:19:03Z","published":"2023-10-25T10:19:03Z","title":"Cyclic Directed Probabilistic Graphical Model: A Proposal Based on\n Structured Outcomes","summary":" In the process of building (structural learning) a probabilistic graphical\nmodel from a set of observed data, the directional, cyclic dependencies between\nthe random variables of the model are often found. Existing graphical models\nsuch as Bayesian and Markov networks can reflect such dependencies. However,\nthis requires complicating those models, such as adding additional variables or\ndividing the model graph into separate subgraphs. Herein, we describe a\nprobabilistic graphical model - probabilistic relation network - that allows\nthe direct capture of directional cyclic dependencies during structural\nlearning. This model is based on the simple idea that each sample of the\nobserved data can be represented by an arbitrary graph (structured outcome),\nwhich reflects the structure of the dependencies of the variables included in\nthe sample. Each of the outcomes contains only a part of the graphical model\nstructure; however, a complete graph of the probabilistic model is obtained by\ncombining different outcomes. Such a graph, unlike Bayesian and Markov\nnetworks, can be directed and can have cycles. We explored the full joint\ndistribution and conditional distribution and conditional independence\nproperties of variables in the proposed model. We defined the algorithms for\nconstructing of the model from the dataset and for calculating the conditional\nand full joint distributions. We also performed a numerical comparison with\nBayesian and Markov networks. This model does not violate the probability\naxioms, and it supports learning from observed data. Notably, it supports\nprobabilistic inference, making it a prospective tool in data analysis and in\nexpert and design-making applications.\n","authors":["Oleksii Sirotkin"],"pdf_url":"https://arxiv.org/pdf/2310.16525v1.pdf","comment":"41 pages, 11 figures, arXiv:2206.06089v1"},{"id":"http://arxiv.org/abs/2310.16524v1","updated":"2023-10-25T10:18:44Z","published":"2023-10-25T10:18:44Z","title":"Can You Rely on Your Model Evaluation? Improving Model Evaluation with\n Synthetic Test Data","summary":" Evaluating the performance of machine learning models on diverse and\nunderrepresented subgroups is essential for ensuring fairness and reliability\nin real-world applications. However, accurately assessing model performance\nbecomes challenging due to two main issues: (1) a scarcity of test data,\nespecially for small subgroups, and (2) possible distributional shifts in the\nmodel's deployment setting, which may not align with the available test data.\nIn this work, we introduce 3S Testing, a deep generative modeling framework to\nfacilitate model evaluation by generating synthetic test sets for small\nsubgroups and simulating distributional shifts. Our experiments demonstrate\nthat 3S Testing outperforms traditional baselines -- including real test data\nalone -- in estimating model performance on minority subgroups and under\nplausible distributional shifts. In addition, 3S offers intervals around its\nperformance estimates, exhibiting superior coverage of the ground truth\ncompared to existing approaches. Overall, these results raise the question of\nwhether we need a paradigm shift away from limited real test data towards\nsynthetic test data.\n","authors":["Boris van Breugel","Nabeel Seedat","Fergus Imrie","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2310.16524v1.pdf","comment":"Advances in Neural Information Processing Systems 36 (NeurIPS 2023).\n Van Breugel & Seedat contributed equally"},{"id":"http://arxiv.org/abs/2310.16520v1","updated":"2023-10-25T10:10:07Z","published":"2023-10-25T10:10:07Z","title":"Towards Self-Interpretable Graph-Level Anomaly Detection","summary":" Graph-level anomaly detection (GLAD) aims to identify graphs that exhibit\nnotable dissimilarity compared to the majority in a collection. However,\ncurrent works primarily focus on evaluating graph-level abnormality while\nfailing to provide meaningful explanations for the predictions, which largely\nlimits their reliability and application scope. In this paper, we investigate a\nnew challenging problem, explainable GLAD, where the learning objective is to\npredict the abnormality of each graph sample with corresponding explanations,\ni.e., the vital subgraph that leads to the predictions. To address this\nchallenging problem, we propose a Self-Interpretable Graph aNomaly dETection\nmodel (SIGNET for short) that detects anomalous graphs as well as generates\ninformative explanations simultaneously. Specifically, we first introduce the\nmulti-view subgraph information bottleneck (MSIB) framework, serving as the\ndesign basis of our self-interpretable GLAD approach. This way SIGNET is able\nto not only measure the abnormality of each graph based on cross-view mutual\ninformation but also provide informative graph rationales by extracting\nbottleneck subgraphs from the input graph and its dual hypergraph in a\nself-supervised way. Extensive experiments on 16 datasets demonstrate the\nanomaly detection capability and self-interpretability of SIGNET.\n","authors":["Yixin Liu","Kaize Ding","Qinghua Lu","Fuyi Li","Leo Yu Zhang","Shirui Pan"],"pdf_url":"https://arxiv.org/pdf/2310.16520v1.pdf","comment":"23 pages; accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.15217v2","updated":"2023-10-25T10:09:24Z","published":"2023-06-27T05:37:23Z","title":"Unsupervised Episode Generation for Graph Meta-learning","summary":" We investigate Unsupervised Episode Generation methods to solve Few-Shot\nNode-Classification (FSNC) task via Meta-learning without labels. Dominant\nmeta-learning methodologies for FSNC were developed under the existence of\nabundant labeled nodes from diverse base classes for training, which however\nmay not be possible to obtain in the real-world. Although a few studies tried\nto tackle the label-scarcity problem in graph meta-learning, they still rely on\na few labeled nodes, which hinders the full utilization of the information of\nall nodes in a graph. Despite the effectiveness of graph contrastive learning\n(GCL) methods in the FSNC task without using the label information, they mainly\nlearn generic node embeddings without consideration of the downstream task to\nbe solved, which may limit its performance in the FSNC task. To this end, we\npropose a simple yet effective unsupervised episode generation method to\nbenefit from the generalization ability of meta-learning for the FSNC task,\nwhile resolving the label-scarcity problem. Our proposed method, called\nNeighbors as Queries (NaQ), generates training episodes based on pre-calculated\nnode-node similarity. Moreover, NaQ is model-agnostic; hence, it can be used to\ntrain any existing supervised graph meta-learning methods in an unsupervised\nmanner, while not sacrificing much of their performance or sometimes even\nimproving them. Extensive experimental results demonstrate the potential of our\nunsupervised episode generation methods for graph meta-learning towards the\nFSNC task. Our code is available at: https://github.com/JhngJng/NaQ-PyTorch\n","authors":["Jihyeong Jung","Sangwoo Seo","Sungwon Kim","Chanyoung Park"],"pdf_url":"https://arxiv.org/pdf/2306.15217v2.pdf","comment":"12 pages, 12 figures, Preprint version"},{"id":"http://arxiv.org/abs/2310.16516v1","updated":"2023-10-25T10:05:42Z","published":"2023-10-25T10:05:42Z","title":"Particle-based Variational Inference with Generalized Wasserstein\n Gradient Flow","summary":" Particle-based variational inference methods (ParVIs) such as Stein\nvariational gradient descent (SVGD) update the particles based on the\nkernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence.\nHowever, the design of kernels is often non-trivial and can be restrictive for\nthe flexibility of the method. Recent works show that functional gradient flow\napproximations with quadratic form regularization terms can improve\nperformance. In this paper, we propose a ParVI framework, called generalized\nWasserstein gradient descent (GWG), based on a generalized Wasserstein gradient\nflow of the KL divergence, which can be viewed as a functional gradient method\nwith a broader class of regularizers induced by convex functions. We show that\nGWG exhibits strong convergence guarantees. We also provide an adaptive version\nthat automatically chooses Wasserstein metric to accelerate convergence. In\nexperiments, we demonstrate the effectiveness and efficiency of the proposed\nframework on both simulated and real data problems.\n","authors":["Ziheng Cheng","Shiyue Zhang","Longlin Yu","Cheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16516v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15694v2","updated":"2023-10-25T10:03:52Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.04574v2","updated":"2023-10-25T09:58:53Z","published":"2023-05-08T09:32:05Z","title":"TAPS: Connecting Certified and Adversarial Training","summary":" Training certifiably robust neural networks remains a notoriously hard\nproblem. On one side, adversarial training optimizes under-approximations of\nthe worst-case loss, which leads to insufficient regularization for\ncertification, while on the other, sound certified training methods optimize\nloose over-approximations, leading to over-regularization and poor (standard)\naccuracy. In this work we propose TAPS, an (unsound) certified training method\nthat combines IBP and PGD training to yield precise, although not necessarily\nsound, worst-case loss approximations, reducing over-regularization and\nincreasing certified and standard accuracies. Empirically, TAPS achieves a new\nstate-of-the-art in many settings, e.g., reaching a certified accuracy of\n$22\\%$ on TinyImageNet for $\\ell_\\infty$-perturbations with radius\n$\\epsilon=1/255$. We make our implementation and networks public at\nhttps://github.com/eth-sri/taps.\n","authors":["Yuhao Mao","Mark Niklas Müller","Marc Fischer","Martin Vechev"],"pdf_url":"https://arxiv.org/pdf/2305.04574v2.pdf","comment":"NeuIPS'23"},{"id":"http://arxiv.org/abs/2310.16506v1","updated":"2023-10-25T09:47:15Z","published":"2023-10-25T09:47:15Z","title":"Identifying Reasons for Bias: An Argumentation-Based Approach","summary":" As algorithmic decision-making systems become more prevalent in society,\nensuring the fairness of these systems is becoming increasingly important.\nWhilst there has been substantial research in building fair algorithmic\ndecision-making systems, the majority of these methods require access to the\ntraining data, including personal characteristics, and are not transparent\nregarding which individuals are classified unfairly. In this paper, we propose\na novel model-agnostic argumentation-based method to determine why an\nindividual is classified differently in comparison to similar individuals. Our\nmethod uses a quantitative argumentation framework to represent attribute-value\npairs of an individual and of those similar to them, and uses a well-known\nsemantics to identify the attribute-value pairs in the individual contributing\nmost to their different classification. We evaluate our method on two datasets\ncommonly used in the fairness literature and illustrate its effectiveness in\nthe identification of bias.\n","authors":["Madeleine Waller","Odinaldo Rodrigues","Oana Cocarascu"],"pdf_url":"https://arxiv.org/pdf/2310.16506v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2310.16499v1","updated":"2023-10-25T09:33:57Z","published":"2023-10-25T09:33:57Z","title":"Data Optimization in Deep Learning: A Survey","summary":" Large-scale, high-quality data are considered an essential factor for the\nsuccessful application of many deep learning techniques. Meanwhile, numerous\nreal-world deep learning tasks still have to contend with the lack of\nsufficient amounts of high-quality data. Additionally, issues such as model\nrobustness, fairness, and trustworthiness are also closely related to training\ndata. Consequently, a huge number of studies in the existing literature have\nfocused on the data aspect in deep learning tasks. Some typical data\noptimization techniques include data augmentation, logit perturbation, sample\nweighting, and data condensation. These techniques usually come from different\ndeep learning divisions and their theoretical inspirations or heuristic\nmotivations may seem unrelated to each other. This study aims to organize a\nwide range of existing data optimization methodologies for deep learning from\nthe previous literature, and makes the effort to construct a comprehensive\ntaxonomy for them. The constructed taxonomy considers the diversity of split\ndimensions, and deep sub-taxonomies are constructed for each dimension. On the\nbasis of the taxonomy, connections among the extensive data optimization\nmethods for deep learning are built in terms of four aspects. We probe into\nrendering several promising and interesting future directions. The constructed\ntaxonomy and the revealed connections will enlighten the better understanding\nof existing methods and the design of novel data optimization techniques.\nFurthermore, our aspiration for this survey is to promote data optimization as\nan independent subdivision of deep learning. A curated, up-to-date list of\nresources related to data optimization in deep learning is available at\n\\url{https://github.com/YaoRujing/Data-Optimization}.\n","authors":["Ou Wu","Rujing Yao"],"pdf_url":"https://arxiv.org/pdf/2310.16499v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16496v1","updated":"2023-10-25T09:30:22Z","published":"2023-10-25T09:30:22Z","title":"Citizen participation: crowd-sensed sustainable indoor location services","summary":" In the present era of sustainable innovation, the circular economy paradigm\ndictates the optimal use and exploitation of existing finite resources. At the\nsame time, the transition to smart infrastructures requires considerable\ninvestment in capital, resources and people. In this work, we present a general\nmachine learning approach for offering indoor location awareness without the\nneed to invest in additional and specialised hardware. We explore use cases\nwhere visitors equipped with their smart phone would interact with the\navailable WiFi infrastructure to estimate their location, since the indoor\nrequirement poses a limitation to standard GPS solutions. Results have shown\nthat the proposed approach achieves a less than 2m accuracy and the model is\nresilient even in the case where a substantial number of BSSIDs are dropped.\n","authors":["Ioannis Nasios","Konstantinos Vogklis","Avleen Malhi","Anastasia Vayona","Panos Chatziadam","Vasilis Katos"],"pdf_url":"https://arxiv.org/pdf/2310.16496v1.pdf","comment":"Preprint submitted to Elsevier"},{"id":"http://arxiv.org/abs/2310.16492v1","updated":"2023-10-25T09:19:45Z","published":"2023-10-25T09:19:45Z","title":"On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection","summary":" Successful detection of Out-of-Distribution (OoD) data is becoming\nincreasingly important to ensure safe deployment of neural networks. One of the\nmain challenges in OoD detection is that neural networks output overconfident\npredictions on OoD data, make it difficult to determine OoD-ness of data solely\nbased on their predictions. Outlier exposure addresses this issue by\nintroducing an additional loss that encourages low-confidence predictions on\nOoD data during training. While outlier exposure has shown promising potential\nin improving OoD detection performance, all previous studies on outlier\nexposure have been limited to utilizing visual outliers. Drawing inspiration\nfrom the recent advancements in vision-language pre-training, this paper\nventure out to the uncharted territory of textual outlier exposure. First, we\nuncover the benefits of using textual outliers by replacing real or virtual\noutliers in the image-domain with textual equivalents. Then, we propose various\nways of generating preferable textual outliers. Our extensive experiments\ndemonstrate that generated textual outliers achieve competitive performance on\nlarge-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical\nanalyses of textual outliers to provide primary criteria for designing\nadvantageous textual outliers: near-distribution, descriptiveness, and\ninclusion of visual semantics.\n","authors":["Sangha Park","Jisoo Mok","Dahuin Jung","Saehyung Lee","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2310.16492v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16491v1","updated":"2023-10-25T09:19:40Z","published":"2023-10-25T09:19:40Z","title":"TSONN: Time-stepping-oriented neural network for solving partial\n differential equations","summary":" Deep neural networks (DNNs), especially physics-informed neural networks\n(PINNs), have recently become a new popular method for solving forward and\ninverse problems governed by partial differential equations (PDEs). However,\nthese methods still face challenges in achieving stable training and obtaining\ncorrect results in many problems, since minimizing PDE residuals with PDE-based\nsoft constraint make the problem ill-conditioned. Different from all existing\nmethods that directly minimize PDE residuals, this work integrates\ntime-stepping method with deep learning, and transforms the original\nill-conditioned optimization problem into a series of well-conditioned\nsub-problems over given pseudo time intervals. The convergence of model\ntraining is significantly improved by following the trajectory of the pseudo\ntime-stepping process, yielding a robust optimization-based PDE solver. Our\nresults show that the proposed method achieves stable training and correct\nresults in many problems that standard PINNs fail to solve, requiring only a\nsimple modification on the loss function. In addition, we demonstrate several\nnovel properties and advantages of time-stepping methods within the framework\nof neural network-based optimization approach, in comparison to traditional\ngrid-based numerical method. Specifically, explicit scheme allows significantly\nlarger time step, while implicit scheme can be implemented as straightforwardly\nas explicit scheme.\n","authors":["Wenbo Cao","Weiwei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.16491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16487v1","updated":"2023-10-25T09:17:25Z","published":"2023-10-25T09:17:25Z","title":"Hyperparameter Optimization for Multi-Objective Reinforcement Learning","summary":" Reinforcement learning (RL) has emerged as a powerful approach for tackling\ncomplex problems. The recent introduction of multi-objective reinforcement\nlearning (MORL) has further expanded the scope of RL by enabling agents to make\ntrade-offs among multiple objectives. This advancement not only has broadened\nthe range of problems that can be tackled but also created numerous\nopportunities for exploration and advancement. Yet, the effectiveness of RL\nagents heavily relies on appropriately setting their hyperparameters. In\npractice, this task often proves to be challenging, leading to unsuccessful\ndeployments of these techniques in various instances. Hence, prior research has\nexplored hyperparameter optimization in RL to address this concern.\n This paper presents an initial investigation into the challenge of\nhyperparameter optimization specifically for MORL. We formalize the problem,\nhighlight its distinctive challenges, and propose a systematic methodology to\naddress it. The proposed methodology is applied to a well-known environment\nusing a state-of-the-art MORL algorithm, and preliminary results are reported.\nOur findings indicate that the proposed methodology can effectively provide\nhyperparameter configurations that significantly enhance the performance of\nMORL agents. Furthermore, this study identifies various future research\nopportunities to further advance the field of hyperparameter optimization for\nMORL.\n","authors":["Florian Felten","Daniel Gareev","El-Ghazali Talbi","Grégoire Danoy"],"pdf_url":"https://arxiv.org/pdf/2310.16487v1.pdf","comment":"Presented at the MODeM workshop https://modem2023.vub.ac.be/#"},{"id":"http://arxiv.org/abs/2310.16485v1","updated":"2023-10-25T09:13:19Z","published":"2023-10-25T09:13:19Z","title":"A Comprehensive Python Library for Deep Learning-Based Event Detection\n in Multivariate Time Series Data and Information Retrieval in NLP","summary":" Event detection in time series data is crucial in various domains, including\nfinance, healthcare, cybersecurity, and science. Accurately identifying events\nin time series data is vital for making informed decisions, detecting\nanomalies, and predicting future trends. Despite extensive research exploring\ndiverse methods for event detection in time series, with deep learning\napproaches being among the most advanced, there is still room for improvement\nand innovation in this field. In this paper, we present a new deep learning\nsupervised method for detecting events in multivariate time series data. Our\nmethod combines four distinct novelties compared to existing deep-learning\nsupervised methods. Firstly, it is based on regression instead of binary\nclassification. Secondly, it does not require labeled datasets where each point\nis labeled; instead, it only requires reference events defined as time points\nor intervals of time. Thirdly, it is designed to be robust by using a stacked\nensemble learning meta-model that combines deep learning models, ranging from\nclassic feed-forward neural networks (FFNs) to state-of-the-art architectures\nlike transformers. This ensemble approach can mitigate individual model\nweaknesses and biases, resulting in more robust predictions. Finally, to\nfacilitate practical implementation, we have developed a Python package to\naccompany our proposed method. The package, called eventdetector-ts, can be\ninstalled through the Python Package Index (PyPI). In this paper, we present\nour method and provide a comprehensive guide on the usage of the package. We\nshowcase its versatility and effectiveness through different real-world use\ncases from natural language processing (NLP) to financial security domains.\n","authors":["Menouar Azib","Benjamin Renard","Philippe Garnier","Vincent Génot","Nicolas André"],"pdf_url":"https://arxiv.org/pdf/2310.16485v1.pdf","comment":"Accepted for the 22nd International Conference on Machine Learning\n and Applications (ICMLA)"},{"id":"http://arxiv.org/abs/2306.12795v3","updated":"2023-10-25T09:11:50Z","published":"2023-06-22T10:53:10Z","title":"Learning Unseen Modality Interaction","summary":" Multimodal learning assumes all modality combinations of interest are\navailable during training to learn cross-modal correspondences. In this paper,\nwe challenge this modality-complete assumption for multimodal learning and\ninstead strive for generalization to unseen modality combinations during\ninference. We pose the problem of unseen modality interaction and introduce a\nfirst solution. It exploits a module that projects the multidimensional\nfeatures of different modalities into a common space with rich information\npreserved. This allows the information to be accumulated with a simple\nsummation operation across available modalities. To reduce overfitting to less\ndiscriminative modality combinations during training, we further improve the\nmodel learning with pseudo-supervision indicating the reliability of a\nmodality's prediction. We demonstrate that our approach is effective for\ndiverse tasks and modalities by evaluating it for multimodal video\nclassification, robot state regression, and multimedia retrieval. Project\nwebsite: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.\n","authors":["Yunhua Zhang","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2306.12795v3.pdf","comment":"Published at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.11974v3","updated":"2023-10-25T09:09:48Z","published":"2023-06-21T02:02:41Z","title":"Universal adversarial perturbations for multiple classification tasks\n with quantum classifiers","summary":" Quantum adversarial machine learning is an emerging field that studies the\nvulnerability of quantum learning systems against adversarial perturbations and\ndevelops possible defense strategies. Quantum universal adversarial\nperturbations are small perturbations, which can make different input samples\ninto adversarial examples that may deceive a given quantum classifier. This is\na field that was rarely looked into but worthwhile investigating because\nuniversal perturbations might simplify malicious attacks to a large extent,\ncausing unexpected devastation to quantum machine learning models. In this\npaper, we take a step forward and explore the quantum universal perturbations\nin the context of heterogeneous classification tasks. In particular, we find\nthat quantum classifiers that achieve almost state-of-the-art accuracy on two\ndifferent classification tasks can be both conclusively deceived by one\ncarefully-crafted universal perturbation. This result is explicitly\ndemonstrated with well-designed quantum continual learning models with elastic\nweight consolidation method to avoid catastrophic forgetting, as well as\nreal-life heterogeneous datasets from hand-written digits and medical MRI\nimages. Our results provide a simple and efficient way to generate universal\nperturbations on heterogeneous classification tasks and thus would provide\nvaluable guidance for future quantum learning technologies.\n","authors":["Yun-Zhong Qiu"],"pdf_url":"https://arxiv.org/pdf/2306.11974v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09863v3","updated":"2023-10-25T09:06:44Z","published":"2023-03-17T10:01:32Z","title":"Deep Nonparametric Estimation of Intrinsic Data Structures by Chart\n Autoencoders: Generalization Error and Robustness","summary":" Autoencoders have demonstrated remarkable success in learning low-dimensional\nlatent features of high-dimensional data across various applications. Assuming\nthat data are sampled near a low-dimensional manifold, we employ chart\nautoencoders, which encode data into low-dimensional latent features on a\ncollection of charts, preserving the topology and geometry of the data\nmanifold. Our paper establishes statistical guarantees on the generalization\nerror of chart autoencoders, and we demonstrate their denoising capabilities by\nconsidering $n$ noisy training samples, along with their noise-free\ncounterparts, on a $d$-dimensional manifold. By training autoencoders, we show\nthat chart autoencoders can effectively denoise the input data with normal\nnoise. We prove that, under proper network architectures, chart autoencoders\nachieve a squared generalization error in the order of $\\displaystyle\nn^{-\\frac{2}{d+2}}\\log^4 n$, which depends on the intrinsic dimension of the\nmanifold and only weakly depends on the ambient dimension and noise level. We\nfurther extend our theory on data with noise containing both normal and\ntangential components, where chart autoencoders still exhibit a denoising\neffect for the normal component. As a special case, our theory also applies to\nclassical autoencoders, as long as the data manifold has a global\nparametrization. Our results provide a solid theoretical foundation for the\neffectiveness of autoencoders, which is further validated through several\nnumerical experiments.\n","authors":["Hao Liu","Alex Havrilla","Rongjie Lai","Wenjing Liao"],"pdf_url":"https://arxiv.org/pdf/2303.09863v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16473v1","updated":"2023-10-25T08:53:51Z","published":"2023-10-25T08:53:51Z","title":"Symphony of experts: orchestration with adversarial insights in\n reinforcement learning","summary":" Structured reinforcement learning leverages policies with advantageous\nproperties to reach better performance, particularly in scenarios where\nexploration poses challenges. We explore this field through the concept of\norchestration, where a (small) set of expert policies guides decision-making;\nthe modeling thereof constitutes our first contribution. We then establish\nvalue-functions regret bounds for orchestration in the tabular setting by\ntransferring regret-bound results from adversarial settings. We generalize and\nextend the analysis of natural policy gradient in Agarwal et al. [2021, Section\n5.3] to arbitrary adversarial aggregation strategies. We also extend it to the\ncase of estimated advantage functions, providing insights into sample\ncomplexity both in expectation and high probability. A key point of our\napproach lies in its arguably more transparent proofs compared to existing\nmethods. Finally, we present simulations for a stochastic matching toy model.\n","authors":["Matthieu Jonckheere","Chiara Mignacco","Gilles Stoltz"],"pdf_url":"https://arxiv.org/pdf/2310.16473v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16466v1","updated":"2023-10-25T08:44:05Z","published":"2023-10-25T08:44:05Z","title":"Learning Continuous Network Emerging Dynamics from Scarce Observations\n via Data-Adaptive Stochastic Processes","summary":" Learning network dynamics from the empirical structure and spatio-temporal\nobservation data is crucial to revealing the interaction mechanisms of complex\nnetworks in a wide range of domains. However, most existing methods only aim at\nlearning network dynamic behaviors generated by a specific ordinary\ndifferential equation instance, resulting in ineffectiveness for new ones, and\ngenerally require dense observations. The observed data, especially from\nnetwork emerging dynamics, are usually difficult to obtain, which brings\ntrouble to model learning. Therefore, how to learn accurate network dynamics\nwith sparse, irregularly-sampled, partial, and noisy observations remains a\nfundamental challenge. We introduce Neural ODE Processes for Network Dynamics\n(NDP4ND), a new class of stochastic processes governed by stochastic\ndata-adaptive network dynamics, to overcome the challenge and learn continuous\nnetwork dynamics from scarce observations. Intensive experiments conducted on\nvarious network dynamics in ecological population evolution, phototaxis\nmovement, brain activity, epidemic spreading, and real-world empirical systems,\ndemonstrate that the proposed method has excellent data adaptability and\ncomputational efficiency, and can adapt to unseen network emerging dynamics,\nproducing accurate interpolation and extrapolation with reducing the ratio of\nrequired observation data to only about 6\\% and improving the learning speed\nfor new dynamics by three orders of magnitude.\n","authors":["Jiaxu Cui","Bingyi Sun","Jiming Liu","Bo Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16466v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2309.10790v2","updated":"2023-10-25T08:39:38Z","published":"2023-09-19T17:39:20Z","title":"Guide Your Agent with Adaptive Multimodal Rewards","summary":" Developing an agent capable of adapting to unseen environments remains a\ndifficult challenge in imitation learning. This work presents Adaptive\nReturn-conditioned Policy (ARP), an efficient framework designed to enhance the\nagent's generalization ability using natural language task descriptions and\npre-trained multimodal encoders. Our key idea is to calculate a similarity\nbetween visual observations and natural language instructions in the\npre-trained multimodal embedding space (such as CLIP) and use it as a reward\nsignal. We then train a return-conditioned policy using expert demonstrations\nlabeled with multimodal rewards. Because the multimodal rewards provide\nadaptive signals at each timestep, our ARP effectively mitigates the goal\nmisgeneralization. This results in superior generalization performances even\nwhen faced with unseen text instructions, compared to existing text-conditioned\npolicies. To improve the quality of rewards, we also introduce a fine-tuning\nmethod for pre-trained multimodal encoders, further enhancing the performance.\nVideo demonstrations and source code are available on the project website:\n\\url{https://sites.google.com/view/2023arp}.\n","authors":["Changyeon Kim","Younggyo Seo","Hao Liu","Lisa Lee","Jinwoo Shin","Honglak Lee","Kimin Lee"],"pdf_url":"https://arxiv.org/pdf/2309.10790v2.pdf","comment":"Accepted to NeurIPS 2023. Project webpage:\n https://sites.google.com/view/2023arp"},{"id":"http://arxiv.org/abs/2310.16457v1","updated":"2023-10-25T08:31:04Z","published":"2023-10-25T08:31:04Z","title":"Towards Explainability in Monocular Depth Estimation","summary":" The estimation of depth in two-dimensional images has long been a challenging\nand extensively studied subject in computer vision. Recently, significant\nprogress has been made with the emergence of Deep Learning-based approaches,\nwhich have proven highly successful. This paper focuses on the explainability\nin monocular depth estimation methods, in terms of how humans perceive depth.\nThis preliminary study emphasizes on one of the most significant visual cues,\nthe relative size, which is prominent in almost all viewed images. We designed\na specific experiment to mimic the experiments in humans and have tested\nstate-of-the-art methods to indirectly assess the explainability in the context\ndefined. In addition, we observed that measuring the accuracy required further\nattention and a particular approach is proposed to this end. The results show\nthat a mean accuracy of around 77% across methods is achieved, with some of the\nmethods performing markedly better, thus, indirectly revealing their\ncorresponding potential to uncover monocular depth cues, like relative size.\n","authors":["Vasileios Arampatzakis","George Pavlidis","Kyriakos Pantoglou","Nikolaos Mitianoudis","Nikos Papamarkos"],"pdf_url":"https://arxiv.org/pdf/2310.16457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2111.07058v3","updated":"2023-10-25T08:20:12Z","published":"2021-11-13T06:54:36Z","title":"Bolstering Stochastic Gradient Descent with Model Building","summary":" Stochastic gradient descent method and its variants constitute the core\noptimization algorithms that achieve good convergence rates for solving machine\nlearning problems. These rates are obtained especially when these algorithms\nare fine-tuned for the application at hand. Although this tuning process can\nrequire large computational costs, recent work has shown that these costs can\nbe reduced by line search methods that iteratively adjust the step length. We\npropose an alternative approach to stochastic line search by using a new\nalgorithm based on forward step model building. This model building step\nincorporates second-order information that allows adjusting not only the step\nlength but also the search direction. Noting that deep learning model\nparameters come in groups (layers of tensors), our method builds its model and\ncalculates a new step for each parameter group. This novel diagonalization\napproach makes the selected step lengths adaptive. We provide convergence rate\nanalysis, and experimentally show that the proposed algorithm achieves faster\nconvergence and better generalization in well-known test problems. More\nprecisely, SMB requires less tuning, and shows comparable performance to other\nadaptive methods.\n","authors":["S. Ilker Birbil","Ozgur Martin","Gonenc Onay","Figen Oztoprak"],"pdf_url":"https://arxiv.org/pdf/2111.07058v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16453v1","updated":"2023-10-25T08:16:55Z","published":"2023-10-25T08:16:55Z","title":"ClearMark: Intuitive and Robust Model Watermarking via Transposed Model\n Training","summary":" Due to costly efforts during data acquisition and model training, Deep Neural\nNetworks (DNNs) belong to the intellectual property of the model creator.\nHence, unauthorized use, theft, or modification may lead to legal\nrepercussions. Existing DNN watermarking methods for ownership proof are often\nnon-intuitive, embed human-invisible marks, require trust in algorithmic\nassessment that lacks human-understandable attributes, and rely on rigid\nthresholds, making it susceptible to failure in cases of partial watermark\nerasure.\n This paper introduces ClearMark, the first DNN watermarking method designed\nfor intuitive human assessment. ClearMark embeds visible watermarks, enabling\nhuman decision-making without rigid value thresholds while allowing\ntechnology-assisted evaluations. ClearMark defines a transposed model\narchitecture allowing to use of the model in a backward fashion to interwove\nthe watermark with the main task within all model parameters. Compared to\nexisting watermarking methods, ClearMark produces visual watermarks that are\neasy for humans to understand without requiring complex verification algorithms\nor strict thresholds. The watermark is embedded within all model parameters and\nentangled with the main task, exhibiting superior robustness. It shows an\n8,544-bit watermark capacity comparable to the strongest existing work.\nCrucially, ClearMark's effectiveness is model and dataset-agnostic, and\nresilient against adversarial model manipulations, as demonstrated in a\ncomprehensive study performed with four datasets and seven architectures.\n","authors":["Torsten Krauß","Jasper Stang","Alexandra Dmitrienko"],"pdf_url":"https://arxiv.org/pdf/2310.16453v1.pdf","comment":"20 pages, 18 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.16452v1","updated":"2023-10-25T08:14:49Z","published":"2023-10-25T08:14:49Z","title":"Faithful Path Language Modelling for Explainable Recommendation over\n Knowledge Graph","summary":" Path reasoning methods over knowledge graphs have gained popularity for their\npotential to improve transparency in recommender systems. However, the\nresulting models still rely on pre-trained knowledge graph embeddings, fail to\nfully exploit the interdependence between entities and relations in the KG for\nrecommendation, and may generate inaccurate explanations. In this paper, we\nintroduce PEARLM, a novel approach that efficiently captures user behaviour and\nproduct-side knowledge through language modelling. With our approach, knowledge\ngraph embeddings are directly learned from paths over the KG by the language\nmodel, which also unifies entities and relations in the same optimisation\nspace. Constraints on the sequence decoding additionally guarantee path\nfaithfulness with respect to the KG. Experiments on two datasets show the\neffectiveness of our approach compared to state-of-the-art baselines. Source\ncode and datasets: AVAILABLE AFTER GETTING ACCEPTED.\n","authors":["Giacomo Balloccu","Ludovico Boratto","Christian Cancedda","Gianni Fenu","Mirko Marras"],"pdf_url":"https://arxiv.org/pdf/2310.16452v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03944v2","updated":"2023-10-25T08:12:18Z","published":"2023-03-07T14:55:05Z","title":"On Momentum-Based Gradient Methods for Bilevel Optimization with\n Nonconvex Lower-Level","summary":" Bilevel optimization is a popular two-level hierarchical optimization, which\nhas been widely applied to many machine learning tasks such as hyperparameter\nlearning, meta learning and continual learning. Although many bilevel\noptimization methods recently have been developed, the bilevel methods are not\nwell studied when the lower-level problem is nonconvex. To fill this gap, in\nthe paper, we study a class of nonconvex bilevel optimization problems, where\nboth upper-level and lower-level problems are nonconvex, and the lower-level\nproblem satisfies Polyak-{\\L}ojasiewicz (PL) condition. We propose an efficient\nmomentum-based gradient bilevel method (MGBiO) to solve these deterministic\nproblems. Meanwhile, we propose a class of efficient momentum-based stochastic\ngradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic\nproblems. Moreover, we provide a useful convergence analysis framework for our\nmethods. Specifically, under some mild conditions, we prove that our MGBiO\nmethod has a sample (or gradient) complexity of $O(\\epsilon^{-2})$ for finding\nan $\\epsilon$-stationary solution of the deterministic bilevel problems (i.e.,\n$\\|\\nabla F(x)\\|\\leq \\epsilon$), which improves the existing best results by a\nfactor of $O(\\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO\nmethods have sample complexities of $\\tilde{O}(\\epsilon^{-4})$ and\n$\\tilde{O}(\\epsilon^{-3})$, respectively, in finding an $\\epsilon$-stationary\nsolution of the stochastic bilevel problems (i.e., $\\mathbb{E}\\|\\nabla\nF(x)\\|\\leq \\epsilon$), which improves the existing best results by a factor of\n$\\tilde{O}(\\epsilon^{-3})$. Extensive experimental results on bilevel PL game\nand hyper-representation learning demonstrate the efficiency of our algorithms.\nThis paper commemorates the mathematician Boris Polyak (1935 -2023).\n","authors":["Feihu Huang"],"pdf_url":"https://arxiv.org/pdf/2303.03944v2.pdf","comment":"In new version of our paper, we relaxed some assumptions, updated our\n algorithms and added some numerical experiments"},{"id":"http://arxiv.org/abs/2310.16441v1","updated":"2023-10-25T08:08:44Z","published":"2023-10-25T08:08:44Z","title":"Grokking in Linear Estimators -- A Solvable Model that Groks without\n Understanding","summary":" Grokking is the intriguing phenomenon where a model learns to generalize long\nafter it has fit the training data. We show both analytically and numerically\nthat grokking can surprisingly occur in linear networks performing linear tasks\nin a simple teacher-student setup with Gaussian inputs. In this setting, the\nfull training dynamics is derived in terms of the training and generalization\ndata covariance matrix. We present exact predictions on how the grokking time\ndepends on input and output dimensionality, train sample size, regularization,\nand network initialization. We demonstrate that the sharp increase in\ngeneralization accuracy may not imply a transition from \"memorization\" to\n\"understanding\", but can simply be an artifact of the accuracy measure. We\nprovide empirical verification for our calculations, along with preliminary\nresults indicating that some predictions also hold for deeper networks, with\nnon-linear activations.\n","authors":["Noam Levi","Alon Beck","Yohai Bar-Sinai"],"pdf_url":"https://arxiv.org/pdf/2310.16441v1.pdf","comment":"17 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.16437v1","updated":"2023-10-25T08:03:17Z","published":"2023-10-25T08:03:17Z","title":"Non-isotropic Persistent Homology: Leveraging the Metric Dependency of\n PH","summary":" Persistent Homology is a widely used topological data analysis tool that\ncreates a concise description of the topological properties of a point cloud\nbased on a specified filtration. Most filtrations used for persistent homology\ndepend (implicitly) on a chosen metric, which is typically agnostically chosen\nas the standard Euclidean metric on $\\mathbb{R}^n$. Recent work has tried to\nuncover the 'true' metric on the point cloud using distance-to-measure\nfunctions, in order to obtain more meaningful persistent homology results. Here\nwe propose an alternative look at this problem: we posit that information on\nthe point cloud is lost when restricting persistent homology to a single\n(correct) distance function. Instead, we show how by varying the distance\nfunction on the underlying space and analysing the corresponding shifts in the\npersistence diagrams, we can extract additional topological and geometrical\ninformation. Finally, we numerically show that non-isotropic persistent\nhomology can extract information on orientation, orientational variance, and\nscaling of randomly generated point clouds with good accuracy and conduct some\nexperiments on real-world data.\n","authors":["Vincent P. Grande","Michael T. Schaub"],"pdf_url":"https://arxiv.org/pdf/2310.16437v1.pdf","comment":"30 pages, 17 figures, comments welcome!"},{"id":"http://arxiv.org/abs/1912.13490v3","updated":"2023-10-25T07:56:24Z","published":"2019-12-31T18:45:33Z","title":"A Neurocomputational Account of Consciousness: The Goal-Aligning\n Representation Internal Manipulation Theory (GARIM)","summary":" Consciousness, a central element of human cognition, has been studied with\nmultiple scientific approaches spanning neuroscience, psychology, artificial\nintelligence and robotics. Unfortunately, poor integration between these fields\nlimits a full and clear understanding of consciousness. Here we contribute to\nimproving this integration by proposing, within a neurocomputational framework,\nthe `Goal-Aligning Representations Internal Manipulation' (GARIM) theory of\nconsciousness. The central idea of the GARIM theory is that consciousness\nsupports the active manipulation of goal-relevant internal representations\n(e.g., world states, objects, and action sequences), making them more aligned\nwith the goals pursued. These manipulations allow the conscious agent to\ninternally produce the knowledge it lacks to cope with novel conditions and\ngoals, increasing the flexibility of goal-directed behaviour. The manipulation\nof representations is supported by four neuro-functional macro-systems\n(hierarchical perceptual working memories, abstract working memory, internal\nmanipulator, motivational systems) that operate through a set of computational\nmanipulation operations (abstraction, specification, decomposition,\ncomposition). The theory also presents the concept of `GARIM agency', proposing\nthat subjective conscious experience derives from the ability of agents to\ngenerate and control a vivid internally simulated reality. Furthermore, the\ntheory highlights the criticalities of the experimental investigation of\nconsciousness, suggesting a new approach to testing consciousness in biological\nand artificial agents. Finally, the GARIM theory can benefit technological\nfields such as machine learning and autonomous robotics (e.g., the manipulation\nprocesses proposed by the theory could be linked to the operations performed by\nsystems based on transformers).\n","authors":["Gianluca Baldassarre","Giovanni Granato"],"pdf_url":"https://arxiv.org/pdf/1912.13490v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.03271v2","updated":"2023-10-25T07:56:21Z","published":"2023-04-06T17:55:27Z","title":"Making AI Less \"Thirsty\": Uncovering and Addressing the Secret Water\n Footprint of AI Models","summary":" The growing carbon footprint of artificial intelligence (AI) models,\nespecially large ones such as GPT-3, has been undergoing public scrutiny.\nUnfortunately, however, the equally important and enormous water (withdrawal\nand consumption) footprint of AI models has remained under the radar. For\nexample, training GPT-3 in Microsoft's state-of-the-art U.S. data centers can\ndirectly evaporate 700,000 liters of clean freshwater, but such information has\nbeen kept a secret. More critically, the global AI demand might be accountable\nfor 4.2 -- 6.6 billion cubic meters of water withdrawal in 2027, which is more\nthan the total annual water withdrawal of 4 -- 6 Denmark or half of the United\nKingdom. This is very concerning, as freshwater scarcity has become one of the\nmost pressing challenges shared by all of us in the wake of the rapidly growing\npopulation, depleting water resources, and aging water infrastructures. To\nrespond to the global water challenges, AI models can, and also must, take\nsocial responsibility and lead by example by addressing their own water\nfootprint. In this paper, we provide a principled methodology to estimate the\nwater footprint of AI models, and also discuss the unique spatial-temporal\ndiversities of AI models' runtime water efficiency. Finally, we highlight the\nnecessity of holistically addressing water footprint along with carbon\nfootprint to enable truly sustainable AI.\n","authors":["Pengfei Li","Jianyi Yang","Mohammad A. Islam","Shaolei Ren"],"pdf_url":"https://arxiv.org/pdf/2304.03271v2.pdf","comment":"New updates include discussion on water withdrawal and water\n consumption, scope definition for water, and new estimates of GPT-3's water\n footprint based on Microsoft's new WUE and PUE data. Source codes available\n at: https://github.com/Ren-Research/Making-AI-Less-Thirsty"},{"id":"http://arxiv.org/abs/2309.00082v2","updated":"2023-10-25T07:42:11Z","published":"2023-08-31T18:43:04Z","title":"RePo: Resilient Model-Based Reinforcement Learning by Regularizing\n Posterior Predictability","summary":" Visual model-based RL methods typically encode image observations into\nlow-dimensional representations in a manner that does not eliminate redundant\ninformation. This leaves them susceptible to spurious variations -- changes in\ntask-irrelevant components such as background distractors or lighting\nconditions. In this paper, we propose a visual model-based RL method that\nlearns a latent representation resilient to such spurious variations. Our\ntraining objective encourages the representation to be maximally predictive of\ndynamics and reward, while constraining the information flow from the\nobservation to the latent representation. We demonstrate that this objective\nsignificantly bolsters the resilience of visual model-based RL methods to\nvisual distractors, allowing them to operate in dynamic environments. We then\nshow that while the learned encoder is resilient to spirious variations, it is\nnot invariant under significant distribution shift. To address this, we propose\na simple reward-free alignment procedure that enables test time adaptation of\nthe encoder. This allows for quick adaptation to widely differing environments\nwithout having to relearn the dynamics and policy. Our effort is a step towards\nmaking model-based RL a practical and useful tool for dynamic, diverse domains.\nWe show its effectiveness in simulation benchmarks with significant spurious\nvariations as well as a real-world egocentric navigation task with noisy TVs in\nthe background. Videos and code at https://zchuning.github.io/repo-website/.\n","authors":["Chuning Zhu","Max Simchowitz","Siri Gadipudi","Abhishek Gupta"],"pdf_url":"https://arxiv.org/pdf/2309.00082v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03688v2","updated":"2023-10-25T07:41:24Z","published":"2023-08-07T16:08:11Z","title":"AgentBench: Evaluating LLMs as Agents","summary":" Large Language Models (LLMs) are becoming increasingly smart and autonomous,\ntargeting real-world pragmatic missions beyond traditional NLP tasks. As a\nresult, there has been an urgent need to evaluate LLMs as agents on challenging\ntasks in interactive environments. We present AgentBench, a multi-dimensional\nevolving benchmark that currently consists of 8 distinct environments to assess\nLLM-as-Agent's reasoning and decision-making abilities in a multi-turn\nopen-ended generation setting. Our extensive test over 27 API-based and\nopen-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong\nability of acting as agents in complex environments, there is a significant\ndisparity in performance between them and OSS competitors. We identify the\ntypical reasons of failures in environments and LLMs, showing that poor\nlong-term reasoning, decision-making, and instruction following abilities are\nthe main obstacles for developing usable LLM agents. Training on code and high\nquality multi-turn alignment data could improve agent performance. Datasets,\nenvironments, and an integrated evaluation package for AgentBench are released\nat \\url{https://github.com/THUDM/AgentBench}.\n","authors":["Xiao Liu","Hao Yu","Hanchen Zhang","Yifan Xu","Xuanyu Lei","Hanyu Lai","Yu Gu","Hangliang Ding","Kaiwen Men","Kejuan Yang","Shudan Zhang","Xiang Deng","Aohan Zeng","Zhengxiao Du","Chenhui Zhang","Sheng Shen","Tianjun Zhang","Yu Su","Huan Sun","Minlie Huang","Yuxiao Dong","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2308.03688v2.pdf","comment":"55 pages"},{"id":"http://arxiv.org/abs/2205.12787v2","updated":"2023-10-25T07:23:16Z","published":"2022-05-25T14:02:02Z","title":"Impartial Games: A Challenge for Reinforcement Learning","summary":" AlphaZero-style reinforcement learning (RL) algorithms excel in various board\ngames but face challenges with impartial games, where players share pieces. We\npresent a concrete example of a game - namely the children's game of nim - and\nother impartial games that seem to be a stumbling block for AlphaZero-style and\nsimilar reinforcement learning algorithms.\n Our findings are consistent with recent studies showing that AlphaZero-style\nalgorithms are vulnerable to adversarial attacks and adversarial perturbations,\nshowing the difficulty of learning to master the games in all legal states.\n We show that nim can be learned on small boards, but AlphaZero-style\nalgorithms learning dramatically slows down when the board size increases.\nIntuitively, the difference between impartial games like nim and partisan games\nlike Chess and Go can be explained by the fact that if a tiny amount of noise\nis added to the system (e.g. if a small part of the board is covered), for\nimpartial games, it is typically not possible to predict whether the position\nis good or bad (won or lost). There is often zero correlation between the\nvisible part of a partly blanked-out position and its correct evaluation. This\nsituation starkly contrasts partisan games where a partly blanked-out\nconfiguration typically provides abundant or at least non-trifle information\nabout the value of the fully uncovered position.\n","authors":["Bei Zhou","Søren Riis"],"pdf_url":"https://arxiv.org/pdf/2205.12787v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.06399v3","updated":"2023-10-25T07:16:23Z","published":"2023-08-11T21:46:45Z","title":"Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via\n Mixed-Effect Models and Hierarchical Clustering","summary":" Maize is a major crop providing vital calories in sub-Saharan Africa, Asia\nand Latin America, with a global cultivation area of 197 million hectares in\n2021. Therefore, many statistical models (such as mixed-effect and random\ncoefficients models) and machine learning models (such as random forests and\ndeep learning architectures) have been developed to predict maize yield and how\nit is affected by genotype, environment and genotype-environment interaction\nfactors, including field management. However, these models do not fully\nleverage the network of causal relationships between these factors and the\nhierarchical structure of the agronomic data arising from data collection.\n Bayesian networks (BNs) provide a powerful framework for modelling causal and\nprobabilistic relationships using directed acyclic graphs to illustrate the\nconnections between variables. This study introduces a novel approach that\nintegrates random effects into BN learning. Rooted in the linear mixed-effects\nmodels framework, it is particularly well-suited to hierarchical data. Results\nfrom a real-world agronomic trial suggest that the proposed approach enhances\nBN learning, leading to a more interpretable model and discovering new causal\nconnections. At the same time, the error rate of maize yield prediction is\nreduced from 28% to 17%. Therefore, we argue that BNs should be the tool of\nchoice to construct practical decision support tools for hierarchical agronomic\ndata that allow for causal inference.\n","authors":["Lorenzo Valleggi","Marco Scutari","Federico Mattia Stefanini"],"pdf_url":"https://arxiv.org/pdf/2308.06399v3.pdf","comment":"34 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.16412v1","updated":"2023-10-25T06:57:59Z","published":"2023-10-25T06:57:59Z","title":"FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness\n for Semi-Supervised Learning","summary":" Semi-Supervised Learning (SSL) has been an effective way to leverage abundant\nunlabeled data with extremely scarce labeled data. However, most SSL methods\nare commonly based on instance-wise consistency between different data\ntransformations. Therefore, the label guidance on labeled data is hard to be\npropagated to unlabeled data. Consequently, the learning process on labeled\ndata is much faster than on unlabeled data which is likely to fall into a local\nminima that does not favor unlabeled data, leading to sub-optimal\ngeneralization performance. In this paper, we propose FlatMatch which minimizes\na cross-sharpness measure to ensure consistent learning performance between the\ntwo datasets. Specifically, we increase the empirical risk on labeled data to\nobtain a worst-case model which is a failure case that needs to be enhanced.\nThen, by leveraging the richness of unlabeled data, we penalize the prediction\ndifference (i.e., cross-sharpness) between the worst-case model and the\noriginal model so that the learning direction is beneficial to generalization\non unlabeled data. Therefore, we can calibrate the learning process without\nbeing limited to insufficient label information. As a result, the mismatched\nlearning performance can be mitigated, further enabling the effective\nexploitation of unlabeled data and improving SSL performance. Through\ncomprehensive validation, we show FlatMatch achieves state-of-the-art results\nin many SSL settings.\n","authors":["Zhuo Huang","Li Shen","Jun Yu","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16412v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2212.04407v4","updated":"2023-10-25T06:57:08Z","published":"2022-12-06T19:51:12Z","title":"Dynamic Decision Frequency with Continuous Options","summary":" In classic reinforcement learning algorithms, agents make decisions at\ndiscrete and fixed time intervals. The duration between decisions becomes a\ncrucial hyperparameter, as setting it too short may increase the problem's\ndifficulty by requiring the agent to make numerous decisions to achieve its\ngoal while setting it too long can result in the agent losing control over the\nsystem. However, physical systems do not necessarily require a constant control\nfrequency, and for learning agents, it is often preferable to operate with a\nlow frequency when possible and a high frequency when necessary. We propose a\nframework called Continuous-Time Continuous-Options (CTCO), where the agent\nchooses options as sub-policies of variable durations. These options are\ntime-continuous and can interact with the system at any desired frequency\nproviding a smooth change of actions. We demonstrate the effectiveness of CTCO\nby comparing its performance to classical RL and temporal-abstraction RL\nmethods on simulated continuous control tasks with various action-cycle times.\nWe show that our algorithm's performance is not affected by the choice of\nenvironment interaction frequency. Furthermore, we demonstrate the efficacy of\nCTCO in facilitating exploration in a real-world visual reaching task for a 7\nDOF robotic arm with sparse rewards.\n","authors":["Amirmohammad Karimi","Jun Jin","Jun Luo","A. Rupam Mahmood","Martin Jagersand","Samuele Tosatto"],"pdf_url":"https://arxiv.org/pdf/2212.04407v4.pdf","comment":"Appears in the Proceedings of the 2023 International Conference on\n Intelligent Robots and Systems (IROS). Source code at\n https://github.com/amir-karimi96/continuous-time-continuous-option-policy-gradient.git"},{"id":"http://arxiv.org/abs/2310.15074v2","updated":"2023-10-25T06:50:49Z","published":"2023-10-23T16:32:18Z","title":"MGAS: Multi-Granularity Architecture Search for Effective and Efficient\n Neural Networks","summary":" Differentiable architecture search (DAS) revolutionizes neural architecture\nsearch (NAS) with time-efficient automation, transitioning from discrete\ncandidate sampling and evaluation to differentiable super-net optimization and\ndiscretization. However, existing DAS methods either only conduct\ncoarse-grained operation-level search or manually define the remaining ratios\nfor fine-grained kernel-level and weight-level units, which fail to\nsimultaneously optimize model size and model performance. Furthermore, these\nmethods compromise search quality to reduce memory consumption. To tackle these\nissues, we introduce multi-granularity architecture search (MGAS), a unified\nframework which aims to comprehensively and memory-efficiently explore the\nmulti-granularity search space to discover both effective and efficient neural\nnetworks. Specifically, we learn discretization functions specific to each\ngranularity level to adaptively determine the remaining ratios according to the\nevolving architecture. This ensures an optimal balance among units of different\ngranularity levels for different target model sizes. Considering the memory\ndemands, we break down the super-net optimization and discretization into\nmultiple sub-net stages. Nevertheless, the greedy nature of this approach may\nintroduce bias in the early stages. To compensate for the bias, we propose\nprogressive re-evaluation to allow for re-pruning and regrowing of previous\nunits during subsequent stages. Extensive experiments on CIFAR-10, CIFAR-100\nand ImageNet demonstrate that MGAS outperforms other state-of-the-art methods\nin achieving a better trade-off between model performance and model size.\n","authors":["Xiaoyun Liu","Divya Saxena","Jiannong Cao","Yuqing Zhao","Penghui Ruan"],"pdf_url":"https://arxiv.org/pdf/2310.15074v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16410v1","updated":"2023-10-25T06:49:26Z","published":"2023-10-25T06:49:26Z","title":"Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in\n AlphaZero","summary":" Artificial Intelligence (AI) systems have made remarkable progress, attaining\nsuper-human performance across various domains. This presents us with an\nopportunity to further human knowledge and improve human expert performance by\nleveraging the hidden knowledge encoded within these highly performant AI\nsystems. Yet, this knowledge is often hard to extract, and may be hard to\nunderstand or learn from. Here, we show that this is possible by proposing a\nnew method that allows us to extract new chess concepts in AlphaZero, an AI\nsystem that mastered the game of chess via self-play without human supervision.\nOur analysis indicates that AlphaZero may encode knowledge that extends beyond\nthe existing human knowledge, but knowledge that is ultimately not beyond human\ngrasp, and can be successfully learned from. In a human study, we show that\nthese concepts are learnable by top human experts, as four top chess\ngrandmasters show improvements in solving the presented concept prototype\npositions. This marks an important first milestone in advancing the frontier of\nhuman knowledge by leveraging AI; a development that could bear profound\nimplications and help us shape how we interact with AI systems across many AI\napplications.\n","authors":["Lisa Schut","Nenad Tomasev","Tom McGrath","Demis Hassabis","Ulrich Paquet","Been Kim"],"pdf_url":"https://arxiv.org/pdf/2310.16410v1.pdf","comment":"61 pages, 29 figures"},{"id":"http://arxiv.org/abs/2310.16409v1","updated":"2023-10-25T06:49:19Z","published":"2023-10-25T06:49:19Z","title":"Multiple Key-value Strategy in Recommendation Systems Incorporating\n Large Language Model","summary":" Recommendation system (RS) plays significant roles in matching users\ninformation needs for Internet applications, and it usually utilizes the\nvanilla neural network as the backbone to handle embedding details. Recently,\nthe large language model (LLM) has exhibited emergent abilities and achieved\ngreat breakthroughs both in the CV and NLP communities. Thus, it is logical to\nincorporate RS with LLM better, which has become an emerging research\ndirection. Although some existing works have made their contributions to this\nissue, they mainly consider the single key situation (e.g. historical\ninteractions), especially in sequential recommendation. The situation of\nmultiple key-value data is simply neglected. This significant scenario is\nmainstream in real practical applications, where the information of users (e.g.\nage, occupation, etc) and items (e.g. title, category, etc) has more than one\nkey. Therefore, we aim to implement sequential recommendations based on\nmultiple key-value data by incorporating RS with LLM. In particular, we\ninstruct tuning a prevalent open-source LLM (Llama 7B) in order to inject\ndomain knowledge of RS into the pre-trained LLM. Since we adopt multiple\nkey-value strategies, LLM is hard to learn well among these keys. Thus the\ngeneral and innovative shuffle and mask strategies, as an innovative manner of\ndata argument, are designed. To demonstrate the effectiveness of our approach,\nextensive experiments are conducted on the popular and suitable dataset\nMovieLens which contains multiple keys-value. The experimental results\ndemonstrate that our approach can nicely and effectively complete this\nchallenging issue.\n","authors":["Dui Wang","Xiangyu Hou","Xiaohui Yang","Bo Zhang","Renbing Chen","Daiyue Xue"],"pdf_url":"https://arxiv.org/pdf/2310.16409v1.pdf","comment":"Accepted by CIKM2023 workshop at GenRec'23"},{"id":"http://arxiv.org/abs/2310.16407v1","updated":"2023-10-25T06:46:48Z","published":"2023-10-25T06:46:48Z","title":"Information-Theoretic Generalization Analysis for Topology-aware\n Heterogeneous Federated Edge Learning over Noisy Channels","summary":" With the rapid growth of edge intelligence, the deployment of federated\nlearning (FL) over wireless networks has garnered increasing attention, which\nis called Federated Edge Learning (FEEL). In FEEL, both mobile devices\ntransmitting model parameters over noisy channels and collecting data in\ndiverse environments pose challenges to the generalization of trained models.\nMoreover, devices can engage in decentralized FL via Device-to-Device\ncommunication while the communication topology of connected devices also\nimpacts the generalization of models. Most recent theoretical studies overlook\nthe incorporation of all these effects into FEEL when developing generalization\nanalyses. In contrast, our work presents an information-theoretic\ngeneralization analysis for topology-aware FEEL in the presence of data\nheterogeneity and noisy channels. Additionally, we propose a novel\nregularization method called Federated Global Mutual Information Reduction\n(FedGMIR) to enhance the performance of models based on our analysis. Numerical\nresults validate our theoretical findings and provide evidence for the\neffectiveness of the proposed method.\n","authors":["Zheshun Wu","Zenglin Xu","Hongfang Yu","Jie Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16407v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16401v1","updated":"2023-10-25T06:38:24Z","published":"2023-10-25T06:38:24Z","title":"Graph Neural Networks with a Distribution of Parametrized Graphs","summary":" Traditionally, graph neural networks have been trained using a single\nobserved graph. However, the observed graph represents only one possible\nrealization. In many applications, the graph may encounter uncertainties, such\nas having erroneous or missing edges, as well as edge weights that provide\nlittle informative value. To address these challenges and capture additional\ninformation previously absent in the observed graph, we introduce latent\nvariables to parameterize and generate multiple graphs. We obtain the maximum\nlikelihood estimate of the network parameters in an Expectation-Maximization\n(EM) framework based on the multiple graphs. Specifically, we iteratively\ndetermine the distribution of the graphs using a Markov Chain Monte Carlo\n(MCMC) method, incorporating the principles of PAC-Bayesian theory. Numerical\nexperiments demonstrate improvements in performance against baseline models on\nnode classification for heterogeneous graphs and graph regression on chemistry\ndatasets.\n","authors":["See Hian Lee","Feng Ji","Kelin Xia","Wee Peng Tay"],"pdf_url":"https://arxiv.org/pdf/2310.16401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16397v1","updated":"2023-10-25T06:32:47Z","published":"2023-10-25T06:32:47Z","title":"Learning Efficient Surrogate Dynamic Models with Graph Spline Networks","summary":" While complex simulations of physical systems have been widely used in\nengineering and scientific computing, lowering their often prohibitive\ncomputational requirements has only recently been tackled by deep learning\napproaches. In this paper, we present GraphSplineNets, a novel deep-learning\nmethod to speed up the forecasting of physical systems by reducing the grid\nsize and number of iteration steps of deep surrogate models. Our method uses\ntwo differentiable orthogonal spline collocation methods to efficiently predict\nresponse at any location in time and space. Additionally, we introduce an\nadaptive collocation strategy in space to prioritize sampling from the most\nimportant regions. GraphSplineNets improve the accuracy-speedup tradeoff in\nforecasting various dynamical systems with increasing complexity, including the\nheat equation, damped wave propagation, Navier-Stokes equations, and real-world\nocean currents in both regular and irregular domains.\n","authors":["Chuanbo Hua","Federico Berto","Michael Poli","Stefano Massaroli","Jinkyoo Park"],"pdf_url":"https://arxiv.org/pdf/2310.16397v1.pdf","comment":"Published as a conference paper in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.14664v2","updated":"2023-10-25T06:19:05Z","published":"2023-10-23T08:00:03Z","title":"Data Pruning via Moving-one-Sample-out","summary":" In this paper, we propose a novel data-pruning approach called\nmoving-one-sample-out (MoSo), which aims to identify and remove the least\ninformative samples from the training set. The core insight behind MoSo is to\ndetermine the importance of each sample by assessing its impact on the optimal\nempirical risk. This is achieved by measuring the extent to which the empirical\nrisk changes when a particular sample is excluded from the training set.\nInstead of using the computationally expensive leaving-one-out-retraining\nprocedure, we propose an efficient first-order approximator that only requires\ngradient information from different training stages. The key idea behind our\napproximation is that samples with gradients that are consistently aligned with\nthe average gradient of the training set are more informative and should\nreceive higher scores, which could be intuitively understood as follows: if the\ngradient from a specific sample is consistent with the average gradient vector,\nit implies that optimizing the network using the sample will yield a similar\neffect on all remaining samples. Experimental results demonstrate that MoSo\neffectively mitigates severe performance degradation at high pruning ratios and\nachieves satisfactory performance across various settings.\n","authors":["Haoru Tan","Sitong Wu","Fei Du","Yukang Chen","Zhibin Wang","Fan Wang","Xiaojuan Qi"],"pdf_url":"https://arxiv.org/pdf/2310.14664v2.pdf","comment":"Accepted by the Thirty-seventh Conference on Neural Information\n Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2103.10385v2","updated":"2023-10-25T06:15:58Z","published":"2021-03-18T17:13:50Z","title":"GPT Understands, Too","summary":" Prompting a pretrained language model with natural language patterns has been\nproved effective for natural language understanding (NLU). However, our\npreliminary study reveals that manual discrete prompts often lead to unstable\nperformance -- e.g., changing a single word in the prompt might result in\nsubstantial performance drop. We propose a novel method P-Tuning that employs\ntrainable continuous prompt embeddings in concatenation with discrete prompts.\nEmpirically, P-Tuning not only stabilizes training by minimizing the gap\nbetween various discrete prompts, but also improves performance by a sizeable\nmargin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is\ngenerally effective for both frozen and tuned language models, under both the\nfully-supervised and few-shot settings.\n","authors":["Xiao Liu","Yanan Zheng","Zhengxiao Du","Ming Ding","Yujie Qian","Zhilin Yang","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2103.10385v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16391v1","updated":"2023-10-25T06:10:57Z","published":"2023-10-25T06:10:57Z","title":"Winning Prize Comes from Losing Tickets: Improve Invariant Learning by\n Exploring Variant Parameters for Out-of-Distribution Generalization","summary":" Out-of-Distribution (OOD) Generalization aims to learn robust models that\ngeneralize well to various environments without fitting to\ndistribution-specific features. Recent studies based on Lottery Ticket\nHypothesis (LTH) address this problem by minimizing the learning target to find\nsome of the parameters that are critical to the task. However, in OOD problems,\nsuch solutions are suboptimal as the learning task contains severe distribution\nnoises, which can mislead the optimization process. Therefore, apart from\nfinding the task-related parameters (i.e., invariant parameters), we propose\nExploring Variant parameters for Invariant Learning (EVIL) which also leverages\nthe distribution knowledge to find the parameters that are sensitive to\ndistribution shift (i.e., variant parameters). Once the variant parameters are\nleft out of invariant learning, a robust subnetwork that is resistant to\ndistribution shift can be found. Additionally, the parameters that are\nrelatively stable across distributions can be considered invariant ones to\nimprove invariant learning. By fully exploring both variant and invariant\nparameters, our EVIL can effectively identify a robust subnetwork to improve\nOOD generalization. In extensive experiments on integrated testbed: DomainBed,\nEVIL can effectively and efficiently enhance many popular methods, such as ERM,\nIRM, SAM, etc.\n","authors":["Zhuo Huang","Muyang Li","Li Shen","Jun Yu","Chen Gong","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16391v1.pdf","comment":"27 pages, 9 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.16822v1","updated":"2023-10-25T17:51:56Z","published":"2023-10-25T17:51:56Z","title":"Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity\n and Relation Extraction","summary":" How can we better extract entities and relations from text? Using multimodal\nextraction with images and text obtains more signals for entities and\nrelations, and aligns them through graphs or hierarchical fusion, aiding in\nextraction. Despite attempts at various fusions, previous works have overlooked\nmany unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes\ninnovative pre-training objectives for entity-object and relation-image\nalignment, extracting objects from images and aligning them with entity and\nrelation prompts for soft pseudo-labels. These labels are used as\nself-supervised signals for pre-training, enhancing the ability to extract\nentities and relations. Experiments on three datasets show an average 3.41% F1\nimprovement over prior SOTA. Additionally, our method is orthogonal to previous\nmultimodal fusions, and using it on prior SOTA fusions further improves 5.47%\nF1.\n","authors":["Xuming Hu","Junzhe Chen","Aiwei Liu","Shiao Meng","Lijie Wen","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2310.16822v1.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2310.02674v2","updated":"2023-10-25T14:34:03Z","published":"2023-10-04T09:26:44Z","title":"Land-cover change detection using paired OpenStreetMap data and optical\n high-resolution imagery via object-guided Transformer","summary":" Optical high-resolution imagery and OpenStreetMap (OSM) data are two\nimportant data sources for land-cover change detection. Previous studies in\nthese two data sources focus on utilizing the information in OSM data to aid\nthe change detection on multi-temporal optical high-resolution images. This\npaper pioneers the direct detection of land-cover changes utilizing paired OSM\ndata and optical imagery, thereby broadening the horizons of change detection\ntasks to encompass more dynamic earth observations. To this end, we propose an\nobject-guided Transformer (ObjFormer) architecture by naturally combining the\nprevalent object-based image analysis (OBIA) technique with the advanced vision\nTransformer architecture. The introduction of OBIA can significantly reduce the\ncomputational overhead and memory burden in the self-attention module.\nSpecifically, the proposed ObjFormer has a hierarchical pseudo-siamese encoder\nconsisting of object-guided self-attention modules that extract representative\nfeatures of different levels from OSM data and optical images; a decoder\nconsisting of object-guided cross-attention modules can progressively recover\nthe land-cover changes from the extracted heterogeneous features. In addition\nto the basic supervised binary change detection task, this paper raises a new\nsemi-supervised semantic change detection task that does not require any\nmanually annotated land-cover labels of optical images to train semantic change\ndetectors. Two lightweight semantic decoders are added to ObjFormer to\naccomplish this task efficiently. A converse cross-entropy loss is designed to\nfully utilize the negative samples, thereby contributing to the great\nperformance improvement in this task. The first large-scale benchmark dataset\ncontaining 1,287 map-image pairs (1024$\\times$ 1024 pixels for each sample)\ncovering 40 regions on six continents ...(see the manuscript for the full\nabstract)\n","authors":["Hongruixuan Chen","Cuiling Lan","Jian Song","Clifford Broni-Bediako","Junshi Xia","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2310.02674v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11954v2","updated":"2023-10-25T13:34:13Z","published":"2023-10-18T13:31:10Z","title":"MusicAgent: An AI Agent for Music Understanding and Generation with\n Large Language Models","summary":" AI-empowered music processing is a diverse field that encompasses dozens of\ntasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension\ntasks (e.g., music classification). For developers and amateurs, it is very\ndifficult to grasp all of these task to satisfy their requirements in music\nprocessing, especially considering the huge differences in the representations\nof music data and the model applicability across platforms among various tasks.\nConsequently, it is necessary to build a system to organize and integrate these\ntasks, and thus help practitioners to automatically analyze their demand and\ncall suitable tools as solutions to fulfill their requirements. Inspired by the\nrecent success of large language models (LLMs) in task automation, we develop a\nsystem, named MusicAgent, which integrates numerous music-related tools and an\nautonomous workflow to address user requirements. More specifically, we build\n1) toolset that collects tools from diverse sources, including Hugging Face,\nGitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g.,\nChatGPT) to organize these tools and automatically decompose user requests into\nmultiple sub-tasks and invoke corresponding music tools. The primary goal of\nthis system is to free users from the intricacies of AI-music tools, enabling\nthem to concentrate on the creative aspect. By granting users the freedom to\neffortlessly combine tools, the system offers a seamless and enriching music\nexperience.\n","authors":["Dingyao Yu","Kaitao Song","Peiling Lu","Tianyu He","Xu Tan","Wei Ye","Shikun Zhang","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2310.11954v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16573v1","updated":"2023-10-25T11:58:14Z","published":"2023-10-25T11:58:14Z","title":"Adapt Anything: Tailor Any Image Classifiers across Domains And\n Categories Using Text-to-Image Diffusion Models","summary":" We do not pursue a novel method in this paper, but aim to study if a modern\ntext-to-image diffusion model can tailor any task-adaptive image classifier\nacross domains and categories. Existing domain adaptive image classification\nworks exploit both source and target data for domain alignment so as to\ntransfer the knowledge learned from the labeled source data to the unlabeled\ntarget data. However, as the development of the text-to-image diffusion model,\nwe wonder if the high-fidelity synthetic data from the text-to-image generator\ncan serve as a surrogate of the source data in real world. In this way, we do\nnot need to collect and annotate the source data for each domain adaptation\ntask in a one-for-one manner. Instead, we utilize only one off-the-shelf\ntext-to-image model to synthesize images with category labels derived from the\ncorresponding text prompts, and then leverage the surrogate data as a bridge to\ntransfer the knowledge embedded in the task-agnostic text-to-image generator to\nthe task-oriented image classifier via domain adaptation. Such a one-for-all\nadaptation paradigm allows us to adapt anything in the world using only one\ntext-to-image generator as well as the corresponding unlabeled target data.\nExtensive experiments validate the feasibility of the proposed idea, which even\nsurpasses the state-of-the-art domain adaptation works using the source data\ncollected and annotated in real world.\n","authors":["Weijie Chen","Haoyu Wang","Shicai Yang","Lei Zhang","Wei Wei","Yanning Zhang","Luojun Lin","Di Xie","Yueting Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.16573v1.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2306.12795v3","updated":"2023-10-25T09:11:50Z","published":"2023-06-22T10:53:10Z","title":"Learning Unseen Modality Interaction","summary":" Multimodal learning assumes all modality combinations of interest are\navailable during training to learn cross-modal correspondences. In this paper,\nwe challenge this modality-complete assumption for multimodal learning and\ninstead strive for generalization to unseen modality combinations during\ninference. We pose the problem of unseen modality interaction and introduce a\nfirst solution. It exploits a module that projects the multidimensional\nfeatures of different modalities into a common space with rich information\npreserved. This allows the information to be accumulated with a simple\nsummation operation across available modalities. To reduce overfitting to less\ndiscriminative modality combinations during training, we further improve the\nmodel learning with pseudo-supervision indicating the reliability of a\nmodality's prediction. We demonstrate that our approach is effective for\ndiverse tasks and modalities by evaluating it for multimodal video\nclassification, robot state regression, and multimedia retrieval. Project\nwebsite: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.\n","authors":["Yunhua Zhang","Hazel Doughty","Cees G. M. Snoek"],"pdf_url":"https://arxiv.org/pdf/2306.12795v3.pdf","comment":"Published at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16334v1","updated":"2023-10-25T03:30:37Z","published":"2023-10-25T03:30:37Z","title":"AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style\n Transfer and Multi-Track Function Prior","summary":" We propose AccoMontage-3, a symbolic music automation system capable of\ngenerating multi-track, full-band accompaniment based on the input of a lead\nmelody with chords (i.e., a lead sheet). The system contains three modular\ncomponents, each modelling a vital aspect of full-band composition. The first\ncomponent is a piano arranger that generates piano accompaniment for the lead\nsheet by transferring texture styles to the chords using latent chord-texture\ndisentanglement and heuristic retrieval of texture donors. The second component\norchestrates the piano accompaniment score into full-band arrangement according\nto the orchestration style encoded by individual track functions. The third\ncomponent, which connects the previous two, is a prior model characterizing the\nglobal structure of orchestration style over the whole piece of music. From end\nto end, the system learns to generate full-band accompaniment in a\nself-supervised fashion, applying style transfer at two levels of polyphonic\ncomposition: texture and orchestration. Experiments show that our system\noutperforms the baselines significantly, and the modular design offers\neffective controls in a musically meaningful way.\n","authors":["Jingwei Zhao","Gus Xia","Ye Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16334v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13276v2","updated":"2023-10-25T00:46:42Z","published":"2023-10-20T04:45:44Z","title":"InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution","summary":" Over recent decades, significant advancements in cross-modal retrieval are\nmainly driven by breakthroughs in visual and linguistic modeling. However, a\nrecent study shows that multi-modal data representations tend to cluster within\na limited convex cone (as representation degeneration problem), which hinders\nretrieval performance due to the inseparability of these representations. In\nour study, we first empirically validate the presence of the representation\ndegeneration problem across multiple cross-modal benchmarks and methods. Next,\nto address it, we introduce a novel method, called InvGC, a post-processing\ntechnique inspired by graph convolution and average pooling. Specifically,\nInvGC defines the graph topology within the datasets and then applies graph\nconvolution in a subtractive manner. This method effectively separates\nrepresentations by increasing the distances between data points. To improve the\nefficiency and effectiveness of InvGC, we propose an advanced graph topology,\nLocalAdj, which only aims to increase the distances between each data point and\nits nearest neighbors. To understand why InvGC works, we present a detailed\ntheoretical analysis, proving that the lower bound of recall will be improved\nafter deploying InvGC. Extensive empirical results show that InvGC and InvGC\nw/LocalAdj significantly mitigate the representation degeneration problem,\nthereby enhancing retrieval performance.\n Our code is available at\nhttps://github.com/yimuwangcs/Better_Cross_Modal_Retrieval\n","authors":["Xiangru Jian","Yimu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.13276v2.pdf","comment":"Findings of EMNLP 2023"}]},"2023-10-26T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.17644v1","updated":"2023-10-26T17:57:15Z","published":"2023-10-26T17:57:15Z","title":"torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free\n Deep Learning Studies: A Case Study on NLP","summary":" Reproducibility in scientific work has been becoming increasingly important\nin research communities such as machine learning, natural language processing,\nand computer vision communities due to the rapid development of the research\ndomains supported by recent advances in deep learning. In this work, we present\na significantly upgraded version of torchdistill, a modular-driven coding-free\ndeep learning framework significantly upgraded from the initial release, which\nsupports only image classification and object detection tasks for reproducible\nknowledge distillation experiments. To demonstrate that the upgraded framework\ncan support more tasks with third-party libraries, we reproduce the GLUE\nbenchmark results of BERT models using a script based on the upgraded\ntorchdistill, harmonizing with various Hugging Face libraries. All the 27\nfine-tuned BERT models and configurations to reproduce the results are\npublished at Hugging Face, and the model weights have already been widely used\nin research communities. We also reimplement popular small-sized models and new\nknowledge distillation methods and perform additional experiments for computer\nvision tasks.\n","authors":["Yoshitomo Matsubara"],"pdf_url":"https://arxiv.org/pdf/2310.17644v1.pdf","comment":"Accepted at the 3rd Workshop for Natural Language Processing Open\n Source Software (NLP-OSS) at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17639v1","updated":"2023-10-26T17:54:52Z","published":"2023-10-26T17:54:52Z","title":"In-Context Learning Dynamics with Random Binary Sequences","summary":" Large language models (LLMs) trained on huge corpora of text datasets\ndemonstrate complex, emergent capabilities, achieving state-of-the-art\nperformance on tasks they were not explicitly trained for. The precise nature\nof LLM capabilities is often mysterious, and different prompts can elicit\ndifferent capabilities through in-context learning. We propose a Cognitive\nInterpretability framework that enables us to analyze in-context learning\ndynamics to understand latent concepts in LLMs underlying behavioral patterns.\nThis provides a more nuanced understanding than success-or-failure evaluation\nbenchmarks, but does not require observing internal activations as a\nmechanistic interpretation of circuits would. Inspired by the cognitive science\nof human randomness perception, we use random binary sequences as context and\nstudy dynamics of in-context learning by manipulating properties of context\ndata, such as sequence length. In the latest GPT-3.5+ models, we find emergent\nabilities to generate pseudo-random numbers and learn basic formal languages,\nwith striking in-context learning dynamics where model outputs transition\nsharply from pseudo-random behaviors to deterministic repetition.\n","authors":["Eric J. Bigelow","Ekdeep Singh Lubana","Robert P. Dick","Hidenori Tanaka","Tomer D. Ullman"],"pdf_url":"https://arxiv.org/pdf/2310.17639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17631v1","updated":"2023-10-26T17:48:58Z","published":"2023-10-26T17:48:58Z","title":"JudgeLM: Fine-tuned Large Language Models are Scalable Judges","summary":" Evaluating Large Language Models (LLMs) in open-ended scenarios is\nchallenging because existing benchmarks and metrics can not measure them\ncomprehensively. To address this problem, we propose to fine-tune LLMs as\nscalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in\nopen-ended benchmarks. We first propose a comprehensive, large-scale,\nhigh-quality dataset containing task seeds, LLMs-generated answers, and\nGPT-4-generated judgments for fine-tuning high-performance judges, as well as a\nnew benchmark for evaluating the judges. We train JudgeLM at different scales\nfrom 7B, 13B, to 33B parameters, and conduct a systematic analysis of its\ncapabilities and behaviors. We then analyze the key biases in fine-tuning LLM\nas a judge and consider them as position bias, knowledge bias, and format bias.\nTo address these issues, JudgeLM introduces a bag of techniques including swap\naugmentation, reference support, and reference drop, which clearly enhance the\njudge's performance. JudgeLM obtains the state-of-the-art judge performance on\nboth the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM\nis efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8\nA100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an\nagreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM\nalso demonstrates extended capabilities in being judges of the single answer,\nmultimodal models, multiple answers, and multi-turn chat.\n","authors":["Lianghui Zhu","Xinggang Wang","Xinlong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17631v1.pdf","comment":"30 pages, 23 figures"},{"id":"http://arxiv.org/abs/2310.17630v1","updated":"2023-10-26T17:48:45Z","published":"2023-10-26T17:48:45Z","title":"InstOptima: Evolutionary Multi-objective Instruction Optimization via\n Large Language Model-based Instruction Operators","summary":" Instruction-based language modeling has received significant attention in\npretrained language models. However, the efficiency of instruction engineering\nremains low and hinders the development of instruction studies. Recent studies\nhave focused on automating instruction generation, but they primarily aim to\nimprove performance without considering other crucial objectives that impact\ninstruction quality, such as instruction length and perplexity. Therefore, we\npropose a novel approach (i.e., InstOptima) that treats instruction generation\nas an evolutionary multi-objective optimization problem. In contrast to text\nedition-based methods, our approach utilizes a large language model (LLM) to\nsimulate instruction operators, including mutation and crossover. Furthermore,\nwe introduce an objective-guided mechanism for these operators, allowing the\nLLM to comprehend the objectives and enhance the quality of the generated\ninstructions. Experimental results demonstrate improved fine-tuning performance\nand the generation of a diverse set of high-quality instructions.\n","authors":["Heng Yang","Ke Li"],"pdf_url":"https://arxiv.org/pdf/2310.17630v1.pdf","comment":"Accepted by EMNLP Findings"},{"id":"http://arxiv.org/abs/2310.17623v1","updated":"2023-10-26T17:43:13Z","published":"2023-10-26T17:43:13Z","title":"Proving Test Set Contamination in Black Box Language Models","summary":" Large language models are trained on vast amounts of internet data, prompting\nconcerns and speculation that they have memorized public benchmarks. Going from\nspeculation to proof of contamination is challenging, as the pretraining data\nused by proprietary models are often not publicly accessible. We show that it\nis possible to provide provable guarantees of test set contamination in\nlanguage models without access to pretraining data or model weights. Our\napproach leverages the fact that when there is no data contamination, all\norderings of an exchangeable benchmark should be equally likely. In contrast,\nthe tendency for language models to memorize example order means that a\ncontaminated language model will find certain canonical orderings to be much\nmore likely than others. Our test flags potential contamination whenever the\nlikelihood of a canonically ordered benchmark dataset is significantly higher\nthan the likelihood after shuffling the examples. We demonstrate that our\nprocedure is sensitive enough to reliably prove test set contamination in\nchallenging situations, including models as small as 1.4 billion parameters, on\nsmall test sets of only 1000 examples, and datasets that appear only a few\ntimes in the pretraining corpus. Using our test, we audit five popular publicly\naccessible language models for test set contamination and find little evidence\nfor pervasive contamination.\n","authors":["Yonatan Oren","Nicole Meister","Niladri Chatterji","Faisal Ladhak","Tatsunori B. Hashimoto"],"pdf_url":"https://arxiv.org/pdf/2310.17623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17611v1","updated":"2023-10-26T17:34:32Z","published":"2023-10-26T17:34:32Z","title":"Uncovering Meanings of Embeddings via Partial Orthogonality","summary":" Machine learning tools often rely on embedding text as vectors of real\nnumbers. In this paper, we study how the semantic structure of language is\nencoded in the algebraic structure of such embeddings. Specifically, we look at\na notion of ``semantic independence'' capturing the idea that, e.g.,\n``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such\nexamples are intuitive, it is difficult to formalize such a notion of semantic\nindependence. The key observation here is that any sensible formalization\nshould obey a set of so-called independence axioms, and thus any algebraic\nencoding of this structure should also obey these axioms. This leads us\nnaturally to use partial orthogonality as the relevant algebraic structure. We\ndevelop theory and methods that allow us to demonstrate that partial\northogonality does indeed capture semantic independence. Complementary to this,\nwe also introduce the concept of independence preserving embeddings where\nembeddings preserve the conditional independence structures of a distribution,\nand we prove the existence of such embeddings and approximations to them.\n","authors":["Yibo Jiang","Bryon Aragam","Victor Veitch"],"pdf_url":"https://arxiv.org/pdf/2310.17611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17609v1","updated":"2023-10-26T17:32:55Z","published":"2023-10-26T17:32:55Z","title":"LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset","summary":" As an important component of intelligent legal systems, legal case retrieval\nplays a critical role in ensuring judicial justice and fairness. However, the\ndevelopment of legal case retrieval technologies in the Chinese legal system is\nrestricted by three problems in existing datasets: limited data size, narrow\ndefinitions of legal relevance, and naive candidate pooling strategies used in\ndata sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale\nLegal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192\ncandidates extracted from 4.3 million criminal case documents. To the best of\nour knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval\ndatasets, providing extensive coverage of criminal charges. Additionally, we\nenrich the existing relevance criteria by considering three key aspects:\ncharacterization, penalty, procedure. This comprehensive criteria enriches the\ndataset and may provides a more holistic perspective. Furthermore, we propose a\ntwo-level candidate set pooling strategy that effectively identify potential\ncandidates for each query case. It's important to note that all cases in the\ndataset have been annotated by multiple legal experts specializing in criminal\nlaw. Their expertise ensures the accuracy and reliability of the annotations.\nWe evaluate several state-of-the-art retrieval models at LeCaRDv2,\ndemonstrating that there is still significant room for improvement in legal\ncase retrieval. The details of LeCaRDv2 can be found at the anonymous website\nhttps://github.com/anonymous1113243/LeCaRDv2.\n","authors":["Haitao Li","Yunqiu Shao","Yueyue Wu","Qingyao Ai","Yixiao Ma","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2310.17609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17606v1","updated":"2023-10-26T17:30:13Z","published":"2023-10-26T17:30:13Z","title":"Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in\n Ghana","summary":" This paper reports on a set of three recent experiments utilizing large-scale\nspeech models to evaluate the oral reading fluency (ORF) of students in Ghana.\nWhile ORF is a well-established measure of foundational literacy, assessing it\ntypically requires one-on-one sessions between a student and a trained\nevaluator, a process that is time-consuming and costly. Automating the\nevaluation of ORF could support better literacy instruction, particularly in\neducation contexts where formative assessment is uncommon due to large class\nsizes and limited resources. To our knowledge, this research is among the first\nto examine the use of the most recent versions of large-scale speech models\n(Whisper V2 wav2vec2.0) for ORF assessment in the Global South.\n We find that Whisper V2 produces transcriptions of Ghanaian students reading\naloud with a Word Error Rate of 13.5. This is close to the model's average WER\non adult speech (12.8) and would have been considered state-of-the-art for\nchildren's speech transcription only a few years ago. We also find that when\nthese transcriptions are used to produce fully automated ORF scores, they\nclosely align with scores generated by expert human graders, with a correlation\ncoefficient of 0.96. Importantly, these results were achieved on a\nrepresentative dataset (i.e., students with regional accents, recordings taken\nin actual classrooms), using a free and publicly available speech model out of\nthe box (i.e., no fine-tuning). This suggests that using large-scale speech\nmodels to assess ORF may be feasible to implement and scale in lower-resource,\nlinguistically diverse educational contexts.\n","authors":["Owen Henkel","Hannah Horne-Robinson","Libby Hills","Bill Roberts","Joshua McGrane"],"pdf_url":"https://arxiv.org/pdf/2310.17606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17591v1","updated":"2023-10-26T17:13:07Z","published":"2023-10-26T17:13:07Z","title":"Lil-Bevo: Explorations of Strategies for Training Language Models in\n More Humanlike Ways","summary":" We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained\nour masked language models with three ingredients: an initial pretraining with\nmusic data, training on shorter sequences before training on longer ones, and\nmasking specific tokens to target some of the BLiMP subtasks. Overall, our\nbaseline models performed above chance, but far below the performance levels of\nlarger LLMs trained on more data. We found that training on short sequences\nperformed better than training on longer sequences.Pretraining on music may\nhelp performance marginally, but, if so, the effect seems small. Our targeted\nMasked Language Modeling augmentation did not seem to improve model performance\nin general, but did seem to help on some of the specific BLiMP tasks that we\nwere targeting (e.g., Negative Polarity Items). Training performant LLMs on\nsmall amounts of data is a difficult but potentially informative task. While\nsome of our techniques showed some promise, more work is needed to explore\nwhether they can improve performance more than the modest gains here. Our code\nis available at https://github.com/venkatasg/Lil-Bevo and out models at\nhttps://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a\n","authors":["Venkata S Govindarajan","Juan Diego Rodriguez","Kaj Bostrom","Kyle Mahowald"],"pdf_url":"https://arxiv.org/pdf/2310.17591v1.pdf","comment":"Proceedings of the BabyLM Challenge"},{"id":"http://arxiv.org/abs/2310.17589v1","updated":"2023-10-26T17:11:42Z","published":"2023-10-26T17:11:42Z","title":"An Open Source Data Contamination Report for Llama Series Models","summary":" Data contamination in language model evaluation is increasingly prevalent as\nthe popularity of large language models. It allows models to \"cheat\" via\nmemorisation instead of displaying true capabilities. Therefore, contamination\nanalysis has became an crucial part of reliable model evaluation to validate\nresults. However, existing contamination analysis is usually conducted\ninternally by LLM developers and often lacks transparency and completeness.\nThis paper present an open source data contamination reports for the Llama\nseries models. We analyse six popular multi-choice QA benchmarks and quantify\ntheir overlapping with the training set of Llama. Various levels of\ncontamination ranging from 1\\% to 8.7\\% are found across benchmarks. Our\ncomparison also reveals that Llama models can gain over 5\\% higher accuracy on\ncontaminated subsets versus clean subsets. Data and code are available at:\nhttps://github.com/liyucheng09/Contamination_Detector.\n","authors":["Yucheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.17589v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17588v1","updated":"2023-10-26T17:09:13Z","published":"2023-10-26T17:09:13Z","title":"PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven\n Perturbed Gradient Descent","summary":" Fine-tuning pretrained language models (PLMs) for downstream tasks is a\nlarge-scale optimization problem, in which the choice of the training algorithm\ncritically determines how well the trained model can generalize to unseen test\ndata, especially in the context of few-shot learning. To achieve good\ngeneralization performance and avoid overfitting, techniques such as data\naugmentation and pruning are often applied. However, adding these\nregularizations necessitates heavy tuning of the hyperparameters of\noptimization algorithms, such as the popular Adam optimizer. In this paper, we\npropose a two-stage fine-tuning method, PAC-tuning, to address this\noptimization challenge. First, based on PAC-Bayes training, PAC-tuning directly\nminimizes the PAC-Bayes generalization bound to learn proper parameter\ndistribution. Second, PAC-tuning modifies the gradient by injecting noise with\nthe variance learned in the first stage into the model parameters during\ntraining, resulting in a variant of perturbed gradient descent (PGD). In the\npast, the few-shot scenario posed difficulties for PAC-Bayes training because\nthe PAC-Bayes bound, when applied to large models with limited training data,\nmight not be stringent. Our experimental results across 5 GLUE benchmark tasks\ndemonstrate that PAC-tuning successfully handles the challenges of fine-tuning\ntasks and outperforms strong baseline methods by a visible margin, further\nconfirming the potential to apply PAC training for any other settings where the\nAdam optimizer is currently used for training.\n","authors":["Guangliang Liu","Zhiyu Xue","Xitong Zhang","Kristen Marie Johnson","Rongrong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17588v1.pdf","comment":"Accepted to EMNLP23 main"},{"id":"http://arxiv.org/abs/2310.17586v1","updated":"2023-10-26T17:07:50Z","published":"2023-10-26T17:07:50Z","title":"Global Voices, Local Biases: Socio-Cultural Prejudices across Languages","summary":" Human biases are ubiquitous but not uniform: disparities exist across\nlinguistic, cultural, and societal borders. As large amounts of recent\nliterature suggest, language models (LMs) trained on human data can reflect and\noften amplify the effects of these social biases. However, the vast majority of\nexisting studies on bias are heavily skewed towards Western and European\nlanguages. In this work, we scale the Word Embedding Association Test (WEAT) to\n24 languages, enabling broader studies and yielding interesting findings about\nLM bias. We additionally enhance this data with culturally relevant information\nfor each language, capturing local contexts on a global scale. Further, to\nencompass more widely prevalent societal biases, we examine new bias dimensions\nacross toxicity, ableism, and more. Moreover, we delve deeper into the Indian\nlinguistic landscape, conducting a comprehensive regional bias analysis across\nsix prevalent Indian languages. Finally, we highlight the significance of these\nsocial biases and the new dimensions through an extensive comparison of\nembedding methods, reinforcing the need to address them in pursuit of more\nequitable language models. All code, data and results are available here:\nhttps://github.com/iamshnoo/weathub.\n","authors":["Anjishnu Mukherjee","Chahat Raj","Ziwei Zhu","Antonios Anastasopoulos"],"pdf_url":"https://arxiv.org/pdf/2310.17586v1.pdf","comment":"accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.13632v2","updated":"2023-10-26T17:02:56Z","published":"2023-05-23T02:59:25Z","title":"Detecting and Mitigating Hallucinations in Multilingual Summarisation","summary":" Hallucinations pose a significant challenge to the reliability of neural\nmodels for abstractive summarisation. While automatically generated summaries\nmay be fluent, they often lack faithfulness to the original document. This\nissue becomes even more pronounced in low-resource settings, such as\ncross-lingual transfer. With the existing faithful metrics focusing on English,\neven measuring the extent of this phenomenon in cross-lingual settings is hard.\nTo address this, we first develop a novel metric, mFACT, evaluating the\nfaithfulness of non-English summaries, leveraging translation-based transfer\nfrom multiple English faithfulness metrics. We then propose a simple but\neffective method to reduce hallucinations with a cross-lingual transfer, which\nweighs the loss of each training example by its faithfulness score. Through\nextensive experiments in multiple languages, we demonstrate that mFACT is the\nmetric that is most suited to detect hallucinations. Moreover, we find that our\nproposed loss weighting method drastically increases both performance and\nfaithfulness according to both automatic and human evaluation when compared to\nstrong baselines for cross-lingual transfer such as MAD-X. Our code and dataset\nare available at https://github.com/yfqiu-nlp/mfact-summ.\n","authors":["Yifu Qiu","Yftah Ziser","Anna Korhonen","Edoardo M. Ponti","Shay B. Cohen"],"pdf_url":"https://arxiv.org/pdf/2305.13632v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17576v1","updated":"2023-10-26T17:01:22Z","published":"2023-10-26T17:01:22Z","title":"1D-Touch: NLP-Assisted Coarse Text Selection via a Semi-Direct Gesture","summary":" Existing text selection techniques on touchscreen focus on improving the\ncontrol for moving the carets. Coarse-grained text selection on word and phrase\nlevels has not received much support beyond word-snapping and entity\nrecognition. We introduce 1D-Touch, a novel text selection method that\ncomplements the carets-based sub-word selection by facilitating the selection\nof semantic units of words and above. This method employs a simple vertical\nslide gesture to expand and contract a selection area from a word. The\nexpansion can be by words or by semantic chunks ranging from sub-phrases to\nsentences. This technique shifts the concept of text selection, from defining a\nrange by locating the first and last words, towards a dynamic process of\nexpanding and contracting a textual semantic entity. To understand the effects\nof our approach, we prototyped and tested two variants: WordTouch, which offers\na straightforward word-by-word expansion, and ChunkTouch, which leverages NLP\nto chunk text into syntactic units, allowing the selection to grow by\nsemantically meaningful units in response to the sliding gesture. Our\nevaluation, focused on the coarse-grained selection tasks handled by 1D-Touch,\nshows a 20% improvement over the default word-snapping selection method on\nAndroid.\n","authors":["Peiling Jiang","Li Feng","Fuling Sun","Parakrant Sarkar","Haijun Xia","Can Liu"],"pdf_url":"https://arxiv.org/pdf/2310.17576v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17570v1","updated":"2023-10-26T16:58:14Z","published":"2023-10-26T16:58:14Z","title":"DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct\n Speech-to-Speech Translation","summary":" While Diffusion Generative Models have achieved great success on image\ngeneration tasks, how to efficiently and effectively incorporate them into\nspeech generation especially translation tasks remains a non-trivial problem.\nSpecifically, due to the low information density of speech data, the\ntransformed discrete speech unit sequence is much longer than the corresponding\ntext transcription, posing significant challenges to existing auto-regressive\nmodels. Furthermore, it is not optimal to brutally apply discrete diffusion on\nthe speech unit sequence while disregarding the continuous space structure,\nwhich will degrade the generation performance significantly. In this paper, we\npropose a novel diffusion model by applying the diffusion forward process in\nthe \\textit{continuous} speech representation space, while employing the\ndiffusion backward process in the \\textit{discrete} speech unit space. In this\nway, we preserve the semantic structure of the continuous speech representation\nspace in the diffusion process and integrate the continuous and discrete\ndiffusion models. We conduct extensive experiments on the textless direct\nspeech-to-speech translation task, where the proposed method achieves\ncomparable results to the computationally intensive auto-regressive baselines\n(500 steps on average) with significantly fewer decoding steps (50 steps).\n","authors":["Yongxin Zhu","Zhujin Gao","Xinyuan Zhou","Zhongyi Ye","Linli Xu"],"pdf_url":"https://arxiv.org/pdf/2310.17570v1.pdf","comment":"Accepted in EMNLP2023 main conference"},{"id":"http://arxiv.org/abs/2310.17568v1","updated":"2023-10-26T16:56:01Z","published":"2023-10-26T16:56:01Z","title":"Navigating to Success in Multi-Modal Human-Robot Collaboration: Analysis\n and Corpus Release","summary":" Human-guided robotic exploration is a useful approach to gathering\ninformation at remote locations, especially those that might be too risky,\ninhospitable, or inaccessible for humans. Maintaining common ground between the\nremotely-located partners is a challenge, one that can be facilitated by\nmulti-modal communication. In this paper, we explore how participants utilized\nmultiple modalities to investigate a remote location with the help of a robotic\npartner. Participants issued spoken natural language instructions and received\nfrom the robot: text-based feedback, continuous 2D LIDAR mapping, and\nupon-request static photographs. We noticed that different strategies were\nadopted in terms of use of the modalities, and hypothesize that these\ndifferences may be correlated with success at several exploration sub-tasks. We\nfound that requesting photos may have improved the identification and counting\nof some key entities (doorways in particular) and that this strategy did not\nhinder the amount of overall area exploration. Future work with larger samples\nmay reveal the effects of more nuanced photo and dialogue strategies, which can\ninform the training of robotic agents. Additionally, we announce the release of\nour unique multi-modal corpus of human-robot communication in an exploration\ncontext: SCOUT, the Situated Corpus on Understanding Transactions.\n","authors":["Stephanie M. Lukin","Kimberly A. Pollard","Claire Bonial","Taylor Hudson","Ron Arstein","Clare Voss","David Traum"],"pdf_url":"https://arxiv.org/pdf/2310.17568v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2310.17567v1","updated":"2023-10-26T16:55:05Z","published":"2023-10-26T16:55:05Z","title":"Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models","summary":" With LLMs shifting their role from statistical modeling of language to\nserving as general-purpose AI agents, how should LLM evaluations change?\nArguably, a key ability of an AI agent is to flexibly combine, as needed, the\nbasic skills it has learned. The capability to combine skills plays an\nimportant role in (human) pedagogy and also in a paper on emergence phenomena\n(Arora & Goyal, 2023).\n This work introduces Skill-Mix, a new evaluation to measure ability to\ncombine skills. Using a list of $N$ skills the evaluator repeatedly picks\nrandom subsets of $k$ skills and asks the LLM to produce text combining that\nsubset of skills. Since the number of subsets grows like $N^k$, for even modest\n$k$ this evaluation will, with high probability, require the LLM to produce\ntext significantly different from any text in the training set. The paper\ndevelops a methodology for (a) designing and administering such an evaluation,\nand (b) automatic grading (plus spot-checking by humans) of the results using\nGPT-4 as well as the open LLaMA-2 70B model.\n Administering a version of to popular chatbots gave results that, while\ngenerally in line with prior expectations, contained surprises. Sizeable\ndifferences exist among model capabilities that are not captured by their\nranking on popular LLM leaderboards (\"cramming for the leaderboard\").\nFurthermore, simple probability calculations indicate that GPT-4's reasonable\nperformance on $k=5$ is suggestive of going beyond \"stochastic parrot\" behavior\n(Bender et al., 2021), i.e., it combines skills in ways that it had not seen\nduring training.\n We sketch how the methodology can lead to a Skill-Mix based eco-system of\nopen evaluations for AI capabilities of future models.\n","authors":["Dingli Yu","Simran Kaur","Arushi Gupta","Jonah Brown-Cohen","Anirudh Goyal","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2310.17567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17558v1","updated":"2023-10-26T16:47:52Z","published":"2023-10-26T16:47:52Z","title":"Towards Matching Phones and Speech Representations","summary":" Learning phone types from phone instances has been a long-standing problem,\nwhile still being open. In this work, we revisit this problem in the context of\nself-supervised learning, and pose it as the problem of matching cluster\ncentroids to phone embeddings. We study two key properties that enable\nmatching, namely, whether cluster centroids of self-supervised representations\nreduce the variability of phone instances and respect the relationship among\nphones. We then use the matching result to produce pseudo-labels and introduce\na new loss function for improving self-supervised representations. Our\nexperiments show that the matching result captures the relationship among\nphones. Training the new loss function jointly with the regular self-supervised\nlosses, such as APC and CPC, significantly improves the downstream phone\nclassification.\n","authors":["Gene-Ping Yang","Hao Tang"],"pdf_url":"https://arxiv.org/pdf/2310.17558v1.pdf","comment":"Accepted to ASRU 2023"},{"id":"http://arxiv.org/abs/2310.17551v1","updated":"2023-10-26T16:45:40Z","published":"2023-10-26T16:45:40Z","title":"Unpacking the Ethical Value Alignment in Big Models","summary":" Big models have greatly advanced AI's ability to understand, generate, and\nmanipulate information and content, enabling numerous applications. However, as\nthese models become increasingly integrated into everyday life, their inherent\nethical values and potential biases pose unforeseen risks to society. This\npaper provides an overview of the risks and challenges associated with big\nmodels, surveys existing AI ethics guidelines, and examines the ethical\nimplications arising from the limitations of these models. Taking a normative\nethics perspective, we propose a reassessment of recent normative guidelines,\nhighlighting the importance of collaborative efforts in academia to establish a\nunified and universal AI ethics framework. Furthermore, we investigate the\nmoral inclinations of current mainstream LLMs using the Moral Foundation\ntheory, analyze existing alignment algorithms, and outline the unique\nchallenges encountered in aligning ethical values within them. To address these\nchallenges, we introduce a novel conceptual paradigm for aligning the ethical\nvalues of big models and discuss promising research directions for alignment\ncriteria, evaluation, and method, representing an initial step towards the\ninterdisciplinary construction of the ethically aligned AI\n This paper is a modified English version of our Chinese paper\nhttps://crad.ict.ac.cn/cn/article/doi/10.7544/issn1000-1239.202330553, intended\nto help non-Chinese native speakers better understand our work.\n","authors":["Xiaoyuan Yi","Jing Yao","Xiting Wang","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17551v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15064v3","updated":"2023-10-26T16:44:39Z","published":"2023-05-24T11:52:23Z","title":"AutoPlan: Automatic Planning of Interactive Decision-Making Tasks With\n Large Language Models","summary":" Recent large language models (LLMs) are promising for making decisions in\ngrounded environments. However, LLMs frequently fail in complex decision-making\ntasks due to the misalignment between the pre-trained knowledge in LLMs and the\nactual rules in the environment. Existing methods require either costly\ngradient computation or lengthy in-context demonstrations. In this paper, we\npropose AutoPlan, an approach to guide LLM-based agents to accomplish\ninteractive decision-making tasks. AutoPlan augments the LLM prompt with a\ntask-solving plan and optimizes it through iterative experience collection and\nreflection. Our experiments show that AutoPlan, though using no in-context\ndemonstrations, achieves success rates on par with the baselines using\nhuman-written demonstrations on ALFWorld and even outperforms them by 8% on\nHotpotQA. The code is available at https://github.com/owaski/AutoPlan.\n","authors":["Siqi Ouyang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2305.15064v3.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2306.08937v3","updated":"2023-10-26T16:23:15Z","published":"2023-06-15T08:21:15Z","title":"DocumentNet: Bridging the Data Gap in Document Pre-Training","summary":" Document understanding tasks, in particular, Visually-rich Document Entity\nRetrieval (VDER), have gained significant attention in recent years thanks to\ntheir broad applications in enterprise AI. However, publicly available data\nhave been scarce for these tasks due to strict privacy constraints and high\nannotation costs. To make things worse, the non-overlapping entity spaces from\ndifferent datasets hinder the knowledge transfer between document types. In\nthis paper, we propose a method to collect massive-scale and weakly labeled\ndata from the web to benefit the training of VDER models. The collected\ndataset, named DocumentNet, does not depend on specific document types or\nentity sets, making it universally applicable to all VDER tasks. The current\nDocumentNet consists of 30M documents spanning nearly 400 document types\norganized in a four-level ontology. Experiments on a set of broadly adopted\nVDER tasks show significant improvements when DocumentNet is incorporated into\nthe pre-training for both classic and few-shot learning settings. With the\nrecent emergence of large language models (LLMs), DocumentNet provides a large\ndata source to extend their multi-modal capabilities for VDER.\n","authors":["Lijun Yu","Jin Miao","Xiaoyu Sun","Jiayi Chen","Alexander G. Hauptmann","Hanjun Dai","Wei Wei"],"pdf_url":"https://arxiv.org/pdf/2306.08937v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17530v1","updated":"2023-10-26T16:19:19Z","published":"2023-10-26T16:19:19Z","title":"Evaluating Bias and Fairness in Gender-Neutral Pretrained\n Vision-and-Language Models","summary":" Pretrained machine learning models are known to perpetuate and even amplify\nexisting biases in data, which can result in unfair outcomes that ultimately\nimpact user experience. Therefore, it is crucial to understand the mechanisms\nbehind those prejudicial biases to ensure that model performance does not\nresult in discriminatory behaviour toward certain groups or populations. In\nthis work, we define gender bias as our case study. We quantify bias\namplification in pretraining and after fine-tuning on three families of\nvision-and-language models. We investigate the connection, if any, between the\ntwo learning stages, and evaluate how bias amplification reflects on model\nperformance. Overall, we find that bias amplification in pretraining and after\nfine-tuning are independent. We then examine the effect of continued\npretraining on gender-neutral data, finding that this reduces group\ndisparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without\nsignificantly compromising task performance.\n","authors":["Laura Cabello","Emanuele Bugliarello","Stephanie Brandl","Desmond Elliott"],"pdf_url":"https://arxiv.org/pdf/2310.17530v1.pdf","comment":"To appear in EMNLP 2024"},{"id":"http://arxiv.org/abs/2310.17526v1","updated":"2023-10-26T16:18:30Z","published":"2023-10-26T16:18:30Z","title":"Can large language models replace humans in the systematic review\n process? Evaluating GPT-4's efficacy in screening and extracting data from\n peer-reviewed and grey literature in multiple languages","summary":" Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance.\n","authors":["Qusai Khraisha","Sophie Put","Johanna Kappenberg","Azza Warraitch","Kristin Hadfield"],"pdf_url":"https://arxiv.org/pdf/2310.17526v1.pdf","comment":"9 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/2310.17514v1","updated":"2023-10-26T16:11:04Z","published":"2023-10-26T16:11:04Z","title":"The Validity of Evaluation Results: Assessing Concurrence Across\n Compositionality Benchmarks","summary":" NLP models have progressed drastically in recent years, according to numerous\ndatasets proposed to evaluate performance. Questions remain, however, about how\nparticular dataset design choices may impact the conclusions we draw about\nmodel capabilities. In this work, we investigate this question in the domain of\ncompositional generalization. We examine the performance of six modeling\napproaches across 4 datasets, split according to 8 compositional splitting\nstrategies, ranking models by 18 compositional generalization splits in total.\nOur results show that: i) the datasets, although all designed to evaluate\ncompositional generalization, rank modeling approaches differently; ii)\ndatasets generated by humans align better with each other than they with\nsynthetic datasets, or than synthetic datasets among themselves; iii)\ngenerally, whether datasets are sampled from the same source is more predictive\nof the resulting model ranking than whether they maintain the same\ninterpretation of compositionality; and iv) which lexical items are used in the\ndata can strongly impact conclusions. Overall, our results demonstrate that\nmuch work remains to be done when it comes to assessing whether popular\nevaluation datasets measure what they intend to measure, and suggest that\nelucidating more rigorous standards for establishing the validity of evaluation\nsets could benefit the field.\n","authors":["Kaiser Sun","Adina Williams","Dieuwke Hupkes"],"pdf_url":"https://arxiv.org/pdf/2310.17514v1.pdf","comment":"CoNLL2023"},{"id":"http://arxiv.org/abs/2310.17513v1","updated":"2023-10-26T16:08:33Z","published":"2023-10-26T16:08:33Z","title":"The Expressive Power of Low-Rank Adaptation","summary":" Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that\nleverages low-rank adaptation of weight matrices, has emerged as a prevalent\ntechnique for fine-tuning pre-trained models such as large language models and\ndiffusion models. Despite its huge success in practice, the theoretical\nunderpinnings of LoRA have largely remained unexplored. This paper takes the\nfirst step to bridge this gap by theoretically analyzing the expressive power\nof LoRA. We prove that, for fully connected neural networks, LoRA can adapt any\nmodel $f$ to accurately represent any smaller target model $\\overline{f}$ if\nLoRA-rank $\\geq(\\text{width of }f) \\times \\frac{\\text{depth of\n}\\overline{f}}{\\text{depth of }f}$. We also quantify the approximation error\nwhen LoRA-rank is lower than the threshold. For Transformer networks, we show\nany model can be adapted to a target model of the same size with\nrank-$(\\frac{\\text{embedding size}}{2})$ LoRA adapters.\n","authors":["Yuchen Zeng","Kangwook Lee"],"pdf_url":"https://arxiv.org/pdf/2310.17513v1.pdf","comment":"40 pages,5 figures"},{"id":"http://arxiv.org/abs/2310.17512v1","updated":"2023-10-26T16:06:20Z","published":"2023-10-26T16:06:20Z","title":"CompeteAI: Understanding the Competition Behaviors in Large Language\n Model-based Agents","summary":" Large language models (LLMs) have been widely used as agents to complete\ndifferent tasks, such as personal assistance or event planning. While most work\nhas focused on cooperation and collaboration between agents, little work\nexplores competition, another important mechanism that fosters the development\nof society and economy. In this paper, we seek to examine the competition\nbehaviors in LLM-based agents. We first propose a general framework to study\nthe competition between agents. Then, we implement a practical competitive\nenvironment using GPT-4 to simulate a virtual town with two types of agents,\nincluding restaurant agents and customer agents. Specifically, restaurant\nagents compete with each other to attract more customers, where the competition\nfosters them to transform, such as cultivating new operating strategies. The\nresults of our experiments reveal several interesting findings ranging from\nsocial learning to Matthew Effect, which aligns well with existing sociological\nand economic theories. We believe that competition between agents deserves\nfurther investigation to help us understand society better. The code will be\nreleased soon.\n","authors":["Qinlin Zhao","Jindong Wang","Yixuan Zhang","Yiqiao Jin","Kaijie Zhu","Hao Chen","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17512v1.pdf","comment":"Technical report; 21 pages"},{"id":"http://arxiv.org/abs/2310.17499v1","updated":"2023-10-26T15:53:29Z","published":"2023-10-26T15:53:29Z","title":"The IMS Toucan System for the Blizzard Challenge 2023","summary":" For our contribution to the Blizzard Challenge 2023, we improved on the\nsystem we submitted to the Blizzard Challenge 2021. Our approach entails a\nrule-based text-to-phoneme processing system that includes rule-based\ndisambiguation of homographs in the French language. It then transforms the\nphonemes to spectrograms as intermediate representations using a fast and\nefficient non-autoregressive synthesis architecture based on Conformer and\nGlow. A GAN based neural vocoder that combines recent state-of-the-art\napproaches converts the spectrogram to the final wave. We carefully designed\nthe data processing, training, and inference procedures for the challenge data.\nOur system identifier is G. Open source code and demo are available.\n","authors":["Florian Lux","Julia Koch","Sarina Meyer","Thomas Bott","Nadja Schauffler","Pavel Denisov","Antje Schweitzer","Ngoc Thang Vu"],"pdf_url":"https://arxiv.org/pdf/2310.17499v1.pdf","comment":"Published at the Blizzard Challenge Workshop 2023, colocated with the\n Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023"},{"id":"http://arxiv.org/abs/2305.04798v2","updated":"2023-10-26T15:52:30Z","published":"2023-05-04T13:13:44Z","title":"Multi-grained Hypergraph Interest Modeling for Conversational\n Recommendation","summary":" Conversational recommender system (CRS) interacts with users through\nmulti-turn dialogues in natural language, which aims to provide high-quality\nrecommendations for user's instant information need. Although great efforts\nhave been made to develop effective CRS, most of them still focus on the\ncontextual information from the current dialogue, usually suffering from the\ndata scarcity issue. Therefore, we consider leveraging historical dialogue data\nto enrich the limited contexts of the current dialogue session.\n In this paper, we propose a novel multi-grained hypergraph interest modeling\napproach to capture user interest beneath intricate historical data from\ndifferent perspectives. As the core idea, we employ hypergraph to represent\ncomplicated semantic relations underlying historical dialogues. In our\napproach, we first employ the hypergraph structure to model users' historical\ndialogue sessions and form a session-based hypergraph, which captures\ncoarse-grained, session-level relations. Second, to alleviate the issue of data\nscarcity, we use an external knowledge graph and construct a knowledge-based\nhypergraph considering fine-grained, entity-level semantics. We further conduct\nmulti-grained hypergraph convolution on the two kinds of hypergraphs, and\nutilize the enhanced representations to develop interest-aware CRS. Extensive\nexperiments on two benchmarks ReDial and TG-ReDial validate the effectiveness\nof our approach on both recommendation and conversation tasks. Code is\navailable at: https://github.com/RUCAIBox/MHIM.\n","authors":["Chenzhan Shang","Yupeng Hou","Wayne Xin Zhao","Yaliang Li","Jing Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.04798v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17490v1","updated":"2023-10-26T15:45:12Z","published":"2023-10-26T15:45:12Z","title":"Improving Zero-shot Reader by Reducing Distractions from Irrelevant\n Documents in Open-Domain Question Answering","summary":" Large language models (LLMs) enable zero-shot approaches in open-domain\nquestion answering (ODQA), yet with limited advancements as the reader is\ncompared to the retriever. This study aims at the feasibility of a zero-shot\nreader that addresses the challenges of computational cost and the need for\nlabeled data. We find that LLMs are distracted due to irrelevant documents in\nthe retrieved set and the overconfidence of the generated answers when they are\nexploited as zero-shot readers. To tackle these problems, we mitigate the\nimpact of such documents via Distraction-aware Answer Selection (DAS) with a\nnegation-based instruction and score adjustment for proper answer selection.\nExperimental results show that our approach successfully handles distraction\nacross diverse scenarios, enhancing the performance of zero-shot readers.\nFurthermore, unlike supervised readers struggling with unseen data, zero-shot\nreaders demonstrate outstanding transferability without any training.\n","authors":["Sukmin Cho","Jeong yeon Seo","Soyeong Jeong","Jong C. Park"],"pdf_url":"https://arxiv.org/pdf/2310.17490v1.pdf","comment":"Findings of EMNLP 2023 Camera Ready"},{"id":"http://arxiv.org/abs/2310.17488v1","updated":"2023-10-26T15:44:57Z","published":"2023-10-26T15:44:57Z","title":"LightLM: A Lightweight Deep and Narrow Language Model for Generative\n Recommendation","summary":" This paper presents LightLM, a lightweight Transformer-based language model\nfor generative recommendation. While Transformer-based generative modeling has\ngained importance in various AI sub-fields such as NLP and vision, generative\nrecommendation is still in its infancy due to its unique demand on personalized\ngenerative modeling. Existing works on generative recommendation often use\nNLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are\nheavy-weight and are not specifically designed for recommendation tasks.\nLightLM tackles the issue by introducing a light-weight deep and narrow\nTransformer architecture, which is specifically tailored for direct generation\nof recommendation items. This structure is especially apt for straightforward\ngenerative recommendation and stems from the observation that language model\ndoes not have to be too wide for this task, as the input predominantly consists\nof short tokens that are well-suited for the model's capacity. We also show\nthat our devised user and item ID indexing methods, i.e., Spectral\nCollaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables\nthe deep and narrow Transformer architecture to outperform large-scale language\nmodels for recommendation. Besides, to address the hallucination problem of\ngenerating items as output, we propose the constrained generation process for\ngenerative recommenders. Experiments on real-world datasets show that LightLM\noutperforms various competitive baselines in terms of both recommendation\naccuracy and efficiency. The code can be found at\nhttps://github.com/dongyuanjushi/LightLM.\n","authors":["Kai Mei","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17488v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14956v3","updated":"2023-10-26T15:38:38Z","published":"2023-05-24T09:50:54Z","title":"Editing Common Sense in Transformers","summary":" Editing model parameters directly in Transformers makes updating open-source\ntransformer-based models possible without re-training (Meng et al., 2023).\nHowever, these editing methods have only been evaluated on statements about\nencyclopedic knowledge with a single correct answer. Commonsense knowledge with\nmultiple correct answers, e.g., an apple can be green or red but not\ntransparent, has not been studied but is as essential for enhancing\ntransformers' reliability and usefulness. In this paper, we investigate whether\ncommonsense judgments are causally associated with localized, editable\nparameters in Transformers, and we provide an affirmative answer. We find that\ndirectly applying the MEMIT editing algorithm results in sub-par performance\nand improve it for the commonsense domain by varying edit tokens and improving\nthe layer selection strategy, i.e., $MEMIT_{CSK}$. GPT-2 Large and XL models\nedited using $MEMIT_{CSK}$ outperform best-fine-tuned baselines by 10.97% and\n10.73% F1 scores on PEP3k and 20Q datasets. In addition, we propose a novel\nevaluation dataset, PROBE SET, that contains unaffected and affected\nneighborhoods, affected paraphrases, and affected reasoning challenges.\n$MEMIT_{CSK}$ performs well across the metrics while fine-tuning baselines show\nsignificant trade-offs between unaffected and affected metrics. These results\nsuggest a compelling future direction for incorporating feedback about common\nsense into Transformers through direct model editing.\n","authors":["Anshita Gupta","Debanjan Mondal","Akshay Krishna Sheshadri","Wenlong Zhao","Xiang Lorraine Li","Sarah Wiegreffe","Niket Tandon"],"pdf_url":"https://arxiv.org/pdf/2305.14956v3.pdf","comment":"Accepted to EMNLP 2023 Main Conference. Anshita, Debanjan, Akshay are\n co-first authors. Code and datasets for all experiments are available at\n https://github.com/anshitag/memit_csk"},{"id":"http://arxiv.org/abs/2106.02397v5","updated":"2023-10-26T15:22:23Z","published":"2021-06-04T10:23:48Z","title":"On Classifying Continuous Constraint Satisfaction Problems","summary":" A continuous constraint satisfaction problem (CCSP) is a constraint\nsatisfaction problem (CSP) with an interval domain $U \\subset \\mathbb{R}$. We\nengage in a systematic study to classify CCSPs that are complete of the\nExistential Theory of the Reals, i.e., ER-complete. To define this class, we\nfirst consider the problem ETR, which also stands for Existential Theory of the\nReals. In an instance of this problem we are given some sentence of the form\n$\\exists x_1, \\ldots, x_n \\in \\mathbb{R} : \\Phi(x_1, \\ldots, x_n)$, where\n$\\Phi$ is a well-formed quantifier-free formula consisting of the symbols $\\{0,\n1, +, \\cdot, \\geq, >, \\wedge, \\vee, \\neg\\}$, the goal is to check whether this\nsentence is true. Now the class ER is the family of all problems that admit a\npolynomial-time many-one reduction to ETR. It is known that NP $\\subseteq$ ER\n$\\subseteq$ PSPACE.\n We restrict our attention on CCSPs with addition constraints ($x + y = z$)\nand some other mild technical condition. Previously, it was shown that\nmultiplication constraints ($x \\cdot y = z$), squaring constraints ($x^2 = y$),\nor inversion constraints ($x\\cdot y = 1$) are sufficient to establish\nER-completeness. We extend this in the strongest possible sense for equality\nconstraints as follows. We show that CCSPs (with addition constraints and some\nother mild technical condition) that have any one well-behaved curved equality\nconstraint ($f(x,y) = 0$) are ER-complete. We further extend our results to\ninequality constraints. We show that any well-behaved convexly curved and any\nwell-behaved concavely curved inequality constraint ($f(x,y) \\geq 0$ and\n$g(x,y) \\geq 0$) imply ER-completeness on the class of such CCSPs.\n","authors":["Tillmann Miltzow","Reinier F. Schmiermann"],"pdf_url":"https://arxiv.org/pdf/2106.02397v5.pdf","comment":"39 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.00100v3","updated":"2023-10-26T15:18:01Z","published":"2023-09-29T19:20:27Z","title":"Multilingual Natural Language Processing Model for Radiology Reports --\n The Summary is all you need!","summary":" The impression section of a radiology report summarizes important radiology\nfindings and plays a critical role in communicating these findings to\nphysicians. However, the preparation of these summaries is time-consuming and\nerror-prone for radiologists. Recently, numerous models for radiology report\nsummarization have been developed. Nevertheless, there is currently no model\nthat can summarize these reports in multiple languages. Such a model could\ngreatly improve future research and the development of Deep Learning models\nthat incorporate data from patients with different ethnic backgrounds. In this\nstudy, the generation of radiology impressions in different languages was\nautomated by fine-tuning a model, publicly available, based on a multilingual\ntext-to-text Transformer to summarize findings available in English,\nPortuguese, and German radiology reports. In a blind test, two board-certified\nradiologists indicated that for at least 70% of the system-generated summaries,\nthe quality matched or exceeded the corresponding human-written summaries,\nsuggesting substantial clinical reliability. Furthermore, this study showed\nthat the multilingual model outperformed other models that specialized in\nsummarizing radiology reports in only one language, as well as models that were\nnot specifically designed for summarizing radiology reports, such as ChatGPT.\n","authors":["Mariana Lindo","Ana Sofia Santos","André Ferreira","Jianning Li","Gijs Luijten","Gustavo Correia","Moon Kim","Jens Kleesiek","Jan Egger","Victor Alves"],"pdf_url":"https://arxiv.org/pdf/2310.00100v3.pdf","comment":"Problems with the model"},{"id":"http://arxiv.org/abs/2310.17448v1","updated":"2023-10-26T14:57:08Z","published":"2023-10-26T14:57:08Z","title":"Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech\n Systems for the MADASR 2023 Challenge","summary":" This paper describes Tallinn University of Technology (TalTech) systems\ndeveloped for the ASRU MADASR 2023 Challenge. The challenge focuses on\nautomatic speech recognition of dialect-rich Indian languages with limited\ntraining audio and text data. TalTech participated in two tracks of the\nchallenge: Track 1 that allowed using only the provided training data and Track\n3 which allowed using additional audio data. In both tracks, we relied on\nwav2vec2.0 models. Our methodology diverges from the traditional procedure of\nfinetuning pretrained wav2vec2.0 models in two key points: firstly, through the\nimplementation of the aligned data augmentation technique to enhance the\nlinguistic diversity of the training data, and secondly, via the application of\ndeep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks,\nour approach yielded significant improvements over the provided baselines,\nachieving the lowest word error rates across all participating teams.\n","authors":["Tanel Alumäe","Jiaming Kong","Daniil Robnikov"],"pdf_url":"https://arxiv.org/pdf/2310.17448v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17428v1","updated":"2023-10-26T14:34:06Z","published":"2023-10-26T14:34:06Z","title":"''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT\n Generated English Text","summary":" Language serves as a powerful tool for the manifestation of societal belief\nsystems. In doing so, it also perpetuates the prevalent biases in our society.\nGender bias is one of the most pervasive biases in our society and is seen in\nonline and offline discourses. With LLMs increasingly gaining human-like\nfluency in text generation, gaining a nuanced understanding of the biases these\nsystems can generate is imperative. Prior work often treats gender bias as a\nbinary classification task. However, acknowledging that bias must be perceived\nat a relative scale; we investigate the generation and consequent receptivity\nof manual annotators to bias of varying degrees. Specifically, we create the\nfirst dataset of GPT-generated English text with normative ratings of gender\nbias. Ratings were obtained using Best--Worst Scaling -- an efficient\ncomparative annotation framework. Next, we systematically analyze the variation\nof themes of gender biases in the observed ranking and show that\nidentity-attack is most closely related to gender bias. Finally, we show the\nperformance of existing automated models trained on related concepts on our\ndataset.\n","authors":["Rishav Hada","Agrima Seth","Harshita Diddee","Kalika Bali"],"pdf_url":"https://arxiv.org/pdf/2310.17428v1.pdf","comment":"Camera-ready version in EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17415v1","updated":"2023-10-26T14:20:44Z","published":"2023-10-26T14:20:44Z","title":"PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word\n Tokenization on Downstream Applications","summary":" Large protein language models are adept at capturing the underlying\nevolutionary information in primary structures, offering significant practical\nvalue for protein engineering. Compared to natural language models, protein\namino acid sequences have a smaller data volume and a limited combinatorial\nspace. Choosing an appropriate vocabulary size to optimize the pre-trained\nmodel is a pivotal issue. Moreover, despite the wealth of benchmarks and\nstudies in the natural language community, there remains a lack of a\ncomprehensive benchmark for systematically evaluating protein language model\nquality. Given these challenges, PETA trained language models with 14 different\nvocabulary sizes under three tokenization methods. It conducted thousands of\ntests on 33 diverse downstream datasets to assess the models' transfer learning\ncapabilities, incorporating two classification heads and three random seeds to\nmitigate potential biases. Extensive experiments indicate that vocabulary sizes\nbetween 50 and 200 optimize the model, whereas sizes exceeding 800\ndetrimentally affect the model's representational performance. Our code, model\nweights and datasets are available at\nhttps://github.com/ginnm/ProteinPretraining.\n","authors":["Yang Tan","Mingchen Li","Pan Tan","Ziyi Zhou","Huiqun Yu","Guisheng Fan","Liang Hong"],"pdf_url":"https://arxiv.org/pdf/2310.17415v1.pdf","comment":"46 pages, 4figures, 9 tables"},{"id":"http://arxiv.org/abs/2310.17413v1","updated":"2023-10-26T14:19:48Z","published":"2023-10-26T14:19:48Z","title":"Harnessing GPT-3.5-turbo for Rhetorical Role Prediction in Legal Cases","summary":" We propose a comprehensive study of one-stage elicitation techniques for\nquerying a large pre-trained generative transformer (GPT-3.5-turbo) in the\nrhetorical role prediction task of legal cases. This task is known as requiring\ntextual context to be addressed. Our study explores strategies such as zero-few\nshots, task specification with definitions and clarification of annotation\nambiguities, textual context and reasoning with general prompts and specific\nquestions. We show that the number of examples, the definition of labels, the\npresentation of the (labelled) textual context and specific questions about\nthis context have a positive influence on the performance of the model. Given\nnon-equivalent test set configurations, we observed that prompting with a few\nlabelled examples from direct context can lead the model to a better\nperformance than a supervised fined-tuned multi-class classifier based on the\nBERT encoder (weighted F1 score of = 72%). But there is still a gap to reach\nthe performance of the best systems = 86%) in the LegalEval 2023 task which, on\nthe other hand, require dedicated resources, architectures and training.\n","authors":["Anas Belfathi","Nicolas Hernandez","Laura Monceaux"],"pdf_url":"https://arxiv.org/pdf/2310.17413v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17408v1","updated":"2023-10-26T14:09:57Z","published":"2023-10-26T14:09:57Z","title":"Tackling the Matrix Multiplication Micro-kernel Generation with Exo","summary":" The optimization of the matrix multiplication (or GEMM) has been a need\nduring the last decades. This operation is considered the flagship of current\nlinear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its\nwidespread use in a large variety of scientific applications. The GEMM is\nusually implemented following the GotoBLAS philosophy, which tiles the GEMM\noperands and uses a series of nested loops for performance improvement. These\napproaches extract the maximum computational power of the architectures through\nsmall pieces of hardware-oriented, high-performance code called micro-kernel.\nHowever, this approach forces developers to generate, with a non-negligible\neffort, a dedicated micro-kernel for each new hardware.\n In this work, we present a step-by-step procedure for generating\nmicro-kernels with the Exo compiler that performs close to (or even better\nthan) manually developed microkernels written with intrinsic functions or\nassembly language. Our solution also improves the portability of the generated\ncode, since a hardware target is fully specified by a concise library-based\ndescription of its instructions.\n","authors":["Adrián Castelló","Julian Bellavita","Grace Dinh","Yuka Ikarashi","Héctor Martínez"],"pdf_url":"https://arxiv.org/pdf/2310.17408v1.pdf","comment":"11 pages, 18 figures. Presented at CGO 2024. It includes a software\n artifact step-by-step execution"},{"id":"http://arxiv.org/abs/2310.17407v1","updated":"2023-10-26T14:06:14Z","published":"2023-10-26T14:06:14Z","title":"Meaning and understanding in large language models","summary":" Can a machine understand the meanings of natural language? Recent\ndevelopments in the generative large language models (LLMs) of artificial\nintelligence have led to the belief that traditional philosophical assumptions\nabout machine understanding of language need to be revised. This article\ncritically evaluates the prevailing tendency to regard machine language\nperformance as mere syntactic manipulation and the simulation of understanding,\nwhich is only partial and very shallow, without sufficient referential\ngrounding in the world. The aim is to highlight the conditions crucial to\nattributing natural language understanding to state-of-the-art LLMs, where it\ncan be legitimately argued that LLMs not only use syntax but also semantics,\ntheir understanding not being simulated but duplicated; and determine how they\nground the meanings of linguistic expressions.\n","authors":["Vladimír Havlík"],"pdf_url":"https://arxiv.org/pdf/2310.17407v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2310.17389v1","updated":"2023-10-26T13:35:41Z","published":"2023-10-26T13:35:41Z","title":"ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in\n Real-World User-AI Conversation","summary":" Despite remarkable advances that large language models have achieved in\nchatbots, maintaining a non-toxic user-AI interactive environment has become\nincreasingly critical nowadays. However, previous efforts in toxicity detection\nhave been mostly based on benchmarks derived from social media content, leaving\nthe unique challenges inherent to real-world user-AI interactions\ninsufficiently explored. In this work, we introduce ToxicChat, a novel\nbenchmark based on real user queries from an open-source chatbot. This\nbenchmark contains the rich, nuanced phenomena that can be tricky for current\ntoxicity detection models to identify, revealing a significant domain\ndifference compared to social media content. Our systematic evaluation of\nmodels trained on existing toxicity datasets has shown their shortcomings when\napplied to this unique domain of ToxicChat. Our work illuminates the\npotentially overlooked challenges of toxicity detection in real-world user-AI\nconversations. In the future, ToxicChat can be a valuable resource to drive\nfurther advancements toward building a safe and healthy environment for user-AI\ninteractions.\n","authors":["Zi Lin","Zihan Wang","Yongqi Tong","Yangkun Wang","Yuxin Guo","Yujia Wang","Jingbo Shang"],"pdf_url":"https://arxiv.org/pdf/2310.17389v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.01488v4","updated":"2023-10-26T13:27:31Z","published":"2022-12-02T23:43:18Z","title":"Event knowledge in large language models: the gap between the impossible\n and the unlikely","summary":" Word co-occurrence patterns in language corpora contain a surprising amount\nof conceptual knowledge. Large language models (LLMs), trained to predict words\nin context, leverage these patterns to achieve impressive performance on\ndiverse semantic tasks requiring world knowledge. An important but understudied\nquestion about LLMs' semantic abilities is whether they acquire generalized\nknowledge of common events. Here, we test whether five pre-trained LLMs (from\n2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions\nof agent-patient interactions than to minimally different implausible versions\nof the same event. Using three curated sets of minimal sentence pairs (total\nn=1,215), we found that pre-trained LLMs possess substantial event knowledge,\noutperforming other distributional language models. In particular, they almost\nalways assign higher likelihood to possible vs. impossible events (The teacher\nbought the laptop vs. The laptop bought the teacher). However, LLMs show less\nconsistent preferences for likely vs. unlikely events (The nanny tutored the\nboy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM\nscores are driven by both plausibility and surface-level sentence features,\n(ii) LLM scores generalize well across syntactic variants (active vs. passive\nconstructions) but less well across semantic variants (synonymous sentences),\n(iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence\nplausibility serves as an organizing dimension in internal LLM representations.\nOverall, our results show that important aspects of event knowledge naturally\nemerge from distributional linguistic patterns, but also highlight a gap\nbetween representations of possible/impossible and likely/unlikely events.\n","authors":["Carina Kauf","Anna A. Ivanova","Giulia Rambelli","Emmanuele Chersoni","Jingyuan Selena She","Zawad Chowdhury","Evelina Fedorenko","Alessandro Lenci"],"pdf_url":"https://arxiv.org/pdf/2212.01488v4.pdf","comment":"The two lead authors have contributed equally to this work"},{"id":"http://arxiv.org/abs/2310.17372v1","updated":"2023-10-26T13:07:01Z","published":"2023-10-26T13:07:01Z","title":"Dialogue-based generation of self-driving simulation scenarios using\n Large Language Models","summary":" Simulation is an invaluable tool for developing and evaluating controllers\nfor self-driving cars. Current simulation frameworks are driven by\nhighly-specialist domain specific languages, and so a natural language\ninterface would greatly enhance usability. But there is often a gap, consisting\nof tacit assumptions the user is making, between a concise English utterance\nand the executable code that captures the user's intent. In this paper we\ndescribe a system that addresses this issue by supporting an extended\nmultimodal interaction: the user can follow up prior instructions with\nrefinements or revisions, in reaction to the simulations that have been\ngenerated from their utterances so far. We use Large Language Models (LLMs) to\nmap the user's English utterances in this interaction into domain-specific\ncode, and so we explore the extent to which LLMs capture the context\nsensitivity that's necessary for computing the speaker's intended message in\ndiscourse.\n","authors":["Antonio Valerio Miceli-Barone","Alex Lascarides","Craig Innes"],"pdf_url":"https://arxiv.org/pdf/2310.17372v1.pdf","comment":"12 pages, 6 figures, SpLU-RoboNLP 2023"},{"id":"http://arxiv.org/abs/2305.03598v2","updated":"2023-10-26T13:02:44Z","published":"2023-05-05T15:03:01Z","title":"NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial\n Reports","summary":" How can we interpret and retrieve medical evidence to support clinical\ndecisions? Clinical trial reports (CTR) amassed over the years contain\nindispensable information for the development of personalized medicine.\nHowever, it is practically infeasible to manually inspect over 400,000+\nclinical trial reports in order to find the best evidence for experimental\ntreatments. Natural Language Inference (NLI) offers a potential solution to\nthis problem, by allowing the scalable computation of textual entailment.\nHowever, existing NLI models perform poorly on biomedical corpora, and\npreviously published datasets fail to capture the full complexity of inference\nover CTRs. In this work, we present a novel resource to advance research on NLI\nfor reasoning on CTRs. The resource includes two main tasks. Firstly, to\ndetermine the inference relation between a natural language statement, and a\nCTR. Secondly, to retrieve supporting facts to justify the predicted relation.\nWe provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these\ntasks. Baselines on this corpus expose the limitations of existing NLI models,\nwith 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To\nthe best of our knowledge, we are the first to design a task that covers the\ninterpretation of full CTRs. To encourage further work on this challenging\ndataset, we make the corpus, competition leaderboard, website and code to\nreplicate the baseline experiments available at:\nhttps://github.com/ai-systems/nli4ct\n","authors":["Maël Jullien","Marco Valentino","Hannah Frost","Paul O'Regan","Donal Landers","André Freitas"],"pdf_url":"https://arxiv.org/pdf/2305.03598v2.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2310.17369v1","updated":"2023-10-26T13:00:26Z","published":"2023-10-26T13:00:26Z","title":"Language and Mental Health: Measures of Emotion Dynamics from Text as\n Linguistic Biosocial Markers","summary":" Research in psychopathology has shown that, at an aggregate level, the\npatterns of emotional change over time -- emotion dynamics -- are indicators of\none's mental health. One's patterns of emotion change have traditionally been\ndetermined through self-reports of emotions; however, there are known issues\nwith accuracy, bias, and convenience. Recent approaches to determining emotion\ndynamics from one's everyday utterances, addresses many of these concerns, but\nit is not yet known whether these measures of utterance emotion dynamics (UED)\ncorrelate with mental health diagnoses. Here, for the first time, we study the\nrelationship between tweet emotion dynamics and mental health disorders. We\nfind that each of the UED metrics studied varied by the user's self-disclosed\ndiagnosis. For example: average valence was significantly higher (i.e., more\npositive text) in the control group compared to users with ADHD, MDD, and PTSD.\nValence variability was significantly lower in the control group compared to\nADHD, depression, bipolar disorder, MDD, PTSD, and OCD but not PPD. Rise and\nrecovery rates of valence also exhibited significant differences from the\ncontrol. This work provides important early evidence for how linguistic cues\npertaining to emotion dynamics can play a crucial role as biosocial markers for\nmental illnesses and aid in the understanding, diagnosis, and management of\nmental health disorders.\n","authors":["Daniela Teodorescu","Tiffany Cheng","Alona Fyshe","Saif M. Mohammad"],"pdf_url":"https://arxiv.org/pdf/2310.17369v1.pdf","comment":"9 pages, 3 figures"},{"id":"http://arxiv.org/abs/2308.13056v2","updated":"2023-10-26T12:54:30Z","published":"2023-08-24T19:49:30Z","title":"Lexical Diversity in Kinship Across Languages and Dialects","summary":" Languages are known to describe the world in diverse ways. Across lexicons,\ndiversity is pervasive, appearing through phenomena such as lexical gaps and\nuntranslatability. However, in computational resources, such as multilingual\nlexical databases, diversity is hardly ever represented. In this paper, we\nintroduce a method to enrich computational lexicons with content relating to\nlinguistic diversity. The method is verified through two large-scale case\nstudies on kinship terminology, a domain known to be diverse across languages\nand cultures: one case study deals with seven Arabic dialects, while the other\none with three Indonesian languages. Our results, made available as browseable\nand downloadable computational resources, extend prior linguistics research on\nkinship terminology, and provide insight into the extent of diversity even\nwithin linguistically and culturally close communities.\n","authors":["Hadi Khalilia","Gábor Bella","Abed Alhakim Freihat","Shandy Darma","Fausto Giunchiglia"],"pdf_url":"https://arxiv.org/pdf/2308.13056v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15080v2","updated":"2023-10-26T12:51:07Z","published":"2023-05-24T11:59:13Z","title":"Visually-Situated Natural Language Understanding with Contrastive\n Reading Model and Frozen Large Language Models","summary":" Recent advances in Large Language Models (LLMs) have stimulated a surge of\nresearch aimed at extending their applications to the visual domain. While\nthese models exhibit promise in generating abstract image captions and\nfacilitating natural conversations, their performance on text-rich images still\nrequires improvement. In this paper, we introduce Contrastive Reading Model\n(Cream), a novel neural architecture designed to enhance the language-image\nunderstanding capability of LLMs by capturing intricate details that are often\noverlooked in existing methods. Cream combines vision and auxiliary encoders,\nfortified by a contrastive feature alignment technique, to achieve a more\neffective comprehension of language information in visually situated contexts\nwithin the images. Our approach bridges the gap between vision and language\nunderstanding, paving the way for the development of more sophisticated\nDocument Intelligence Assistants. Through rigorous evaluations across diverse\nvisually-situated language understanding tasks that demand reasoning\ncapabilities, we demonstrate the compelling performance of Cream, positioning\nit as a prominent model in the field of visual document understanding. We\nprovide our codebase and newly-generated datasets at\nhttps://github.com/naver-ai/cream .\n","authors":["Geewook Kim","Hodong Lee","Daehee Kim","Haeji Jung","Sanghee Park","Yoonsik Kim","Sangdoo Yun","Taeho Kil","Bado Lee","Seunghyun Park"],"pdf_url":"https://arxiv.org/pdf/2305.15080v2.pdf","comment":"22 pages; To appear at EMNLP 2023 Main Conference (Project page:\n https://naver-ai.github.io/cream )"},{"id":"http://arxiv.org/abs/2310.17353v1","updated":"2023-10-26T12:39:20Z","published":"2023-10-26T12:39:20Z","title":"Cultural Adaptation of Recipes","summary":" Building upon the considerable advances in Large Language Models (LLMs), we\nare now equipped to address more sophisticated tasks demanding a nuanced\nunderstanding of cross-cultural contexts. A key example is recipe adaptation,\nwhich goes beyond simple translation to include a grasp of ingredients,\nculinary techniques, and dietary preferences specific to a given culture. We\nintroduce a new task involving the translation and cultural adaptation of\nrecipes between Chinese and English-speaking cuisines. To support this\ninvestigation, we present CulturalRecipes, a unique dataset comprised of\nautomatically paired recipes written in Mandarin Chinese and English. This\ndataset is further enriched with a human-written and curated test set. In this\nintricate task of cross-cultural recipe adaptation, we evaluate the performance\nof various methods, including GPT-4 and other LLMs, traditional machine\ntranslation, and information retrieval techniques. Our comprehensive analysis\nincludes both automatic and human evaluation metrics. While GPT-4 exhibits\nimpressive abilities in adapting Chinese recipes into English, it still lags\nbehind human expertise when translating English recipes into Chinese. This\nunderscores the multifaceted nature of cultural adaptations. We anticipate that\nthese insights will significantly contribute to future research on\nculturally-aware language models and their practical application in culturally\ndiverse contexts.\n","authors":["Yong Cao","Yova Kementchedjhieva","Ruixiang Cui","Antonia Karamolegkou","Li Zhou","Megan Dare","Lucia Donatelli","Daniel Hershcovich"],"pdf_url":"https://arxiv.org/pdf/2310.17353v1.pdf","comment":"Accepted to TACL"},{"id":"http://arxiv.org/abs/2309.00359v3","updated":"2023-10-26T12:18:51Z","published":"2023-09-01T09:34:49Z","title":"Large Content And Behavior Models To Understand, Simulate, And Optimize\n Content And Behavior","summary":" Shannon, in his seminal paper introducing information theory, divided the\ncommunication into three levels: technical, semantic, and effectivenss. While\nthe technical level is concerned with accurate reconstruction of transmitted\nsymbols, the semantic and effectiveness levels deal with the inferred meaning\nand its effect on the receiver. Thanks to telecommunications, the first level\nproblem has produced great advances like the internet. Large Language Models\n(LLMs) make some progress towards the second goal, but the third level still\nremains largely untouched. The third problem deals with predicting and\noptimizing communication for desired receiver behavior. LLMs, while showing\nwide generalization capabilities across a wide range of tasks, are unable to\nsolve for this. One reason for the underperformance could be a lack of\n``behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver\nbehavior over a communication, such as shares, likes, clicks, purchases,\nretweets, etc. While preprocessing data for LLM training, behavior tokens are\noften removed from the corpora as noise. Therefore, in this paper, we make some\ninitial progress towards reintroducing behavior tokens in LLM training. The\ntrained models, other than showing similar performance to LLMs on content\nunderstanding tasks, show generalization capabilities on behavior simulation,\ncontent simulation, behavior understanding, and behavior domain adaptation.\nUsing a wide range of tasks on two corpora, we show results on all these\ncapabilities. We call these models Large Content and Behavior Models (LCBMs).\nFurther, to spur more research on LCBMs, we release our new Content Behavior\nCorpus (CBC), a repository containing communicator, message, and corresponding\nreceiver behavior.\n","authors":["Ashmit Khandelwal","Aditya Agrawal","Aanisha Bhattacharyya","Yaman K Singla","Somesh Singh","Uttaran Bhattacharya","Ishita Dasgupta","Stefano Petrangeli","Rajiv Ratn Shah","Changyou Chen","Balaji Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2309.00359v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17342v1","updated":"2023-10-26T12:16:25Z","published":"2023-10-26T12:16:25Z","title":"ACT-SQL: In-Context Learning for Text-to-SQL with\n Automatically-Generated Chain-of-Thought","summary":" Recently Large Language Models (LLMs) have been proven to have strong\nabilities in various domains and tasks. We study the problem of prompt\ndesigning in the text-to-SQL task and attempt to improve the LLMs' reasoning\nability when generating SQL queries. Besides the trivial few-shot in-context\nlearning setting, we design our chain-of-thought (CoT) prompt with a similar\nmethod to schema linking. We provide a method named ACT-SQL to automatically\ngenerate auto-CoT exemplars and thus the whole process doesn't need manual\nlabeling. Our approach is cost-saving since we only use the LLMs' API call once\nwhen generating one SQL query. Furthermore, we extend our in-context learning\nmethod to the multi-turn text-to-SQL task. The experiment results show that the\nLLMs' performance can benefit from our ACT-SQL approach. Our approach achieves\nSOTA performance on the Spider dev set among existing in-context learning\napproaches.\n","authors":["Hanchong Zhang","Ruisheng Cao","Lu Chen","Hongshen Xu","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2310.17342v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17333v1","updated":"2023-10-26T11:59:45Z","published":"2023-10-26T11:59:45Z","title":"Arabic Fine-Grained Entity Recognition","summary":" Traditional NER systems are typically trained to recognize coarse-grained\nentities, and less attention is given to classifying entities into a hierarchy\nof fine-grained lower-level subtypes. This article aims to advance Arabic NER\nwith fine-grained entities. We chose to extend Wojood (an open-source Nested\nArabic Named Entity Corpus) with subtypes. In particular, four main entity\ntypes in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),\nand facility (FAC), are extended with 31 subtypes. To do this, we first revised\nWojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's\nACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,\nORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE\nsub-types. We refer to this extended version of Wojood as WojoodF ine. To\nevaluate our annotations, we measured the inter-annotator agreement (IAA) using\nboth Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.\nTo compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic\nBERT encoders in three settings: flat NER, nested NER and nested NER with\nsubtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our\ncorpus and models are open-source and available at\nhttps://sina.birzeit.edu/wojood/.\n","authors":["Haneen Liqreina","Mustafa Jarrar","Mohammed Khalilia","Ahmed Oumar El-Shangiti","Muhammad AbdulMageed"],"pdf_url":"https://arxiv.org/pdf/2310.17333v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15113v2","updated":"2023-10-26T11:45:28Z","published":"2023-10-23T17:21:03Z","title":"Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into\n the Morphological Capabilities of a Large Language Model","summary":" Large language models (LLMs) have recently reached an impressive level of\nlinguistic capability, prompting comparisons with human language skills.\nHowever, there have been relatively few systematic inquiries into the\nlinguistic capabilities of the latest generation of LLMs, and those studies\nthat do exist (i) ignore the remarkable ability of humans to generalize, (ii)\nfocus only on English, and (iii) investigate syntax or semantics and overlook\nother capabilities that lie at the heart of human language, like morphology.\nHere, we close these gaps by conducting the first rigorous analysis of the\nmorphological capabilities of ChatGPT in four typologically varied languages\n(specifically, English, German, Tamil, and Turkish). We apply a version of\nBerko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for\nthe four examined languages. We find that ChatGPT massively underperforms\npurpose-built systems, particularly in English. Overall, our results -- through\nthe lens of morphology -- cast a new light on the linguistic capabilities of\nChatGPT, suggesting that claims of human-like language skills are premature and\nmisleading.\n","authors":["Leonie Weissweiler","Valentin Hofmann","Anjali Kantharuban","Anna Cai","Ritam Dutt","Amey Hengle","Anubha Kabra","Atharva Kulkarni","Abhishek Vijayakumar","Haofei Yu","Hinrich Schütze","Kemal Oflazer","David R. Mortensen"],"pdf_url":"https://arxiv.org/pdf/2310.15113v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2303.04053v3","updated":"2023-10-26T11:35:03Z","published":"2023-03-07T17:01:25Z","title":"Describe me an Aucklet: Generating Grounded Perceptual Category\n Descriptions","summary":" Human speakers can generate descriptions of perceptual concepts, abstracted\nfrom the instance-level. Moreover, such descriptions can be used by other\nspeakers to learn provisional representations of those concepts. Learning and\nusing abstract perceptual concepts is under-investigated in the\nlanguage-and-vision field. The problem is also highly relevant to the field of\nrepresentation learning in multi-modal NLP. In this paper, we introduce a\nframework for testing category-level perceptual grounding in multi-modal\nlanguage models. In particular, we train separate neural networks to generate\nand interpret descriptions of visual categories. We measure the communicative\nsuccess of the two models with the zero-shot classification performance of the\ninterpretation model, which we argue is an indicator of perceptual grounding.\nUsing this framework, we compare the performance of prototype- and\nexemplar-based representations. Finally, we show that communicative success\nexposes performance issues in the generation model, not captured by traditional\nintrinsic NLG evaluation metrics, and argue that these issues stem from a\nfailure to properly ground language in vision at the category level.\n","authors":["Bill Noble","Nikolai Ilinykh"],"pdf_url":"https://arxiv.org/pdf/2303.04053v3.pdf","comment":"To appear in Proceedings of the 2023 Conference on Empirical Methods\n in Natural Language Processing (EMNLP, Main)"},{"id":"http://arxiv.org/abs/2303.07914v2","updated":"2023-10-26T11:27:24Z","published":"2023-03-14T13:56:36Z","title":"Adapting Offline Speech Translation Models for Streaming with\n Future-Aware Distillation and Inference","summary":" A popular approach to streaming speech translation is to employ a single\noffline model with a wait-k policy to support different latency requirements,\nwhich is simpler than training multiple online models with different latency\nconstraints. However, there is a mismatch problem in using a model trained with\ncomplete utterances for streaming inference with partial input. We demonstrate\nthat speech representations extracted at the end of a streaming input are\nsignificantly different from those extracted from a complete utterance. To\naddress this issue, we propose a new approach called Future-Aware Streaming\nTranslation (FAST) that adapts an offline ST model for streaming input. FAST\nincludes a Future-Aware Inference (FAI) strategy that incorporates future\ncontext through a trainable masked embedding, and a Future-Aware Distillation\n(FAD) framework that transfers future context from an approximation of full\nspeech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr\nbenchmarks show that FAST achieves better trade-offs between translation\nquality and latency than strong baselines. Extensive analyses suggest that our\nmethods effectively alleviate the aforementioned mismatch problem between\noffline training and online inference.\n","authors":["Biao Fu","Minpeng Liao","Kai Fan","Zhongqiang Huang","Boxing Chen","Yidong Chen","Xiaodong Shi"],"pdf_url":"https://arxiv.org/pdf/2303.07914v2.pdf","comment":"Accept to EMNLP 2023 main conference"},{"id":"http://arxiv.org/abs/2310.17315v1","updated":"2023-10-26T11:23:05Z","published":"2023-10-26T11:23:05Z","title":"Nabra: Syrian Arabic Dialects with Morphological Annotations","summary":" This paper presents Nabra, a corpora of Syrian Arabic dialects with\nmorphological annotations. A team of Syrian natives collected more than 6K\nsentences containing about 60K words from several sources including social\nmedia posts, scripts of movies and series, lyrics of songs and local proverbs\nto build Nabra. Nabra covers several local Syrian dialects including those of\nAleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and\nSuwayda. A team of nine annotators annotated the 60K tokens with full\nmorphological annotations across sentence contexts. We trained the annotators\nto follow methodological annotation guidelines to ensure unique morpheme\nannotations, and normalized the annotations. F1 and kappa agreement scores\nranged between 74% and 98% across features, showing the excellent quality of\nNabra annotations. Our corpora are open-source and publicly available as part\nof the Currasat portal https://sina.birzeit.edu/currasat.\n","authors":["Amal Nayouf","Tymaa Hammouda","Mustafa Jarrar","Fadi Zaraket","Mohamad-Bassam Kurdy"],"pdf_url":"https://arxiv.org/pdf/2310.17315v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17312v1","updated":"2023-10-26T11:17:03Z","published":"2023-10-26T11:17:03Z","title":"An Ensemble Method Based on the Combination of Transformers with\n Convolutional Neural Networks to Detect Artificially Generated Text","summary":" Thanks to the state-of-the-art Large Language Models (LLMs), language\ngeneration has reached outstanding levels. These models are capable of\ngenerating high quality content, thus making it a challenging task to detect\ngenerated text from human-written content. Despite the advantages provided by\nNatural Language Generation, the inability to distinguish automatically\ngenerated text can raise ethical concerns in terms of authenticity.\nConsequently, it is important to design and develop methodologies to detect\nartificial content. In our work, we present some classification models\nconstructed by ensembling transformer models such as Sci-BERT, DeBERTa and\nXLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate\nthat the considered ensemble architectures surpass the performance of the\nindividual transformer models for classification. Furthermore, the proposed\nSciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared\ntask 2023 data.\n","authors":["Vijini Liyanage","Davide Buscaldi"],"pdf_url":"https://arxiv.org/pdf/2310.17312v1.pdf","comment":"In Proceedings of the 21st Annual Workshop of the Australasian\n Language Technology Association (ALTA 2023)"},{"id":"http://arxiv.org/abs/2310.17306v1","updated":"2023-10-26T11:05:15Z","published":"2023-10-26T11:05:15Z","title":"FormaT5: Abstention and Examples for Conditional Table Formatting with\n Natural Language","summary":" Formatting is an important property in tables for visualization,\npresentation, and analysis. Spreadsheet software allows users to automatically\nformat their tables by writing data-dependent conditional formatting (CF)\nrules. Writing such rules is often challenging for users as it requires them to\nunderstand and implement the underlying logic. We present FormaT5, a\ntransformer-based model that can generate a CF rule given the target table and\na natural language description of the desired formatting logic. We find that\nuser descriptions for these tasks are often under-specified or ambiguous,\nmaking it harder for code generation systems to accurately learn the desired\nrule in a single step. To tackle this problem of under-specification and\nminimise argument errors, FormaT5 learns to predict placeholders though an\nabstention objective. These placeholders can then be filled by a second model\nor, when examples of rows that should be formatted are available, by a\nprogramming-by-example system. To evaluate FormaT5 on diverse and real\nscenarios, we create an extensive benchmark of 1053 CF tasks, containing\nreal-world descriptions collected from four different sources. We release our\nbenchmarks to encourage research in this area. Abstention and filling allow\nFormaT5 to outperform 8 different neural approaches on our benchmarks, both\nwith and without examples. Our results illustrate the value of building\ndomain-specific learning systems.\n","authors":["Mukul Singh","José Cambronero","Sumit Gulwani","Vu Le","Carina Negreanu","Elnaz Nouri","Mohammad Raza","Gust Verbruggen"],"pdf_url":"https://arxiv.org/pdf/2310.17306v1.pdf","comment":"VLDB 2024, 14 pages"},{"id":"http://arxiv.org/abs/2310.17300v1","updated":"2023-10-26T10:45:26Z","published":"2023-10-26T10:45:26Z","title":"Comparing Photorealistic and Animated Embodied Conversational Agents in\n Serious Games: An Empirical Study on User Experience","summary":" Embodied conversational agents (ECAs) are paradigms of conversational user\ninterfaces in the form of embodied characters. While ECAs offer various\nmanipulable features, this paper focuses on a study conducted to explore two\ndistinct levels of presentation realism. The two agent versions are\nphotorealistic and animated. The study aims to provide insights and design\nsuggestions for speech-enabled ECAs within serious game environments. A\nwithin-subjects, two-by-two factorial design was employed for this research\nwith a cohort of 36 participants balanced for gender. The results showed that\nboth the photorealistic and the animated versions were perceived as highly\nusable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4\nper cent of the participants stated they preferred the photorealistic version,\n25 per cent stated they preferred the animated version and 5.6 per cent had no\nstated preference. The photorealistic agents were perceived as more realistic\nand human-like, while the animated characters made the task feel more like a\ngame. Even though the agents' realism had no significant effect on usability,\nit positively influenced participants' perceptions of the agent. This research\naims to lay the groundwork for future studies on ECA realism's impact in\nserious games across diverse contexts.\n","authors":["Danai Korre"],"pdf_url":"https://arxiv.org/pdf/2310.17300v1.pdf","comment":"21 pages, 14 figures, preprint to be published in HCI INTERNATIONAL\n 2023 25TH INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION proceedings"},{"id":"http://arxiv.org/abs/2305.11685v2","updated":"2023-10-26T10:43:07Z","published":"2023-05-19T14:07:43Z","title":"Recycle-and-Distill: Universal Compression Strategy for\n Transformer-based Speech SSL Models with Attention Map Reusing and Masking\n Distillation","summary":" Transformer-based speech self-supervised learning (SSL) models, such as\nHuBERT, show surprising performance in various speech processing tasks.\nHowever, huge number of parameters in speech SSL models necessitate the\ncompression to a more compact model for wider usage in academia or small\ncompanies. In this study, we suggest to reuse attention maps across the\nTransformer layers, so as to remove key and query parameters while retaining\nthe number of layers. Furthermore, we propose a novel masking distillation\nstrategy to improve the student model's speech representation quality. We\nextend the distillation loss to utilize both masked and unmasked speech frames\nto fully leverage the teacher model's high-quality representation. Our\nuniversal compression strategy yields the student model that achieves phoneme\nerror rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB\nbenchmark.\n","authors":["Kangwook Jang","Sungnyun Kim","Se-Young Yun","Hoirin Kim"],"pdf_url":"https://arxiv.org/pdf/2305.11685v2.pdf","comment":"Proceedings of Interspeech 2023. Code URL:\n https://github.com/sungnyun/ARMHuBERT"},{"id":"http://arxiv.org/abs/2305.09758v3","updated":"2023-10-26T10:08:31Z","published":"2023-05-16T19:13:11Z","title":"A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In\n Zero Shot","summary":" Multimedia content, such as advertisements and story videos, exhibit a rich\nblend of creativity and multiple modalities. They incorporate elements like\ntext, visuals, audio, and storytelling techniques, employing devices like\nemotions, symbolism, and slogans to convey meaning. There is a dearth of large\nannotated training datasets in the multimedia domain hindering the development\nof supervised learning models with satisfactory performance for real-world\napplications. On the other hand, the rise of large language models (LLMs) has\nwitnessed remarkable zero-shot performance in various natural language\nprocessing (NLP) tasks, such as emotion classification, question-answering, and\ntopic classification. To leverage such advanced techniques to bridge this\nperformance gap in multimedia understanding, we propose verbalizing long videos\nto generate their descriptions in natural language, followed by performing\nvideo-understanding tasks on the generated story as opposed to the original\nvideo. Through extensive experiments on fifteen video-understanding tasks, we\ndemonstrate that our method, despite being zero-shot, achieves significantly\nbetter results than supervised baselines for video understanding. Furthermore,\nto alleviate a lack of story understanding benchmarks, we publicly release the\nfirst dataset on a crucial task in computational social science on persuasion\nstrategy identification.\n","authors":["Aanisha Bhattacharya","Yaman K Singla","Balaji Krishnamurthy","Rajiv Ratn Shah","Changyou Chen"],"pdf_url":"https://arxiv.org/pdf/2305.09758v3.pdf","comment":"Accepted to EMNLP-23 TL;DR: Video understanding lags far behind NLP;\n LLMs excel in zero-shot. Our approach utilizes LLMs to verbalize videos,\n creating stories for zero-shot video understanding. This yields\n state-of-the-art results across five datasets, covering fifteen tasks"},{"id":"http://arxiv.org/abs/2310.17284v1","updated":"2023-10-26T10:04:31Z","published":"2023-10-26T10:04:31Z","title":"Learning to Abstract with Nonparametric Variational Information\n Bottleneck","summary":" Learned representations at the level of characters, sub-words, words and\nsentences, have each contributed to advances in understanding different NLP\ntasks and linguistic phenomena. However, learning textual embeddings is costly\nas they are tokenization specific and require different models to be trained\nfor each level of abstraction. We introduce a novel language representation\nmodel which can learn to compress to different levels of abstraction at\ndifferent layers of the same model. We apply Nonparametric Variational\nInformation Bottleneck (NVIB) to stacked Transformer self-attention layers in\nthe encoder, which encourages an information-theoretic compression of the\nrepresentations through the model. We find that the layers within the model\ncorrespond to increasing levels of abstraction and that their representations\nare more linguistically informed. Finally, we show that NVIB compression\nresults in a model which is more robust to adversarial perturbations.\n","authors":["Melika Behjati","Fabio Fehr","James Henderson"],"pdf_url":"https://arxiv.org/pdf/2310.17284v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17279v1","updated":"2023-10-26T10:00:24Z","published":"2023-10-26T10:00:24Z","title":"Automatic Logical Forms improve fidelity in Table-to-Text generation","summary":" Table-to-text systems generate natural language statements from structured\ndata like tables. While end-to-end techniques suffer from low factual\ncorrectness (fidelity), a previous study reported gains when using manual\nlogical forms (LF) that represent the selected content and the semantics of the\ntarget text. Given the manual step, it was not clear whether automatic LFs\nwould be effective, or whether the improvement came from content selection\nalone. We present TlT which, given a table and a selection of the content,\nfirst produces LFs and then the textual statement. We show for the first time\nthat automatic LFs improve quality, with an increase in fidelity of 30 points\nover a comparable system not using LFs. Our experiments allow to quantify the\nremaining challenges for high factual correctness, with automatic selection of\ncontent coming first, followed by better Logic-to-Text generation and, to a\nlesser extent, better Table-to-Logic parsing.\n","authors":["Iñigo Alonso","Eneko Agirre"],"pdf_url":"https://arxiv.org/pdf/2310.17279v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17271v1","updated":"2023-10-26T09:47:50Z","published":"2023-10-26T09:47:50Z","title":"Understanding the Role of Input Token Characters in Language Models: How\n Does Information Loss Affect Performance?","summary":" Understanding how and what pre-trained language models (PLMs) learn about\nlanguage is an open challenge in natural language processing. Previous work has\nfocused on identifying whether they capture semantic and syntactic information,\nand how the data or the pre-training objective affects their performance.\nHowever, to the best of our knowledge, no previous work has specifically\nexamined how information loss in input token characters affects the performance\nof PLMs. In this study, we address this gap by pre-training language models\nusing small subsets of characters from individual tokens. Surprisingly, we find\nthat pre-training even under extreme settings, i.e. using only one character of\neach token, the performance retention in standard NLU benchmarks and probing\ntasks compared to full-token models is high. For instance, a model pre-trained\nonly on single first characters from tokens achieves performance retention of\napproximately $90$\\% and $77$\\% of the full-token model in SuperGLUE and GLUE\ntasks, respectively.\n","authors":["Ahmed Alajrami","Katerina Margatina","Nikolaos Aletras"],"pdf_url":"https://arxiv.org/pdf/2310.17271v1.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.14159v2","updated":"2023-10-26T09:27:13Z","published":"2023-10-22T03:01:38Z","title":"Can Language Models Laugh at YouTube Short-form Videos?","summary":" As short-form funny videos on social networks are gaining popularity, it\nbecomes demanding for AI models to understand them for better communication\nwith humans. Unfortunately, previous video humor datasets target specific\ndomains, such as speeches or sitcoms, and mostly focus on verbal cues. We\ncurate a user-generated dataset of 10K multimodal funny videos from YouTube,\ncalled ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both\nverbal and visual elements contributing to humor. After filtering, we annotate\neach video with timestamps and text explanations for funny moments. Our\nExFunTube is unique over existing datasets in that our videos cover a wide\nrange of domains with various types of humor that necessitate a multimodal\nunderstanding of the content. Also, we develop a zero-shot video-to-text\nprompting to maximize video humor understanding of large language models\n(LLMs). With three different evaluation methods using automatic scores,\nrationale quality experiments, and human evaluations, we show that our\nprompting significantly improves LLMs' ability for humor explanation.\n","authors":["Dayoon Ko","Sangho Lee","Gunhee Kim"],"pdf_url":"https://arxiv.org/pdf/2310.14159v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15694v3","updated":"2023-10-26T09:08:34Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16713v2","updated":"2023-10-26T09:01:16Z","published":"2023-10-25T15:34:55Z","title":"SkyMath: Technical Report","summary":" Large language models (LLMs) have shown great potential to solve varieties of\nnatural language processing (NLP) tasks, including mathematical reasoning. In\nthis work, we present SkyMath, a large language model for mathematics with 13\nbillion parameters. By applying self-compare fine-tuning, we have enhanced\nmathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K,\nSkyMath outperforms all known open-source models of similar size and has\nestablished a new SOTA performance.\n","authors":["Liu Yang","Haihua Yang","Wenjun Cheng","Lei Lin","Chenxia Li","Yifu Chen","Lunan Liu","Jianfei Pan","Tianwen Wei","Biye Li","Liang Zhao","Lijie Wang","Bo Zhu","Guoliang Li","Xuejie Wu","Xilin Luo","Rui Hu"],"pdf_url":"https://arxiv.org/pdf/2310.16713v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.13001v2","updated":"2023-10-26T08:43:58Z","published":"2023-03-23T02:50:38Z","title":"Is ChatGPT A Good Keyphrase Generator? A Preliminary Study","summary":" The emergence of ChatGPT has recently garnered significant attention from the\ncomputational linguistics community. To demonstrate its capabilities as a\nkeyphrase generator, we conduct a preliminary evaluation of ChatGPT for the\nkeyphrase generation task. We evaluate its performance in various aspects,\nincluding keyphrase generation prompts, keyphrase generation diversity, and\nlong document understanding. Our evaluation is based on six benchmark datasets,\nand we adopt the prompt suggested by OpenAI while extending it to six candidate\nprompts. We find that ChatGPT performs exceptionally well on all six candidate\nprompts, with minor performance differences observed across the datasets. Based\non our findings, we conclude that ChatGPT has great potential for keyphrase\ngeneration. Moreover, we discover that ChatGPT still faces challenges when it\ncomes to generating absent keyphrases. Meanwhile, in the final section, we also\npresent some limitations and future expansions of this report.\n","authors":["Mingyang Song","Haiyun Jiang","Shuming Shi","Songfang Yao","Shilong Lu","Yi Feng","Huafeng Liu","Liping Jing"],"pdf_url":"https://arxiv.org/pdf/2303.13001v2.pdf","comment":"Technical Report, 7 pages"},{"id":"http://arxiv.org/abs/2310.17238v1","updated":"2023-10-26T08:36:39Z","published":"2023-10-26T08:36:39Z","title":"Joint Entity and Relation Extraction with Span Pruning and Hypergraph\n Neural Networks","summary":" Entity and Relation Extraction (ERE) is an important task in information\nextraction. Recent marker-based pipeline models achieve state-of-the-art\nperformance, but still suffer from the error propagation issue. Also, most of\ncurrent ERE models do not take into account higher-order interactions between\nmultiple entities and relations, while higher-order modeling could be\nbeneficial.In this work, we propose HyperGraph neural network for ERE\n($\\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based\npipleline model). To alleviate error propagation,we use a high-recall pruner\nmechanism to transfer the burden of entity identification and labeling from the\nNER module to the joint module of our model. For higher-order modeling, we\nbuild a hypergraph, where nodes are entities (provided by the span pruner) and\nrelations thereof, and hyperedges encode interactions between two different\nrelations or between a relation and its associated subject and object entities.\nWe then run a hypergraph neural network for higher-order inference by applying\nmessage passing over the built hypergraph. Experiments on three widely used\nbenchmarks (\\acef{}, \\ace{} and \\scierc{}) for ERE task show significant\nimprovements over the previous state-of-the-art PL-marker.\n","authors":["Zhaohui Yan","Songlin Yang","Wei Liu","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2310.17238v1.pdf","comment":"Accepted to Proceedings of EMNLP, 2023"},{"id":"http://arxiv.org/abs/2310.17233v1","updated":"2023-10-26T08:31:00Z","published":"2023-10-26T08:31:00Z","title":"EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual\n Representation Learning","summary":" Expressing universal semantics common to all languages is helpful in\nunderstanding the meanings of complex and culture-specific sentences. The\nresearch theme underlying this scenario focuses on learning universal\nrepresentations across languages with the usage of massive parallel corpora.\nHowever, due to the sparsity and scarcity of parallel data, there is still a\nbig challenge in learning authentic ``universals'' for any two languages. In\nthis paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm,\nto learn (X)Cross-lingual universals with the aid of excessive multilingual\nnon-parallel data. EMMA-X unifies the cross-lingual representation learning\ntask and an extra semantic relation prediction task within an EM framework.\nBoth the extra semantic classifier and the cross-lingual sentence encoder\napproximate the semantic relation of two sentences, and supervise each other\nuntil convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly\nintroduced benchmark containing 12 widely studied cross-lingual tasks that\nfully depend on sentence-level representations. Results reveal that EMMA-X\nachieves state-of-the-art performance. Further geometric analysis of the built\nrepresentation space with three requirements demonstrates the superiority of\nEMMA-X over advanced models.\n","authors":["Ping Guo","Xiangpeng Wei","Yue Hu","Baosong Yang","Dayiheng Liu","Fei Huang","Jun Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17233v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17230v1","updated":"2023-10-26T08:28:48Z","published":"2023-10-26T08:28:48Z","title":"Codebook Features: Sparse and Discrete Interpretability for Neural\n Networks","summary":" Understanding neural networks is challenging in part because of the dense,\ncontinuous nature of their hidden states. We explore whether we can train\nneural networks to have hidden states that are sparse, discrete, and more\ninterpretable by quantizing their continuous features into what we call\ncodebook features. Codebook features are produced by finetuning neural networks\nwith vector quantization bottlenecks at each layer, producing a network whose\nhidden features are the sum of a small number of discrete vector codes chosen\nfrom a larger codebook. Surprisingly, we find that neural networks can operate\nunder this extreme bottleneck with only modest degradation in performance. This\nsparse, discrete bottleneck also provides an intuitive way of controlling\nneural network behavior: first, find codes that activate when the desired\nbehavior is present, then activate those same codes during generation to elicit\nthat behavior. We validate our approach by training codebook Transformers on\nseveral different datasets. First, we explore a finite state machine dataset\nwith far more hidden states than neurons. In this setting, our approach\novercomes the superposition problem by assigning states to distinct codes, and\nwe find that we can make the neural network behave as if it is in a different\nstate by activating the code for that state. Second, we train Transformer\nlanguage models with up to 410M parameters on two natural language datasets. We\nidentify codes in these models representing diverse, disentangled concepts\n(ranging from negative emotions to months of the year) and find that we can\nguide the model to generate different topics by activating the appropriate\ncodes during inference. Overall, codebook features appear to be a promising\nunit of analysis and control for neural networks and interpretability. Our\ncodebase and models are open-sourced at\nhttps://github.com/taufeeque9/codebook-features.\n","authors":["Alex Tamkin","Mohammad Taufeeque","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2310.17230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17228v1","updated":"2023-10-26T08:27:36Z","published":"2023-10-26T08:27:36Z","title":"TST$^\\mathrm{R}$: Target Similarity Tuning Meets the Real World","summary":" Target similarity tuning (TST) is a method of selecting relevant examples in\nnatural language (NL) to code generation through large language models (LLMs)\nto improve performance. Its goal is to adapt a sentence embedding model to have\nthe similarity between two NL inputs match the similarity between their\nassociated code outputs. In this paper, we propose different methods to apply\nand improve TST in the real world. First, we replace the sentence transformer\nwith embeddings from a larger model, which reduces sensitivity to the language\ndistribution and thus provides more flexibility in synthetic generation of\nexamples, and we train a tiny model that transforms these embeddings to a space\nwhere embedding similarity matches code similarity, which allows the model to\nremain a black box and only requires a few matrix multiplications at inference\ntime. Second, we how to efficiently select a smaller number of training\nexamples to train the TST model. Third, we introduce a ranking-based evaluation\nfor TST that does not require end-to-end code generation experiments, which can\nbe expensive to perform.\n","authors":["Anirudh Khatry","Sumit Gulwani","Priyanshu Gupta","Vu Le","Ananya Singha","Mukul Singh","Gust Verbruggen"],"pdf_url":"https://arxiv.org/pdf/2310.17228v1.pdf","comment":"Accepted for EMNLP-Findings, 2023"},{"id":"http://arxiv.org/abs/2302.04449v3","updated":"2023-10-26T08:23:48Z","published":"2023-02-09T05:47:03Z","title":"Read and Reap the Rewards: Learning to Play Atari with the Help of\n Instruction Manuals","summary":" High sample complexity has long been a challenge for RL. On the other hand,\nhumans learn to perform tasks not only from interaction or demonstrations, but\nalso by reading unstructured text documents, e.g., instruction manuals.\nInstruction manuals and wiki pages are among the most abundant data that could\ninform agents of valuable features and policies or task-specific environmental\ndynamics and reward structures. Therefore, we hypothesize that the ability to\nutilize human-written instruction manuals to assist learning policies for\nspecific tasks should lead to a more efficient and better-performing agent. We\npropose the Read and Reward framework. Read and Reward speeds up RL algorithms\non Atari games by reading manuals released by the Atari game developers. Our\nframework consists of a QA Extraction module that extracts and summarizes\nrelevant information from the manual and a Reasoning module that evaluates\nobject-agent interactions based on information from the manual. An auxiliary\nreward is then provided to a standard A2C RL agent, when interaction is\ndetected. Experimentally, various RL algorithms obtain significant improvement\nin performance and training speed when assisted by our design.\n","authors":["Yue Wu","Yewen Fan","Paul Pu Liang","Amos Azaria","Yuanzhi Li","Tom M. Mitchell"],"pdf_url":"https://arxiv.org/pdf/2302.04449v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.01522v2","updated":"2023-10-26T08:15:30Z","published":"2023-09-04T11:05:10Z","title":"What are Public Concerns about ChatGPT? A Novel Self-Supervised Neural\n Topic Model Tells You","summary":" The recently released artificial intelligence conversational agent, ChatGPT,\nhas gained significant attention in academia and real life. A multitude of\nearly ChatGPT users eagerly explore its capabilities and share their opinions\non it via social media. Both user queries and social media posts express public\nconcerns regarding this advanced dialogue system. To mine public concerns about\nChatGPT, a novel Self-Supervised neural Topic Model (SSTM), which formalizes\ntopic modeling as a representation learning procedure, is proposed in this\npaper. Extensive experiments have been conducted on Twitter posts about ChatGPT\nand queries asked by ChatGPT users. And experimental results demonstrate that\nthe proposed approach could extract higher quality public concerns with\nimproved interpretability and diversity, surpassing the performance of\nstate-of-the-art approaches.\n","authors":["Rui Wang","Xing Liu","Yanan Wang","Haiping Huang"],"pdf_url":"https://arxiv.org/pdf/2309.01522v2.pdf","comment":"The paper requires major revision"},{"id":"http://arxiv.org/abs/2310.17217v1","updated":"2023-10-26T08:08:43Z","published":"2023-10-26T08:08:43Z","title":"Beyond MLE: Convex Learning for Text Generation","summary":" Maximum likelihood estimation (MLE) is a statistical method used to estimate\nthe parameters of a probability distribution that best explain the observed\ndata. In the context of text generation, MLE is often used to train generative\nlanguage models, which can then be used to generate new text. However, we argue\nthat MLE is not always necessary and optimal, especially for closed-ended text\ngeneration tasks like machine translation. In these tasks, the goal of model is\nto generate the most appropriate response, which does not necessarily require\nit to estimate the entire data distribution with MLE. To this end, we propose a\nnovel class of training objectives based on convex functions, which enables\ntext generation models to focus on highly probable outputs without having to\nestimate the entire data distribution. We investigate the theoretical\nproperties of the optimal predicted distribution when applying convex functions\nto the loss, demonstrating that convex functions can sharpen the optimal\ndistribution, thereby enabling the model to better capture outputs with high\nprobabilities. Experiments on various text generation tasks and models show the\neffectiveness of our approach. It enables autoregressive models to bridge the\ngap between greedy and beam search, and facilitates the learning of\nnon-autoregressive models with a maximum improvement of 9+ BLEU points.\nMoreover, our approach also exhibits significant impact on large language\nmodels (LLMs), substantially enhancing their generative capability on various\ntasks. Source code is available at\n\\url{https://github.com/ictnlp/Convex-Learning}.\n","authors":["Chenze Shao","Zhengrui Ma","Min Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.17217v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17207v1","updated":"2023-10-26T07:49:25Z","published":"2023-10-26T07:49:25Z","title":"Efficient Data Fusion using the Tsetlin Machine","summary":" We propose a novel way of assessing and fusing noisy dynamic data using a\nTsetlin Machine. Our approach consists in monitoring how explanations in form\nof logical clauses that a TM learns changes with possible noise in dynamic\ndata. This way TM can recognize the noise by lowering weights of previously\nlearned clauses, or reflect it in the form of new clauses. We also perform a\ncomprehensive experimental study using notably different datasets that\ndemonstrated high performance of the proposed approach.\n","authors":["Rupsa Saha","Vladimir I. Zadorozhny","Ole-Christoffer Granmo"],"pdf_url":"https://arxiv.org/pdf/2310.17207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01458v4","updated":"2023-10-26T07:19:58Z","published":"2023-07-04T03:34:19Z","title":"CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity\n and Infant Care","summary":" The recent advances in natural language processing (NLP), have led to a new\ntrend of applying large language models (LLMs) to real-world scenarios. While\nthe latest LLMs are astonishingly fluent when interacting with humans, they\nsuffer from the misinformation problem by unintentionally generating factually\nfalse statements. This can lead to harmful consequences, especially when\nproduced within sensitive contexts, such as healthcare. Yet few previous works\nhave focused on evaluating misinformation in the long-form (LF) generation of\nLLMs, especially for knowledge-intensive topics. Moreover, although LLMs have\nbeen shown to perform well in different languages, misinformation evaluation\nhas been mostly conducted in English. To this end, we present a benchmark,\nCARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic,\nspecifically the maternity and infant care domain; and 2) a language other than\nEnglish, namely Chinese. Most importantly, we provide an innovative paradigm\nfor building LF generation evaluation benchmarks that can be transferred to\nother knowledge-intensive domains and low-resourced languages. Our proposed\nbenchmark fills the gap between the extensive usage of LLMs and the lack of\ndatasets for assessing the misinformation generated by these models. It\ncontains 1,612 expert-checked questions, accompanied with human-selected\nreferences. Using our benchmark, we conduct extensive experiments and found\nthat current Chinese LLMs are far from perfect in the topic of maternity and\ninfant care. In an effort to minimize the reliance on human resources for\nperformance evaluation, we offer off-the-shelf judgment models for\nautomatically assessing the LF output of LLMs given benchmark questions.\nMoreover, we compare potential solutions for LF generation evaluation and\nprovide insights for building better automated metrics.\n","authors":["Tong Xiang","Liangzhi Li","Wangyue Li","Mingbai Bai","Lu Wei","Bowen Wang","Noa Garcia"],"pdf_url":"https://arxiv.org/pdf/2307.01458v4.pdf","comment":"NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2310.17191v1","updated":"2023-10-26T07:10:31Z","published":"2023-10-26T07:10:31Z","title":"How do Language Models Bind Entities in Context?","summary":" To correctly use in-context information, language models (LMs) must bind\nentities to their attributes. For example, given a context describing a \"green\nsquare\" and a \"blue circle\", LMs must bind the shapes to their respective\ncolors. We analyze LM representations and identify the binding ID mechanism: a\ngeneral mechanism for solving the binding problem, which we observe in every\nsufficiently large model from the Pythia and LLaMA families. Using causal\ninterventions, we show that LMs' internal activations represent binding\ninformation by attaching binding ID vectors to corresponding entities and\nattributes. We further show that binding ID vectors form a continuous subspace,\nin which distances between binding ID vectors reflect their discernability.\nOverall, our results uncover interpretable strategies in LMs for representing\nsymbolic knowledge in-context, providing a step towards understanding general\nin-context reasoning in large-scale LMs.\n","authors":["Jiahai Feng","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2310.17191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04618v2","updated":"2023-10-26T06:59:50Z","published":"2023-06-07T17:47:03Z","title":"Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,\n and LLMs Evaluations","summary":" This paper reexamines the research on out-of-distribution (OOD) robustness in\nthe field of NLP. We find that the distribution shift settings in previous\nstudies commonly lack adequate challenges, hindering the accurate evaluation of\nOOD robustness. To address these issues, we propose a benchmark construction\nprotocol that ensures clear differentiation and challenging distribution\nshifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution\nrobustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we\nconduct a series of experiments on pre-trained language models for analysis and\nevaluation of OOD robustness. First, for vanilla fine-tuning, we examine the\nrelationship between in-distribution (ID) and OOD performance. We identify\nthree typical types that unveil the inner learning mechanism, which could\npotentially facilitate the forecasting of OOD robustness, correlating with the\nadvancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and\nfind that, despite exhibiting some effectiveness in specific cases, they do not\noffer significant improvement compared to vanilla fine-tuning. Further, we\nevaluate 5 LLMs with various adaptation paradigms and find that when sufficient\nID data is available, fine-tuning domain-specific models outperform LLMs on ID\nexamples significantly. However, in the case of OOD instances, prioritizing\nLLMs with in-context learning yields better results. We identify that both\nfine-tuned small models and LLMs face challenges in effectively addressing\ndownstream tasks. The code is public at\n\\url{https://github.com/lifan-yuan/OOD_NLP}.\n","authors":["Lifan Yuan","Yangyi Chen","Ganqu Cui","Hongcheng Gao","Fangyuan Zou","Xingyi Cheng","Heng Ji","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2306.04618v2.pdf","comment":"Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is\n available at \\url{https://github.com/lifan-yuan/OOD_NLP}"},{"id":"http://arxiv.org/abs/2302.04089v2","updated":"2023-10-26T06:42:40Z","published":"2023-02-07T18:55:28Z","title":"ZipLM: Inference-Aware Structured Pruning of Language Models","summary":" The breakthrough performance of large language models (LLMs) comes with major\ncomputational footprints and high deployment costs. In this paper, we progress\ntowards resolving this problem by proposing a novel structured compression\napproach for LLMs, called ZipLM. ZipLM achieves state-of-the-art\naccuracy-vs-speedup, while matching a set of desired target runtime speedups in\nany given inference environment. Specifically, given a model, a dataset, an\ninference environment, as well as a set of speedup targets, ZipLM iteratively\nidentifies and removes components with the worst loss-runtime trade-off. Unlike\nprior methods that specialize in either the post-training/one-shot or the\ngradual compression setting, and only for specific families of models such as\nBERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed\nmodels across all these settings. Furthermore, ZipLM achieves superior results\nfor a fraction of the computational cost relative to prior distillation and\npruning techniques, making it a cost-effective approach for generating an\nentire family of smaller, faster, and highly accurate models, guaranteed to\nmeet the desired inference specifications. In particular, ZipLM outperforms all\nprior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and\nTinyBERT. Moreover, it matches the performance of the heavily optimized\nMobileBERT model, obtained via extensive architecture search, by simply pruning\nthe baseline BERT-large model. When compressing GPT2, ZipLM outperforms\nDistilGPT2 while being 60% smaller and 30% faster. Our code is available at:\nhttps://github.com/IST-DASLab/ZipLM.\n","authors":["Eldar Kurtic","Elias Frantar","Dan Alistarh"],"pdf_url":"https://arxiv.org/pdf/2302.04089v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.14802v2","updated":"2023-10-26T06:09:05Z","published":"2023-05-24T06:55:09Z","title":"Estimating Large Language Model Capabilities without Labeled Test Data","summary":" Large Language Models (LLMs) have the impressive ability to perform\nin-context learning (ICL) from only a few examples, but the success of ICL\nvaries widely from task to task. Thus, it is important to quickly determine\nwhether ICL is applicable to a new task, but directly evaluating ICL accuracy\ncan be expensive in situations where test data is expensive to annotate -- the\nexact situations where ICL is most appealing. In this paper, we propose the\ntask of ICL accuracy estimation, in which we predict the accuracy of an LLM\nwhen doing in-context learning on a new task given only unlabeled test data for\nthat task. To perform ICL accuracy estimation, we propose a method that trains\na meta-model using LLM confidence scores as features. We compare our method to\nseveral strong accuracy estimation baselines on a new benchmark that covers 4\nLLMs and 3 task collections. The meta-model improves over all baselines across\n8 out of 12 settings and achieves the same estimation performance as directly\nevaluating on 40 collected labeled test examples per task. At the same time, no\nexisting approach provides an accurate and reliable ICL accuracy estimation in\nevery setting, highlighting the need for better ways to measure the uncertainty\nof LLM predictions.\n","authors":["Harvey Yiyun Fu","Qinyuan Ye","Albert Xu","Xiang Ren","Robin Jia"],"pdf_url":"https://arxiv.org/pdf/2305.14802v2.pdf","comment":"Accepted to EMNLP 2023 Findings. Camera-ready version. Code:\n https://github.com/harvey-fin/icl-estimate"},{"id":"http://arxiv.org/abs/2305.15814v3","updated":"2023-10-26T05:57:27Z","published":"2023-05-25T07:53:23Z","title":"Bhasha-Abhijnaanam: Native-script and romanized Language Identification\n for 22 Indic languages","summary":" We create publicly available language identification (LID) datasets and\nmodels in all 22 Indian languages listed in the Indian constitution in both\nnative-script and romanized text. First, we create Bhasha-Abhijnaanam, a\nlanguage identification test set for native-script as well as romanized text\nwhich spans all 22 Indic languages. We also train IndicLID, a language\nidentifier for all the above-mentioned languages in both native and romanized\nscript. For native-script text, it has better language coverage than existing\nLIDs and is competitive or better than other LIDs. IndicLID is the first LID\nfor romanized text in Indian languages. Two major challenges for romanized text\nLID are the lack of training data and low-LID performance when languages are\nsimilar. We provide simple and effective solutions to these problems. In\ngeneral, there has been limited work on romanized text in any language, and our\nfindings are relevant to other languages that need romanized language\nidentification. Our models are publicly available at\nhttps://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training\nand test sets are also publicly available at\nhttps://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses.\n","authors":["Yash Madhani","Mitesh M. Khapra","Anoop Kunchukuttan"],"pdf_url":"https://arxiv.org/pdf/2305.15814v3.pdf","comment":"Accepted to ACL 2023"},{"id":"http://arxiv.org/abs/2310.17166v1","updated":"2023-10-26T05:39:49Z","published":"2023-10-26T05:39:49Z","title":"X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity","summary":" Cross-lingual transfer (XLT) is an emergent ability of multilingual language\nmodels that preserves their performance on a task to a significant extent when\nevaluated in languages that were not included in the fine-tuning process. While\nEnglish, due to its widespread usage, is typically regarded as the primary\nlanguage for model adaption in various tasks, recent studies have revealed that\nthe efficacy of XLT can be amplified by selecting the most appropriate source\nlanguages based on specific conditions. In this work, we propose the\nutilization of sub-network similarity between two languages as a proxy for\npredicting the compatibility of the languages in the context of XLT. Our\napproach is model-oriented, better reflecting the inner workings of foundation\nmodels. In addition, it requires only a moderate amount of raw text from\ncandidate languages, distinguishing it from the majority of previous methods\nthat rely on external resources. In experiments, we demonstrate that our method\nis more effective than baselines across diverse tasks. Specifically, it shows\nproficiency in ranking candidates for zero-shot XLT, achieving an improvement\nof 4.6% on average in terms of NDCG@3. We also provide extensive analyses that\nconfirm the utility of sub-networks for XLT prediction.\n","authors":["Taejun Yun","Jinhyeon Kim","Deokyeong Kang","Seong Hoon Lim","Jihoon Kim","Taeuk Kim"],"pdf_url":"https://arxiv.org/pdf/2310.17166v1.pdf","comment":"Accepted to EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2305.13850v2","updated":"2023-10-26T05:32:22Z","published":"2023-05-23T09:18:47Z","title":"Global Structure Knowledge-Guided Relation Extraction Method for\n Visually-Rich Document","summary":" Visual Relation Extraction (VRE) is a powerful means of discovering\nrelationships between entities within visually-rich documents. Existing methods\noften focus on manipulating entity features to find pairwise relations, yet\nneglect the more fundamental structural information that links disparate entity\npairs together. The absence of global structure information may make the model\nstruggle to learn long-range relations and easily predict conflicted results.\nTo alleviate such limitations, we propose a \\textbf{G}l\\textbf{O}bal\n\\textbf{S}tructure knowledge-guided relation \\textbf{E}xtraction\n(\\textbf{\\model}) framework. {\\model} initiates by generating preliminary\nrelation predictions on entity pairs extracted from a scanned image of the\ndocument. Subsequently, global structural knowledge is captured from the\npreceding iterative predictions, which are then incorporated into the\nrepresentations of the entities. This ``generate-capture-incorporate'' cycle is\nrepeated multiple times, allowing entity representations and global structure\nknowledge to be mutually reinforced. Extensive experiments validate that\n{\\model} not only outperforms existing methods in the standard fine-tuning\nsetting but also reveals superior cross-lingual learning capabilities; indeed,\neven yields stronger data-efficient performance in the low-resource setting.\nThe code for GOSE will be available at https://github.com/chenxn2020/GOSE.\n","authors":["Xiangnan Chen","Qian Xiao","Juncheng Li","Duo Dong","Jun Lin","Xiaozhong Liu","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2305.13850v2.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2205.03018v2","updated":"2023-10-26T05:21:20Z","published":"2022-05-06T05:13:12Z","title":"Aksharantar: Open Indic-language Transliteration datasets and models for\n the Next Billion Users","summary":" Transliteration is very important in the Indian language context due to the\nusage of multiple scripts and the widespread use of romanized inputs. However,\nfew training and evaluation sets are publicly available. We introduce\nAksharantar, the largest publicly available transliteration dataset for Indian\nlanguages created by mining from monolingual and parallel corpora, as well as\ncollecting data from human annotators. The dataset contains 26 million\ntransliteration pairs for 21 Indic languages from 3 language families using 12\nscripts. Aksharantar is 21 times larger than existing datasets and is the first\npublicly available dataset for 7 languages and 1 language family. We also\nintroduce the Aksharantar testset comprising 103k word pairs spanning 19\nlanguages that enables a fine-grained analysis of transliteration models on\nnative origin words, foreign words, frequent words, and rare words. Using the\ntraining set, we trained IndicXlit, a multilingual transliteration model that\nimproves accuracy by 15% on the Dakshina test set, and establishes strong\nbaselines on the Aksharantar testset introduced in this work. The models,\nmining scripts, transliteration guidelines, and datasets are available at\nhttps://github.com/AI4Bharat/IndicXlit under open-source licenses. We hope the\navailability of these large-scale, open resources will spur innovation for\nIndic language transliteration and downstream applications. We hope the\navailability of these large-scale, open resources will spur innovation for\nIndic language transliteration and downstream applications.\n","authors":["Yash Madhani","Sushane Parthan","Priyanka Bedekar","Gokul NC","Ruchi Khapra","Anoop Kunchukuttan","Pratyush Kumar","Mitesh M. Khapra"],"pdf_url":"https://arxiv.org/pdf/2205.03018v2.pdf","comment":"This manuscript is an extended version of the paper accepted to EMNLP\n Findings 2023. You can find the EMNLP Findings version at\n https://anoopkunchukuttan.gitlab.io/publications/emnlp_findings_2023_aksharantar.pdf"},{"id":"http://arxiv.org/abs/2305.14327v2","updated":"2023-10-26T05:10:18Z","published":"2023-05-23T17:56:26Z","title":"Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation","summary":" Instruction tuning has emerged to enhance the capabilities of large language\nmodels (LLMs) to comprehend instructions and generate appropriate responses.\nExisting methods either manually annotate or employ LLM (e.g., GPT-series) to\ngenerate data for instruction tuning. However, they often overlook associating\ninstructions with existing annotated datasets. In this paper, we propose\nDynosaur, a dynamic growth paradigm for the automatic curation of\ninstruction-tuning data. Based on the metadata of existing datasets, we use\nLLMs to automatically construct instruction-tuning data by identifying relevant\ndata fields and generating appropriate instructions.\n By leveraging the existing annotated datasets, Dynosaur offers several\nadvantages: 1) it reduces the API cost for generating instructions (e.g., it\ncosts less than $12 USD by calling GPT-3.5-turbo for generating 800K\ninstruction tuning samples; 2) it provides high-quality data for instruction\ntuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform\nwith comparable data sizes); and 3) it supports the continuous improvement of\nmodels by generating instruction-tuning data when a new annotated dataset\nbecomes available. We further investigate a continual learning scheme for\nlearning with the ever-growing instruction-tuning dataset, and demonstrate that\nreplaying tasks with diverse instruction embeddings not only helps mitigate\nforgetting issues but generalizes to unseen tasks better.\n Code and data are available at https://github.com/WadeYin9712/Dynosaur.\n","authors":["Da Yin","Xiao Liu","Fan Yin","Ming Zhong","Hritik Bansal","Jiawei Han","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2305.14327v2.pdf","comment":"EMNLP 2023. Code and data are available at\n https://github.com/WadeYin9712/Dynosaur"},{"id":"http://arxiv.org/abs/2307.16200v3","updated":"2023-10-26T04:55:52Z","published":"2023-07-30T10:51:32Z","title":"A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue\n Information Extraction","summary":" This paper focuses on term-status pair extraction from medical dialogues\n(MD-TSPE), which is essential in diagnosis dialogue systems and the automatic\nscribe of electronic medical records (EMRs). In the past few years, works on\nMD-TSPE have attracted increasing research attention, especially after the\nremarkable progress made by generative methods. However, these generative\nmethods output a whole sequence consisting of term-status pairs in one stage\nand ignore integrating prior knowledge, which demands a deeper understanding to\nmodel the relationship between terms and infer the status of each term. This\npaper presents a knowledge-enhanced two-stage generative framework (KTGF) to\naddress the above challenges. Using task-specific prompts, we employ a single\nmodel to complete the MD-TSPE through two phases in a unified generative form:\nwe generate all terms the first and then generate the status of each generated\nterm. In this way, the relationship between terms can be learned more\neffectively from the sequence containing only terms in the first phase, and our\ndesigned knowledge-enhanced prompt in the second phase can leverage the\ncategory and status candidates of the generated term for status generation.\nFurthermore, our proposed special status \"not mentioned\" makes more terms\navailable and enriches the training data in the second phase, which is critical\nin the low-resource setting. The experiments on the Chunyu and CMDD datasets\nshow that the proposed method achieves superior results compared to the\nstate-of-the-art models in the full training and low-resource settings.\n","authors":["Zefa Hu","Ziyi Ni","Jing Shi","Shuang Xu","Bo Xu"],"pdf_url":"https://arxiv.org/pdf/2307.16200v3.pdf","comment":"Published in Machine Intelligence Research"},{"id":"http://arxiv.org/abs/2310.17143v1","updated":"2023-10-26T04:35:00Z","published":"2023-10-26T04:35:00Z","title":"Supercharging academic writing with generative AI: framework,\n techniques, and caveats","summary":" Academic writing is an indispensable yet laborious part of the research\nenterprise. This Perspective maps out principles and methods for using\ngenerative artificial intelligence (AI), specifically large language models\n(LLMs), to elevate the quality and efficiency of academic writing. We introduce\na human-AI collaborative framework that delineates the rationale (why), process\n(how), and nature (what) of AI engagement in writing. The framework pinpoints\nboth short-term and long-term reasons for engagement and their underlying\nmechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals\nthe role of AI throughout the writing process, conceptualized through a\ntwo-stage model for human-AI collaborative writing, and the nature of AI\nassistance in writing, represented through a model of writing-assistance types\nand levels. Building on this framework, we describe effective prompting\ntechniques for incorporating AI into the writing routine (outlining, drafting,\nand editing) as well as strategies for maintaining rigorous scholarship,\nadhering to varied journal policies, and avoiding overreliance on AI.\nUltimately, the prudent integration of AI into academic writing can ease the\ncommunication burden, empower authors, accelerate discovery, and promote\ndiversity in science.\n","authors":["Zhicheng Lin"],"pdf_url":"https://arxiv.org/pdf/2310.17143v1.pdf","comment":"14 pages, 2 figures, 1 table, 1 box"},{"id":"http://arxiv.org/abs/2310.17140v1","updated":"2023-10-26T04:22:23Z","published":"2023-10-26T04:22:23Z","title":"Symbolic Planning and Code Generation for Grounded Dialogue","summary":" Large language models (LLMs) excel at processing and generating both text and\ncode. However, LLMs have had limited applicability in grounded task-oriented\ndialogue as they are difficult to steer toward task objectives and fail to\nhandle novel grounding. We present a modular and interpretable grounded\ndialogue system that addresses these shortcomings by composing LLMs with a\nsymbolic planner and grounded code execution. Our system consists of a reader\nand planner: the reader leverages an LLM to convert partner utterances into\nexecutable code, calling functions that perform grounding. The translated\ncode's output is stored to track dialogue state, while a symbolic planner\ndetermines the next appropriate response. We evaluate our system's performance\non the demanding OneCommon dialogue task, involving collaborative reference\nresolution on abstract images of scattered dots. Our system substantially\noutperforms the previous state-of-the-art, including improving task success in\nhuman evaluations from 56% to 69% in the most challenging setting.\n","authors":["Justin T. Chiu","Wenting Zhao","Derek Chen","Saujas Vaduguru","Alexander M. Rush","Daniel Fried"],"pdf_url":"https://arxiv.org/pdf/2310.17140v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.16436v2","updated":"2023-10-26T04:16:52Z","published":"2023-10-25T08:03:10Z","title":"DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning\n in Language Models","summary":" A long-standing goal of AI systems is to perform complex multimodal reasoning\nlike humans. Recently, large language models (LLMs) have made remarkable\nstrides in such multi-step reasoning on the language modality solely by\nleveraging the chain of thought (CoT) to mimic human thinking. However, the\ntransfer of these advancements to multimodal contexts introduces heightened\nchallenges, including but not limited to the impractical need for\nlabor-intensive annotation and the limitations in terms of flexibility,\ngeneralizability, and explainability. To evoke CoT reasoning in multimodality,\nthis work first conducts an in-depth analysis of these challenges posed by\nmultimodality and presents two key insights: \"keeping critical thinking\" and\n\"letting everyone do their jobs\" in multimodal CoT reasoning. Furthermore, this\nstudy proposes a novel DDCoT prompting that maintains a critical attitude\nthrough negative-space prompting and incorporates multimodality into reasoning\nby first dividing the reasoning responsibility of LLMs into reasoning and\nrecognition and then integrating the visual recognition capability of visual\nmodels into the joint reasoning process. The rationales generated by DDCoT not\nonly improve the reasoning abilities of both large and small language models in\nzero-shot prompting and fine-tuning learning, significantly outperforming\nstate-of-the-art methods but also exhibit impressive generalizability and\nexplainability.\n","authors":["Ge Zheng","Bin Yang","Jiajin Tang","Hong-Yu Zhou","Sibei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16436v2.pdf","comment":"24 pages, 13 figures, to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17133v1","updated":"2023-10-26T04:13:49Z","published":"2023-10-26T04:13:49Z","title":"Incorporating Probing Signals into Multimodal Machine Translation via\n Visual Question-Answering Pairs","summary":" This paper presents an in-depth study of multimodal machine translation\n(MMT), examining the prevailing understanding that MMT systems exhibit\ndecreased sensitivity to visual information when text inputs are complete.\nInstead, we attribute this phenomenon to insufficient cross-modal interaction,\nrather than image information redundancy. A novel approach is proposed to\ngenerate parallel Visual Question-Answering (VQA) style pairs from the source\ntext, fostering more robust cross-modal interaction. Using Large Language\nModels (LLMs), we explicitly model the probing signal in MMT to convert it into\nVQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask\nlearning framework is introduced to incorporate explicit probing signals from\nthe dataset into the MMT training process. Experimental results on two\nwidely-used benchmarks demonstrate the effectiveness of this novel approach.\nOur code and data would be available at:\n\\url{https://github.com/libeineu/MMT-VQA}.\n","authors":["Yuxin Zuo","Bei Li","Chuanhao Lv","Tong Zheng","Tong Xiao","Jingbo Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.17133v1.pdf","comment":"Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.17130v1","updated":"2023-10-26T04:10:16Z","published":"2023-10-26T04:10:16Z","title":"M2C: Towards Automatic Multimodal Manga Complement","summary":" Multimodal manga analysis focuses on enhancing manga understanding with\nvisual and textual features, which has attracted considerable attention from\nboth natural language processing and computer vision communities. Currently,\nmost comics are hand-drawn and prone to problems such as missing pages, text\ncontamination, and aging, resulting in missing comic text content and seriously\nhindering human comprehension. In other words, the Multimodal Manga Complement\n(M2C) task has not been investigated, which aims to handle the aforementioned\nissues by providing a shared semantic space for vision and language\nunderstanding. To this end, we first propose the Multimodal Manga Complement\ntask by establishing a new M2C benchmark dataset covering two languages. First,\nwe design a manga argumentation method called MCoT to mine event knowledge in\ncomics with large language models. Then, an effective baseline FVP-M$^{2}$\nusing fine-grained visual prompts is proposed to support manga complement.\nExtensive experimental results show the effectiveness of FVP-M$^{2}$ method for\nMultimodal Mange Complement.\n","authors":["Hongcheng Guo","Boyang Wang","Jiaqi Bai","Jiaheng Liu","Jian Yang","Zhoujun Li"],"pdf_url":"https://arxiv.org/pdf/2310.17130v1.pdf","comment":"EMNLP2023. arXiv admin note: text overlap with arXiv:2210.15461"},{"id":"http://arxiv.org/abs/2307.08701v3","updated":"2023-10-26T04:08:51Z","published":"2023-07-17T17:59:40Z","title":"AlpaGasus: Training A Better Alpaca with Fewer Data","summary":" Large language models~(LLMs) strengthen instruction-following capability\nthrough instruction-finetuning (IFT) on supervised instruction/response data.\nHowever, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly\ncontain many low-quality instances with incorrect or irrelevant responses,\nwhich are misleading and detrimental to IFT. In this paper, we propose a simple\nand effective data selection strategy that automatically identifies and filters\nout low-quality data using a strong LLM (e.g., ChatGPT). To this end, we\nintroduce AlpaGasus, which is finetuned on only 9k high-quality data filtered\nfrom the 52k Alpaca data. AlpaGasus significantly outperforms the original\nAlpaca as evaluated by GPT-4 on multiple test sets and the controlled human\nevaluation. Its 13B variant matches $>90\\%$ performance of its teacher LLM\n(i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also\nprovides 5.7x faster training, reducing the training time for a 7B variant from\n80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the\nefficacy of our method across diverse datasets, base models, and LLM filters.\nOverall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be\ngenerally applied to instruction-tuning data, leading to faster training and\nbetter instruction-following models. Our project page is available at:\n\\url{https://lichang-chen.github.io/AlpaGasus/}\n","authors":["Lichang Chen","Shiyang Li","Jun Yan","Hai Wang","Kalpa Gunaratna","Vikas Yadav","Zheng Tang","Vijay Srinivasan","Tianyi Zhou","Heng Huang","Hongxia Jin"],"pdf_url":"https://arxiv.org/pdf/2307.08701v3.pdf","comment":"32 Pages; 29 Figures; 15 Tables"},{"id":"http://arxiv.org/abs/2310.17121v1","updated":"2023-10-26T03:41:32Z","published":"2023-10-26T03:41:32Z","title":"Test-time Augmentation for Factual Probing","summary":" Factual probing is a method that uses prompts to test if a language model\n\"knows\" certain world knowledge facts. A problem in factual probing is that\nsmall changes to the prompt can lead to large changes in model output. Previous\nwork aimed to alleviate this problem by optimizing prompts via text mining or\nfine-tuning. However, such approaches are relation-specific and do not\ngeneralize to unseen relation types. Here, we propose to use test-time\naugmentation (TTA) as a relation-agnostic method for reducing sensitivity to\nprompt variations by automatically augmenting and ensembling prompts at test\ntime. Experiments show improved model calibration, i.e., with TTA, model\nconfidence better reflects prediction accuracy. Improvements in prediction\naccuracy are observed for some models, but for other models, TTA leads to\ndegradation. Error analysis identifies the difficulty of producing high-quality\nprompt variations as the main challenge for TTA.\n","authors":["Go Kamoda","Benjamin Heinzerling","Keisuke Sakaguchi","Kentaro Inui"],"pdf_url":"https://arxiv.org/pdf/2310.17121v1.pdf","comment":"12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper)"},{"id":"http://arxiv.org/abs/2310.17120v1","updated":"2023-10-26T03:37:51Z","published":"2023-10-26T03:37:51Z","title":"Topic Segmentation of Semi-Structured and Unstructured Conversational\n Datasets using Language Models","summary":" Breaking down a document or a conversation into multiple contiguous segments\nbased on its semantic structure is an important and challenging problem in NLP,\nwhich can assist many downstream tasks. However, current works on topic\nsegmentation often focus on segmentation of structured texts. In this paper, we\ncomprehensively analyze the generalization capabilities of state-of-the-art\ntopic segmentation models on unstructured texts. We find that: (a) Current\nstrategies of pre-training on a large corpus of structured text such as\nWiki-727K do not help in transferability to unstructured conversational data.\n(b) Training from scratch with only a relatively small-sized dataset of the\ntarget unstructured domain improves the segmentation results by a significant\nmargin. We stress-test our proposed Topic Segmentation approach by\nexperimenting with multiple loss functions, in order to mitigate effects of\nimbalance in unstructured conversational datasets. Our empirical evaluation\nindicates that Focal Loss function is a robust alternative to Cross-Entropy and\nre-weighted Cross-Entropy loss function when segmenting unstructured and\nsemi-structured chats.\n","authors":["Reshmi Ghosh","Harjeet Singh Kajal","Sharanya Kamath","Dhuri Shrivastava","Samyadeep Basu","Hansi Zeng","Soundararajan Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2310.17120v1.pdf","comment":"Accepted to IntelliSys 2023. arXiv admin note: substantial text\n overlap with arXiv:2211.14954"},{"id":"http://arxiv.org/abs/2310.17119v1","updated":"2023-10-26T03:28:30Z","published":"2023-10-26T03:28:30Z","title":"FLEEK: Factual Error Detection and Correction with Evidence Retrieved\n from External Knowledge","summary":" Detecting factual errors in textual information, whether generated by large\nlanguage models (LLM) or curated by humans, is crucial for making informed\ndecisions. LLMs' inability to attribute their claims to external knowledge and\ntheir tendency to hallucinate makes it difficult to rely on their responses.\nHumans, too, are prone to factual errors in their writing. Since manual\ndetection and correction of factual errors is labor-intensive, developing an\nautomatic approach can greatly reduce human effort. We present FLEEK, a\nprototype tool that automatically extracts factual claims from text, gathers\nevidence from external knowledge sources, evaluates the factuality of each\nclaim, and suggests revisions for identified errors using the collected\nevidence. Initial empirical evaluation on fact error detection (77-85\\% F1)\nshows the potential of FLEEK. A video demo of FLEEK can be found at\nhttps://youtu.be/NapJFUlkPdQ.\n","authors":["Farima Fatahi Bayat","Kun Qian","Benjamin Han","Yisi Sang","Anton Belyi","Samira Khorshidi","Fei Wu","Ihab F. Ilyas","Yunyao Li"],"pdf_url":"https://arxiv.org/pdf/2310.17119v1.pdf","comment":"EMNLP 2023 (Demonstration Track)"},{"id":"http://arxiv.org/abs/2310.16350v2","updated":"2023-10-26T03:26:30Z","published":"2023-10-25T04:22:40Z","title":"Unraveling Feature Extraction Mechanisms in Neural Networks","summary":" The underlying mechanism of neural networks in capturing precise knowledge\nhas been the subject of consistent research efforts. In this work, we propose a\ntheoretical approach based on Neural Tangent Kernels (NTKs) to investigate such\nmechanisms. Specifically, considering the infinite network width, we\nhypothesize the learning dynamics of target models may intuitively unravel the\nfeatures they acquire from training data, deepening our insights into their\ninternal mechanisms. We apply our approach to several fundamental models and\nreveal how these models leverage statistical features during gradient descent\nand how they are integrated into final decisions. We also discovered that the\nchoice of activation function can affect feature extraction. For instance, the\nuse of the \\textit{ReLU} activation function could potentially introduce a bias\nin features, providing a plausible explanation for its replacement with\nalternative functions in recent pre-trained language models. Additionally, we\nfind that while self-attention and CNN models may exhibit limitations in\nlearning n-grams, multiplication-based models seem to excel in this area. We\nverify these theoretical findings through experiments and find that they can be\napplied to analyze language modeling tasks, which can be regarded as a special\nvariant of classification. Our contributions offer insights into the roles and\ncapacities of fundamental components within large language models, thereby\naiding the broader understanding of these complex systems.\n","authors":["Xiaobing Sun","Jiaxi Li","Wei Lu"],"pdf_url":"https://arxiv.org/pdf/2310.16350v2.pdf","comment":"Accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.13588v2","updated":"2023-10-26T03:17:37Z","published":"2023-10-20T15:32:26Z","title":"Simultaneous Machine Translation with Tailored Reference","summary":" Simultaneous machine translation (SiMT) generates translation while reading\nthe whole source sentence. However, existing SiMT models are typically trained\nusing the same reference disregarding the varying amounts of available source\ninformation at different latency. Training the model with ground-truth at low\nlatency may introduce forced anticipations, whereas utilizing reference\nconsistent with the source word order at high latency results in performance\ndegradation. Consequently, it is crucial to train the SiMT model with\nappropriate reference that avoids forced anticipations during training while\nmaintaining high quality. In this paper, we propose a novel method that\nprovides tailored reference for the SiMT models trained at different latency by\nrephrasing the ground-truth. Specifically, we introduce the tailor, induced by\nreinforcement learning, to modify ground-truth to the tailored reference. The\nSiMT model is trained with the tailored reference and jointly optimized with\nthe tailor to enhance performance. Importantly, our method is applicable to a\nwide range of current SiMT approaches. Experiments on three translation tasks\ndemonstrate that our method achieves state-of-the-art performance in both fixed\nand adaptive policies.\n","authors":["Shoutao Guo","Shaolei Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.13588v2.pdf","comment":"Accepted to EMNLP 2023; 15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2305.12945v2","updated":"2023-10-26T03:15:41Z","published":"2023-05-22T11:45:42Z","title":"ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist\n Examination","summary":" As ChatGPT and GPT-4 spearhead the development of Large Language Models\n(LLMs), more researchers are investigating their performance across various\ntasks. But more research needs to be done on the interpretability capabilities\nof LLMs, that is, the ability to generate reasons after an answer has been\ngiven. Existing explanation datasets are mostly English-language general\nknowledge questions, which leads to insufficient thematic and linguistic\ndiversity. To address the language bias and lack of medical resources in\ngenerating rationales QA datasets, we present ExplainCPE (over 7k instances), a\nchallenging medical benchmark in Simplified Chinese. We analyzed the errors of\nChatGPT and GPT-4, pointing out the limitations of current LLMs in\nunderstanding text and computational reasoning. During the experiment, we also\nfound that different LLMs have different preferences for in-context learning.\nExplainCPE presents a significant challenge, but its potential for further\ninvestigation is promising, and it can be used to evaluate the ability of a\nmodel to generate explanations. AI safety and trustworthiness need more\nattention, and this work makes the first step to explore the medical\ninterpretability of LLMs.The dataset is available at\nhttps://github.com/HITsz-TMG/ExplainCPE.\n","authors":["Dongfang Li","Jindi Yu","Baotian Hu","Zhenran Xu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.12945v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.16776v2","updated":"2023-10-26T03:13:26Z","published":"2023-10-25T17:06:42Z","title":"DEFT: Data Efficient Fine-Tuning for Large Language Models via\n Unsupervised Core-Set Selection","summary":" Recent advances have led to the availability of many pre-trained language\nmodels (PLMs); however, a question that remains is how much data is truly\nneeded to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT,\na data-efficient fine-tuning framework that leverages unsupervised core-set\nselection to minimize the amount of data needed to fine-tune PLMs for\ndownstream tasks. We demonstrate the efficacy of our DEFT framework in the\ncontext of text-editing LMs, and compare to the state-of-the art text-editing\nmodel, CoEDIT. Our quantitative and qualitative results demonstrate that DEFT\nmodels are just as accurate as CoEDIT while being finetuned on ~70% less data.\n","authors":["Devleena Das","Vivek Khetan"],"pdf_url":"https://arxiv.org/pdf/2310.16776v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2106.09462v2","updated":"2023-10-26T03:10:16Z","published":"2021-06-17T13:15:07Z","title":"pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks","summary":" In recent years, the extraction of opinions and information from\nuser-generated text has attracted a lot of interest, largely due to the\nunprecedented volume of content in Social Media. However, social researchers\nface some issues in adopting cutting-edge tools for these tasks, as they are\nusually behind commercial APIs, unavailable for other languages than English,\nor very complex to use for non-experts. To address these issues, we present\npysentimiento, a comprehensive multilingual Python toolkit designed for opinion\nmining and other Social NLP tasks. This open-source library brings\nstate-of-the-art models for Spanish, English, Italian, and Portuguese in an\neasy-to-use Python library, allowing researchers to leverage these techniques.\nWe present a comprehensive assessment of performance for several pre-trained\nlanguage models across a variety of tasks, languages, and datasets, including\nan evaluation of fairness in the results.\n","authors":["Juan Manuel Pérez","Mariela Rajngewerc","Juan Carlos Giudici","Damián A. Furman","Franco Luque","Laura Alonso Alemany","María Vanina Martínez"],"pdf_url":"https://arxiv.org/pdf/2106.09462v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11191v2","updated":"2023-10-26T02:54:41Z","published":"2023-10-17T12:14:03Z","title":"Medical Text Simplification: Optimizing for Readability with\n Unlikelihood Training and Reranked Beam Search Decoding","summary":" Text simplification has emerged as an increasingly useful application of AI\nfor bridging the communication gap in specialized fields such as medicine,\nwhere the lexicon is often dominated by technical jargon and complex\nconstructs. Despite notable progress, methods in medical simplification\nsometimes result in the generated text having lower quality and diversity. In\nthis work, we explore ways to further improve the readability of text\nsimplification in the medical domain. We propose (1) a new unlikelihood loss\nthat encourages generation of simpler terms and (2) a reranked beam search\ndecoding method that optimizes for simplicity, which achieve better performance\non readability metrics on three datasets. This study's findings offer promising\navenues for improving text simplification in the medical field.\n","authors":["Lorenzo Jaime Yu Flores","Heyuan Huang","Kejian Shi","Sophie Chheang","Arman Cohan"],"pdf_url":"https://arxiv.org/pdf/2310.11191v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.10355v3","updated":"2023-10-26T02:52:40Z","published":"2023-05-17T16:34:01Z","title":"Evaluating Object Hallucination in Large Vision-Language Models","summary":" Inspired by the superior language abilities of large language models (LLM),\nlarge vision-language models (LVLM) have been recently explored by integrating\npowerful LLMs for improving the performance on complex multimodal tasks.\nDespite the promising progress on LVLMs, we find that LVLMs suffer from the\nhallucination problem, i.e. they tend to generate objects that are inconsistent\nwith the target images in the descriptions. To investigate it, this work\npresents the first systematic study on object hallucination of LVLMs. We\nconduct the evaluation experiments on several representative LVLMs, and show\nthat they mostly suffer from severe object hallucination issue. We further\ndiscuss that the visual instructions may influence the hallucination, and find\nthat: objects that frequently occur in the visual instructions or co-occur with\nthe image objects, are obviously prone to be hallucinated by LVLMs. Besides, we\nfind that existing evaluation methods might be affected by the input\ninstructions and generation styles of LVLMs. Thus, we further design an\nimproved evaluation method for object hallucination by proposing a\npolling-based query method called POPE. Experiment results demonstrate that our\nPOPE can evaluate the object hallucination in a more stable and flexible way.\nOur codes and data are publicly available at https://github.com/RUCAIBox/POPE.\n","authors":["Yifan Li","Yifan Du","Kun Zhou","Jinpeng Wang","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2305.10355v3.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.07224v4","updated":"2023-10-26T02:32:54Z","published":"2023-05-12T03:31:24Z","title":"Asymmetric feature interaction for interpreting model predictions","summary":" In natural language processing (NLP), deep neural networks (DNNs) could model\ncomplex interactions between context and have achieved impressive results on a\nrange of NLP tasks. Prior works on feature interaction attribution mainly focus\non studying symmetric interaction that only explains the additional influence\nof a set of words in combination, which fails to capture asymmetric influence\nthat contributes to model prediction. In this work, we propose an asymmetric\nfeature interaction attribution explanation model that aims to explore\nasymmetric higher-order feature interactions in the inference of deep neural\nNLP models. By representing our explanation with an directed interaction graph,\nwe experimentally demonstrate interpretability of the graph to discover\nasymmetric feature interactions. Experimental results on two sentiment\nclassification datasets show the superiority of our model against the\nstate-of-the-art feature interaction attribution methods in identifying\ninfluential features for model predictions. Our code is available at\nhttps://github.com/StillLu/ASIV.\n","authors":["Xiaolei Lu","Jianghong Ma","Haode Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.07224v4.pdf","comment":"Accepted by Findings of the Association for Computational\n Linguistics: ACL 2023 (long paper)"},{"id":"http://arxiv.org/abs/2305.16264v4","updated":"2023-10-26T02:24:24Z","published":"2023-05-25T17:18:55Z","title":"Scaling Data-Constrained Language Models","summary":" The current trend of scaling language models involves increasing both\nparameter count and training dataset size. Extrapolating this trend suggests\nthat training dataset size may soon be limited by the amount of text data\navailable on the internet. Motivated by this limit, we investigate scaling\nlanguage models in data-constrained regimes. Specifically, we run a large set\nof experiments varying the extent of data repetition and compute budget,\nranging up to 900 billion training tokens and 9 billion parameter models. We\nfind that with constrained data for a fixed compute budget, training with up to\n4 epochs of repeated data yields negligible changes to loss compared to having\nunique data. However, with more repetition, the value of adding compute\neventually decays to zero. We propose and empirically validate a scaling law\nfor compute optimality that accounts for the decreasing value of repeated\ntokens and excess parameters. Finally, we experiment with approaches mitigating\ndata scarcity, including augmenting the training dataset with code data or\nremoving commonly used filters. Models and datasets from our 400 training runs\nare freely available at https://github.com/huggingface/datablations.\n","authors":["Niklas Muennighoff","Alexander M. Rush","Boaz Barak","Teven Le Scao","Aleksandra Piktus","Nouamane Tazi","Sampo Pyysalo","Thomas Wolf","Colin Raffel"],"pdf_url":"https://arxiv.org/pdf/2305.16264v4.pdf","comment":"50 pages (9 main), 39 figures, 15 tables"},{"id":"http://arxiv.org/abs/2305.00633v3","updated":"2023-10-26T01:43:17Z","published":"2023-05-01T02:37:59Z","title":"Self-Evaluation Guided Beam Search for Reasoning","summary":" Breaking down a problem into intermediate steps has demonstrated impressive\nperformance in Large Language Model (LLM) reasoning. However, the growth of the\nreasoning chain introduces uncertainty and error accumulation, making it\nchallenging to elicit accurate final results. To tackle this challenge of\nuncertainty in multi-step reasoning, we introduce a stepwise self-evaluation\nmechanism to guide and calibrate the reasoning process of LLMs. We propose a\ndecoding algorithm integrating the self-evaluation guidance via stochastic beam\nsearch. The self-evaluation guidance serves as a better-calibrated automatic\ncriterion, facilitating an efficient search in the reasoning space and\nresulting in superior prediction quality. Stochastic beam search balances\nexploitation and exploration of the search space with temperature-controlled\nrandomness. Our approach surpasses the corresponding Codex-backboned baselines\nin few-shot accuracy by $6.34\\%$, $9.56\\%$, and $5.46\\%$ on the GSM8K, AQuA,\nand StrategyQA benchmarks, respectively. Experiment results with Llama-2 on\narithmetic reasoning demonstrate the efficiency of our method in outperforming\nthe baseline methods with comparable computational budgets. Further analysis in\nmulti-step reasoning finds our self-evaluation guidance pinpoints logic\nfailures and leads to higher consistency and robustness. Our code is publicly\navailable at https://guideddecoding.github.io/.\n","authors":["Yuxi Xie","Kenji Kawaguchi","Yiran Zhao","Xu Zhao","Min-Yen Kan","Junxian He","Qizhe Xie"],"pdf_url":"https://arxiv.org/pdf/2305.00633v3.pdf","comment":"NeurIPS 2023. 10 pages, 7 figures, 4 tables (33 pages, 14 figures, 15\n tables including references and appendices)"},{"id":"http://arxiv.org/abs/2307.04721v2","updated":"2023-10-26T01:27:29Z","published":"2023-07-10T17:32:13Z","title":"Large Language Models as General Pattern Machines","summary":" We observe that pre-trained large language models (LLMs) are capable of\nautoregressively completing complex token sequences -- from arbitrary ones\nprocedurally generated by probabilistic context-free grammars (PCFG), to more\nrich spatial patterns found in the Abstraction and Reasoning Corpus (ARC), a\ngeneral AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern\ncompletion proficiency can be partially retained even when the sequences are\nexpressed using tokens randomly sampled from the vocabulary. These results\nsuggest that without any additional training, LLMs can serve as general\nsequence modelers, driven by in-context learning. In this work, we investigate\nhow these zero-shot capabilities may be applied to problems in robotics -- from\nextrapolating sequences of numbers that represent states over time to complete\nsimple motions, to least-to-most prompting of reward-conditioned trajectories\nthat can discover and represent closed-loop policies (e.g., a stabilizing\ncontroller for CartPole). While difficult to deploy today for real systems due\nto latency, context size limitations, and compute costs, the approach of using\nLLMs to drive low-level control may provide an exciting glimpse into how the\npatterns among words could be transferred to actions.\n","authors":["Suvir Mirchandani","Fei Xia","Pete Florence","Brian Ichter","Danny Driess","Montserrat Gonzalez Arenas","Kanishka Rao","Dorsa Sadigh","Andy Zeng"],"pdf_url":"https://arxiv.org/pdf/2307.04721v2.pdf","comment":"21 pages, 25 figures. To appear at Conference on Robot Learning\n (CoRL) 2023"},{"id":"http://arxiv.org/abs/2305.13272v2","updated":"2023-10-26T01:25:23Z","published":"2023-05-22T17:35:05Z","title":"CLASS: A Design Framework for building Intelligent Tutoring Systems\n based on Learning Science principles","summary":" We present a design framework called Conversational Learning with Analytical\nStep-by-Step Strategies (CLASS) for building advanced Intelligent Tutoring\nSystems (ITS) powered by high-performance Large Language Models (LLMs). The\nCLASS framework empowers ITS with two key capabilities. First, through a\ncarefully curated scaffolding dataset, CLASS equips ITS with essential\nproblem-solving strategies, enabling it to provide tutor-like, step-by-step\nguidance to students. Second, by using a dynamic conversational dataset, CLASS\nassists ITS in facilitating natural language interactions, fostering engaging\nstudent-tutor conversations. The CLASS framework also provides valuable\ninsights into ITS' internal decision-making process which allows seamless\nintegration of user feedback, thus enabling continuous refinement and\nimprovement. We also present a proof-of-concept ITS, referred to as SPOCK,\nwhich is trained using the CLASS framework with a focus on introductory\ncollege-level biology content. A carefully constructed protocol was developed\nfor SPOCK's preliminary evaluation, examining aspects such as the factual\naccuracy and relevance of its responses. Experts in the field of biology\noffered favorable remarks, particularly highlighting SPOCK's capability to\nbreak down questions into manageable subproblems and provide encouraging\nresponses to students. Code and models are available at\nhttps://github.com/luffycodes/Tutorbot-Spock.\n","authors":["Shashank Sonkar","Naiming Liu","Debshila Basu Mallick","Richard G. Baraniuk"],"pdf_url":"https://arxiv.org/pdf/2305.13272v2.pdf","comment":"Paper accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17086v1","updated":"2023-10-26T01:08:47Z","published":"2023-10-26T01:08:47Z","title":"Transformers Learn Higher-Order Optimization Methods for In-Context\n Learning: A Study with Linear Models","summary":" Transformers are remarkably good at in-context learning (ICL) -- learning\nfrom demonstrations without parameter updates -- but how they perform ICL\nremains a mystery. Recent work suggests that Transformers may learn in-context\nby internally running Gradient Descent, a first-order optimization method. In\nthis paper, we instead demonstrate that Transformers learn to implement\nhigher-order optimization methods to perform ICL. Focusing on in-context linear\nregression, we show that Transformers learn to implement an algorithm very\nsimilar to Iterative Newton's Method, a higher-order optimization method,\nrather than Gradient Descent. Empirically, we show that predictions from\nsuccessive Transformer layers closely match different iterations of Newton's\nMethod linearly, with each middle layer roughly computing 3 iterations. In\ncontrast, exponentially more Gradient Descent steps are needed to match an\nadditional Transformers layer; this suggests that Transformers have an\ncomparable rate of convergence with high-order methods such as Iterative\nNewton, which are exponentially faster than Gradient Descent. We also show that\nTransformers can learn in-context on ill-conditioned data, a setting where\nGradient Descent struggles but Iterative Newton succeeds. Finally, we show\ntheoretical results which support our empirical findings and have a close\ncorrespondence with them: we prove that Transformers can implement $k$\niterations of Newton's method with $\\mathcal{O}(k)$ layers.\n","authors":["Deqing Fu","Tian-Qi Chen","Robin Jia","Vatsal Sharan"],"pdf_url":"https://arxiv.org/pdf/2310.17086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14331v2","updated":"2023-10-26T01:01:11Z","published":"2023-05-23T17:57:12Z","title":"What Else Do I Need to Know? The Effect of Background Information on\n Users' Reliance on QA Systems","summary":" NLP systems have shown impressive performance at answering questions by\nretrieving relevant context. However, with the increasingly large models, it is\nimpossible and often undesirable to constrain models' knowledge or reasoning to\nonly the retrieved context. This leads to a mismatch between the information\nthat the models access to derive the answer and the information that is\navailable to the user to assess the model predicted answer. In this work, we\nstudy how users interact with QA systems in the absence of sufficient\ninformation to assess their predictions. Further, we ask whether adding the\nrequisite background helps mitigate users' over-reliance on predictions. Our\nstudy reveals that users rely on model predictions even in the absence of\nsufficient information needed to assess the model's correctness. Providing the\nrelevant background, however, helps users better catch model errors, reducing\nover-reliance on incorrect predictions. On the flip side, background\ninformation also increases users' confidence in their accurate as well as\ninaccurate judgments. Our work highlights that supporting users' verification\nof QA predictions is an important, yet challenging, problem.\n","authors":["Navita Goyal","Eleftheria Briakou","Amanda Liu","Connor Baumler","Claire Bonial","Jeffrey Micher","Clare R. Voss","Marine Carpuat","Hal Daumé III"],"pdf_url":"https://arxiv.org/pdf/2305.14331v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14907v2","updated":"2023-10-26T01:01:01Z","published":"2023-05-24T08:58:28Z","title":"Coverage-based Example Selection for In-Context Learning","summary":" In-context learning (ICL), the ability of large language models to perform\nnovel tasks by conditioning on a prompt with a few task examples, requires\nthese examples to be informative about the test instance. The standard approach\nof independently ranking and selecting the most similar examples selects\nredundant examples while omitting important information. In this work, we show\nthat BERTScore-Recall (BSR) selects better examples that demonstrate more of\nthe salient aspects, e.g. reasoning patterns, of the test input. We further\nextend BSR and many standard metrics to easily optimizable set-level metrics,\ngiving still better coverage of those salient aspects. On 15 datasets spanning\n6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric\nfor in-context example selection across the board, and (2) for compositional\ntasks, set selection using Set-BSR outperforms independent ranking by up to 17\npoints on average and, despite being training-free, surpasses methods that\nleverage task or LLM-specific training.\n","authors":["Shivanshu Gupta","Matt Gardner","Sameer Singh"],"pdf_url":"https://arxiv.org/pdf/2305.14907v2.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.16218v2","updated":"2023-10-26T00:45:42Z","published":"2023-10-24T22:18:13Z","title":"Knowledge Editing for Large Language Models: A Survey","summary":" Large language models (LLMs) have recently transformed both the academic and\nindustrial landscapes due to their remarkable capacity to understand, analyze,\nand generate texts based on their vast knowledge and reasoning ability.\nNevertheless, one major drawback of LLMs is their substantial computational\ncost for pre-training due to their unprecedented amounts of parameters. The\ndisadvantage is exacerbated when new knowledge frequently needs to be\nintroduced into the pre-trained model. Therefore, it is imperative to develop\neffective and efficient techniques to update pre-trained LLMs. Traditional\nmethods encode new knowledge in pre-trained LLMs through direct fine-tuning.\nHowever, naively re-training LLMs can be computationally intensive and risks\ndegenerating valuable pre-trained knowledge irrelevant to the update in the\nmodel. Recently, Knowledge-based Model Editing (KME) has attracted increasing\nattention, which aims to precisely modify the LLMs to incorporate specific\nknowledge, without negatively influencing other irrelevant knowledge. In this\nsurvey, we aim to provide a comprehensive and in-depth overview of recent\nadvances in the field of KME. We first introduce a general formulation of KME\nto encompass different KME strategies. Afterward, we provide an innovative\ntaxonomy of KME techniques based on how the new knowledge is introduced into\npre-trained LLMs, and investigate existing KME strategies while analyzing key\ninsights, advantages, and limitations of methods from each category. Moreover,\nrepresentative metrics, datasets, and applications of KME are introduced\naccordingly. Finally, we provide an in-depth analysis regarding the\npracticality and remaining challenges of KME and suggest promising research\ndirections for further advancement in this field.\n","authors":["Song Wang","Yaochen Zhu","Haochen Liu","Zaiyi Zheng","Chen Chen","Jundong Li"],"pdf_url":"https://arxiv.org/pdf/2310.16218v2.pdf","comment":"31 pages"},{"id":"http://arxiv.org/abs/2310.17811v1","updated":"2023-10-26T23:06:38Z","published":"2023-10-26T23:06:38Z","title":"Style-Aware Radiology Report Generation with RadGraph and Few-Shot\n Prompting","summary":" Automatically generated reports from medical images promise to improve the\nworkflow of radiologists. Existing methods consider an image-to-report modeling\ntask by directly generating a fully-fledged report from an image. However, this\nconflates the content of the report (e.g., findings and their attributes) with\nits style (e.g., format and choice of words), which can lead to clinically\ninaccurate reports. To address this, we propose a two-step approach for\nradiology report generation. First, we extract the content from an image; then,\nwe verbalize the extracted content into a report that matches the style of a\nspecific radiologist. For this, we leverage RadGraph -- a graph representation\nof reports -- together with large language models (LLMs). In our quantitative\nevaluations, we find that our approach leads to beneficial performance. Our\nhuman evaluation with clinical raters highlights that the AI-generated reports\nare indistinguishably tailored to the style of individual radiologist despite\nleveraging only a few examples as context.\n","authors":["Benjamin Yan","Ruochen Liu","David E. Kuo","Subathra Adithan","Eduardo Pontes Reis","Stephen Kwak","Vasantha Kumar Venugopal","Chloe P. O'Connell","Agustina Saenz","Pranav Rajpurkar","Michael Moor"],"pdf_url":"https://arxiv.org/pdf/2310.17811v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2306.15162v2","updated":"2023-10-26T22:57:49Z","published":"2023-06-27T02:44:07Z","title":"YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English\n Parallel Corpus","summary":" Machine learning for sign languages is bottlenecked by data. In this paper,\nwe present YouTube-ASL, a large-scale, open-domain corpus of American Sign\nLanguage (ASL) videos and accompanying English captions drawn from YouTube.\nWith ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as\nlarge and has ~10x as many unique signers as the largest prior ASL dataset. We\ntrain baseline models for ASL to English translation on YouTube-ASL and\nevaluate them on How2Sign, where we achieve a new finetuned state of the art of\n12.39 BLEU and, for the first time, report zero-shot results.\n","authors":["David Uthus","Garrett Tanzer","Manfred Georg"],"pdf_url":"https://arxiv.org/pdf/2306.15162v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11129v2","updated":"2023-10-26T22:43:26Z","published":"2023-05-18T17:22:53Z","title":"mLongT5: A Multilingual and Efficient Text-To-Text Transformer for\n Longer Sequences","summary":" We present our work on developing a multilingual, efficient text-to-text\ntransformer that is suitable for handling long inputs. This model, called\nmLongT5, builds upon the architecture of LongT5, while leveraging the\nmultilingual datasets used for pretraining mT5 and the pretraining tasks of\nUL2. We evaluate this model on a variety of multilingual summarization and\nquestion-answering tasks, and the results show stronger performance for mLongT5\nwhen compared to existing multilingual models such as mBART or M-BERT.\n","authors":["David Uthus","Santiago Ontañón","Joshua Ainslie","Mandy Guo"],"pdf_url":"https://arxiv.org/pdf/2305.11129v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17802v1","updated":"2023-10-26T22:23:38Z","published":"2023-10-26T22:23:38Z","title":"TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the\n Automatic Ordering of Events in News Articles","summary":" Temporal relation extraction models have thus far been hindered by a number\nof issues in existing temporal relation-annotated news datasets, including: (1)\nlow inter-annotator agreement due to the lack of specificity of their\nannotation guidelines in terms of what counts as a temporal relation; (2) the\nexclusion of long-distance relations within a given document (those spanning\nacross different paragraphs); and (3) the exclusion of events that are not\ncentred on verbs. This paper aims to alleviate these issues by presenting a new\nannotation scheme that clearly defines the criteria based on which temporal\nrelations should be annotated. Additionally, the scheme includes events even if\nthey are not expressed as verbs (e.g., nominalised events). Furthermore, we\npropose a method for annotating all temporal relations -- including\nlong-distance ones -- which automates the process, hence reducing time and\nmanual effort on the part of annotators. The result is a new dataset, the\nTIMELINE corpus, in which improved inter-annotator agreement was obtained, in\ncomparison with previously reported temporal relation datasets. We report the\nresults of training and evaluating baseline temporal relation extraction models\non the new corpus, and compare them with results obtained on the widely used\nMATRES corpus.\n","authors":["Sarah Alsayyahi","Riza Batista-Navarro"],"pdf_url":"https://arxiv.org/pdf/2310.17802v1.pdf","comment":"Accepted for publication in EMNLP 2023: 13 pages, 3 figures and 14\n tables"},{"id":"http://arxiv.org/abs/2310.17793v1","updated":"2023-10-26T21:47:59Z","published":"2023-10-26T21:47:59Z","title":"\"You Are An Expert Linguistic Annotator\": Limits of LLMs as Analyzers of\n Abstract Meaning Representation","summary":" Large language models (LLMs) show amazing proficiency and fluency in the use\nof language. Does this mean that they have also acquired insightful linguistic\nknowledge about the language, to an extent that they can serve as an \"expert\nlinguistic annotator\"? In this paper, we examine the successes and limitations\nof the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning\nstructure, focusing on the Abstract Meaning Representation (AMR; Banarescu et\nal. 2013) parsing formalism, which provides rich graphical representations of\nsentence meaning structure while abstracting away from surface forms. We\ncompare models' analysis of this semantic structure across two settings: 1)\ndirect production of AMR parses based on zero- and few-shot prompts, and 2)\nindirect partial reconstruction of AMR via metalinguistic natural language\nqueries (e.g., \"Identify the primary event of this sentence, and the predicate\ncorresponding to that event.\"). Across these settings, we find that models can\nreliably reproduce the basic format of AMR, and can often capture core event,\nargument, and modifier structure -- however, model outputs are prone to\nfrequent and major errors, and holistic analysis of parse acceptability shows\nthat even with few-shot demonstrations, models have virtually 0% success in\nproducing fully accurate parses. Eliciting natural language responses produces\nsimilar patterns of errors. Overall, our findings indicate that these models\nout-of-the-box can capture aspects of semantic structure, but there remain key\nlimitations in their ability to support fully accurate semantic analyses or\nparses.\n","authors":["Allyson Ettinger","Jena D. Hwang","Valentina Pyatkin","Chandra Bhagavatula","Yejin Choi"],"pdf_url":"https://arxiv.org/pdf/2310.17793v1.pdf","comment":"EMNLP 2023 Findings (short)"},{"id":"http://arxiv.org/abs/2310.17788v1","updated":"2023-10-26T21:36:06Z","published":"2023-10-26T21:36:06Z","title":"Utilizing Language Models for Energy Load Forecasting","summary":" Energy load forecasting plays a crucial role in optimizing resource\nallocation and managing energy consumption in buildings and cities. In this\npaper, we propose a novel approach that leverages language models for energy\nload forecasting. We employ prompting techniques to convert energy consumption\ndata into descriptive sentences, enabling fine-tuning of language models. By\nadopting an autoregressive generating approach, our proposed method enables\npredictions of various horizons of future energy load consumption. Through\nextensive experiments on real-world datasets, we demonstrate the effectiveness\nand accuracy of our proposed method. Our results indicate that utilizing\nlanguage models for energy load forecasting holds promise for enhancing energy\nefficiency and facilitating intelligent decision-making in energy systems.\n","authors":["Hao Xue","Flora D. Salim"],"pdf_url":"https://arxiv.org/pdf/2310.17788v1.pdf","comment":"BuildSys 2023 Accepted"},{"id":"http://arxiv.org/abs/2310.17787v1","updated":"2023-10-26T21:32:24Z","published":"2023-10-26T21:32:24Z","title":"Evaluation of large language models using an Indian language LGBTI+\n lexicon","summary":" Large language models (LLMs) are typically evaluated on the basis of\ntask-based benchmarks such as MMLU. Such benchmarks do not examine responsible\nbehaviour of LLMs in specific contexts. This is particularly true in the LGBTI+\ncontext where social stereotypes may result in variation in LGBTI+ terminology.\nTherefore, domain-specific lexicons or dictionaries may be useful as a\nrepresentative list of words against which the LLM's behaviour needs to be\nevaluated. This paper presents a methodology for evaluation of LLMs using an\nLGBTI+ lexicon in Indian languages. The methodology consists of four steps:\nformulating NLP tasks relevant to the expected behaviour, creating prompts that\ntest LLMs, using the LLMs to obtain the output and, finally, manually\nevaluating the results. Our qualitative analysis shows that the three LLMs we\nexperiment on are unable to detect underlying hateful content. Similarly, we\nobserve limitations in using machine translation as means to evaluate natural\nlanguage understanding in languages other than English. The methodology\npresented in this paper can be useful for LGBTI+ lexicons in other languages as\nwell as other domain-specific lexicons. The work done in this paper opens\navenues for responsible behaviour of LLMs, as demonstrated in the context of\nprevalent social perception of the LGBTI+ community.\n","authors":["Aditya Joshi","Shruta Rawat","Alpana Dange"],"pdf_url":"https://arxiv.org/pdf/2310.17787v1.pdf","comment":"Selected for publication in the AI Ethics Journal published by the\n Artificial Intelligence Robotics Ethics Society (AIRES)"},{"id":"http://arxiv.org/abs/2309.04679v2","updated":"2023-10-26T21:26:16Z","published":"2023-09-09T04:27:18Z","title":"Embedding structure matters: Comparing methods to adapt multilingual\n vocabularies to new languages","summary":" Pre-trained multilingual language models underpin a large portion of modern\nNLP tools outside of English. A strong baseline for specializing these models\nfor specific languages is Language-Adaptive Pre-Training (LAPT). However,\nretaining a large cross-lingual vocabulary and embedding matrix comes at\nconsiderable excess computational cost during adaptation. In this study, we\npropose several simple techniques to replace a cross-lingual vocabulary with a\ncompact, language-specific one. Namely, we address strategies for\nre-initializing the token embedding matrix after vocabulary specialization. We\nthen provide a systematic experimental comparison of our techniques, in\naddition to the recently-proposed Focus method. We demonstrate that: 1)\nEmbedding-replacement techniques in the monolingual transfer literature are\ninadequate for adapting multilingual models. 2) Replacing cross-lingual\nvocabularies with smaller specialized ones provides an efficient method to\nimprove performance in low-resource languages. 3) Simple embedding\nre-initialization techniques based on script-wise sub-distributions rival\ntechniques such as Focus, which rely on similarity scores obtained from an\nauxiliary model.\n","authors":["C. M. Downey","Terra Blevins","Nora Goldfine","Shane Steinert-Threlkeld"],"pdf_url":"https://arxiv.org/pdf/2309.04679v2.pdf","comment":"Camera-ready for Proceedings of the 3rd Workshop on Multilingual\n Representation Learning"},{"id":"http://arxiv.org/abs/2303.02260v2","updated":"2023-10-26T21:24:47Z","published":"2023-03-03T23:19:42Z","title":"Learning to reason over visual objects","summary":" A core component of human intelligence is the ability to identify abstract\npatterns inherent in complex, high-dimensional perceptual data, as exemplified\nby visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated\nby the goal of designing AI systems with this capacity, recent work has focused\non evaluating whether neural networks can learn to solve RPM-like problems.\nPrevious work has generally found that strong performance on these problems\nrequires the incorporation of inductive biases that are specific to the RPM\nproblem format, raising the question of whether such models might be more\nbroadly useful. Here, we investigated the extent to which a general-purpose\nmechanism for processing visual scenes in terms of objects might help promote\nabstract visual reasoning. We found that a simple model, consisting only of an\nobject-centric encoder and a transformer reasoning module, achieved\nstate-of-the-art results on both of two challenging RPM-like benchmarks (PGM\nand I-RAVEN), as well as a novel benchmark with greater visual complexity\n(CLEVR-Matrices). These results suggest that an inductive bias for\nobject-centric processing may be a key component of abstract visual reasoning,\nobviating the need for problem-specific inductive biases.\n","authors":["Shanka Subhra Mondal","Taylor Webb","Jonathan D. Cohen"],"pdf_url":"https://arxiv.org/pdf/2303.02260v2.pdf","comment":"ICLR 2023"},{"id":"http://arxiv.org/abs/2306.09539v3","updated":"2023-10-26T21:09:43Z","published":"2023-06-15T22:48:08Z","title":"Block-State Transformers","summary":" State space models (SSMs) have shown impressive results on tasks that require\nmodeling long-range dependencies and efficiently scale to long sequences owing\nto their subquadratic runtime complexity. Originally designed for continuous\nsignals, SSMs have shown superior performance on a plethora of tasks, in vision\nand audio; however, SSMs still lag Transformer performance in Language Modeling\ntasks. In this work, we propose a hybrid layer named Block-State Transformer\n(BST), that internally combines an SSM sublayer for long-range\ncontextualization, and a Block Transformer sublayer for short-term\nrepresentation of sequences. We study three different, and completely\nparallelizable, variants that integrate SSMs and block-wise attention. We show\nthat our model outperforms similar Transformer-based architectures on language\nmodeling perplexity and generalizes to longer sequences. In addition, the\nBlock-State Transformer demonstrates more than tenfold increase in speed at the\nlayer level compared to the Block-Recurrent Transformer when model\nparallelization is employed.\n","authors":["Mahan Fathi","Jonathan Pilault","Pierre-Luc Bacon","Christopher Pal","Orhan Firat","Ross Goroshin"],"pdf_url":"https://arxiv.org/pdf/2306.09539v3.pdf","comment":"NeurIPS'23 - Thirty-seventh Conference on Neural Information\n Processing Systems"},{"id":"http://arxiv.org/abs/2310.17774v1","updated":"2023-10-26T20:55:29Z","published":"2023-10-26T20:55:29Z","title":"Words, Subwords, and Morphemes: What Really Matters in the\n Surprisal-Reading Time Relationship?","summary":" An important assumption that comes with using LLMs on psycholinguistic data\nhas gone unverified. LLM-based predictions are based on subword tokenization,\nnot decomposition of words into morphemes. Does that matter? We carefully test\nthis by comparing surprisal estimates using orthographic, morphological, and\nBPE tokenization against reading time data. Our results replicate previous\nfindings and provide evidence that in the aggregate, predictions using BPE\ntokenization do not suffer relative to morphological and orthographic\nsegmentation. However, a finer-grained analysis points to potential issues with\nrelying on BPE-based tokenization, as well as providing promising results\ninvolving morphologically-aware surprisal estimates and suggesting a new method\nfor evaluating morphological prediction.\n","authors":["Sathvik Nair","Philip Resnik"],"pdf_url":"https://arxiv.org/pdf/2310.17774v1.pdf","comment":"Accepted to Findings of EMNLP 2023; 10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2305.13661v2","updated":"2023-10-26T20:45:39Z","published":"2023-05-23T04:10:26Z","title":"On the Risk of Misinformation Pollution with Large Language Models","summary":" In this paper, we comprehensively investigate the potential misuse of modern\nLarge Language Models (LLMs) for generating credible-sounding misinformation\nand its subsequent impact on information-intensive applications, particularly\nOpen-Domain Question Answering (ODQA) systems. We establish a threat model and\nsimulate potential misuse scenarios, both unintentional and intentional, to\nassess the extent to which LLMs can be utilized to produce misinformation. Our\nstudy reveals that LLMs can act as effective misinformation generators, leading\nto a significant degradation in the performance of ODQA systems. To mitigate\nthe harm caused by LLM-generated misinformation, we explore three defense\nstrategies: prompting, misinformation detection, and majority voting. While\ninitial results show promising trends for these defensive strategies, much more\nwork needs to be done to address the challenge of misinformation pollution. Our\nwork highlights the need for further research and interdisciplinary\ncollaboration to address LLM-generated misinformation and to promote\nresponsible use of LLMs.\n","authors":["Yikang Pan","Liangming Pan","Wenhu Chen","Preslav Nakov","Min-Yen Kan","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.13661v2.pdf","comment":"EMNLP 2023 (Findings; Long Paper)"},{"id":"http://arxiv.org/abs/2305.05480v3","updated":"2023-10-26T20:36:36Z","published":"2023-05-09T14:30:29Z","title":"Effects of sub-word segmentation on performance of transformer language\n models","summary":" Language modeling is a fundamental task in natural language processing, which\nhas been thoroughly explored with various architectures and hyperparameters.\nHowever, few studies focus on the effect of sub-word segmentation on the\nperformance of language models (LMs). In this paper, we compare GPT and BERT\nmodels trained with the statistical segmentation algorithm BPE vs. two\nunsupervised algorithms for morphological segmentation -- Morfessor and\nStateMorph. We train the models for several languages -- including ones with\nvery rich morphology -- and compare their performance with different\nsegmentation algorithms, vocabulary sizes, and model sizes. The results show\nthat training with morphological segmentation allows the LMs to: 1. achieve\nlower perplexity, 2. converge more efficiently in terms of training time, and\n3. achieve equivalent or better evaluation scores on downstream tasks. Lastly,\nwe show 4. that LMs of smaller size using morphological segmentation can\nperform comparably to models of larger size trained with BPE -- both in terms\nof (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact\non sustainability of LMs, since they reduce the model cost: size and\ncomputation time. While (2) reduces cost only in the training phase, (4) does\nso also in the inference phase.\n","authors":["Jue Hou","Anisia Katinskaia","Anh-Duc Vu","Roman Yangarber"],"pdf_url":"https://arxiv.org/pdf/2305.05480v3.pdf","comment":"This submission published in EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17770v1","updated":"2023-10-26T20:27:16Z","published":"2023-10-26T20:27:16Z","title":"GROOViST: A Metric for Grounding Objects in Visual Storytelling","summary":" A proper evaluation of stories generated for a sequence of images -- the task\ncommonly referred to as visual storytelling -- must consider multiple aspects,\nsuch as coherence, grammatical correctness, and visual grounding. In this work,\nwe focus on evaluating the degree of grounding, that is, the extent to which a\nstory is about the entities shown in the images. We analyze current metrics,\nboth designed for this purpose and for general vision-text alignment. Given\ntheir observed shortcomings, we propose a novel evaluation tool, GROOViST, that\naccounts for cross-modal dependencies, temporal misalignments (the fact that\nthe order in which entities appear in the story and the image sequence may not\nmatch), and human intuitions on visual grounding. An additional advantage of\nGROOViST is its modular design, where the contribution of each component can be\nassessed and interpreted individually.\n","authors":["Aditya K Surikuchi","Sandro Pezzelle","Raquel Fernández"],"pdf_url":"https://arxiv.org/pdf/2310.17770v1.pdf","comment":"In EMNLP 2023 main conference proceedings (to appear)"},{"id":"http://arxiv.org/abs/2310.17769v1","updated":"2023-10-26T20:27:03Z","published":"2023-10-26T20:27:03Z","title":"Social Contract AI: Aligning AI Assistants with Implicit Group Norms","summary":" We explore the idea of aligning an AI assistant by inverting a model of\nusers' (unknown) preferences from observed interactions. To validate our\nproposal, we run proof-of-concept simulations in the economic ultimatum game,\nformalizing user preferences as policies that guide the actions of simulated\nplayers. We find that the AI assistant accurately aligns its behavior to match\nstandard policies from the economic literature (e.g., selfish, altruistic).\nHowever, the assistant's learned policies lack robustness and exhibit limited\ngeneralization in an out-of-distribution setting when confronted with a\ncurrency (e.g., grams of medicine) that was not included in the assistant's\ntraining distribution. Additionally, we find that when there is inconsistency\nin the relationship between language use and an unknown policy (e.g., an\naltruistic policy combined with rude language), the assistant's learning of the\npolicy is slowed. Overall, our preliminary results suggest that developing\nsimulation frameworks in which AI assistants need to infer preferences from\ndiverse users can provide a valuable approach for studying practical alignment\nquestions.\n","authors":["Jan-Philipp Fränken","Sam Kwok","Peixuan Ye","Kanishk Gandhi","Dilip Arumugam","Jared Moore","Alex Tamkin","Tobias Gerstenberg","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2310.17769v1.pdf","comment":"SoLaR NeurIPS 2023 Workshop (https://solar-neurips.github.io/)"},{"id":"http://arxiv.org/abs/2305.01028v2","updated":"2023-10-26T20:19:37Z","published":"2023-05-01T18:36:06Z","title":"Company classification using zero-shot learning","summary":" In recent years, natural language processing (NLP) has become increasingly\nimportant in a variety of business applications, including sentiment analysis,\ntext classification, and named entity recognition. In this paper, we propose an\napproach for company classification using NLP and zero-shot learning. Our\nmethod utilizes pre-trained transformer models to extract features from company\ndescriptions, and then applies zero-shot learning to classify companies into\nrelevant categories without the need for specific training data for each\ncategory. We evaluate our approach on a dataset obtained through the Wharton\nResearch Data Services (WRDS), which comprises textual descriptions of publicly\ntraded companies. We demonstrate that the approach can streamline the process\nof company classification, thereby reducing the time and resources required in\ntraditional approaches such as the Global Industry Classification Standard\n(GICS). The results show that this method has potential for automation of\ncompany classification, making it a promising avenue for future research in\nthis area.\n","authors":["Maryan Rizinski","Andrej Jankov","Vignesh Sankaradas","Eugene Pinsky","Igor Miskovski","Dimitar Trajanov"],"pdf_url":"https://arxiv.org/pdf/2305.01028v2.pdf","comment":"6 pages, 1 figure, 4 tables, conference paper, published in the 20th\n International Conference on Informatics and Information Technologies (CIIT\n 2023)"},{"id":"http://arxiv.org/abs/2310.09520v3","updated":"2023-10-26T20:04:47Z","published":"2023-10-14T07:19:47Z","title":"Reward-Augmented Decoding: Efficient Controlled Text Generation With a\n Unidirectional Reward Model","summary":" While large language models have proven effective in a huge range of\ndownstream applications, they often generate text that is problematic or lacks\na desired attribute. In this paper, we introduce Reward-Augmented Decoding\n(RAD), a text generation procedure that uses a small unidirectional reward\nmodel to encourage a language model to generate text that has certain\nproperties. Specifically, RAD uses the reward model to score generations as\nthey are produced and rescales sampling probabilities to favor high-reward\ntokens. By using a unidirectional reward model, RAD can cache activations from\nprior generation steps to decrease computational overhead. Through experiments\non generating non-toxic and sentiment-controlled text, we demonstrate that RAD\nperforms best among methods that change only the generation procedure and\nmatches the performance of state-of-the-art methods that involve re-training\nthe language model. We further validate that RAD is effective on very large\nlanguage models while incurring a minimal computational overhead.\n","authors":["Haikang Deng","Colin Raffel"],"pdf_url":"https://arxiv.org/pdf/2310.09520v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.09674v2","updated":"2023-10-26T19:50:29Z","published":"2022-12-19T18:00:09Z","title":"LR-Sum: Summarization for Less-Resourced Languages","summary":" This preprint describes work in progress on LR-Sum, a new\npermissively-licensed dataset created with the goal of enabling further\nresearch in automatic summarization for less-resourced languages. LR-Sum\ncontains human-written summaries for 40 languages, many of which are\nless-resourced. We describe our process for extracting and filtering the\ndataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The\nsource data is public domain newswire collected from from Voice of America\nwebsites, and LR-Sum is released under a Creative Commons license (CC BY 4.0),\nmaking it one of the most openly-licensed multilingual summarization datasets.\nWe describe how we plan to use the data for modeling experiments and discuss\nlimitations of the dataset.\n","authors":["Chester Palen-Michel","Constantine Lignos"],"pdf_url":"https://arxiv.org/pdf/2212.09674v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17750v1","updated":"2023-10-26T19:45:06Z","published":"2023-10-26T19:45:06Z","title":"A Framework for Automated Measurement of Responsible AI Harms in\n Generative AI Applications","summary":" We present a framework for the automated measurement of responsible AI (RAI)\nmetrics for large language models (LLMs) and associated products and services.\nOur framework for automatically measuring harms from LLMs builds on existing\ntechnical and sociotechnical expertise and leverages the capabilities of\nstate-of-the-art LLMs, such as GPT-4. We use this framework to run through\nseveral case studies investigating how different LLMs may violate a range of\nRAI-related principles. The framework may be employed alongside domain-specific\nsociotechnical expertise to create measurements for new harm areas in the\nfuture. By implementing this framework, we aim to enable more advanced harm\nmeasurement efforts and further the responsible use of LLMs.\n","authors":["Ahmed Magooda","Alec Helyar","Kyle Jackson","David Sullivan","Chad Atalla","Emily Sheng","Dan Vann","Richard Edgar","Hamid Palangi","Roman Lutz","Hongliang Kong","Vincent Yun","Eslam Kamal","Federico Zarfati","Hanna Wallach","Sarah Bird","Mei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.17750v1.pdf","comment":"This is a living document"},{"id":"http://arxiv.org/abs/2310.17749v1","updated":"2023-10-26T19:44:06Z","published":"2023-10-26T19:44:06Z","title":"Salespeople vs SalesBot: Exploring the Role of Educational Value in\n Conversational Recommender Systems","summary":" Making big purchases requires consumers to research or consult a salesperson\nto gain domain expertise. However, existing conversational recommender systems\n(CRS) often overlook users' lack of background knowledge, focusing solely on\ngathering preferences. In this work, we define a new problem space for\nconversational agents that aim to provide both product recommendations and\neducational value through mixed-type mixed-initiative dialog. We introduce\nSalesOps, a framework that facilitates the simulation and evaluation of such\nsystems by leveraging recent advancements in large language models (LLMs). We\nbuild SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate\neither side of the framework. A comprehensive human study compares SalesBot\nagainst professional salespeople, revealing that although SalesBot approaches\nprofessional performance in terms of fluency and informativeness, it lags\nbehind in recommendation quality. We emphasize the distinct limitations both\nface in providing truthful information, highlighting the challenges of ensuring\nfaithfulness in the CRS context. We release our code and make all data\navailable.\n","authors":["Lidiya Murakhovs'ka","Philippe Laban","Tian Xie","Caiming Xiong","Chien-Sheng Wu"],"pdf_url":"https://arxiv.org/pdf/2310.17749v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17743v1","updated":"2023-10-26T19:31:22Z","published":"2023-10-26T19:31:22Z","title":"StyleBART: Decorate Pretrained Model with Style Adapters for\n Unsupervised Stylistic Headline Generation","summary":" Stylistic headline generation is the task to generate a headline that not\nonly summarizes the content of an article, but also reflects a desired style\nthat attracts users. As style-specific article-headline pairs are scarce,\nprevious researches focus on unsupervised approaches with a standard headline\ngeneration dataset and mono-style corpora. In this work, we follow this line\nand propose StyleBART, an unsupervised approach for stylistic headline\ngeneration. Our method decorates the pretrained BART model with adapters that\nare responsible for different styles and allows the generation of headlines\nwith diverse styles by simply switching the adapters. Different from previous\nworks, StyleBART separates the task of style learning and headline generation,\nmaking it possible to freely combine the base model and the style adapters\nduring inference. We further propose an inverse paraphrasing task to enhance\nthe style adapters. Extensive automatic and human evaluations show that\nStyleBART achieves new state-of-the-art performance in the unsupervised\nstylistic headline generation task, producing high-quality headlines with the\ndesired style.\n","authors":["Hanqing Wang","Yajing Luo","Boya Xiong","Guanhua Chen","Yun Chen"],"pdf_url":"https://arxiv.org/pdf/2310.17743v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14695v2","updated":"2023-10-26T19:02:27Z","published":"2023-05-24T03:59:18Z","title":"A Causal View of Entity Bias in (Large) Language Models","summary":" Entity bias widely affects pretrained (large) language models, causing them\nto rely on (biased) parametric knowledge to make unfaithful predictions.\nAlthough causality-inspired methods have shown great potential to mitigate\nentity bias, it is hard to precisely estimate the parameters of underlying\ncausal models in practice. The rise of black-box LLMs also makes the situation\neven worse, because of their inaccessible parameters and uncalibrated logits.\nTo address these problems, we propose a specific structured causal model (SCM)\nwhose parameters are comparatively easier to estimate. Building upon this SCM,\nwe propose causal intervention techniques to mitigate entity bias for both\nwhite-box and black-box settings. The proposed causal intervention perturbs the\noriginal entity with neighboring entities. This intervention reduces specific\nbiasing information pertaining to the original entity while still preserving\nsufficient semantic information from similar entities. Under the white-box\nsetting, our training-time intervention improves OOD performance of PLMs on\nrelation extraction (RE) and machine reading comprehension (MRC) by 5.7 points\nand by 9.1 points, respectively. Under the black-box setting, our in-context\nintervention effectively reduces the entity-based knowledge conflicts of\nGPT-3.5, achieving up to 20.5 points of improvement of exact match accuracy on\nMRC and up to 17.6 points of reduction in memorization ratio on RE. Our code is\navailable at https://github.com/luka-group/Causal-View-of-Entity-Bias.\n","authors":["Fei Wang","Wenjie Mo","Yiwei Wang","Wenxuan Zhou","Muhao Chen"],"pdf_url":"https://arxiv.org/pdf/2305.14695v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17737v1","updated":"2023-10-26T18:58:52Z","published":"2023-10-26T18:58:52Z","title":"ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural\n Languages","summary":" Building multi-modal language models has been a trend in the recent years,\nwhere additional modalities such as image, video, speech, etc. are jointly\nlearned along with natural languages (i.e., textual information). Despite the\nsuccess of these multi-modal language models with different modalities, there\nis no existing solution for neural network architectures and natural languages.\nProviding neural architectural information as a new modality allows us to\nprovide fast architecture-2-text and text-2-architecture retrieval/generation\nservices on the cloud with a single inference. Such solution is valuable in\nterms of helping beginner and intermediate ML users to come up with better\nneural architectures or AutoML approaches with a simple text query. In this\npaper, we propose ArchBERT, a bi-modal model for joint learning and\nunderstanding of neural architectures and natural languages, which opens up new\navenues for research in this area. We also introduce a pre-training strategy\nnamed Masked Architecture Modeling (MAM) for a more generalized joint learning.\nMoreover, we introduce and publicly release two new bi-modal datasets for\ntraining and validating our methods. The ArchBERT's performance is verified\nthrough a set of numerical experiments on different downstream tasks such as\narchitecture-oriented reasoning, question answering, and captioning\n(summarization). Datasets, codes, and demos are available supplementary\nmaterials.\n","authors":["Mohammad Akbari","Saeed Ranjbar Alvar","Behnam Kamranian","Amin Banitalebi-Dehkordi","Yong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17737v1.pdf","comment":"CoNLL 2023"},{"id":"http://arxiv.org/abs/2310.17734v1","updated":"2023-10-26T18:50:04Z","published":"2023-10-26T18:50:04Z","title":"Investigating Multilingual Coreference Resolution by Universal\n Annotations","summary":" Multilingual coreference resolution (MCR) has been a long-standing and\nchallenging task. With the newly proposed multilingual coreference dataset,\nCorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by\nusing its harmonized universal morphosyntactic and coreference annotations.\nFirst, we study coreference by examining the ground truth data at different\nlinguistic levels, namely mention, entity and document levels, and across\ndifferent genres, to gain insights into the characteristics of coreference\nacross multiple languages. Second, we perform an error analysis of the most\nchallenging cases that the SotA system fails to resolve in the CRAC 2022 shared\ntask using the universal annotations. Last, based on this analysis, we extract\nfeatures from universal morphosyntactic annotations and integrate these\nfeatures into a baseline system to assess their potential benefits for the MCR\ntask. Our results show that our best configuration of features improves the\nbaseline by 0.9% F1 score.\n","authors":["Haixia Chai","Michael Strube"],"pdf_url":"https://arxiv.org/pdf/2310.17734v1.pdf","comment":"Accepted at Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.17723v1","updated":"2023-10-26T18:34:41Z","published":"2023-10-26T18:34:41Z","title":"ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training\n Quantization Framework for W8A8 Transformers","summary":" Quantization techniques are pivotal in reducing the memory and computational\ndemands of deep neural network inference. Existing solutions, such as\nZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook\ncrucial memory-bounded operators and the complexities of per-token\nquantization. Addressing these gaps, we present a novel, fully\nhardware-enhanced robust optimized post-training W8A8 quantization framework,\nZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and\ncompute-intensive operators, aiming for optimal hardware performance.\nAdditionally, it offers flexibility by allowing specific INT8 modules to switch\nto FP16/BF16 mode, enhancing accuracy.\n","authors":["Zhewei Yao","Reza Yazdani Aminabadi","Stephen Youn","Xiaoxia Wu","Elton Zheng","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2310.17723v1.pdf","comment":"8 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.17722v1","updated":"2023-10-26T18:32:05Z","published":"2023-10-26T18:32:05Z","title":"Large Language Models as Generalizable Policies for Embodied Tasks","summary":" We show that large language models (LLMs) can be adapted to be generalizable\npolicies for embodied visual tasks. Our approach, called Large LAnguage model\nReinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take\nas input text instructions and visual egocentric observations and output\nactions directly in the environment. Using reinforcement learning, we train\nLLaRP to see and act solely through environmental interactions. We show that\nLLaRP is robust to complex paraphrasings of task instructions and can\ngeneralize to new tasks that require novel optimal behavior. In particular, on\n1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other\ncommon learned baselines or zero-shot applications of LLMs. Finally, to aid the\ncommunity in studying language conditioned, massively multi-task, embodied AI\nproblems we release a novel benchmark, Language Rearrangement, consisting of\n150,000 training and 1,000 testing tasks for language-conditioned\nrearrangement. Video examples of LLaRP in unseen Language Rearrangement\ninstructions are at https://llm-rl.github.io.\n","authors":["Andrew Szot","Max Schwarzer","Harsh Agrawal","Bogdan Mazoure","Walter Talbott","Katherine Metcalf","Natalie Mackraz","Devon Hjelm","Alexander Toshev"],"pdf_url":"https://arxiv.org/pdf/2310.17722v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17721v1","updated":"2023-10-26T18:30:37Z","published":"2023-10-26T18:30:37Z","title":"From Transcripts to Insights: Uncovering Corporate Risks Using\n Generative AI","summary":" We explore the value of generative AI tools, such as ChatGPT, in helping\ninvestors uncover dimensions of corporate risk. We develop and validate\nfirm-level measures of risk exposure to political, climate, and AI-related\nrisks. Using the GPT 3.5 model to generate risk summaries and assessments from\nthe context provided by earnings call transcripts, we show that GPT-based\nmeasures possess significant information content and outperform the existing\nrisk measures in predicting (abnormal) firm-level volatility and firms' choices\nsuch as investment and innovation. Importantly, information in risk assessments\ndominates that in risk summaries, establishing the value of general AI\nknowledge. We also find that generative AI is effective at detecting emerging\nrisks, such as AI risk, which has soared in recent quarters. Our measures\nperform well both within and outside the GPT's training window and are priced\nin equity markets. Taken together, an AI-based approach to risk measurement\nprovides useful insights to users of corporate disclosures at a low cost.\n","authors":["Alex Kim","Maximilian Muhn","Valeri Nikolaev"],"pdf_url":"https://arxiv.org/pdf/2310.17721v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16564v3","updated":"2023-10-26T18:27:00Z","published":"2023-06-28T21:11:15Z","title":"Automatic Calibration and Error Correction for Generative Large Language\n Models via Pareto Optimal Self-Supervision","summary":" Generative Large language models (LLMs) have demonstrated remarkable\ncapabilities for a wide range of applications, but reducing ungrounded or\nerroneous responses remains a major growth area. Unlike task-specific models,\nthere lack an effective method to calibrate the confidence level of LLM\nresponses to indicate potential errors and facilitate human-in-the-loop\nverification. An important source of calibration stems from expert-stipulated\nprogrammatic supervision, which is often available at low cost but has its own\nlimitations such as noise and coverage. In this paper, we introduce a Pareto\noptimal self-supervision framework that can leverage available programmatic\nsupervision to systematically calibrate LLM responses by producing a risk score\nfor every LLM response, without any additional manual efforts. This is\naccomplished by learning a harmonizer model to align with LLM output as well as\nother weak supervision sources. The model assigns higher risk scores to more\nuncertain LLM responses and facilitate error correction. Experiments on\nstandard relation extraction and classification tasks in biomedical and general\ndomains demonstrate that the proposed risk score is highly correlated with the\nactual LLM error rate. By using a dynamic prompting strategy based on the risk\nscore, we observed significant accuracy improvement for off-the-shelf LLMs,\nboosting GPT-3.5 results past state-of-the-art (SOTA) weak supervision model\nand GPT-4 results past SOTA supervised results on challenging evaluation\ndatasets.\n","authors":["Theodore Zhao","Mu Wei","J. Samuel Preston","Hoifung Poon"],"pdf_url":"https://arxiv.org/pdf/2306.16564v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17715v1","updated":"2023-10-26T18:22:13Z","published":"2023-10-26T18:22:13Z","title":"Outlier Dimensions Encode Task-Specific Knowledge","summary":" Representations from large language models (LLMs) are known to be dominated\nby a small subset of dimensions with exceedingly high variance. Previous works\nhave argued that although ablating these outlier dimensions in LLM\nrepresentations hurts downstream performance, outlier dimensions are\ndetrimental to the representational quality of embeddings. In this study, we\ninvestigate how fine-tuning impacts outlier dimensions and show that 1) outlier\ndimensions that occur in pre-training persist in fine-tuned models and 2) a\nsingle outlier dimension can complete downstream tasks with a minimal error\nrate. Our results suggest that outlier dimensions can encode crucial\ntask-specific knowledge and that the value of a representation in a single\noutlier dimension drives downstream model decisions.\n","authors":["William Rudman","Catherine Chen","Carsten Eickhoff"],"pdf_url":"https://arxiv.org/pdf/2310.17715v1.pdf","comment":"Camera-ready version for EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.14282v3","updated":"2023-10-26T18:21:30Z","published":"2023-05-23T17:27:22Z","title":"INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained\n Feedback","summary":" Automatically evaluating the quality of language generation is critical.\nAlthough recent learned metrics show high correlation with human judgement,\nthese metrics can not explain their verdict or associate the scores with\ndefects in generated text. To address this limitation, we present\nInstructScore, an explainable evaluation metric for text generation. By\nharnessing both explicit human instruction and the implicit knowledge of GPT-4,\nwe fine-tune a text evaluation metric based on LLaMA, producing both a score\nfor generated text and a human readable diagnostic report. We evaluate\nInstructScore on a variety of generation tasks, including translation,\ncaptioning, data-to-text and commonsense generation. Experiments show that our\n7B model surpasses all other unsupervised metrics, including those based on\n175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without direct\nsupervision from human-rated data, achieves performance levels on par with\nstate-of-the-art metrics like COMET22, which were fine-tuned on human ratings.\n","authors":["Wenda Xu","Danqing Wang","Liangming Pan","Zhenqiao Song","Markus Freitag","William Yang Wang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2305.14282v3.pdf","comment":"Accepted to EMNLP2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.17714v1","updated":"2023-10-26T18:19:56Z","published":"2023-10-26T18:19:56Z","title":"Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for\n Relation Extraction from Financial Documents","summary":" Relation extraction (RE) has achieved remarkable progress with the help of\npre-trained language models. However, existing RE models are usually incapable\nof handling two situations: implicit expressions and long-tail relation\nclasses, caused by language complexity and data sparsity. Further, these\napproaches and models are largely inaccessible to users who don't have direct\naccess to large language models (LLMs) and/or infrastructure for supervised\ntraining or fine-tuning. Rule-based systems also struggle with implicit\nexpressions. Apart from this, Real world financial documents such as various\n10-X reports (including 10-K, 10-Q, etc.) of publicly traded companies pose\nanother challenge to rule-based systems in terms of longer and complex\nsentences. In this paper, we introduce a simple approach that consults training\nrelations at test time through a nearest-neighbor search over dense vectors of\nlexico-syntactic patterns and provides a simple yet effective means to tackle\nthe above issues. We evaluate our approach on REFinD and show that our method\nachieves state-of-the-art performance. We further show that it can provide a\ngood start for human in the loop setup when a small number of annotations are\navailable and it is also beneficial when domain experts can provide high\nquality patterns.\n","authors":["Pawan Kumar Rajpoot","Ankur Parikh"],"pdf_url":"https://arxiv.org/pdf/2310.17714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17711v1","updated":"2023-10-26T18:12:02Z","published":"2023-10-26T18:12:02Z","title":"Is Explanation the Cure? Misinformation Mitigation in the Short Term and\n Long Term","summary":" With advancements in natural language processing (NLP) models, automatic\nexplanation generation has been proposed to mitigate misinformation on social\nmedia platforms in addition to adding warning labels to identified fake news.\nWhile many researchers have focused on generating good explanations, how these\nexplanations can really help humans combat fake news is under-explored. In this\nstudy, we compare the effectiveness of a warning label and the state-of-the-art\ncounterfactual explanations generated by GPT-4 in debunking misinformation. In\na two-wave, online human-subject study, participants (N = 215) were randomly\nassigned to a control group in which false contents are shown without any\nintervention, a warning tag group in which the false claims were labeled, or an\nexplanation group in which the false contents were accompanied by GPT-4\ngenerated explanations. Our results show that both interventions significantly\ndecrease participants' self-reported belief in fake claims in an equivalent\nmanner for the short-term and long-term. We discuss the implications of our\nfindings and directions for future NLP-based misinformation debunking\nstrategies.\n","authors":["Yi-Li Hsu","Shih-Chieh Dai","Aiping Xiong","Lun-Wei Ku"],"pdf_url":"https://arxiv.org/pdf/2310.17711v1.pdf","comment":"EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2310.17703v1","updated":"2023-10-26T18:03:46Z","published":"2023-10-26T18:03:46Z","title":"The impact of using an AI chatbot to respond to patient messages","summary":" Documentation burden is a major contributor to clinician burnout, which is\nrising nationally and is an urgent threat to our ability to care for patients.\nArtificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician\nburden by assisting with documentation. Although many hospitals are actively\nintegrating such systems into electronic medical record systems, AI chatbots\nutility and impact on clinical decision-making have not been studied for this\nintended use. We are the first to examine the utility of large language models\nin assisting clinicians draft responses to patient questions. In our two-stage\ncross-sectional study, 6 oncologists responded to 100 realistic synthetic\ncancer patient scenarios and portal messages developed to reflect common\nmedical situations, first manually, then with AI assistance.\n We find AI-assisted responses were longer, less readable, but provided\nacceptable drafts without edits 58% of time. AI assistance improved efficiency\n77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses\ncould severely harm. In 31% cases, physicians thought AI drafts were\nhuman-written. AI assistance led to more patient education recommendations,\nfewer clinical actions than manual responses. Results show promise for AI to\nimprove clinician efficiency and patient care through assisting documentation,\nif used judiciously. Monitoring model outputs and human-AI interaction remains\ncrucial for safe implementation.\n","authors":["Shan Chen","Marco Guevara","Shalini Moningi","Frank Hoebers","Hesham Elhalawani","Benjamin H. Kann","Fallon E. Chipidza","Jonathan Leeman","Hugo J. W. L. Aerts","Timothy Miller","Guergana K. Savova","Raymond H. Mak","Maryam Lustberg","Majid Afshar","Danielle S. Bitterman"],"pdf_url":"https://arxiv.org/pdf/2310.17703v1.pdf","comment":"4 figures and tables in main, submitted for review"},{"id":"http://arxiv.org/abs/2310.17690v1","updated":"2023-10-26T18:00:00Z","published":"2023-10-26T18:00:00Z","title":"Non-contrastive sentence representations via self-supervision","summary":" Sample contrastive methods, typically referred to simply as contrastive are\nthe foundation of most unsupervised methods to learn text and sentence\nembeddings. On the other hand, a different class of self-supervised loss\nfunctions and methods have been considered in the computer vision community and\nreferred to as dimension contrastive. In this paper, we thoroughly compare this\nclass of methods with the standard baseline for contrastive sentence\nembeddings, SimCSE. We find that self-supervised embeddings trained using\ndimension contrastive objectives can outperform SimCSE on downstream tasks\nwithout needing auxiliary loss functions.\n","authors":["Marco Farina","Duccio Pappadopulo"],"pdf_url":"https://arxiv.org/pdf/2310.17690v1.pdf","comment":"Submitted and rejected by EMNLP 2023. Contact the authors for a copy\n of the \"reviews\""}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.17653v1","updated":"2023-10-26T17:59:46Z","published":"2023-10-26T17:59:46Z","title":"Fantastic Gains and Where to Find Them: On the Existence and Prospect of\n General Knowledge Transfer between Any Pretrained Model","summary":" Training deep networks requires various design decisions regarding for\ninstance their architecture, data augmentation, or optimization. In this work,\nwe find these training variations to result in networks learning unique feature\nsets from the data. Using public model libraries comprising thousands of models\ntrained on canonical datasets like ImageNet, we observe that for arbitrary\npairings of pretrained models, one model extracts significant data context\nunavailable in the other -- independent of overall performance. Given any\narbitrary pairing of pretrained models and no external rankings (such as\nseparate test sets, e.g. due to data privacy), we investigate if it is possible\nto transfer such \"complementary\" knowledge from one model to another without\nperformance degradation -- a task made particularly difficult as additional\nknowledge can be contained in stronger, equiperformant or weaker models. Yet\nfacilitating robust transfer in scenarios agnostic to pretrained model pairings\nwould unlock auxiliary gains and knowledge fusion from any model repository\nwithout restrictions on model and problem specifics - including from weaker,\nlower-performance models. This work therefore provides an initial, in-depth\nexploration on the viability of such general-purpose knowledge transfer. Across\nlarge-scale experiments, we first reveal the shortcomings of standard knowledge\ndistillation techniques, and then propose a much more general extension through\ndata partitioning for successful transfer between nearly all pretrained models,\nwhich we show can also be done unsupervised. Finally, we assess both the\nscalability and impact of fundamental model properties on successful\nmodel-agnostic knowledge transfer.\n","authors":["Karsten Roth","Lukas Thede","Almut Sophia Koepke","Oriol Vinyals","Olivier Hénaff","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2310.17653v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17650v1","updated":"2023-10-26T17:59:19Z","published":"2023-10-26T17:59:19Z","title":"A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised\n Video Anomaly Detection","summary":" Detection of anomalous events in videos is an important problem in\napplications such as surveillance. Video anomaly detection (VAD) is\nwell-studied in the one-class classification (OCC) and weakly supervised (WS)\nsettings. However, fully unsupervised (US) video anomaly detection methods,\nwhich learn a complete system without any annotation or human supervision, have\nnot been explored in depth. This is because the lack of any ground truth\nannotations significantly increases the magnitude of the VAD challenge. To\naddress this challenge, we propose a simple-but-effective two-stage\npseudo-label generation framework that produces segment-level (normal/anomaly)\npseudo-labels, which can be further used to train a segment-level anomaly\ndetector in a supervised manner. The proposed coarse-to-fine pseudo-label\n(C2FPL) generator employs carefully-designed hierarchical divisive clustering\nand statistical hypothesis testing to identify anomalous video segments from a\nset of completely unlabeled videos. The trained anomaly detector can be\ndirectly applied on segments of an unseen test video to obtain segment-level,\nand subsequently, frame-level anomaly predictions. Extensive studies on two\nlarge-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that\nthe proposed unsupervised approach achieves superior performance compared to\nall existing OCC and US methods , while yielding comparable performance to the\nstate-of-the-art WS methods.\n","authors":["Anas Al-lahham","Nurbek Tastan","Zaigham Zaheer","Karthik Nandakumar"],"pdf_url":"https://arxiv.org/pdf/2310.17650v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2310.17649v1","updated":"2023-10-26T17:59:12Z","published":"2023-10-26T17:59:12Z","title":"6-DoF Stability Field via Diffusion Models","summary":" A core capability for robot manipulation is reasoning over where and how to\nstably place objects in cluttered environments. Traditionally, robots have\nrelied on object-specific, hand-crafted heuristics in order to perform such\nreasoning, with limited generalizability beyond a small number of object\ninstances and object interaction patterns. Recent approaches instead learn\nnotions of physical interaction, namely motion prediction, but require\nsupervision in the form of labeled object information or come at the cost of\nhigh sample complexity, and do not directly reason over stability or object\nplacement. We present 6-DoFusion, a generative model capable of generating 3D\nposes of an object that produces a stable configuration of a given scene.\nUnderlying 6-DoFusion is a diffusion model that incrementally refines a\nrandomly initialized SE(3) pose to generate a sample from a learned,\ncontext-dependent distribution over stable poses. We evaluate our model on\ndifferent object placement and stacking tasks, demonstrating its ability to\nconstruct stable scenes that involve novel object classes as well as to improve\nthe accuracy of state-of-the-art 3D pose estimation methods.\n","authors":["Takuma Yoneda","Tianchong Jiang","Gregory Shakhnarovich","Matthew R. Walter"],"pdf_url":"https://arxiv.org/pdf/2310.17649v1.pdf","comment":"In submission"},{"id":"http://arxiv.org/abs/2310.17645v1","updated":"2023-10-26T17:58:08Z","published":"2023-10-26T17:58:08Z","title":"Defending Against Transfer Attacks From Public Models","summary":" Adversarial attacks have been a looming and unaddressed threat in the\nindustry. However, through a decade-long history of the robustness evaluation\nliterature, we have learned that mounting a strong or optimal attack is\nchallenging. It requires both machine learning and domain expertise. In other\nwords, the white-box threat model, religiously assumed by a large majority of\nthe past literature, is unrealistic. In this paper, we propose a new practical\nthreat model where the adversary relies on transfer attacks through publicly\navailable surrogate models. We argue that this setting will become the most\nprevalent for security-sensitive applications in the future. We evaluate the\ntransfer attacks in this setting and propose a specialized defense method based\non a game-theoretic perspective. The defenses are evaluated under 24 public\nmodels and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and\nImageNet). Under this threat model, our defense, PubDef, outperforms the\nstate-of-the-art white-box adversarial training by a large margin with almost\nno loss in the normal accuracy. For instance, on ImageNet, our defense achieves\n62% accuracy under the strongest transfer attack vs only 36% of the best\nadversarially trained model. Its accuracy when not under attack is only 2%\nlower than that of an undefended model (78% vs 80%). We release our code at\nhttps://github.com/wagner-group/pubdef.\n","authors":["Chawin Sitawarin","Jaewon Chang","David Huang","Wesson Altoyan","David Wagner"],"pdf_url":"https://arxiv.org/pdf/2310.17645v1.pdf","comment":"Under submission. Code available at\n https://github.com/wagner-group/pubdef"},{"id":"http://arxiv.org/abs/2310.17644v1","updated":"2023-10-26T17:57:15Z","published":"2023-10-26T17:57:15Z","title":"torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free\n Deep Learning Studies: A Case Study on NLP","summary":" Reproducibility in scientific work has been becoming increasingly important\nin research communities such as machine learning, natural language processing,\nand computer vision communities due to the rapid development of the research\ndomains supported by recent advances in deep learning. In this work, we present\na significantly upgraded version of torchdistill, a modular-driven coding-free\ndeep learning framework significantly upgraded from the initial release, which\nsupports only image classification and object detection tasks for reproducible\nknowledge distillation experiments. To demonstrate that the upgraded framework\ncan support more tasks with third-party libraries, we reproduce the GLUE\nbenchmark results of BERT models using a script based on the upgraded\ntorchdistill, harmonizing with various Hugging Face libraries. All the 27\nfine-tuned BERT models and configurations to reproduce the results are\npublished at Hugging Face, and the model weights have already been widely used\nin research communities. We also reimplement popular small-sized models and new\nknowledge distillation methods and perform additional experiments for computer\nvision tasks.\n","authors":["Yoshitomo Matsubara"],"pdf_url":"https://arxiv.org/pdf/2310.17644v1.pdf","comment":"Accepted at the 3rd Workshop for Natural Language Processing Open\n Source Software (NLP-OSS) at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17642v1","updated":"2023-10-26T17:56:35Z","published":"2023-10-26T17:56:35Z","title":"Drive Anywhere: Generalizable End-to-end Autonomous Driving with\n Multi-modal Foundation Models","summary":" As autonomous driving technology matures, end-to-end methodologies have\nemerged as a leading strategy, promising seamless integration from perception\nto control via deep learning. However, existing systems grapple with challenges\nsuch as unexpected open set environments and the complexity of black-box\nmodels. At the same time, the evolution of deep learning introduces larger,\nmultimodal foundational models, offering multi-modal visual and textual\nunderstanding. In this paper, we harness these multimodal foundation models to\nenhance the robustness and adaptability of autonomous driving systems, enabling\nout-of-distribution, end-to-end, multimodal, and more explainable autonomy.\nSpecifically, we present an approach to apply end-to-end open-set (any\nenvironment/scene) autonomous driving that is capable of providing driving\ndecisions from representations queryable by image and text. To do so, we\nintroduce a method to extract nuanced spatial (pixel/patch-aligned) features\nfrom transformers to enable the encapsulation of both spatial and semantic\nfeatures. Our approach (i) demonstrates unparalleled results in diverse tests\nwhile achieving significantly greater robustness in out-of-distribution\nsituations, and (ii) allows the incorporation of latent space simulation (via\ntext) for improved training (data augmentation via text) and policy debugging.\nWe encourage the reader to check our explainer video at\nhttps://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the\ncode and demos on our project webpage at https://drive-anywhere.github.io/.\n","authors":["Tsun-Hsuan Wang","Alaa Maalouf","Wei Xiao","Yutong Ban","Alexander Amini","Guy Rosman","Sertac Karaman","Daniela Rus"],"pdf_url":"https://arxiv.org/pdf/2310.17642v1.pdf","comment":"Project webpage: https://drive-anywhere.github.io Explainer video:\n https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be"},{"id":"http://arxiv.org/abs/2310.17632v1","updated":"2023-10-26T17:50:10Z","published":"2023-10-26T17:50:10Z","title":"DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown\n Lighting","summary":" Geometry reconstruction of textureless, non-Lambertian objects under unknown\nnatural illumination (i.e., in the wild) remains challenging as correspondences\ncannot be established and the reflectance cannot be expressed in simple\nanalytical forms. We derive a novel multi-view method, DeepShaRM, that achieves\nstate-of-the-art accuracy on this challenging task. Unlike past methods that\nformulate this as inverse-rendering, i.e., estimation of reflectance,\nillumination, and geometry from images, our key idea is to realize that\nreflectance and illumination need not be disentangled and instead estimated as\na compound reflectance map. We introduce a novel deep reflectance map\nestimation network that recovers the camera-view reflectance maps from the\nsurface normals of the current geometry estimate and the input multi-view\nimages. The network also explicitly estimates per-pixel confidence scores to\nhandle global light transport effects. A deep shape-from-shading network then\nupdates the geometry estimate expressed with a signed distance function using\nthe recovered reflectance maps. By alternating between these two, and, most\nimportant, by bypassing the ill-posed problem of reflectance and illumination\ndecomposition, the method accurately recovers object geometry in these\nchallenging settings. Extensive experiments on both synthetic and real-world\ndata clearly demonstrate its state-of-the-art accuracy.\n","authors":["Kohei Yamashita","Shohei Nobuhara","Ko Nishino"],"pdf_url":"https://arxiv.org/pdf/2310.17632v1.pdf","comment":"3DV 2024"},{"id":"http://arxiv.org/abs/2310.17626v1","updated":"2023-10-26T17:45:26Z","published":"2023-10-26T17:45:26Z","title":"A Survey on Transferability of Adversarial Examples across Deep Neural\n Networks","summary":" The emergence of Deep Neural Networks (DNNs) has revolutionized various\ndomains, enabling the resolution of complex tasks spanning image recognition,\nnatural language processing, and scientific problem-solving. However, this\nprogress has also exposed a concerning vulnerability: adversarial examples.\nThese crafted inputs, imperceptible to humans, can manipulate machine learning\nmodels into making erroneous predictions, raising concerns for safety-critical\napplications. An intriguing property of this phenomenon is the transferability\nof adversarial examples, where perturbations crafted for one model can deceive\nanother, often with a different architecture. This intriguing property enables\n\"black-box\" attacks, circumventing the need for detailed knowledge of the\ntarget model. This survey explores the landscape of the adversarial\ntransferability of adversarial examples. We categorize existing methodologies\nto enhance adversarial transferability and discuss the fundamental principles\nguiding each approach. While the predominant body of research primarily\nconcentrates on image classification, we also extend our discussion to\nencompass other vision tasks and beyond. Challenges and future prospects are\ndiscussed, highlighting the importance of fortifying DNNs against adversarial\nvulnerabilities in an evolving landscape.\n","authors":["Jindong Gu","Xiaojun Jia","Pau de Jorge","Wenqain Yu","Xinwei Liu","Avery Ma","Yuan Xun","Anjun Hu","Ashkan Khakzar","Zhijiang Li","Xiaochun Cao","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2310.17626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17596v1","updated":"2023-10-26T17:17:31Z","published":"2023-10-26T17:17:31Z","title":"MimicGen: A Data Generation System for Scalable Robot Learning using\n Human Demonstrations","summary":" Imitation learning from a large set of human demonstrations has proved to be\nan effective paradigm for building capable robot agents. However, the\ndemonstrations can be extremely costly and time-consuming to collect. We\nintroduce MimicGen, a system for automatically synthesizing large-scale, rich\ndatasets from only a small number of human demonstrations by adapting them to\nnew contexts. We use MimicGen to generate over 50K demonstrations across 18\ntasks with diverse scene configurations, object instances, and robot arms from\njust ~200 human demonstrations. We show that robot agents can be effectively\ntrained on this generated dataset by imitation learning to achieve strong\nperformance in long-horizon and high-precision tasks, such as multi-part\nassembly and coffee preparation, across broad initial state distributions. We\nfurther demonstrate that the effectiveness and utility of MimicGen data compare\nfavorably to collecting additional human demonstrations, making it a powerful\nand economical approach towards scaling up robot learning. Datasets, simulation\nenvironments, videos, and more at https://mimicgen.github.io .\n","authors":["Ajay Mandlekar","Soroush Nasiriany","Bowen Wen","Iretiayo Akinola","Yashraj Narang","Linxi Fan","Yuke Zhu","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2310.17596v1.pdf","comment":"Conference on Robot Learning (CoRL) 2023"},{"id":"http://arxiv.org/abs/2310.17594v1","updated":"2023-10-26T17:13:48Z","published":"2023-10-26T17:13:48Z","title":"SPA: A Graph Spectral Alignment Perspective for Domain Adaptation","summary":" Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to\nextend the in-domain model to the distinctive target domains where the data\ndistributions differ. Most prior works focus on capturing the inter-domain\ntransferability but largely overlook rich intra-domain structures, which\nempirically results in even worse discriminability. In this work, we introduce\na novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The\ncore of our method is briefly condensed as follows: (i)-by casting the DA\nproblem to graph primitives, SPA composes a coarse graph alignment mechanism\nwith a novel spectral regularizer towards aligning the domain graphs in\neigenspaces; (ii)-we further develop a fine-grained message propagation module\n-- upon a novel neighbor-aware self-training mechanism -- in order for enhanced\ndiscriminability in the target domain. On standardized benchmarks, the\nextensive experiments of SPA demonstrate that its performance has surpassed the\nexisting cutting-edge DA methods. Coupled with dense model analysis, we\nconclude that our approach indeed possesses superior efficacy, robustness,\ndiscriminability, and transferability. Code and data are available at:\nhttps://github.com/CrownX/SPA.\n","authors":["Zhiqing Xiao","Haobo Wang","Ying Jin","Lei Feng","Gang Chen","Fei Huang","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.17594v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17590v1","updated":"2023-10-26T17:12:26Z","published":"2023-10-26T17:12:26Z","title":"Noise-Free Score Distillation","summary":" Score Distillation Sampling (SDS) has emerged as the de facto approach for\ntext-to-content generation in non-image domains. In this paper, we reexamine\nthe SDS process and introduce a straightforward interpretation that demystifies\nthe necessity for large Classifier-Free Guidance (CFG) scales, rooted in the\ndistillation of an undesired noise term. Building upon our interpretation, we\npropose a novel Noise-Free Score Distillation (NFSD) process, which requires\nminimal modifications to the original SDS framework. Through this streamlined\ndesign, we achieve more effective distillation of pre-trained text-to-image\ndiffusion models while using a nominal CFG scale. This strategic choice allows\nus to prevent the over-smoothing of results, ensuring that the generated data\nis both realistic and complies with the desired prompt. To demonstrate the\nefficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as\nwell as several other methods.\n","authors":["Oren Katzir","Or Patashnik","Daniel Cohen-Or","Dani Lischinski"],"pdf_url":"https://arxiv.org/pdf/2310.17590v1.pdf","comment":"Project page at https://orenkatzir.github.io/nfsd/"},{"id":"http://arxiv.org/abs/2310.01164v4","updated":"2023-10-26T17:08:34Z","published":"2023-10-02T12:49:20Z","title":"Segment Any Building","summary":" The task of identifying and segmenting buildings within remote sensing\nimagery has perennially stood at the forefront of scholarly investigations.\nThis manuscript accentuates the potency of harnessing diversified datasets in\ntandem with cutting-edge representation learning paradigms for building\nsegmentation in such images. Through the strategic amalgamation of disparate\ndatasets, we have not only expanded the informational horizon accessible for\nmodel training but also manifested unparalleled performance metrics across\nmultiple datasets. Our avant-garde joint training regimen underscores the merit\nof our approach, bearing significant implications in pivotal domains such as\nurban infrastructural development, disaster mitigation strategies, and\necological surveillance. Our methodology, predicated upon the fusion of\ndatasets and gleaning insights from pre-trained models, carves a new benchmark\nin the annals of building segmentation endeavors. The outcomes of this research\nboth fortify the foundations for ensuing scholarly pursuits and presage a\nhorizon replete with innovative applications in the discipline of building\nsegmentation.\n","authors":["Lei Li"],"pdf_url":"https://arxiv.org/pdf/2310.01164v4.pdf","comment":"CGI 2023"},{"id":"http://arxiv.org/abs/2310.17577v1","updated":"2023-10-26T17:01:52Z","published":"2023-10-26T17:01:52Z","title":"Global Structure-Aware Diffusion Process for Low-Light Image Enhancement","summary":" This paper studies a diffusion-based framework to address the low-light image\nenhancement problem. To harness the capabilities of diffusion models, we delve\ninto this intricate process and advocate for the regularization of its inherent\nODE-trajectory. To be specific, inspired by the recent research that low\ncurvature ODE-trajectory results in a stable and effective diffusion process,\nwe formulate a curvature regularization term anchored in the intrinsic\nnon-local structures of image data, i.e., global structure-aware\nregularization, which gradually facilitates the preservation of complicated\ndetails and the augmentation of contrast during the diffusion process. This\nincorporation mitigates the adverse effects of noise and artifacts resulting\nfrom the diffusion process, leading to a more precise and flexible enhancement.\nTo additionally promote learning in challenging regions, we introduce an\nuncertainty-guided regularization technique, which wisely relaxes constraints\non the most extreme regions of the image. Experimental evaluations reveal that\nthe proposed diffusion-based framework, complemented by rank-informed\nregularization, attains distinguished performance in low-light enhancement. The\noutcomes indicate substantial advancements in image quality, noise suppression,\nand contrast amplification in comparison with state-of-the-art methods. We\nbelieve this innovative approach will stimulate further exploration and\nadvancement in low-light image processing, with potential implications for\nother applications of diffusion models. The code is publicly available at\nhttps://github.com/jinnh/GSAD.\n","authors":["Jinhui Hou","Zhiyu Zhu","Junhui Hou","Hui Liu","Huanqiang Zeng","Hui Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.17577v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17569v1","updated":"2023-10-26T16:58:01Z","published":"2023-10-26T16:58:01Z","title":"SD4Match: Learning to Prompt Stable Diffusion Model for Semantic\n Matching","summary":" In this paper, we address the challenge of matching semantically similar\nkeypoints across image pairs. Existing research indicates that the intermediate\noutput of the UNet within the Stable Diffusion (SD) can serve as robust image\nfeature maps for such a matching task. We demonstrate that by employing a basic\nprompt tuning technique, the inherent potential of Stable Diffusion can be\nharnessed, resulting in a significant enhancement in accuracy over previous\napproaches. We further introduce a novel conditional prompting module that\nconditions the prompt on the local details of the input image pairs, leading to\na further improvement in performance. We designate our approach as SD4Match,\nshort for Stable Diffusion for Semantic Matching. Comprehensive evaluations of\nSD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets\nnew benchmarks in accuracy across all these datasets. Particularly, SD4Match\noutperforms the previous state-of-the-art by a margin of 12 percentage points\non the challenging SPair-71k dataset.\n","authors":["Xinghui Li","Jingyi Lu","Kai Han","Victor Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2310.17569v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14267v2","updated":"2023-10-26T16:51:28Z","published":"2023-05-23T17:19:54Z","title":"SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from\n Diffusion Models","summary":" A potent class of generative models known as Diffusion Probabilistic Models\n(DPMs) has become prominent. A forward diffusion process adds gradually noise\nto data, while a model learns to gradually denoise. Sampling from pre-trained\nDPMs is obtained by solving differential equations (DE) defined by the learnt\nmodel, a process which has shown to be prohibitively slow. Numerous efforts on\nspeeding-up this process have consisted on crafting powerful ODE solvers.\nDespite being quick, such solvers do not usually reach the optimal quality\nachieved by available slow SDE solvers. Our goal is to propose SDE solvers that\nreach optimal quality without requiring several hundreds or thousands of NFEs\nto achieve that goal. We propose Stochastic Explicit Exponential\nDerivative-free Solvers (SEEDS), improving and generalizing Exponential\nIntegrator approaches to the stochastic case on several frameworks. After\ncarefully analyzing the formulation of exact solutions of diffusion SDEs, we\ncraft SEEDS to analytically compute the linear part of such solutions. Inspired\nby the Exponential Time-Differencing method, SEEDS use a novel treatment of the\nstochastic components of solutions, enabling the analytical computation of\ntheir variance, and contains high-order terms allowing to reach optimal quality\nsampling $\\sim3$-$5\\times$ faster than previous SDE methods. We validate our\napproach on several image generation benchmarks, showing that SEEDS outperform\nor are competitive with previous SDE solvers. Contrary to the latter, SEEDS are\nderivative and training free, and we fully prove strong convergence guarantees\nfor them.\n","authors":["Martin Gonzalez","Nelson Fernandez","Thuy Tran","Elies Gherbi","Hatem Hajri","Nader Masmoudi"],"pdf_url":"https://arxiv.org/pdf/2305.14267v2.pdf","comment":"60 pages. Camera-Ready version for the 37th Conference on Neural\n Information Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.17559v1","updated":"2023-10-26T16:48:36Z","published":"2023-10-26T16:48:36Z","title":"Instability of computer vision models is a necessary result of the task\n itself","summary":" Adversarial examples resulting from instability of current computer vision\nmodels are an extremely important topic due to their potential to compromise\nany application. In this paper we demonstrate that instability is inevitable\ndue to a) symmetries (translational invariance) of the data, b) the categorical\nnature of the classification task, and c) the fundamental discrepancy of\nclassifying images as objects themselves. The issue is further exacerbated by\nnon-exhaustive labelling of the training data. Therefore we conclude that\ninstability is a necessary result of how the problem of computer vision is\ncurrently formulated. While the problem cannot be eliminated, through the\nanalysis of the causes, we have arrived at ways how it can be partially\nalleviated. These include i) increasing the resolution of images, ii) providing\ncontextual information for the image, iii) exhaustive labelling of training\ndata, and iv) preventing attackers from frequent access to the computer vision\nsystem.\n","authors":["Oliver Turnbull","George Cevora"],"pdf_url":"https://arxiv.org/pdf/2310.17559v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16002v2","updated":"2023-10-26T16:30:44Z","published":"2023-10-24T16:55:07Z","title":"Integrating View Conditions for Image Synthesis","summary":" In the field of image processing, applying intricate semantic modifications\nwithin existing images remains an enduring challenge. This paper introduces a\npioneering framework that integrates viewpoint information to enhance the\ncontrol of image editing tasks. By surveying existing object editing\nmethodologies, we distill three essential criteria, consistency,\ncontrollability, and harmony, that should be met for an image editing method.\nIn contrast to previous approaches, our method takes the lead in satisfying all\nthree requirements for addressing the challenge of image synthesis. Through\ncomprehensive experiments, encompassing both quantitative assessments and\nqualitative comparisons with contemporary state-of-the-art methods, we present\ncompelling evidence of our framework's superior performance across multiple\ndimensions. This work establishes a promising avenue for advancing image\nsynthesis techniques and empowering precise object modifications while\npreserving the visual coherence of the entire composition.\n","authors":["Jinbin Bai","Zhen Dong","Aosong Feng","Xiao Zhang","Tian Ye","Kaicheng Zhou","Mike Zheng Shou"],"pdf_url":"https://arxiv.org/pdf/2310.16002v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.07813v4","updated":"2023-10-26T16:28:29Z","published":"2023-07-15T14:34:25Z","title":"TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for\n Gaze Estimation","summary":" Intelligent edge vision tasks encounter the critical challenge of ensuring\npower and latency efficiency due to the typically heavy computational load they\nimpose on edge platforms.This work leverages one of the first \"AI in sensor\"\nvision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power\nend-to-end edge vision applications. We evaluate the IMX500 and compare it to\nother edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by\nexploring gaze estimation as a case study. We propose TinyTracker, a highly\nefficient, fully quantized model for 2D gaze estimation designed to maximize\nthe performance of the edge vision systems considered in this study.\nTinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1]\nwithout significant loss in gaze estimation accuracy (maximum of 0.16 cm when\nfully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor\nresults in end-to-end latency of around 19ms. The camera takes around 17.9ms to\nread, process and transmit the pixels to the accelerator. The inference time of\nthe network is 0.86ms with an additional 0.24 ms for retrieving the results\nfrom the sensor. The overall energy consumption of the end-to-end system is 4.9\nmJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is\n1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ\nVS 34.2mJ)\n","authors":["Pietro Bonazzi","Thomas Ruegg","Sizhen Bian","Yawei Li","Michele Magno"],"pdf_url":"https://arxiv.org/pdf/2307.07813v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17534v1","updated":"2023-10-26T16:23:40Z","published":"2023-10-26T16:23:40Z","title":"SoK: Pitfalls in Evaluating Black-Box Attacks","summary":" Numerous works study black-box attacks on image classifiers. However, these\nworks make different assumptions on the adversary's knowledge and current\nliterature lacks a cohesive organization centered around the threat model. To\nsystematize knowledge in this area, we propose a taxonomy over the threat space\nspanning the axes of feedback granularity, the access of interactive queries,\nand the quality and quantity of the auxiliary data available to the attacker.\nOur new taxonomy provides three key insights. 1) Despite extensive literature,\nnumerous under-explored threat spaces exist, which cannot be trivially solved\nby adapting techniques from well-explored settings. We demonstrate this by\nestablishing a new state-of-the-art in the less-studied setting of access to\ntop-k confidence scores by adapting techniques from well-explored settings of\naccessing the complete confidence vector, but show how it still falls short of\nthe more restrictive setting that only obtains the prediction label,\nhighlighting the need for more research. 2) Identification the threat model of\ndifferent attacks uncovers stronger baselines that challenge prior\nstate-of-the-art claims. We demonstrate this by enhancing an initially weaker\nbaseline (under interactive query access) via surrogate models, effectively\noverturning claims in the respective paper. 3) Our taxonomy reveals\ninteractions between attacker knowledge that connect well to related areas,\nsuch as model inversion and extraction attacks. We discuss how advances in\nother areas can enable potentially stronger black-box attacks. Finally, we\nemphasize the need for a more realistic assessment of attack success by\nfactoring in local attack runtime. This approach reveals the potential for\ncertain attacks to achieve notably higher success rates and the need to\nevaluate attacks in diverse and harder settings, highlighting the need for\nbetter selection criteria.\n","authors":["Fnu Suya","Anshuman Suri","Tingwei Zhang","Jingtao Hong","Yuan Tian","David Evans"],"pdf_url":"https://arxiv.org/pdf/2310.17534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17530v1","updated":"2023-10-26T16:19:19Z","published":"2023-10-26T16:19:19Z","title":"Evaluating Bias and Fairness in Gender-Neutral Pretrained\n Vision-and-Language Models","summary":" Pretrained machine learning models are known to perpetuate and even amplify\nexisting biases in data, which can result in unfair outcomes that ultimately\nimpact user experience. Therefore, it is crucial to understand the mechanisms\nbehind those prejudicial biases to ensure that model performance does not\nresult in discriminatory behaviour toward certain groups or populations. In\nthis work, we define gender bias as our case study. We quantify bias\namplification in pretraining and after fine-tuning on three families of\nvision-and-language models. We investigate the connection, if any, between the\ntwo learning stages, and evaluate how bias amplification reflects on model\nperformance. Overall, we find that bias amplification in pretraining and after\nfine-tuning are independent. We then examine the effect of continued\npretraining on gender-neutral data, finding that this reduces group\ndisparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without\nsignificantly compromising task performance.\n","authors":["Laura Cabello","Emanuele Bugliarello","Stephanie Brandl","Desmond Elliott"],"pdf_url":"https://arxiv.org/pdf/2310.17530v1.pdf","comment":"To appear in EMNLP 2024"},{"id":"http://arxiv.org/abs/2310.17527v1","updated":"2023-10-26T16:18:38Z","published":"2023-10-26T16:18:38Z","title":"Masked Space-Time Hash Encoding for Efficient Dynamic Scene\n Reconstruction","summary":" In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel\nmethod for efficiently reconstructing dynamic 3D scenes from multi-view or\nmonocular videos. Based on the observation that dynamic scenes often contain\nsubstantial static areas that result in redundancy in storage and computations,\nMSTH represents a dynamic scene as a weighted combination of a 3D hash encoding\nand a 4D hash encoding. The weights for the two components are represented by a\nlearnable mask which is guided by an uncertainty-based objective to reflect the\nspatial and temporal importance of each 3D position. With this design, our\nmethod can reduce the hash collision rate by avoiding redundant queries and\nmodifications on static areas, making it feasible to represent a large number\nof space-time voxels by hash tables with small size.Besides, without the\nrequirements to fit the large numbers of temporally redundant features\nindependently, our method is easier to optimize and converge rapidly with only\ntwenty minutes of training for a 300-frame dynamic scene.As a result, MSTH\nobtains consistently better results than previous methods with only 20 minutes\nof training time and 130 MB of memory storage. Code is available at\nhttps://github.com/masked-spacetime-hashing/msth\n","authors":["Feng Wang","Zilong Chen","Guokang Wang","Yafei Song","Huaping Liu"],"pdf_url":"https://arxiv.org/pdf/2310.17527v1.pdf","comment":"NeurIPS 2023 (Spotlight)"},{"id":"http://arxiv.org/abs/2310.17519v1","updated":"2023-10-26T16:13:00Z","published":"2023-10-26T16:13:00Z","title":"FLARE: Fast Learning of Animatable and Relightable Mesh Avatars","summary":" Our goal is to efficiently learn personalized animatable 3D head avatars from\nvideos that are geometrically accurate, realistic, relightable, and compatible\nwith current rendering systems. While 3D meshes enable efficient processing and\nare highly portable, they lack realism in terms of shape and appearance. Neural\nrepresentations, on the other hand, are realistic but lack compatibility and\nare slow to train and render. Our key insight is that it is possible to\nefficiently learn high-fidelity 3D mesh representations via differentiable\nrendering by exploiting highly-optimized methods from traditional computer\ngraphics and approximating some of the components with neural networks. To that\nend, we introduce \\moniker, a technique that enables the creation of animatable\nand relightable mesh avatars from a single monocular video. First, we learn a\ncanonical geometry using a mesh representation, enabling efficient\ndifferentiable rasterization and straightforward animation via learned\nblendshapes and linear blend skinning weights. Second, we follow\nphysically-based rendering and factor observed colors into intrinsic albedo,\nroughness, and a neural representation of the illumination, allowing the\nlearned avatars to be relit in novel scenes. Since our input videos are\ncaptured on a single device with a narrow field of view, modeling the\nsurrounding environment light is non-trivial. Based on the split-sum\napproximation for modeling specular reflections, we address this by\napproximating the pre-filtered environment map with a multi-layer perceptron\n(MLP) modulated by the surface roughness, eliminating the need to explicitly\nmodel the light. We demonstrate that our mesh-based avatar formulation,\ncombined with learned deformation, material, and lighting MLPs, produces\navatars with high-quality geometry and appearance, while also being efficient\nto train and render compared to existing approaches.\n","authors":["Shrisha Bharadwaj","Yufeng Zheng","Otmar Hilliges","Michael J. Black","Victoria Fernandez-Abrevaya"],"pdf_url":"https://arxiv.org/pdf/2310.17519v1.pdf","comment":"15 pages, Accepted: ACM Transactions on Graphics (Proceedings of\n SIGGRAPH Asia), 2023"},{"id":"http://arxiv.org/abs/2305.19693v3","updated":"2023-10-26T16:02:56Z","published":"2023-05-31T09:36:34Z","title":"Spontaneous Symmetry Breaking in Generative Diffusion Models","summary":" Generative diffusion models have recently emerged as a leading approach for\ngenerating high-dimensional data. In this paper, we show that the dynamics of\nthese models exhibit a spontaneous symmetry breaking that divides the\ngenerative dynamics into two distinct phases: 1) A linear steady-state dynamics\naround a central fixed-point and 2) an attractor dynamics directed towards the\ndata manifold. These two \"phases\" are separated by the change in stability of\nthe central fixed-point, with the resulting window of instability being\nresponsible for the diversity of the generated samples. Using both theoretical\nand empirical evidence, we show that an accurate simulation of the early\ndynamics does not significantly contribute to the final generation, since early\nfluctuations are reverted to the central fixed point. To leverage this insight,\nwe propose a Gaussian late initialization scheme, which significantly improves\nmodel performance, achieving up to 3x FID improvements on fast samplers, while\nalso increasing sample diversity (e.g., racial composition of generated CelebA\nimages). Our work offers a new way to understand the generative dynamics of\ndiffusion models that has the potential to bring about higher performance and\nless biased fast-samplers.\n","authors":["Gabriel Raya","Luca Ambrogioni"],"pdf_url":"https://arxiv.org/pdf/2305.19693v3.pdf","comment":"As published at NeurIPS 2023, and the size of the file has been\n optimized for fast downloading"},{"id":"http://arxiv.org/abs/2301.00157v2","updated":"2023-10-26T15:56:50Z","published":"2022-12-31T08:58:39Z","title":"Ponder: Point Cloud Pre-training via Neural Rendering","summary":" We propose a novel approach to self-supervised learning of point cloud\nrepresentations by differentiable neural rendering. Motivated by the fact that\ninformative point cloud features should be able to encode rich geometry and\nappearance cues and render realistic images, we train a point-cloud encoder\nwithin a devised point-based neural renderer by comparing the rendered images\nwith real images on massive RGB-D data. The learned point-cloud encoder can be\neasily integrated into various downstream tasks, including not only high-level\ntasks like 3D detection and segmentation, but low-level tasks like 3D\nreconstruction and image synthesis. Extensive experiments on various tasks\ndemonstrate the superiority of our approach compared to existing pre-training\nmethods.\n","authors":["Di Huang","Sida Peng","Tong He","Honghui Yang","Xiaowei Zhou","Wanli Ouyang"],"pdf_url":"https://arxiv.org/pdf/2301.00157v2.pdf","comment":"Project page: https://dihuang.me/ponder/"},{"id":"http://arxiv.org/abs/2310.17504v1","updated":"2023-10-26T15:54:43Z","published":"2023-10-26T15:54:43Z","title":"Revisiting the Distillation of Image Representations into Point Clouds\n for Autonomous Driving","summary":" Self-supervised image networks can be used to address complex 2D tasks (e.g.,\nsemantic segmentation, object discovery) very efficiently and with little or no\ndownstream supervision. However, self-supervised 3D networks on lidar data do\nnot perform as well for now. A few methods therefore propose to distill\nhigh-quality self-supervised 2D features into 3D networks. The most recent ones\ndoing so on autonomous driving data show promising results. Yet, a performance\ngap persists between these distilled features and fully-supervised ones. In\nthis work, we revisit 2D-to-3D distillation. First, we propose, for semantic\nsegmentation, a simple approach that leads to a significant improvement\ncompared to prior 3D distillation methods. Second, we show that distillation in\nhigh capacity 3D networks is key to reach high quality 3D features. This\nactually allows us to significantly close the gap between unsupervised\ndistilled 3D features and fully-supervised ones. Last, we show that our\nhigh-quality distilled representations can also be used for open-vocabulary\nsegmentation and background/foreground discovery.\n","authors":["Gilles Puy","Spyros Gidaris","Alexandre Boulch","Oriane Siméoni","Corentin Sautier","Patrick Pérez","Andrei Bursuc","Renaud Marlet"],"pdf_url":"https://arxiv.org/pdf/2310.17504v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17493v1","updated":"2023-10-26T15:49:35Z","published":"2023-10-26T15:49:35Z","title":"A Hybrid Graph Network for Complex Activity Detection in Video","summary":" Interpretation and understanding of video presents a challenging computer\nvision task in numerous fields - e.g. autonomous driving and sports analytics.\nExisting approaches to interpreting the actions taking place within a video\nclip are based upon Temporal Action Localisation (TAL), which typically\nidentifies short-term actions. The emerging field of Complex Activity Detection\n(CompAD) extends this analysis to long-term activities, with a deeper\nunderstanding obtained by modelling the internal structure of a complex\nactivity taking place within the video. We address the CompAD problem using a\nhybrid graph neural network which combines attention applied to a graph\nencoding the local (short-term) dynamic scene with a temporal graph modelling\nthe overall long-duration activity. Our approach is as follows: i) Firstly, we\npropose a novel feature extraction technique which, for each video snippet,\ngenerates spatiotemporal `tubes' for the active elements (`agents') in the\n(local) scene by detecting individual objects, tracking them and then\nextracting 3D features from all the agent tubes as well as the overall scene.\nii) Next, we construct a local scene graph where each node (representing either\nan agent tube or the scene) is connected to all other nodes. Attention is then\napplied to this graph to obtain an overall representation of the local dynamic\nscene. iii) Finally, all local scene graph representations are interconnected\nvia a temporal graph, to estimate the complex activity class together with its\nstart and end time. The proposed framework outperforms all previous\nstate-of-the-art methods on all three datasets including ActivityNet-1.3,\nThumos-14, and ROAD.\n","authors":["Salman Khan","Izzeddin Teeti","Andrew Bradley","Mohamed Elhoseiny","Fabio Cuzzolin"],"pdf_url":"https://arxiv.org/pdf/2310.17493v1.pdf","comment":"This paper is Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2306.14685v3","updated":"2023-10-26T15:32:32Z","published":"2023-06-26T13:30:38Z","title":"DiffSketcher: Text Guided Vector Sketch Synthesis through Latent\n Diffusion Models","summary":" Even though trained mainly on images, we discover that pretrained diffusion\nmodels show impressive power in guiding sketch synthesis. In this paper, we\npresent DiffSketcher, an innovative algorithm that creates \\textit{vectorized}\nfree-hand sketches using natural language input. DiffSketcher is developed\nbased on a pre-trained text-to-image diffusion model. It performs the task by\ndirectly optimizing a set of B\\'ezier curves with an extended version of the\nscore distillation sampling (SDS) loss, which allows us to use a raster-level\ndiffusion model as a prior for optimizing a parametric vectorized sketch\ngenerator. Furthermore, we explore attention maps embedded in the diffusion\nmodel for effective stroke initialization to speed up the generation process.\nThe generated sketches demonstrate multiple levels of abstraction while\nmaintaining recognizability, underlying structure, and essential visual details\nof the subject drawn. Our experiments show that DiffSketcher achieves greater\nquality than prior work. The code and demo of DiffSketcher can be found at\nhttps://ximinng.github.io/DiffSketcher-project/.\n","authors":["Ximing Xing","Chuang Wang","Haitao Zhou","Jing Zhang","Qian Yu","Dong Xu"],"pdf_url":"https://arxiv.org/pdf/2306.14685v3.pdf","comment":"Accepted by NIPS 2023. Project page:\n https://ximinng.github.io/DiffSketcher-project/"},{"id":"http://arxiv.org/abs/2306.00612v3","updated":"2023-10-26T15:20:31Z","published":"2023-06-01T12:32:52Z","title":"AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud\n Dataset","summary":" It is a long-term vision for Autonomous Driving (AD) community that the\nperception models can learn from a large-scale point cloud dataset, to obtain\nunified representations that can achieve promising results on different tasks\nor benchmarks. Previous works mainly focus on the self-supervised pre-training\npipeline, meaning that they perform the pre-training and fine-tuning on the\nsame benchmark, which is difficult to attain the performance scalability and\ncross-dataset application for the pre-training checkpoint. In this paper, for\nthe first time, we are committed to building a large-scale pre-training\npoint-cloud dataset with diverse data distribution, and meanwhile learning\ngeneralizable representations from such a diverse pre-training dataset. We\nformulate the point-cloud pre-training task as a semi-supervised problem, which\nleverages the few-shot labeled and massive unlabeled point-cloud data to\ngenerate the unified backbone representations that can be directly applied to\nmany baseline models and benchmarks, decoupling the AD-related pre-training\nprocess and downstream fine-tuning task. During the period of backbone\npre-training, by enhancing the scene- and instance-level distribution diversity\nand exploiting the backbone's ability to learn from unknown instances, we\nachieve significant performance gains on a series of downstream perception\nbenchmarks including Waymo, nuScenes, and KITTI, under different baseline\nmodels like PV-RCNN++, SECOND, CenterPoint.\n","authors":["Jiakang Yuan","Bo Zhang","Xiangchao Yan","Tao Chen","Botian Shi","Yikang Li","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2306.00612v3.pdf","comment":"Accepted by NeurIPS 2023. Project page:\n https://jiakangyuan.github.io/AD-PT.github.io/"},{"id":"http://arxiv.org/abs/2306.00984v2","updated":"2023-10-26T15:16:57Z","published":"2023-06-01T17:59:51Z","title":"StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual\n Representation Learners","summary":" We investigate the potential of learning visual representations using\nsynthetic images generated by text-to-image models. This is a natural question\nin the light of the excellent performance of such models in generating\nhigh-quality images. We consider specifically the Stable Diffusion, one of the\nleading open source text-to-image models. We show that (1) when the generative\nmodel is configured with proper classifier-free guidance scale, training\nself-supervised methods on synthetic images can match or beat the real image\ncounterpart; (2) by treating the multiple images generated from the same text\nprompt as positives for each other, we develop a multi-positive contrastive\nlearning method, which we call StableRep. With solely synthetic images, the\nrepresentations learned by StableRep surpass the performance of representations\nlearned by SimCLR and CLIP using the same set of text prompts and corresponding\nreal images, on large scale datasets. When we further add language supervision,\nStableRep trained with 20M synthetic images achieves better accuracy than CLIP\ntrained with 50M real images.\n","authors":["Yonglong Tian","Lijie Fan","Phillip Isola","Huiwen Chang","Dilip Krishnan"],"pdf_url":"https://arxiv.org/pdf/2306.00984v2.pdf","comment":"code is available at:\n https://github.com/google-research/syn-rep-learn"},{"id":"http://arxiv.org/abs/2310.16639v2","updated":"2023-10-26T15:15:39Z","published":"2023-10-25T13:39:04Z","title":"Driving through the Concept Gridlock: Unraveling Explainability\n Bottlenecks in Automated Driving","summary":" Concept bottleneck models have been successfully used for explainable machine\nlearning by encoding information within the model with a set of human-defined\nconcepts. In the context of human-assisted or autonomous driving,\nexplainability models can help user acceptance and understanding of decisions\nmade by the autonomous vehicle, which can be used to rationalize and explain\ndriver or vehicle behavior. We propose a new approach using concept bottlenecks\nas visual features for control command predictions and explanations of user and\nvehicle behavior. We learn a human-understandable concept layer that we use to\nexplain sequential driving scenes while learning vehicle control commands. This\napproach can then be used to determine whether a change in a preferred gap or\nsteering commands from a human (or autonomous vehicle) is led by an external\nstimulus or change in preferences. We achieve competitive performance to latent\nvisual features while gaining interpretability within our model setup.\n","authors":["Jessica Echterhoff","An Yan","Kyungtae Han","Amr Abdelraouf","Rohit Gupta","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2310.16639v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17468v1","updated":"2023-10-26T15:15:11Z","published":"2023-10-26T15:15:11Z","title":"Cross-modal Active Complementary Learning with Self-refining\n Correspondence","summary":" Recently, image-text matching has attracted more and more attention from\nacademia and industry, which is fundamental to understanding the latent\ncorrespondence across visual and textual modalities. However, most existing\nmethods implicitly assume the training pairs are well-aligned while ignoring\nthe ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby\ninevitably leading to a performance drop. Although some methods attempt to\naddress such noise, they still face two challenging problems: excessive\nmemorizing/overfitting and unreliable correction for NC, especially under high\nnoise. To address the two problems, we propose a generalized Cross-modal Robust\nComplementary Learning framework (CRCL), which benefits from a novel Active\nComplementary Loss (ACL) and an efficient Self-refining Correspondence\nCorrection (SCC) to improve the robustness of existing methods. Specifically,\nACL exploits active and complementary learning losses to reduce the risk of\nproviding erroneous supervision, leading to theoretically and experimentally\ndemonstrated robustness against NC. SCC utilizes multiple self-refining\nprocesses with momentum correction to enlarge the receptive field for\ncorrecting correspondences, thereby alleviating error accumulation and\nachieving accurate and stable corrections. We carry out extensive experiments\non three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify\nthe superior robustness of our CRCL against synthetic and real-world noisy\ncorrespondences.\n","authors":["Yang Qin","Yuan Sun","Dezhong Peng","Joey Tianyi Zhou","Xi Peng","Peng Hu"],"pdf_url":"https://arxiv.org/pdf/2310.17468v1.pdf","comment":"This paper is accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17462v1","updated":"2023-10-26T15:10:10Z","published":"2023-10-26T15:10:10Z","title":"Towards Learning Monocular 3D Object Localization From 2D Labels using\n the Physical Laws of Motion","summary":" We present a novel method for precise 3D object localization in single images\nfrom a single calibrated camera using only 2D labels. No expensive 3D labels\nare needed. Thus, instead of using 3D labels, our model is trained with\neasy-to-annotate 2D labels along with the physical knowledge of the object's\nmotion. Given this information, the model can infer the latent third dimension,\neven though it has never seen this information during training. Our method is\nevaluated on both synthetic and real-world datasets, and we are able to achieve\na mean distance error of just 6 cm in our experiments on real data. The results\nindicate the method's potential as a step towards learning 3D object location\nestimation, where collecting 3D data for training is not feasible.\n","authors":["Daniel Kienzle","Julian Lorenz","Katja Ludwig","Rainer Lienhart"],"pdf_url":"https://arxiv.org/pdf/2310.17462v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17455v1","updated":"2023-10-26T15:01:54Z","published":"2023-10-26T15:01:54Z","title":"OTMatch: Improving Semi-Supervised Learning with Optimal Transport","summary":" Semi-supervised learning has made remarkable strides by effectively utilizing\na limited amount of labeled data while capitalizing on the abundant information\npresent in unlabeled data. However, current algorithms often prioritize\naligning image predictions with specific classes generated through\nself-training techniques, thereby neglecting the inherent relationships that\nexist within these classes. In this paper, we present a new approach called\nOTMatch, which leverages semantic relationships among classes by employing an\noptimal transport loss function. By utilizing optimal transport, our proposed\nmethod consistently outperforms established state-of-the-art methods. Notably,\nwe observed a substantial improvement of a certain percentage in accuracy\ncompared to the current state-of-the-art method, FreeMatch. OTMatch achieves\n3.18%, 3.46%, and 1.28% error rate reduction over FreeMatch on CIFAR-10 with 1\nlabel per class, STL-10 with 4 labels per class, and ImageNet with 100 labels\nper class, respectively. This demonstrates the effectiveness and superiority of\nour approach in harnessing semantic relationships to enhance learning\nperformance in a semi-supervised setting.\n","authors":["Zhiquan Tan","Kaipeng Zheng","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17455v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17451v1","updated":"2023-10-26T15:00:21Z","published":"2023-10-26T15:00:21Z","title":"Generating by Understanding: Neural Visual Generation with Logical\n Symbol Groundings","summary":" Despite the great success of neural visual generative models in recent years,\nintegrating them with strong symbolic knowledge reasoning systems remains a\nchallenging task. The main challenges are two-fold: one is symbol assignment,\ni.e. bonding latent factors of neural visual generators with meaningful symbols\nfrom knowledge reasoning systems. Another is rule learning, i.e. learning new\nrules, which govern the generative process of the data, to augment the\nknowledge reasoning systems. To deal with these symbol grounding problems, we\npropose a neural-symbolic learning approach, Abductive Visual Generation\n(AbdGen), for integrating logic programming systems with neural visual\ngenerative models based on the abductive learning framework. To achieve\nreliable and efficient symbol assignment, the quantized abduction method is\nintroduced for generating abduction proposals by the nearest-neighbor lookups\nwithin semantic codebooks. To achieve precise rule learning, the contrastive\nmeta-abduction method is proposed to eliminate wrong rules with positive cases\nand avoid less-informative rules with negative cases simultaneously.\nExperimental results on various benchmark datasets show that compared to the\nbaselines, AbdGen requires significantly fewer instance-level labeling\ninformation for symbol assignment. Furthermore, our approach can effectively\nlearn underlying logical generative rules from data, which is out of the\ncapability of existing approaches.\n","authors":["Yifei Peng","Yu Jin","Zhexu Luo","Yao-Xiang Ding","Wang-Zhou Dai","Zhong Ren","Kun Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.17451v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17437v1","updated":"2023-10-26T14:47:11Z","published":"2023-10-26T14:47:11Z","title":"Sign Languague Recognition without frame-sequencing constraints: A proof\n of concept on the Argentinian Sign Language","summary":" Automatic sign language recognition (SLR) is an important topic within the\nareas of human-computer interaction and machine learning. On the one hand, it\nposes a complex challenge that requires the intervention of various knowledge\nareas, such as video processing, image processing, intelligent systems and\nlinguistics. On the other hand, robust recognition of sign language could\nassist in the translation process and the integration of hearing-impaired\npeople, as well as the teaching of sign language for the hearing population.\n SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or\nsimilar models to recognize signs. Such techniques exploit the sequential\nordering of frames to reduce the number of hypothesis. This paper presents a\ngeneral probabilistic model for sign classification that combines\nsub-classifiers based on different types of features such as position, movement\nand handshape. The model employs a bag-of-words approach in all classification\nsteps, to explore the hypothesis that ordering is not essential for\nrecognition. The proposed model achieved an accuracy rate of 97% on an\nArgentinian Sign Language dataset containing 64 classes of signs and 3200\nsamples, providing some evidence that indeed recognition without ordering is\npossible.\n","authors":["Franco Ronchetti","Facundo Manuel Quiroga","César Estrebou","Laura Lanzarini","Alejandro Rosete"],"pdf_url":"https://arxiv.org/pdf/2310.17437v1.pdf","comment":"IBERAMIA 2016"},{"id":"http://arxiv.org/abs/2310.17436v1","updated":"2023-10-26T14:47:10Z","published":"2023-10-26T14:47:10Z","title":"Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on\n Semantic Segmentation","summary":" State-of-the-art deep neural networks have been shown to be extremely\npowerful in a variety of perceptual tasks like semantic segmentation. However,\nthese networks are vulnerable to adversarial perturbations of the input which\nare imperceptible for humans but lead to incorrect predictions. Treating image\nsegmentation as a sum of pixel-wise classifications, adversarial attacks\ndeveloped for classification models were shown to be applicable to segmentation\nmodels as well. In this work, we present simple uncertainty-based weighting\nschemes for the loss functions of such attacks that (i) put higher weights on\npixel classifications which can more easily perturbed and (ii) zero-out the\npixel-wise losses corresponding to those pixels that are already confidently\nmisclassified. The weighting schemes can be easily integrated into the loss\nfunction of a range of well-known adversarial attackers with minimal additional\ncomputational overhead, but lead to significant improved perturbation\nperformance, as we demonstrate in our empirical analysis on several datasets\nand models.\n","authors":["Kira Maag","Asja Fischer"],"pdf_url":"https://arxiv.org/pdf/2310.17436v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17429v1","updated":"2023-10-26T14:37:01Z","published":"2023-10-26T14:37:01Z","title":"LSA64: An Argentinian Sign Language Dataset","summary":" Automatic sign language recognition is a research area that encompasses\nhuman-computer interaction, computer vision and machine learning. Robust\nautomatic recognition of sign language could assist in the translation process\nand the integration of hearing-impaired people, as well as the teaching of sign\nlanguage to the hearing population. Sign languages differ significantly in\ndifferent countries and even regions, and their syntax and semantics are\ndifferent as well from those of written languages. While the techniques for\nautomatic sign language recognition are mostly the same for different\nlanguages, training a recognition system for a new language requires having an\nentire dataset for that language. This paper presents a dataset of 64 signs\nfrom the Argentinian Sign Language (LSA). The dataset, called LSA64, contains\n3200 videos of 64 different LSA signs recorded by 10 subjects, and is a first\nstep towards building a comprehensive research-level dataset of Argentinian\nsigns, specifically tailored to sign language recognition or other machine\nlearning tasks. The subjects that performed the signs wore colored gloves to\nease the hand tracking and segmentation steps, allowing experiments on the\ndataset to focus specifically on the recognition of signs. We also present a\npre-processed version of the dataset, from which we computed statistics of\nmovement, position and handshape of the signs.\n","authors":["Franco Ronchetti","Facundo Manuel Quiroga","César Estrebou","Laura Lanzarini","Alejandro Rosete"],"pdf_url":"https://arxiv.org/pdf/2310.17429v1.pdf","comment":"Published in CACIC XXII"},{"id":"http://arxiv.org/abs/2307.10922v3","updated":"2023-10-26T14:34:55Z","published":"2023-07-20T14:47:50Z","title":"Language-based Action Concept Spaces Improve Video Self-Supervised\n Learning","summary":" Recent contrastive language image pre-training has led to learning highly\ntransferable and robust image representations. However, adapting these models\nto video domains with minimal supervision remains an open problem. We explore a\nsimple step in that direction, using language tied self-supervised learning to\nadapt an image CLIP model to the video domain. A backbone modified for temporal\nmodeling is trained under self-distillation settings with train objectives\noperating in an action concept space. Feature vectors of various action\nconcepts extracted from a language encoder using relevant textual prompts\nconstruct this space. We introduce two train objectives, concept distillation\nand concept alignment, that retain generality of original representations while\nenforcing relations between actions and their attributes. Our approach improves\nzero-shot and linear probing performance on three action recognition\nbenchmarks.\n","authors":["Kanchana Ranasinghe","Michael Ryoo"],"pdf_url":"https://arxiv.org/pdf/2307.10922v3.pdf","comment":"Presented at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17427v1","updated":"2023-10-26T14:32:44Z","published":"2023-10-26T14:32:44Z","title":"Handshape recognition for Argentinian Sign Language using ProbSom","summary":" Automatic sign language recognition is an important topic within the areas of\nhuman-computer interaction and machine learning. On the one hand, it poses a\ncomplex challenge that requires the intervention of various knowledge areas,\nsuch as video processing, image processing, intelligent systems and\nlinguistics. On the other hand, robust recognition of sign language could\nassist in the translation process and the integration of hearing-impaired\npeople.\n This paper offers two main contributions: first, the creation of a database\nof handshapes for the Argentinian Sign Language (LSA), which is a topic that\nhas barely been discussed so far. Secondly, a technique for image processing,\ndescriptor extraction and subsequent handshape classification using a\nsupervised adaptation of self-organizing maps that is called ProbSom. This\ntechnique is compared to others in the state of the art, such as Support Vector\nMachines (SVM), Random Forests, and Neural Networks.\n The database that was built contains 800 images with 16 LSA handshapes, and\nis a first step towards building a comprehensive database of Argentinian signs.\nThe ProbSom-based neural classifier, using the proposed descriptor, achieved an\naccuracy rate above 90%.\n","authors":["Franco Ronchetti","Facundo Manuel Quiroga","César Estrebou","Laura Lanzarini"],"pdf_url":"https://arxiv.org/pdf/2310.17427v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17421v1","updated":"2023-10-26T14:24:57Z","published":"2023-10-26T14:24:57Z","title":"Distribution of Action Movements (DAM): A Descriptor for Human Action\n Recognition","summary":" Human action recognition from skeletal data is an important and active area\nof research in which the state of the art has not yet achieved near-perfect\naccuracy on many well-known datasets. In this paper, we introduce the\nDistribution of Action Movements Descriptor, a novel action descriptor based on\nthe distribution of the directions of the motions of the joints between frames,\nover the set of all possible motions in the dataset. The descriptor is computed\nas a normalized histogram over a set of representative directions of the\njoints, which are in turn obtained via clustering. While the descriptor is\nglobal in the sense that it represents the overall distribution of movement\ndirections of an action, it is able to partially retain its temporal structure\nby applying a windowing scheme.\n The descriptor, together with a standard classifier, outperforms several\nstate-of-the-art techniques on many well-known datasets.\n","authors":["Facundo Manuel Quiroga","Franco Ronchetti","Laura Lanzarini","Cesar Eestrebou"],"pdf_url":"https://arxiv.org/pdf/2310.17421v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17419v1","updated":"2023-10-26T14:23:45Z","published":"2023-10-26T14:23:45Z","title":"AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image\n Detectors","summary":" Deep generative models can create remarkably photorealistic fake images while\nraising concerns about misinformation and copyright infringement, known as\ndeepfake threats. Deepfake detection technique is developed to distinguish\nbetween real and fake images, where the existing methods typically learn\nclassifiers in the image domain or various feature domains. However, the\ngeneralizability of deepfake detection against emerging and more advanced\ngenerative models remains challenging. In this paper, being inspired by the\nzero-shot advantages of Vision-Language Models (VLMs), we propose a novel\napproach using VLMs (e.g. InstructBLIP) and prompt tuning techniques to improve\nthe deepfake detection accuracy over unseen data. We formulate deepfake\ndetection as a visual question answering problem, and tune soft prompts for\nInstructBLIP to answer the real/fake information of a query image. We conduct\nfull-spectrum experiments on datasets from 3 held-in and 13 held-out generative\nmodels, covering modern text-to-image generation, image editing and image\nattacks. Results demonstrate that (1) the deepfake detection accuracy can be\nsignificantly and consistently improved (from 58.8% to 91.31%, in average\naccuracy over unseen data) using pretrained vision-language models with prompt\ntuning; (2) our superior performance is at less cost of trainable parameters,\nresulting in an effective and efficient solution for deepfake detection. Code\nand models can be found at https://github.com/nctu-eva-lab/AntifakePrompt.\n","authors":["You-Ming Chang","Chen Yeh","Wei-Chen Chiu","Ning Yu"],"pdf_url":"https://arxiv.org/pdf/2310.17419v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17418v1","updated":"2023-10-26T14:22:43Z","published":"2023-10-26T14:22:43Z","title":"Circuit as Set of Points","summary":" As the size of circuit designs continues to grow rapidly, artificial\nintelligence technologies are being extensively used in Electronic Design\nAutomation (EDA) to assist with circuit design. Placement and routing are the\nmost time-consuming parts of the physical design process, and how to quickly\nevaluate the placement has become a hot research topic. Prior works either\ntransformed circuit designs into images using hand-crafted methods and then\nused Convolutional Neural Networks (CNN) to extract features, which are limited\nby the quality of the hand-crafted methods and could not achieve end-to-end\ntraining, or treated the circuit design as a graph structure and used Graph\nNeural Networks (GNN) to extract features, which require time-consuming\npreprocessing. In our work, we propose a novel perspective for circuit design\nby treating circuit components as point clouds and using Transformer-based\npoint cloud perception methods to extract features from the circuit. This\napproach enables direct feature extraction from raw data without any\npreprocessing, allows for end-to-end training, and results in high performance.\nExperimental results show that our method achieves state-of-the-art performance\nin congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as\nwell as in design rule check (DRC) violation prediction tasks on the CircuitNet\ndataset. Our method establishes a bridge between the relatively mature point\ncloud perception methods and the fast-developing EDA algorithms, enabling us to\nleverage more collective intelligence to solve this task. To facilitate the\nresearch of open EDA design, source codes and pre-trained models are released\nat https://github.com/hustvl/circuitformer.\n","authors":["Jialv Zou","Xinggang Wang","Jiahao Guo","Wenyu Liu","Qian Zhang","Chang Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17418v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17403v1","updated":"2023-10-26T13:56:12Z","published":"2023-10-26T13:56:12Z","title":"Detection Defenses: An Empty Promise against Adversarial Patch Attacks\n on Optical Flow","summary":" Adversarial patches undermine the reliability of optical flow predictions\nwhen placed in arbitrary scene locations. Therefore, they pose a realistic\nthreat to real-world motion detection and its downstream applications.\nPotential remedies are defense strategies that detect and remove adversarial\npatches, but their influence on the underlying motion prediction has not been\ninvestigated. In this paper, we thoroughly examine the currently available\ndetect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art\noptical flow methods, and illuminate their side effects on the quality and\nrobustness of the final flow predictions. In particular, we implement\ndefense-aware attacks to investigate whether current defenses are able to\nwithstand attacks that take the defense mechanism into account. Our experiments\nyield two surprising results: Detect-and-remove defenses do not only lower the\noptical flow quality on benign scenes, in doing so, they also harm the\nrobustness under patch attacks for all tested optical flow methods except\nFlowNetC. As currently employed detect-and-remove defenses fail to deliver the\npromised adversarial robustness for optical flow, they evoke a false sense of\nsecurity. The code is available at\nhttps://github.com/cv-stuttgart/DetectionDefenses.\n","authors":["Erik Scheurer","Jenny Schmalfuss","Alexander Lis","Andrés Bruhn"],"pdf_url":"https://arxiv.org/pdf/2310.17403v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.17395v1","updated":"2023-10-26T13:46:20Z","published":"2023-10-26T13:46:20Z","title":"Learning Temporal Sentence Grounding From Narrated EgoVideos","summary":" The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens\npresents a new challenge for the task of Temporal Sentence Grounding (TSG).\nCompared to traditional benchmarks on which this task is evaluated, these\ndatasets offer finer-grained sentences to ground in notably longer videos. In\nthis paper, we develop an approach for learning to ground sentences in these\ndatasets using only narrations and their corresponding rough narration\ntimestamps. We propose to artificially merge clips to train for temporal\ngrounding in a contrastive manner using text-conditioning attention. This Clip\nMerging (CliMer) approach is shown to be effective when compared with a high\nperforming TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and\nfrom 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from:\nhttps://github.com/keflanagan/CliMer\n","authors":["Kevin Flanagan","Dima Damen","Michael Wray"],"pdf_url":"https://arxiv.org/pdf/2310.17395v1.pdf","comment":"Accepted in BMVC 2023"},{"id":"http://arxiv.org/abs/2304.08965v4","updated":"2023-10-26T13:17:13Z","published":"2023-04-18T12:58:21Z","title":"PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via\n Cross-modal Distillation and Super-Voxel Clustering","summary":" Semantic segmentation of point clouds usually requires exhausting efforts of\nhuman annotations, hence it attracts wide attention to the challenging topic of\nlearning from unlabeled or weaker forms of annotations. In this paper, we take\nthe first attempt for fully unsupervised semantic segmentation of point clouds,\nwhich aims to delineate semantically meaningful objects without any form of\nannotations. Previous works of unsupervised pipeline on 2D images fails in this\ntask of point clouds, due to: 1) Clustering Ambiguity caused by limited\nmagnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity\ncaused by the irregular sparsity of point cloud. Therefore, we propose a novel\nframework, PointDC, which is comprised of two steps that handle the\naforementioned problems respectively: Cross-Modal Distillation (CMD) and\nSuper-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual\nfeatures are back-projected to the 3D space and aggregated to a unified point\nfeature to distill the training of the point representation. In the second\nstage of SVC, the point features are aggregated to super-voxels and then fed to\nthe iterative clustering process for excavating semantic classes. PointDC\nyields a significant improvement over the prior state-of-the-art unsupervised\nmethods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic\nsegmentation benchmarks.\n","authors":["Zisheng Chen","Hongbin Xu","Weitao Chen","Zhipeng Zhou","Haihong Xiao","Baigui Sun","Xuansong Xie","Wenxiong Kang"],"pdf_url":"https://arxiv.org/pdf/2304.08965v4.pdf","comment":"Accepted by International Conference on Computer Vision (ICCV) 2023"},{"id":"http://arxiv.org/abs/2310.17379v1","updated":"2023-10-26T13:16:27Z","published":"2023-10-26T13:16:27Z","title":"YOLO-BEV: Generating Bird's-Eye View in the Same Way as 2D Object\n Detection","summary":" Vehicle perception systems strive to achieve comprehensive and rapid visual\ninterpretation of their surroundings for improved safety and navigation. We\nintroduce YOLO-BEV, an efficient framework that harnesses a unique surrounding\ncameras setup to generate a 2D bird's-eye view of the vehicular environment. By\nstrategically positioning eight cameras, each at a 45-degree interval, our\nsystem captures and integrates imagery into a coherent 3x3 grid format, leaving\nthe center blank, providing an enriched spatial representation that facilitates\nefficient processing. In our approach, we employ YOLO's detection mechanism,\nfavoring its inherent advantages of swift response and compact model structure.\nInstead of leveraging the conventional YOLO detection head, we augment it with\na custom-designed detection head, translating the panoramically captured data\ninto a unified bird's-eye view map of ego car. Preliminary results validate the\nfeasibility of YOLO-BEV in real-time vehicular perception tasks. With its\nstreamlined architecture and potential for rapid deployment due to minimized\nparameters, YOLO-BEV poses as a promising tool that may reshape future\nperspectives in autonomous driving systems.\n","authors":["Chang Liu","Liguo Zhou","Yanliang Huang","Alois Knoll"],"pdf_url":"https://arxiv.org/pdf/2310.17379v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.06874v2","updated":"2023-10-26T13:05:07Z","published":"2023-01-17T13:30:03Z","title":"Training Methods of Multi-label Prediction Classifiers for Hyperspectral\n Remote Sensing Images","summary":" With their combined spectral depth and geometric resolution, hyperspectral\nremote sensing images embed a wealth of complex, non-linear information that\nchallenges traditional computer vision techniques. Yet, deep learning methods\nknown for their representation learning capabilities prove more suitable for\nhandling such complexities. Unlike applications that focus on single-label,\npixel-level classification methods for hyperspectral remote sensing images, we\npropose a multi-label, patch-level classification method based on a\ntwo-component deep-learning network. We use patches of reduced spatial\ndimension and a complete spectral depth extracted from the remote sensing\nimages. Additionally, we investigate three training schemes for our network:\nIterative, Joint, and Cascade. Experiments suggest that the Joint scheme is the\nbest-performing scheme; however, its application requires an expensive search\nfor the best weight combination of the loss constituents. The Iterative scheme\nenables the sharing of features between the two parts of the network at the\nearly stages of training. It performs better on complex data with multi-labels.\nFurther experiments showed that methods designed with different architectures\nperformed well when trained on patches extracted and labeled according to our\nsampling method.\n","authors":["Salma Haidar","José Oramas"],"pdf_url":"https://arxiv.org/pdf/2301.06874v2.pdf","comment":"1- Added references. 2- updated methodology figure and added new\n figures to visualise the different training schemes and 3- Correcting typos\n 4- Revised introduction, no change in results or discussion"},{"id":"http://arxiv.org/abs/2309.07510v3","updated":"2023-10-26T12:54:40Z","published":"2023-09-14T08:24:32Z","title":"Learning Environment-Aware Affordance for 3D Articulated Object\n Manipulation under Occlusions","summary":" Perceiving and manipulating 3D articulated objects in diverse environments is\nessential for home-assistant robots. Recent studies have shown that point-level\naffordance provides actionable priors for downstream manipulation tasks.\nHowever, existing works primarily focus on single-object scenarios with\nhomogeneous agents, overlooking the realistic constraints imposed by the\nenvironment and the agent's morphology, e.g., occlusions and physical\nlimitations. In this paper, we propose an environment-aware affordance\nframework that incorporates both object-level actionable priors and environment\nconstraints. Unlike object-centric affordance approaches, learning\nenvironment-aware affordance faces the challenge of combinatorial explosion due\nto the complexity of various occlusions, characterized by their quantities,\ngeometries, positions and poses. To address this and enhance data efficiency,\nwe introduce a novel contrastive affordance learning framework capable of\ntraining on scenes containing a single occluder and generalizing to scenes with\ncomplex occluder combinations. Experiments demonstrate the effectiveness of our\nproposed approach in learning affordance considering environment constraints.\nProject page at https://chengkaiacademycity.github.io/EnvAwareAfford/\n","authors":["Ruihai Wu","Kai Cheng","Yan Shen","Chuanruo Ning","Guanqi Zhan","Hao Dong"],"pdf_url":"https://arxiv.org/pdf/2309.07510v3.pdf","comment":"In 37th Conference on Neural Information Processing Systems (NeurIPS\n 2023). Website at https://chengkaiacademycity.github.io/EnvAwareAfford/"},{"id":"http://arxiv.org/abs/2310.17359v1","updated":"2023-10-26T12:47:26Z","published":"2023-10-26T12:47:26Z","title":"SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D\n Object Pose Estimation","summary":" In this paper, we introduce an SE(3) diffusion model-based point cloud\nregistration framework for 6D object pose estimation in real-world scenarios.\nOur approach formulates the 3D registration task as a denoising diffusion\nprocess, which progressively refines the pose of the source point cloud to\nobtain a precise alignment with the model point cloud. Training our framework\ninvolves two operations: An SE(3) diffusion process and an SE(3) reverse\nprocess. The SE(3) diffusion process gradually perturbs the optimal rigid\ntransformation of a pair of point clouds by continuously injecting noise\n(perturbation transformation). By contrast, the SE(3) reverse process focuses\non learning a denoising network that refines the noisy transformation\nstep-by-step, bringing it closer to the optimal transformation for accurate\npose estimation. Unlike standard diffusion models used in linear Euclidean\nspaces, our diffusion model operates on the SE(3) manifold. This requires\nexploiting the linear Lie algebra $\\mathfrak{se}(3)$ associated with SE(3) to\nconstrain the transformation transitions during the diffusion and reverse\nprocesses. Additionally, to effectively train our denoising network, we derive\na registration-specific variational lower bound as the optimization objective\nfor model learning. Furthermore, we show that our denoising network can be\nconstructed with a surrogate registration model, making our approach applicable\nto different deep registration networks. Extensive experiments demonstrate that\nour diffusion registration framework presents outstanding pose estimation\nperformance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets.\n","authors":["Haobo Jiang","Mathieu Salzmann","Zheng Dang","Jin Xie","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2310.17359v1.pdf","comment":"Accepted by NeurIPS-2023"},{"id":"http://arxiv.org/abs/2310.17356v1","updated":"2023-10-26T12:44:45Z","published":"2023-10-26T12:44:45Z","title":"Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning","summary":" Ahead-of-time forecasting of the output power of power plants is essential\nfor the stability of the electricity grid and ensuring uninterrupted service.\nHowever, forecasting renewable energy sources is difficult due to the chaotic\nbehavior of natural energy sources. This paper presents a new approach to\nestimate short-term solar irradiance from sky images. The~proposed algorithm\nextracts features from sky images and use learning-based techniques to estimate\nthe solar irradiance. The~performance of proposed machine learning (ML)\nalgorithm is evaluated using two publicly available datasets of sky images.\nThe~datasets contain over 350,000 images for an interval of 16 years, from 2004\nto 2020, with the corresponding global horizontal irradiance (GHI) of each\nimage as the ground truth. Compared to the state-of-the-art computationally\nheavy algorithms proposed in the literature, our approach achieves competitive\nresults with much less computational complexity for both nowcasting and\nforecasting up to 4 h ahead of time.\n","authors":["Anas Al-lahham","Obaidah Theeb","Khaled Elalem","Tariq A. Alshawi","Saleh A. Alshebeili"],"pdf_url":"https://arxiv.org/pdf/2310.17356v1.pdf","comment":"Published in MDPI Electronics Journal"},{"id":"http://arxiv.org/abs/2310.16020v2","updated":"2023-10-26T12:37:00Z","published":"2023-10-24T17:30:26Z","title":"ConvBKI: Real-Time Probabilistic Semantic Mapping Network with\n Quantifiable Uncertainty","summary":" In this paper, we develop a modular neural network for real-time semantic\nmapping in uncertain environments, which explicitly updates per-voxel\nprobabilistic distributions within a neural network layer. Our approach\ncombines the reliability of classical probabilistic algorithms with the\nperformance and efficiency of modern neural networks. Although robotic\nperception is often divided between modern differentiable methods and classical\nexplicit methods, a union of both is necessary for real-time and trustworthy\nperformance. We introduce a novel Convolutional Bayesian Kernel Inference\n(ConvBKI) layer which incorporates semantic segmentation predictions online\ninto a 3D map through a depthwise convolution layer by leveraging conjugate\npriors. We compare ConvBKI against state-of-the-art deep learning approaches\nand probabilistic algorithms for mapping to evaluate reliability and\nperformance. We also create a Robot Operating System (ROS) package of ConvBKI\nand test it on real-world perceptually challenging off-road driving data.\n","authors":["Joey Wilson","Yuewei Fu","Joshua Friesen","Parker Ewen","Andrew Capodieci","Paramsothy Jayakumar","Kira Barton","Maani Ghaffari"],"pdf_url":"https://arxiv.org/pdf/2310.16020v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2209.10663"},{"id":"http://arxiv.org/abs/2310.17347v1","updated":"2023-10-26T12:27:56Z","published":"2023-10-26T12:27:56Z","title":"CADS: Unleashing the Diversity of Diffusion Models through\n Condition-Annealed Sampling","summary":" While conditional diffusion models are known to have good coverage of the\ndata distribution, they still face limitations in output diversity,\nparticularly when sampled with a high classifier-free guidance scale for\noptimal image quality or when trained on small datasets. We attribute this\nproblem to the role of the conditioning signal in inference and offer an\nimproved sampling strategy for diffusion models that can increase generation\ndiversity, especially at high guidance scales, with minimal loss of sample\nquality. Our sampling strategy anneals the conditioning signal by adding\nscheduled, monotonically decreasing Gaussian noise to the conditioning vector\nduring inference to balance diversity and condition alignment. Our\nCondition-Annealed Diffusion Sampler (CADS) can be used with any pretrained\nmodel and sampling algorithm, and we show that it boosts the diversity of\ndiffusion models in various conditional generation tasks. Further, using an\nexisting pretrained diffusion model, CADS achieves a new state-of-the-art FID\nof 1.70 and 2.31 for class-conditional ImageNet generation at 256$\\times$256\nand 512$\\times$512 respectively.\n","authors":["Seyedmorteza Sadat","Jakob Buhmann","Derek Bradely","Otmar Hilliges","Romann M. Weber"],"pdf_url":"https://arxiv.org/pdf/2310.17347v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.00359v3","updated":"2023-10-26T12:18:51Z","published":"2023-09-01T09:34:49Z","title":"Large Content And Behavior Models To Understand, Simulate, And Optimize\n Content And Behavior","summary":" Shannon, in his seminal paper introducing information theory, divided the\ncommunication into three levels: technical, semantic, and effectivenss. While\nthe technical level is concerned with accurate reconstruction of transmitted\nsymbols, the semantic and effectiveness levels deal with the inferred meaning\nand its effect on the receiver. Thanks to telecommunications, the first level\nproblem has produced great advances like the internet. Large Language Models\n(LLMs) make some progress towards the second goal, but the third level still\nremains largely untouched. The third problem deals with predicting and\noptimizing communication for desired receiver behavior. LLMs, while showing\nwide generalization capabilities across a wide range of tasks, are unable to\nsolve for this. One reason for the underperformance could be a lack of\n``behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver\nbehavior over a communication, such as shares, likes, clicks, purchases,\nretweets, etc. While preprocessing data for LLM training, behavior tokens are\noften removed from the corpora as noise. Therefore, in this paper, we make some\ninitial progress towards reintroducing behavior tokens in LLM training. The\ntrained models, other than showing similar performance to LLMs on content\nunderstanding tasks, show generalization capabilities on behavior simulation,\ncontent simulation, behavior understanding, and behavior domain adaptation.\nUsing a wide range of tasks on two corpora, we show results on all these\ncapabilities. We call these models Large Content and Behavior Models (LCBMs).\nFurther, to spur more research on LCBMs, we release our new Content Behavior\nCorpus (CBC), a repository containing communicator, message, and corresponding\nreceiver behavior.\n","authors":["Ashmit Khandelwal","Aditya Agrawal","Aanisha Bhattacharyya","Yaman K Singla","Somesh Singh","Uttaran Bhattacharya","Ishita Dasgupta","Stefano Petrangeli","Rajiv Ratn Shah","Changyou Chen","Balaji Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2309.00359v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15134v2","updated":"2023-10-26T11:53:27Z","published":"2023-05-24T13:27:11Z","title":"Networks are Slacking Off: Understanding Generalization Problem in Image\n Deraining","summary":" Deep deraining networks consistently encounter substantial generalization\nissues when deployed in real-world applications, although they are successful\nin laboratory benchmarks. A prevailing perspective in deep learning encourages\nusing highly complex data for training, with the expectation that richer image\nbackground content will facilitate overcoming the generalization problem.\nHowever, through comprehensive and systematic experimentation, we discover that\nthis strategy does not enhance the generalization capability of these networks.\nOn the contrary, it exacerbates the tendency of networks to overfit specific\ndegradations. Our experiments reveal that better generalization in a deraining\nnetwork can be achieved by simplifying the complexity of the training\nbackground images. This is because that the networks are ``slacking off''\nduring training, that is, learning the least complex elements in the image\nbackground and degradation to minimize training loss. When the background\nimages are less complex than the rain streaks, the network will prioritize the\nbackground reconstruction, thereby suppressing overfitting the rain patterns\nand leading to improved generalization performance. Our research offers a\nvaluable perspective and methodology for better understanding the\ngeneralization problem in low-level vision tasks and displays promising\npotential for practical application.\n","authors":["Jinjin Gu","Xianzheng Ma","Xiangtao Kong","Yu Qiao","Chao Dong"],"pdf_url":"https://arxiv.org/pdf/2305.15134v2.pdf","comment":"This article has been accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2211.00990v2","updated":"2023-10-26T11:47:25Z","published":"2022-11-02T09:51:15Z","title":"A weighted-variance variational autoencoder model for speech enhancement","summary":" We address speech enhancement based on variational autoencoders, which\ninvolves learning a speech prior distribution in the time-frequency (TF)\ndomain. A zero-mean complex-valued Gaussian distribution is usually assumed for\nthe generative model, where the speech information is encoded in the variance\nas a function of a latent variable. In contrast to this commonly used approach,\nwe propose a weighted variance generative model, where the contribution of each\nspectrogram time-frame in parameter learning is weighted. We impose a Gamma\nprior distribution on the weights, which would effectively lead to a Student's\nt-distribution instead of Gaussian for speech generative modeling. We develop\nefficient training and speech enhancement algorithms based on the proposed\ngenerative model. Our experimental results on spectrogram auto-encoding and\nspeech enhancement demonstrate the effectiveness and robustness of the proposed\napproach compared to the standard unweighted variance model.\n","authors":["Ali Golmakani","Mostafa Sadeghi","Xavier Alameda-Pineda","Romain Serizel"],"pdf_url":"https://arxiv.org/pdf/2211.00990v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17325v1","updated":"2023-10-26T11:44:42Z","published":"2023-10-26T11:44:42Z","title":"C-Disentanglement: Discovering Causally-Independent Generative Factors\n under an Inductive Bias of Confounder","summary":" Representation learning assumes that real-world data is generated by a few\nsemantically meaningful generative factors (i.e., sources of variation) and\naims to discover them in the latent space. These factors are expected to be\ncausally disentangled, meaning that distinct factors are encoded into separate\nlatent variables, and changes in one factor will not affect the values of the\nothers. Compared to statistical independence, causal disentanglement allows\nmore controllable data generation, improved robustness, and better\ngeneralization. However, most existing work assumes unconfoundedness in the\ndiscovery process, that there are no common causes to the generative factors\nand thus obtain only statistical independence. In this paper, we recognize the\nimportance of modeling confounders in discovering causal generative factors.\nUnfortunately, such factors are not identifiable without proper inductive bias.\nWe fill the gap by introducing a framework entitled Confounded-Disentanglement\n(C-Disentanglement), the first framework that explicitly introduces the\ninductive bias of confounder via labels from domain expertise. In addition, we\naccordingly propose an approach to sufficiently identify the causally\ndisentangled factors under any inductive bias of the confounder. We conduct\nextensive experiments on both synthetic and real-world datasets. Our method\ndemonstrates competitive results compared to various SOTA baselines in\nobtaining causally disentangled features and downstream tasks under domain\nshifts.\n","authors":["Xiaoyu Liu","Jiaxin Yuan","Bang An","Yuancheng Xu","Yifan Yang","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17325v1.pdf","comment":"accepted to Neurips 2023"},{"id":"http://arxiv.org/abs/2310.17323v1","updated":"2023-10-26T11:44:29Z","published":"2023-10-26T11:44:29Z","title":"IndustReal: A Dataset for Procedure Step Recognition Handling Execution\n Errors in Egocentric Videos in an Industrial-Like Setting","summary":" Although action recognition for procedural tasks has received notable\nattention, it has a fundamental flaw in that no measure of success for actions\nis provided. This limits the applicability of such systems especially within\nthe industrial domain, since the outcome of procedural actions is often\nsignificantly more important than the mere execution. To address this\nlimitation, we define the novel task of procedure step recognition (PSR),\nfocusing on recognizing the correct completion and order of procedural steps.\nAlongside the new task, we also present the multi-modal IndustReal dataset.\nUnlike currently available datasets, IndustReal contains procedural errors\n(such as omissions) as well as execution errors. A significant part of these\nerrors are exclusively present in the validation and test sets, making\nIndustReal suitable to evaluate robustness of algorithms to new, unseen\nmistakes. Additionally, to encourage reproducibility and allow for scalable\napproaches trained on synthetic data, the 3D models of all parts are publicly\navailable. Annotations and benchmark performance are provided for action\nrecognition and assembly state detection, as well as the new PSR task.\nIndustReal, along with the code and model weights, is available at:\nhttps://github.com/TimSchoonbeek/IndustReal .\n","authors":["Tim J. Schoonbeek","Tim Houben","Hans Onvlee","Peter H. N. de With","Fons van der Sommen"],"pdf_url":"https://arxiv.org/pdf/2310.17323v1.pdf","comment":"Accepted for WACV 2024. 15 pages, 9 figures, including supplementary\n materials"},{"id":"http://arxiv.org/abs/2310.17316v1","updated":"2023-10-26T11:23:24Z","published":"2023-10-26T11:23:24Z","title":"Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with\n Rich Semantics","summary":" Defect inspection is paramount within the closed-loop manufacturing system.\nHowever, existing datasets for defect inspection often lack precision and\nsemantic granularity required for practical applications. In this paper, we\nintroduce the Defect Spectrum, a comprehensive benchmark that offers precise,\nsemantic-abundant, and large-scale annotations for a wide range of industrial\ndefects. Building on four key industrial benchmarks, our dataset refines\nexisting annotations and introduces rich semantic details, distinguishing\nmultiple defect types within a single image. Furthermore, we introduce\nDefect-Gen, a two-stage diffusion-based generator designed to create\nhigh-quality and diverse defective images, even when working with limited\ndatasets. The synthetic images generated by Defect-Gen significantly enhance\nthe efficacy of defect inspection models. Overall, The Defect Spectrum dataset\ndemonstrates its potential in defect inspection research, offering a solid\nplatform for testing and refining advanced models.\n","authors":["Shuai Yang","Zhifei Chen","Pengguang Chen","Xi Fang","Shu Liu","Yingcong Chen"],"pdf_url":"https://arxiv.org/pdf/2310.17316v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13847v2","updated":"2023-10-26T11:04:19Z","published":"2023-09-25T03:20:09Z","title":"Tuning Multi-mode Token-level Prompt Alignment across Modalities","summary":" Advancements in prompt tuning of vision-language models have underscored\ntheir potential in enhancing open-world visual concept comprehension. However,\nprior works only primarily focus on single-mode (only one prompt for each\nmodality) and holistic level (image or sentence) semantic alignment, which\nfails to capture the sample diversity, leading to sub-optimal prompt discovery.\nTo address the limitation, we propose a multi-mode token-level tuning framework\nthat leverages the optimal transportation to learn and align a set of prompt\ntokens across modalities. Specifically, we rely on two essential factors: 1)\nmulti-mode prompts discovery, which guarantees diverse semantic\nrepresentations, and 2) token-level alignment, which helps explore fine-grained\nsimilarity. Consequently, the similarity can be calculated as a hierarchical\ntransportation problem between the modality-specific sets. Extensive\nexperiments on popular image recognition benchmarks show the superior\ngeneralization and few-shot abilities of our approach. The qualitative analysis\ndemonstrates that the learned prompt tokens have the ability to capture diverse\nvisual concepts.\n","authors":["Dongsheng Wang","Miaoge Li","Xinyang Liu","MingSheng Xu","Bo Chen","Hanwang Zhang"],"pdf_url":"https://arxiv.org/pdf/2309.13847v2.pdf","comment":"In Proceedings of NeurIPS2023"},{"id":"http://arxiv.org/abs/2304.12760v3","updated":"2023-10-26T11:01:59Z","published":"2023-04-25T12:19:18Z","title":"Parallel Spiking Neurons with High Efficiency and Ability to Learn\n Long-term Dependencies","summary":" Vanilla spiking neurons in Spiking Neural Networks (SNNs) use\ncharge-fire-reset neuronal dynamics, which can only be simulated serially and\ncan hardly learn long-time dependencies. We find that when removing reset, the\nneuronal dynamics can be reformulated in a non-iterative form and parallelized.\nBy rewriting neuronal dynamics without reset to a general formulation, we\npropose the Parallel Spiking Neuron (PSN), which generates hidden states that\nare independent of their predecessors, resulting in parallelizable neuronal\ndynamics and extremely high simulation speed. The weights of inputs in the PSN\nare fully connected, which maximizes the utilization of temporal information.\nTo avoid the use of future inputs for step-by-step inference, the weights of\nthe PSN can be masked, resulting in the masked PSN. By sharing weights across\ntime-steps based on the masked PSN, the sliding PSN is proposed to handle\nsequences of varying lengths. We evaluate the PSN family on simulation speed\nand temporal/static data classification, and the results show the overwhelming\nadvantage of the PSN family in efficiency and accuracy. To the best of our\nknowledge, this is the first study about parallelizing spiking neurons and can\nbe a cornerstone for the spiking deep learning research. Our codes are\navailable at \\url{https://github.com/fangwei123456/Parallel-Spiking-Neuron}.\n","authors":["Wei Fang","Zhaofei Yu","Zhaokun Zhou","Ding Chen","Yanqi Chen","Zhengyu Ma","Timothée Masquelier","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2304.12760v3.pdf","comment":"Accepted in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.04909v3","updated":"2023-10-26T10:53:03Z","published":"2023-03-08T21:55:15Z","title":"Robotic Fabric Flattening with Wrinkle Direction Detection","summary":" Deformable Object Manipulation (DOM) is an important field of research as it\ncontributes to practical tasks such as automatic cloth handling, cable routing,\nsurgical operation, etc. Perception is considered one of the major challenges\nin DOM due to the complex dynamics and high degree of freedom of deformable\nobjects. In this paper, we develop a novel image-processing algorithm based on\nGabor filters to extract useful features from cloth, and based on this, devise\na strategy for cloth flattening tasks. We also evaluate the overall framework\nexperimentally and compare it with three human operators. The results show that\nour algorithm can determine the direction of wrinkles on the cloth accurately\nin simulation as well as in real robot experiments. Furthermore, our\ndewrinkling strategy compares favorably to baseline methods. The experiment\nvideo is available on\nhttps://sites.google.com/view/robotic-fabric-flattening/home\n","authors":["Yulei Qiu","Jihong Zhu","Cosimo Della Santina","Michael Gienger","Jens Kober"],"pdf_url":"https://arxiv.org/pdf/2303.04909v3.pdf","comment":"Accepted by the 18th International Symposium on Experimental Robotics\n (ISER 2023)"},{"id":"http://arxiv.org/abs/2310.17294v1","updated":"2023-10-26T10:18:51Z","published":"2023-10-26T10:18:51Z","title":"Scale-Adaptive Feature Aggregation for Efficient Space-Time Video\n Super-Resolution","summary":" The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual\nquality of videos, by simultaneously performing video frame interpolation (VFI)\nand video super-resolution (VSR). However, facing the challenge of the\nadditional temporal dimension and scale inconsistency, most existing STVSR\nmethods are complex and inflexible in dynamically modeling different motion\namplitudes. In this work, we find that choosing an appropriate processing scale\nachieves remarkable benefits in flow-based feature propagation. We propose a\nnovel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects\nsub-networks with different processing scales for individual samples.\nExperiments on four public STVSR benchmarks demonstrate that SAFA achieves\nstate-of-the-art performance. Our SAFA network outperforms recent\nstate-of-the-art methods such as TMNet and VideoINR by an average improvement\nof over 0.5dB on PSNR, while requiring less than half the number of parameters\nand only 1/3 computational costs.\n","authors":["Zhewei Huang","Ailin Huang","Xiaotao Hu","Chen Hu","Jun Xu","Shuchang Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.17294v1.pdf","comment":"WACV2024, 16 pages"},{"id":"http://arxiv.org/abs/2310.17290v1","updated":"2023-10-26T10:15:21Z","published":"2023-10-26T10:15:21Z","title":"RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open\n Environments","summary":" Intention-oriented object detection aims to detect desired objects based on\nspecific intentions or requirements. For instance, when we desire to \"lie down\nand rest\", we instinctively seek out a suitable option such as a \"bed\" or a\n\"sofa\" that can fulfill our needs. Previous work in this area is limited either\nby the number of intention descriptions or by the affordance vocabulary\navailable for intention objects. These limitations make it challenging to\nhandle intentions in open environments effectively. To facilitate this\nresearch, we construct a comprehensive dataset called Reasoning\nIntention-Oriented Objects (RIO). In particular, RIO is specifically designed\nto incorporate diverse real-world scenarios and a wide range of object\ncategories. It offers the following key features: 1) intention descriptions in\nRIO are represented as natural sentences rather than a mere word or verb\nphrase, making them more practical and meaningful; 2) the intention\ndescriptions are contextually relevant to the scene, enabling a broader range\nof potential functionalities associated with the objects; 3) the dataset\ncomprises a total of 40,214 images and 130,585 intention-object pairs. With the\nproposed RIO, we evaluate the ability of some existing models to reason\nintention-oriented objects in open environments.\n","authors":["Mengxue Qu","Yu Wu","Wu Liu","Xiaodan Liang","Jingkuan Song","Yao Zhao","Yunchao Wei"],"pdf_url":"https://arxiv.org/pdf/2310.17290v1.pdf","comment":"NeurIPS 2023 D&B accepted. See our project page for more details:\n https://reasonio.github.io/"},{"id":"http://arxiv.org/abs/2305.09758v3","updated":"2023-10-26T10:08:31Z","published":"2023-05-16T19:13:11Z","title":"A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In\n Zero Shot","summary":" Multimedia content, such as advertisements and story videos, exhibit a rich\nblend of creativity and multiple modalities. They incorporate elements like\ntext, visuals, audio, and storytelling techniques, employing devices like\nemotions, symbolism, and slogans to convey meaning. There is a dearth of large\nannotated training datasets in the multimedia domain hindering the development\nof supervised learning models with satisfactory performance for real-world\napplications. On the other hand, the rise of large language models (LLMs) has\nwitnessed remarkable zero-shot performance in various natural language\nprocessing (NLP) tasks, such as emotion classification, question-answering, and\ntopic classification. To leverage such advanced techniques to bridge this\nperformance gap in multimedia understanding, we propose verbalizing long videos\nto generate their descriptions in natural language, followed by performing\nvideo-understanding tasks on the generated story as opposed to the original\nvideo. Through extensive experiments on fifteen video-understanding tasks, we\ndemonstrate that our method, despite being zero-shot, achieves significantly\nbetter results than supervised baselines for video understanding. Furthermore,\nto alleviate a lack of story understanding benchmarks, we publicly release the\nfirst dataset on a crucial task in computational social science on persuasion\nstrategy identification.\n","authors":["Aanisha Bhattacharya","Yaman K Singla","Balaji Krishnamurthy","Rajiv Ratn Shah","Changyou Chen"],"pdf_url":"https://arxiv.org/pdf/2305.09758v3.pdf","comment":"Accepted to EMNLP-23 TL;DR: Video understanding lags far behind NLP;\n LLMs excel in zero-shot. Our approach utilizes LLMs to verbalize videos,\n creating stories for zero-shot video understanding. This yields\n state-of-the-art results across five datasets, covering fifteen tasks"},{"id":"http://arxiv.org/abs/2301.08951v4","updated":"2023-10-26T10:07:02Z","published":"2023-01-21T13:39:39Z","title":"Time-Conditioned Generative Modeling of Object-Centric Representations\n for Video Decomposition and Prediction","summary":" When perceiving the world from multiple viewpoints, humans have the ability\nto reason about the complete objects in a compositional manner even when an\nobject is completely occluded from certain viewpoints. Meanwhile, humans are\nable to imagine novel views after observing multiple viewpoints. Recent\nremarkable advances in multi-view object-centric learning still leaves some\nunresolved problems: 1) The shapes of partially or completely occluded objects\ncan not be well reconstructed. 2) The novel viewpoint prediction depends on\nexpensive viewpoint annotations rather than implicit rules in view\nrepresentations. In this paper, we introduce a time-conditioned generative\nmodel for videos. To reconstruct the complete shape of an object accurately, we\nenhance the disentanglement between the latent representations of objects and\nviews, where the latent representations of time-conditioned views are jointly\ninferred with a Transformer and then are input to a sequential extension of\nSlot Attention to learn object-centric representations. In addition, Gaussian\nprocesses are employed as priors of view latent variables for video generation\nand novel-view prediction without viewpoint annotations. Experiments on\nmultiple datasets demonstrate that the proposed model can make object-centric\nvideo decomposition, reconstruct the complete shapes of occluded objects, and\nmake novel-view predictions.\n","authors":["Chengmin Gao","Bin Li"],"pdf_url":"https://arxiv.org/pdf/2301.08951v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17281v1","updated":"2023-10-26T10:02:33Z","published":"2023-10-26T10:02:33Z","title":"BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point\n Clouds","summary":" We present a surprisingly simple and efficient method for self-supervision of\n3D backbone on automotive Lidar point clouds. We design a contrastive loss\nbetween features of Lidar scans captured in the same scene. Several such\napproaches have been proposed in the literature from PointConstrast, which uses\na contrast at the level of points, to the state-of-the-art TARL, which uses a\ncontrast at the level of segments, roughly corresponding to objects. While the\nformer enjoys a great simplicity of implementation, it is surpassed by the\nlatter, which however requires a costly pre-processing. In BEVContrast, we\ndefine our contrast at the level of 2D cells in the Bird's Eye View plane.\nResulting cell-level representations offer a good trade-off between the\npoint-level representations exploited in PointContrast and segment-level\nrepresentations exploited in TARL: we retain the simplicity of PointContrast\n(cell representations are cheap to compute) while surpassing the performance of\nTARL in downstream semantic segmentation.\n","authors":["Corentin Sautier","Gilles Puy","Alexandre Boulch","Renaud Marlet","Vincent Lepetit"],"pdf_url":"https://arxiv.org/pdf/2310.17281v1.pdf","comment":"Accepted to 3DV 2024"},{"id":"http://arxiv.org/abs/2309.01270v2","updated":"2023-10-26T09:58:37Z","published":"2023-09-03T20:50:53Z","title":"COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action\n Spotting using Transformers","summary":" We present COMEDIAN, a novel pipeline to initialize spatiotemporal\ntransformers for action spotting, which involves self-supervised learning and\nknowledge distillation. Action spotting is a timestamp-level temporal action\ndetection task. Our pipeline consists of three steps, with two initialization\nstages. First, we perform self-supervised initialization of a spatial\ntransformer using short videos as input. Additionally, we initialize a temporal\ntransformer that enhances the spatial transformer's outputs with global context\nthrough knowledge distillation from a pre-computed feature bank aligned with\neach short video segment. In the final step, we fine-tune the transformers to\nthe action spotting task. The experiments, conducted on the SoccerNet-v2\ndataset, demonstrate state-of-the-art performance and validate the\neffectiveness of COMEDIAN's pretraining paradigm. Our results highlight several\nadvantages of our pretraining pipeline, including improved performance and\nfaster convergence compared to non-pretrained models.\n","authors":["Julien Denize","Mykola Liashuha","Jaonary Rabarisoa","Astrid Orcesi","Romain Hérault"],"pdf_url":"https://arxiv.org/pdf/2309.01270v2.pdf","comment":"Source code is available here:\n https://github.com/juliendenize/eztorch"},{"id":"http://arxiv.org/abs/2304.01091v2","updated":"2023-10-26T09:37:16Z","published":"2023-04-03T15:51:42Z","title":"Changes to Captions: An Attentive Network for Remote Sensing Change\n Captioning","summary":" In recent years, advanced research has focused on the direct learning and\nanalysis of remote sensing images using natural language processing (NLP)\ntechniques. The ability to accurately describe changes occurring in\nmulti-temporal remote sensing images is becoming increasingly important for\ngeospatial understanding and land planning. Unlike natural image change\ncaptioning tasks, remote sensing change captioning aims to capture the most\nsignificant changes, irrespective of various influential factors such as\nillumination, seasonal effects, and complex land covers. In this study, we\nhighlight the significance of accurately describing changes in remote sensing\nimages and present a comparison of the change captioning task for natural and\nsynthetic images and remote sensing images. To address the challenge of\ngenerating accurate captions, we propose an attentive changes-to-captions\nnetwork, called Chg2Cap for short, for bi-temporal remote sensing images. The\nnetwork comprises three main components: 1) a Siamese CNN-based feature\nextractor to collect high-level representations for each image pair; 2) an\nattentive decoder that includes a hierarchical self-attention block to locate\nchange-related features and a residual block to generate the image embedding;\nand 3) a transformer-based caption generator to decode the relationship between\nthe image embedding and the word embedding into a description. The proposed\nChg2Cap network is evaluated on two representative remote sensing datasets, and\na comprehensive experimental analysis is provided. The code and pre-trained\nmodels will be available online at https://github.com/ShizhenChang/Chg2Cap.\n","authors":["Shizhen Chang","Pedram Ghamisi"],"pdf_url":"https://arxiv.org/pdf/2304.01091v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.01254v2","updated":"2023-10-26T09:33:29Z","published":"2022-08-02T05:25:35Z","title":"A Robust Morphological Approach for Semantic Segmentation of Very High\n Resolution Images","summary":" State-of-the-art methods for semantic segmentation of images involve\ncomputationally intensive neural network architectures. Most of these methods\nare not adaptable to high-resolution image segmentation due to memory and other\ncomputational issues. Typical approaches in literature involve design of neural\nnetwork architectures that can fuse global information from low-resolution\nimages and local information from the high-resolution counterparts. However,\narchitectures designed for processing high resolution images are unnecessarily\ncomplex and involve a lot of hyper parameters that can be difficult to tune.\nAlso, most of these architectures require ground truth annotations of the high\nresolution images to train, which can be hard to obtain. In this article, we\ndevelop a robust pipeline based on mathematical morphological (MM) operators\nthat can seamlessly extend any existing semantic segmentation algorithm to high\nresolution images. Our method does not require the ground truth annotations of\nthe high resolution images. It is based on efficiently utilizing information\nfrom the low-resolution counterparts, and gradient information on the\nhigh-resolution images. We obtain high quality seeds from the inferred labels\non low-resolution images using traditional morphological operators and\npropagate seed labels using a random walker to refine the semantic labels at\nthe boundaries. We show that the semantic segmentation results obtained by our\nmethod beat the existing state-of-the-art algorithms on high-resolution images.\nWe empirically prove the robustness of our approach to the hyper parameters\nused in our pipeline. Further, we characterize some necessary conditions under\nwhich our pipeline is applicable and provide an in-depth analysis of the\nproposed approach.\n","authors":["Siddharth Saravanan","Aditya Challa","Sravan Danda"],"pdf_url":"https://arxiv.org/pdf/2208.01254v2.pdf","comment":"Under review at Computer Vision and Image Understanding"},{"id":"http://arxiv.org/abs/2310.14159v2","updated":"2023-10-26T09:27:13Z","published":"2023-10-22T03:01:38Z","title":"Can Language Models Laugh at YouTube Short-form Videos?","summary":" As short-form funny videos on social networks are gaining popularity, it\nbecomes demanding for AI models to understand them for better communication\nwith humans. Unfortunately, previous video humor datasets target specific\ndomains, such as speeches or sitcoms, and mostly focus on verbal cues. We\ncurate a user-generated dataset of 10K multimodal funny videos from YouTube,\ncalled ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both\nverbal and visual elements contributing to humor. After filtering, we annotate\neach video with timestamps and text explanations for funny moments. Our\nExFunTube is unique over existing datasets in that our videos cover a wide\nrange of domains with various types of humor that necessitate a multimodal\nunderstanding of the content. Also, we develop a zero-shot video-to-text\nprompting to maximize video humor understanding of large language models\n(LLMs). With three different evaluation methods using automatic scores,\nrationale quality experiments, and human evaluations, we show that our\nprompting significantly improves LLMs' ability for humor explanation.\n","authors":["Dayoon Ko","Sangho Lee","Gunhee Kim"],"pdf_url":"https://arxiv.org/pdf/2310.14159v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17261v1","updated":"2023-10-26T09:25:09Z","published":"2023-10-26T09:25:09Z","title":"Attribute Based Interpretable Evaluation Metrics for Generative Models","summary":" When the training dataset comprises a 1:1 proportion of dogs to cats, a\ngenerative model that produces 1:1 dogs and cats better resembles the training\nspecies distribution than another model with 3:1 dogs and cats. Can we capture\nthis phenomenon using existing metrics? Unfortunately, we cannot, because these\nmetrics do not provide any interpretability beyond \"diversity\". In this\ncontext, we propose a new evaluation protocol that measures the divergence of a\nset of generated images from the training set regarding the distribution of\nattribute strengths as follows. Single-attribute Divergence (SaD) measures the\ndivergence regarding PDFs of a single attribute. Paired-attribute Divergence\n(PaD) measures the divergence regarding joint PDFs of a pair of attributes.\nThey provide which attributes the models struggle. For measuring the attribute\nstrengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures\nthe cosine similarity between image and text vectors with heterogeneous initial\npoints. With SaD and PaD, we reveal the following about existing generative\nmodels. ProjectedGAN generates implausible attribute relationships such as a\nbaby with a beard even though it has competitive scores of existing metrics.\nDiffusion models struggle to capture diverse colors in the datasets. The larger\nsampling timesteps of latent diffusion model generate the more minor objects\nincluding earrings and necklaces. Stable Diffusion v1.5 better captures the\nattributes than v2.1. Our metrics lay a foundation for explainable evaluations\nof generative models.\n","authors":["Dongkyun Kim","Mingi Kwon","Youngjung Uh"],"pdf_url":"https://arxiv.org/pdf/2310.17261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08645v2","updated":"2023-10-26T09:16:25Z","published":"2023-06-14T17:23:07Z","title":"Training-free Diffusion Model Adaptation for Variable-Sized\n Text-to-Image Synthesis","summary":" Diffusion models (DMs) have recently gained attention with state-of-the-art\nperformance in text-to-image synthesis. Abiding by the tradition in deep\nlearning, DMs are trained and evaluated on the images with fixed sizes.\nHowever, users are demanding for various images with specific sizes and various\naspect ratio. This paper focuses on adapting text-to-image diffusion models to\nhandle such variety while maintaining visual fidelity. First we observe that,\nduring the synthesis, lower resolution images suffer from incomplete object\nportrayal, while higher resolution images exhibit repetitively disordered\npresentation. Next, we establish a statistical relationship indicating that\nattention entropy changes with token quantity, suggesting that models aggregate\nspatial information in proportion to image resolution. The subsequent\ninterpretation on our observations is that objects are incompletely depicted\ndue to limited spatial information for low resolutions, while repetitively\ndisorganized presentation arises from redundant spatial information for high\nresolutions. From this perspective, we propose a scaling factor to alleviate\nthe change of attention entropy and mitigate the defective pattern observed.\nExtensive experimental results validate the efficacy of the proposed scaling\nfactor, enabling models to achieve better visual effects, image quality, and\ntext alignment. Notably, these improvements are achieved without additional\ntraining or fine-tuning techniques.\n","authors":["Zhiyu Jin","Xuli Shen","Bin Li","Xiangyang Xue"],"pdf_url":"https://arxiv.org/pdf/2306.08645v2.pdf","comment":"Accepted by NeurIPS 2023. 23 pages, 13 figures"},{"id":"http://arxiv.org/abs/2310.17255v1","updated":"2023-10-26T09:11:55Z","published":"2023-10-26T09:11:55Z","title":"Generalizing to Unseen Domains in Diabetic Retinopathy Classification","summary":" Diabetic retinopathy (DR). is caused by long-standing diabetes and is among\nthe fifth leading cause for visual impairments. The process of early diagnosis\nand treatments could be helpful in curing the disease, however, the detection\nprocedure is rather challenging and mostly tedious. Therefore, automated\ndiabetic retinopathy classification using deep learning techniques has gained\ninterest in the medical imaging community. Akin to several other real-world\napplications of deep learning, the typical assumption of i.i.d data is also\nviolated in DR classification that relies on deep learning. Therefore,\ndeveloping DR classification methods robust to unseen distributions is of great\nvalue. In this paper, we study the problem of generalizing a model to unseen\ndistributions or domains (a.k.a domain generalization) in DR classification. To\nthis end, we propose a simple and effective domain generalization (DG) approach\nthat achieves self-distillation in vision transformers (ViT) via a novel\nprediction softening mechanism. This prediction softening is an adaptive convex\ncombination one-hot labels with the model's own knowledge. We perform extensive\nexperiments on challenging open-source DR classification datasets under both\nmulti-source and single-source DG settings with three different ViT backbones\nto establish the efficacy and applicability of our approach against competing\nmethods. For the first time, we report the performance of several\nstate-of-the-art DG methods on open-source DR classification datasets after\nconducting thorough experiments. Finally, our method is also capable of\ndelivering improved calibration performance than other methods, showing its\nsuitability for safety-critical applications, including healthcare. We hope\nthat our contributions would investigate more DG research across the medical\nimaging community.\n","authors":["Chamuditha Jayanga Galappaththige","Gayal Kuruppu","Muhammad Haris Khan"],"pdf_url":"https://arxiv.org/pdf/2310.17255v1.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2305.04247v3","updated":"2023-10-26T09:05:11Z","published":"2023-05-07T11:18:39Z","title":"Estimation of control area in badminton doubles with pose information\n from top and back view drone videos","summary":" The application of visual tracking to the performance analysis of sports\nplayers in dynamic competitions is vital for effective coaching. In doubles\nmatches, coordinated positioning is crucial for maintaining control of the\ncourt and minimizing opponents' scoring opportunities. The analysis of such\nteamwork plays a vital role in understanding the dynamics of the game. However,\nprevious studies have primarily focused on analyzing and assessing singles\nplayers without considering occlusion in broadcast videos. These studies have\nrelied on discrete representations, which involve the analysis and\nrepresentation of specific actions (e.g., strokes) or events that occur during\nthe game while overlooking the meaningful spatial distribution. In this work,\nwe present the first annotated drone dataset from top and back views in\nbadminton doubles and propose a framework to estimate the control area\nprobability map, which can be used to evaluate teamwork performance. We present\nan efficient framework of deep neural networks that enables the calculation of\nfull probability surfaces. This framework utilizes the embedding of a Gaussian\nmixture map of players' positions and employs graph convolution on their poses.\nIn the experiment, we verify our approach by comparing various baselines and\ndiscovering the correlations between the score and control area. Additionally,\nwe propose a practical application for assessing optimal positioning to provide\ninstructions during a game. Our approach offers both visual and quantitative\nevaluations of players' movements, thereby providing valuable insights into\ndoubles teamwork. The dataset and related project code is available at\nhttps://github.com/Ning-D/Drone_BD_ControlArea\n","authors":["Ning Ding","Kazuya Takeda","Wenhui Jin","Yingjiu Bei","Keisuke Fujii"],"pdf_url":"https://arxiv.org/pdf/2305.04247v3.pdf","comment":"15 pages, 10 figures, to appear in Multimedia Tools and Applications"},{"id":"http://arxiv.org/abs/2303.07963v2","updated":"2023-10-26T08:55:20Z","published":"2023-03-14T15:07:51Z","title":"RoCNet: 3D Robust Registration of Point-Clouds using Deep Learning","summary":" This paper introduces a new method for 3D point cloud registration based on\ndeep learning. The architecture is composed of three distinct blocs: (i) an\nencoder composed of a convolutional graph-based descriptor that encodes the\nimmediate neighbourhood of each point and an attention mechanism that encodes\nthe variations of the surface normals. Such descriptors are refined by\nhighlighting attention between the points of the same set and then between the\npoints of the two sets. (ii) a matching process that estimates a matrix of\ncorrespondences using the Sinkhorn algorithm. (iii) Finally, the rigid\ntransformation between the two point clouds is calculated by RANSAC using the\nKc best scores from the correspondence matrix. We conduct experiments on the\nModelNet40 dataset, and our proposed architecture shows very promising results,\noutperforming state-of-the-art methods in most of the simulated configurations,\nincluding partial overlap and data augmentation with Gaussian noise.\n","authors":["Karim Slimani","Brahim Tamadazte","Catherine Achard"],"pdf_url":"https://arxiv.org/pdf/2303.07963v2.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2310.17218v1","updated":"2023-10-26T08:12:53Z","published":"2023-10-26T08:12:53Z","title":"Prototypical Contrastive Learning-based CLIP Fine-tuning for Object\n Re-identification","summary":" This work aims to adapt large-scale pre-trained vision-language models, such\nas contrastive language-image pretraining (CLIP), to enhance the performance of\nobject reidentification (Re-ID) across various supervision settings. Although\nprompt learning has enabled a recent work named CLIP-ReID to achieve promising\nperformance, the underlying mechanisms and the necessity of prompt learning\nremain unclear due to the absence of semantic labels in ReID tasks. In this\nwork, we first analyze the role prompt learning in CLIP-ReID and identify its\nlimitations. Based on our investigations, we propose a simple yet effective\napproach to adapt CLIP for supervised object Re-ID. Our approach directly\nfine-tunes the image encoder of CLIP using a prototypical contrastive learning\n(PCL) loss, eliminating the need for prompt learning. Experimental results on\nboth person and vehicle Re-ID datasets demonstrate the competitiveness of our\nmethod compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP\nfine-tuning approach to unsupervised scenarios, where we achieve state-of-the\nart performance.\n","authors":["Jiachen Li","Xiaojin Gong"],"pdf_url":"https://arxiv.org/pdf/2310.17218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.09412v4","updated":"2023-10-26T08:08:43Z","published":"2023-03-16T15:44:31Z","title":"NeRFtrinsic Four: An End-To-End Trainable NeRF Jointly Optimizing\n Diverse Intrinsic and Extrinsic Camera Parameters","summary":" Novel view synthesis using neural radiance fields (NeRF) is the\nstate-of-the-art technique for generating high-quality images from novel\nviewpoints. Existing methods require a priori knowledge about extrinsic and\nintrinsic camera parameters. This limits their applicability to synthetic\nscenes, or real-world scenarios with the necessity of a preprocessing step.\nCurrent research on the joint optimization of camera parameters and NeRF\nfocuses on refining noisy extrinsic camera parameters and often relies on the\npreprocessing of intrinsic camera parameters. Further approaches are limited to\ncover only one single camera intrinsic. To address these limitations, we\npropose a novel end-to-end trainable approach called NeRFtrinsic Four. We\nutilize Gaussian Fourier features to estimate extrinsic camera parameters and\ndynamically predict varying intrinsic camera parameters through the supervision\nof the projection error. Our approach outperforms existing joint optimization\nmethods on LLFF and BLEFF. In addition to these existing datasets, we introduce\na new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic\nFour is a step forward in joint optimization NeRF-based view synthesis and\nenables more realistic and flexible rendering in real-world scenarios with\nvarying camera parameters.\n","authors":["Hannah Schieber","Fabian Deuser","Bernhard Egger","Norbert Oswald","Daniel Roth"],"pdf_url":"https://arxiv.org/pdf/2303.09412v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17216v1","updated":"2023-10-26T08:08:17Z","published":"2023-10-26T08:08:17Z","title":"Three-dimensional Bone Image Synthesis with Generative Adversarial\n Networks","summary":" Medical image processing has been highlighted as an area where deep\nlearning-based models have the greatest potential. However, in the medical\nfield in particular, problems of data availability and privacy are hampering\nresearch progress and thus rapid implementation in clinical routine. The\ngeneration of synthetic data not only ensures privacy, but also allows to\n\\textit{draw} new patients with specific characteristics, enabling the\ndevelopment of data-driven models on a much larger scale. This work\ndemonstrates that three-dimensional generative adversarial networks (GANs) can\nbe efficiently trained to generate high-resolution medical volumes with finely\ndetailed voxel-based architectures. In addition, GAN inversion is successfully\nimplemented for the three-dimensional setting and used for extensive research\non model interpretability and applications such as image morphing, attribute\nediting and style mixing. The results are comprehensively validated on a\ndatabase of three-dimensional HR-pQCT instances representing the bone\nmicro-architecture of the distal radius.\n","authors":["Christoph Angermann","Johannes Bereiter-Payr","Kerstin Stock","Markus Haltmeier","Gerald Degenhart"],"pdf_url":"https://arxiv.org/pdf/2310.17216v1.pdf","comment":"Submitted to the journal Artificial Intelligence in Medicine"},{"id":"http://arxiv.org/abs/2106.07368v2","updated":"2023-10-26T08:04:17Z","published":"2021-06-14T12:40:46Z","title":"Quality-Aware Network for Face Parsing","summary":" This is a very short technical report, which introduces the solution of the\nTeam BUPT-CASIA for Short-video Face Parsing Track of The 3rd Person in Context\n(PIC) Workshop and Challenge at CVPR 2021.\n Face parsing has recently attracted increasing interest due to its numerous\napplication potentials. Generally speaking, it has a lot in common with human\nparsing, such as task setting, data characteristics, number of categories and\nso on. Therefore, this work applies state-of-the-art human parsing method to\nface parsing task to explore the similarities and differences between them. Our\nsubmission achieves 86.84% score and wins the 2nd place in the challenge.\n","authors":["Lu Yang","Qing Song","Xueshi Xin","Wenhe Jia","Zhiwei Liu"],"pdf_url":"https://arxiv.org/pdf/2106.07368v2.pdf","comment":"2nd place in Short-video Face Parsing Track of The 3rd Person in\n Context (PIC) Workshop and Challenge at CVPR 2021"},{"id":"http://arxiv.org/abs/2310.17212v1","updated":"2023-10-26T07:56:17Z","published":"2023-10-26T07:56:17Z","title":"Emotion Recognition by Video: A review","summary":" Video emotion recognition is an important branch of affective computing, and\nits solutions can be applied in different fields such as human-computer\ninteraction (HCI) and intelligent medical treatment. Although the number of\npapers published in the field of emotion recognition is increasing, there are\nfew comprehensive literature reviews covering related research on video emotion\nrecognition. Therefore, this paper selects articles published from 2015 to 2023\nto systematize the existing trends in video emotion recognition in related\nstudies. In this paper, we first talk about two typical emotion models, then we\ntalk about databases that are frequently utilized for video emotion\nrecognition, including unimodal databases and multimodal databases. Next, we\nlook at and classify the specific structure and performance of modern unimodal\nand multimodal video emotion recognition methods, talk about the benefits and\ndrawbacks of each, and then we compare them in detail in the tables. Further,\nwe sum up the primary difficulties right now looked by video emotion\nrecognition undertakings and point out probably the most encouraging future\nheadings, such as establishing an open benchmark database and better multimodal\nfusion strategys. The essential objective of this paper is to assist scholarly\nand modern scientists with keeping up to date with the most recent advances and\nnew improvements in this speedy, high-influence field of video emotion\nrecognition.\n","authors":["Junxiao Xue","Jie Wang","Xuecheng Wu","Liangyu Fu"],"pdf_url":"https://arxiv.org/pdf/2310.17212v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17209v1","updated":"2023-10-26T07:54:47Z","published":"2023-10-26T07:54:47Z","title":"Weakly-Supervised Surgical Phase Recognition","summary":" A key element of computer-assisted surgery systems is phase recognition of\nsurgical videos. Existing phase recognition algorithms require frame-wise\nannotation of a large number of videos, which is time and money consuming. In\nthis work we join concepts of graph segmentation with self-supervised learning\nto derive a random-walk solution for per-frame phase prediction. Furthermore,\nwe utilize within our method two forms of weak supervision: sparse timestamps\nor few-shot learning. The proposed algorithm enjoys low complexity and can\noperate in lowdata regimes. We validate our method by running experiments with\nthe public Cholec80 dataset of laparoscopic cholecystectomy videos,\ndemonstrating promising performance in multiple setups.\n","authors":["Roy Hirsch","Regev Cohen","Mathilde Caron","Tomer Golany","Daniel Freedman","Ehud Rivlin"],"pdf_url":"https://arxiv.org/pdf/2310.17209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13268v2","updated":"2023-10-26T07:51:47Z","published":"2023-10-20T04:23:12Z","title":"DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model\n Statistics","summary":" Diffusion probabilistic models (DPMs) have exhibited excellent performance\nfor high-fidelity image generation while suffering from inefficient sampling.\nRecent works accelerate the sampling procedure by proposing fast ODE solvers\nthat leverage the specific ODE form of DPMs. However, they highly rely on\nspecific parameterization during inference (such as noise/data prediction),\nwhich might not be the optimal choice. In this work, we propose a novel\nformulation towards the optimal parameterization during sampling that minimizes\nthe first-order discretization error of the ODE solution. Based on such\nformulation, we propose \\textit{DPM-Solver-v3}, a new fast ODE solver for DPMs\nby introducing several coefficients efficiently computed on the pretrained\nmodel, which we call \\textit{empirical model statistics}. We further\nincorporate multistep methods and a predictor-corrector framework, and propose\nsome techniques for improving sample quality at small numbers of function\nevaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3\nachieves consistently better or comparable performance in both unconditional\nand conditional sampling with both pixel-space and latent-space DPMs,\nespecially in 5$\\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE)\non unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable\nDiffusion, bringing a speed-up of 15\\%$\\sim$30\\% compared to previous\nstate-of-the-art training-free methods. Code is available at\n\\url{https://github.com/thu-ml/DPM-Solver-v3}.\n","authors":["Kaiwen Zheng","Cheng Lu","Jianfei Chen","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.13268v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.08231v3","updated":"2023-10-26T07:41:12Z","published":"2023-08-16T09:06:32Z","title":"DDF-HO: Hand-Held Object Reconstruction via Conditional Directed\n Distance Field","summary":" Reconstructing hand-held objects from a single RGB image is an important and\nchallenging problem. Existing works utilizing Signed Distance Fields (SDF)\nreveal limitations in comprehensively capturing the complex hand-object\ninteractions, since SDF is only reliable within the proximity of the target,\nand hence, infeasible to simultaneously encode local hand and object cues. To\naddress this issue, we propose DDF-HO, a novel approach leveraging Directed\nDistance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in\n3D space, consisting of an origin and a direction, to corresponding DDF values,\nincluding a binary visibility signal determining whether the ray intersects the\nobjects and a distance value measuring the distance from origin to target in\nthe given direction. We randomly sample multiple rays and collect local to\nglobal geometric features for them by introducing a novel 2D ray-based feature\naggregation scheme and a 3D intersection-aware hand pose embedding, combining\n2D-3D features to model hand-object interactions. Extensive experiments on\nsynthetic and real-world datasets demonstrate that DDF-HO consistently\noutperforms all baseline methods by a large margin, especially under Chamfer\nDistance, with about 80% leap forward. Codes are available at\nhttps://github.com/ZhangCYG/DDFHO.\n","authors":["Chenyangguang Zhang","Yan Di","Ruida Zhang","Guangyao Zhai","Fabian Manhardt","Federico Tombari","Xiangyang Ji"],"pdf_url":"https://arxiv.org/pdf/2308.08231v3.pdf","comment":"Camera Ready for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.15105v2","updated":"2023-10-26T07:32:44Z","published":"2023-10-23T17:12:01Z","title":"FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained\n Models in Few-Shot Learning","summary":" Due to the limited availability of data, existing few-shot learning methods\ntrained from scratch fail to achieve satisfactory performance. In contrast,\nlarge-scale pre-trained models such as CLIP demonstrate remarkable few-shot and\nzero-shot capabilities. To enhance the performance of pre-trained models for\ndownstream tasks, fine-tuning the model on downstream data is frequently\nnecessary. However, fine-tuning the pre-trained model leads to a decrease in\nits generalizability in the presence of distribution shift, while the limited\nnumber of samples in few-shot learning makes the model highly susceptible to\noverfitting. Consequently, existing methods for fine-tuning few-shot learning\nprimarily focus on fine-tuning the model's classification head or introducing\nadditional structure. In this paper, we introduce a fine-tuning approach termed\nFeature Discrimination Alignment (FD-Align). Our method aims to bolster the\nmodel's generalizability by preserving the consistency of spurious features\nacross the fine-tuning process. Extensive experimental results validate the\nefficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model\ncan seamlessly integrate with existing methods, leading to performance\nimprovements. Our code can be found in https://github.com/skingorz/FD-Align.\n","authors":["Kun Song","Huimin Ma","Bochao Zou","Huishuai Zhang","Weiran Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15105v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17190v1","updated":"2023-10-26T07:05:38Z","published":"2023-10-26T07:05:38Z","title":"Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction\n Network for Tone Mapping","summary":" Tone mapping aims to convert high dynamic range (HDR) images to low dynamic\nrange (LDR) representations, a critical task in the camera imaging pipeline. In\nrecent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained\nattention due to their ability to strike a favorable balance between\nenhancement performance and computational efficiency. However, these methods\noften fail to deliver satisfactory results in local areas since the look-up\ntable is a global operator for tone mapping, which works based on pixel values\nand fails to incorporate crucial local information. To this end, this paper\naims to address this issue by exploring a novel strategy that integrates global\nand local operators by utilizing closed-form Laplacian pyramid decomposition\nand reconstruction. Specifically, we employ image-adaptive 3D LUTs to\nmanipulate the tone in the low-frequency image by leveraging the specific\ncharacteristics of the frequency information. Furthermore, we utilize local\nLaplacian filters to refine the edge details in the high-frequency components\nin an adaptive manner. Local Laplacian filters are widely used to preserve edge\ndetails in photographs, but their conventional usage involves manual tuning and\nfixed implementation within camera imaging pipelines or photo editing tools. We\npropose to learn parameter value maps progressively for local Laplacian filters\nfrom annotated data using a lightweight network. Our model achieves\nsimultaneous global tone manipulation and local edge detail preservation in an\nend-to-end manner. Extensive experimental results on two benchmark datasets\ndemonstrate that the proposed method performs favorably against\nstate-of-the-art methods.\n","authors":["Feng Zhang","Ming Tian","Zhiqiang Li","Bin Xu","Qingbo Lu","Changxin Gao","Nong Sang"],"pdf_url":"https://arxiv.org/pdf/2310.17190v1.pdf","comment":"12 pages, 6 figures, accepted by NeurlPS 2023"},{"id":"http://arxiv.org/abs/2308.04826v2","updated":"2023-10-26T07:05:19Z","published":"2023-08-09T09:24:56Z","title":"WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields","summary":" Neural Radiance Field (NeRF) has shown impressive performance in novel view\nsynthesis via implicit scene representation. However, it usually suffers from\npoor scalability as requiring densely sampled images for each new scene.\nSeveral studies have attempted to mitigate this problem by integrating\nMulti-View Stereo (MVS) technique into NeRF while they still entail a\ncumbersome fine-tuning process for new scenes. Notably, the rendering quality\nwill drop severely without this fine-tuning process and the errors mainly\nappear around the high-frequency features. In the light of this observation, we\ndesign WaveNeRF, which integrates wavelet frequency decomposition into MVS and\nNeRF to achieve generalizable yet high-quality synthesis without any per-scene\noptimization. To preserve high-frequency information when generating 3D feature\nvolumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating\nthe discrete wavelet transform into the classical cascade MVS, which\ndisentangles high-frequency information explicitly. With that, disentangled\nfrequency features can be injected into classic NeRF via a novel hybrid neural\nrenderer to yield faithful high-frequency details, and an intuitive\nfrequency-guided sampling strategy can be designed to suppress artifacts around\nhigh-frequency regions. Extensive experiments over three widely studied\nbenchmarks show that WaveNeRF achieves superior generalizable radiance field\nmodeling when only given three images as input.\n","authors":["Muyu Xu","Fangneng Zhan","Jiahui Zhang","Yingchen Yu","Xiaoqin Zhang","Christian Theobalt","Ling Shao","Shijian Lu"],"pdf_url":"https://arxiv.org/pdf/2308.04826v2.pdf","comment":"Accepted to ICCV 2023. Project website:\n https://mxuai.github.io/WaveNeRF/"},{"id":"http://arxiv.org/abs/2310.17189v1","updated":"2023-10-26T07:04:44Z","published":"2023-10-26T07:04:44Z","title":"Exploring Iterative Refinement with Diffusion Models for Video Grounding","summary":" Video grounding aims to localize the target moment in an untrimmed video\ncorresponding to a given sentence query. Existing methods typically select the\nbest prediction from a set of predefined proposals or directly regress the\ntarget span in a single-shot manner, resulting in the absence of a systematical\nprediction refinement process. In this paper, we propose DiffusionVG, a novel\nframework with diffusion models that formulates video grounding as a\nconditional generation task, where the target span is generated from Gaussian\nnoise inputs and interatively refined in the reverse diffusion process. During\ntraining, DiffusionVG progressively adds noise to the target span with a fixed\nforward diffusion process and learns to recover the target span in the reverse\ndiffusion process. In inference, DiffusionVG can generate the target span from\nGaussian noise inputs by the learned reverse diffusion process conditioned on\nthe video-sentence representations. Our DiffusionVG follows the encoder-decoder\narchitecture, which firstly encodes the video-sentence features and iteratively\ndenoises the predicted spans in its specialized span refining decoder. Without\nbells and whistles, our DiffusionVG demonstrates competitive or even superior\nperformance compared to existing well-crafted models on mainstream Charades-STA\nand ActivityNet Captions benchmarks.\n","authors":["Xiao Liang","Tao Shi","Yaoyuan Liang","Te Tao","Shao-Lun Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17189v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17188v1","updated":"2023-10-26T07:00:18Z","published":"2023-10-26T07:00:18Z","title":"Blind Image Super-resolution with Rich Texture-Aware Codebooks","summary":" Blind super-resolution (BSR) methods based on high-resolution (HR)\nreconstruction codebooks have achieved promising results in recent years.\nHowever, we find that a codebook based on HR reconstruction may not effectively\ncapture the complex correlations between low-resolution (LR) and HR images. In\ndetail, multiple HR images may produce similar LR versions due to complex blind\ndegradations, causing the HR-dependent only codebooks having limited texture\ndiversity when faced with confusing LR inputs. To alleviate this problem, we\npropose the Rich Texture-aware Codebook-based Network (RTCNet), which consists\nof the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware\nTexture Prior Module (PTPM). DTPM effectively mines the cross-resolution\ncorrelation of textures between LR and HR images by exploiting the\ncross-resolution correspondence of textures. PTPM uses patch-wise semantic\npre-training to correct the misperception of texture similarity in the\nhigh-level semantic regularization. By taking advantage of this, RTCNet\neffectively gets rid of the misalignment of confusing textures between HR and\nLR in the BSR scenarios. Experiments show that RTCNet outperforms\nstate-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.\n","authors":["Rui Qin","Ming Sun","Fangyuan Zhang","Xing Wen","Bin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17188v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16818v2","updated":"2023-10-26T06:54:22Z","published":"2023-10-25T17:50:10Z","title":"DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion\n Prior","summary":" We present DreamCraft3D, a hierarchical 3D content generation method that\nproduces high-fidelity and coherent 3D objects. We tackle the problem by\nleveraging a 2D reference image to guide the stages of geometry sculpting and\ntexture boosting. A central focus of this work is to address the consistency\nissue that existing works encounter. To sculpt geometries that render\ncoherently, we perform score distillation sampling via a view-dependent\ndiffusion model. This 3D prior, alongside several training strategies,\nprioritizes the geometry consistency but compromises the texture fidelity. We\nfurther propose Bootstrapped Score Distillation to specifically boost the\ntexture. We train a personalized diffusion model, Dreambooth, on the augmented\nrenderings of the scene, imbuing it with 3D knowledge of the scene being\noptimized. The score distillation from this 3D-aware diffusion prior provides\nview-consistent guidance for the scene. Notably, through an alternating\noptimization of the diffusion prior and 3D scene representation, we achieve\nmutually reinforcing improvements: the optimized 3D scene aids in training the\nscene-specific diffusion model, which offers increasingly view-consistent\nguidance for 3D optimization. The optimization is thus bootstrapped and leads\nto substantial texture boosting. With tailored 3D priors throughout the\nhierarchical generation, DreamCraft3D generates coherent 3D objects with\nphotorealistic renderings, advancing the state-of-the-art in 3D content\ngeneration. Code available at https://github.com/deepseek-ai/DreamCraft3D.\n","authors":["Jingxiang Sun","Bo Zhang","Ruizhi Shao","Lizhen Wang","Wen Liu","Zhenda Xie","Yebin Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16818v2.pdf","comment":"Project Page: https://mrtornado24.github.io/DreamCraft3D/"},{"id":"http://arxiv.org/abs/2309.14660v2","updated":"2023-10-26T06:48:29Z","published":"2023-09-26T04:32:38Z","title":"CoFiI2P: Coarse-to-Fine Correspondences for Image-to-Point Cloud\n Registration","summary":" Image-to-point cloud (I2P) registration is a fundamental task in the field of\nautonomous vehicles and transportation systems for cross-modality data fusion\nand localization. Existing I2P registration methods estimate correspondences at\nthe point/pixel level, often overlooking global alignment. However, I2P\nmatching can easily converge to a local optimum when performed without\nhigh-level guidance from global constraints. To address this issue, this paper\nintroduces CoFiI2P, a novel I2P registration network that extracts\ncorrespondences in a coarse-to-fine manner to achieve the globally optimal\nsolution. First, the image and point cloud data are processed through a Siamese\nencoder-decoder network for hierarchical feature extraction. Second, a\ncoarse-to-fine matching module is designed to leverage these features and\nestablish robust feature correspondences. Specifically, In the coarse matching\nphase, a novel I2P transformer module is employed to capture both homogeneous\nand heterogeneous global information from the image and point cloud data. This\nenables the estimation of coarse super-point/super-pixel matching pairs with\ndiscriminative descriptors. In the fine matching module, point/pixel pairs are\nestablished with the guidance of super-point/super-pixel correspondences.\nFinally, based on matching pairs, the transform matrix is estimated with the\nEPnP-RANSAC algorithm. Extensive experiments conducted on the KITTI dataset\ndemonstrate that CoFiI2P achieves impressive results, with a relative rotation\nerror (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29\nmeters. These results represent a significant improvement of 84\\% in RRE and\n89\\% in RTE compared to the current state-of-the-art (SOTA) method. Qualitative\nresults are available at https://youtu.be/ovbedasXuZE. The source code will be\npublicly released at https://github.com/kang-1-2-3/CoFiI2P.\n","authors":["Shuhao Kang","Youqi Liao","Jianping Li","Fuxun Liang","Yuhao Li","Fangning Li","Zhen Dong","Bisheng Yang"],"pdf_url":"https://arxiv.org/pdf/2309.14660v2.pdf","comment":"demo video: https://youtu.be/ovbedasXuZE; source code (will be\n public): https://github.com/kang-1-2-3/CoFiI2P"},{"id":"http://arxiv.org/abs/2310.15712v2","updated":"2023-10-26T06:40:23Z","published":"2023-10-24T10:40:51Z","title":"GNeSF: Generalizable Neural Semantic Fields","summary":" 3D scene segmentation based on neural implicit representation has emerged\nrecently with the advantage of training only on 2D supervision. However,\nexisting approaches still requires expensive per-scene optimization that\nprohibits generalization to novel scenes during inference. To circumvent this\nproblem, we introduce a generalizable 3D segmentation framework based on\nimplicit representation. Specifically, our framework takes in multi-view image\nfeatures and semantic maps as the inputs instead of only spatial information to\navoid overfitting to scene-specific geometric and semantic information. We\npropose a novel soft voting mechanism to aggregate the 2D semantic information\nfrom different views for each 3D point. In addition to the image features, view\ndifference information is also encoded in our framework to predict the voting\nscores. Intuitively, this allows the semantic information from nearby views to\ncontribute more compared to distant ones. Furthermore, a visibility module is\nalso designed to detect and filter out detrimental information from occluded\nviews. Due to the generalizability of our proposed method, we can synthesize\nsemantic maps or conduct 3D semantic segmentation for novel scenes with solely\n2D semantic supervision. Experimental results show that our approach achieves\ncomparable performance with scene-specific approaches. More importantly, our\napproach can even outperform existing strong supervision-based approaches with\nonly 2D annotations. Our source code is available at:\nhttps://github.com/HLinChen/GNeSF.\n","authors":["Hanlin Chen","Chen Li","Mengqi Guo","Zhiwen Yan","Gim Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15712v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17183v1","updated":"2023-10-26T06:30:39Z","published":"2023-10-26T06:30:39Z","title":"Understanding the Effects of Projectors in Knowledge Distillation","summary":" Conventionally, during the knowledge distillation process (e.g. feature\ndistillation), an additional projector is often required to perform feature\ntransformation due to the dimension mismatch between the teacher and the\nstudent networks. Interestingly, we discovered that even if the student and the\nteacher have the same feature dimensions, adding a projector still helps to\nimprove the distillation performance. In addition, projectors even improve\nlogit distillation if we add them to the architecture too. Inspired by these\nsurprising findings and the general lack of understanding of the projectors in\nthe knowledge distillation process from existing literature, this paper\ninvestigates the implicit role that projectors play but so far have been\noverlooked. Our empirical study shows that the student with a projector (1)\nobtains a better trade-off between the training accuracy and the testing\naccuracy compared to the student without a projector when it has the same\nfeature dimensions as the teacher, (2) better preserves its similarity to the\nteacher beyond shallow and numeric resemblance, from the view of Centered\nKernel Alignment (CKA), and (3) avoids being over-confident as the teacher does\nat the testing phase. Motivated by the positive effects of projectors, we\npropose a projector ensemble-based feature distillation method to further\nimprove distillation performance. Despite the simplicity of the proposed\nstrategy, empirical results from the evaluation of classification tasks on\nbenchmark datasets demonstrate the superior classification performance of our\nmethod on a broad range of teacher-student pairs and verify from the aspects of\nCKA and model calibration that the student's features are of improved quality\nwith the projector ensemble design.\n","authors":["Yudong Chen","Sen Wang","Jiajun Liu","Xuwei Xu","Frank de Hoog","Brano Kusy","Zi Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17183v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2210.15274"},{"id":"http://arxiv.org/abs/2310.17177v1","updated":"2023-10-26T06:03:18Z","published":"2023-10-26T06:03:18Z","title":"Bridging The Gaps Between Token Pruning and Full Pre-training via Masked\n Fine-tuning","summary":" Despite the success of transformers on various computer vision tasks, they\nsuffer from excessive memory and computational cost. Some works present dynamic\nvision transformers to accelerate inference by pruning redundant tokens. A key\nto improving token pruning is using well-trained models as initialization for\nfaster convergence and better performance. However, current base models usually\nadopt full image training, i.e., using full images as inputs and keeping the\nwhole feature maps through the forward process, which causes inconsistencies\nwith dynamic models that gradually reduce tokens, including calculation\npattern, information amount and token selection strategy inconsistencies.\nInspired by MAE which performs masking and reconstruction self-supervised task,\nwe devise masked fine-tuning to bridge the gaps between pre-trained base models\nused for initialization and token pruning based dynamic vision transformers, by\nmasking image patches and predicting the image class label based on left\nunmasked patches. Extensive experiments on ImageNet demonstrate that base\nmodels via masked fine-tuning gain strong occlusion robustness and ability\nagainst information loss. With this better initialization, Dynamic ViT achieves\nhigher accuracies, especially under large token pruning ratios (e.g., 81.9% vs.\n81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3).\nMoreover, we apply our method into different token pruning based dynamic vision\ntransformers, different pre-trained models and randomly initialized models to\ndemonstrate the generalization ability.\n","authors":["Fengyuan Shi","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17177v1.pdf","comment":"Submitted to TIP"},{"id":"http://arxiv.org/abs/2310.17176v1","updated":"2023-10-26T06:01:25Z","published":"2023-10-26T06:01:25Z","title":"A Deep Learning Approach to Teeth Segmentation and Orientation from\n Panoramic X-rays","summary":" Accurate teeth segmentation and orientation are fundamental in modern oral\nhealthcare, enabling precise diagnosis, treatment planning, and dental implant\ndesign. In this study, we present a comprehensive approach to teeth\nsegmentation and orientation from panoramic X-ray images, leveraging deep\nlearning techniques. We build our model based on FUSegNet, a popular model\noriginally developed for wound segmentation, and introduce modifications by\nincorporating grid-based attention gates into the skip connections. We\nintroduce oriented bounding box (OBB) generation through principal component\nanalysis (PCA) for precise tooth orientation estimation. Evaluating our\napproach on the publicly available DNS dataset, comprising 543 panoramic X-ray\nimages, we achieve the highest Intersection-over-Union (IoU) score of 82.43%\nand Dice Similarity Coefficient (DSC) score of 90.37% among compared models in\nteeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU)\nscore of 82.82%. We also conduct detailed analyses of individual tooth labels\nand categorical performance, shedding light on strengths and weaknesses. The\nproposed model's accuracy and versatility offer promising prospects for\nimproving dental diagnoses, treatment planning, and personalized healthcare in\nthe oral domain. Our generated OBB coordinates and codes are available at\nhttps://github.com/mrinal054/Instance_teeth_segmentation.\n","authors":["Mrinal Kanti Dhar","Mou Deb","D. Madhab","Zeyun Yu"],"pdf_url":"https://arxiv.org/pdf/2310.17176v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17170v1","updated":"2023-10-26T05:49:44Z","published":"2023-10-26T05:49:44Z","title":"MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR","summary":" This paper aims to address critical issues in the field of Multi-Object\nTracking (MOT) by proposing an efficient and computationally resource-efficient\nend-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods\ntypically involve two separate steps: object detection and object tracking,\nleading to computational complexity and error propagation issues. Recent\nresearch has demonstrated outstanding performance in end-to-end MOT models\nbased on Transformer architectures, but they require substantial hardware\nsupport. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct\na high-efficiency, lightweight, and resource-efficient end-to-end multi-object\ntracking network, offering new opportunities in the multi-object tracking\ndomain. On the MOT17 dataset, MOTR\\cite{zeng2022motr} requires training with 8\nGeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO\nonly requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve\ncomparable performance.\n","authors":["Liao Pan","Yang Feng","Wu Di","Liu Bo","Zhang Xingle"],"pdf_url":"https://arxiv.org/pdf/2310.17170v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17167v1","updated":"2023-10-26T05:43:07Z","published":"2023-10-26T05:43:07Z","title":"Improving Denoising Diffusion Models via Simultaneous Estimation of\n Image and Noise","summary":" This paper introduces two key contributions aimed at improving the speed and\nquality of images generated through inverse diffusion processes. The first\ncontribution involves reparameterizing the diffusion process in terms of the\nangle on a quarter-circular arc between the image and noise, specifically\nsetting the conventional $\\displaystyle \\sqrt{\\bar{\\alpha}}=\\cos(\\eta)$. This\nreparameterization eliminates two singularities and allows for the expression\nof diffusion evolution as a well-behaved ordinary differential equation (ODE).\nIn turn, this allows higher order ODE solvers such as Runge-Kutta methods to be\nused effectively. The second contribution is to directly estimate both the\nimage ($\\mathbf{x}_0$) and noise ($\\mathbf{\\epsilon}$) using our network, which\nenables more stable calculations of the update step in the inverse diffusion\nsteps, as accurate estimation of both the image and noise are crucial at\ndifferent stages of the process. Together with these changes, our model\nachieves faster generation, with the ability to converge on high-quality images\nmore quickly, and higher quality of the generated images, as measured by\nmetrics such as Frechet Inception Distance (FID), spatial Frechet Inception\nDistance (sFID), precision, and recall.\n","authors":["Zhenkai Zhang","Krista A. Ehinger","Tom Drummond"],"pdf_url":"https://arxiv.org/pdf/2310.17167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.13959v2","updated":"2023-10-26T05:41:20Z","published":"2022-09-28T09:43:02Z","title":"Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual\n Grounding","summary":" Multimodal transformer exhibits high capacity and flexibility to align image\nand text for visual grounding. However, the existing encoder-only grounding\nframework (e.g., TransVG) suffers from heavy computation due to the\nself-attention operation with quadratic time complexity. To address this issue,\nwe present a new multimodal transformer architecture, coined as Dynamic\nMutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into\nencoding and decoding phases. The key observation is that there exists high\nspatial redundancy in images. Thus, we devise a new dynamic multimodal\ntransformer decoder by exploiting this sparsity prior to speed up the visual\ngrounding process. Specifically, our dynamic decoder is composed of a 2D\nadaptive sampling module and a text guided decoding module. The sampling module\naims to select these informative patches by predicting the offsets with respect\nto a reference point, while the decoding module works for extracting the\ngrounded object information by performing cross attention between image\nfeatures and text features. These two modules are stacked alternatively to\ngradually bridge the modality gap and iteratively refine the reference point of\ngrounded object, eventually realizing the objective of visual grounding.\nExtensive experiments on five benchmarks demonstrate that our proposed Dynamic\nMDETR achieves competitive trade-offs between computation and accuracy.\nNotably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs\nof the multimodal transformer, but still get higher accuracy than the\nencoder-only counterpart. In addition, to verify its generalization ability and\nscale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual\ngrounding framework, and achieve the state-of-the-art performance on these\nbenchmarks.\n","authors":["Fengyuan Shi","Ruopeng Gao","Weilin Huang","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2209.13959v2.pdf","comment":"Accepted by IEEE Transactions on Pattern Analysis and Machine\n Intelligence (TPAMI) in October 2023"},{"id":"http://arxiv.org/abs/2012.05435v2","updated":"2023-10-26T05:39:49Z","published":"2020-12-10T03:24:53Z","title":"Optimization-Inspired Learning with Architecture Augmentations and\n Control Mechanisms for Low-Level Vision","summary":" In recent years, there has been a growing interest in combining learnable\nmodules with numerical optimization to solve low-level vision tasks. However,\nmost existing approaches focus on designing specialized schemes to generate\nimage/feature propagation. There is a lack of unified consideration to\nconstruct propagative modules, provide theoretical analysis tools, and design\neffective learning mechanisms. To mitigate the above issues, this paper\nproposes a unified optimization-inspired learning framework to aggregate\nGenerative, Discriminative, and Corrective (GDC for short) principles with\nstrong generalization for diverse optimization models. Specifically, by\nintroducing a general energy minimization model and formulating its descent\ndirection from different viewpoints (i.e., in a generative manner, based on the\ndiscriminative metric and with optimality-based correction), we construct three\npropagative modules to effectively solve the optimization models with flexible\ncombinations. We design two control mechanisms that provide the non-trivial\ntheoretical guarantees for both fully- and partially-defined optimization\nformulations. Under the support of theoretical guarantees, we can introduce\ndiverse architecture augmentation strategies such as normalization and search\nto ensure stable propagation with convergence and seamlessly integrate the\nsuitable modules into the propagation respectively. Extensive experiments\nacross varied low-level vision tasks validate the efficacy and adaptability of\nGDC. The codes are available at\nhttps://github.com/LiuZhu-CV/GDC-OptimizationLearning\n","authors":["Risheng Liu","Zhu Liu","Pan Mu","Xin Fan","Zhongxuan Luo"],"pdf_url":"https://arxiv.org/pdf/2012.05435v2.pdf","comment":"14 pages. The codes are available at\n https://github.com/LiuZhu-CV/GDC-OptimizationLearning"},{"id":"http://arxiv.org/abs/2310.00917v3","updated":"2023-10-26T05:33:06Z","published":"2023-10-02T06:08:01Z","title":"Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards\n Enhancing Text Spotting Performance","summary":" The adaptation capability to a wide range of domains is crucial for scene\ntext spotting models when deployed to real-world conditions. However, existing\nstate-of-the-art (SOTA) approaches usually incorporate scene text detection and\nrecognition simply by pretraining on natural scene text datasets, which do not\ndirectly exploit the intermediate feature representations between multiple\ndomains. Here, we investigate the problem of domain-adaptive scene text\nspotting, i.e., training a model on multi-domain source data such that it can\ndirectly adapt to target domains rather than being specialized for a specific\ndomain or scenario. Further, we investigate a transformer baseline called\nSwin-TESTR to focus on solving scene-text spotting for both regular and\narbitrary-shaped scene text along with an exhaustive evaluation. The results\nclearly demonstrate the potential of intermediate representations to achieve\nsignificant performance on text spotting benchmarks across multiple domains\n(e.g. language, synth-to-real, and documents). both in terms of accuracy and\nefficiency.\n","authors":["Alloy Das","Sanket Biswas","Ayan Banerjee","Saumik Bhattacharya","Josep Lladós","Umapada Pal"],"pdf_url":"https://arxiv.org/pdf/2310.00917v3.pdf","comment":"Accepted to the 2024 IEEE/CVF Winter Conference on Applications of\n Computer Vision (WACV 2024)"},{"id":"http://arxiv.org/abs/2310.17164v1","updated":"2023-10-26T05:32:33Z","published":"2023-10-26T05:32:33Z","title":"Bridging Phylogeny and Taxonomy with Protein-protein Interaction\n Networks","summary":" The protein-protein interaction (PPI) network provides an overview of the\ncomplex biological reactions vital to an organism's metabolism and survival.\nEven though in the past PPI network were compared across organisms in detail,\nthere has not been large-scale research on how individual PPI networks reflect\non the species relationships. In this study we aim to increase our\nunderstanding of the tree of life and taxonomy by gleaming information from the\nPPI networks. We successful created (1) a predictor of network statistics based\non known traits of existing species in the phylogeny, and (2) a taxonomic\nclassifier of organism using the known protein network statistics, whether\nexperimentally determined or predicted de novo. With the knowledge of protein\ninteractions at its core, our two models effectively connects two field with\nwidely diverging methodologies - the phylogeny and taxonomy of species.\n","authors":["Long-Huei Chen","Mohana Prasad Sathya Moorthy","Pratyaksh Sharma"],"pdf_url":"https://arxiv.org/pdf/2310.17164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17163v1","updated":"2023-10-26T05:28:32Z","published":"2023-10-26T05:28:32Z","title":"Low-Dimensional Gradient Helps Out-of-Distribution Detection","summary":" Detecting out-of-distribution (OOD) samples is essential for ensuring the\nreliability of deep neural networks (DNNs) in real-world scenarios. While\nprevious research has predominantly investigated the disparity between\nin-distribution (ID) and OOD data through forward information analysis, the\ndiscrepancy in parameter gradients during the backward process of DNNs has\nreceived insufficient attention. Existing studies on gradient disparities\nmainly focus on the utilization of gradient norms, neglecting the wealth of\ninformation embedded in gradient directions. To bridge this gap, in this paper,\nwe conduct a comprehensive investigation into leveraging the entirety of\ngradient information for OOD detection. The primary challenge arises from the\nhigh dimensionality of gradients due to the large number of network parameters.\nTo solve this problem, we propose performing linear dimension reduction on the\ngradient using a designated subspace that comprises principal components. This\ninnovative technique enables us to obtain a low-dimensional representation of\nthe gradient with minimal information loss. Subsequently, by integrating the\nreduced gradient with various existing detection score functions, our approach\ndemonstrates superior performance across a wide range of detection tasks. For\ninstance, on the ImageNet benchmark, our method achieves an average reduction\nof 11.15% in the false positive rate at 95% recall (FPR95) compared to the\ncurrent state-of-the-art approach. The code would be released.\n","authors":["Yingwen Wu","Tao Li","Xinwen Cheng","Jie Yang","Xiaolin Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17163v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.11699v4","updated":"2023-10-26T05:25:17Z","published":"2022-07-24T09:37:42Z","title":"Semi-supervised Deep Multi-view Stereo","summary":" Significant progress has been witnessed in learning-based Multi-view Stereo\n(MVS) under supervised and unsupervised settings. To combine their respective\nmerits in accuracy and completeness, meantime reducing the demand for expensive\nlabeled data, this paper explores the problem of learning-based MVS in a\nsemi-supervised setting that only a tiny part of the MVS data is attached with\ndense depth ground truth. However, due to huge variation of scenarios and\nflexible settings in views, it may break the basic assumption in classic\nsemi-supervised learning, that unlabeled data and labeled data share the same\nlabel space and data distribution, named as semi-supervised distribution-gap\nambiguity in the MVS problem. To handle these issues, we propose a novel\nsemi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the\nsimple case that the basic assumption works in MVS data, consistency\nregularization encourages the model predictions to be consistent between\noriginal sample and randomly augmented sample. For further troublesome case\nthat the basic assumption is conflicted in MVS data, we propose a novel style\nconsistency loss to alleviate the negative effect caused by the distribution\ngap. The visual style of unlabeled sample is transferred to labeled sample to\nshrink the gap, and the model prediction of generated sample is further\nsupervised with the label in original labeled sample. The experimental results\nin semi-supervised settings of multiple MVS datasets show the superior\nperformance of the proposed method. With the same settings in backbone network,\nour proposed SDA-MVS outperforms its fully-supervised and unsupervised\nbaselines.\n","authors":["Hongbin Xu","Weitao Chen","Yang Liu","Zhipeng Zhou","Haihong Xiao","Baigui Sun","Xuansong Xie","Wenxiong Kang"],"pdf_url":"https://arxiv.org/pdf/2207.11699v4.pdf","comment":"This paper is accepted in ACMMM-2023. The code is released at:\n https://github.com/ToughStoneX/Semi-MVS"},{"id":"http://arxiv.org/abs/2310.09478v2","updated":"2023-10-26T05:21:06Z","published":"2023-10-14T03:22:07Z","title":"MiniGPT-v2: large language model as a unified interface for\n vision-language multi-task learning","summary":" Large language models have shown their remarkable capabilities as a general\ninterface for various language-related applications. Motivated by this, we\ntarget to build a unified interface for completing many vision-language tasks\nincluding image description, visual question answering, and visual grounding,\namong others. The challenge is to use a single model for performing diverse\nvision-language tasks effectively with simple multi-modal instructions. Towards\nthis objective, we introduce MiniGPT-v2, a model that can be treated as a\nunified interface for better handling various vision-language tasks. We propose\nusing unique identifiers for different tasks when training the model. These\nidentifiers enable our model to better distinguish each task instruction\neffortlessly and also improve the model learning efficiency for each task.\nAfter the three-stage training, the experimental results show that MiniGPT-v2\nachieves strong performance on many visual question-answering and visual\ngrounding benchmarks compared to other vision-language generalist models. Our\nmodel and codes are available at https://minigpt-v2.github.io/\n","authors":["Jun Chen","Deyao Zhu","Xiaoqian Shen","Xiang Li","Zechun Liu","Pengchuan Zhang","Raghuraman Krishnamoorthi","Vikas Chandra","Yunyang Xiong","Mohamed Elhoseiny"],"pdf_url":"https://arxiv.org/pdf/2310.09478v2.pdf","comment":"fix small typos"},{"id":"http://arxiv.org/abs/2302.06961v3","updated":"2023-10-26T05:18:43Z","published":"2023-02-14T10:40:20Z","title":"DualStreamFoveaNet: A Dual Stream Fusion Architecture with Anatomical\n Awareness for Robust Fovea Localization","summary":" Accurate fovea localization is essential for analyzing retinal diseases to\nprevent irreversible vision loss. While current deep learning-based methods\noutperform traditional ones, they still face challenges such as the lack of\nlocal anatomical landmarks around the fovea, the inability to robustly handle\ndiseased retinal images, and the variations in image conditions. In this paper,\nwe propose a novel transformer-based architecture called DualStreamFoveaNet\n(DSFN) for multi-cue fusion. This architecture explicitly incorporates\nlong-range connections and global features using retina and vessel\ndistributions for robust fovea localization. We introduce a spatial attention\nmechanism in the dual-stream encoder to extract and fuse self-learned\nanatomical information, focusing more on features distributed along blood\nvessels and significantly reducing computational costs by decreasing token\nnumbers. Our extensive experiments show that the proposed architecture achieves\nstate-of-the-art performance on two public datasets and one large-scale private\ndataset. Furthermore, we demonstrate that the DSFN is more robust on both\nnormal and diseased retina images and has better generalization capacity in\ncross-dataset experiments.\n","authors":["Sifan Song","Jinfeng Wang","Zilong Wang","Shaopeng Wang","Jionglong Su","Xiaowei Ding","Kang Dang"],"pdf_url":"https://arxiv.org/pdf/2302.06961v3.pdf","comment":"This paper is prepared for IEEE Transactions on Biomedical\n Engineering"},{"id":"http://arxiv.org/abs/2310.15447v2","updated":"2023-10-26T05:07:55Z","published":"2023-10-24T01:44:11Z","title":"DeepIron: Predicting Unwarped Garment Texture from a Single Image","summary":" Realistic reconstruction of 3D clothing from an image has wide applications,\nsuch as avatar creation and virtual try-on. This paper presents a novel\nframework that reconstructs the texture map for 3D garments from a single image\nwith pose. Assuming that 3D garments are modeled by stitching 2D garment sewing\npatterns, our specific goal is to generate a texture image for the sewing\npatterns. A key component of our framework, the Texture Unwarper, infers the\noriginal texture image from the input clothing image, which exhibits warping\nand occlusion of texture due to the user's body shape and pose. The Texture\nUnwarper effectively transforms between the input and output images by mapping\nthe latent spaces of the two images. By inferring the unwarped original texture\nof the input garment, our method helps reconstruct 3D garment models that can\nshow high-quality texture images realistically deformed for new poses. We\nvalidate the effectiveness of our approach through a comparison with other\nmethods and ablation studies.\n","authors":["Hyun-Song Kwon","Sung-Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2310.15447v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06700v2","updated":"2023-10-26T05:04:17Z","published":"2023-04-13T17:52:29Z","title":"Control3Diff: Learning Controllable 3D Diffusion Models from Single-view\n Images","summary":" Diffusion models have recently become the de-facto approach for generative\nmodeling in the 2D domain. However, extending diffusion models to 3D is\nchallenging due to the difficulties in acquiring 3D ground truth data for\ntraining. On the other hand, 3D GANs that integrate implicit 3D representations\ninto GANs have shown remarkable 3D-aware generation when trained only on\nsingle-view image datasets. However, 3D GANs do not provide straightforward\nways to precisely control image synthesis. To address these challenges, We\npresent Control3Diff, a 3D diffusion model that combines the strengths of\ndiffusion models and 3D GANs for versatile, controllable 3D-aware image\nsynthesis for single-view datasets. Control3Diff explicitly models the\nunderlying latent distribution (optionally conditioned on external inputs),\nthus enabling direct control during the diffusion process. Moreover, our\napproach is general and applicable to any type of controlling input, allowing\nus to train it with the same diffusion objective without any auxiliary\nsupervision. We validate the efficacy of Control3Diff on standard image\ngeneration benchmarks, including FFHQ, AFHQ, and ShapeNet, using various\nconditioning inputs such as images, sketches, and text prompts. Please see the\nproject website (\\url{https://jiataogu.me/control3diff}) for video comparisons.\n","authors":["Jiatao Gu","Qingzhe Gao","Shuangfei Zhai","Baoquan Chen","Lingjie Liu","Josh Susskind"],"pdf_url":"https://arxiv.org/pdf/2304.06700v2.pdf","comment":"Accepted by 3DV24"},{"id":"http://arxiv.org/abs/2310.17158v1","updated":"2023-10-26T05:02:19Z","published":"2023-10-26T05:02:19Z","title":"CosmosDSR -- a methodology for automated detection and tracking of\n orbital debris using the Unscented Kalman Filter","summary":" The Kessler syndrome refers to the escalating space debris from frequent\nspace activities, threatening future space exploration. Addressing this issue\nis vital. Several AI models, including Convolutional Neural Networks (CNN),\nKernel Principal Component Analysis (KPCA), and Model-Agnostic Meta-Learning\n(MAML), have been assessed with various data types. Earlier studies highlighted\nthe combination of the YOLO object detector and a linear Kalman filter for\nobject detection and tracking. Building on this, our project introduces\nCosmosDSR, a novel methodology combining YOLOv3 with an Unscented Kalman Filter\nfor tracking satellites in sequential images, compared to a linear Kalman\nfilter. Using the SPARK dataset from the University of Luxembourg for training\nand testing, the YOLOv3 precisely detected and classified all satellite\ncategories (mAP=97.18%, F1=0.95) with few errors (TP=4163, FP=209, FN=237).\nBoth CosmosDSR and the LKF tracked satellites accurately (UKF:\nMSE=2.83/RMSE=1.66, LKF: MSE=2.84/RMSE=1.66). Despite concerns of class\nimbalance and the absence of real images, the model shows promise. Future work\nshould address these limitations, increase tracking sample size, and improve\nmetrics. This research suggests the algorithm's potential in detecting and\ntracking satellites, paving the way for solutions to the Kessler syndrome.\n","authors":["Daniel S. Roll","Zeyneb Kurt","Wai Lok Woo"],"pdf_url":"https://arxiv.org/pdf/2310.17158v1.pdf","comment":"7 figures, 15 pages inc refs"},{"id":"http://arxiv.org/abs/2310.17156v1","updated":"2023-10-26T05:00:41Z","published":"2023-10-26T05:00:41Z","title":"Learning depth from monocular video sequences","summary":" Learning single image depth estimation model from monocular video sequence is\na very challenging problem. In this paper, we propose a novel training loss\nwhich enables us to include more images for supervision during the training\nprocess. We propose a simple yet effective model to account the frame to frame\npixel motion. We also design a novel network architecture for single image\nestimation. When combined, our method produces state of the art results for\nmonocular depth estimation on the KITTI dataset in the self-supervised setting.\n","authors":["Zhenwei Luo"],"pdf_url":"https://arxiv.org/pdf/2310.17156v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.07254v2","updated":"2023-10-26T04:59:17Z","published":"2023-09-13T18:43:13Z","title":"Mitigate Replication and Copying in Diffusion Models with Generalized\n Caption and Dual Fusion Enhancement","summary":" While diffusion models demonstrate a remarkable capability for generating\nhigh-quality images, their tendency to `replicate' training data raises privacy\nconcerns. Although recent research suggests that this replication may stem from\nthe insufficient generalization of training data captions and duplication of\ntraining images, effective mitigation strategies remain elusive. To address\nthis gap, our paper first introduces a generality score that measures the\ncaption generality and employ large language model (LLM) to generalize training\ncaptions. Subsequently, we leverage generalized captions and propose a novel\ndual fusion enhancement approach to mitigate the replication of diffusion\nmodels. Our empirical results demonstrate that our proposed methods can\nsignificantly reduce replication by 43.5% compared to the original diffusion\nmodel while maintaining the diversity and quality of generations.\n","authors":["Chenghao Li","Dake Chen","Yuke Zhang","Peter A. Beerel"],"pdf_url":"https://arxiv.org/pdf/2309.07254v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17154v1","updated":"2023-10-26T04:54:39Z","published":"2023-10-26T04:54:39Z","title":"Deep Imbalanced Regression via Hierarchical Classification Adjustment","summary":" Regression tasks in computer vision, such as age estimation or counting, are\noften formulated into classification by quantizing the target space into\nclasses. Yet real-world data is often imbalanced -- the majority of training\nsamples lie in a head range of target values, while a minority of samples span\na usually larger tail range. By selecting the class quantization, one can\nadjust imbalanced regression targets into balanced classification outputs,\nthough there are trade-offs in balancing classification accuracy and\nquantization error. To improve regression performance over the entire range of\ndata, we propose to construct hierarchical classifiers for solving imbalanced\nregression tasks. The fine-grained classifiers limit the quantization error\nwhile being modulated by the coarse predictions to ensure high accuracy.\nStandard hierarchical classification approaches, however, when applied to the\nregression problem, fail to ensure that predicted ranges remain consistent\nacross the hierarchy. As such, we propose a range-preserving distillation\nprocess that can effectively learn a single classifier from the set of\nhierarchical classifiers. Our novel hierarchical classification adjustment\n(HCA) for imbalanced regression shows superior results on three diverse tasks:\nage estimation, crowd counting and depth estimation. We will release the source\ncode upon acceptance.\n","authors":["Haipeng Xiong","Angela Yao"],"pdf_url":"https://arxiv.org/pdf/2310.17154v1.pdf","comment":"14 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.17152v1","updated":"2023-10-26T04:52:25Z","published":"2023-10-26T04:52:25Z","title":"Technical Note: Feasibility of translating 3.0T-trained Deep-Learning\n Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy\n Controls","summary":" In the current study, our purpose is to evaluate the feasibility of applying\ndeep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in\nhealthy controls scanned at 0.55T and compared with 3.0T. The current study\nassesses the performance of standard in-practice bone, and cartilage\nsegmentation algorithms at 0.55T, both qualitatively and quantitatively, in\nterms of comparing segmentation performance, areas of improvement, and\ncompartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial\nresults demonstrate a usable to good technical feasibility of translating\nexisting quantitative deep-learning-based image segmentation techniques,\ntrained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition\nenvironment. Especially in terms of segmenting cartilage compartments, the\nmodels perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T\nlow-field sustainable and easy-to-install MRI, as demonstrated, thus, can be\nutilized for evaluating knee cartilage thickness and bone segmentations aided\nby established DL algorithms trained at higher-field strengths out-of-the-box\ninitially. This could be utilized at the far-spread point-of-care locations\nwith a lack of radiologists available to manually segment low-field images, at\nleast till a decent base of low-field data pool is collated. With further\nfine-tuning with manual labeling of low-field data or utilizing synthesized\nhigher SNR images from low-field images, OA biomarker quantification\nperformance is potentially guaranteed to be further improved.\n","authors":["Rupsa Bhattacharjee","Zehra Akkaya","Johanna Luitjens","Pan Su","Yang Yang","Valentina Pedoia","Sharmila Majumdar"],"pdf_url":"https://arxiv.org/pdf/2310.17152v1.pdf","comment":"11 Pages, 3 Figures, 2 Tables"},{"id":"http://arxiv.org/abs/2310.17147v1","updated":"2023-10-26T04:42:57Z","published":"2023-10-26T04:42:57Z","title":"Simple Baselines for Projection-based Full-reference and No-reference\n Point Cloud Quality Assessment","summary":" Point clouds are widely used in 3D content representation and have various\napplications in multimedia. However, compression and simplification processes\ninevitably result in the loss of quality-aware information under storage and\nbandwidth constraints. Therefore, there is an increasing need for effective\nmethods to quantify the degree of distortion in point clouds. In this paper, we\npropose simple baselines for projection-based point cloud quality assessment\n(PCQA) to tackle this challenge. We use multi-projections obtained via a common\ncube-like projection process from the point clouds for both full-reference (FR)\nand no-reference (NR) PCQA tasks. Quality-aware features are extracted with\npopular vision backbones. The FR quality representation is computed as the\nsimilarity between the feature maps of reference and distorted projections\nwhile the NR quality representation is obtained by simply squeezing the feature\nmaps of distorted projections with average pooling The corresponding quality\nrepresentations are regressed into visual quality scores by fully-connected\nlayers. Taking part in the ICIP 2023 PCVQA Challenge, we succeeded in achieving\nthe top spot in four out of the five competition tracks.\n","authors":["Zicheng Zhang","Yingjie Zhou","Wei Sun","Xiongkuo Min","Guangtao Zhai"],"pdf_url":"https://arxiv.org/pdf/2310.17147v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04824v2","updated":"2023-10-26T04:30:30Z","published":"2023-02-09T18:25:24Z","title":"Lithium Metal Battery Quality Control via Transformer-CNN Segmentation","summary":" Lithium metal battery (LMB) has the potential to be the next-generation\nbattery system because of its high theoretical energy density. However, defects\nknown as dendrites are formed by heterogeneous lithium (Li) plating, which\nhinders the development and utilization of LMBs. Non-destructive techniques to\nobserve the dendrite morphology often use X-ray computed tomography (XCT) to\nprovide cross-sectional views. To retrieve three-dimensional structures inside\na battery, image segmentation becomes essential to quantitatively analyze XCT\nimages. This work proposes a new semantic segmentation approach using a\ntransformer-based neural network called TransforCNN that is capable of\nsegmenting out dendrites from XCT data. In addition, we compare the performance\nof the proposed TransforCNN with three other algorithms, such as U-Net, Y-Net,\nand E-Net, consisting of an Ensemble Network model for XCT analysis. Our\nresults show the advantages of using TransforCNN when evaluating\nover-segmentation metrics, such as mean Intersection over Union (mIoU) and mean\nDice Similarity Coefficient (mDSC) as well as through several qualitatively\ncomparative visualizations.\n","authors":["Jerome Quenum","Iryna Zenyuk","Daniela Ushizima"],"pdf_url":"https://arxiv.org/pdf/2302.04824v2.pdf","comment":"15 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.17138v1","updated":"2023-10-26T04:20:39Z","published":"2023-10-26T04:20:39Z","title":"A Classifier Using Global Character Level and Local Sub-unit Level\n Features for Hindi Online Handwritten Character Recognition","summary":" A classifier is developed that defines a joint distribution of global\ncharacter features, number of sub-units and local sub-unit features to model\nHindi online handwritten characters. The classifier uses latent variables to\nmodel the structure of sub-units. The classifier uses histograms of points,\norientations, and dynamics of orientations (HPOD) features to represent\ncharacters at global character level and local sub-unit level and is\nindependent of character stroke order and stroke direction variations. The\nparameters of the classifier is estimated using maximum likelihood method.\nDifferent classifiers and features used in other studies are considered in this\nstudy for classification performance comparison with the developed classifier.\nThe classifiers considered are Second Order Statistics (SOS), Sub-space (SS),\nFisher Discriminant (FD), Feedforward Neural Network (FFN) and Support Vector\nMachines (SVM) and the features considered are Spatio Temporal (ST), Discrete\nFourier Transform (DFT), Discrete Cosine Transform (SCT), Discrete Wavelet\nTransform (DWT), Spatial (SP) and Histograms of Oriented Gradients (HOG). Hindi\ncharacter datasets used for training and testing the developed classifier\nconsist of samples of handwritten characters from 96 different character\nclasses. There are 12832 samples with an average of 133 samples per character\nclass in the training set and 2821 samples with an average of 29 samples per\ncharacter class in the testing set. The developed classifier has the highest\naccuracy of 93.5\\% on the testing set compared to that of the classifiers\ntrained on different features extracted from the same training set and\nevaluated on the same testing set considered in this study.\n","authors":["Anand Sharma","A. G. Ramakrishnan"],"pdf_url":"https://arxiv.org/pdf/2310.17138v1.pdf","comment":"23 pages, 8 jpg figures. arXiv admin note: text overlap with\n arXiv:2310.08222"},{"id":"http://arxiv.org/abs/2310.17135v1","updated":"2023-10-26T04:18:00Z","published":"2023-10-26T04:18:00Z","title":"Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type\n Segmentation","summary":" Up-to-date sea ice charts are crucial for safer navigation in ice-infested\nwaters. Recently, Convolutional Neural Network (CNN) models show the potential\nto accelerate the generation of ice maps for large regions. However, results\nfrom CNN models still need to undergo scrutiny as higher metrics performance\nnot always translate to adequate outputs. Sea ice type classes are imbalanced,\nrequiring special treatment during training. We evaluate how three different\nloss functions, some developed for imbalanced class problems, affect the\nperformance of CNN models trained to predict the dominant ice type in\nSentinel-1 images. Despite the fact that Dice and Focal loss produce higher\nmetrics, results from cross-entropy seem generally more physically consistent.\n","authors":["Rafael Pires de Lima","Behzad Vahedi","Morteza Karimzadeh"],"pdf_url":"https://arxiv.org/pdf/2310.17135v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16436v2","updated":"2023-10-26T04:16:52Z","published":"2023-10-25T08:03:10Z","title":"DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning\n in Language Models","summary":" A long-standing goal of AI systems is to perform complex multimodal reasoning\nlike humans. Recently, large language models (LLMs) have made remarkable\nstrides in such multi-step reasoning on the language modality solely by\nleveraging the chain of thought (CoT) to mimic human thinking. However, the\ntransfer of these advancements to multimodal contexts introduces heightened\nchallenges, including but not limited to the impractical need for\nlabor-intensive annotation and the limitations in terms of flexibility,\ngeneralizability, and explainability. To evoke CoT reasoning in multimodality,\nthis work first conducts an in-depth analysis of these challenges posed by\nmultimodality and presents two key insights: \"keeping critical thinking\" and\n\"letting everyone do their jobs\" in multimodal CoT reasoning. Furthermore, this\nstudy proposes a novel DDCoT prompting that maintains a critical attitude\nthrough negative-space prompting and incorporates multimodality into reasoning\nby first dividing the reasoning responsibility of LLMs into reasoning and\nrecognition and then integrating the visual recognition capability of visual\nmodels into the joint reasoning process. The rationales generated by DDCoT not\nonly improve the reasoning abilities of both large and small language models in\nzero-shot prompting and fine-tuning learning, significantly outperforming\nstate-of-the-art methods but also exhibit impressive generalizability and\nexplainability.\n","authors":["Ge Zheng","Bin Yang","Jiajin Tang","Hong-Yu Zhou","Sibei Yang"],"pdf_url":"https://arxiv.org/pdf/2310.16436v2.pdf","comment":"24 pages, 13 figures, to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17131v1","updated":"2023-10-26T04:11:34Z","published":"2023-10-26T04:11:34Z","title":"Virtual Accessory Try-On via Keypoint Hallucination","summary":" The virtual try-on task refers to fitting the clothes from one image onto\nanother portrait image. In this paper, we focus on virtual accessory try-on,\nwhich fits accessory (e.g., glasses, ties) onto a face or portrait image.\nUnlike clothing try-on, which relies on human silhouette as guidance, accessory\ntry-on warps the accessory into an appropriate location and shape to generate a\nplausible composite image. In contrast to previous try-on methods that treat\nforeground (i.e., accessories) and background (i.e., human faces or bodies)\nequally, we propose a background-oriented network to utilize the prior\nknowledge of human bodies and accessories. Specifically, our approach learns\nthe human body priors and hallucinates the target locations of specified\nforeground keypoints in the background. Then our approach will inject\nforeground information with accessory priors into the background UNet. Based on\nthe hallucinated target locations, the warping parameters are calculated to\nwarp the foreground. Moreover, this background-oriented network can also easily\nincorporate auxiliary human face/body semantic segmentation supervision to\nfurther boost performance. Experiments conducted on STRAT dataset validate the\neffectiveness of our proposed method.\n","authors":["Junhong Gou","Bo Zhang","Li Niu","Jianfu Zhang","Jianlou Si","Chen Qian","Liqing Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17128v1","updated":"2023-10-26T04:08:07Z","published":"2023-10-26T04:08:07Z","title":"Task-driven Prompt Evolution for Foundation Models","summary":" Promptable foundation models, particularly Segment Anything Model (SAM), have\nemerged as a promising alternative to the traditional task-specific supervised\nlearning for image segmentation. However, many evaluation studies have found\nthat their performance on medical imaging modalities to be underwhelming\ncompared to conventional deep learning methods. In the world of large\npre-trained language and vision-language models, learning prompt from\ndownstream tasks has achieved considerable success in improving performance. In\nthis work, we propose a plug-and-play Prompt Optimization Technique for\nfoundation models like SAM (SAMPOT) that utilizes the downstream segmentation\ntask to optimize the human-provided prompt to obtain improved performance. We\ndemonstrate the utility of SAMPOT on lung segmentation in chest X-ray images\nand obtain an improvement on a significant number of cases ($\\sim75\\%$) over\nhuman-provided initial prompts. We hope this work will lead to further\ninvestigations in the nascent field of automatic visual prompt-tuning.\n","authors":["Rachana Sathish","Rahul Venkataramani","K S Shriram","Prasad Sudhakar"],"pdf_url":"https://arxiv.org/pdf/2310.17128v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.09249v2","updated":"2023-10-26T03:56:58Z","published":"2022-03-17T11:18:17Z","title":"Fine-tuning Global Model via Data-Free Knowledge Distillation for\n Non-IID Federated Learning","summary":" Federated Learning (FL) is an emerging distributed learning paradigm under\nprivacy constraint. Data heterogeneity is one of the main challenges in FL,\nwhich results in slow convergence and degraded performance. Most existing\napproaches only tackle the heterogeneity challenge by restricting the local\nmodel update in client, ignoring the performance drop caused by direct global\nmodel aggregation. Instead, we propose a data-free knowledge distillation\nmethod to fine-tune the global model in the server (FedFTG), which relieves the\nissue of direct model aggregation. Concretely, FedFTG explores the input space\nof local models through a generator, and uses it to transfer the knowledge from\nlocal models to the global model. Besides, we propose a hard sample mining\nscheme to achieve effective knowledge distillation throughout the training. In\naddition, we develop customized label sampling and class-level ensemble to\nderive maximum utilization of knowledge, which implicitly mitigates the\ndistribution discrepancy across clients. Extensive experiments show that our\nFedFTG significantly outperforms the state-of-the-art (SOTA) FL algorithms and\ncan serve as a strong plugin for enhancing FedAvg, FedProx, FedDyn, and\nSCAFFOLD.\n","authors":["Lin Zhang","Li Shen","Liang Ding","Dacheng Tao","Ling-Yu Duan"],"pdf_url":"https://arxiv.org/pdf/2203.09249v2.pdf","comment":"This paper is accepted by CVPR2022"},{"id":"http://arxiv.org/abs/2310.17126v1","updated":"2023-10-26T03:52:54Z","published":"2023-10-26T03:52:54Z","title":"Deep Learning on SAR Imagery: Transfer Learning Versus Randomly\n Initialized Weights","summary":" Deploying deep learning on Synthetic Aperture Radar (SAR) data is becoming\nmore common for mapping purposes. One such case is sea ice, which is highly\ndynamic and rapidly changes as a result of the combined effect of wind,\ntemperature, and ocean currents. Therefore, frequent mapping of sea ice is\nnecessary to ensure safe marine navigation. However, there is a general\nshortage of expert-labeled data to train deep learning algorithms. Fine-tuning\na pre-trained model on SAR imagery is a potential solution. In this paper, we\ncompare the performance of deep learning models trained from scratch using\nrandomly initialized weights against pre-trained models that we fine-tune for\nthis purpose. Our results show that pre-trained models lead to better results,\nespecially on test samples from the melt season.\n","authors":["Morteza Karimzadeh","Rafael Pires de Lima"],"pdf_url":"https://arxiv.org/pdf/2310.17126v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17122v1","updated":"2023-10-26T03:43:28Z","published":"2023-10-26T03:43:28Z","title":"Enhancing sea ice segmentation in Sentinel-1 images with atrous\n convolutions","summary":" Due to the growing volume of remote sensing data and the low latency required\nfor safe marine navigation, machine learning (ML) algorithms are being\ndeveloped to accelerate sea ice chart generation, currently a manual\ninterpretation task. However, the low signal-to-noise ratio of the freely\navailable Sentinel-1 Synthetic Aperture Radar (SAR) imagery, the ambiguity of\nbackscatter signals for ice types, and the scarcity of open-source\nhigh-resolution labelled data makes automating sea ice mapping challenging. We\nuse Extreme Earth version 2, a high-resolution benchmark dataset generated for\nML training and evaluation, to investigate the effectiveness of ML for\nautomated sea ice mapping. Our customized pipeline combines ResNets and Atrous\nSpatial Pyramid Pooling for SAR image segmentation. We investigate the\nperformance of our model for: i) binary classification of sea ice and open\nwater in a segmentation framework; and ii) a multiclass segmentation of five\nsea ice types. For binary ice-water classification, models trained with our\nlargest training set have weighted F1 scores all greater than 0.95 for January\nand July test scenes. Specifically, the median weighted F1 score was 0.98,\nindicating high performance for both months. By comparison, a competitive\nbaseline U-Net has a weighted average F1 score of ranging from 0.92 to 0.94\n(median 0.93) for July, and 0.97 to 0.98 (median 0.97) for January. Multiclass\nice type classification is more challenging, and even though our models achieve\n2% improvement in weighted F1 average compared to the baseline U-Net, test\nweighted F1 is generally between 0.6 and 0.80. Our approach can efficiently\nsegment full SAR scenes in one run, is faster than the baseline U-Net, retains\nspatial resolution and dimension, and is more robust against noise compared to\napproaches that rely on patch classification.\n","authors":["Rafael Pires de Lima","Behzad Vahedi","Nick Hughes","Andrew P. Barrett","Walter Meier","Morteza Karimzadeh"],"pdf_url":"https://arxiv.org/pdf/2310.17122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10350v2","updated":"2023-10-26T03:41:06Z","published":"2023-07-19T17:47:12Z","title":"Improving Multimodal Datasets with Image Captioning","summary":" Massive web datasets play a key role in the success of large vision-language\nmodels like CLIP and Flamingo. However, the raw web data is noisy, and existing\nfiltering methods to reduce noise often come at the expense of data diversity.\nOur work focuses on caption quality as one major source of noise, and studies\nhow generated captions can increase the utility of web-scraped datapoints with\nnondescript text. Through exploring different mixing strategies for raw and\ngenerated captions, we outperform the best filtering method proposed by the\nDataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a\ncandidate pool of 128M image-text pairs. Our best approach is also 2x better at\nFlickr and MS-COCO retrieval. We then analyze what makes synthetic captions an\neffective source of text supervision. In experimenting with different image\ncaptioning models, we also demonstrate that the performance of a model on\nstandard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable\nindicator of the utility of the captions it generates for multimodal training.\nFinally, our experiments with using generated captions at DataComp's large\nscale (1.28B image-text pairs) offer insights into the limitations of synthetic\ntext, as well as the importance of image curation with increasing training data\nquantity. The synthetic captions used in our experiments are now available on\nHuggingFace.\n","authors":["Thao Nguyen","Samir Yitzhak Gadre","Gabriel Ilharco","Sewoong Oh","Ludwig Schmidt"],"pdf_url":"https://arxiv.org/pdf/2307.10350v2.pdf","comment":"Accepted at NeurIPS 2023 Datasets & Benchmarks"},{"id":"http://arxiv.org/abs/2305.02324v2","updated":"2023-10-26T03:38:48Z","published":"2023-05-03T10:31:35Z","title":"Cross-Stream Contrastive Learning for Self-Supervised Skeleton-Based\n Action Recognition","summary":" Self-supervised skeleton-based action recognition enjoys a rapid growth along\nwith the development of contrastive learning. The existing methods rely on\nimposing invariance to augmentations of 3D skeleton within a single data\nstream, which merely leverages the easy positive pairs and limits the ability\nto explore the complicated movement patterns. In this paper, we advocate that\nthe defect of single-stream contrast and the lack of necessary feature\ntransformation are responsible for easy positives, and therefore propose a\nCross-Stream Contrastive Learning framework for skeleton-based action\nRepresentation learning (CSCLR). Specifically, the proposed CSCLR not only\nutilizes intra-stream contrast pairs, but introduces inter-stream contrast\npairs as hard samples to formulate a better representation learning. Besides,\nto further exploit the potential of positive pairs and increase the robustness\nof self-supervised representation learning, we propose a Positive Feature\nTransformation (PFT) strategy which adopts feature-level manipulation to\nincrease the variance of positive pairs. To validate the effectiveness of our\nmethod, we conduct extensive experiments on three benchmark datasets NTU-RGB+D\n60, NTU-RGB+D 120 and PKU-MMD. Experimental results show that our proposed\nCSCLR exceeds the state-of-the-art methods on a diverse range of evaluation\nprotocols.\n","authors":["Ding Li","Yongqiang Tang","Zhizhong Zhang","Wensheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.02324v2.pdf","comment":"15 pages, 7 figures"},{"id":"http://arxiv.org/abs/2306.00595v5","updated":"2023-10-26T03:13:30Z","published":"2023-06-01T12:12:22Z","title":"Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language\n Perspective","summary":" We focus on the weakly-supervised audio-visual video parsing task (AVVP),\nwhich aims to identify and locate all the events in audio/visual modalities.\nPrevious works only concentrate on video-level overall label denoising across\nmodalities, but overlook the segment-level label noise, where adjacent video\nsegments (i.e., 1-second video clips) may contain different events. However,\nrecognizing events in the segment is challenging because its label could be any\ncombination of events that occur in the video. To address this issue, we\nconsider tackling AVVP from the language perspective, since language could\nfreely describe how various events appear in each segment beyond fixed labels.\nSpecifically, we design language prompts to describe all cases of event\nappearance for each video. Then, the similarity between language prompts and\nsegments is calculated, where the event of the most similar prompt is regarded\nas the segment-level label. In addition, to deal with the mislabeled segments,\nwe propose to perform dynamic re-weighting on the unreliable segments to adjust\ntheir labels. Experiments show that our simple yet effective approach\noutperforms state-of-the-art methods by a large margin.\n","authors":["Yingying Fan","Yu Wu","Bo Du","Yutian Lin"],"pdf_url":"https://arxiv.org/pdf/2306.00595v5.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.08247v4","updated":"2023-10-26T03:13:18Z","published":"2023-06-14T05:25:06Z","title":"Diffusion in Diffusion: Cyclic One-Way Diffusion for\n Text-Vision-Conditioned Generation","summary":" Originating from the diffusion phenomenon in physics that describes particle\nmovement, the diffusion generative models inherit the characteristics of\nstochastic random walk in the data space along the denoising trajectory.\nHowever, the intrinsic mutual interference among image regions contradicts the\nneed for practical downstream application scenarios where the preservation of\nlow-level pixel information from given conditioning is desired (e.g.,\ncustomization tasks like personalized generation and inpainting based on a\nuser-provided single image). In this work, we investigate the diffusion\n(physics) in diffusion (machine learning) properties and propose our Cyclic\nOne-Way Diffusion (COW) method to control the direction of diffusion phenomenon\ngiven a pre-trained frozen diffusion model for versatile customization\napplication scenarios, where the low-level pixel information from the\nconditioning needs to be preserved. Notably, unlike most current methods that\nincorporate additional conditions by fine-tuning the base text-to-image\ndiffusion model or learning auxiliary networks, our method provides a novel\nperspective to understand the task needs and is applicable to a wider range of\ncustomization scenarios in a learning-free manner. Extensive experiment results\nshow that our proposed COW can achieve more flexible customization based on\nstrict visual conditions in different application settings.\n","authors":["Ruoyu Wang","Yongqi Yang","Zhihao Qian","Ye Zhu","Yu Wu"],"pdf_url":"https://arxiv.org/pdf/2306.08247v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.00198v2","updated":"2023-10-26T03:10:45Z","published":"2023-03-01T03:06:29Z","title":"Convolutional Visual Prompt for Robust Visual Perception","summary":" Vision models are often vulnerable to out-of-distribution (OOD) samples\nwithout adapting. While visual prompts offer a lightweight method of\ninput-space adaptation for large-scale vision models, they rely on a\nhigh-dimensional additive vector and labeled data. This leads to overfitting\nwhen adapting models in a self-supervised test-time setting without labels. We\nintroduce convolutional visual prompts (CVP) for label-free test-time\nadaptation for robust visual perception. The structured nature of CVP demands\nfewer trainable parameters, less than 1\\% compared to standard visual prompts,\ncombating overfitting. Extensive experiments and analysis on a wide variety of\nOOD visual perception tasks show that our approach is effective, improving\nrobustness by up to 5.87% over several large-scale models.\n","authors":["Yun-Yun Tsai","Chengzhi Mao","Junfeng Yang"],"pdf_url":"https://arxiv.org/pdf/2303.00198v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10355v3","updated":"2023-10-26T02:52:40Z","published":"2023-05-17T16:34:01Z","title":"Evaluating Object Hallucination in Large Vision-Language Models","summary":" Inspired by the superior language abilities of large language models (LLM),\nlarge vision-language models (LVLM) have been recently explored by integrating\npowerful LLMs for improving the performance on complex multimodal tasks.\nDespite the promising progress on LVLMs, we find that LVLMs suffer from the\nhallucination problem, i.e. they tend to generate objects that are inconsistent\nwith the target images in the descriptions. To investigate it, this work\npresents the first systematic study on object hallucination of LVLMs. We\nconduct the evaluation experiments on several representative LVLMs, and show\nthat they mostly suffer from severe object hallucination issue. We further\ndiscuss that the visual instructions may influence the hallucination, and find\nthat: objects that frequently occur in the visual instructions or co-occur with\nthe image objects, are obviously prone to be hallucinated by LVLMs. Besides, we\nfind that existing evaluation methods might be affected by the input\ninstructions and generation styles of LVLMs. Thus, we further design an\nimproved evaluation method for object hallucination by proposing a\npolling-based query method called POPE. Experiment results demonstrate that our\nPOPE can evaluate the object hallucination in a more stable and flexible way.\nOur codes and data are publicly available at https://github.com/RUCAIBox/POPE.\n","authors":["Yifan Li","Yifan Du","Kun Zhou","Jinpeng Wang","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2305.10355v3.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2109.14251v3","updated":"2023-10-26T02:52:39Z","published":"2021-09-29T07:51:49Z","title":"Road Network Guided Fine-Grained Urban Traffic Flow Inference","summary":" Accurate inference of fine-grained traffic flow from coarse-grained one is an\nemerging yet crucial problem, which can help greatly reduce the number of the\nrequired traffic monitoring sensors for cost savings. In this work, we notice\nthat traffic flow has a high correlation with road network, which was either\ncompletely ignored or simply treated as an external factor in previous works.\nTo facilitate this problem, we propose a novel Road-Aware Traffic Flow\nMagnifier (RATFM) that explicitly exploits the prior knowledge of road networks\nto fully learn the road-aware spatial distribution of fine-grained traffic\nflow. Specifically, a multi-directional 1D convolutional layer is first\nintroduced to extract the semantic feature of the road network. Subsequently,\nwe incorporate the road network feature and coarse-grained flow feature to\nregularize the short-range spatial distribution modeling of road-relative\ntraffic flow. Furthermore, we take the road network feature as a query to\ncapture the long-range spatial distribution of traffic flow with a transformer\narchitecture. Benefiting from the road-aware inference mechanism, our method\ncan generate high-quality fine-grained traffic flow maps. Extensive experiments\non three real-world datasets show that the proposed RATFM outperforms\nstate-of-the-art models under various scenarios. Our code and datasets are\nreleased at {\\url{https://github.com/luimoli/RATFM}}.\n","authors":["Lingbo Liu","Mengmeng Liu","Guanbin Li","Ziyi Wu","Junfan Lin","Liang Lin"],"pdf_url":"https://arxiv.org/pdf/2109.14251v3.pdf","comment":"This work has been accepted to TNNLS"},{"id":"http://arxiv.org/abs/2310.17109v1","updated":"2023-10-26T02:37:08Z","published":"2023-10-26T02:37:08Z","title":"LP-OVOD: Open-Vocabulary Object Detection by Linear Probing","summary":" This paper addresses the challenging problem of open-vocabulary object\ndetection (OVOD) where an object detector must identify both seen and unseen\nclasses in test images without labeled examples of the unseen classes in\ntraining. A typical approach for OVOD is to use joint text-image embeddings of\nCLIP to assign box proposals to their closest text label. However, this method\nhas a critical issue: many low-quality boxes, such as over- and\nunder-covered-object boxes, have the same similarity score as high-quality\nboxes since CLIP is not trained on exact object location information. To\naddress this issue, we propose a novel method, LP-OVOD, that discards\nlow-quality boxes by training a sigmoid linear classifier on pseudo labels\nretrieved from the top relevant region proposals to the novel text.\nExperimental results on COCO affirm the superior performance of our approach\nover the state of the art, achieving $\\textbf{40.5}$ in $\\text{AP}_{novel}$\nusing ResNet50 as the backbone and without external datasets or knowing novel\nclasses during training. Our code will be available at\nhttps://github.com/VinAIResearch/LP-OVOD.\n","authors":["Chau Pham","Truong Vu","Khoi Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.17109v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.08872v4","updated":"2023-10-26T02:24:32Z","published":"2023-10-13T05:48:42Z","title":"R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image\n Generation","summary":" Recent text-to-image (T2I) diffusion models have achieved remarkable progress\nin generating high-quality images given text-prompts as input. However, these\nmodels fail to convey appropriate spatial composition specified by a layout\ninstruction. In this work, we probe into zero-shot grounded T2I generation with\ndiffusion models, that is, generating images corresponding to the input layout\ninformation without training auxiliary modules or finetuning diffusion models.\nWe propose a Region and Boundary (R&B) aware cross-attention guidance approach\nthat gradually modulates the attention maps of diffusion model during\ngenerative process, and assists the model to synthesize images (1) with high\nfidelity, (2) highly compatible with textual input, and (3) interpreting layout\ninstructions accurately. Specifically, we leverage the discrete sampling to\nbridge the gap between consecutive attention maps and discrete layout\nconstraints, and design a region-aware loss to refine the generative layout\nduring diffusion process. We further propose a boundary-aware loss to\nstrengthen object discriminability within the corresponding regions.\nExperimental results show that our method outperforms existing state-of-the-art\nzero-shot grounded T2I generation methods by a large margin both qualitatively\nand quantitatively on several benchmarks.\n","authors":["Jiayu Xiao","Liang Li","Henglei Lv","Shuhui Wang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2310.08872v4.pdf","comment":"Preprint. Under review. Project page:\n https://sagileo.github.io/Region-and-Boundary"},{"id":"http://arxiv.org/abs/2310.17097v1","updated":"2023-10-26T01:40:28Z","published":"2023-10-26T01:40:28Z","title":"Navigating Data Heterogeneity in Federated Learning: A Semi-Supervised\n Approach for Object Detection","summary":" Federated Learning (FL) has emerged as a potent framework for training models\nacross distributed data sources while maintaining data privacy. Nevertheless,\nit faces challenges with limited high-quality labels and non-IID client data,\nparticularly in applications like autonomous driving. To address these hurdles,\nwe navigate the uncharted waters of Semi-Supervised Federated Object Detection\n(SSFOD). We present a pioneering SSFOD framework, designed for scenarios where\nlabeled data reside only at the server while clients possess unlabeled data.\nNotably, our method represents the inaugural implementation of SSFOD for\nclients with 0% labeled non-IID data, a stark contrast to previous studies that\nmaintain some subset of labels at each client. We propose FedSTO, a two-stage\nstrategy encompassing Selective Training followed by Orthogonally enhanced\nfull-parameter training, to effectively address data shift (e.g. weather\nconditions) between server and clients. Our contributions include selectively\nrefining the backbone of the detector to avert overfitting, orthogonality\nregularization to boost representation divergence, and local EMA-driven pseudo\nlabel assignment to yield high-quality pseudo labels. Extensive validation on\nprominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M)\nattests to the efficacy of our approach, demonstrating state-of-the-art\nresults. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as\nwell as fully-supervised centralized training methods.\n","authors":["Taehyeon Kim","Eric Lin","Junu Lee","Christian Lau","Vaikkunth Mugunthan"],"pdf_url":"https://arxiv.org/pdf/2310.17097v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.06118v5","updated":"2023-10-26T01:16:13Z","published":"2023-09-12T10:33:19Z","title":"CHITNet: A Complementary to Harmonious Information Transfer Network for\n Infrared and Visible Image Fusion","summary":" Current infrared and visible image fusion (IVIF) methods go to great lengths\nto excavate complementary features and design complex fusion strategies, which\nis extremely challenging. To this end, we rethink the IVIF outside the box,\nproposing a complementary to harmonious information transfer network (CHITNet).\nIt reasonably transfers complementary information into harmonious one, which\nintegrates both the shared and complementary features from two modalities.\nSpecifically, to skillfully sidestep aggregating complementary information in\nIVIF, we design a mutual information transfer (MIT) module to mutually\nrepresent features from two modalities, roughly transferring complementary\ninformation into harmonious one. Then, a harmonious information acquisition\nsupervised by source image (HIASSI) module is devised to further ensure the\ncomplementary to harmonious information transfer after MIT. Meanwhile, we also\npropose a structure information preservation (SIP) module to guarantee that the\nedge structure information of the source images can be transferred to the\nfusion results. Moreover, a mutual promotion training paradigm (MPTP) with\ninteraction loss is adopted to facilitate better collaboration among MIT,\nHIASSI and SIP. In this way, the proposed method is able to generate fused\nimages with higher qualities. Extensive experimental results demonstrate the\nsuperiority of our CHITNet over state-of-the-art algorithms in terms of visual\nquality and quantitative evaluations.\n","authors":["Yafei Zhang","Keying Du","Huafeng Li","Zhengtao Yu","Yu Liu"],"pdf_url":"https://arxiv.org/pdf/2309.06118v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04249v3","updated":"2023-10-26T01:14:34Z","published":"2023-06-07T08:40:25Z","title":"DEMIST: A deep-learning-based task-specific denoising approach for\n myocardial perfusion SPECT","summary":" There is an important need for methods to process myocardial perfusion\nimaging (MPI) SPECT images acquired at lower radiation dose and/or acquisition\ntime such that the processed images improve observer performance on the\nclinical task of detecting perfusion defects. To address this need, we build\nupon concepts from model-observer theory and our understanding of the human\nvisual system to propose a Detection task-specific deep-learning-based approach\nfor denoising MPI SPECT images (DEMIST). The approach, while performing\ndenoising, is designed to preserve features that influence observer performance\non detection tasks. We objectively evaluated DEMIST on the task of detecting\nperfusion defects using a retrospective study with anonymized clinical data in\npatients who underwent MPI studies across two scanners (N = 338). The\nevaluation was performed at low-dose levels of 6.25%, 12.5% and 25% and using\nan anthropomorphic channelized Hotelling observer. Performance was quantified\nusing area under the receiver operating characteristics curve (AUC). Images\ndenoised with DEMIST yielded significantly higher AUC compared to corresponding\nlow-dose images and images denoised with a commonly used task-agnostic DL-based\ndenoising method. Similar results were observed with stratified analysis based\non patient sex and defect type. Additionally, DEMIST improved visual fidelity\nof the low-dose images as quantified using root mean squared error and\nstructural similarity index metric. A mathematical analysis revealed that\nDEMIST preserved features that assist in detection tasks while improving the\nnoise properties, resulting in improved observer performance. The results\nprovide strong evidence for further clinical evaluation of DEMIST to denoise\nlow-count images in MPI SPECT.\n","authors":["Md Ashequr Rahman","Zitong Yu","Richard Laforest","Craig K. Abbey","Barry A. Siegel","Abhinav K. Jha"],"pdf_url":"https://arxiv.org/pdf/2306.04249v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17080v1","updated":"2023-10-26T00:45:19Z","published":"2023-10-26T00:45:19Z","title":"Automating lichen monitoring in ecological studies using instance\n segmentation of time-lapse images","summary":" Lichens are symbiotic organisms composed of fungi, algae, and/or\ncyanobacteria that thrive in a variety of environments. They play important\nroles in carbon and nitrogen cycling, and contribute directly and indirectly to\nbiodiversity. Ecologists typically monitor lichens by using them as indicators\nto assess air quality and habitat conditions. In particular, epiphytic lichens,\nwhich live on trees, are key markers of air quality and environmental health. A\nnew method of monitoring epiphytic lichens involves using time-lapse cameras to\ngather images of lichen populations. These cameras are used by ecologists in\nNewfoundland and Labrador to subsequently analyze and manually segment the\nimages to determine lichen thalli condition and change. These methods are\ntime-consuming and susceptible to observer bias. In this work, we aim to\nautomate the monitoring of lichens over extended periods and to estimate their\nbiomass and condition to facilitate the task of ecologists. To accomplish this,\nour proposed framework uses semantic segmentation with an effective training\napproach to automate monitoring and biomass estimation of epiphytic lichens on\ntime-lapse images. We show that our method has the potential to significantly\nimprove the accuracy and efficiency of lichen population monitoring, making it\na valuable tool for forest ecologists and environmental scientists to evaluate\nthe impact of climate change on Canada's forests. To the best of our knowledge,\nthis is the first time that such an approach has been used to assist ecologists\nin monitoring and analyzing epiphytic lichens.\n","authors":["Safwen Naimi","Olfa Koubaa","Wassim Bouachir","Guillaume-Alexandre Bilodeau","Gregory Jeddore","Patricia Baines","David Correia","Andre Arsenault"],"pdf_url":"https://arxiv.org/pdf/2310.17080v1.pdf","comment":"6 pages, 3 Figures, 8 Tables, Accepted for publication in IEEE\n International Conference on Machine Learning and Applications (ICMLA),\n copyright IEEE"},{"id":"http://arxiv.org/abs/2310.17078v1","updated":"2023-10-26T00:43:15Z","published":"2023-10-26T00:43:15Z","title":"HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and\n severity prediction from gait","summary":" In this paper, we propose a novel deep learning method based on a new Hybrid\nConvNet-Transformer architecture to detect and stage Parkinson's disease (PD)\nfrom gait data. We adopt a two-step approach by dividing the problem into two\nsub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy\nversus parkinsonian patients. If the patient is parkinsonian, a multi-class\nHybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to\nassess the PD severity stage. Our hybrid architecture exploits the strengths of\nboth Convolutional Neural Networks (ConvNets) and Transformers to accurately\ndetect PD and determine the severity stage. In particular, we take advantage of\nConvNets to capture local patterns and correlations in the data, while we\nexploit Transformers for handling long-term dependencies in the input signal.\nWe show that our hybrid method achieves superior performance when compared to\nother state-of-the-art methods, with a PD detection accuracy of 97% and a\nseverity staging accuracy of 87%. Our source code is available at:\nhttps://github.com/SafwenNaimi\n","authors":["Safwen Naimi","Wassim Bouachir","Guillaume-Alexandre Bilodeau"],"pdf_url":"https://arxiv.org/pdf/2310.17078v1.pdf","comment":"6 pages, 6 figures, 3 tables, Accepted for publication in IEEE\n International Conference on Machine Learning and Applications (ICMLA),\n copyright IEEE"},{"id":"http://arxiv.org/abs/2310.17075v1","updated":"2023-10-26T00:36:03Z","published":"2023-10-26T00:36:03Z","title":"HyperFields: Towards Zero-Shot Generation of NeRFs from Text","summary":" We introduce HyperFields, a method for generating text-conditioned Neural\nRadiance Fields (NeRFs) with a single forward pass and (optionally) some\nfine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns\na smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF\ndistillation training, which distills scenes encoded in individual NeRFs into\none dynamic hypernetwork. These techniques enable a single network to fit over\na hundred unique scenes. We further demonstrate that HyperFields learns a more\ngeneral map between text and NeRFs, and consequently is capable of predicting\nnovel in-distribution and out-of-distribution scenes -- either zero-shot or\nwith a few finetuning steps. Finetuning HyperFields benefits from accelerated\nconvergence thanks to the learned general map, and is capable of synthesizing\nnovel scenes 5 to 10 times faster than existing neural optimization-based\nmethods. Our ablation experiments show that both the dynamic architecture and\nNeRF distillation are critical to the expressivity of HyperFields.\n","authors":["Sudarshan Babu","Richard Liu","Avery Zhou","Michael Maire","Greg Shakhnarovich","Rana Hanocka"],"pdf_url":"https://arxiv.org/pdf/2310.17075v1.pdf","comment":"Project page: https://threedle.github.io/hyperfields/"},{"id":"http://arxiv.org/abs/1908.10907v3","updated":"2023-10-26T09:06:52Z","published":"2019-08-28T19:07:40Z","title":"DFPENet-geology: A Deep Learning Framework for High Precision\n Recognition and Segmentation of Co-seismic Landslides","summary":" Automatic recognition and segmentation methods now become the essential\nrequirement in identifying co-seismic landslides, which are fundamental for\ndisaster assessment and mitigation in large-scale earthquakes. This approach\nused to be carried out through pixel-based or object-oriented methods. However,\ndue to the massive amount of remote sensing data, variations in different\nearthquake scenarios, and the efficiency requirement for post-earthquake\nrescue, these methods are difficult to develop into an accurate, rapid,\ncomprehensive, and general (cross-scene) solution for co-seismic landslide\nrecognition. This paper develops a robust model, Dense Feature Pyramid with\nEncoder-decoder Network (DFPENet), to understand and fuse the multi-scale\nfeatures of objects in remote sensing images. The proposed method achieves a\ncompetitive segmentation accuracy on the public ISPRS 2D Semantic. Furthermore,\na comprehensive and widely-used scheme is proposed for co-seismic landslide\nrecognition, which integrates image features extracted from the DFPENet model,\ngeologic features, temporal resolution, landslide spatial analysis, and\ntransfer learning, while only RGB images are used. To corroborate its\nfeasibility and applicability, the proposed scheme is applied to two\nearthquake-triggered landslides in Jiuzhaigou (China) and Hokkaido (Japan),\nusing available pre- and post-earthquake remote sensing images.\n","authors":["Qingsong Xu","Chaojun Ouyang","Tianhai Jiang","Xuanmei Fan","Duoxiang Cheng"],"pdf_url":"https://arxiv.org/pdf/1908.10907v3.pdf","comment":"35 pages, 11 figures"},{"id":"http://arxiv.org/abs/2305.12672v2","updated":"2023-10-26T23:37:38Z","published":"2023-05-22T03:27:30Z","title":"Block Coordinate Plug-and-Play Methods for Blind Inverse Problems","summary":" Plug-and-play (PnP) prior is a well-known class of methods for solving\nimaging inverse problems by computing fixed-points of operators combining\nphysical measurement models and learned image denoisers. While PnP methods have\nbeen extensively used for image recovery with known measurement operators,\nthere is little work on PnP for solving blind inverse problems. We address this\ngap by presenting a new block-coordinate PnP (BC-PnP) method that efficiently\nsolves this joint estimation problem by introducing learned denoisers as priors\non both the unknown image and the unknown measurement operator. We present a\nnew convergence theory for BC-PnP compatible with blind inverse problems by\nconsidering nonconvex data-fidelity terms and expansive denoisers. Our theory\nanalyzes the convergence of BC-PnP to a stationary point of an implicit\nfunction associated with an approximate minimum mean-squared error (MMSE)\ndenoiser. We numerically validate our method on two blind inverse problems:\nautomatic coil sensitivity estimation in magnetic resonance imaging (MRI) and\nblind image deblurring. Our results show that BC-PnP provides an efficient and\nprincipled framework for using denoisers as PnP priors for jointly estimating\nmeasurement operators and images.\n","authors":["Weijie Gan","Shirin Shoushtari","Yuyang Hu","Jiaming Liu","Hongyu An","Ulugbek S. Kamilov"],"pdf_url":"https://arxiv.org/pdf/2305.12672v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14500v3","updated":"2023-10-26T22:58:04Z","published":"2023-09-25T19:50:47Z","title":"Assessment of a new GeoAI foundation model for flood inundation mapping","summary":" Vision foundation models are a new frontier in Geospatial Artificial\nIntelligence (GeoAI), an interdisciplinary research area that applies and\nextends AI for geospatial problem solving and geographic knowledge discovery,\nbecause of their potential to enable powerful image analysis by learning and\nextracting important image features from vast amounts of geospatial data. This\npaper evaluates the performance of the first-of-its-kind geospatial foundation\nmodel, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood\ninundation mapping. This model is compared with convolutional neural network\nand vision transformer-based architectures in terms of mapping accuracy for\nflooded areas. A benchmark dataset, Sen1Floods11, is used in the experiments,\nand the models' predictability, generalizability, and transferability are\nevaluated based on both a test dataset and a dataset that is completely unseen\nby the model. Results show the good transferability of the Prithvi model,\nhighlighting its performance advantages in segmenting flooded areas in\npreviously unseen regions. The findings also indicate areas for improvement for\nthe Prithvi model in terms of adopting multi-scale representation learning,\ndeveloping more end-to-end pipelines for high-level image analysis tasks, and\noffering more flexibility in terms of input data bands.\n","authors":["Wenwen Li","Hyunho Lee","Sizhe Wang","Chia-Yu Hsu","Samantha T. Arundel"],"pdf_url":"https://arxiv.org/pdf/2309.14500v3.pdf","comment":"8 pages, 4 figures, Accepted for the 6th ACM SIGSPATIAL International\n Workshop on AI for Geographic Knowledge Discovery"},{"id":"http://arxiv.org/abs/2306.15162v2","updated":"2023-10-26T22:57:49Z","published":"2023-06-27T02:44:07Z","title":"YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English\n Parallel Corpus","summary":" Machine learning for sign languages is bottlenecked by data. In this paper,\nwe present YouTube-ASL, a large-scale, open-domain corpus of American Sign\nLanguage (ASL) videos and accompanying English captions drawn from YouTube.\nWith ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as\nlarge and has ~10x as many unique signers as the largest prior ASL dataset. We\ntrain baseline models for ASL to English translation on YouTube-ASL and\nevaluate them on How2Sign, where we achieve a new finetuned state of the art of\n12.39 BLEU and, for the first time, report zero-shot results.\n","authors":["David Uthus","Garrett Tanzer","Manfred Georg"],"pdf_url":"https://arxiv.org/pdf/2306.15162v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06706v3","updated":"2023-10-26T22:19:56Z","published":"2023-04-13T17:55:12Z","title":"Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields","summary":" Neural Radiance Field training can be accelerated through the use of\ngrid-based representations in NeRF's learned mapping from spatial coordinates\nto colors and volumetric density. However, these grid-based approaches lack an\nexplicit understanding of scale and therefore often introduce aliasing, usually\nin the form of jaggies or missing scene content. Anti-aliasing has previously\nbeen addressed by mip-NeRF 360, which reasons about sub-volumes along a cone\nrather than points along a ray, but this approach is not natively compatible\nwith current grid-based techniques. We show how ideas from rendering and signal\nprocessing can be used to construct a technique that combines mip-NeRF 360 and\ngrid-based models such as Instant NGP to yield error rates that are 8% - 77%\nlower than either prior technique, and that trains 24x faster than mip-NeRF\n360.\n","authors":["Jonathan T. Barron","Ben Mildenhall","Dor Verbin","Pratul P. Srinivasan","Peter Hedman"],"pdf_url":"https://arxiv.org/pdf/2304.06706v3.pdf","comment":"Project page: https://jonbarron.info/zipnerf/"},{"id":"http://arxiv.org/abs/2303.06705v3","updated":"2023-10-26T22:19:35Z","published":"2023-03-12T16:54:08Z","title":"Retinexformer: One-stage Retinex-based Transformer for Low-light Image\n Enhancement","summary":" When enhancing low-light images, many deep learning algorithms are based on\nthe Retinex theory. However, the Retinex model does not consider the\ncorruptions hidden in the dark or introduced by the light-up process. Besides,\nthese methods usually require a tedious multi-stage training pipeline and rely\non convolutional neural networks, showing limitations in capturing long-range\ndependencies. In this paper, we formulate a simple yet principled One-stage\nRetinex-based Framework (ORF). ORF first estimates the illumination information\nto light up the low-light image and then restores the corruption to produce the\nenhanced image. We design an Illumination-Guided Transformer (IGT) that\nutilizes illumination representations to direct the modeling of non-local\ninteractions of regions with different lighting conditions. By plugging IGT\ninto ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative\nand qualitative experiments demonstrate that our Retinexformer significantly\noutperforms state-of-the-art methods on thirteen benchmarks. The user study and\napplication on low-light object detection also reveal the latent practical\nvalues of our method. Code, models, and results are available at\nhttps://github.com/caiyuanhao1998/Retinexformer\n","authors":["Yuanhao Cai","Hao Bian","Jing Lin","Haoqian Wang","Radu Timofte","Yulun Zhang"],"pdf_url":"https://arxiv.org/pdf/2303.06705v3.pdf","comment":"ICCV 2023; The first Transformer-based method for low-light image\n enhancement"},{"id":"http://arxiv.org/abs/2310.17801v1","updated":"2023-10-26T22:17:37Z","published":"2023-10-26T22:17:37Z","title":"Image Prior and Posterior Conditional Probability Representation for\n Efficient Damage Assessment","summary":" It is important to quantify Damage Assessment (DA) for Human Assistance and\nDisaster Response (HADR) applications. In this paper, to achieve efficient and\nscalable DA in HADR, an image prior and posterior conditional probability\n(IP2CP) is developed as an effective computational imaging representation.\nEquipped with the IP2CP representation, the matching pre- and post-disaster\nimages are effectively encoded into one image that is then processed using deep\nlearning approaches to determine the damage levels. Two scenarios of crucial\nimportance for the practical use of DA in HADR applications are examined:\npixel-wise semantic segmentation and patch-based contrastive learning-based\nglobal damage classification. Results achieved by IP2CP in both scenarios\ndemonstrate promising performances, showing that our IP2CP-based methods within\nthe deep learning framework can effectively achieve data and computational\nefficiency, which is of utmost importance for the DA in HADR applications.\n","authors":["Jie Wei","Weicong Feng","Erik Blasch","Erika Ardiles-Cruz","Haibin Ling"],"pdf_url":"https://arxiv.org/pdf/2310.17801v1.pdf","comment":"6 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.17796v1","updated":"2023-10-26T21:57:21Z","published":"2023-10-26T21:57:21Z","title":"ControlLLM: Augment Language Models with Tools by Searching on Graphs","summary":" We present ControlLLM, a novel framework that enables large language models\n(LLMs) to utilize multi-modal tools for solving complex real-world tasks.\nDespite the remarkable performance of LLMs, they still struggle with tool\ninvocation due to ambiguous user prompts, inaccurate tool selection and\nparameterization, and inefficient tool scheduling. To overcome these\nchallenges, our framework comprises three key components: (1) a \\textit{task\ndecomposer} that breaks down a complex task into clear subtasks with\nwell-defined inputs and outputs; (2) a \\textit{Thoughts-on-Graph (ToG)\nparadigm} that searches the optimal solution path on a pre-built tool graph,\nwhich specifies the parameter and dependency relations among different tools;\nand (3) an \\textit{execution engine with a rich toolbox} that interprets the\nsolution path and runs the tools efficiently on different computational\ndevices. We evaluate our framework on diverse tasks involving image, audio, and\nvideo processing, demonstrating its superior accuracy, efficiency, and\nversatility compared to existing methods.\n","authors":["Zhaoyang Liu","Zeqiang Lai","Zhangwei Gao","Erfei Cui","Xizhou Zhu","Lewei Lu","Qifeng Chen","Yu Qiao","Jifeng Dai","Wenhai Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17796v1.pdf","comment":"22 pages, 9 figures, 10 tables"},{"id":"http://arxiv.org/abs/2309.10900v2","updated":"2023-10-26T21:38:08Z","published":"2023-09-19T19:49:03Z","title":"Incremental Multimodal Surface Mapping via Self-Organizing Gaussian\n Mixture Models","summary":" This letter describes an incremental multimodal surface mapping methodology,\nwhich represents the environment as a continuous probabilistic model. This\nmodel enables high-resolution reconstruction while simultaneously compressing\nspatial and intensity point cloud data. The strategy employed in this work\nutilizes Gaussian mixture models (GMMs) to represent the environment. While\nprior GMM-based mapping works have developed methodologies to determine the\nnumber of mixture components using information-theoretic techniques, these\napproaches either operate on individual sensor observations, making them\nunsuitable for incremental mapping, or are not real-time viable, especially for\napplications where high-fidelity modeling is required. To bridge this gap, this\nletter introduces a spatial hash map for rapid GMM submap extraction combined\nwith an approach to determine relevant and redundant data in a point cloud.\nThese contributions increase computational speed by an order of magnitude\ncompared to state-of-the-art incremental GMM-based mapping. In addition, the\nproposed approach yields a superior tradeoff in map accuracy and size when\ncompared to state-of-the-art mapping methodologies (both GMM- and not\nGMM-based). Evaluations are conducted using both simulated and real-world data.\nThe software is released open-source to benefit the robotics community.\n","authors":["Kshitij Goel","Wennie Tabib"],"pdf_url":"https://arxiv.org/pdf/2309.10900v2.pdf","comment":"8 pages, 7 figures, published in IEEE Robotics and Automation Letters"},{"id":"http://arxiv.org/abs/2303.02260v2","updated":"2023-10-26T21:24:47Z","published":"2023-03-03T23:19:42Z","title":"Learning to reason over visual objects","summary":" A core component of human intelligence is the ability to identify abstract\npatterns inherent in complex, high-dimensional perceptual data, as exemplified\nby visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated\nby the goal of designing AI systems with this capacity, recent work has focused\non evaluating whether neural networks can learn to solve RPM-like problems.\nPrevious work has generally found that strong performance on these problems\nrequires the incorporation of inductive biases that are specific to the RPM\nproblem format, raising the question of whether such models might be more\nbroadly useful. Here, we investigated the extent to which a general-purpose\nmechanism for processing visual scenes in terms of objects might help promote\nabstract visual reasoning. We found that a simple model, consisting only of an\nobject-centric encoder and a transformer reasoning module, achieved\nstate-of-the-art results on both of two challenging RPM-like benchmarks (PGM\nand I-RAVEN), as well as a novel benchmark with greater visual complexity\n(CLEVR-Matrices). These results suggest that an inductive bias for\nobject-centric processing may be a key component of abstract visual reasoning,\nobviating the need for problem-specific inductive biases.\n","authors":["Shanka Subhra Mondal","Taylor Webb","Jonathan D. Cohen"],"pdf_url":"https://arxiv.org/pdf/2303.02260v2.pdf","comment":"ICLR 2023"},{"id":"http://arxiv.org/abs/2310.17780v1","updated":"2023-10-26T21:09:47Z","published":"2023-10-26T21:09:47Z","title":"AutoCT: Automated CT registration, segmentation, and quantification","summary":" The processing and analysis of computed tomography (CT) imaging is important\nfor both basic scientific development and clinical applications. In AutoCT, we\nprovide a comprehensive pipeline that integrates an end-to-end automatic\npreprocessing, registration, segmentation, and quantitative analysis of 3D CT\nscans. The engineered pipeline enables atlas-based CT segmentation and\nquantification leveraging diffeomorphic transformations through efficient\nforward and inverse mappings. The extracted localized features from the\ndeformation field allow for downstream statistical learning that may facilitate\nmedical diagnostics. On a lightweight and portable software platform, AutoCT\nprovides a new toolkit for the CT imaging community to underpin the deployment\nof artificial intelligence-driven applications.\n","authors":["Zhe Bai","Abdelilah Essiari","Talita Perciano","Kristofer E. Bouchard"],"pdf_url":"https://arxiv.org/pdf/2310.17780v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17773v1","updated":"2023-10-26T20:51:24Z","published":"2023-10-26T20:51:24Z","title":"Graph Convolutional Networks for Complex Traffic Scenario Classification","summary":" A scenario-based testing approach can reduce the time required to obtain\nstatistically significant evidence of the safety of Automated Driving Systems\n(ADS). Identifying these scenarios in an automated manner is a challenging\ntask. Most methods on scenario classification do not work for complex scenarios\nwith diverse environments (highways, urban) and interaction with other traffic\nagents. This is mirrored in their approaches which model an individual vehicle\nin relation to its environment, but neglect the interaction between multiple\nvehicles (e.g. cut-ins, stationary lead vehicle). Furthermore, existing\ndatasets lack diversity and do not have per-frame annotations to accurately\nlearn the start and end time of a scenario. We propose a method for complex\ntraffic scenario classification that is able to model the interaction of a\nvehicle with the environment, as well as other agents. We use Graph\nConvolutional Networks to model spatial and temporal aspects of these\nscenarios. Expanding the nuScenes and Argoverse 2 driving datasets, we\nintroduce a scenario-labeled dataset, which covers different driving\nenvironments and is annotated per frame. Training our method on this dataset,\nwe present a promising baseline for future research on per-frame complex\nscenario classification.\n","authors":["Tobias Hoek","Holger Caesar","Andreas Falkovén","Tommy Johansson"],"pdf_url":"https://arxiv.org/pdf/2310.17773v1.pdf","comment":"Netherlands Conference on Computer Vision (NCCV) 2023 camera-ready +\n supplementary material"},{"id":"http://arxiv.org/abs/2310.17770v1","updated":"2023-10-26T20:27:16Z","published":"2023-10-26T20:27:16Z","title":"GROOViST: A Metric for Grounding Objects in Visual Storytelling","summary":" A proper evaluation of stories generated for a sequence of images -- the task\ncommonly referred to as visual storytelling -- must consider multiple aspects,\nsuch as coherence, grammatical correctness, and visual grounding. In this work,\nwe focus on evaluating the degree of grounding, that is, the extent to which a\nstory is about the entities shown in the images. We analyze current metrics,\nboth designed for this purpose and for general vision-text alignment. Given\ntheir observed shortcomings, we propose a novel evaluation tool, GROOViST, that\naccounts for cross-modal dependencies, temporal misalignments (the fact that\nthe order in which entities appear in the story and the image sequence may not\nmatch), and human intuitions on visual grounding. An additional advantage of\nGROOViST is its modular design, where the contribution of each component can be\nassessed and interpreted individually.\n","authors":["Aditya K Surikuchi","Sandro Pezzelle","Raquel Fernández"],"pdf_url":"https://arxiv.org/pdf/2310.17770v1.pdf","comment":"In EMNLP 2023 main conference proceedings (to appear)"},{"id":"http://arxiv.org/abs/2310.17768v1","updated":"2023-10-26T20:26:50Z","published":"2023-10-26T20:26:50Z","title":"A Dataset of Relighted 3D Interacting Hands","summary":" The two-hand interaction is one of the most challenging signals to analyze\ndue to the self-similarity, complicated articulations, and occlusions of hands.\nAlthough several datasets have been proposed for the two-hand interaction\nanalysis, all of them do not achieve 1) diverse and realistic image appearances\nand 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In\nthis work, we propose Re:InterHand, a dataset of relighted 3D interacting hands\nthat achieve the two goals. To this end, we employ a state-of-the-art hand\nrelighting network with our accurately tracked two-hand 3D poses. We compare\nour Re:InterHand with existing 3D interacting hands datasets and show the\nbenefit of it. Our Re:InterHand is available in\nhttps://mks0601.github.io/ReInterHand/.\n","authors":["Gyeongsik Moon","Shunsuke Saito","Weipeng Xu","Rohan Joshi","Julia Buffalini","Harley Bellan","Nicholas Rosen","Jesse Richardson","Mallorie Mize","Philippe de Bree","Tomas Simon","Bo Peng","Shubham Garg","Kevyn McPhail","Takaaki Shiratori"],"pdf_url":"https://arxiv.org/pdf/2310.17768v1.pdf","comment":"Accepted by NeurIPS 2023 (Datasets and Benchmarks Track)"},{"id":"http://arxiv.org/abs/2310.17764v1","updated":"2023-10-26T20:13:44Z","published":"2023-10-26T20:13:44Z","title":"SynergyNet: Bridging the Gap between Discrete and Continuous\n Representations for Precise Medical Image Segmentation","summary":" In recent years, continuous latent space (CLS) and discrete latent space\n(DLS) deep learning models have been proposed for medical image analysis for\nimproved performance. However, these models encounter distinct challenges. CLS\nmodels capture intricate details but often lack interpretability in terms of\nstructural representation and robustness due to their emphasis on low-level\nfeatures. Conversely, DLS models offer interpretability, robustness, and the\nability to capture coarse-grained information thanks to their structured latent\nspace. However, DLS models have limited efficacy in capturing fine-grained\ndetails. To address the limitations of both DLS and CLS models, we propose\nSynergyNet, a novel bottleneck architecture designed to enhance existing\nencoder-decoder segmentation frameworks. SynergyNet seamlessly integrates\ndiscrete and continuous representations to harness complementary information\nand successfully preserves both fine and coarse-grained details in the learned\nrepresentations. Our extensive experiment on multi-organ segmentation and\ncardiac datasets demonstrates that SynergyNet outperforms other state of the\nart methods, including TransUNet: dice scores improving by 2.16%, and Hausdorff\nscores improving by 11.13%, respectively. When evaluating skin lesion and brain\ntumor segmentation datasets, we observe a remarkable improvement of 1.71% in\nIntersection-over Union scores for skin lesion segmentation and of 8.58% for\nbrain tumor segmentation. Our innovative approach paves the way for enhancing\nthe overall performance and capabilities of deep learning models in the\ncritical domain of medical image analysis.\n","authors":["Vandan Gorade","Sparsh Mittal","Debesh Jha","Ulas Bagci"],"pdf_url":"https://arxiv.org/pdf/2310.17764v1.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.16832v2","updated":"2023-10-26T20:02:03Z","published":"2023-10-25T17:59:05Z","title":"LightSpeed: Light and Fast Neural Light Fields on Mobile Devices","summary":" Real-time novel-view image synthesis on mobile devices is prohibitive due to\nthe limited computational power and storage. Using volumetric rendering\nmethods, such as NeRF and its derivatives, on mobile devices is not suitable\ndue to the high computational cost of volumetric rendering. On the other hand,\nrecent advances in neural light field representations have shown promising\nreal-time view synthesis results on mobile devices. Neural light field methods\nlearn a direct mapping from a ray representation to the pixel color. The\ncurrent choice of ray representation is either stratified ray sampling or\nPlucker coordinates, overlooking the classic light slab (two-plane)\nrepresentation, the preferred representation to interpolate between light field\nviews. In this work, we find that using the light slab representation is an\nefficient representation for learning a neural light field. More importantly,\nit is a lower-dimensional ray representation enabling us to learn the 4D ray\nspace using feature grids which are significantly faster to train and render.\nAlthough mostly designed for frontal views, we show that the light-slab\nrepresentation can be further extended to non-frontal scenes using a\ndivide-and-conquer strategy. Our method offers superior rendering quality\ncompared to previous light field methods and achieves a significantly improved\ntrade-off between rendering quality and speed.\n","authors":["Aarush Gupta","Junli Cao","Chaoyang Wang","Ju Hu","Sergey Tulyakov","Jian Ren","László A Jeni"],"pdf_url":"https://arxiv.org/pdf/2310.16832v2.pdf","comment":"Project Page: http://lightspeed-r2l.github.io/ . Add camera ready\n version"},{"id":"http://arxiv.org/abs/2310.17755v1","updated":"2023-10-26T19:48:08Z","published":"2023-10-26T19:48:08Z","title":"Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches","summary":" The most frequent kind of dementia of the nervous system, Alzheimer's\ndisease, weakens several brain processes (such as memory) and eventually\nresults in death. The clinical study uses magnetic resonance imaging to\ndiagnose AD. Deep learning algorithms are capable of pattern recognition and\nfeature extraction from the inputted raw data. As early diagnosis and stage\ndetection are the most crucial elements in enhancing patient care and treatment\noutcomes, deep learning algorithms for MRI images have recently allowed for\ndiagnosing a medical condition at the beginning stage and identifying\nparticular symptoms of Alzheimer's disease. As a result, we aimed to analyze\nfive specific studies focused on AD diagnosis using MRI-based deep learning\nalgorithms between 2021 and 2023 in this study. To completely illustrate the\ndifferences between these techniques and comprehend how deep learning\nalgorithms function, we attempted to explore selected approaches in depth.\n","authors":["Sarasadat Foroughipoor","Kimia Moradi","Hamidreza Bolhasani"],"pdf_url":"https://arxiv.org/pdf/2310.17755v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19181v2","updated":"2023-10-26T19:03:36Z","published":"2023-05-30T16:25:16Z","title":"Table Detection for Visually Rich Document Images","summary":" Table Detection (TD) is a fundamental task to enable visually rich document\nunderstanding, which requires the model to extract information without\ninformation loss. However, popular Intersection over Union (IoU) based\nevaluation metrics and IoU-based loss functions for the detection models cannot\ndirectly represent the degree of information loss for the prediction results.\nTherefore, we propose to decouple IoU into a ground truth coverage term and a\nprediction coverage term, in which the former can be used to measure the\ninformation loss of the prediction results. Besides, considering the sparse\ndistribution of tables in document images, we use SparseR-CNN as the base model\nand further improve the model by using Gaussian Noise Augmented Image Size\nregion proposals and many-to-one label assignments. Results under comprehensive\nexperiments show that the proposed method can consistently outperform\nstate-of-the-art methods with different IoU-based metrics under various\ndatasets and demonstrate that the proposed decoupled IoU loss can enable the\nmodel to alleviate information loss.\n","authors":["Bin Xiao","Murat Simsek","Burak Kantarci","Ala Abu Alkheir"],"pdf_url":"https://arxiv.org/pdf/2305.19181v2.pdf","comment":"Accepted by Knowledge-Based Systems"},{"id":"http://arxiv.org/abs/2305.19478v4","updated":"2023-10-26T18:43:33Z","published":"2023-05-31T01:12:32Z","title":"Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment\n Alignment","summary":" This paper presents an unsupervised transformer-based framework for temporal\nactivity segmentation which leverages not only frame-level cues but also\nsegment-level cues. This is in contrast with previous methods which often rely\non frame-level information only. Our approach begins with a frame-level\nprediction module which estimates framewise action classes via a transformer\nencoder. The frame-level prediction module is trained in an unsupervised manner\nvia temporal optimal transport. To exploit segment-level information, we\nutilize a segment-level prediction module and a frame-to-segment alignment\nmodule. The former includes a transformer decoder for estimating video\ntranscripts, while the latter matches frame-level features with segment-level\nfeatures, yielding permutation-aware segmentation results. Moreover, inspired\nby temporal optimal transport, we introduce simple-yet-effective pseudo labels\nfor unsupervised training of the above modules. Our experiments on four public\ndatasets, i.e., 50 Salads, YouTube Instructions, Breakfast, and Desktop\nAssembly show that our approach achieves comparable or better performance than\nprevious methods in unsupervised activity segmentation.\n","authors":["Quoc-Huy Tran","Ahmed Mehmood","Muhammad Ahmed","Muhammad Naufil","Anas Zafar","Andrey Konin","M. Zeeshan Zia"],"pdf_url":"https://arxiv.org/pdf/2305.19478v4.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.17729v1","updated":"2023-10-26T18:40:28Z","published":"2023-10-26T18:40:28Z","title":"Improving Traffic Density Forecasting in Intelligent Transportation\n Systems Using Gated Graph Neural Networks","summary":" This study delves into the application of graph neural networks in the realm\nof traffic forecasting, a crucial facet of intelligent transportation systems.\nAccurate traffic predictions are vital for functions like trip planning,\ntraffic control, and vehicle routing in such systems. Three prominent GNN\narchitectures Graph Convolutional Networks (Graph Sample and Aggregation) and\nGated Graph Neural Networks are explored within the context of traffic\nprediction. Each architecture's methodology is thoroughly examined, including\nlayer configurations, activation functions,and hyperparameters. The primary\ngoal is to minimize prediction errors, with GGNNs emerging as the most\neffective choice among the three models. The research outlines outcomes for\neach architecture, elucidating their predictive performance through root mean\nsquared error and mean absolute error (MAE). Hypothetical results reveal\nintriguing insights: GCNs display an RMSE of 9.10 and an MAE of 8.00, while\nGraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. Gated Graph\nNeural Networks (GGNNs) exhibit the lowest RMSE at 9.15 and an impressive MAE\nof 7.1, positioning them as the frontrunner.\n","authors":["Razib Hayat Khan","Jonayet Miah","S M Yasir Arafat","M M Mahbubul Syeed","Duc M Ca"],"pdf_url":"https://arxiv.org/pdf/2310.17729v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17720v1","updated":"2023-10-26T18:27:20Z","published":"2023-10-26T18:27:20Z","title":"Advancing Brain Tumor Detection: A Thorough Investigation of CNNs,\n Clustering, and SoftMax Classification in the Analysis of MRI Images","summary":" Brain tumors pose a significant global health challenge due to their high\nprevalence and mortality rates across all age groups. Detecting brain tumors at\nan early stage is crucial for effective treatment and patient outcomes. This\nstudy presents a comprehensive investigation into the use of Convolutional\nNeural Networks (CNNs) for brain tumor detection using Magnetic Resonance\nImaging (MRI) images. The dataset, consisting of MRI scans from both healthy\nindividuals and patients with brain tumors, was processed and fed into the CNN\narchitecture. The SoftMax Fully Connected layer was employed to classify the\nimages, achieving an accuracy of 98%. To evaluate the CNN's performance, two\nother classifiers, Radial Basis Function (RBF) and Decision Tree (DT), were\nutilized, yielding accuracy rates of 98.24% and 95.64%, respectively. The study\nalso introduced a clustering method for feature extraction, improving CNN's\naccuracy. Sensitivity, Specificity, and Precision were employed alongside\naccuracy to comprehensively evaluate the network's performance. Notably, the\nSoftMax classifier demonstrated the highest accuracy among the categorizers,\nachieving 99.52% accuracy on test data. The presented research contributes to\nthe growing field of deep learning in medical image analysis. The combination\nof CNNs and MRI data offers a promising tool for accurately detecting brain\ntumors, with potential implications for early diagnosis and improved patient\ncare.\n","authors":["Jonayet Miah","Duc M Cao","Md Abu Sayed3","Md Siam Taluckder","Md Sabbirul Haque","Fuad Mahmud"],"pdf_url":"https://arxiv.org/pdf/2310.17720v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.06822v2","updated":"2023-10-26T18:27:01Z","published":"2023-10-10T17:50:09Z","title":"Neural Bounding","summary":" Bounding volumes are an established concept in computer graphics and vision\ntasks but have seen little change since their early inception. In this work, we\nstudy the use of neural networks as bounding volumes. Our key observation is\nthat bounding, which so far has primarily been considered a problem of\ncomputational geometry, can be redefined as a problem of learning to classify\nspace into free or occupied. This learning-based approach is particularly\nadvantageous in high-dimensional spaces, such as animated scenes with complex\nqueries, where neural networks are known to excel. However, unlocking neural\nbounding requires a twist: allowing -- but also limiting -- false positives,\nwhile ensuring that the number of false negatives is strictly zero. We enable\nsuch tight and conservative results using a dynamically-weighted asymmetric\nloss function. Our results show that our neural bounding produces up to an\norder of magnitude fewer false positives than traditional methods.\n","authors":["Wenxin Liu","Michael Fischer","Paul D. Yoo","Tobias Ritschel"],"pdf_url":"https://arxiv.org/pdf/2310.06822v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.13755v4","updated":"2023-10-26T18:13:05Z","published":"2023-07-25T18:26:22Z","title":"Training-based Model Refinement and Representation Disagreement for\n Semi-Supervised Object Detection","summary":" Semi-supervised object detection (SSOD) aims to improve the performance and\ngeneralization of existing object detectors by utilizing limited labeled data\nand extensive unlabeled data. Despite many advances, recent SSOD methods are\nstill challenged by inadequate model refinement using the classical exponential\nmoving average (EMA) strategy, the consensus of Teacher-Student models in the\nlatter stages of training (i.e., losing their distinctiveness), and\nnoisy/misleading pseudo-labels. This paper proposes a novel training-based\nmodel refinement (TMR) stage and a simple yet effective representation\ndisagreement (RD) strategy to address the limitations of classical EMA and the\nconsensus problem. The TMR stage of Teacher-Student models optimizes the\nlightweight scaling operation to refine the model's weights and prevent\noverfitting or forgetting learned patterns from unlabeled data. Meanwhile, the\nRD strategy helps keep these models diverged to encourage the student model to\nexplore additional patterns in unlabeled data. Our approach can be integrated\ninto established SSOD methods and is empirically validated using two baseline\nmethods, with and without cascade regression, to generate more reliable\npseudo-labels. Extensive experiments demonstrate the superior performance of\nour approach over state-of-the-art SSOD methods. Specifically, the proposed\napproach outperforms the baseline Unbiased-Teacher-v2 (& Unbiased-Teacher-v1)\nmethod by an average mAP margin of 2.23, 2.1, and 3.36 (& 2.07, 1.9, and 3.27)\non COCO-standard, COCO-additional, and Pascal VOC datasets, respectively.\n","authors":["Seyed Mojtaba Marvasti-Zadeh","Nilanjan Ray","Nadir Erbilgin"],"pdf_url":"https://arxiv.org/pdf/2307.13755v4.pdf","comment":"Accepted in IEEE/CVF Winter Applications of Computer Vision (WACV)\n 2024"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.17609v1","updated":"2023-10-26T17:32:55Z","published":"2023-10-26T17:32:55Z","title":"LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset","summary":" As an important component of intelligent legal systems, legal case retrieval\nplays a critical role in ensuring judicial justice and fairness. However, the\ndevelopment of legal case retrieval technologies in the Chinese legal system is\nrestricted by three problems in existing datasets: limited data size, narrow\ndefinitions of legal relevance, and naive candidate pooling strategies used in\ndata sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale\nLegal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192\ncandidates extracted from 4.3 million criminal case documents. To the best of\nour knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval\ndatasets, providing extensive coverage of criminal charges. Additionally, we\nenrich the existing relevance criteria by considering three key aspects:\ncharacterization, penalty, procedure. This comprehensive criteria enriches the\ndataset and may provides a more holistic perspective. Furthermore, we propose a\ntwo-level candidate set pooling strategy that effectively identify potential\ncandidates for each query case. It's important to note that all cases in the\ndataset have been annotated by multiple legal experts specializing in criminal\nlaw. Their expertise ensures the accuracy and reliability of the annotations.\nWe evaluate several state-of-the-art retrieval models at LeCaRDv2,\ndemonstrating that there is still significant room for improvement in legal\ncase retrieval. The details of LeCaRDv2 can be found at the anonymous website\nhttps://github.com/anonymous1113243/LeCaRDv2.\n","authors":["Haitao Li","Yunqiu Shao","Yueyue Wu","Qingyao Ai","Yixiao Ma","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2310.17609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08937v3","updated":"2023-10-26T16:23:15Z","published":"2023-06-15T08:21:15Z","title":"DocumentNet: Bridging the Data Gap in Document Pre-Training","summary":" Document understanding tasks, in particular, Visually-rich Document Entity\nRetrieval (VDER), have gained significant attention in recent years thanks to\ntheir broad applications in enterprise AI. However, publicly available data\nhave been scarce for these tasks due to strict privacy constraints and high\nannotation costs. To make things worse, the non-overlapping entity spaces from\ndifferent datasets hinder the knowledge transfer between document types. In\nthis paper, we propose a method to collect massive-scale and weakly labeled\ndata from the web to benefit the training of VDER models. The collected\ndataset, named DocumentNet, does not depend on specific document types or\nentity sets, making it universally applicable to all VDER tasks. The current\nDocumentNet consists of 30M documents spanning nearly 400 document types\norganized in a four-level ontology. Experiments on a set of broadly adopted\nVDER tasks show significant improvements when DocumentNet is incorporated into\nthe pre-training for both classic and few-shot learning settings. With the\nrecent emergence of large language models (LLMs), DocumentNet provides a large\ndata source to extend their multi-modal capabilities for VDER.\n","authors":["Lijun Yu","Jin Miao","Xiaoyu Sun","Jiayi Chen","Alexander G. Hauptmann","Hanjun Dai","Wei Wei"],"pdf_url":"https://arxiv.org/pdf/2306.08937v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.04798v2","updated":"2023-10-26T15:52:30Z","published":"2023-05-04T13:13:44Z","title":"Multi-grained Hypergraph Interest Modeling for Conversational\n Recommendation","summary":" Conversational recommender system (CRS) interacts with users through\nmulti-turn dialogues in natural language, which aims to provide high-quality\nrecommendations for user's instant information need. Although great efforts\nhave been made to develop effective CRS, most of them still focus on the\ncontextual information from the current dialogue, usually suffering from the\ndata scarcity issue. Therefore, we consider leveraging historical dialogue data\nto enrich the limited contexts of the current dialogue session.\n In this paper, we propose a novel multi-grained hypergraph interest modeling\napproach to capture user interest beneath intricate historical data from\ndifferent perspectives. As the core idea, we employ hypergraph to represent\ncomplicated semantic relations underlying historical dialogues. In our\napproach, we first employ the hypergraph structure to model users' historical\ndialogue sessions and form a session-based hypergraph, which captures\ncoarse-grained, session-level relations. Second, to alleviate the issue of data\nscarcity, we use an external knowledge graph and construct a knowledge-based\nhypergraph considering fine-grained, entity-level semantics. We further conduct\nmulti-grained hypergraph convolution on the two kinds of hypergraphs, and\nutilize the enhanced representations to develop interest-aware CRS. Extensive\nexperiments on two benchmarks ReDial and TG-ReDial validate the effectiveness\nof our approach on both recommendation and conversation tasks. Code is\navailable at: https://github.com/RUCAIBox/MHIM.\n","authors":["Chenzhan Shang","Yupeng Hou","Wayne Xin Zhao","Yaliang Li","Jing Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.04798v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17488v1","updated":"2023-10-26T15:44:57Z","published":"2023-10-26T15:44:57Z","title":"LightLM: A Lightweight Deep and Narrow Language Model for Generative\n Recommendation","summary":" This paper presents LightLM, a lightweight Transformer-based language model\nfor generative recommendation. While Transformer-based generative modeling has\ngained importance in various AI sub-fields such as NLP and vision, generative\nrecommendation is still in its infancy due to its unique demand on personalized\ngenerative modeling. Existing works on generative recommendation often use\nNLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are\nheavy-weight and are not specifically designed for recommendation tasks.\nLightLM tackles the issue by introducing a light-weight deep and narrow\nTransformer architecture, which is specifically tailored for direct generation\nof recommendation items. This structure is especially apt for straightforward\ngenerative recommendation and stems from the observation that language model\ndoes not have to be too wide for this task, as the input predominantly consists\nof short tokens that are well-suited for the model's capacity. We also show\nthat our devised user and item ID indexing methods, i.e., Spectral\nCollaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables\nthe deep and narrow Transformer architecture to outperform large-scale language\nmodels for recommendation. Besides, to address the hallucination problem of\ngenerating items as output, we propose the constrained generation process for\ngenerative recommenders. Experiments on real-world datasets show that LightLM\noutperforms various competitive baselines in terms of both recommendation\naccuracy and efficiency. The code can be found at\nhttps://github.com/dongyuanjushi/LightLM.\n","authors":["Kai Mei","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17488v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17373v1","updated":"2023-10-26T13:10:59Z","published":"2023-10-26T13:10:59Z","title":"FMMRec: Fairness-aware Multimodal Recommendation","summary":" Recently, multimodal recommendations have gained increasing attention for\neffectively addressing the data sparsity problem by incorporating\nmodality-based representations. Although multimodal recommendations excel in\naccuracy, the introduction of different modalities (e.g., images, text, and\naudio) may expose more users' sensitive information (e.g., gender and age) to\nrecommender systems, resulting in potentially more serious unfairness issues.\nDespite many efforts on fairness, existing fairness-aware methods are either\nincompatible with multimodal scenarios, or lead to suboptimal fairness\nperformance due to neglecting sensitive information of multimodal content. To\nachieve counterfactual fairness in multimodal recommendations, we propose a\nnovel fairness-aware multimodal recommendation approach (dubbed as FMMRec) to\ndisentangle the sensitive and non-sensitive information from modal\nrepresentations and leverage the disentangled modal representations to guide\nfairer representation learning. Specifically, we first disentangle biased and\nfiltered modal representations by maximizing and minimizing their sensitive\nattribute prediction ability respectively. With the disentangled modal\nrepresentations, we mine the modality-based unfair and fair (corresponding to\nbiased and filtered) user-user structures for enhancing explicit user\nrepresentation with the biased and filtered neighbors from the corresponding\nstructures, followed by adversarially filtering out sensitive information.\nExperiments on two real-world public datasets demonstrate the superiority of\nour FMMRec relative to the state-of-the-art baselines. Our source code is\navailable at https://anonymous.4open.science/r/FMMRec.\n","authors":["Weixin Chen","Li Chen","Yongxin Ni","Yuhan Zhao","Fajie Yuan","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17373v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17370v1","updated":"2023-10-26T13:02:45Z","published":"2023-10-26T13:02:45Z","title":"Exploring the Potential of Generative AI for the World Wide Web","summary":" Generative Artificial Intelligence (AI) is a cutting-edge technology capable\nof producing text, images, and various media content leveraging generative\nmodels and user prompts. Between 2022 and 2023, generative AI surged in\npopularity with a plethora of applications spanning from AI-powered movies to\nchatbots. In this paper, we delve into the potential of generative AI within\nthe realm of the World Wide Web, specifically focusing on image generation. Web\ndevelopers already harness generative AI to help crafting text and images,\nwhile Web browsers might use it in the future to locally generate images for\ntasks like repairing broken webpages, conserving bandwidth, and enhancing\nprivacy. To explore this research area, we have developed WebDiffusion, a tool\nthat allows to simulate a Web powered by stable diffusion, a popular\ntext-to-image model, from both a client and server perspective. WebDiffusion\nfurther supports crowdsourcing of user opinions, which we use to evaluate the\nquality and accuracy of 409 AI-generated images sourced from 60 webpages. Our\nfindings suggest that generative AI is already capable of producing pertinent\nand high-quality Web images, even without requiring Web designers to manually\ninput prompts, just by leveraging contextual information available within the\nwebpages. However, we acknowledge that direct in-browser image generation\nremains a challenge, as only highly powerful GPUs, such as the A40 and A100,\ncan (partially) compete with classic image downloads. Nevertheless, this\napproach could be valuable for a subset of the images, for example when fixing\nbroken webpages or handling highly private content.\n","authors":["Nouar AlDahoul","Joseph Hong","Matteo Varvello","Yasir Zaki"],"pdf_url":"https://arxiv.org/pdf/2310.17370v1.pdf","comment":"11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.16605v2","updated":"2023-10-26T00:50:59Z","published":"2023-10-25T12:50:34Z","title":"Distributionally Robust Unsupervised Dense Retrieval Training on Web\n Graphs","summary":" This paper introduces Web-DRO, an unsupervised dense retrieval model, which\nclusters documents based on web structures and reweights the groups during\ncontrastive training. Specifically, we first leverage web graph links and\ncontrastively train an embedding model for clustering anchor-document pairs.\nThen we use Group Distributional Robust Optimization to reweight different\nclusters of anchor-document pairs, which guides the model to assign more\nweights to the group with higher contrastive loss and pay more attention to the\nworst case during training. Our experiments on MS MARCO and BEIR show that our\nmodel, Web-DRO, significantly improves the retrieval effectiveness in\nunsupervised scenarios. A comparison of clustering techniques shows that\ntraining on the web graph combining URL information reaches optimal performance\non clustering. Further analysis confirms that group weights are stable and\nvalid, indicating consistent model preferences as well as effective\nup-weighting of valuable groups and down-weighting of uninformative ones. The\ncode of this paper can be obtained from https://github.com/OpenMatch/Web-DRO.\n","authors":["Peixuan Han","Zhenghao Liu","Zhiyuan Liu","Chenyan Xiong"],"pdf_url":"https://arxiv.org/pdf/2310.16605v2.pdf","comment":"9 pages, 5 figures, 5 tables"},{"id":"http://arxiv.org/abs/2305.19181v2","updated":"2023-10-26T19:03:36Z","published":"2023-05-30T16:25:16Z","title":"Table Detection for Visually Rich Document Images","summary":" Table Detection (TD) is a fundamental task to enable visually rich document\nunderstanding, which requires the model to extract information without\ninformation loss. However, popular Intersection over Union (IoU) based\nevaluation metrics and IoU-based loss functions for the detection models cannot\ndirectly represent the degree of information loss for the prediction results.\nTherefore, we propose to decouple IoU into a ground truth coverage term and a\nprediction coverage term, in which the former can be used to measure the\ninformation loss of the prediction results. Besides, considering the sparse\ndistribution of tables in document images, we use SparseR-CNN as the base model\nand further improve the model by using Gaussian Noise Augmented Image Size\nregion proposals and many-to-one label assignments. Results under comprehensive\nexperiments show that the proposed method can consistently outperform\nstate-of-the-art methods with different IoU-based metrics under various\ndatasets and demonstrate that the proposed decoupled IoU loss can enable the\nmodel to alleviate information loss.\n","authors":["Bin Xiao","Murat Simsek","Burak Kantarci","Ala Abu Alkheir"],"pdf_url":"https://arxiv.org/pdf/2305.19181v2.pdf","comment":"Accepted by Knowledge-Based Systems"},{"id":"http://arxiv.org/abs/2310.17732v1","updated":"2023-10-26T18:43:16Z","published":"2023-10-26T18:43:16Z","title":"GNN-GMVO: Graph Neural Networks for Optimizing Gross Merchandise Value\n in Similar Item Recommendation","summary":" Similar item recommendation is a critical task in the e-Commerce industry,\nwhich helps customers explore similar and relevant alternatives based on their\ninterested products. Despite the traditional machine learning models, Graph\nNeural Networks (GNNs), by design, can understand complex relations like\nsimilarity between products. However, in contrast to their wide usage in\nretrieval tasks and their focus on optimizing the relevance, the current GNN\narchitectures are not tailored toward maximizing revenue-related objectives\nsuch as Gross Merchandise Value (GMV), which is one of the major business\nmetrics for e-Commerce companies. In addition, defining accurate edge relations\nin GNNs is non-trivial in large-scale e-Commerce systems, due to the\nheterogeneity nature of the item-item relationships. This work aims to address\nthese issues by designing a new GNN architecture called GNN-GMVO (Graph Neural\nNetwork - Gross Merchandise Value Optimizer). This model directly optimizes GMV\nwhile considering the complex relations between items. In addition, we propose\na customized edge construction method to tailor the model toward similar item\nrecommendation task and alleviate the noisy and complex item-item relations. In\nour comprehensive experiments on three real-world datasets, we show higher\nprediction performance and expected GMV for top ranked items recommended by our\nmodel when compared with selected state-of-the-art benchmark models.\n","authors":["Ramin Giahi","Reza Yousefi Maragheh","Nima Farrokhsiar","Jianpeng Xu","Jason Cho","Evren Korpeoglu","Sushant Kumar","Kannan Achan"],"pdf_url":"https://arxiv.org/pdf/2310.17732v1.pdf","comment":"9 pages, 3 figures, 43 citations"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.17653v1","updated":"2023-10-26T17:59:46Z","published":"2023-10-26T17:59:46Z","title":"Fantastic Gains and Where to Find Them: On the Existence and Prospect of\n General Knowledge Transfer between Any Pretrained Model","summary":" Training deep networks requires various design decisions regarding for\ninstance their architecture, data augmentation, or optimization. In this work,\nwe find these training variations to result in networks learning unique feature\nsets from the data. Using public model libraries comprising thousands of models\ntrained on canonical datasets like ImageNet, we observe that for arbitrary\npairings of pretrained models, one model extracts significant data context\nunavailable in the other -- independent of overall performance. Given any\narbitrary pairing of pretrained models and no external rankings (such as\nseparate test sets, e.g. due to data privacy), we investigate if it is possible\nto transfer such \"complementary\" knowledge from one model to another without\nperformance degradation -- a task made particularly difficult as additional\nknowledge can be contained in stronger, equiperformant or weaker models. Yet\nfacilitating robust transfer in scenarios agnostic to pretrained model pairings\nwould unlock auxiliary gains and knowledge fusion from any model repository\nwithout restrictions on model and problem specifics - including from weaker,\nlower-performance models. This work therefore provides an initial, in-depth\nexploration on the viability of such general-purpose knowledge transfer. Across\nlarge-scale experiments, we first reveal the shortcomings of standard knowledge\ndistillation techniques, and then propose a much more general extension through\ndata partitioning for successful transfer between nearly all pretrained models,\nwhich we show can also be done unsupervised. Finally, we assess both the\nscalability and impact of fundamental model properties on successful\nmodel-agnostic knowledge transfer.\n","authors":["Karsten Roth","Lukas Thede","Almut Sophia Koepke","Oriol Vinyals","Olivier Hénaff","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2310.17653v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.14338v2","updated":"2023-10-26T17:59:37Z","published":"2023-07-26T17:58:07Z","title":"TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023","summary":" Deep learning (DL) models for tabular data problems (e.g. classification,\nregression) are currently receiving increasingly more attention from\nresearchers. However, despite the recent efforts, the non-DL algorithms based\non gradient-boosted decision trees (GBDT) remain a strong go-to solution for\nthese problems. One of the research directions aimed at improving the position\nof tabular DL involves designing so-called retrieval-augmented models. For a\ntarget object, such models retrieve other objects (e.g. the nearest neighbors)\nfrom the available training data and use their features and labels to make a\nbetter prediction.\n In this work, we present TabR -- essentially, a feed-forward network with a\ncustom k-Nearest-Neighbors-like component in the middle. On a set of public\nbenchmarks with datasets up to several million objects, TabR marks a big step\nforward for tabular DL: it demonstrates the best average performance among\ntabular DL models, becomes the new state-of-the-art on several datasets, and\neven outperforms GBDT models on the recently proposed \"GBDT-friendly\" benchmark\n(see Figure 1). Among the important findings and technical details powering\nTabR, the main ones lie in the attention-like mechanism that is responsible for\nretrieving the nearest neighbors and extracting valuable signal from them. In\naddition to the much higher performance, TabR is simple and significantly more\nefficient compared to prior retrieval-based tabular DL models.\n","authors":["Yury Gorishniy","Ivan Rubachev","Nikolay Kartashev","Daniil Shlenskii","Akim Kotelnikov","Artem Babenko"],"pdf_url":"https://arxiv.org/pdf/2307.14338v2.pdf","comment":"Code: https://github.com/yandex-research/tabular-dl-tabr"},{"id":"http://arxiv.org/abs/2310.17651v1","updated":"2023-10-26T17:59:32Z","published":"2023-10-26T17:59:32Z","title":"High-Dimensional Prediction for Sequential Decision Making","summary":" We study the problem of making predictions of an adversarially chosen\nhigh-dimensional state that are unbiased subject to an arbitrary collection of\nconditioning events, with the goal of tailoring these events to downstream\ndecision makers. We give efficient algorithms for solving this problem, as well\nas a number of applications that stem from choosing an appropriate set of\nconditioning events.\n","authors":["Georgy Noarov","Ramya Ramalingam","Aaron Roth","Stephan Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17651v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17650v1","updated":"2023-10-26T17:59:19Z","published":"2023-10-26T17:59:19Z","title":"A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised\n Video Anomaly Detection","summary":" Detection of anomalous events in videos is an important problem in\napplications such as surveillance. Video anomaly detection (VAD) is\nwell-studied in the one-class classification (OCC) and weakly supervised (WS)\nsettings. However, fully unsupervised (US) video anomaly detection methods,\nwhich learn a complete system without any annotation or human supervision, have\nnot been explored in depth. This is because the lack of any ground truth\nannotations significantly increases the magnitude of the VAD challenge. To\naddress this challenge, we propose a simple-but-effective two-stage\npseudo-label generation framework that produces segment-level (normal/anomaly)\npseudo-labels, which can be further used to train a segment-level anomaly\ndetector in a supervised manner. The proposed coarse-to-fine pseudo-label\n(C2FPL) generator employs carefully-designed hierarchical divisive clustering\nand statistical hypothesis testing to identify anomalous video segments from a\nset of completely unlabeled videos. The trained anomaly detector can be\ndirectly applied on segments of an unseen test video to obtain segment-level,\nand subsequently, frame-level anomaly predictions. Extensive studies on two\nlarge-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that\nthe proposed unsupervised approach achieves superior performance compared to\nall existing OCC and US methods , while yielding comparable performance to the\nstate-of-the-art WS methods.\n","authors":["Anas Al-lahham","Nurbek Tastan","Zaigham Zaheer","Karthik Nandakumar"],"pdf_url":"https://arxiv.org/pdf/2310.17650v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2310.17646v1","updated":"2023-10-26T17:58:12Z","published":"2023-10-26T17:58:12Z","title":"Do Graph Neural Networks Dream of Landau Damping? Insights from Kinetic\n Simulations of a Plasma Sheet Model","summary":" We explore the possibility of fully replacing a plasma physics kinetic\nsimulator with a graph neural network-based simulator. We focus on this class\nof surrogate models given the similarity between their message-passing update\nmechanism and the traditional physics solver update, and the possibility of\nenforcing known physical priors into the graph construction and update. We show\nthat our model learns the kinetic plasma dynamics of the one-dimensional plasma\nmodel, a predecessor of contemporary kinetic plasma simulation codes, and\nrecovers a wide range of well-known kinetic plasma processes, including plasma\nthermalization, electrostatic fluctuations about thermal equilibrium, and the\ndrag on a fast sheet and Landau damping. We compare the performance against the\noriginal plasma model in terms of run-time, conservation laws, and temporal\nevolution of key physical quantities. The limitations of the model are\npresented and possible directions for higher-dimensional surrogate models for\nkinetic plasmas are discussed.\n","authors":["Diogo D Carvalho","Diogo R Ferreira","Luis O Silva"],"pdf_url":"https://arxiv.org/pdf/2310.17646v1.pdf","comment":"27 pages, 14 figures"},{"id":"http://arxiv.org/abs/2310.17645v1","updated":"2023-10-26T17:58:08Z","published":"2023-10-26T17:58:08Z","title":"Defending Against Transfer Attacks From Public Models","summary":" Adversarial attacks have been a looming and unaddressed threat in the\nindustry. However, through a decade-long history of the robustness evaluation\nliterature, we have learned that mounting a strong or optimal attack is\nchallenging. It requires both machine learning and domain expertise. In other\nwords, the white-box threat model, religiously assumed by a large majority of\nthe past literature, is unrealistic. In this paper, we propose a new practical\nthreat model where the adversary relies on transfer attacks through publicly\navailable surrogate models. We argue that this setting will become the most\nprevalent for security-sensitive applications in the future. We evaluate the\ntransfer attacks in this setting and propose a specialized defense method based\non a game-theoretic perspective. The defenses are evaluated under 24 public\nmodels and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and\nImageNet). Under this threat model, our defense, PubDef, outperforms the\nstate-of-the-art white-box adversarial training by a large margin with almost\nno loss in the normal accuracy. For instance, on ImageNet, our defense achieves\n62% accuracy under the strongest transfer attack vs only 36% of the best\nadversarially trained model. Its accuracy when not under attack is only 2%\nlower than that of an undefended model (78% vs 80%). We release our code at\nhttps://github.com/wagner-group/pubdef.\n","authors":["Chawin Sitawarin","Jaewon Chang","David Huang","Wesson Altoyan","David Wagner"],"pdf_url":"https://arxiv.org/pdf/2310.17645v1.pdf","comment":"Under submission. Code available at\n https://github.com/wagner-group/pubdef"},{"id":"http://arxiv.org/abs/2310.17644v1","updated":"2023-10-26T17:57:15Z","published":"2023-10-26T17:57:15Z","title":"torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free\n Deep Learning Studies: A Case Study on NLP","summary":" Reproducibility in scientific work has been becoming increasingly important\nin research communities such as machine learning, natural language processing,\nand computer vision communities due to the rapid development of the research\ndomains supported by recent advances in deep learning. In this work, we present\na significantly upgraded version of torchdistill, a modular-driven coding-free\ndeep learning framework significantly upgraded from the initial release, which\nsupports only image classification and object detection tasks for reproducible\nknowledge distillation experiments. To demonstrate that the upgraded framework\ncan support more tasks with third-party libraries, we reproduce the GLUE\nbenchmark results of BERT models using a script based on the upgraded\ntorchdistill, harmonizing with various Hugging Face libraries. All the 27\nfine-tuned BERT models and configurations to reproduce the results are\npublished at Hugging Face, and the model weights have already been widely used\nin research communities. We also reimplement popular small-sized models and new\nknowledge distillation methods and perform additional experiments for computer\nvision tasks.\n","authors":["Yoshitomo Matsubara"],"pdf_url":"https://arxiv.org/pdf/2310.17644v1.pdf","comment":"Accepted at the 3rd Workshop for Natural Language Processing Open\n Source Software (NLP-OSS) at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17643v1","updated":"2023-10-26T17:56:50Z","published":"2023-10-26T17:56:50Z","title":"Where you go is who you are -- A study on machine learning based\n semantic privacy attacks","summary":" Concerns about data privacy are omnipresent, given the increasing usage of\ndigital applications and their underlying business model that includes selling\nuser data. Location data is particularly sensitive since they allow us to infer\nactivity patterns and interests of users, e.g., by categorizing visited\nlocations based on nearby points of interest (POI). On top of that, machine\nlearning methods provide new powerful tools to interpret big data. In light of\nthese considerations, we raise the following question: What is the actual risk\nthat realistic, machine learning based privacy attacks can obtain meaningful\nsemantic information from raw location data, subject to inaccuracies in the\ndata? In response, we present a systematic analysis of two attack scenarios,\nnamely location categorization and user profiling. Experiments on the\nFoursquare dataset and tracking data demonstrate the potential for abuse of\nhigh-quality spatial information, leading to a significant privacy loss even\nwith location inaccuracy of up to 200m. With location obfuscation of more than\n1 km, spatial information hardly adds any value, but a high privacy risk solely\nfrom temporal information remains. The availability of public context data such\nas POIs plays a key role in inference based on spatial information. Our\nfindings point out the risks of ever-growing databases of tracking data and\nspatial context data, which policymakers should consider for privacy\nregulations, and which could guide individuals in their personal location\nprotection measures.\n","authors":["Nina Wiedemann","Ourania Kounadi","Martin Raubal","Krzysztof Janowicz"],"pdf_url":"https://arxiv.org/pdf/2310.17643v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17642v1","updated":"2023-10-26T17:56:35Z","published":"2023-10-26T17:56:35Z","title":"Drive Anywhere: Generalizable End-to-end Autonomous Driving with\n Multi-modal Foundation Models","summary":" As autonomous driving technology matures, end-to-end methodologies have\nemerged as a leading strategy, promising seamless integration from perception\nto control via deep learning. However, existing systems grapple with challenges\nsuch as unexpected open set environments and the complexity of black-box\nmodels. At the same time, the evolution of deep learning introduces larger,\nmultimodal foundational models, offering multi-modal visual and textual\nunderstanding. In this paper, we harness these multimodal foundation models to\nenhance the robustness and adaptability of autonomous driving systems, enabling\nout-of-distribution, end-to-end, multimodal, and more explainable autonomy.\nSpecifically, we present an approach to apply end-to-end open-set (any\nenvironment/scene) autonomous driving that is capable of providing driving\ndecisions from representations queryable by image and text. To do so, we\nintroduce a method to extract nuanced spatial (pixel/patch-aligned) features\nfrom transformers to enable the encapsulation of both spatial and semantic\nfeatures. Our approach (i) demonstrates unparalleled results in diverse tests\nwhile achieving significantly greater robustness in out-of-distribution\nsituations, and (ii) allows the incorporation of latent space simulation (via\ntext) for improved training (data augmentation via text) and policy debugging.\nWe encourage the reader to check our explainer video at\nhttps://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the\ncode and demos on our project webpage at https://drive-anywhere.github.io/.\n","authors":["Tsun-Hsuan Wang","Alaa Maalouf","Wei Xiao","Yutong Ban","Alexander Amini","Guy Rosman","Sertac Karaman","Daniela Rus"],"pdf_url":"https://arxiv.org/pdf/2310.17642v1.pdf","comment":"Project webpage: https://drive-anywhere.github.io Explainer video:\n https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be"},{"id":"http://arxiv.org/abs/2310.17639v1","updated":"2023-10-26T17:54:52Z","published":"2023-10-26T17:54:52Z","title":"In-Context Learning Dynamics with Random Binary Sequences","summary":" Large language models (LLMs) trained on huge corpora of text datasets\ndemonstrate complex, emergent capabilities, achieving state-of-the-art\nperformance on tasks they were not explicitly trained for. The precise nature\nof LLM capabilities is often mysterious, and different prompts can elicit\ndifferent capabilities through in-context learning. We propose a Cognitive\nInterpretability framework that enables us to analyze in-context learning\ndynamics to understand latent concepts in LLMs underlying behavioral patterns.\nThis provides a more nuanced understanding than success-or-failure evaluation\nbenchmarks, but does not require observing internal activations as a\nmechanistic interpretation of circuits would. Inspired by the cognitive science\nof human randomness perception, we use random binary sequences as context and\nstudy dynamics of in-context learning by manipulating properties of context\ndata, such as sequence length. In the latest GPT-3.5+ models, we find emergent\nabilities to generate pseudo-random numbers and learn basic formal languages,\nwith striking in-context learning dynamics where model outputs transition\nsharply from pseudo-random behaviors to deterministic repetition.\n","authors":["Eric J. Bigelow","Ekdeep Singh Lubana","Robert P. Dick","Hidenori Tanaka","Tomer D. Ullman"],"pdf_url":"https://arxiv.org/pdf/2310.17639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17638v1","updated":"2023-10-26T17:53:24Z","published":"2023-10-26T17:53:24Z","title":"Generative Fractional Diffusion Models","summary":" We generalize the continuous time framework for score-based generative models\nfrom an underlying Brownian motion (BM) to an approximation of fractional\nBrownian motion (FBM). We derive a continuous reparameterization trick and the\nreverse time model by representing FBM as a stochastic integral over a family\nof Ornstein-Uhlenbeck processes to define generative fractional diffusion\nmodels (GFDM) with driving noise converging to a non-Markovian process of\ninfinite quadratic variation. The Hurst index $H\\in(0,1)$ of FBM enables\ncontrol of the roughness of the distribution transforming path. To the best of\nour knowledge, this is the first attempt to build a generative model upon a\nstochastic process with infinite quadratic variation.\n","authors":["Gabriel Nobis","Marco Aversa","Maximilian Springenberg","Michael Detzel","Stefano Ermon","Shinichi Nakajima","Roderick Murray-Smith","Sebastian Lapuschkin","Christoph Knochenhauer","Luis Oala","Wojciech Samek"],"pdf_url":"https://arxiv.org/pdf/2310.17638v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17634v1","updated":"2023-10-26T17:51:46Z","published":"2023-10-26T17:51:46Z","title":"Grow Your Limits: Continuous Improvement with Real-World RL for Robotic\n Locomotion","summary":" Deep reinforcement learning (RL) can enable robots to autonomously acquire\ncomplex behaviors, such as legged locomotion. However, RL in the real world is\ncomplicated by constraints on efficiency, safety, and overall training\nstability, which limits its practical applicability. We present APRL, a policy\nregularization framework that modulates the robot's exploration over the course\nof training, striking a balance between flexible improvement potential and\nfocused, efficient exploration. APRL enables a quadrupedal robot to efficiently\nlearn to walk entirely in the real world within minutes and continue to improve\nwith more training where prior work saturates in performance. We demonstrate\nthat continued training with APRL results in a policy that is substantially\nmore capable of navigating challenging situations and is able to adapt to\nchanges in dynamics with continued training.\n","authors":["Laura Smith","Yunhao Cao","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2310.17634v1.pdf","comment":"First two authors contributed equally. Project website:\n https://sites.google.com/berkeley.edu/aprl"},{"id":"http://arxiv.org/abs/2009.01742v3","updated":"2023-10-26T17:51:42Z","published":"2020-09-03T15:39:55Z","title":"Online Estimation and Community Detection of Network Point Processes for\n Event Streams","summary":" A common goal in network modeling is to uncover the latent community\nstructure present among nodes. For many real-world networks, the true\nconnections consist of events arriving as streams, which are then aggregated to\nform edges, ignoring the dynamic temporal component. A natural way to take\naccount of these temporal dynamics of interactions is to use point processes as\nthe foundation of network models for community detection. Computational\ncomplexity hampers the scalability of such approaches to large sparse networks.\nTo circumvent this challenge, we propose a fast online variational inference\nalgorithm for estimating the latent structure underlying dynamic event arrivals\non a network, using continuous-time point process latent network models. We\ndescribe this procedure for networks models capturing community structure. This\nstructure can be learned as new events are observed on the network, updating\nthe inferred community assignments. We investigate the theoretical properties\nof such an inference scheme, and provide regret bounds on the loss function of\nthis procedure. The proposed inference procedure is then thoroughly compared,\nusing both simulation studies and real data, to non-online variants. We\ndemonstrate that online inference can obtain comparable performance, in terms\nof community recovery, to non-online variants, while realising computational\ngains. Our proposed inference framework can also be readily modified to\nincorporate other popular network structures.\n","authors":["Guanhua Fang","Owen G. Ward","Tian Zheng"],"pdf_url":"https://arxiv.org/pdf/2009.01742v3.pdf","comment":"45 pages"},{"id":"http://arxiv.org/abs/2310.02987v2","updated":"2023-10-26T17:47:10Z","published":"2023-10-04T17:24:45Z","title":"Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions","summary":" Machine learning approaches relying on such criteria as adversarial\nrobustness or multi-agent settings have raised the need for solving\ngame-theoretic equilibrium problems. Of particular relevance to these\napplications are methods targeting finite-sum structure, which generically\narises in empirical variants of learning problems in these contexts. Further,\nmethods with computable approximation errors are highly desirable, as they\nprovide verifiable exit criteria. Motivated by these applications, we study\nfinite-sum monotone inclusion problems, which model broad classes of\nequilibrium problems. Our main contributions are variants of the classical\nHalpern iteration that employ variance reduction to obtain improved complexity\nguarantees in which $n$ component operators in the finite sum are ``on\naverage'' either cocoercive or Lipschitz continuous and monotone, with\nparameter $L$. The resulting oracle complexity of our methods, which provide\nguarantees for the last iterate and for a (computable) operator norm residual,\nis $\\widetilde{\\mathcal{O}}( n + \\sqrt{n}L\\varepsilon^{-1})$, which improves\nupon existing methods by a factor up to $\\sqrt{n}$. This constitutes the first\nvariance reduction-type result for general finite-sum monotone inclusions and\nfor more specific problems such as convex-concave optimization when operator\nnorm residual is the optimality measure. We further argue that, up to\npoly-logarithmic factors, this complexity is unimprovable in the monotone\nLipschitz setting; i.e., the provided result is near-optimal.\n","authors":["Xufeng Cai","Ahmet Alacaoglu","Jelena Diakonikolas"],"pdf_url":"https://arxiv.org/pdf/2310.02987v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17623v1","updated":"2023-10-26T17:43:13Z","published":"2023-10-26T17:43:13Z","title":"Proving Test Set Contamination in Black Box Language Models","summary":" Large language models are trained on vast amounts of internet data, prompting\nconcerns and speculation that they have memorized public benchmarks. Going from\nspeculation to proof of contamination is challenging, as the pretraining data\nused by proprietary models are often not publicly accessible. We show that it\nis possible to provide provable guarantees of test set contamination in\nlanguage models without access to pretraining data or model weights. Our\napproach leverages the fact that when there is no data contamination, all\norderings of an exchangeable benchmark should be equally likely. In contrast,\nthe tendency for language models to memorize example order means that a\ncontaminated language model will find certain canonical orderings to be much\nmore likely than others. Our test flags potential contamination whenever the\nlikelihood of a canonically ordered benchmark dataset is significantly higher\nthan the likelihood after shuffling the examples. We demonstrate that our\nprocedure is sensitive enough to reliably prove test set contamination in\nchallenging situations, including models as small as 1.4 billion parameters, on\nsmall test sets of only 1000 examples, and datasets that appear only a few\ntimes in the pretraining corpus. Using our test, we audit five popular publicly\naccessible language models for test set contamination and find little evidence\nfor pervasive contamination.\n","authors":["Yonatan Oren","Nicole Meister","Niladri Chatterji","Faisal Ladhak","Tatsunori B. Hashimoto"],"pdf_url":"https://arxiv.org/pdf/2310.17623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17622v1","updated":"2023-10-26T17:41:11Z","published":"2023-10-26T17:41:11Z","title":"Combating Representation Learning Disparity with Geometric Harmonization","summary":" Self-supervised learning (SSL) as an effective paradigm of representation\nlearning has achieved tremendous success on various curated datasets in diverse\nscenarios. Nevertheless, when facing the long-tailed distribution in real-world\napplications, it is still hard for existing methods to capture transferable and\nrobust representation. Conventional SSL methods, pursuing sample-level\nuniformity, easily leads to representation learning disparity where head\nclasses dominate the feature regime but tail classes passively collapse. To\naddress this problem, we propose a novel Geometric Harmonization (GH) method to\nencourage category-level uniformity in representation learning, which is more\nbenign to the minority and almost does not hurt the majority under long-tailed\ndistribution. Specially, GH measures the population statistics of the embedding\nspace on top of self-supervised learning, and then infer an fine-grained\ninstance-wise calibration to constrain the space expansion of head classes and\navoid the passive collapse of tail classes. Our proposal does not alter the\nsetting of SSL and can be easily integrated into existing methods in a low-cost\nmanner. Extensive results on a range of benchmark datasets show the\neffectiveness of GH with high tolerance to the distribution skewness. Our code\nis available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization.\n","authors":["Zhihan Zhou","Jiangchao Yao","Feng Hong","Ya Zhang","Bo Han","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17622v1.pdf","comment":"Accepted to NeurIPS 2023 (spotlight)"},{"id":"http://arxiv.org/abs/2310.17611v1","updated":"2023-10-26T17:34:32Z","published":"2023-10-26T17:34:32Z","title":"Uncovering Meanings of Embeddings via Partial Orthogonality","summary":" Machine learning tools often rely on embedding text as vectors of real\nnumbers. In this paper, we study how the semantic structure of language is\nencoded in the algebraic structure of such embeddings. Specifically, we look at\na notion of ``semantic independence'' capturing the idea that, e.g.,\n``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such\nexamples are intuitive, it is difficult to formalize such a notion of semantic\nindependence. The key observation here is that any sensible formalization\nshould obey a set of so-called independence axioms, and thus any algebraic\nencoding of this structure should also obey these axioms. This leads us\nnaturally to use partial orthogonality as the relevant algebraic structure. We\ndevelop theory and methods that allow us to demonstrate that partial\northogonality does indeed capture semantic independence. Complementary to this,\nwe also introduce the concept of independence preserving embeddings where\nembeddings preserve the conditional independence structures of a distribution,\nand we prove the existence of such embeddings and approximations to them.\n","authors":["Yibo Jiang","Bryon Aragam","Victor Veitch"],"pdf_url":"https://arxiv.org/pdf/2310.17611v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17610v1","updated":"2023-10-26T17:33:52Z","published":"2023-10-26T17:33:52Z","title":"A qualitative difference between gradient flows of convex functions in\n finite- and infinite-dimensional Hilbert spaces","summary":" We consider gradient flow/gradient descent and heavy ball/accelerated\ngradient descent optimization for convex objective functions. In the gradient\nflow case, we prove the following:\n 1. If $f$ does not have a minimizer, the convergence $f(x_t)\\to \\inf f$ can\nbe arbitrarily slow.\n 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \\inf f$ is\nintegrable/summable in time. In particular, $f(x_t) - \\inf f = o(1/t)$ as\n$t\\to\\infty$.\n 3. In Hilbert spaces, this is optimal: $f(x_t) - \\inf f$ can decay to $0$ as\nslowly as any given function which is monotone decreasing and integrable at\n$\\infty$, even for a fixed quadratic objective.\n 4. In finite dimension (or more generally, for all gradient flow curves of\nfinite length), this is not optimal: We prove that there are convex monotone\ndecreasing integrable functions $g(t)$ which decrease to zero slower than\n$f(x_t)-\\inf f$ for the gradient flow of any convex function on $\\mathbb R^d$.\nFor instance, we show that any gradient flow $x_t$ of a convex function $f$ in\nfinite dimension satisfies $\\liminf_{t\\to\\infty} \\big(t\\cdot \\log^2(t)\\cdot\n\\big\\{f(x_t) -\\inf f\\big\\}\\big)=0$.\n This improves on the commonly reported $O(1/t)$ rate and provides a sharp\ncharacterization of the energy decay law. We also note that it is impossible to\nestablish a rate $O(1/(t\\phi(t))$ for any function $\\phi$ which satisfies\n$\\lim_{t\\to\\infty}\\phi(t) = \\infty$, even asymptotically.\n Similar results are obtained in related settings for (1) discrete time\ngradient descent, (2) stochastic gradient descent with multiplicative noise and\n(3) the heavy ball ODE. In the case of stochastic gradient descent, the\nsummability of $\\mathbb E[f(x_n) - \\inf f]$ is used to prove that $f(x_n)\\to\n\\inf f$ almost surely - an improvement on the convergence almost surely up to a\nsubsequence which follows from the $O(1/n)$ decay estimate.\n","authors":["Jonathan W. Siegel","Stephan Wojtowytsch"],"pdf_url":"https://arxiv.org/pdf/2310.17610v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.14534v2","updated":"2023-10-26T17:27:56Z","published":"2023-06-26T09:18:30Z","title":"CEIL: Generalized Contextual Imitation Learning","summary":" In this paper, we present \\textbf{C}ont\\textbf{E}xtual \\textbf{I}mitation\n\\textbf{L}earning~(CEIL), a general and broadly applicable algorithm for\nimitation learning (IL). Inspired by the formulation of hindsight information\nmatching, we derive CEIL by explicitly learning a hindsight embedding function\ntogether with a contextual policy using the hindsight embeddings. To achieve\nthe expert matching objective for IL, we advocate for optimizing a contextual\nvariable such that it biases the contextual policy towards mimicking expert\nbehaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL\nis a generalist that can be effectively applied to multiple settings including:\n1)~learning from observations (LfO), 2)~offline IL, 3)~cross-domain IL\n(mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate\nCEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline).\nCompared to prior state-of-the-art baselines, we show that CEIL is more\nsample-efficient in most online IL tasks and achieves better or competitive\nperformances in offline tasks.\n","authors":["Jinxin Liu","Li He","Yachen Kang","Zifeng Zhuang","Donglin Wang","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2306.14534v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.07273v2","updated":"2023-10-26T17:24:29Z","published":"2023-06-12T17:57:05Z","title":"Gaussian Membership Inference Privacy","summary":" We propose a novel and practical privacy notion called $f$-Membership\nInference Privacy ($f$-MIP), which explicitly considers the capabilities of\nrealistic adversaries under the membership inference attack threat model.\nConsequently, $f$-MIP offers interpretable privacy guarantees and improved\nutility (e.g., better classification accuracy). In particular, we derive a\nparametric family of $f$-MIP guarantees that we refer to as $\\mu$-Gaussian\nMembership Inference Privacy ($\\mu$-GMIP) by theoretically analyzing likelihood\nratio-based membership inference attacks on stochastic gradient descent (SGD).\nOur analysis highlights that models trained with standard SGD already offer an\nelementary level of MIP. Additionally, we show how $f$-MIP can be amplified by\nadding noise to gradient updates. Our analysis further yields an analytical\nmembership inference attack that offers two distinct advantages over previous\napproaches. First, unlike existing state-of-the-art attacks that require\ntraining hundreds of shadow models, our attack does not require any shadow\nmodel. Second, our analytical attack enables straightforward auditing of our\nprivacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g.,\nbatch size, number of model parameters) and specific data characteristics\ndetermine an attacker's ability to accurately infer a point's membership in the\ntraining set. We demonstrate the effectiveness of our method on models trained\non vision and tabular datasets.\n","authors":["Tobias Leemann","Martin Pawelczyk","Gjergji Kasneci"],"pdf_url":"https://arxiv.org/pdf/2306.07273v2.pdf","comment":"NeurIPS 2023 camera-ready. The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2302.13534v2","updated":"2023-10-26T17:21:36Z","published":"2023-02-27T06:09:10Z","title":"Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL\n with General Regularizers and Multiple Optimal Arms","summary":" We study the problem of designing adaptive multi-armed bandit algorithms that\nperform optimally in both the stochastic setting and the adversarial setting\nsimultaneously (often known as a best-of-both-world guarantee). A line of\nrecent works shows that when configured and analyzed properly, the\nFollow-the-Regularized-Leader (FTRL) algorithm, originally designed for the\nadversarial setting, can in fact optimally adapt to the stochastic setting as\nwell. Such results, however, critically rely on an assumption that there exists\none unique optimal arm. Recently, Ito (2021) took the first step to remove such\nan undesirable uniqueness assumption for one particular FTRL algorithm with the\n$\\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly\nimprove and generalize this result, showing that uniqueness is unnecessary for\nFTRL with a broad family of regularizers and a new learning rate schedule. For\nsome regularizers, our regret bounds also improve upon prior results even when\nuniqueness holds. We further provide an application of our results to the\ndecoupled exploration and exploitation problem, demonstrating that our\ntechniques are broadly applicable.\n","authors":["Tiancheng Jin","Junyan Liu","Haipeng Luo"],"pdf_url":"https://arxiv.org/pdf/2302.13534v2.pdf","comment":"Update the camera-ready version for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17596v1","updated":"2023-10-26T17:17:31Z","published":"2023-10-26T17:17:31Z","title":"MimicGen: A Data Generation System for Scalable Robot Learning using\n Human Demonstrations","summary":" Imitation learning from a large set of human demonstrations has proved to be\nan effective paradigm for building capable robot agents. However, the\ndemonstrations can be extremely costly and time-consuming to collect. We\nintroduce MimicGen, a system for automatically synthesizing large-scale, rich\ndatasets from only a small number of human demonstrations by adapting them to\nnew contexts. We use MimicGen to generate over 50K demonstrations across 18\ntasks with diverse scene configurations, object instances, and robot arms from\njust ~200 human demonstrations. We show that robot agents can be effectively\ntrained on this generated dataset by imitation learning to achieve strong\nperformance in long-horizon and high-precision tasks, such as multi-part\nassembly and coffee preparation, across broad initial state distributions. We\nfurther demonstrate that the effectiveness and utility of MimicGen data compare\nfavorably to collecting additional human demonstrations, making it a powerful\nand economical approach towards scaling up robot learning. Datasets, simulation\nenvironments, videos, and more at https://mimicgen.github.io .\n","authors":["Ajay Mandlekar","Soroush Nasiriany","Bowen Wen","Iretiayo Akinola","Yashraj Narang","Linxi Fan","Yuke Zhu","Dieter Fox"],"pdf_url":"https://arxiv.org/pdf/2310.17596v1.pdf","comment":"Conference on Robot Learning (CoRL) 2023"},{"id":"http://arxiv.org/abs/2310.16779v2","updated":"2023-10-26T17:15:48Z","published":"2023-10-25T17:11:21Z","title":"Multi-scale Diffusion Denoised Smoothing","summary":" Along with recent diffusion models, randomized smoothing has become one of a\nfew tangible approaches that offers adversarial robustness to models at scale,\ne.g., those of large pre-trained models. Specifically, one can perform\nrandomized smoothing on any classifier via a simple \"denoise-and-classify\"\npipeline, so-called denoised smoothing, given that an accurate denoiser is\navailable - such as diffusion model. In this paper, we present scalable methods\nto address the current trade-off between certified robustness and accuracy in\ndenoised smoothing. Our key idea is to \"selectively\" apply smoothing among\nmultiple noise scales, coined multi-scale smoothing, which can be efficiently\nimplemented with a single diffusion model. This approach also suggests a new\nobjective to compare the collective robustness of multi-scale smoothed\nclassifiers, and questions which representation of diffusion model would\nmaximize the objective. To address this, we propose to further fine-tune\ndiffusion model (a) to perform consistent denoising whenever the original image\nis recoverable, but (b) to generate rather diverse outputs otherwise. Our\nexperiments show that the proposed multi-scale smoothing scheme combined with\ndiffusion fine-tuning enables strong certified robustness available with high\nnoise level while maintaining its accuracy closer to non-smoothed classifiers.\n","authors":["Jongheon Jeong","Jinwoo Shin"],"pdf_url":"https://arxiv.org/pdf/2310.16779v2.pdf","comment":"Published as a conference paper at NeurIPS 2023; Code is available at\n https://github.com/jh-jeong/smoothing-multiscale"},{"id":"http://arxiv.org/abs/2305.17380v3","updated":"2023-10-26T17:12:30Z","published":"2023-05-27T06:10:17Z","title":"No-Regret Online Reinforcement Learning with Adversarial Losses and\n Transitions","summary":" Existing online learning algorithms for adversarial Markov Decision Processes\nachieve ${O}(\\sqrt{T})$ regret after $T$ rounds of interactions even if the\nloss functions are chosen arbitrarily by an adversary, with the caveat that the\ntransition function has to be fixed. This is because it has been shown that\nadversarial transition functions make no-regret learning impossible. Despite\nsuch impossibility results, in this work, we develop algorithms that can handle\nboth adversarial losses and adversarial transitions, with regret increasing\nsmoothly in the degree of maliciousness of the adversary. More concretely, we\nfirst propose an algorithm that enjoys $\\widetilde{{O}}(\\sqrt{T} +\nC^{\\textsf{P}})$ regret where $C^{\\textsf{P}}$ measures how adversarial the\ntransition functions are and can be at most ${O}(T)$. While this algorithm\nitself requires knowledge of $C^{\\textsf{P}}$, we further develop a black-box\nreduction approach that removes this requirement. Moreover, we also show that\nfurther refinements of the algorithm not only maintains the same regret bound,\nbut also simultaneously adapts to easier environments (where losses are\ngenerated in a certain stochastically constrained manner as in Jin et al.\n[2021]) and achieves $\\widetilde{{O}}(U + \\sqrt{UC^{\\textsf{L}}} +\nC^{\\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient\nand $C^{\\textsf{L}}$ is the amount of corruption on losses.\n","authors":["Tiancheng Jin","Junyan Liu","Chloé Rouyer","William Chang","Chen-Yu Wei","Haipeng Luo"],"pdf_url":"https://arxiv.org/pdf/2305.17380v3.pdf","comment":"Update the camera-ready version for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17588v1","updated":"2023-10-26T17:09:13Z","published":"2023-10-26T17:09:13Z","title":"PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven\n Perturbed Gradient Descent","summary":" Fine-tuning pretrained language models (PLMs) for downstream tasks is a\nlarge-scale optimization problem, in which the choice of the training algorithm\ncritically determines how well the trained model can generalize to unseen test\ndata, especially in the context of few-shot learning. To achieve good\ngeneralization performance and avoid overfitting, techniques such as data\naugmentation and pruning are often applied. However, adding these\nregularizations necessitates heavy tuning of the hyperparameters of\noptimization algorithms, such as the popular Adam optimizer. In this paper, we\npropose a two-stage fine-tuning method, PAC-tuning, to address this\noptimization challenge. First, based on PAC-Bayes training, PAC-tuning directly\nminimizes the PAC-Bayes generalization bound to learn proper parameter\ndistribution. Second, PAC-tuning modifies the gradient by injecting noise with\nthe variance learned in the first stage into the model parameters during\ntraining, resulting in a variant of perturbed gradient descent (PGD). In the\npast, the few-shot scenario posed difficulties for PAC-Bayes training because\nthe PAC-Bayes bound, when applied to large models with limited training data,\nmight not be stringent. Our experimental results across 5 GLUE benchmark tasks\ndemonstrate that PAC-tuning successfully handles the challenges of fine-tuning\ntasks and outperforms strong baseline methods by a visible margin, further\nconfirming the potential to apply PAC training for any other settings where the\nAdam optimizer is currently used for training.\n","authors":["Guangliang Liu","Zhiyu Xue","Xitong Zhang","Kristen Marie Johnson","Rongrong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17588v1.pdf","comment":"Accepted to EMNLP23 main"},{"id":"http://arxiv.org/abs/2301.12593v2","updated":"2023-10-26T17:07:59Z","published":"2023-01-30T00:37:06Z","title":"Risk-Averse Model Uncertainty for Distributionally Robust Safe\n Reinforcement Learning","summary":" Many real-world domains require safe decision making in uncertain\nenvironments. In this work, we introduce a deep reinforcement learning\nframework for approaching this important problem. We consider a distribution\nover transition models, and apply a risk-averse perspective towards model\nuncertainty through the use of coherent distortion risk measures. We provide\nrobustness guarantees for this framework by showing it is equivalent to a\nspecific class of distributionally robust safe reinforcement learning problems.\nUnlike existing approaches to robustness in deep reinforcement learning,\nhowever, our formulation does not involve minimax optimization. This leads to\nan efficient, model-free implementation of our approach that only requires\nstandard data collection from a single training environment. In experiments on\ncontinuous control tasks with safety constraints, we demonstrate that our\nframework produces robust performance and safety at deployment time across a\nrange of perturbed test environments.\n","authors":["James Queeney","Mouhacine Benosman"],"pdf_url":"https://arxiv.org/pdf/2301.12593v2.pdf","comment":"37th Conference on Neural Information Processing Systems (NeurIPS\n 2023)"},{"id":"http://arxiv.org/abs/2310.17584v1","updated":"2023-10-26T17:07:43Z","published":"2023-10-26T17:07:43Z","title":"A minimax optimal control approach for robust neural ODEs","summary":" In this paper, we address the adversarial training of neural ODEs from a\nrobust control perspective. This is an alternative to the classical training\nvia empirical risk minimization, and it is widely used to enforce reliable\noutcomes for input perturbations. Neural ODEs allow the interpretation of deep\nneural networks as discretizations of control systems, unlocking powerful tools\nfrom control theory for the development and the understanding of machine\nlearning. In this specific case, we formulate the adversarial training with\nperturbed data as a minimax optimal control problem, for which we derive first\norder optimality conditions in the form of Pontryagin's Maximum Principle. We\nprovide a novel interpretation of robust training leading to an alternative\nweighted technique, which we test on a low-dimensional classification task.\n","authors":["Cristina Cipriani","Alessandro Scagliotti","Tobias Wöhrer"],"pdf_url":"https://arxiv.org/pdf/2310.17584v1.pdf","comment":"6 pages, 2 figures and 1 table"},{"id":"http://arxiv.org/abs/2310.17582v1","updated":"2023-10-26T17:06:23Z","published":"2023-10-26T17:06:23Z","title":"Convergence of flow-based generative models via proximal gradient\n descent in Wasserstein space","summary":" Flow-based generative models enjoy certain advantages in computing the data\ngeneration and the likelihood, and have recently shown competitive empirical\nperformance. Compared to the accumulating theoretical studies on related\nscore-based diffusion models, analysis of flow-based models, which are\ndeterministic in both forward (data-to-noise) and reverse (noise-to-data)\ndirections, remain sparse. In this paper, we provide a theoretical guarantee of\ngenerating data distribution by a progressive flow model, the so-called JKO\nflow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a\nnormalizing flow network. Leveraging the exponential convergence of the\nproximal gradient descent (GD) in Wasserstein space, we prove the\nKullback-Leibler (KL) guarantee of data generation by a JKO flow model to be\n$O(\\varepsilon^2)$ when using $N \\lesssim \\log (1/\\varepsilon)$ many JKO steps\n($N$ Residual Blocks in the flow) where $\\varepsilon $ is the error in the\nper-step first-order condition. The assumption on data density is merely a\nfinite second moment, and the theory extends to data distributions without\ndensity and when there are inversion errors in the reverse process where we\nobtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of\nthe JKO-type $W_2$-proximal GD is proved for a general class of convex\nobjective functionals that includes the KL divergence as a special case, which\ncan be of independent interest.\n","authors":["Xiuyuan Cheng","Jianfeng Lu","Yixin Tan","Yao Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17579v1","updated":"2023-10-26T17:03:14Z","published":"2023-10-26T17:03:14Z","title":"BLIS-Net: Classifying and Analyzing Signals on Graphs","summary":" Graph neural networks (GNNs) have emerged as a powerful tool for tasks such\nas node classification and graph classification. However, much less work has\nbeen done on signal classification, where the data consists of many functions\n(referred to as signals) defined on the vertices of a single graph. These tasks\nrequire networks designed differently from those designed for traditional GNN\ntasks. Indeed, traditional GNNs rely on localized low-pass filters, and signals\nof interest may have intricate multi-frequency behavior and exhibit long range\ninteractions. This motivates us to introduce the BLIS-Net (Bi-Lipschitz\nScattering Net), a novel GNN that builds on the previously introduced geometric\nscattering transform. Our network is able to capture both local and global\nsignal structure and is able to capture both low-frequency and high-frequency\ninformation. We make several crucial changes to the original geometric\nscattering architecture which we prove increase the ability of our network to\ncapture information about the input signal and show that BLIS-Net achieves\nsuperior performance on both synthetic and real-world data sets based on\ntraffic flow and fMRI data.\n","authors":["Charles Xu","Laney Goldman","Valentina Guo","Benjamin Hollander-Bodie","Maedee Trank-Greene","Ian Adelstein","Edward De Brouwer","Rex Ying","Smita Krishnaswamy","Michael Perlmutter"],"pdf_url":"https://arxiv.org/pdf/2310.17579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13632v2","updated":"2023-10-26T17:02:56Z","published":"2023-05-23T02:59:25Z","title":"Detecting and Mitigating Hallucinations in Multilingual Summarisation","summary":" Hallucinations pose a significant challenge to the reliability of neural\nmodels for abstractive summarisation. While automatically generated summaries\nmay be fluent, they often lack faithfulness to the original document. This\nissue becomes even more pronounced in low-resource settings, such as\ncross-lingual transfer. With the existing faithful metrics focusing on English,\neven measuring the extent of this phenomenon in cross-lingual settings is hard.\nTo address this, we first develop a novel metric, mFACT, evaluating the\nfaithfulness of non-English summaries, leveraging translation-based transfer\nfrom multiple English faithfulness metrics. We then propose a simple but\neffective method to reduce hallucinations with a cross-lingual transfer, which\nweighs the loss of each training example by its faithfulness score. Through\nextensive experiments in multiple languages, we demonstrate that mFACT is the\nmetric that is most suited to detect hallucinations. Moreover, we find that our\nproposed loss weighting method drastically increases both performance and\nfaithfulness according to both automatic and human evaluation when compared to\nstrong baselines for cross-lingual transfer such as MAD-X. Our code and dataset\nare available at https://github.com/yfqiu-nlp/mfact-summ.\n","authors":["Yifu Qiu","Yftah Ziser","Anna Korhonen","Edoardo M. Ponti","Shay B. Cohen"],"pdf_url":"https://arxiv.org/pdf/2305.13632v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2107.07420v2","updated":"2023-10-26T17:02:12Z","published":"2021-07-15T16:05:48Z","title":"Optimal Scoring Rule Design under Partial Knowledge","summary":" This paper studies the design of optimal proper scoring rules when the\nprincipal has partial knowledge of an agent's signal distribution. Recent work\ncharacterizes the proper scoring rules that maximize the increase of an agent's\npayoff when the agent chooses to access a costly signal to refine a posterior\nbelief from her prior prediction, under the assumption that the agent's signal\ndistribution is fully known to the principal. In our setting, the principal\nonly knows about a set of distributions where the agent's signal distribution\nbelongs. We formulate the scoring rule design problem as a max-min optimization\nthat maximizes the worst-case increase in payoff across the set of\ndistributions.\n We propose an efficient algorithm to compute an optimal scoring rule when the\nset of distributions is finite, and devise a fully polynomial-time\napproximation scheme that accommodates various infinite sets of distributions.\nWe further remark that widely used scoring rules, such as the quadratic and log\nrules, as well as previously identified optimal scoring rules under full\nknowledge, can be far from optimal in our partial knowledge settings.\n","authors":["Yiling Chen","Fang-Yi Yu"],"pdf_url":"https://arxiv.org/pdf/2107.07420v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17569v1","updated":"2023-10-26T16:58:01Z","published":"2023-10-26T16:58:01Z","title":"SD4Match: Learning to Prompt Stable Diffusion Model for Semantic\n Matching","summary":" In this paper, we address the challenge of matching semantically similar\nkeypoints across image pairs. Existing research indicates that the intermediate\noutput of the UNet within the Stable Diffusion (SD) can serve as robust image\nfeature maps for such a matching task. We demonstrate that by employing a basic\nprompt tuning technique, the inherent potential of Stable Diffusion can be\nharnessed, resulting in a significant enhancement in accuracy over previous\napproaches. We further introduce a novel conditional prompting module that\nconditions the prompt on the local details of the input image pairs, leading to\na further improvement in performance. We designate our approach as SD4Match,\nshort for Stable Diffusion for Semantic Matching. Comprehensive evaluations of\nSD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets\nnew benchmarks in accuracy across all these datasets. Particularly, SD4Match\noutperforms the previous state-of-the-art by a margin of 12 percentage points\non the challenging SPair-71k dataset.\n","authors":["Xinghui Li","Jingyi Lu","Kai Han","Victor Prisacariu"],"pdf_url":"https://arxiv.org/pdf/2310.17569v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17567v1","updated":"2023-10-26T16:55:05Z","published":"2023-10-26T16:55:05Z","title":"Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models","summary":" With LLMs shifting their role from statistical modeling of language to\nserving as general-purpose AI agents, how should LLM evaluations change?\nArguably, a key ability of an AI agent is to flexibly combine, as needed, the\nbasic skills it has learned. The capability to combine skills plays an\nimportant role in (human) pedagogy and also in a paper on emergence phenomena\n(Arora & Goyal, 2023).\n This work introduces Skill-Mix, a new evaluation to measure ability to\ncombine skills. Using a list of $N$ skills the evaluator repeatedly picks\nrandom subsets of $k$ skills and asks the LLM to produce text combining that\nsubset of skills. Since the number of subsets grows like $N^k$, for even modest\n$k$ this evaluation will, with high probability, require the LLM to produce\ntext significantly different from any text in the training set. The paper\ndevelops a methodology for (a) designing and administering such an evaluation,\nand (b) automatic grading (plus spot-checking by humans) of the results using\nGPT-4 as well as the open LLaMA-2 70B model.\n Administering a version of to popular chatbots gave results that, while\ngenerally in line with prior expectations, contained surprises. Sizeable\ndifferences exist among model capabilities that are not captured by their\nranking on popular LLM leaderboards (\"cramming for the leaderboard\").\nFurthermore, simple probability calculations indicate that GPT-4's reasonable\nperformance on $k=5$ is suggestive of going beyond \"stochastic parrot\" behavior\n(Bender et al., 2021), i.e., it combines skills in ways that it had not seen\nduring training.\n We sketch how the methodology can lead to a Skill-Mix based eco-system of\nopen evaluations for AI capabilities of future models.\n","authors":["Dingli Yu","Simran Kaur","Arushi Gupta","Jonah Brown-Cohen","Anirudh Goyal","Sanjeev Arora"],"pdf_url":"https://arxiv.org/pdf/2310.17567v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14267v2","updated":"2023-10-26T16:51:28Z","published":"2023-05-23T17:19:54Z","title":"SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from\n Diffusion Models","summary":" A potent class of generative models known as Diffusion Probabilistic Models\n(DPMs) has become prominent. A forward diffusion process adds gradually noise\nto data, while a model learns to gradually denoise. Sampling from pre-trained\nDPMs is obtained by solving differential equations (DE) defined by the learnt\nmodel, a process which has shown to be prohibitively slow. Numerous efforts on\nspeeding-up this process have consisted on crafting powerful ODE solvers.\nDespite being quick, such solvers do not usually reach the optimal quality\nachieved by available slow SDE solvers. Our goal is to propose SDE solvers that\nreach optimal quality without requiring several hundreds or thousands of NFEs\nto achieve that goal. We propose Stochastic Explicit Exponential\nDerivative-free Solvers (SEEDS), improving and generalizing Exponential\nIntegrator approaches to the stochastic case on several frameworks. After\ncarefully analyzing the formulation of exact solutions of diffusion SDEs, we\ncraft SEEDS to analytically compute the linear part of such solutions. Inspired\nby the Exponential Time-Differencing method, SEEDS use a novel treatment of the\nstochastic components of solutions, enabling the analytical computation of\ntheir variance, and contains high-order terms allowing to reach optimal quality\nsampling $\\sim3$-$5\\times$ faster than previous SDE methods. We validate our\napproach on several image generation benchmarks, showing that SEEDS outperform\nor are competitive with previous SDE solvers. Contrary to the latter, SEEDS are\nderivative and training free, and we fully prove strong convergence guarantees\nfor them.\n","authors":["Martin Gonzalez","Nelson Fernandez","Thuy Tran","Elies Gherbi","Hatem Hajri","Nader Masmoudi"],"pdf_url":"https://arxiv.org/pdf/2305.14267v2.pdf","comment":"60 pages. Camera-Ready version for the 37th Conference on Neural\n Information Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.17561v1","updated":"2023-10-26T16:49:44Z","published":"2023-10-26T16:49:44Z","title":"Bifurcations and loss jumps in RNN training","summary":" Recurrent neural networks (RNNs) are popular machine learning tools for\nmodeling and forecasting sequential data and for inferring dynamical systems\n(DS) from observed time series. Concepts from DS theory (DST) have variously\nbeen used to further our understanding of both, how trained RNNs solve complex\ntasks, and the training process itself. Bifurcations are particularly important\nphenomena in DS, including RNNs, that refer to topological (qualitative)\nchanges in a system's dynamical behavior as one or more of its parameters are\nvaried. Knowing the bifurcation structure of an RNN will thus allow to deduce\nmany of its computational and dynamical properties, like its sensitivity to\nparameter variations or its behavior during training. In particular,\nbifurcations may account for sudden loss jumps observed in RNN training that\ncould severely impede the training process. Here we first mathematically prove\nfor a particular class of ReLU-based RNNs that certain bifurcations are indeed\nassociated with loss gradients tending toward infinity or zero. We then\nintroduce a novel heuristic algorithm for detecting all fixed points and\nk-cycles in ReLU-based RNNs and their existence and stability regions, hence\nbifurcation manifolds in parameter space. In contrast to previous numerical\nalgorithms for finding fixed points and common continuation methods, our\nalgorithm provides exact results and returns fixed points and cycles up to high\norders with surprisingly good scaling behavior. We exemplify the algorithm on\nthe analysis of the training process of RNNs, and find that the recently\nintroduced technique of generalized teacher forcing completely avoids certain\ntypes of bifurcations in training. Thus, besides facilitating the DST analysis\nof trained RNNs, our algorithm provides a powerful instrument for analyzing the\ntraining process itself.\n","authors":["Lukas Eisenmann","Zahra Monfared","Niclas Alexander Göring","Daniel Durstewitz"],"pdf_url":"https://arxiv.org/pdf/2310.17561v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17558v1","updated":"2023-10-26T16:47:52Z","published":"2023-10-26T16:47:52Z","title":"Towards Matching Phones and Speech Representations","summary":" Learning phone types from phone instances has been a long-standing problem,\nwhile still being open. In this work, we revisit this problem in the context of\nself-supervised learning, and pose it as the problem of matching cluster\ncentroids to phone embeddings. We study two key properties that enable\nmatching, namely, whether cluster centroids of self-supervised representations\nreduce the variability of phone instances and respect the relationship among\nphones. We then use the matching result to produce pseudo-labels and introduce\na new loss function for improving self-supervised representations. Our\nexperiments show that the matching result captures the relationship among\nphones. Training the new loss function jointly with the regular self-supervised\nlosses, such as APC and CPC, significantly improves the downstream phone\nclassification.\n","authors":["Gene-Ping Yang","Hao Tang"],"pdf_url":"https://arxiv.org/pdf/2310.17558v1.pdf","comment":"Accepted to ASRU 2023"},{"id":"http://arxiv.org/abs/2310.17556v1","updated":"2023-10-26T16:46:13Z","published":"2023-10-26T16:46:13Z","title":"Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient\n Descent","summary":" We propose a new algorithm for efficiently solving the damped Fisher matrix\nin large-scale scenarios where the number of parameters significantly exceeds\nthe number of available samples. This problem is fundamental for natural\ngradient descent and stochastic reconfiguration. Our algorithm is based on\nCholesky decomposition and is generally applicable. Benchmark results show that\nthe algorithm is significantly faster than existing methods.\n","authors":["Yixiao Chen","Hao Xie","Han Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17556v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17555v1","updated":"2023-10-26T16:46:12Z","published":"2023-10-26T16:46:12Z","title":"Interactive Robot Learning from Verbal Correction","summary":" The ability to learn and refine behavior after deployment has become ever\nmore important for robots as we design them to operate in unstructured\nenvironments like households. In this work, we design a new learning system\nbased on large language model (LLM), OLAF, that allows everyday users to teach\na robot using verbal corrections when the robot makes mistakes, e.g., by saying\n\"Stop what you're doing. You should move closer to the cup.\" A key feature of\nOLAF is its ability to update the robot's visuomotor neural policy based on the\nverbal feedback to avoid repeating mistakes in the future. This is in contrast\nto existing LLM-based robotic systems, which only follow verbal commands or\ncorrections but not learn from them. We demonstrate the efficacy of our design\nin experiments where a user teaches a robot to perform long-horizon\nmanipulation tasks both in simulation and on physical hardware, achieving on\naverage 20.0% improvement in policy success rate. Videos and more results are\nat https://ut-austin-rpl.github.io/olaf/\n","authors":["Huihan Liu","Alice Chen","Yuke Zhu","Adith Swaminathan","Andrey Kolobov","Ching-An Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.17555v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17552v1","updated":"2023-10-26T16:45:44Z","published":"2023-10-26T16:45:44Z","title":"Model-Based Runtime Monitoring with Interactive Imitation Learning","summary":" Robot learning methods have recently made great strides, but generalization\nand robustness challenges still hinder their widespread deployment. Failing to\ndetect and address potential failures renders state-of-the-art learning systems\nnot combat-ready for high-stakes tasks. Recent advances in interactive\nimitation learning have presented a promising framework for human-robot\nteaming, enabling the robots to operate safely and continually improve their\nperformances over long-term deployments. Nonetheless, existing methods\ntypically require constant human supervision and preemptive feedback, limiting\ntheir practicality in realistic domains. This work aims to endow a robot with\nthe ability to monitor and detect errors during task execution. We introduce a\nmodel-based runtime monitoring algorithm that learns from deployment data to\ndetect system anomalies and anticipate failures. Unlike prior work that cannot\nforesee future failures or requires failure experiences for training, our\nmethod learns a latent-space dynamics model and a failure classifier, enabling\nour method to simulate future action outcomes and detect out-of-distribution\nand high-risk states preemptively. We train our method within an interactive\nimitation learning framework, where it continually updates the model from the\nexperiences of the human-robot team collected using trustworthy deployments.\nConsequently, our method reduces the human workload needed over time while\nensuring reliable task execution. Our method outperforms the baselines across\nsystem-level and unit-test metrics, with 23% and 40% higher success rates in\nsimulation and on physical hardware, respectively. More information at\nhttps://ut-austin-rpl.github.io/sirius-runtime-monitor/\n","authors":["Huihan Liu","Shivin Dass","Roberto Martín-Martín","Yuke Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.17552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17550v1","updated":"2023-10-26T16:45:34Z","published":"2023-10-26T16:45:34Z","title":"Human-Guided Complexity-Controlled Abstractions","summary":" Neural networks often learn task-specific latent representations that fail to\ngeneralize to novel settings or tasks. Conversely, humans learn discrete\nrepresentations (i.e., concepts or words) at a variety of abstraction levels\n(e.g., ``bird'' vs. ``sparrow'') and deploy the appropriate abstraction based\non task. Inspired by this, we train neural models to generate a spectrum of\ndiscrete representations, and control the complexity of the representations\n(roughly, how many bits are allocated for encoding inputs) by tuning the\nentropy of the distribution over representations. In finetuning experiments,\nusing only a small number of labeled examples for a new task, we show that (1)\ntuning the representation to a task-appropriate complexity level supports the\nhighest finetuning performance, and (2) in a human-participant study, users\nwere able to identify the appropriate complexity level for a downstream task\nusing visualizations of discrete representations. Our results indicate a\npromising direction for rapid model finetuning by leveraging human insight.\n","authors":["Andi Peng","Mycal Tucker","Eoin Kenny","Noga Zaslavsky","Pulkit Agrawal","Julie Shah"],"pdf_url":"https://arxiv.org/pdf/2310.17550v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17544v1","updated":"2023-10-26T16:40:09Z","published":"2023-10-26T16:40:09Z","title":"Hierarchical Ensemble-Based Feature Selection for Time Series\n Forecasting","summary":" We study a novel ensemble approach for feature selection based on\nhierarchical stacking in cases of non-stationarity and limited number of\nsamples with large number of features. Our approach exploits the co-dependency\nbetween features using a hierarchical structure. Initially, a machine learning\nmodel is trained using a subset of features, and then the model's output is\nupdated using another algorithm with the remaining features to minimize the\ntarget loss. This hierarchical structure allows for flexible depth and feature\nselection. By exploiting feature co-dependency hierarchically, our proposed\napproach overcomes the limitations of traditional feature selection methods and\nfeature importance scores. The effectiveness of the approach is demonstrated on\nsynthetic and real-life datasets, indicating improved performance with\nscalability and stability compared to the traditional methods and\nstate-of-the-art approaches.\n","authors":["Aysin Tumay","Mustafa E. Aydin","Suleyman S. Kozat"],"pdf_url":"https://arxiv.org/pdf/2310.17544v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17540v1","updated":"2023-10-26T16:32:34Z","published":"2023-10-26T16:32:34Z","title":"EqDrive: Efficient Equivariant Motion Forecasting with Multi-Modality\n for Autonomous Driving","summary":" Forecasting vehicular motions in autonomous driving requires a deep\nunderstanding of agent interactions and the preservation of motion equivariance\nunder Euclidean geometric transformations. Traditional models often lack the\nsophistication needed to handle the intricate dynamics inherent to autonomous\nvehicles and the interaction relationships among agents in the scene. As a\nresult, these models have a lower model capacity, which then leads to higher\nprediction errors and lower training efficiency. In our research, we employ\nEqMotion, a leading equivariant particle, and human prediction model that also\naccounts for invariant agent interactions, for the task of multi-agent vehicle\nmotion forecasting. In addition, we use a multi-modal prediction mechanism to\naccount for multiple possible future paths in a probabilistic manner. By\nleveraging EqMotion, our model achieves state-of-the-art (SOTA) performance\nwith fewer parameters (1.2 million) and a significantly reduced training time\n(less than 2 hours).\n","authors":["Yuping Wang","Jier Chen"],"pdf_url":"https://arxiv.org/pdf/2310.17540v1.pdf","comment":"6 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.17538v1","updated":"2023-10-26T16:28:29Z","published":"2023-10-26T16:28:29Z","title":"Little Exploration is All You Need","summary":" The prevailing principle of \"Optimism in the Face of Uncertainty\" advocates\nfor the incorporation of an exploration bonus, generally assumed to be\nproportional to the inverse square root of the visit count ($1/\\sqrt{n}$),\nwhere $n$ is the number of visits to a particular state-action pair. This\napproach, however, exclusively focuses on \"uncertainty,\" neglecting the\ninherent \"difficulty\" of different options. To address this gap, we introduce a\nnovel modification of standard UCB algorithm in the multi-armed bandit problem,\nproposing an adjusted bonus term of $1/n^\\tau$, where $\\tau > 1/2$, that\naccounts for task difficulty. Our proposed algorithm, denoted as UCB$^\\tau$, is\nsubstantiated through comprehensive regret and risk analyses, confirming its\ntheoretical robustness. Comparative evaluations with standard UCB and Thompson\nSampling algorithms on synthetic datasets demonstrate that UCB$^\\tau$ not only\noutperforms in efficacy but also exhibits lower risk across various\nenvironmental conditions and hyperparameter settings.\n","authors":["Henry H. H. Chen","Jiaming Lu"],"pdf_url":"https://arxiv.org/pdf/2310.17538v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17537v1","updated":"2023-10-26T16:28:17Z","published":"2023-10-26T16:28:17Z","title":"Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic\n Forgetting in Curiosity","summary":" Deep reinforcement learning methods exhibit impressive performance on a range\nof tasks but still struggle on hard exploration tasks in large environments\nwith sparse rewards. To address this, intrinsic rewards can be generated using\nforward model prediction errors that decrease as the environment becomes known,\nand incentivize an agent to explore novel states. While prediction-based\nintrinsic rewards can help agents solve hard exploration tasks, they can suffer\nfrom catastrophic forgetting and actually increase at visited states. We first\nexamine the conditions and causes of catastrophic forgetting in grid world\nenvironments. We then propose a new method FARCuriosity, inspired by how humans\nand animals learn. The method depends on fragmentation and recall: an agent\nfragments an environment based on surprisal, and uses different local curiosity\nmodules (prediction-based intrinsic reward functions) for each fragment so that\nmodules are not trained on the entire environment. At each fragmentation event,\nthe agent stores the current module in long-term memory (LTM) and either\ninitializes a new module or recalls a previously stored module based on its\nmatch with the current state. With fragmentation and recall, FARCuriosity\nachieves less forgetting and better overall performance in games with varied\nand heterogeneous environments in the Atari benchmark suite of tasks. Thus,\nthis work highlights the problem of catastrophic forgetting in prediction-based\ncuriosity methods and proposes a solution.\n","authors":["Jaedong Hwang","Zhang-Wei Hong","Eric Chen","Akhilan Boopathy","Pulkit Agrawal","Ila Fiete"],"pdf_url":"https://arxiv.org/pdf/2310.17537v1.pdf","comment":"NeurIPS 2023 Workshop - Intrinsically Motivated Open-ended Learning"},{"id":"http://arxiv.org/abs/2310.17534v1","updated":"2023-10-26T16:23:40Z","published":"2023-10-26T16:23:40Z","title":"SoK: Pitfalls in Evaluating Black-Box Attacks","summary":" Numerous works study black-box attacks on image classifiers. However, these\nworks make different assumptions on the adversary's knowledge and current\nliterature lacks a cohesive organization centered around the threat model. To\nsystematize knowledge in this area, we propose a taxonomy over the threat space\nspanning the axes of feedback granularity, the access of interactive queries,\nand the quality and quantity of the auxiliary data available to the attacker.\nOur new taxonomy provides three key insights. 1) Despite extensive literature,\nnumerous under-explored threat spaces exist, which cannot be trivially solved\nby adapting techniques from well-explored settings. We demonstrate this by\nestablishing a new state-of-the-art in the less-studied setting of access to\ntop-k confidence scores by adapting techniques from well-explored settings of\naccessing the complete confidence vector, but show how it still falls short of\nthe more restrictive setting that only obtains the prediction label,\nhighlighting the need for more research. 2) Identification the threat model of\ndifferent attacks uncovers stronger baselines that challenge prior\nstate-of-the-art claims. We demonstrate this by enhancing an initially weaker\nbaseline (under interactive query access) via surrogate models, effectively\noverturning claims in the respective paper. 3) Our taxonomy reveals\ninteractions between attacker knowledge that connect well to related areas,\nsuch as model inversion and extraction attacks. We discuss how advances in\nother areas can enable potentially stronger black-box attacks. Finally, we\nemphasize the need for a more realistic assessment of attack success by\nfactoring in local attack runtime. This approach reveals the potential for\ncertain attacks to achieve notably higher success rates and the need to\nevaluate attacks in diverse and harder settings, highlighting the need for\nbetter selection criteria.\n","authors":["Fnu Suya","Anshuman Suri","Tingwei Zhang","Jingtao Hong","Yuan Tian","David Evans"],"pdf_url":"https://arxiv.org/pdf/2310.17534v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17531v1","updated":"2023-10-26T16:19:24Z","published":"2023-10-26T16:19:24Z","title":"Learning Regularized Graphon Mean-Field Games with Unknown Graphons","summary":" We design and analyze reinforcement learning algorithms for Graphon\nMean-Field Games (GMFGs). In contrast to previous works that require the\nprecise values of the graphons, we aim to learn the Nash Equilibrium (NE) of\nthe regularized GMFGs when the graphons are unknown. Our contributions are\nthreefold. First, we propose the Proximal Policy Optimization for GMFG\n(GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$\nafter $T$ iterations with an estimation oracle, improving on a previous work by\nXie et al. (ICML, 2021). Second, using kernel embedding of distributions, we\ndesign efficient algorithms to estimate the transition kernels, reward\nfunctions, and graphons from sampled agents. Convergence rates are then derived\nwhen the positions of the agents are either known or unknown. Results for the\ncombination of the optimization algorithm GMFG-PPO and the estimation algorithm\nare then provided. These algorithms are the first specifically designed for\nlearning graphons from sampled agents. Finally, the efficacy of the proposed\nalgorithms are corroborated through simulations. These simulations demonstrate\nthat learning the unknown graphons reduces the exploitability effectively.\n","authors":["Fengzhuo Zhang","Vincent Y. F. Tan","Zhaoran Wang","Zhuoran Yang"],"pdf_url":"https://arxiv.org/pdf/2310.17531v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17530v1","updated":"2023-10-26T16:19:19Z","published":"2023-10-26T16:19:19Z","title":"Evaluating Bias and Fairness in Gender-Neutral Pretrained\n Vision-and-Language Models","summary":" Pretrained machine learning models are known to perpetuate and even amplify\nexisting biases in data, which can result in unfair outcomes that ultimately\nimpact user experience. Therefore, it is crucial to understand the mechanisms\nbehind those prejudicial biases to ensure that model performance does not\nresult in discriminatory behaviour toward certain groups or populations. In\nthis work, we define gender bias as our case study. We quantify bias\namplification in pretraining and after fine-tuning on three families of\nvision-and-language models. We investigate the connection, if any, between the\ntwo learning stages, and evaluate how bias amplification reflects on model\nperformance. Overall, we find that bias amplification in pretraining and after\nfine-tuning are independent. We then examine the effect of continued\npretraining on gender-neutral data, finding that this reduces group\ndisparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without\nsignificantly compromising task performance.\n","authors":["Laura Cabello","Emanuele Bugliarello","Stephanie Brandl","Desmond Elliott"],"pdf_url":"https://arxiv.org/pdf/2310.17530v1.pdf","comment":"To appear in EMNLP 2024"},{"id":"http://arxiv.org/abs/2310.17526v1","updated":"2023-10-26T16:18:30Z","published":"2023-10-26T16:18:30Z","title":"Can large language models replace humans in the systematic review\n process? Evaluating GPT-4's efficacy in screening and extracting data from\n peer-reviewed and grey literature in multiple languages","summary":" Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance.\n","authors":["Qusai Khraisha","Sophie Put","Johanna Kappenberg","Azza Warraitch","Kristin Hadfield"],"pdf_url":"https://arxiv.org/pdf/2310.17526v1.pdf","comment":"9 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/2303.11249v4","updated":"2023-10-26T16:10:27Z","published":"2023-03-20T16:34:39Z","title":"What Makes Data Suitable for a Locally Connected Neural Network? A\n Necessary and Sufficient Condition Based on Quantum Entanglement","summary":" The question of what makes a data distribution suitable for deep learning is\na fundamental open problem. Focusing on locally connected neural networks (a\nprevalent family of architectures that includes convolutional and recurrent\nneural networks as well as local self-attention models), we address this\nproblem by adopting theoretical tools from quantum physics. Our main\ntheoretical result states that a certain locally connected neural network is\ncapable of accurate prediction over a data distribution if and only if the data\ndistribution admits low quantum entanglement under certain canonical partitions\nof features. As a practical application of this result, we derive a\npreprocessing method for enhancing the suitability of a data distribution to\nlocally connected neural networks. Experiments with widespread models over\nvarious datasets demonstrate our findings. We hope that our use of quantum\nentanglement will encourage further adoption of tools from physics for formally\nreasoning about the relation between deep learning and real-world data.\n","authors":["Yotam Alexander","Nimrod De La Vega","Noam Razin","Nadav Cohen"],"pdf_url":"https://arxiv.org/pdf/2303.11249v4.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17513v1","updated":"2023-10-26T16:08:33Z","published":"2023-10-26T16:08:33Z","title":"The Expressive Power of Low-Rank Adaptation","summary":" Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that\nleverages low-rank adaptation of weight matrices, has emerged as a prevalent\ntechnique for fine-tuning pre-trained models such as large language models and\ndiffusion models. Despite its huge success in practice, the theoretical\nunderpinnings of LoRA have largely remained unexplored. This paper takes the\nfirst step to bridge this gap by theoretically analyzing the expressive power\nof LoRA. We prove that, for fully connected neural networks, LoRA can adapt any\nmodel $f$ to accurately represent any smaller target model $\\overline{f}$ if\nLoRA-rank $\\geq(\\text{width of }f) \\times \\frac{\\text{depth of\n}\\overline{f}}{\\text{depth of }f}$. We also quantify the approximation error\nwhen LoRA-rank is lower than the threshold. For Transformer networks, we show\nany model can be adapted to a target model of the same size with\nrank-$(\\frac{\\text{embedding size}}{2})$ LoRA adapters.\n","authors":["Yuchen Zeng","Kangwook Lee"],"pdf_url":"https://arxiv.org/pdf/2310.17513v1.pdf","comment":"40 pages,5 figures"},{"id":"http://arxiv.org/abs/2305.19693v3","updated":"2023-10-26T16:02:56Z","published":"2023-05-31T09:36:34Z","title":"Spontaneous Symmetry Breaking in Generative Diffusion Models","summary":" Generative diffusion models have recently emerged as a leading approach for\ngenerating high-dimensional data. In this paper, we show that the dynamics of\nthese models exhibit a spontaneous symmetry breaking that divides the\ngenerative dynamics into two distinct phases: 1) A linear steady-state dynamics\naround a central fixed-point and 2) an attractor dynamics directed towards the\ndata manifold. These two \"phases\" are separated by the change in stability of\nthe central fixed-point, with the resulting window of instability being\nresponsible for the diversity of the generated samples. Using both theoretical\nand empirical evidence, we show that an accurate simulation of the early\ndynamics does not significantly contribute to the final generation, since early\nfluctuations are reverted to the central fixed point. To leverage this insight,\nwe propose a Gaussian late initialization scheme, which significantly improves\nmodel performance, achieving up to 3x FID improvements on fast samplers, while\nalso increasing sample diversity (e.g., racial composition of generated CelebA\nimages). Our work offers a new way to understand the generative dynamics of\ndiffusion models that has the potential to bring about higher performance and\nless biased fast-samplers.\n","authors":["Gabriel Raya","Luca Ambrogioni"],"pdf_url":"https://arxiv.org/pdf/2305.19693v3.pdf","comment":"As published at NeurIPS 2023, and the size of the file has been\n optimized for fast downloading"},{"id":"http://arxiv.org/abs/2207.04306v3","updated":"2023-10-26T15:57:11Z","published":"2022-07-09T17:21:21Z","title":"Out-of-Distribution Detection in Time-Series Domain: A Novel Seasonal\n Ratio Scoring Approach","summary":" Safe deployment of time-series classifiers for real-world applications relies\non the ability to detect the data which is not generated from the same\ndistribution as training data. This task is referred to as out-of-distribution\n(OOD) detection. We consider the novel problem of OOD detection for the\ntime-series domain. We discuss the unique challenges posed by time-series data\nand explain why prior methods from the image domain will perform poorly.\nMotivated by these challenges, this paper proposes a novel {\\em Seasonal Ratio\nScoring (SRS)} approach. SRS consists of three key algorithmic steps. First,\neach input is decomposed into class-wise semantic component and remainder.\nSecond, this decomposition is employed to estimate the class-wise conditional\nlikelihoods of the input and remainder using deep generative models. The\nseasonal ratio score is computed from these estimates. Third, a threshold\ninterval is identified from the in-distribution data to detect OOD examples.\nExperiments on diverse real-world benchmarks demonstrate that the SRS method is\nwell-suited for time-series OOD detection when compared to baseline methods.\nOpen-source code for SRS method is provided at\nhttps://github.com/tahabelkhouja/SRS\n","authors":["Taha Belkhouja","Yan Yan","Janardhan Rao Doppa"],"pdf_url":"https://arxiv.org/pdf/2207.04306v3.pdf","comment":"Accepted for publication at ACM Transactions on Intelligent Systems\n and Technology (TIST)"},{"id":"http://arxiv.org/abs/2310.17502v1","updated":"2023-10-26T15:54:12Z","published":"2023-10-26T15:54:12Z","title":"Controllable Generation of Artificial Speaker Embeddings through\n Discovery of Principal Directions","summary":" Customizing voice and speaking style in a speech synthesis system with\nintuitive and fine-grained controls is challenging, given that little data with\nappropriate labels is available. Furthermore, editing an existing human's voice\nalso comes with ethical concerns. In this paper, we propose a method to\ngenerate artificial speaker embeddings that cannot be linked to a real human\nwhile offering intuitive and fine-grained control over the voice and speaking\nstyle of the embeddings, without requiring any labels for speaker or style. The\nartificial and controllable embeddings can be fed to a speech synthesis system,\nconditioned on embeddings of real humans during training, without sacrificing\nprivacy during inference.\n","authors":["Florian Lux","Pascal Tilli","Sarina Meyer","Ngoc Thang Vu"],"pdf_url":"https://arxiv.org/pdf/2310.17502v1.pdf","comment":"Published at ISCA Interspeech 2023\n https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html"},{"id":"http://arxiv.org/abs/2310.17499v1","updated":"2023-10-26T15:53:29Z","published":"2023-10-26T15:53:29Z","title":"The IMS Toucan System for the Blizzard Challenge 2023","summary":" For our contribution to the Blizzard Challenge 2023, we improved on the\nsystem we submitted to the Blizzard Challenge 2021. Our approach entails a\nrule-based text-to-phoneme processing system that includes rule-based\ndisambiguation of homographs in the French language. It then transforms the\nphonemes to spectrograms as intermediate representations using a fast and\nefficient non-autoregressive synthesis architecture based on Conformer and\nGlow. A GAN based neural vocoder that combines recent state-of-the-art\napproaches converts the spectrogram to the final wave. We carefully designed\nthe data processing, training, and inference procedures for the challenge data.\nOur system identifier is G. Open source code and demo are available.\n","authors":["Florian Lux","Julia Koch","Sarina Meyer","Thomas Bott","Nadja Schauffler","Pavel Denisov","Antje Schweitzer","Ngoc Thang Vu"],"pdf_url":"https://arxiv.org/pdf/2310.17499v1.pdf","comment":"Published at the Blizzard Challenge Workshop 2023, colocated with the\n Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023"},{"id":"http://arxiv.org/abs/2310.17498v1","updated":"2023-10-26T15:53:18Z","published":"2023-10-26T15:53:18Z","title":"CBD: A Certified Backdoor Detector Based on Local Dominant Probability","summary":" Backdoor attack is a common threat to deep neural networks. During testing,\nsamples embedded with a backdoor trigger will be misclassified as an\nadversarial target by a backdoored model, while samples without the backdoor\ntrigger will be correctly classified. In this paper, we present the first\ncertified backdoor detector (CBD), which is based on a novel, adjustable\nconformal prediction scheme based on our proposed statistic local dominant\nprobability. For any classifier under inspection, CBD provides 1) a detection\ninference, 2) the condition under which the attacks are guaranteed to be\ndetectable for the same classification domain, and 3) a probabilistic upper\nbound for the false positive rate. Our theoretical results show that attacks\nwith triggers that are more resilient to test-time noise and have smaller\nperturbation magnitudes are more likely to be detected with guarantees.\nMoreover, we conduct extensive experiments on four benchmark datasets\nconsidering various backdoor types, such as BadNet, CB, and Blend. CBD achieves\ncomparable or even higher detection accuracy than state-of-the-art detectors,\nand it in addition provides detection certification. Notably, for backdoor\nattacks with random perturbation triggers bounded by $\\ell_2\\leq0.75$ which\nachieves more than 90\\% attack success rate, CBD achieves 100\\% (98\\%), 100\\%\n(84\\%), 98\\% (98\\%), and 72\\% (40\\%) empirical (certified) detection true\npositive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and\nTinyImageNet, respectively, with low false positive rates.\n","authors":["Zhen Xiang","Zidi Xiong","Bo Li"],"pdf_url":"https://arxiv.org/pdf/2310.17498v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.13633v2","updated":"2023-10-26T15:52:44Z","published":"2023-08-25T18:58:53Z","title":"Adaptive whitening with fast gain modulation and slow synaptic\n plasticity","summary":" Neurons in early sensory areas rapidly adapt to changing sensory statistics,\nboth by normalizing the variance of their individual responses and by reducing\ncorrelations between their responses. Together, these transformations may be\nviewed as an adaptive form of statistical whitening. Existing mechanistic\nmodels of adaptive whitening exclusively use either synaptic plasticity or gain\nmodulation as the biological substrate for adaptation; however, on their own,\neach of these models has significant limitations. In this work, we unify these\napproaches in a normative multi-timescale mechanistic model that adaptively\nwhitens its responses with complementary computational roles for synaptic\nplasticity and gain modulation. Gains are modified on a fast timescale to adapt\nto the current statistical context, whereas synapses are modified on a slow\ntimescale to match structural properties of the input statistics that are\ninvariant across contexts. Our model is derived from a novel multi-timescale\nwhitening objective that factorizes the inverse whitening matrix into basis\nvectors, which correspond to synaptic weights, and a diagonal matrix, which\ncorresponds to neuronal gains. We test our model on synthetic and natural\ndatasets and find that the synapses learn optimal configurations over long\ntimescales that enable adaptive whitening on short timescales using gain\nmodulation.\n","authors":["Lyndon R. Duong","Eero P. Simoncelli","Dmitri B. Chklovskii","David Lipshutz"],"pdf_url":"https://arxiv.org/pdf/2308.13633v2.pdf","comment":"NeurIPS 2023 Spotlight; 18 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.17496v1","updated":"2023-10-26T15:52:34Z","published":"2023-10-26T15:52:34Z","title":"Tackling Interference Induced by Data Training Loops in A/B Tests: A\n Weighted Training Approach","summary":" In modern recommendation systems, the standard pipeline involves training\nmachine learning models on historical data to predict user behaviors and\nimprove recommendations continuously. However, these data training loops can\nintroduce interference in A/B tests, where data generated by control and\ntreatment algorithms, potentially with different distributions, are combined.\nTo address these challenges, we introduce a novel approach called weighted\ntraining. This approach entails training a model to predict the probability of\neach data point appearing in either the treatment or control data and\nsubsequently applying weighted losses during model training. We demonstrate\nthat this approach achieves the least variance among all estimators without\ncausing shifts in the training distributions. Through simulation studies, we\ndemonstrate the lower bias and variance of our approach compared to other\nmethods.\n","authors":["Nian Si"],"pdf_url":"https://arxiv.org/pdf/2310.17496v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17492v1","updated":"2023-10-26T15:47:51Z","published":"2023-10-26T15:47:51Z","title":"Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation\n Models: A Multi-Agent Deep Reinforcement Learning Approach","summary":" The efficient deployment and fine-tuning of foundation models are pivotal in\ncontemporary artificial intelligence. In this study, we present a\ngroundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation\nmodels, specifically designed to enhance local task performance on user\nequipment (UE). Central to our approach is the innovative Emulator-Adapter\narchitecture, segmenting the foundation model into two cohesive modules. This\ndesign not only conserves computational resources but also ensures adaptability\nand fine-tuning efficiency for downstream tasks. Additionally, we introduce an\nadvanced resource allocation mechanism that is fine-tuned to the needs of the\nEmulator-Adapter structure in decentralized settings. To address the challenges\npresented by this system, we employ a hybrid multi-agent Deep Reinforcement\nLearning (DRL) strategy, adept at handling mixed discrete-continuous action\nspaces, ensuring dynamic and optimal resource allocations. Our comprehensive\nsimulations and validations underscore the practical viability of our approach,\ndemonstrating its robustness, efficiency, and scalability. Collectively, this\nwork offers a fresh perspective on deploying foundation models and balancing\ncomputational efficiency with task proficiency.\n","authors":["Wenhan Yu","Terence Jie Chua","Jun Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.17492v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17491v1","updated":"2023-10-26T15:47:44Z","published":"2023-10-26T15:47:44Z","title":"FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine\n Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation\n Models with Mobile Edge Computing","summary":" The emergence of foundation models, including language and vision models, has\nreshaped AI's landscape, offering capabilities across various applications.\nDeploying and fine-tuning these large models, like GPT-3 and BERT, presents\nchallenges, especially in the current foundation model era. We introduce\nEmulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning\n(PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we\nexpand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses\nadapters, emulators, and PEFT for federated model tuning, enhancing model\nprivacy and memory efficiency. Adapters adjust pre-trained models, while\nemulators give a compact representation of original models, addressing both\nprivacy and efficiency. Adaptable to various neural networks, our approach also\nuses deep reinforcement learning for hyper-parameter optimization. We tested\nFedPEAT in a unique scenario with a server participating in collaborative\nfederated tuning, showcasing its potential in tackling foundation model\nchallenges.\n","authors":["Terence Jie Chua","Wenhan Yu","Jun Zhao","Kwok-Yan Lam"],"pdf_url":"https://arxiv.org/pdf/2310.17491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17489v1","updated":"2023-10-26T15:45:01Z","published":"2023-10-26T15:45:01Z","title":"Bias in Evaluation Processes: An Optimization-Based Model","summary":" Biases with respect to socially-salient attributes of individuals have been\nwell documented in evaluation processes used in settings such as admissions and\nhiring. We view such an evaluation process as a transformation of a\ndistribution of the true utility of an individual for a task to an observed\ndistribution and model it as a solution to a loss minimization problem subject\nto an information constraint. Our model has two parameters that have been\nidentified as factors leading to biases: the resource-information trade-off\nparameter in the information constraint and the risk-averseness parameter in\nthe loss function. We characterize the distributions that arise from our model\nand study the effect of the parameters on the observed distribution. The\noutputs of our model enrich the class of distributions that can be used to\ncapture variation across groups in the observed evaluations. We empirically\nvalidate our model by fitting real-world datasets and use it to study the\neffect of interventions in a downstream selection task. These results\ncontribute to an understanding of the emergence of bias in evaluation processes\nand provide tools to guide the deployment of interventions to mitigate biases.\n","authors":["L. Elisa Celis","Amit Kumar","Anay Mehrotra","Nisheeth K. Vishnoi"],"pdf_url":"https://arxiv.org/pdf/2310.17489v1.pdf","comment":"The conference version of this paper appears in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17485v1","updated":"2023-10-26T15:42:29Z","published":"2023-10-26T15:42:29Z","title":"Fair collaborative vehicle routing: A deep multi-agent reinforcement\n learning approach","summary":" Collaborative vehicle routing occurs when carriers collaborate through\nsharing their transportation requests and performing transportation requests on\nbehalf of each other. This achieves economies of scale, thus reducing cost,\ngreenhouse gas emissions and road congestion. But which carrier should partner\nwith whom, and how much should each carrier be compensated? Traditional game\ntheoretic solution concepts are expensive to calculate as the characteristic\nfunction scales exponentially with the number of agents. This would require\nsolving the vehicle routing problem (NP-hard) an exponential number of times.\nWe therefore propose to model this problem as a coalitional bargaining game\nsolved using deep multi-agent reinforcement learning, where - crucially -\nagents are not given access to the characteristic function. Instead, we\nimplicitly reason about the characteristic function; thus, when deployed in\nproduction, we only need to evaluate the expensive post-collaboration vehicle\nrouting problem once. Our contribution is that we are the first to consider\nboth the route allocation problem and gain sharing problem simultaneously -\nwithout access to the expensive characteristic function. Through decentralised\nmachine learning, our agents bargain with each other and agree to outcomes that\ncorrelate well with the Shapley value - a fair profit allocation mechanism.\nImportantly, we are able to achieve a reduction in run-time of 88%.\n","authors":["Stephen Mak","Liming Xu","Tim Pearce","Michael Ostroumov","Alexandra Brintrup"],"pdf_url":"https://arxiv.org/pdf/2310.17485v1.pdf","comment":"Final, published version can be found here:\n https://www.sciencedirect.com/science/article/pii/S0968090X23003662"},{"id":"http://arxiv.org/abs/2305.11982v2","updated":"2023-10-26T15:39:02Z","published":"2023-05-19T20:03:31Z","title":"Sequential Memory with Temporal Predictive Coding","summary":" Forming accurate memory of sequential stimuli is a fundamental function of\nbiological agents. However, the computational mechanism underlying sequential\nmemory in the brain remains unclear. Inspired by neuroscience theories and\nrecent successes in applying predictive coding (PC) to \\emph{static} memory\ntasks, in this work we propose a novel PC-based model for \\emph{sequential}\nmemory, called \\emph{temporal predictive coding} (tPC). We show that our tPC\nmodels can memorize and retrieve sequential inputs accurately with a\nbiologically plausible neural implementation. Importantly, our analytical study\nreveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN)\nwith an implicit statistical whitening process, which leads to more stable\nperformance in sequential memory tasks of structured inputs. Moreover, we find\nthat tPC exhibits properties consistent with behavioral observations and\ntheories in neuroscience, thereby strengthening its biological relevance. Our\nwork establishes a possible computational mechanism underlying sequential\nmemory in the brain that can also be theoretically interpreted using existing\nmemory model frameworks.\n","authors":["Mufeng Tang","Helen Barron","Rafal Bogacz"],"pdf_url":"https://arxiv.org/pdf/2305.11982v2.pdf","comment":"37th Conference on Neural Information Processing Systems (NeurIPS\n 2023)"},{"id":"http://arxiv.org/abs/2305.16905v2","updated":"2023-10-26T15:37:22Z","published":"2023-05-26T13:19:15Z","title":"Improving Neural Additive Models with Bayesian Principles","summary":" Neural additive models (NAMs) can improve the interpretability of deep neural\nnetworks by handling input features in separate additive sub-networks. However,\nthey lack inherent mechanisms that provide calibrated uncertainties and enable\nselection of relevant features and interactions. Approaching NAMs from a\nBayesian perspective, we enhance them in three primary ways, namely by a)\nproviding credible intervals for the individual additive sub-networks; b)\nestimating the marginal likelihood to perform an implicit selection of features\nvia an empirical Bayes procedure; and c) enabling a ranking of feature pairs as\ncandidates for second-order interaction in fine-tuned models. In particular, we\ndevelop Laplace-approximated NAMs (LA-NAMs), which show improved empirical\nperformance on tabular datasets and challenging real-world medical tasks.\n","authors":["Kouroche Bouchiat","Alexander Immer","Hugo Yèche","Gunnar Rätsch","Vincent Fortuin"],"pdf_url":"https://arxiv.org/pdf/2305.16905v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.09933v2","updated":"2023-10-26T15:36:48Z","published":"2022-10-18T15:30:14Z","title":"Explanations Based on Item Response Theory (eXirt): A Model-Specific\n Method to Explain Tree-Ensemble Model in Trust Perspective","summary":" In recent years, XAI researchers have been formalizing proposals and\ndeveloping new methods to explain black box models, with no general consensus\nin the community on which method to use to explain these models, with this\nchoice being almost directly linked to the popularity of a specific method.\nMethods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the\nproposal to explain black box models through global rankings of feature\nrelevance, which based on different methodologies, generate global explanations\nthat indicate how the model's inputs explain its predictions. In this context,\n41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost,\nRandom Forest, and Gradient Boosting), and 6 XAI methods were used to support\nthe launch of a new XAI method, called eXirt, based on Item Response Theory -\nIRT and aimed at tree-ensemble black box models that use tabular data referring\nto binary classification problems. In the first set of analyses, the 164 global\nfeature relevance ranks of the eXirt were compared with 984 ranks of the other\nXAI methods present in the literature, seeking to highlight their similarities\nand differences. In a second analysis, exclusive explanations of the eXirt\nbased on Explanation-by-example were presented that help in understanding the\nmodel trust. Thus, it was verified that eXirt is able to generate global\nexplanations of tree-ensemble models and also local explanations of instances\nof models through IRT, showing how this consolidated theory can be used in\nmachine learning in order to obtain explainable and reliable models.\n","authors":["José Ribeiro","Lucas Cardoso","Raíssa Silva","Vitor Cirilo","Níkolas Carneiro","Ronnie Alves"],"pdf_url":"https://arxiv.org/pdf/2210.09933v2.pdf","comment":"54 pages, 15 Figures, 3 Equations, 7 table"},{"id":"http://arxiv.org/abs/2310.17477v1","updated":"2023-10-26T15:27:55Z","published":"2023-10-26T15:27:55Z","title":"Secure short-term load forecasting for smart grids with\n transformer-based federated learning","summary":" Electricity load forecasting is an essential task within smart grids to\nassist demand and supply balance. While advanced deep learning models require\nlarge amounts of high-resolution data for accurate short-term load predictions,\nfine-grained load profiles can expose users' electricity consumption behaviors,\nwhich raises privacy and security concerns. One solution to improve data\nprivacy is federated learning, where models are trained locally on private\ndata, and only the trained model parameters are merged and updated on a global\nserver. Therefore, this paper presents a novel transformer-based deep learning\napproach with federated learning for short-term electricity load prediction. To\nevaluate our results, we benchmark our federated learning architecture against\ncentral and local learning and compare the performance of our model to long\nshort-term memory models and convolutional neural networks. Our simulations are\nbased on a dataset from a German university campus and show that\ntransformer-based forecasting is a promising alternative to state-of-the-art\nmodels within federated learning.\n","authors":["Jonas Sievers","Thomas Blank"],"pdf_url":"https://arxiv.org/pdf/2310.17477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03284v3","updated":"2023-10-26T15:25:57Z","published":"2023-03-06T16:59:14Z","title":"The Wasserstein Believer: Learning Belief Updates for Partially\n Observable Environments through Reliable Latent Space Models","summary":" Partially Observable Markov Decision Processes (POMDPs) are used to model\nenvironments where the full state cannot be perceived by an agent. As such the\nagent needs to reason taking into account the past observations and actions.\nHowever, simply remembering the full history is generally intractable due to\nthe exponential growth in the history space. Maintaining a probability\ndistribution that models the belief over what the true state is can be used as\na sufficient statistic of the history, but its computation requires access to\nthe model of the environment and is often intractable. While SOTA algorithms\nuse Recurrent Neural Networks to compress the observation-action history aiming\nto learn a sufficient statistic, they lack guarantees of success and can lead\nto sub-optimal policies. To overcome this, we propose the Wasserstein Belief\nUpdater, an RL algorithm that learns a latent model of the POMDP and an\napproximation of the belief update. Our approach comes with theoretical\nguarantees on the quality of our approximation ensuring that our outputted\nbeliefs allow for learning the optimal value function.\n","authors":["Raphael Avalos","Florent Delgrange","Ann Nowé","Guillermo A. Pérez","Diederik M. Roijers"],"pdf_url":"https://arxiv.org/pdf/2303.03284v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17471v1","updated":"2023-10-26T15:19:40Z","published":"2023-10-26T15:19:40Z","title":"Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End\n Collaboration","summary":" Future wireless communication networks are in a position to move beyond\ndata-centric, device-oriented connectivity and offer intelligent, immersive\nexperiences based on task-oriented connections, especially in the context of\nthe thriving development of pre-trained foundation models (PFM) and the\nevolving vision of 6G native artificial intelligence (AI). Therefore,\nredefining modes of collaboration between devices and servers and constructing\nnative intelligence libraries become critically important in 6G. In this paper,\nwe analyze the challenges of achieving 6G native AI from the perspectives of\ndata, intelligence, and networks. Then, we propose a 6G native AI framework\nbased on foundation models, provide a customization approach for intent-aware\nPFM, present a construction of a task-oriented AI toolkit, and outline a novel\ncloud-edge-end collaboration paradigm. As a practical use case, we apply this\nframework for orchestration, achieving the maximum sum rate within a wireless\ncommunication system, and presenting preliminary evaluation results. Finally,\nwe outline research directions for achieving native AI in 6G.\n","authors":["Xiang Chen","Zhiheng Guo","Xijun Wang","Howard H. Yang","Chenyuan Feng","Junshen Su","Sihui Zheng","Tony Q. S. Quek"],"pdf_url":"https://arxiv.org/pdf/2310.17471v1.pdf","comment":"8 pages, 4 figures, 1 table"},{"id":"http://arxiv.org/abs/2307.04204v2","updated":"2023-10-26T15:16:38Z","published":"2023-07-09T15:16:45Z","title":"Trajectory Alignment: Understanding the Edge of Stability Phenomenon via\n Bifurcation Theory","summary":" Cohen et al. (2021) empirically study the evolution of the largest eigenvalue\nof the loss Hessian, also known as sharpness, along the gradient descent (GD)\ntrajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness\nincreases at the early phase of training (referred to as progressive\nsharpening), and eventually saturates close to the threshold of $2 /\n\\text{(step size)}$. In this paper, we start by demonstrating through empirical\nstudies that when the EoS phenomenon occurs, different GD trajectories (after a\nproper reparameterization) align on a specific bifurcation diagram independent\nof initialization. We then rigorously prove this trajectory alignment\nphenomenon for a two-layer fully-connected linear network and a single-neuron\nnonlinear network trained with a single data point. Our trajectory alignment\nanalysis establishes both progressive sharpening and EoS phenomena,\nencompassing and extending recent findings in the literature.\n","authors":["Minhak Song","Chulhee Yun"],"pdf_url":"https://arxiv.org/pdf/2307.04204v2.pdf","comment":"NeurIPS 2023 camera-ready; 51 pages"},{"id":"http://arxiv.org/abs/2310.16639v2","updated":"2023-10-26T15:15:39Z","published":"2023-10-25T13:39:04Z","title":"Driving through the Concept Gridlock: Unraveling Explainability\n Bottlenecks in Automated Driving","summary":" Concept bottleneck models have been successfully used for explainable machine\nlearning by encoding information within the model with a set of human-defined\nconcepts. In the context of human-assisted or autonomous driving,\nexplainability models can help user acceptance and understanding of decisions\nmade by the autonomous vehicle, which can be used to rationalize and explain\ndriver or vehicle behavior. We propose a new approach using concept bottlenecks\nas visual features for control command predictions and explanations of user and\nvehicle behavior. We learn a human-understandable concept layer that we use to\nexplain sequential driving scenes while learning vehicle control commands. This\napproach can then be used to determine whether a change in a preferred gap or\nsteering commands from a human (or autonomous vehicle) is led by an external\nstimulus or change in preferences. We achieve competitive performance to latent\nvisual features while gaining interpretability within our model setup.\n","authors":["Jessica Echterhoff","An Yan","Kyungtae Han","Amr Abdelraouf","Rohit Gupta","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2310.16639v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17468v1","updated":"2023-10-26T15:15:11Z","published":"2023-10-26T15:15:11Z","title":"Cross-modal Active Complementary Learning with Self-refining\n Correspondence","summary":" Recently, image-text matching has attracted more and more attention from\nacademia and industry, which is fundamental to understanding the latent\ncorrespondence across visual and textual modalities. However, most existing\nmethods implicitly assume the training pairs are well-aligned while ignoring\nthe ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby\ninevitably leading to a performance drop. Although some methods attempt to\naddress such noise, they still face two challenging problems: excessive\nmemorizing/overfitting and unreliable correction for NC, especially under high\nnoise. To address the two problems, we propose a generalized Cross-modal Robust\nComplementary Learning framework (CRCL), which benefits from a novel Active\nComplementary Loss (ACL) and an efficient Self-refining Correspondence\nCorrection (SCC) to improve the robustness of existing methods. Specifically,\nACL exploits active and complementary learning losses to reduce the risk of\nproviding erroneous supervision, leading to theoretically and experimentally\ndemonstrated robustness against NC. SCC utilizes multiple self-refining\nprocesses with momentum correction to enlarge the receptive field for\ncorrecting correspondences, thereby alleviating error accumulation and\nachieving accurate and stable corrections. We carry out extensive experiments\non three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify\nthe superior robustness of our CRCL against synthetic and real-world noisy\ncorrespondences.\n","authors":["Yang Qin","Yuan Sun","Dezhong Peng","Joey Tianyi Zhou","Xi Peng","Peng Hu"],"pdf_url":"https://arxiv.org/pdf/2310.17468v1.pdf","comment":"This paper is accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17467v1","updated":"2023-10-26T15:15:01Z","published":"2023-10-26T15:15:01Z","title":"The statistical thermodynamics of generative diffusion models","summary":" Generative diffusion models have achieved spectacular performance in many\nareas of generative modeling. While the fundamental ideas behind these models\ncome from non-equilibrium physics, in this paper we show that many aspects of\nthese models can be understood using the tools of equilibrium statistical\nmechanics. Using this reformulation, we show that generative diffusion models\nundergo second-order phase transitions corresponding to symmetry breaking\nphenomena. We argue that this lead to a form of instability that lies at the\nheart of their generative capabilities and that can be described by a set of\nmean field critical exponents. We conclude by analyzing recent work connecting\ndiffusion models and associative memory networks in view of the thermodynamic\nformulations.\n","authors":["Luca Ambrogioni"],"pdf_url":"https://arxiv.org/pdf/2310.17467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17463v1","updated":"2023-10-26T15:10:35Z","published":"2023-10-26T15:10:35Z","title":"Bayesian Neural Controlled Differential Equations for Treatment Effect\n Estimation","summary":" Treatment effect estimation in continuous time is crucial for personalized\nmedicine. However, existing methods for this task are limited to point\nestimates of the potential outcomes, whereas uncertainty estimates have been\nignored. Needless to say, uncertainty quantification is crucial for reliable\ndecision-making in medical applications. To fill this gap, we propose a novel\nBayesian neural controlled differential equation (BNCDE) for treatment effect\nestimation in continuous time. In our BNCDE, the time dimension is modeled\nthrough a coupled system of neural controlled differential equations and neural\nstochastic differential equations, where the neural stochastic differential\nequations allow for tractable variational Bayesian inference. Thereby, for an\nassigned sequence of treatments, our BNCDE provides meaningful posterior\npredictive distributions of the potential outcomes. To the best of our\nknowledge, ours is the first tailored neural method to provide uncertainty\nestimates of treatment effects in continuous time. As such, our method is of\ndirect practical value for promoting reliable decision-making in medicine.\n","authors":["Konstantin Hess","Valentyn Melnychuk","Dennis Frauen","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2310.17463v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17462v1","updated":"2023-10-26T15:10:10Z","published":"2023-10-26T15:10:10Z","title":"Towards Learning Monocular 3D Object Localization From 2D Labels using\n the Physical Laws of Motion","summary":" We present a novel method for precise 3D object localization in single images\nfrom a single calibrated camera using only 2D labels. No expensive 3D labels\nare needed. Thus, instead of using 3D labels, our model is trained with\neasy-to-annotate 2D labels along with the physical knowledge of the object's\nmotion. Given this information, the model can infer the latent third dimension,\neven though it has never seen this information during training. Our method is\nevaluated on both synthetic and real-world datasets, and we are able to achieve\na mean distance error of just 6 cm in our experiments on real data. The results\nindicate the method's potential as a step towards learning 3D object location\nestimation, where collecting 3D data for training is not feasible.\n","authors":["Daniel Kienzle","Julian Lorenz","Katja Ludwig","Rainer Lienhart"],"pdf_url":"https://arxiv.org/pdf/2310.17462v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17458v1","updated":"2023-10-26T15:04:23Z","published":"2023-10-26T15:04:23Z","title":"Coalitional Bargaining via Reinforcement Learning: An Application to\n Collaborative Vehicle Routing","summary":" Collaborative Vehicle Routing is where delivery companies cooperate by\nsharing their delivery information and performing delivery requests on behalf\nof each other. This achieves economies of scale and thus reduces cost,\ngreenhouse gas emissions, and road congestion. But which company should partner\nwith whom, and how much should each company be compensated? Traditional game\ntheoretic solution concepts, such as the Shapley value or nucleolus, are\ndifficult to calculate for the real-world problem of Collaborative Vehicle\nRouting due to the characteristic function scaling exponentially with the\nnumber of agents. This would require solving the Vehicle Routing Problem (an\nNP-Hard problem) an exponential number of times. We therefore propose to model\nthis problem as a coalitional bargaining game where - crucially - agents are\nnot given access to the characteristic function. Instead, we implicitly reason\nabout the characteristic function, and thus eliminate the need to evaluate the\nVRP an exponential number of times - we only need to evaluate it once. Our\ncontribution is that our decentralised approach is both scalable and considers\nthe self-interested nature of companies. The agents learn using a modified\nIndependent Proximal Policy Optimisation. Our RL agents outperform a strong\nheuristic bot. The agents correctly identify the optimal coalitions 79% of the\ntime with an average optimality gap of 4.2% and reduction in run-time of 62%.\n","authors":["Stephen Mak","Liming Xu","Tim Pearce","Michael Ostroumov","Alexandra Brintrup"],"pdf_url":"https://arxiv.org/pdf/2310.17458v1.pdf","comment":"Accepted to NeurIPS 2021 Workshop on Cooperative AI"},{"id":"http://arxiv.org/abs/2310.14814v2","updated":"2023-10-26T15:02:41Z","published":"2023-10-23T11:30:06Z","title":"Leveraging Ensemble Diversity for Robust Self-Training in the Presence\n of Sample Selection Bias","summary":" Self-training is a well-known approach for semi-supervised learning. It\nconsists of iteratively assigning pseudo-labels to unlabeled data for which the\nmodel is confident and treating them as labeled examples. For neural networks,\nsoftmax prediction probabilities are often used as a confidence measure,\ndespite the fact that they are known to be overconfident, even for wrong\npredictions. This phenomenon is particularly intensified in the presence of\nsample selection bias, i.e., when data labeling is subject to some constraint.\nTo address this issue, we propose a novel confidence measure, called\n$\\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of\nlinear classifiers. We provide the theoretical analysis of our approach by\nstudying stationary points and describing the relationship between the\ndiversity of the individual members and their performance. We empirically\ndemonstrate the benefit of our confidence measure for three different\npseudo-labeling policies on classification datasets of various data modalities.\n","authors":["Ambroise Odonnat","Vasilii Feofanov","Ievgen Redko"],"pdf_url":"https://arxiv.org/pdf/2310.14814v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.03004v4","updated":"2023-10-26T15:02:02Z","published":"2023-09-06T13:48:40Z","title":"A Theoretical Explanation of Activation Sparsity through Flat Minima and\n Adversarial Robustness","summary":" A recent empirical observation (Li et al., 2022b) of activation sparsity in\nMLP blocks offers an opportunity to drastically reduce computation costs for\nfree. Although having attributed it to training dynamics, existing theoretical\nexplanations of activation sparsity are restricted to shallow networks, small\ntraining steps and special training, despite its emergence in deep models\nstandardly trained for a large number of steps. To fill these gaps, we propose\nthe notion of gradient sparsity as one source of activation sparsity and a\ntheoretical explanation based on it that sees sparsity a necessary step to\nadversarial robustness w.r.t. hidden features and parameters, which is\napproximately the flatness of minima for well-learned models. The theory\napplies to standardly trained LayerNorm-ed MLPs, and further to Transformers or\nother architectures trained with weight noises. Eliminating other sources of\nflatness except for sparsity, we discover the phenomenon that the ratio between\nthe largest and smallest non-zero singular values of weight matrices is small.\nWhen discussing the emergence of this spectral concentration, we use random\nmatrix theory (RMT) as a powerful tool to analyze stochastic gradient noises.\nValidational experiments are conducted to verify our gradient-sparsity-based\nexplanation. We propose two plug-and-play modules for both training and\nfinetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their\n50% sparsity improvements, indicating further potential cost reduction in both\ntraining and inference.\n","authors":["Ze Peng","Lei Qi","Yinghuan Shi","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2309.03004v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19470v3","updated":"2023-10-26T14:59:16Z","published":"2023-05-31T00:38:55Z","title":"Label Embedding via Low-Coherence Matrices","summary":" Label embedding is a framework for multiclass classification problems where\neach label is represented by a distinct vector of some fixed dimension, and\ntraining involves matching model output to the vector representing the correct\nlabel. While label embedding has been successfully applied in extreme\nclassification and zero-shot learning, and offers both computational and\nstatistical advantages, its theoretical foundations remain poorly understood.\nThis work presents an analysis of label embedding in the context of extreme\nmulticlass classification, where the number of classes $C$ is very large. We\npresent an excess risk bound that reveals a trade-off between computational and\nstatistical efficiency, quantified via the coherence of the embedding matrix.\nWe further show that under the Massart noise condition, the statistical penalty\nfor label embedding vanishes with sufficiently low coherence. Our analysis\nsupports an algorithm that is simple, scalable, and easily parallelizable, and\nexperimental results demonstrate its effectiveness in large-scale applications.\n","authors":["Jianxin Zhang","Clayton Scott"],"pdf_url":"https://arxiv.org/pdf/2305.19470v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04220v4","updated":"2023-10-26T14:48:49Z","published":"2023-06-07T07:51:05Z","title":"Look Beneath the Surface: Exploiting Fundamental Symmetry for\n Sample-Efficient Offline RL","summary":" Offline reinforcement learning (RL) offers an appealing approach to\nreal-world tasks by learning policies from pre-collected datasets without\ninteracting with the environment. However, the performance of existing offline\nRL algorithms heavily depends on the scale and state-action space coverage of\ndatasets. Real-world data collection is often expensive and uncontrollable,\nleading to small and narrowly covered datasets and posing significant\nchallenges for practical deployments of offline RL. In this paper, we provide a\nnew insight that leveraging the fundamental symmetry of system dynamics can\nsubstantially enhance offline RL performance under small datasets.\nSpecifically, we propose a Time-reversal symmetry (T-symmetry) enforced\nDynamics Model (TDM), which establishes consistency between a pair of forward\nand reverse latent dynamics. TDM provides both well-behaved representations for\nsmall datasets and a new reliability measure for OOD samples based on\ncompliance with the T-symmetry. These can be readily used to construct a new\noffline RL algorithm (TSRL) with less conservative policy constraints and a\nreliable latent space data augmentation procedure. Based on extensive\nexperiments, we find TSRL achieves great performance on small benchmark\ndatasets with as few as 1% of the original samples, which significantly\noutperforms the recent offline RL algorithms in terms of data efficiency and\ngeneralizability.Code is available at: https://github.com/pcheng2/TSRL\n","authors":["Peng Cheng","Xianyuan Zhan","Zhihao Wu","Wenjia Zhang","Shoucheng Song","Han Wang","Youfang Lin","Li Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.04220v4.pdf","comment":"The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.17437v1","updated":"2023-10-26T14:47:11Z","published":"2023-10-26T14:47:11Z","title":"Sign Languague Recognition without frame-sequencing constraints: A proof\n of concept on the Argentinian Sign Language","summary":" Automatic sign language recognition (SLR) is an important topic within the\nareas of human-computer interaction and machine learning. On the one hand, it\nposes a complex challenge that requires the intervention of various knowledge\nareas, such as video processing, image processing, intelligent systems and\nlinguistics. On the other hand, robust recognition of sign language could\nassist in the translation process and the integration of hearing-impaired\npeople, as well as the teaching of sign language for the hearing population.\n SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or\nsimilar models to recognize signs. Such techniques exploit the sequential\nordering of frames to reduce the number of hypothesis. This paper presents a\ngeneral probabilistic model for sign classification that combines\nsub-classifiers based on different types of features such as position, movement\nand handshape. The model employs a bag-of-words approach in all classification\nsteps, to explore the hypothesis that ordering is not essential for\nrecognition. The proposed model achieved an accuracy rate of 97% on an\nArgentinian Sign Language dataset containing 64 classes of signs and 3200\nsamples, providing some evidence that indeed recognition without ordering is\npossible.\n","authors":["Franco Ronchetti","Facundo Manuel Quiroga","César Estrebou","Laura Lanzarini","Alejandro Rosete"],"pdf_url":"https://arxiv.org/pdf/2310.17437v1.pdf","comment":"IBERAMIA 2016"},{"id":"http://arxiv.org/abs/2302.02209v4","updated":"2023-10-26T14:44:27Z","published":"2023-02-04T17:40:03Z","title":"A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge\n Graphs","summary":" Graph neural networks are prominent models for representation learning over\ngraph-structured data. While the capabilities and limitations of these models\nare well-understood for simple graphs, our understanding remains incomplete in\nthe context of knowledge graphs. Our goal is to provide a systematic\nunderstanding of the landscape of graph neural networks for knowledge graphs\npertaining to the prominent task of link prediction. Our analysis entails a\nunifying perspective on seemingly unrelated models and unlocks a series of\nother models. The expressive power of various models is characterized via a\ncorresponding relational Weisfeiler-Leman algorithm. This analysis is extended\nto provide a precise logical characterization of the class of functions\ncaptured by a class of graph neural networks. The theoretical findings\npresented in this paper explain the benefits of some widely employed practical\ndesign choices, which are validated empirically.\n","authors":["Xingyue Huang","Miguel Romero Orth","İsmail İlkan Ceylan","Pablo Barceló"],"pdf_url":"https://arxiv.org/pdf/2302.02209v4.pdf","comment":"Proceedings of the Thirty-Seventh Annual Conference on Advances in\n Neural Information Processing Systems (NeurIPS 2023). Code available at:\n https://github.com/HxyScotthuang/CMPNN"},{"id":"http://arxiv.org/abs/2310.17432v1","updated":"2023-10-26T14:40:30Z","published":"2023-10-26T14:40:30Z","title":"Likelihood-based Out-of-Distribution Detection with Denoising Diffusion\n Probabilistic Models","summary":" Out-of-Distribution detection between dataset pairs has been extensively\nexplored with generative models. We show that likelihood-based\nOut-of-Distribution detection can be extended to diffusion models by leveraging\nthe fact that they, like other likelihood-based generative models, are\ndramatically affected by the input sample complexity. Currently, all\nOut-of-Distribution detection methods with Diffusion Models are\nreconstruction-based. We propose a new likelihood ratio for Out-of-Distribution\ndetection with Deep Denoising Diffusion Models, which we call the Complexity\nCorrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence\nLower-Bound evaluations from an individual model at various noising levels. We\npresent results that are comparable to state-of-the-art Out-of-Distribution\ndetection methods with generative models.\n","authors":["Joseph Goodier","Neill D. F. Campbell"],"pdf_url":"https://arxiv.org/pdf/2310.17432v1.pdf","comment":"9 pages (main paper), 3 pages (acknowledgements & references), 3\n figures, 2 tables, 1 algorithm, work accepted for BMVC 2023"},{"id":"http://arxiv.org/abs/2307.10922v3","updated":"2023-10-26T14:34:55Z","published":"2023-07-20T14:47:50Z","title":"Language-based Action Concept Spaces Improve Video Self-Supervised\n Learning","summary":" Recent contrastive language image pre-training has led to learning highly\ntransferable and robust image representations. However, adapting these models\nto video domains with minimal supervision remains an open problem. We explore a\nsimple step in that direction, using language tied self-supervised learning to\nadapt an image CLIP model to the video domain. A backbone modified for temporal\nmodeling is trained under self-distillation settings with train objectives\noperating in an action concept space. Feature vectors of various action\nconcepts extracted from a language encoder using relevant textual prompts\nconstruct this space. We introduce two train objectives, concept distillation\nand concept alignment, that retain generality of original representations while\nenforcing relations between actions and their attributes. Our approach improves\nzero-shot and linear probing performance on three action recognition\nbenchmarks.\n","authors":["Kanchana Ranasinghe","Michael Ryoo"],"pdf_url":"https://arxiv.org/pdf/2307.10922v3.pdf","comment":"Presented at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17427v1","updated":"2023-10-26T14:32:44Z","published":"2023-10-26T14:32:44Z","title":"Handshape recognition for Argentinian Sign Language using ProbSom","summary":" Automatic sign language recognition is an important topic within the areas of\nhuman-computer interaction and machine learning. On the one hand, it poses a\ncomplex challenge that requires the intervention of various knowledge areas,\nsuch as video processing, image processing, intelligent systems and\nlinguistics. On the other hand, robust recognition of sign language could\nassist in the translation process and the integration of hearing-impaired\npeople.\n This paper offers two main contributions: first, the creation of a database\nof handshapes for the Argentinian Sign Language (LSA), which is a topic that\nhas barely been discussed so far. Secondly, a technique for image processing,\ndescriptor extraction and subsequent handshape classification using a\nsupervised adaptation of self-organizing maps that is called ProbSom. This\ntechnique is compared to others in the state of the art, such as Support Vector\nMachines (SVM), Random Forests, and Neural Networks.\n The database that was built contains 800 images with 16 LSA handshapes, and\nis a first step towards building a comprehensive database of Argentinian signs.\nThe ProbSom-based neural classifier, using the proposed descriptor, achieved an\naccuracy rate above 90%.\n","authors":["Franco Ronchetti","Facundo Manuel Quiroga","César Estrebou","Laura Lanzarini"],"pdf_url":"https://arxiv.org/pdf/2310.17427v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17405v1","updated":"2023-10-26T14:01:17Z","published":"2023-10-26T14:01:17Z","title":"Causal Modeling with Stationary Diffusions","summary":" We develop a novel approach towards causal inference. Rather than structural\nequations over a causal graph, we learn stochastic differential equations\n(SDEs) whose stationary densities model a system's behavior under\ninterventions. These stationary diffusion models do not require the formalism\nof causal graphs, let alone the common assumption of acyclicity. We show that\nin several cases, they generalize to unseen interventions on their variables,\noften better than classical approaches. Our inference method is based on a new\ntheoretical result that expresses a stationarity condition on the diffusion's\ngenerator in a reproducing kernel Hilbert space. The resulting kernel deviation\nfrom stationarity (KDS) is an objective function of independent interest.\n","authors":["Lars Lorch","Andreas Krause","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2310.17405v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17404v1","updated":"2023-10-26T13:59:39Z","published":"2023-10-26T13:59:39Z","title":"Invariance Measures for Neural Networks","summary":" Invariances in neural networks are useful and necessary for many tasks.\nHowever, the representation of the invariance of most neural network models has\nnot been characterized. We propose measures to quantify the invariance of\nneural networks in terms of their internal representation. The measures are\nefficient and interpretable, and can be applied to any neural network model.\nThey are also more sensitive to invariance than previously defined measures. We\nvalidate the measures and their properties in the domain of affine\ntransformations and the CIFAR10 and MNIST datasets, including their stability\nand interpretability. Using the measures, we perform a first analysis of CNN\nmodels and show that their internal invariance is remarkably stable to random\nweight initializations, but not to changes in dataset or transformation. We\nbelieve the measures will enable new avenues of research in invariance\nrepresentation.\n","authors":["Facundo Manuel Quiroga","Jordina Torrents-Barrena","Laura Cristina Lanzarini","Domenec Puig-Valls"],"pdf_url":"https://arxiv.org/pdf/2310.17404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17403v1","updated":"2023-10-26T13:56:12Z","published":"2023-10-26T13:56:12Z","title":"Detection Defenses: An Empty Promise against Adversarial Patch Attacks\n on Optical Flow","summary":" Adversarial patches undermine the reliability of optical flow predictions\nwhen placed in arbitrary scene locations. Therefore, they pose a realistic\nthreat to real-world motion detection and its downstream applications.\nPotential remedies are defense strategies that detect and remove adversarial\npatches, but their influence on the underlying motion prediction has not been\ninvestigated. In this paper, we thoroughly examine the currently available\ndetect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art\noptical flow methods, and illuminate their side effects on the quality and\nrobustness of the final flow predictions. In particular, we implement\ndefense-aware attacks to investigate whether current defenses are able to\nwithstand attacks that take the defense mechanism into account. Our experiments\nyield two surprising results: Detect-and-remove defenses do not only lower the\noptical flow quality on benign scenes, in doing so, they also harm the\nrobustness under patch attacks for all tested optical flow methods except\nFlowNetC. As currently employed detect-and-remove defenses fail to deliver the\npromised adversarial robustness for optical flow, they evoke a false sense of\nsecurity. The code is available at\nhttps://github.com/cv-stuttgart/DetectionDefenses.\n","authors":["Erik Scheurer","Jenny Schmalfuss","Alexander Lis","Andrés Bruhn"],"pdf_url":"https://arxiv.org/pdf/2310.17403v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2310.17394v1","updated":"2023-10-26T13:46:18Z","published":"2023-10-26T13:46:18Z","title":"Enhancing Graph Neural Networks with Structure-Based Prompt","summary":" Graph Neural Networks (GNNs) are powerful in learning semantics of graph\ndata. Recently, a new paradigm \"pre-train, prompt\" has shown promising results\nin adapting GNNs to various tasks with less supervised data. The success of\nsuch paradigm can be attributed to the more consistent objectives of\npre-training and task-oriented prompt tuning, where the pre-trained knowledge\ncan be effectively transferred to downstream tasks. However, an overlooked\nissue of existing studies is that the structure information of graph is usually\nexploited during pre-training for learning node representations, while\nneglected in the prompt tuning stage for learning task-specific parameters. To\nbridge this gap, we propose a novel structure-based prompting method for GNNs,\nnamely SAP, which consistently exploits structure information in both\npre-training and prompt tuning stages. In particular, SAP 1) employs a\ndual-view contrastive learning to align the latent semantic spaces of node\nattributes and graph structure, and 2) incorporates structure information in\nprompted graph to elicit more pre-trained knowledge in prompt tuning. We\nconduct extensive experiments on node classification and graph classification\ntasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to\nbetter performance in more challenging few-shot scenarios on both homophilous\nand heterophilous graphs.\n","authors":["Qingqing Ge","Zeyuan Zhao","Yiding Liu","Anfeng Cheng","Xiang Li","Shuaiqiang Wang","Dawei Yin"],"pdf_url":"https://arxiv.org/pdf/2310.17394v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17386v1","updated":"2023-10-26T13:33:26Z","published":"2023-10-26T13:33:26Z","title":"A Challenge in Reweighting Data with Bilevel Optimization","summary":" In many scenarios, one uses a large training set to train a model with the\ngoal of performing well on a smaller testing set with a different distribution.\nLearning a weight for each data point of the training set is an appealing\nsolution, as it ideally allows one to automatically learn the importance of\neach training point for generalization on the testing set. This task is usually\nformalized as a bilevel optimization problem. Classical bilevel solvers are\nbased on a warm-start strategy where both the parameters of the models and the\ndata weights are learned at the same time. We show that this joint dynamic may\nlead to sub-optimal solutions, for which the final data weights are very\nsparse. This finding illustrates the difficulty of data reweighting and offers\na clue as to why this method is rarely used in practice.\n","authors":["Anastasia Ivanova","Pierre Ablin"],"pdf_url":"https://arxiv.org/pdf/2310.17386v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17385v1","updated":"2023-10-26T13:32:49Z","published":"2023-10-26T13:32:49Z","title":"Multitask Online Learning: Listen to the Neighborhood Buzz","summary":" We study multitask online learning in a setting where agents can only\nexchange information with their neighbors on an arbitrary communication\nnetwork. We introduce $\\texttt{MT-CO}_2\\texttt{OL}$, a decentralized algorithm\nfor this setting whose regret depends on the interplay between the task\nsimilarities and the network structure. Our analysis shows that the regret of\n$\\texttt{MT-CO}_2\\texttt{OL}$ is never worse (up to constants) than the bound\nobtained when agents do not share information. On the other hand, our bounds\nsignificantly improve when neighboring agents operate on similar tasks. In\naddition, we prove that our algorithm can be made differentially private with a\nnegligible impact on the regret when the losses are linear. Finally, we provide\nexperimental support for our theory.\n","authors":["Juliette Achddou","Nicolò Cesa-Bianchi","Pierre Laforgue"],"pdf_url":"https://arxiv.org/pdf/2310.17385v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17383v1","updated":"2023-10-26T13:27:23Z","published":"2023-10-26T13:27:23Z","title":"On the recognition of the game type based on physiological signals and\n eye tracking","summary":" Automated interpretation of signals yields many impressive applications from\nthe area of affective computing and human activity recognition (HAR). In this\npaper we ask the question about possibility of cognitive activity recognition\non the base of particular set of signals. We use recognition of the game played\nby the participant as a playground for exploration of the problem. We build\nclassifier of three different games (Space Invaders, Tetris, Tower Defence) and\ninter-game pause. We validate classifier in the player-independent and\nplayer-dependent scenario. We discuss the improvement in the player-dependent\nscenario in the context of biometric person recognition. On the base of the\nresults obtained in game classification, we consider potential applications in\nsmart surveillance and quantified self.\n","authors":["Łukasz Czekaj","Łukasz Radzinski","Mateusz Kolimaga","Jakub Domaszewicz","Robert Kitłowski","Mariusz Szwoch","Włodzisław Duch"],"pdf_url":"https://arxiv.org/pdf/2310.17383v1.pdf","comment":"5 pages, 3 figures, extended version of ESM paper"},{"id":"http://arxiv.org/abs/2305.16427v2","updated":"2023-10-26T13:22:56Z","published":"2023-05-25T18:56:34Z","title":"Neural (Tangent Kernel) Collapse","summary":" This work bridges two important concepts: the Neural Tangent Kernel (NTK),\nwhich captures the evolution of deep neural networks (DNNs) during training,\nand the Neural Collapse (NC) phenomenon, which refers to the emergence of\nsymmetry and structure in the last-layer features of well-trained\nclassification DNNs. We adopt the natural assumption that the empirical NTK\ndevelops a block structure aligned with the class labels, i.e., samples within\nthe same class have stronger correlations than samples from different classes.\nUnder this assumption, we derive the dynamics of DNNs trained with mean squared\n(MSE) loss and break them into interpretable phases. Moreover, we identify an\ninvariant that captures the essence of the dynamics, and use it to prove the\nemergence of NC in DNNs with block-structured NTK. We provide large-scale\nnumerical experiments on three common DNN architectures and three benchmark\ndatasets to support our theory.\n","authors":["Mariia Seleznova","Dana Weitzner","Raja Giryes","Gitta Kutyniok","Hung-Hsu Chou"],"pdf_url":"https://arxiv.org/pdf/2305.16427v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17378v1","updated":"2023-10-26T13:14:13Z","published":"2023-10-26T13:14:13Z","title":"Optimization dependent generalization bound for ReLU networks based on\n sensitivity in the tangent bundle","summary":" Recent advances in deep learning have given us some very promising results on\nthe generalization ability of deep neural networks, however literature still\nlacks a comprehensive theory explaining why heavily over-parametrized models\nare able to generalize well while fitting the training data. In this paper we\npropose a PAC type bound on the generalization error of feedforward ReLU\nnetworks via estimating the Rademacher complexity of the set of networks\navailable from an initial parameter vector via gradient descent. The key idea\nis to bound the sensitivity of the network's gradient to perturbation of the\ninput data along the optimization trajectory. The obtained bound does not\nexplicitly depend on the depth of the network. Our results are experimentally\nverified on the MNIST and CIFAR-10 datasets.\n","authors":["Dániel Rácz","Mihály Petreczky","András Csertán","Bálint Daróczy"],"pdf_url":"https://arxiv.org/pdf/2310.17378v1.pdf","comment":"17 pages, 5 figures, OPT2023: 15th Annual Workshop on Optimization\n for Machine Learning at the 37th NeurIPS 2023, New Orleans, LA, USA"},{"id":"http://arxiv.org/abs/2209.06589v4","updated":"2023-10-26T13:13:16Z","published":"2022-09-14T12:13:59Z","title":"Towards Better Generalization with Flexible Representation of\n Multi-Module Graph Neural Networks","summary":" Graph neural networks (GNNs) have become compelling models designed to\nperform learning and inference on graph-structured data. However, little work\nhas been done to understand the fundamental limitations of GNNs for scaling to\nlarger graphs and generalizing to out-of-distribution (OOD) inputs. In this\npaper, we use a random graph generator to systematically investigate how the\ngraph size and structural properties affect the predictive performance of GNNs.\nWe present specific evidence that the average node degree is a key feature in\ndetermining whether GNNs can generalize to unseen graphs, and that the use of\nmultiple node update functions can improve the generalization performance of\nGNNs when dealing with graphs of multimodal degree distributions. Accordingly,\nwe propose a multi-module GNN framework that allows the network to adapt\nflexibly to new graphs by generalizing a single canonical nonlinear\ntransformation over aggregated inputs. Our results show that the multi-module\nGNNs improve the OOD generalization on a variety of inference tasks in the\ndirection of diverse structural features.\n","authors":["Hyungeun Lee","Kijung Yoon"],"pdf_url":"https://arxiv.org/pdf/2209.06589v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03598v2","updated":"2023-10-26T13:02:44Z","published":"2023-05-05T15:03:01Z","title":"NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial\n Reports","summary":" How can we interpret and retrieve medical evidence to support clinical\ndecisions? Clinical trial reports (CTR) amassed over the years contain\nindispensable information for the development of personalized medicine.\nHowever, it is practically infeasible to manually inspect over 400,000+\nclinical trial reports in order to find the best evidence for experimental\ntreatments. Natural Language Inference (NLI) offers a potential solution to\nthis problem, by allowing the scalable computation of textual entailment.\nHowever, existing NLI models perform poorly on biomedical corpora, and\npreviously published datasets fail to capture the full complexity of inference\nover CTRs. In this work, we present a novel resource to advance research on NLI\nfor reasoning on CTRs. The resource includes two main tasks. Firstly, to\ndetermine the inference relation between a natural language statement, and a\nCTR. Secondly, to retrieve supporting facts to justify the predicted relation.\nWe provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these\ntasks. Baselines on this corpus expose the limitations of existing NLI models,\nwith 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To\nthe best of our knowledge, we are the first to design a task that covers the\ninterpretation of full CTRs. To encourage further work on this challenging\ndataset, we make the corpus, competition leaderboard, website and code to\nreplicate the baseline experiments available at:\nhttps://github.com/ai-systems/nli4ct\n","authors":["Maël Jullien","Marco Valentino","Hannah Frost","Paul O'Regan","Donal Landers","André Freitas"],"pdf_url":"https://arxiv.org/pdf/2305.03598v2.pdf","comment":"15 pages"},{"id":"http://arxiv.org/abs/2301.11113v3","updated":"2023-10-26T12:49:51Z","published":"2023-01-26T14:06:26Z","title":"Finding Regions of Counterfactual Explanations via Robust Optimization","summary":" Counterfactual explanations play an important role in detecting bias and\nimproving the explainability of data-driven classification models. A\ncounterfactual explanation (CE) is a minimal perturbed data point for which the\ndecision of the model changes. Most of the existing methods can only provide\none CE, which may not be achievable for the user. In this work we derive an\niterative method to calculate robust CEs, i.e. CEs that remain valid even after\nthe features are slightly perturbed. To this end, our method provides a whole\nregion of CEs allowing the user to choose a suitable recourse to obtain a\ndesired outcome. We use algorithmic ideas from robust optimization and prove\nconvergence results for the most common machine learning methods including\nlogistic regression, decision trees, random forests, and neural networks. Our\nexperiments show that our method can efficiently generate globally optimal\nrobust CEs for a variety of common data sets and classification models.\n","authors":["Donato Maragno","Jannis Kurtz","Tabea E. Röber","Rob Goedhart","Ş. Ilker Birbil","Dick den Hertog"],"pdf_url":"https://arxiv.org/pdf/2301.11113v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.00855v2","updated":"2023-10-26T12:49:27Z","published":"2023-08-01T21:43:22Z","title":"A Comprehensive Study of Groundbreaking Machine Learning Research:\n Analyzing highly cited and impactful publications across six decades","summary":" Machine learning (ML) has emerged as a prominent field of research in\ncomputer science and other related fields, thereby driving advancements in\nother domains of interest. As the field continues to evolve, it is crucial to\nunderstand the landscape of highly cited publications to identify key trends,\ninfluential authors, and significant contributions made thus far. In this\npaper, we present a comprehensive bibliometric analysis of highly cited ML\npublications. We collected a dataset consisting of the top-cited papers from\nreputable ML conferences and journals, covering a period of several years from\n1959 to 2022. We employed various bibliometric techniques to analyze the data,\nincluding citation analysis, co-authorship analysis, keyword analysis, and\npublication trends. Our findings reveal the most influential papers, highly\ncited authors, and collaborative networks within the machine learning\ncommunity. We identify popular research themes and uncover emerging topics that\nhave recently gained significant attention. Furthermore, we examine the\ngeographical distribution of highly cited publications, highlighting the\ndominance of certain countries in ML research. By shedding light on the\nlandscape of highly cited ML publications, our study provides valuable insights\nfor researchers, policymakers, and practitioners seeking to understand the key\ndevelopments and trends in this rapidly evolving field.\n","authors":["Absalom E. Ezugwu","Japie Greeff","Yuh-Shan Ho"],"pdf_url":"https://arxiv.org/pdf/2308.00855v2.pdf","comment":"Journal of Engineering Research (2023)"},{"id":"http://arxiv.org/abs/2310.17360v1","updated":"2023-10-26T12:48:43Z","published":"2023-10-26T12:48:43Z","title":"Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal\n Graph Learning","summary":" Spatio-temporal graph learning is a fundamental problem in the Web of Things\nera, which enables a plethora of Web applications such as smart cities, human\nmobility and climate analysis. Existing approaches tackle different learning\ntasks independently, tailoring their models to unique task characteristics.\nThese methods, however, fall short of modeling intrinsic uncertainties in the\nspatio-temporal data. Meanwhile, their specialized designs limit their\nuniversality as general spatio-temporal learning solutions. In this paper, we\npropose to model the learning tasks in a unified perspective, viewing them as\npredictions based on conditional information with shared spatio-temporal\npatterns. Based on this proposal, we introduce Unified Spatio-Temporal\nDiffusion Models (USTD) to address the tasks uniformly within the\nuncertainty-aware diffusion framework. USTD is holistically designed,\ncomprising a shared spatio-temporal encoder and attention-based denoising\nnetworks that are task-specific. The shared encoder, optimized by a\npre-training strategy, effectively captures conditional spatio-temporal\npatterns. The denoising networks, utilizing both cross- and self-attention,\nintegrate conditional dependencies and generate predictions. Opting for\nforecasting and kriging as downstream tasks, we design Gated Attention (SGA)\nand Temporal Gated Attention (TGA) for each task, with different emphases on\nthe spatial and temporal dimensions, respectively. By combining the advantages\nof deterministic encoders and probabilistic diffusion models, USTD achieves\nstate-of-the-art performances compared to deterministic and probabilistic\nbaselines in both tasks, while also providing valuable uncertainty estimates.\n","authors":["Junfeng Hu","Xu Liu","Zhencheng Fan","Yuxuan Liang","Roger Zimmermann"],"pdf_url":"https://arxiv.org/pdf/2310.17360v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17356v1","updated":"2023-10-26T12:44:45Z","published":"2023-10-26T12:44:45Z","title":"Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning","summary":" Ahead-of-time forecasting of the output power of power plants is essential\nfor the stability of the electricity grid and ensuring uninterrupted service.\nHowever, forecasting renewable energy sources is difficult due to the chaotic\nbehavior of natural energy sources. This paper presents a new approach to\nestimate short-term solar irradiance from sky images. The~proposed algorithm\nextracts features from sky images and use learning-based techniques to estimate\nthe solar irradiance. The~performance of proposed machine learning (ML)\nalgorithm is evaluated using two publicly available datasets of sky images.\nThe~datasets contain over 350,000 images for an interval of 16 years, from 2004\nto 2020, with the corresponding global horizontal irradiance (GHI) of each\nimage as the ground truth. Compared to the state-of-the-art computationally\nheavy algorithms proposed in the literature, our approach achieves competitive\nresults with much less computational complexity for both nowcasting and\nforecasting up to 4 h ahead of time.\n","authors":["Anas Al-lahham","Obaidah Theeb","Khaled Elalem","Tariq A. Alshawi","Saleh A. Alshebeili"],"pdf_url":"https://arxiv.org/pdf/2310.17356v1.pdf","comment":"Published in MDPI Electronics Journal"},{"id":"http://arxiv.org/abs/2310.17355v1","updated":"2023-10-26T12:44:33Z","published":"2023-10-26T12:44:33Z","title":"Exploring the Trie of Rules: a fast data structure for the\n representation of association rules","summary":" Association rule mining techniques can generate a large volume of sequential\ndata when implemented on transactional databases. Extracting insights from a\nlarge set of association rules has been found to be a challenging process. When\nexamining a ruleset, the fundamental question is how to summarise and represent\nmeaningful mined knowledge efficiently. Many algorithms and strategies have\nbeen developed to address issue of knowledge extraction; however, the\neffectiveness of this process can be limited by the data structures. A better\ndata structure can sufficiently affect the speed of the knowledge extraction\nprocess. This paper proposes a novel data structure, called the Trie of rules,\nfor storing a ruleset that is generated by association rule mining. The\nresulting data structure is a prefix-tree graph structure made of pre-mined\nrules. This graph stores the rules as paths within the prefix-tree in a way\nthat similar rules overlay each other. Each node in the tree represents a rule\nwhere a consequent is this node, and an antecedent is a path from this node to\nthe root of the tree. The evaluation showed that the proposed representation\ntechnique is promising. It compresses a ruleset with almost no data loss and\nbenefits in terms of time for basic operations such as searching for a specific\nrule and sorting, which is the base for many knowledge discovery methods.\nMoreover, our method demonstrated a significant improvement in traversing time,\nachieving an 8-fold increase compared to traditional data structures.\n","authors":["Mikhail Kudriavtsev","Dr Marija Bezbradica","Dr Andrew McCarren"],"pdf_url":"https://arxiv.org/pdf/2310.17355v1.pdf","comment":"12 pages, 13 figures, preprint of journal article"},{"id":"http://arxiv.org/abs/2305.20081v2","updated":"2023-10-26T12:25:02Z","published":"2023-05-31T17:55:21Z","title":"Efficient Diffusion Policies for Offline Reinforcement Learning","summary":" Offline reinforcement learning (RL) aims to learn optimal policies from\noffline datasets, where the parameterization of policies is crucial but often\noverlooked. Recently, Diffsuion-QL significantly boosts the performance of\noffline RL by representing a policy with a diffusion model, whose success\nrelies on a parametrized Markov Chain with hundreds of steps for sampling.\nHowever, Diffusion-QL suffers from two critical limitations. 1) It is\ncomputationally inefficient to forward and backward through the whole Markov\nchain during training. 2) It is incompatible with maximum likelihood-based RL\nalgorithms (e.g., policy gradient methods) as the likelihood of diffusion\nmodels is intractable. Therefore, we propose efficient diffusion policy (EDP)\nto overcome these two challenges. EDP approximately constructs actions from\ncorrupted ones at training to avoid running the sampling chain. We conduct\nextensive experiments on the D4RL benchmark. The results show that EDP can\nreduce the diffusion policy training time from 5 days to 5 hours on\ngym-locomotion tasks. Moreover, we show that EDP is compatible with various\noffline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on\nD4RL by large margins over previous methods. Our code is available at\nhttps://github.com/sail-sg/edp.\n","authors":["Bingyi Kang","Xiao Ma","Chao Du","Tianyu Pang","Shuicheng Yan"],"pdf_url":"https://arxiv.org/pdf/2305.20081v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17341v1","updated":"2023-10-26T12:15:56Z","published":"2023-10-26T12:15:56Z","title":"De-novo Chemical Reaction Generation by Means of Temporarily\n Convolutional Neural Networks","summary":" We present here a combination of two networks, Recurrent Neural Networks\n(RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction\ngeneration using the novel Reaction Smiles-like representation of reactions\n(CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks\nare known for their autoregressive properties and are frequently used in\nlanguage modelling with direct application to SMILES generation. The relatively\nnovel TCNs possess similar properties with wide receptive field while obeying\nthe causality required for natural language processing (NLP). The combination\nof both latent representations expressed through TCN and RNN results in an\noverall better performance compared to RNN alone. Additionally, it is shown\nthat different fine-tuning protocols have a profound impact on generative scope\nof the model when applied on a dataset of interest via transfer learning.\n","authors":["Andrei Buin","Hung Yi Chiang","S. Andrew Gadsden","Faraz A. Alderson"],"pdf_url":"https://arxiv.org/pdf/2310.17341v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.05556v4","updated":"2023-10-26T12:11:02Z","published":"2022-03-10T18:59:21Z","title":"On Embeddings for Numerical Features in Tabular Deep Learning","summary":" Recently, Transformer-like deep architectures have shown strong performance\non tabular data problems. Unlike traditional models, e.g., MLP, these\narchitectures map scalar values of numerical features to high-dimensional\nembeddings before mixing them in the main backbone. In this work, we argue that\nembeddings for numerical features are an underexplored degree of freedom in\ntabular DL, which allows constructing more powerful DL models and competing\nwith GBDT on some traditionally GBDT-friendly benchmarks. We start by\ndescribing two conceptually different approaches to building embedding modules:\nthe first one is based on a piecewise linear encoding of scalar values, and the\nsecond one utilizes periodic activations. Then, we empirically demonstrate that\nthese two approaches can lead to significant performance boosts compared to the\nembeddings based on conventional blocks such as linear layers and ReLU\nactivations. Importantly, we also show that embedding numerical features is\nbeneficial for many backbones, not only for Transformers. Specifically, after\nproper embeddings, simple MLP-like models can perform on par with the\nattention-based architectures. Overall, we highlight embeddings for numerical\nfeatures as an important design aspect with good potential for further\nimprovements in tabular DL.\n","authors":["Yury Gorishniy","Ivan Rubachev","Artem Babenko"],"pdf_url":"https://arxiv.org/pdf/2203.05556v4.pdf","comment":"NeurIPS 2022 camera-ready. Code:\n https://github.com/yandex-research/tabular-dl-num-embeddings (v3-v4: minor\n changes)"},{"id":"http://arxiv.org/abs/2310.17335v1","updated":"2023-10-26T12:01:47Z","published":"2023-10-26T12:01:47Z","title":"A multi-artifact EEG denoising by frequency-based deep learning","summary":" Electroencephalographic (EEG) signals are fundamental to neuroscience\nresearch and clinical applications such as brain-computer interfaces and\nneurological disorder diagnosis. These signals are typically a combination of\nneurological activity and noise, originating from various sources, including\nphysiological artifacts like ocular and muscular movements. Under this setting,\nwe tackle the challenge of distinguishing neurological activity from\nnoise-related sources. We develop a novel EEG denoising model that operates in\nthe frequency domain, leveraging prior knowledge about noise spectral features\nto adaptively compute optimal convolutional filters for noise separation. The\nmodel is trained to learn an empirical relationship connecting the spectral\ncharacteristics of noise and noisy signal to a non-linear transformation which\nallows signal denoising. Performance evaluation on the EEGdenoiseNet dataset\nshows that the proposed model achieves optimal results according to both\ntemporal and spectral metrics. The model is found to remove physiological\nartifacts from input EEG data, thus achieving effective EEG denoising. Indeed,\nthe model performance either matches or outperforms that achieved by benchmark\nmodels, proving to effectively remove both muscle and ocular artifacts without\nthe need to perform any training on the particular type of artifact.\n","authors":["Matteo Gabardi","Aurora Saibene","Francesca Gasparini","Daniele Rizzo","Fabio Antonio Stella"],"pdf_url":"https://arxiv.org/pdf/2310.17335v1.pdf","comment":"Accepted at the Italian Workshop on Artificial Intelligence for\n Human-Machine Interaction (AIxHMI 2023), November 06, 2023, Rome, Italy"},{"id":"http://arxiv.org/abs/2106.11959v5","updated":"2023-10-26T12:00:03Z","published":"2021-06-22T17:58:10Z","title":"Revisiting Deep Learning Models for Tabular Data","summary":" The existing literature on deep learning for tabular data proposes a wide\nrange of novel architectures and reports competitive results on various\ndatasets. However, the proposed models are usually not properly compared to\neach other and existing works often use different benchmarks and experiment\nprotocols. As a result, it is unclear for both researchers and practitioners\nwhat models perform best. Additionally, the field still lacks effective\nbaselines, that is, the easy-to-use models that provide competitive performance\nacross different problems.\n In this work, we perform an overview of the main families of DL architectures\nfor tabular data and raise the bar of baselines in tabular DL by identifying\ntwo simple and powerful deep architectures. The first one is a ResNet-like\narchitecture which turns out to be a strong baseline that is often missing in\nprior works. The second model is our simple adaptation of the Transformer\narchitecture for tabular data, which outperforms other solutions on most tasks.\nBoth models are compared to many existing architectures on a diverse set of\ntasks under the same training and tuning protocols. We also compare the best DL\nmodels with Gradient Boosted Decision Trees and conclude that there is still no\nuniversally superior solution.\n","authors":["Yury Gorishniy","Ivan Rubachev","Valentin Khrulkov","Artem Babenko"],"pdf_url":"https://arxiv.org/pdf/2106.11959v5.pdf","comment":"NeurIPS 2021 camera-ready. Code:\n https://github.com/yandex-research/tabular-dl-revisiting-models (v3-v5: minor\n changes)"},{"id":"http://arxiv.org/abs/2310.17332v1","updated":"2023-10-26T11:55:30Z","published":"2023-10-26T11:55:30Z","title":"On Forecast Stability","summary":" Forecasts are typically not produced in a vacuum but in a business context,\nwhere forecasts are generated on a regular basis and interact with each other.\nFor decisions, it may be important that forecasts do not change arbitrarily,\nand are stable in some sense. However, this area has received only limited\nattention in the forecasting literature. In this paper, we explore two types of\nforecast stability that we call vertical stability and horizontal stability.\nThe existing works in the literature are only applicable to certain base models\nand extending these frameworks to be compatible with any base model is not\nstraightforward. Furthermore, these frameworks can only stabilise the forecasts\nvertically. To fill this gap, we propose a simple linear-interpolation-based\napproach that is applicable to stabilise the forecasts provided by any base\nmodel vertically and horizontally. The approach can produce both accurate and\nstable forecasts. Using N-BEATS, Pooled Regression and LightGBM as the base\nmodels, in our evaluation on four publicly available datasets, the proposed\nframework is able to achieve significantly higher stability and/or accuracy\ncompared to a set of benchmarks including a state-of-the-art forecast\nstabilisation method across three error metrics and six stability metrics.\n","authors":["Rakshitha Godahewa","Christoph Bergmeir","Zeynep Erkin Baz","Chengjun Zhu","Zhangdi Song","Salvador García","Dario Benavides"],"pdf_url":"https://arxiv.org/pdf/2310.17332v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15807v2","updated":"2023-10-26T11:52:12Z","published":"2023-05-25T07:41:35Z","title":"Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with\n Application to Fairness","summary":" We consider contextual bandit problems with knapsacks [CBwK], a problem where\nat each round, a scalar reward is obtained and vector-valued costs are\nsuffered. The learner aims to maximize the cumulative rewards while ensuring\nthat the cumulative costs are lower than some predetermined cost constraints.\nWe assume that contexts come from a continuous set, that costs can be signed,\nand that the expected reward and cost functions, while unknown, may be\nuniformly estimated -- a typical assumption in the literature. In this setting,\ntotal cost constraints had so far to be at least of order $T^{3/4}$, where $T$\nis the number of rounds, and were even typically assumed to depend linearly on\n$T$. We are however motivated to use CBwK to impose a fairness constraint of\nequalized average costs between groups: the budget associated with the\ncorresponding cost constraints should be as close as possible to the natural\ndeviations, of order $\\sqrt{T}$. To that end, we introduce a dual strategy\nbased on projected-gradient-descent updates, that is able to deal with\ntotal-cost constraints of the order of $\\sqrt{T}$ up to poly-logarithmic terms.\nThis strategy is more direct and simpler than existing strategies in the\nliterature. It relies on a careful, adaptive, tuning of the step size.\n","authors":["Evgenii Chzhen","Christophe Giraud","Zhen Li","Gilles Stoltz"],"pdf_url":"https://arxiv.org/pdf/2305.15807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17330v1","updated":"2023-10-26T11:50:58Z","published":"2023-10-26T11:50:58Z","title":"CQM: Curriculum Reinforcement Learning with a Quantized World Model","summary":" Recent curriculum Reinforcement Learning (RL) has shown notable progress in\nsolving complex tasks by proposing sequences of surrogate tasks. However, the\nprevious approaches often face challenges when they generate curriculum goals\nin a high-dimensional space. Thus, they usually rely on manually specified goal\nspaces. To alleviate this limitation and improve the scalability of the\ncurriculum, we propose a novel curriculum method that automatically defines the\nsemantic goal space which contains vital information for the curriculum\nprocess, and suggests curriculum goals over it. To define the semantic goal\nspace, our method discretizes continuous observations via vector\nquantized-variational autoencoders (VQ-VAE) and restores the temporal relations\nbetween the discretized observations by a graph. Concurrently, ours suggests\nuncertainty and temporal distance-aware curriculum goals that converges to the\nfinal goals over the automatically composed goal space. We demonstrate that the\nproposed method allows efficient explorations in an uninformed environment with\nraw goal examples only. Also, ours outperforms the state-of-the-art curriculum\nRL methods on data efficiency and performance, in various goal-reaching tasks\neven with ego-centric visual inputs.\n","authors":["Seungjae Lee","Daesol Cho","Jonghae Park","H. Jin Kim"],"pdf_url":"https://arxiv.org/pdf/2310.17330v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2211.00990v2","updated":"2023-10-26T11:47:25Z","published":"2022-11-02T09:51:15Z","title":"A weighted-variance variational autoencoder model for speech enhancement","summary":" We address speech enhancement based on variational autoencoders, which\ninvolves learning a speech prior distribution in the time-frequency (TF)\ndomain. A zero-mean complex-valued Gaussian distribution is usually assumed for\nthe generative model, where the speech information is encoded in the variance\nas a function of a latent variable. In contrast to this commonly used approach,\nwe propose a weighted variance generative model, where the contribution of each\nspectrogram time-frame in parameter learning is weighted. We impose a Gamma\nprior distribution on the weights, which would effectively lead to a Student's\nt-distribution instead of Gaussian for speech generative modeling. We develop\nefficient training and speech enhancement algorithms based on the proposed\ngenerative model. Our experimental results on spectrogram auto-encoding and\nspeech enhancement demonstrate the effectiveness and robustness of the proposed\napproach compared to the standard unweighted variance model.\n","authors":["Ali Golmakani","Mostafa Sadeghi","Xavier Alameda-Pineda","Romain Serizel"],"pdf_url":"https://arxiv.org/pdf/2211.00990v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17325v1","updated":"2023-10-26T11:44:42Z","published":"2023-10-26T11:44:42Z","title":"C-Disentanglement: Discovering Causally-Independent Generative Factors\n under an Inductive Bias of Confounder","summary":" Representation learning assumes that real-world data is generated by a few\nsemantically meaningful generative factors (i.e., sources of variation) and\naims to discover them in the latent space. These factors are expected to be\ncausally disentangled, meaning that distinct factors are encoded into separate\nlatent variables, and changes in one factor will not affect the values of the\nothers. Compared to statistical independence, causal disentanglement allows\nmore controllable data generation, improved robustness, and better\ngeneralization. However, most existing work assumes unconfoundedness in the\ndiscovery process, that there are no common causes to the generative factors\nand thus obtain only statistical independence. In this paper, we recognize the\nimportance of modeling confounders in discovering causal generative factors.\nUnfortunately, such factors are not identifiable without proper inductive bias.\nWe fill the gap by introducing a framework entitled Confounded-Disentanglement\n(C-Disentanglement), the first framework that explicitly introduces the\ninductive bias of confounder via labels from domain expertise. In addition, we\naccordingly propose an approach to sufficiently identify the causally\ndisentangled factors under any inductive bias of the confounder. We conduct\nextensive experiments on both synthetic and real-world datasets. Our method\ndemonstrates competitive results compared to various SOTA baselines in\nobtaining causally disentangled features and downstream tasks under domain\nshifts.\n","authors":["Xiaoyu Liu","Jiaxin Yuan","Bang An","Yuancheng Xu","Yifan Yang","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17325v1.pdf","comment":"accepted to Neurips 2023"},{"id":"http://arxiv.org/abs/2205.15403v3","updated":"2023-10-26T11:13:53Z","published":"2022-05-30T20:00:19Z","title":"Neural Optimal Transport with General Cost Functionals","summary":" We introduce a novel neural network-based algorithm to compute optimal\ntransport (OT) plans for general cost functionals. In contrast to common\nEuclidean costs, i.e., $\\ell^1$ or $\\ell^2$, such functionals provide more\nflexibility and allow using auxiliary information, such as class labels, to\nconstruct the required transport map. Existing methods for general costs are\ndiscrete and have limitations in practice, i.e. they do not provide an\nout-of-sample estimation. We address the challenge of designing a continuous OT\napproach for general costs that generalizes to new data points in\nhigh-dimensional spaces, such as images. Additionally, we provide the\ntheoretical error analysis for our recovered transport plans. As an\napplication, we construct a cost functional to map data distributions while\npreserving the class-wise structure.\n","authors":["Arip Asadulaev","Alexander Korotin","Vage Egiazarian","Petr Mokrov","Evgeny Burnaev"],"pdf_url":"https://arxiv.org/pdf/2205.15403v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2112.12458v3","updated":"2023-10-26T11:11:26Z","published":"2021-12-23T10:55:33Z","title":"Local Advantage Networks for Cooperative Multi-Agent Reinforcement\n Learning","summary":" Many recent successful off-policy multi-agent reinforcement learning (MARL)\nalgorithms for cooperative partially observable environments focus on finding\nfactorized value functions, leading to convoluted network structures. Building\non the structure of independent Q-learners, our LAN algorithm takes a radically\ndifferent approach, leveraging a dueling architecture to learn for each agent a\ndecentralized best-response policies via individual advantage functions. The\nlearning is stabilized by a centralized critic whose primary objective is to\nreduce the moving target problem of the individual advantages. The critic,\nwhose network's size is independent of the number of agents, is cast aside\nafter learning. Evaluation on the StarCraft II multi-agent challenge benchmark\nshows that LAN reaches state-of-the-art performance and is highly scalable with\nrespect to the number of agents, opening up a promising alternative direction\nfor MARL research.\n","authors":["Raphaël Avalos","Mathieu Reymond","Ann Nowé","Diederik M. Roijers"],"pdf_url":"https://arxiv.org/pdf/2112.12458v3.pdf","comment":"https://openreview.net/forum?id=adpKzWQunW"},{"id":"http://arxiv.org/abs/2310.17303v1","updated":"2023-10-26T10:54:47Z","published":"2023-10-26T10:54:47Z","title":"Demonstration-Regularized RL","summary":" Incorporating expert demonstrations has empirically helped to improve the\nsample efficiency of reinforcement learning (RL). This paper quantifies\ntheoretically to what extent this extra information reduces RL's sample\ncomplexity. In particular, we study the demonstration-regularized reinforcement\nlearning that leverages the expert demonstrations by KL-regularization for a\npolicy learned by behavior cloning. Our findings reveal that using\n$N^{\\mathrm{E}}$ expert demonstrations enables the identification of an optimal\npolicy at a sample complexity of order\n$\\widetilde{\\mathcal{O}}(\\mathrm{Poly}(S,A,H)/(\\varepsilon^2 N^{\\mathrm{E}}))$\nin finite and $\\widetilde{\\mathcal{O}}(\\mathrm{Poly}(d,H)/(\\varepsilon^2\nN^{\\mathrm{E}}))$ in linear Markov decision processes, where $\\varepsilon$ is\nthe target precision, $H$ the horizon, $A$ the number of action, $S$ the number\nof states in the finite case and $d$ the dimension of the feature space in the\nlinear case. As a by-product, we provide tight convergence guarantees for the\nbehaviour cloning procedure under general assumptions on the policy classes.\nAdditionally, we establish that demonstration-regularized methods are provably\nefficient for reinforcement learning from human feedback (RLHF). In this\nrespect, we provide theoretical evidence showing the benefits of\nKL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid\npessimism injection by employing computationally feasible regularization to\nhandle reward estimation uncertainty, thus setting our approach apart from the\nprior works.\n","authors":["Daniil Tiapkin","Denis Belomestny","Daniele Calandriello","Eric Moulines","Alexey Naumov","Pierre Perrault","Michal Valko","Pierre Menard"],"pdf_url":"https://arxiv.org/pdf/2310.17303v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16826v2","updated":"2023-10-26T10:49:13Z","published":"2023-10-25T17:56:28Z","title":"Deep machine learning for meteor monitoring: advances with transfer\n learning and gradient-weighted class activation mapping","summary":" In recent decades, the use of optical detection systems for meteor studies\nhas increased dramatically, resulting in huge amounts of data being analyzed.\nAutomated meteor detection tools are essential for studying the continuous\nmeteoroid incoming flux, recovering fresh meteorites, and achieving a better\nunderstanding of our Solar System. Concerning meteor detection, distinguishing\nfalse positives between meteor and non-meteor images has traditionally been\nperformed by hand, which is significantly time-consuming. To address this\nissue, we developed a fully automated pipeline that uses Convolutional Neural\nNetworks (CNNs) to classify candidate meteor detections. Our new method is able\nto detect meteors even in images that contain static elements such as clouds,\nthe Moon, and buildings. To accurately locate the meteor within each frame, we\nemploy the Gradient-weighted Class Activation Mapping (Grad-CAM) technique.\nThis method facilitates the identification of the region of interest by\nmultiplying the activations from the last convolutional layer with the average\nof the gradients across the feature map of that layer. By combining these\nfindings with the activation map derived from the first convolutional layer, we\neffectively pinpoint the most probable pixel location of the meteor. We trained\nand evaluated our model on a large dataset collected by the Spanish Meteor\nNetwork (SPMN) and achieved a precision of 98\\%. Our new methodology presented\nhere has the potential to reduce the workload of meteor scientists and station\noperators and improve the accuracy of meteor tracking and classification.\n","authors":["Eloy Peña-Asensio","Josep M. Trigo-Rodríguez","Pau Grèbol-Tomàs","David Regordosa-Avellana","Albert Rimola"],"pdf_url":"https://arxiv.org/pdf/2310.16826v2.pdf","comment":"Accepted in Planetary and Space Science"},{"id":"http://arxiv.org/abs/2305.11685v2","updated":"2023-10-26T10:43:07Z","published":"2023-05-19T14:07:43Z","title":"Recycle-and-Distill: Universal Compression Strategy for\n Transformer-based Speech SSL Models with Attention Map Reusing and Masking\n Distillation","summary":" Transformer-based speech self-supervised learning (SSL) models, such as\nHuBERT, show surprising performance in various speech processing tasks.\nHowever, huge number of parameters in speech SSL models necessitate the\ncompression to a more compact model for wider usage in academia or small\ncompanies. In this study, we suggest to reuse attention maps across the\nTransformer layers, so as to remove key and query parameters while retaining\nthe number of layers. Furthermore, we propose a novel masking distillation\nstrategy to improve the student model's speech representation quality. We\nextend the distillation loss to utilize both masked and unmasked speech frames\nto fully leverage the teacher model's high-quality representation. Our\nuniversal compression strategy yields the student model that achieves phoneme\nerror rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB\nbenchmark.\n","authors":["Kangwook Jang","Sungnyun Kim","Se-Young Yun","Hoirin Kim"],"pdf_url":"https://arxiv.org/pdf/2305.11685v2.pdf","comment":"Proceedings of Interspeech 2023. Code URL:\n https://github.com/sungnyun/ARMHuBERT"},{"id":"http://arxiv.org/abs/2302.01757v2","updated":"2023-10-26T10:37:31Z","published":"2023-01-31T01:40:26Z","title":"RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers\n via Randomized Deletion","summary":" Randomized smoothing is a leading approach for constructing classifiers that\nare certifiably robust against adversarial examples. Existing work on\nrandomized smoothing has focused on classifiers with continuous inputs, such as\nimages, where $\\ell_p$-norm bounded adversaries are commonly studied. However,\nthere has been limited work for classifiers with discrete or variable-size\ninputs, such as for source code, which require different threat models and\nsmoothing mechanisms. In this work, we adapt randomized smoothing for discrete\nsequence classifiers to provide certified robustness against edit\ndistance-bounded adversaries. Our proposed smoothing mechanism randomized\ndeletion (RS-Del) applies random deletion edits, which are (perhaps\nsurprisingly) sufficient to confer robustness against adversarial deletion,\ninsertion and substitution edits. Our proof of certification deviates from the\nestablished Neyman-Pearson approach, which is intractable in our setting, and\nis instead organized around longest common subsequences. We present a case\nstudy on malware detection--a binary classification problem on byte sequences\nwhere classifier evasion is a well-established threat model. When applied to\nthe popular MalConv malware detection model, our smoothing mechanism RS-Del\nachieves a certified accuracy of 91% at an edit distance radius of 128 bytes.\n","authors":["Zhuoqun Huang","Neil G. Marchant","Keane Lucas","Lujo Bauer","Olga Ohrimenko","Benjamin I. P. Rubinstein"],"pdf_url":"https://arxiv.org/pdf/2302.01757v2.pdf","comment":"To be published in NeurIPS 2023. 36 pages, 7 figures, 12 tables.\n Includes 20 pages of appendices"},{"id":"http://arxiv.org/abs/2305.13082v2","updated":"2023-10-26T10:33:34Z","published":"2023-05-22T14:51:40Z","title":"Sketch-and-Project Meets Newton Method: Global $\\mathcal O(k^{-2})$\n Convergence with Low-Rank Updates","summary":" In this paper, we propose the first sketch-and-project Newton method with\nfast $\\mathcal O(k^{-2})$ global convergence rate for self-concordant\nfunctions. Our method, SGN, can be viewed in three ways: i) as a\nsketch-and-project algorithm projecting updates of Newton method, ii) as a\ncubically regularized Newton ethod in sketched subspaces, and iii) as a damped\nNewton method in sketched subspaces. SGN inherits best of all three worlds:\ncheap iteration costs of sketch-and-project methods, state-of-the-art $\\mathcal\nO(k^{-2})$ global convergence rate of full-rank Newton-like methods and the\nalgorithm simplicity of damped Newton methods. Finally, we demonstrate its\ncomparable empirical performance to baseline algorithms.\n","authors":["Slavomír Hanzely"],"pdf_url":"https://arxiv.org/pdf/2305.13082v2.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2301.12906v3","updated":"2023-10-26T10:18:20Z","published":"2023-01-30T14:10:30Z","title":"Curvature Filtrations for Graph Generative Model Evaluation","summary":" Graph generative model evaluation necessitates understanding differences\nbetween graphs on the distributional level. This entails being able to harness\nsalient attributes of graphs in an efficient manner. Curvature constitutes one\nsuch property that has recently proved its utility in characterising graphs.\nIts expressive properties, stability, and practical utility in model evaluation\nremain largely unexplored, however. We combine graph curvature descriptors with\nemerging methods from topological data analysis to obtain robust, expressive\ndescriptors for evaluating graph generative models.\n","authors":["Joshua Southern","Jeremy Wayland","Michael Bronstein","Bastian Rieck"],"pdf_url":"https://arxiv.org/pdf/2301.12906v3.pdf","comment":"Accepted at the 37th Conference on Neural Information Processing\n Systems (NeurIPS) 2023"},{"id":"http://arxiv.org/abs/2301.08951v4","updated":"2023-10-26T10:07:02Z","published":"2023-01-21T13:39:39Z","title":"Time-Conditioned Generative Modeling of Object-Centric Representations\n for Video Decomposition and Prediction","summary":" When perceiving the world from multiple viewpoints, humans have the ability\nto reason about the complete objects in a compositional manner even when an\nobject is completely occluded from certain viewpoints. Meanwhile, humans are\nable to imagine novel views after observing multiple viewpoints. Recent\nremarkable advances in multi-view object-centric learning still leaves some\nunresolved problems: 1) The shapes of partially or completely occluded objects\ncan not be well reconstructed. 2) The novel viewpoint prediction depends on\nexpensive viewpoint annotations rather than implicit rules in view\nrepresentations. In this paper, we introduce a time-conditioned generative\nmodel for videos. To reconstruct the complete shape of an object accurately, we\nenhance the disentanglement between the latent representations of objects and\nviews, where the latent representations of time-conditioned views are jointly\ninferred with a Transformer and then are input to a sequential extension of\nSlot Attention to learn object-centric representations. In addition, Gaussian\nprocesses are employed as priors of view latent variables for video generation\nand novel-view prediction without viewpoint annotations. Experiments on\nmultiple datasets demonstrate that the proposed model can make object-centric\nvideo decomposition, reconstruct the complete shapes of occluded objects, and\nmake novel-view predictions.\n","authors":["Chengmin Gao","Bin Li"],"pdf_url":"https://arxiv.org/pdf/2301.08951v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.04110v2","updated":"2023-10-26T10:03:26Z","published":"2023-07-09T06:53:59Z","title":"Learning Space-Time Continuous Neural PDEs from Partially Observed\n States","summary":" We introduce a novel grid-independent model for learning partial differential\nequations (PDEs) from noisy and partial observations on irregular\nspatiotemporal grids. We propose a space-time continuous latent neural PDE\nmodel with an efficient probabilistic framework and a novel encoder design for\nimproved data efficiency and grid independence. The latent state dynamics are\ngoverned by a PDE model that combines the collocation method and the method of\nlines. We employ amortized variational inference for approximate posterior\nestimation and utilize a multiple shooting technique for enhanced training\nspeed and stability. Our model demonstrates state-of-the-art performance on\ncomplex synthetic and real-world datasets, overcoming limitations of previous\napproaches and effectively handling partially-observed data. The proposed model\noutperforms recent methods, showing its potential to advance data-driven PDE\nmodeling and enabling robust, grid-independent modeling of complex\npartially-observed dynamic processes.\n","authors":["Valerii Iakovlev","Markus Heinonen","Harri Lähdesmäki"],"pdf_url":"https://arxiv.org/pdf/2307.04110v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17281v1","updated":"2023-10-26T10:02:33Z","published":"2023-10-26T10:02:33Z","title":"BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point\n Clouds","summary":" We present a surprisingly simple and efficient method for self-supervision of\n3D backbone on automotive Lidar point clouds. We design a contrastive loss\nbetween features of Lidar scans captured in the same scene. Several such\napproaches have been proposed in the literature from PointConstrast, which uses\na contrast at the level of points, to the state-of-the-art TARL, which uses a\ncontrast at the level of segments, roughly corresponding to objects. While the\nformer enjoys a great simplicity of implementation, it is surpassed by the\nlatter, which however requires a costly pre-processing. In BEVContrast, we\ndefine our contrast at the level of 2D cells in the Bird's Eye View plane.\nResulting cell-level representations offer a good trade-off between the\npoint-level representations exploited in PointContrast and segment-level\nrepresentations exploited in TARL: we retain the simplicity of PointContrast\n(cell representations are cheap to compute) while surpassing the performance of\nTARL in downstream semantic segmentation.\n","authors":["Corentin Sautier","Gilles Puy","Alexandre Boulch","Renaud Marlet","Vincent Lepetit"],"pdf_url":"https://arxiv.org/pdf/2310.17281v1.pdf","comment":"Accepted to 3DV 2024"},{"id":"http://arxiv.org/abs/2309.01270v2","updated":"2023-10-26T09:58:37Z","published":"2023-09-03T20:50:53Z","title":"COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action\n Spotting using Transformers","summary":" We present COMEDIAN, a novel pipeline to initialize spatiotemporal\ntransformers for action spotting, which involves self-supervised learning and\nknowledge distillation. Action spotting is a timestamp-level temporal action\ndetection task. Our pipeline consists of three steps, with two initialization\nstages. First, we perform self-supervised initialization of a spatial\ntransformer using short videos as input. Additionally, we initialize a temporal\ntransformer that enhances the spatial transformer's outputs with global context\nthrough knowledge distillation from a pre-computed feature bank aligned with\neach short video segment. In the final step, we fine-tune the transformers to\nthe action spotting task. The experiments, conducted on the SoccerNet-v2\ndataset, demonstrate state-of-the-art performance and validate the\neffectiveness of COMEDIAN's pretraining paradigm. Our results highlight several\nadvantages of our pretraining pipeline, including improved performance and\nfaster convergence compared to non-pretrained models.\n","authors":["Julien Denize","Mykola Liashuha","Jaonary Rabarisoa","Astrid Orcesi","Romain Hérault"],"pdf_url":"https://arxiv.org/pdf/2309.01270v2.pdf","comment":"Source code is available here:\n https://github.com/juliendenize/eztorch"},{"id":"http://arxiv.org/abs/2211.05321v3","updated":"2023-10-26T09:51:39Z","published":"2022-11-10T03:53:17Z","title":"Fairness and bias correction in machine learning for depression\n prediction: results from four study populations","summary":" A significant level of stigma and inequality exists in mental healthcare,\nespecially in under-served populations. Inequalities are reflected in the data\ncollected for scientific purposes. When not properly accounted for, machine\nlearning (ML) models leart from data can reinforce these structural\ninequalities or biases. Here, we present a systematic study of bias in ML\nmodels designed to predict depression in four different case studies covering\ndifferent countries and populations. We find that standard ML approaches show\nregularly biased behaviors. We also show that mitigation techniques, both\nstandard and our own post-hoc method, can be effective in reducing the level of\nunfair bias. No single best ML model for depression prediction provides\nequality of outcomes. This emphasizes the importance of analyzing fairness\nduring model selection and transparent reporting about the impact of debiasing\ninterventions. Finally, we provide practical recommendations to develop\nbias-aware ML models for depression risk prediction.\n","authors":["Vien Ngoc Dang","Anna Cascarano","Rosa H. Mulder","Charlotte Cecil","Maria A. Zuluaga","Jerónimo Hernández-González","Karim Lekadir"],"pdf_url":"https://arxiv.org/pdf/2211.05321v3.pdf","comment":"11 pages, 2 figures"},{"id":"http://arxiv.org/abs/2310.17273v1","updated":"2023-10-26T09:50:31Z","published":"2023-10-26T09:50:31Z","title":"Looping in the Human: Collaborative and Explainable Bayesian\n Optimization","summary":" Like many optimizers, Bayesian optimization often falls short of gaining user\ntrust due to opacity. While attempts have been made to develop human-centric\noptimizers, they typically assume user knowledge is well-specified and\nerror-free, employing users mainly as supervisors of the optimization process.\nWe relax these assumptions and propose a more balanced human-AI partnership\nwith our Collaborative and Explainable Bayesian Optimization (CoExBO)\nframework. Instead of explicitly requiring a user to provide a knowledge model,\nCoExBO employs preference learning to seamlessly integrate human insights into\nthe optimization, resulting in algorithmic suggestions that resonate with user\npreference. CoExBO explains its candidate selection every iteration to foster\ntrust, empowering users with a clearer grasp of the optimization. Furthermore,\nCoExBO offers a no-harm guarantee, allowing users to make mistakes; even with\nextreme adversarial interventions, the algorithm converges asymptotically to a\nvanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI\nteaming experiments in lithium-ion battery design, highlighting substantial\nimprovements over conventional methods.\n","authors":["Masaki Adachi","Brady Planden","David A. Howey","Krikamol Maundet","Michael A. Osborne","Siu Lun Chau"],"pdf_url":"https://arxiv.org/pdf/2310.17273v1.pdf","comment":"22 pages, 9 figures"},{"id":"http://arxiv.org/abs/2210.10485v2","updated":"2023-10-26T09:41:22Z","published":"2022-10-19T11:48:01Z","title":"Learning Transferable Adversarial Robust Representations via Multi-view\n Consistency","summary":" Despite the success on few-shot learning problems, most meta-learned models\nonly focus on achieving good performance on clean examples and thus easily\nbreak down when given adversarially perturbed samples. While some recent works\nhave shown that a combination of adversarial learning and meta-learning could\nenhance the robustness of a meta-learner against adversarial attacks, they fail\nto achieve generalizable adversarial robustness to unseen domains and tasks,\nwhich is the ultimate goal of meta-learning. To address this challenge, we\npropose a novel meta-adversarial multi-view representation learning framework\nwith dual encoders. Specifically, we introduce the discrepancy across the two\ndifferently augmented samples of the same data instance by first updating the\nencoder parameters with them and further imposing a novel label-free\nadversarial attack to maximize their discrepancy. Then, we maximize the\nconsistency across the views to learn transferable robust representations\nacross domains and tasks. Through experimental validation on multiple\nbenchmarks, we demonstrate the effectiveness of our framework on few-shot\nlearning tasks from unseen domains, achieving over 10\\% robust accuracy\nimprovements against previous adversarial meta-learning baselines.\n","authors":["Minseon Kim","Hyeonjeong Ha","Dong Bok Lee","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2210.10485v2.pdf","comment":"*Equal contribution (Author ordering determined by coin flip).\n NeurIPS SafetyML workshop 2022, Under review"},{"id":"http://arxiv.org/abs/2310.17264v1","updated":"2023-10-26T09:31:32Z","published":"2023-10-26T09:31:32Z","title":"Variance of ML-based software fault predictors: are we really improving\n fault prediction?","summary":" Software quality assurance activities become increasingly difficult as\nsoftware systems become more and more complex and continuously grow in size.\nMoreover, testing becomes even more expensive when dealing with large-scale\nsystems. Thus, to effectively allocate quality assurance resources, researchers\nhave proposed fault prediction (FP) which utilizes machine learning (ML) to\npredict fault-prone code areas. However, ML algorithms typically make use of\nstochastic elements to increase the prediction models' generalizability and\nefficiency of the training process. These stochastic elements, also known as\nnondeterminism-introducing (NI) factors, lead to variance in the training\nprocess and as a result, lead to variance in prediction accuracy and training\ntime. This variance poses a challenge for reproducibility in research. More\nimportantly, while fault prediction models may have shown good performance in\nthe lab (e.g., often-times involving multiple runs and averaging outcomes),\nhigh variance of results can pose the risk that these models show low\nperformance when applied in practice. In this work, we experimentally analyze\nthe variance of a state-of-the-art fault prediction approach. Our experimental\nresults indicate that NI factors can indeed cause considerable variance in the\nfault prediction models' accuracy. We observed a maximum variance of 10.10% in\nterms of the per-class accuracy metric. We thus, also discuss how to deal with\nsuch variance.\n","authors":["Xhulja Shahini","Domenic Bubel","Andreas Metzger"],"pdf_url":"https://arxiv.org/pdf/2310.17264v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14077v2","updated":"2023-10-26T09:27:53Z","published":"2023-05-23T13:56:29Z","title":"Mind the spikes: Benign overfitting of kernels and neural networks in\n fixed dimension","summary":" The success of over-parameterized neural networks trained to near-zero\ntraining error has caused great interest in the phenomenon of benign\noverfitting, where estimators are statistically consistent even though they\ninterpolate noisy training data. While benign overfitting in fixed dimension\nhas been established for some learning methods, current literature suggests\nthat for regression with typical kernel methods and wide neural networks,\nbenign overfitting requires a high-dimensional setting where the dimension\ngrows with the sample size. In this paper, we show that the smoothness of the\nestimators, and not the dimension, is the key: benign overfitting is possible\nif and only if the estimator's derivatives are large enough. We generalize\nexisting inconsistency results to non-interpolating models and more kernels to\nshow that benign overfitting with moderate derivatives is impossible in fixed\ndimension. Conversely, we show that rate-optimal benign overfitting is possible\nfor regression with a sequence of spiky-smooth kernels with large derivatives.\nUsing neural tangent kernels, we translate our results to wide neural networks.\nWe prove that while infinite-width networks do not overfit benignly with the\nReLU activation, this can be fixed by adding small high-frequency fluctuations\nto the activation function. Our experiments verify that such neural networks,\nwhile overfitting, can indeed generalize well even on low-dimensional data\nsets.\n","authors":["Moritz Haas","David Holzmüller","Ulrike von Luxburg","Ingo Steinwart"],"pdf_url":"https://arxiv.org/pdf/2305.14077v2.pdf","comment":"We provide Python code to reproduce all of our experimental results\n at https://github.com/moritzhaas/mind-the-spikes"},{"id":"http://arxiv.org/abs/2210.10482v2","updated":"2023-10-26T09:18:23Z","published":"2022-10-19T11:43:39Z","title":"Effective Targeted Attacks for Adversarial Self-Supervised Learning","summary":" Recently, unsupervised adversarial training (AT) has been highlighted as a\nmeans of achieving robustness in models without any label information. Previous\nstudies in unsupervised AT have mostly focused on implementing self-supervised\nlearning (SSL) frameworks, which maximize the instance-wise classification loss\nto generate adversarial examples. However, we observe that simply maximizing\nthe self-supervised training loss with an untargeted adversarial attack often\nresults in generating ineffective adversaries that may not help improve the\nrobustness of the trained model, especially for non-contrastive SSL frameworks\nwithout negative examples. To tackle this problem, we propose a novel positive\nmining for targeted adversarial attack to generate effective adversaries for\nadversarial SSL frameworks. Specifically, we introduce an algorithm that\nselects the most confusing yet similar target example for a given instance\nbased on entropy and similarity, and subsequently perturbs the given instance\ntowards the selected target. Our method demonstrates significant enhancements\nin robustness when applied to non-contrastive SSL frameworks, and less but\nconsistent robustness improvements with contrastive SSL frameworks, on the\nbenchmark datasets.\n","authors":["Minseon Kim","Hyeonjeong Ha","Sooel Son","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2210.10482v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.08645v2","updated":"2023-10-26T09:16:25Z","published":"2023-06-14T17:23:07Z","title":"Training-free Diffusion Model Adaptation for Variable-Sized\n Text-to-Image Synthesis","summary":" Diffusion models (DMs) have recently gained attention with state-of-the-art\nperformance in text-to-image synthesis. Abiding by the tradition in deep\nlearning, DMs are trained and evaluated on the images with fixed sizes.\nHowever, users are demanding for various images with specific sizes and various\naspect ratio. This paper focuses on adapting text-to-image diffusion models to\nhandle such variety while maintaining visual fidelity. First we observe that,\nduring the synthesis, lower resolution images suffer from incomplete object\nportrayal, while higher resolution images exhibit repetitively disordered\npresentation. Next, we establish a statistical relationship indicating that\nattention entropy changes with token quantity, suggesting that models aggregate\nspatial information in proportion to image resolution. The subsequent\ninterpretation on our observations is that objects are incompletely depicted\ndue to limited spatial information for low resolutions, while repetitively\ndisorganized presentation arises from redundant spatial information for high\nresolutions. From this perspective, we propose a scaling factor to alleviate\nthe change of attention entropy and mitigate the defective pattern observed.\nExtensive experimental results validate the efficacy of the proposed scaling\nfactor, enabling models to achieve better visual effects, image quality, and\ntext alignment. Notably, these improvements are achieved without additional\ntraining or fine-tuning techniques.\n","authors":["Zhiyu Jin","Xuli Shen","Bin Li","Xiangyang Xue"],"pdf_url":"https://arxiv.org/pdf/2306.08645v2.pdf","comment":"Accepted by NeurIPS 2023. 23 pages, 13 figures"},{"id":"http://arxiv.org/abs/2302.03857v5","updated":"2023-10-26T09:15:14Z","published":"2023-02-08T03:20:14Z","title":"Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset\n Selection","summary":" Adversarial contrastive learning (ACL) does not require expensive data\nannotations but outputs a robust representation that withstands adversarial\nattacks and also generalizes to a wide range of downstream tasks. However, ACL\nneeds tremendous running time to generate the adversarial variants of all\ntraining data, which limits its scalability to large datasets. To speed up ACL,\nthis paper proposes a robustness-aware coreset selection (RCS) method. RCS does\nnot require label information and searches for an informative subset that\nminimizes a representational divergence, which is the distance of the\nrepresentation between natural data and their virtual adversarial variants. The\nvanilla solution of RCS via traversing all possible subsets is computationally\nprohibitive. Therefore, we theoretically transform RCS into a surrogate problem\nof submodular maximization, of which the greedy search is an efficient solution\nwith an optimality guarantee for the original problem. Empirically, our\ncomprehensive results corroborate that RCS can speed up ACL by a large margin\nwithout significantly hurting the robustness transferability. Notably, to the\nbest of our knowledge, we are the first to conduct ACL efficiently on the\nlarge-scale ImageNet-1K dataset to obtain an effective robust representation\nvia RCS. Our source code is at\nhttps://github.com/GodXuxilie/Efficient_ACL_via_RCS.\n","authors":["Xilie Xu","Jingfeng Zhang","Feng Liu","Masashi Sugiyama","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2302.03857v5.pdf","comment":"NeurIPS 2023 Spotlight"},{"id":"http://arxiv.org/abs/2310.17256v1","updated":"2023-10-26T09:13:15Z","published":"2023-10-26T09:13:15Z","title":"fairret: a Framework for Differentiable Fairness Regularization Terms","summary":" Current tools for machine learning fairness only admit a limited range of\nfairness definitions and have seen little integration with automatic\ndifferentiation libraries, despite the central role these libraries play in\nmodern machine learning pipelines.\n We introduce a framework of fairness regularization terms (fairrets) which\nquantify bias as modular objectives that are easily integrated in automatic\ndifferentiation pipelines. By employing a general definition of fairness in\nterms of linear-fractional statistics, a wide class of fairrets can be computed\nefficiently. Experiments show the behavior of their gradients and their utility\nin enforcing fairness with minimal loss of predictive power compared to\nbaselines. Our contribution includes a PyTorch implementation of the fairret\nframework.\n","authors":["Maarten Buyl","MaryBeth Defrance","Tijl De Bie"],"pdf_url":"https://arxiv.org/pdf/2310.17256v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15694v3","updated":"2023-10-26T09:08:34Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17250v1","updated":"2023-10-26T08:58:29Z","published":"2023-10-26T08:58:29Z","title":"IDENAS: Internal Dependency Exploration for Neural Architecture Search","summary":" Machine learning is a powerful tool for extracting valuable information and\nmaking various predictions from diverse datasets. Traditional algorithms rely\non well-defined input and output variables however, there are scenarios where\nthe distinction between the input and output variables and the underlying,\nassociated (input and output) layers of the model, are unknown. Neural\nArchitecture Search (NAS) and Feature Selection have emerged as promising\nsolutions in such scenarios. This research proposes IDENAS, an Internal\nDependency-based Exploration for Neural Architecture Search, integrating NAS\nwith feature selection. The methodology explores internal dependencies in the\ncomplete parameter space for classification involving 1D sensor and 2D image\ndata as well. IDENAS employs a modified encoder-decoder model and the\nSequential Forward Search (SFS) algorithm, combining input-output configuration\nsearch with embedded feature selection. Experimental results demonstrate\nIDENASs superior performance in comparison to other algorithms, showcasing its\neffectiveness in model development pipelines and automated machine learning. On\naverage, IDENAS achieved significant modelling improvements, underscoring its\nsignificant contribution to advancing the state-of-the-art in neural\narchitecture search and feature selection integration.\n","authors":["Anh T. Hoang","Zsolt J. Viharos"],"pdf_url":"https://arxiv.org/pdf/2310.17250v1.pdf","comment":"57 pages, 19 figures + appendix, the related software code can be\n found under the link: https://github.com/viharoszsolt/IDENAS"},{"id":"http://arxiv.org/abs/2305.18353v2","updated":"2023-10-26T08:49:13Z","published":"2023-05-26T14:39:46Z","title":"Emergent representations in networks trained with the Forward-Forward\n algorithm","summary":" The Backpropagation algorithm has often been criticised for its lack of\nbiological realism. In an attempt to find a more biologically plausible\nalternative, the recently introduced Forward-Forward algorithm replaces the\nforward and backward passes of Backpropagation with two forward passes. In this\nwork, we show that the internal representations obtained by the Forward-Forward\nalgorithm can organise into category-specific ensembles exhibiting high\nsparsity - i.e. composed of an extremely low number of active units. This\nsituation is reminiscent of what has been observed in cortical sensory areas,\nwhere neuronal ensembles are suggested to serve as the functional building\nblocks for perception and action. Interestingly, while this sparse pattern does\nnot typically arise in models trained with standard Backpropagation, it can\nemerge in networks trained with Backpropagation on the same objective proposed\nfor the Forward-Forward algorithm. These results suggest that the learning\nprocedure proposed by Forward-Forward may be superior to Backpropagation in\nmodelling learning in the cortex, even when a backward pass is used.\n","authors":["Niccolò Tosato","Lorenzo Basile","Emanuele Ballarin","Giuseppe de Alteriis","Alberto Cazzaniga","Alessio Ansuini"],"pdf_url":"https://arxiv.org/pdf/2305.18353v2.pdf","comment":"16 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.17247v1","updated":"2023-10-26T08:47:42Z","published":"2023-10-26T08:47:42Z","title":"Grokking Beyond Neural Networks: An Empirical Exploration with Model\n Complexity","summary":" In some settings neural networks exhibit a phenomenon known as grokking,\nwhere they achieve perfect or near-perfect accuracy on the validation set long\nafter the same performance has been achieved on the training set. In this\npaper, we discover that grokking is not limited to neural networks but occurs\nin other settings such as Gaussian process (GP) classification, GP regression\nand linear regression. We also uncover a mechanism by which to induce grokking\non algorithmic datasets via the addition of dimensions containing spurious\ninformation. The presence of the phenomenon in non-neural architectures\nprovides evidence that grokking is not specific to SGD or weight norm\nregularisation. Instead, grokking may be possible in any setting where solution\nsearch is guided by complexity and error. Based on this insight and further\ntrends we see in the training trajectories of a Bayesian neural network (BNN)\nand GP regression model, we make progress towards a more general theory of\ngrokking. Specifically, we hypothesise that the phenomenon is governed by the\naccessibility of certain regions in the error and complexity landscapes.\n","authors":["Jack Miller","Charles O'Neill","Thang Bui"],"pdf_url":"https://arxiv.org/pdf/2310.17247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17245v1","updated":"2023-10-26T08:45:23Z","published":"2023-10-26T08:45:23Z","title":"CROP: Conservative Reward for Model-based Offline Policy Optimization","summary":" Offline reinforcement learning (RL) aims to optimize policy using collected\ndata without online interactions. Model-based approaches are particularly\nappealing for addressing offline RL challenges due to their capability to\nmitigate the limitations of offline data through data generation using models.\nPrior research has demonstrated that introducing conservatism into the model or\nQ-function during policy optimization can effectively alleviate the prevalent\ndistribution drift problem in offline RL. However, the investigation into the\nimpacts of conservatism in reward estimation is still lacking. This paper\nproposes a novel model-based offline RL algorithm, Conservative Reward for\nmodel-based Offline Policy optimization (CROP), which conservatively estimates\nthe reward in model training. To achieve a conservative reward estimation, CROP\nsimultaneously minimizes the estimation error and the reward of random actions.\nTheoretical analysis shows that this conservative reward mechanism leads to a\nconservative policy evaluation and helps mitigate distribution drift.\nExperiments on D4RL benchmarks showcase that the performance of CROP is\ncomparable to the state-of-the-art baselines. Notably, CROP establishes an\ninnovative connection between offline and online RL, highlighting that offline\nRL problems can be tackled by adopting online RL techniques to the empirical\nMarkov decision process trained with a conservative reward. The source code is\navailable with https://github.com/G0K0URURI/CROP.git.\n","authors":["Hao Li","Xiao-Hu Zhou","Xiao-Liang Xie","Shi-Qi Liu","Zhen-Qiu Feng","Xiao-Yin Liu","Mei-Jiang Gui","Tian-Yu Xiang","De-Xing Huang","Bo-Xian Yao","Zeng-Guang Hou"],"pdf_url":"https://arxiv.org/pdf/2310.17245v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17238v1","updated":"2023-10-26T08:36:39Z","published":"2023-10-26T08:36:39Z","title":"Joint Entity and Relation Extraction with Span Pruning and Hypergraph\n Neural Networks","summary":" Entity and Relation Extraction (ERE) is an important task in information\nextraction. Recent marker-based pipeline models achieve state-of-the-art\nperformance, but still suffer from the error propagation issue. Also, most of\ncurrent ERE models do not take into account higher-order interactions between\nmultiple entities and relations, while higher-order modeling could be\nbeneficial.In this work, we propose HyperGraph neural network for ERE\n($\\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based\npipleline model). To alleviate error propagation,we use a high-recall pruner\nmechanism to transfer the burden of entity identification and labeling from the\nNER module to the joint module of our model. For higher-order modeling, we\nbuild a hypergraph, where nodes are entities (provided by the span pruner) and\nrelations thereof, and hyperedges encode interactions between two different\nrelations or between a relation and its associated subject and object entities.\nWe then run a hypergraph neural network for higher-order inference by applying\nmessage passing over the built hypergraph. Experiments on three widely used\nbenchmarks (\\acef{}, \\ace{} and \\scierc{}) for ERE task show significant\nimprovements over the previous state-of-the-art PL-marker.\n","authors":["Zhaohui Yan","Songlin Yang","Wei Liu","Kewei Tu"],"pdf_url":"https://arxiv.org/pdf/2310.17238v1.pdf","comment":"Accepted to Proceedings of EMNLP, 2023"},{"id":"http://arxiv.org/abs/2310.17230v1","updated":"2023-10-26T08:28:48Z","published":"2023-10-26T08:28:48Z","title":"Codebook Features: Sparse and Discrete Interpretability for Neural\n Networks","summary":" Understanding neural networks is challenging in part because of the dense,\ncontinuous nature of their hidden states. We explore whether we can train\nneural networks to have hidden states that are sparse, discrete, and more\ninterpretable by quantizing their continuous features into what we call\ncodebook features. Codebook features are produced by finetuning neural networks\nwith vector quantization bottlenecks at each layer, producing a network whose\nhidden features are the sum of a small number of discrete vector codes chosen\nfrom a larger codebook. Surprisingly, we find that neural networks can operate\nunder this extreme bottleneck with only modest degradation in performance. This\nsparse, discrete bottleneck also provides an intuitive way of controlling\nneural network behavior: first, find codes that activate when the desired\nbehavior is present, then activate those same codes during generation to elicit\nthat behavior. We validate our approach by training codebook Transformers on\nseveral different datasets. First, we explore a finite state machine dataset\nwith far more hidden states than neurons. In this setting, our approach\novercomes the superposition problem by assigning states to distinct codes, and\nwe find that we can make the neural network behave as if it is in a different\nstate by activating the code for that state. Second, we train Transformer\nlanguage models with up to 410M parameters on two natural language datasets. We\nidentify codes in these models representing diverse, disentangled concepts\n(ranging from negative emotions to months of the year) and find that we can\nguide the model to generate different topics by activating the appropriate\ncodes during inference. Overall, codebook features appear to be a promising\nunit of analysis and control for neural networks and interpretability. Our\ncodebase and models are open-sourced at\nhttps://github.com/taufeeque9/codebook-features.\n","authors":["Alex Tamkin","Mohammad Taufeeque","Noah D. Goodman"],"pdf_url":"https://arxiv.org/pdf/2310.17230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04449v3","updated":"2023-10-26T08:23:48Z","published":"2023-02-09T05:47:03Z","title":"Read and Reap the Rewards: Learning to Play Atari with the Help of\n Instruction Manuals","summary":" High sample complexity has long been a challenge for RL. On the other hand,\nhumans learn to perform tasks not only from interaction or demonstrations, but\nalso by reading unstructured text documents, e.g., instruction manuals.\nInstruction manuals and wiki pages are among the most abundant data that could\ninform agents of valuable features and policies or task-specific environmental\ndynamics and reward structures. Therefore, we hypothesize that the ability to\nutilize human-written instruction manuals to assist learning policies for\nspecific tasks should lead to a more efficient and better-performing agent. We\npropose the Read and Reward framework. Read and Reward speeds up RL algorithms\non Atari games by reading manuals released by the Atari game developers. Our\nframework consists of a QA Extraction module that extracts and summarizes\nrelevant information from the manual and a Reasoning module that evaluates\nobject-agent interactions based on information from the manual. An auxiliary\nreward is then provided to a standard A2C RL agent, when interaction is\ndetected. Experimentally, various RL algorithms obtain significant improvement\nin performance and training speed when assisted by our design.\n","authors":["Yue Wu","Yewen Fan","Paul Pu Liang","Amos Azaria","Yuanzhi Li","Tom M. Mitchell"],"pdf_url":"https://arxiv.org/pdf/2302.04449v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17217v1","updated":"2023-10-26T08:08:43Z","published":"2023-10-26T08:08:43Z","title":"Beyond MLE: Convex Learning for Text Generation","summary":" Maximum likelihood estimation (MLE) is a statistical method used to estimate\nthe parameters of a probability distribution that best explain the observed\ndata. In the context of text generation, MLE is often used to train generative\nlanguage models, which can then be used to generate new text. However, we argue\nthat MLE is not always necessary and optimal, especially for closed-ended text\ngeneration tasks like machine translation. In these tasks, the goal of model is\nto generate the most appropriate response, which does not necessarily require\nit to estimate the entire data distribution with MLE. To this end, we propose a\nnovel class of training objectives based on convex functions, which enables\ntext generation models to focus on highly probable outputs without having to\nestimate the entire data distribution. We investigate the theoretical\nproperties of the optimal predicted distribution when applying convex functions\nto the loss, demonstrating that convex functions can sharpen the optimal\ndistribution, thereby enabling the model to better capture outputs with high\nprobabilities. Experiments on various text generation tasks and models show the\neffectiveness of our approach. It enables autoregressive models to bridge the\ngap between greedy and beam search, and facilitates the learning of\nnon-autoregressive models with a maximum improvement of 9+ BLEU points.\nMoreover, our approach also exhibits significant impact on large language\nmodels (LLMs), substantially enhancing their generative capability on various\ntasks. Source code is available at\n\\url{https://github.com/ictnlp/Convex-Learning}.\n","authors":["Chenze Shao","Zhengrui Ma","Min Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.17217v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17209v1","updated":"2023-10-26T07:54:47Z","published":"2023-10-26T07:54:47Z","title":"Weakly-Supervised Surgical Phase Recognition","summary":" A key element of computer-assisted surgery systems is phase recognition of\nsurgical videos. Existing phase recognition algorithms require frame-wise\nannotation of a large number of videos, which is time and money consuming. In\nthis work we join concepts of graph segmentation with self-supervised learning\nto derive a random-walk solution for per-frame phase prediction. Furthermore,\nwe utilize within our method two forms of weak supervision: sparse timestamps\nor few-shot learning. The proposed algorithm enjoys low complexity and can\noperate in lowdata regimes. We validate our method by running experiments with\nthe public Cholec80 dataset of laparoscopic cholecystectomy videos,\ndemonstrating promising performance in multiple setups.\n","authors":["Roy Hirsch","Regev Cohen","Mathilde Caron","Tomer Golany","Daniel Freedman","Ehud Rivlin"],"pdf_url":"https://arxiv.org/pdf/2310.17209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13268v2","updated":"2023-10-26T07:51:47Z","published":"2023-10-20T04:23:12Z","title":"DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model\n Statistics","summary":" Diffusion probabilistic models (DPMs) have exhibited excellent performance\nfor high-fidelity image generation while suffering from inefficient sampling.\nRecent works accelerate the sampling procedure by proposing fast ODE solvers\nthat leverage the specific ODE form of DPMs. However, they highly rely on\nspecific parameterization during inference (such as noise/data prediction),\nwhich might not be the optimal choice. In this work, we propose a novel\nformulation towards the optimal parameterization during sampling that minimizes\nthe first-order discretization error of the ODE solution. Based on such\nformulation, we propose \\textit{DPM-Solver-v3}, a new fast ODE solver for DPMs\nby introducing several coefficients efficiently computed on the pretrained\nmodel, which we call \\textit{empirical model statistics}. We further\nincorporate multistep methods and a predictor-corrector framework, and propose\nsome techniques for improving sample quality at small numbers of function\nevaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3\nachieves consistently better or comparable performance in both unconditional\nand conditional sampling with both pixel-space and latent-space DPMs,\nespecially in 5$\\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE)\non unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable\nDiffusion, bringing a speed-up of 15\\%$\\sim$30\\% compared to previous\nstate-of-the-art training-free methods. Code is available at\n\\url{https://github.com/thu-ml/DPM-Solver-v3}.\n","authors":["Kaiwen Zheng","Cheng Lu","Jianfei Chen","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.13268v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17202v1","updated":"2023-10-26T07:37:44Z","published":"2023-10-26T07:37:44Z","title":"miditok: A Python package for MIDI file tokenization","summary":" Recent progress in natural language processing has been adapted to the\nsymbolic music modality. Language models, such as Transformers, have been used\nwith symbolic music for a variety of tasks among which music generation,\nmodeling or transcription, with state-of-the-art performances. These models are\nbeginning to be used in production products. To encode and decode music for the\nbackbone model, they need to rely on tokenizers, whose role is to serialize\nmusic into sequences of distinct elements called tokens. MidiTok is an\nopen-source library allowing to tokenize symbolic music with great flexibility\nand extended features. It features the most popular music tokenizations, under\na unified API. It is made to be easily used and extensible for everyone.\n","authors":["Nathan Fradet","Jean-Pierre Briot","Fabien Chhel","Amal El Fallah Seghrouchni","Nicolas Gutowski"],"pdf_url":"https://arxiv.org/pdf/2310.17202v1.pdf","comment":"Updated and comprehensive report. Original ISMIR 2021 document at\n https://archives.ismir.net/ismir2021/latebreaking/000005.pdf"},{"id":"http://arxiv.org/abs/2310.17200v1","updated":"2023-10-26T07:32:52Z","published":"2023-10-26T07:32:52Z","title":"Taming Gradient Variance in Federated Learning with Networked Control\n Variates","summary":" Federated learning, a decentralized approach to machine learning, faces\nsignificant challenges such as extensive communication overheads, slow\nconvergence, and unstable improvements. These challenges primarily stem from\nthe gradient variance due to heterogeneous client data distributions. To\naddress this, we introduce a novel Networked Control Variates (FedNCV)\nframework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO)\nas a fundamental control variate unit in the FedNCV framework, implemented at\nboth client and server levels. At the client level, the RLOO control variate is\nemployed to optimize local gradient updates, mitigating the variance introduced\nby data samples. Once relayed to the server, the RLOO-based estimator further\nprovides an unbiased and low-variance aggregated gradient, leading to robust\nglobal updates. This dual-side application is formalized as a linear\ncombination of composite control variates. We provide a mathematical expression\ncapturing this integration of double control variates within FedNCV and present\nthree theoretical results with corresponding proofs. This unique dual structure\nequips FedNCV to address data heterogeneity and scalability issues, thus\npotentially paving the way for large-scale applications. Moreover, we tested\nFedNCV on six diverse datasets under a Dirichlet distribution with {\\alpha} =\n0.1, and benchmarked its performance against six SOTA methods, demonstrating\nits superiority.\n","authors":["Xingyan Chen","Yaling Liu","Huaming Du","Mu Wang","Yu Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.17200v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2302.01681v2","updated":"2023-10-26T07:25:50Z","published":"2023-02-03T12:10:24Z","title":"Improving the Timing Resolution of Positron Emission Tomography\n Detectors Using Boosted Learning -- A Residual Physics Approach","summary":" Artificial intelligence (AI) is entering medical imaging, mainly enhancing\nimage reconstruction. Nevertheless, improvements throughout the entire\nprocessing, from signal detection to computation, potentially offer significant\nbenefits. This work presents a novel and versatile approach to detector\noptimization using machine learning (ML) and residual physics. We apply the\nconcept to positron emission tomography (PET), intending to improve the\ncoincidence time resolution (CTR). PET visualizes metabolic processes in the\nbody by detecting photons with scintillation detectors. Improved CTR\nperformance offers the advantage of reducing radioactive dose exposure for\npatients. Modern PET detectors with sophisticated concepts and read-out\ntopologies represent complex physical and electronic systems requiring\ndedicated calibration techniques. Traditional methods primarily depend on\nanalytical formulations successfully describing the main detector\ncharacteristics. However, when accounting for higher-order effects, additional\ncomplexities arise matching theoretical models to experimental reality. Our\nwork addresses this challenge by combining traditional calibration with AI and\nresidual physics, presenting a highly promising approach. We present a residual\nphysics-based strategy using gradient tree boosting and physics-guided data\ngeneration. The explainable AI framework SHapley Additive exPlanations (SHAP)\nwas used to identify known physical effects with learned patterns. In addition,\nthe models were tested against basic physical laws. We were able to improve the\nCTR significantly (more than 20%) for clinically relevant detectors of 19 mm\nheight, reaching CTRs of 185 ps (450-550 keV).\n","authors":["Stephan Naunheim","Yannick Kuhl","David Schug","Volkmar Schulz","Florian Mueller"],"pdf_url":"https://arxiv.org/pdf/2302.01681v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11588v2","updated":"2023-10-26T07:20:02Z","published":"2023-01-27T08:25:45Z","title":"Bounding Box-based Multi-objective Bayesian Optimization of Risk\n Measures under Input Uncertainty","summary":" In this study, we propose a novel multi-objective Bayesian optimization\n(MOBO) method to efficiently identify the Pareto front (PF) defined by risk\nmeasures for black-box functions under the presence of input uncertainty (IU).\nExisting BO methods for Pareto optimization in the presence of IU are\nrisk-specific or without theoretical guarantees, whereas our proposed method\naddresses general risk measures and has theoretical guarantees. The basic idea\nof the proposed method is to assume a Gaussian process (GP) model for the\nblack-box function and to construct high-probability bounding boxes for the\nrisk measures using the GP model. Furthermore, in order to reduce the\nuncertainty of non-dominated bounding boxes, we propose a method of selecting\nthe next evaluation point using a maximin distance defined by the maximum value\nof a quasi distance based on bounding boxes. As theoretical analysis, we prove\nthat the algorithm can return an arbitrary-accurate solution in a finite number\nof iterations with high probability, for various risk measures such as Bayes\nrisk, worst-case risk, and value-at-risk. We also give a theoretical analysis\nthat takes into account approximation errors because there exist non-negligible\napproximation errors (e.g., finite approximation of PFs and sampling-based\napproximation of bounding boxes) in practice. We confirm that the proposed\nmethod outperforms compared with existing methods not only in the setting with\nIU but also in the setting of ordinary MOBO through numerical experiments.\n","authors":["Yu Inatsu","Shion Takeno","Hiroyuki Hanada","Kazuki Iwata","Ichiro Takeuchi"],"pdf_url":"https://arxiv.org/pdf/2301.11588v2.pdf","comment":"39 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.17191v1","updated":"2023-10-26T07:10:31Z","published":"2023-10-26T07:10:31Z","title":"How do Language Models Bind Entities in Context?","summary":" To correctly use in-context information, language models (LMs) must bind\nentities to their attributes. For example, given a context describing a \"green\nsquare\" and a \"blue circle\", LMs must bind the shapes to their respective\ncolors. We analyze LM representations and identify the binding ID mechanism: a\ngeneral mechanism for solving the binding problem, which we observe in every\nsufficiently large model from the Pythia and LLaMA families. Using causal\ninterventions, we show that LMs' internal activations represent binding\ninformation by attaching binding ID vectors to corresponding entities and\nattributes. We further show that binding ID vectors form a continuous subspace,\nin which distances between binding ID vectors reflect their discernability.\nOverall, our results uncover interpretable strategies in LMs for representing\nsymbolic knowledge in-context, providing a step towards understanding general\nin-context reasoning in large-scale LMs.\n","authors":["Jiahai Feng","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2310.17191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.17310v3","updated":"2023-10-26T07:01:53Z","published":"2023-09-29T15:08:28Z","title":"Leave-one-out Distinguishability in Machine Learning","summary":" We introduce a new analytical framework to quantify the changes in a machine\nlearning algorithm's output distribution following the inclusion of a few data\npoints in its training set, a notion we define as leave-one-out\ndistinguishability (LOOD). This problem is key to measuring data\n**memorization** and **information leakage** in machine learning, and the\n**influence** of training data points on model predictions. We illustrate how\nour method broadens and refines existing empirical measures of memorization and\nprivacy risks associated with training data. We use Gaussian processes to model\nthe randomness of machine learning algorithms, and validate LOOD with extensive\nempirical analysis of information leakage using membership inference attacks.\nOur theoretical framework enables us to investigate the causes of information\nleakage and where the leakage is high. For example, we analyze the influence of\nactivation functions, on data memorization. Additionally, our method allows us\nto optimize queries that disclose the most significant information about the\ntraining data in the leave-one-out setting. We illustrate how optimal queries\ncan be used for accurate **reconstruction** of training data.\n","authors":["Jiayuan Ye","Anastasia Borovykh","Soufiane Hayou","Reza Shokri"],"pdf_url":"https://arxiv.org/pdf/2309.17310v3.pdf","comment":"Fixed typos"},{"id":"http://arxiv.org/abs/2310.00270v5","updated":"2023-10-26T07:01:02Z","published":"2023-09-30T06:20:21Z","title":"SpatialRank: Urban Event Ranking with NDCG Optimization on\n Spatiotemporal Data","summary":" The problem of urban event ranking aims at predicting the top-k most risky\nlocations of future events such as traffic accidents and crimes. This problem\nis of fundamental importance to public safety and urban administration\nespecially when limited resources are available. The problem is, however,\nchallenging due to complex and dynamic spatio-temporal correlations between\nlocations, uneven distribution of urban events in space, and the difficulty to\ncorrectly rank nearby locations with similar features. Prior works on event\nforecasting mostly aim at accurately predicting the actual risk score or counts\nof events for all the locations. Rankings obtained as such usually have low\nquality due to prediction errors. Learning-to-rank methods directly optimize\nmeasures such as Normalized Discounted Cumulative Gain (NDCG), but cannot\nhandle the spatiotemporal autocorrelation existing among locations. In this\npaper, we bridge the gap by proposing a novel spatial event ranking approach\nnamed SpatialRank. SpatialRank features adaptive graph convolution layers that\ndynamically learn the spatiotemporal dependencies across locations from data.\nIn addition, the model optimizes through surrogates a hybrid NDCG loss with a\nspatial component to better rank neighboring spatial locations. We design an\nimportance-sampling with a spatial filtering algorithm to effectively evaluate\nthe loss during training. Comprehensive experiments on three real-world\ndatasets demonstrate that SpatialRank can effectively identify the top riskiest\nlocations of crimes and traffic accidents and outperform state-of-art methods\nin terms of NDCG by up to 12.7%.\n","authors":["Bang An","Xun Zhou","Yongjian Zhong","Tianbao Yang"],"pdf_url":"https://arxiv.org/pdf/2310.00270v5.pdf","comment":"37th Conference on Neural Information Processing Systems (NeurIPS\n 2023)"},{"id":"http://arxiv.org/abs/2306.04618v2","updated":"2023-10-26T06:59:50Z","published":"2023-06-07T17:47:03Z","title":"Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,\n and LLMs Evaluations","summary":" This paper reexamines the research on out-of-distribution (OOD) robustness in\nthe field of NLP. We find that the distribution shift settings in previous\nstudies commonly lack adequate challenges, hindering the accurate evaluation of\nOOD robustness. To address these issues, we propose a benchmark construction\nprotocol that ensures clear differentiation and challenging distribution\nshifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution\nrobustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we\nconduct a series of experiments on pre-trained language models for analysis and\nevaluation of OOD robustness. First, for vanilla fine-tuning, we examine the\nrelationship between in-distribution (ID) and OOD performance. We identify\nthree typical types that unveil the inner learning mechanism, which could\npotentially facilitate the forecasting of OOD robustness, correlating with the\nadvancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and\nfind that, despite exhibiting some effectiveness in specific cases, they do not\noffer significant improvement compared to vanilla fine-tuning. Further, we\nevaluate 5 LLMs with various adaptation paradigms and find that when sufficient\nID data is available, fine-tuning domain-specific models outperform LLMs on ID\nexamples significantly. However, in the case of OOD instances, prioritizing\nLLMs with in-context learning yields better results. We identify that both\nfine-tuned small models and LLMs face challenges in effectively addressing\ndownstream tasks. The code is public at\n\\url{https://github.com/lifan-yuan/OOD_NLP}.\n","authors":["Lifan Yuan","Yangyi Chen","Ganqu Cui","Hongcheng Gao","Fangyuan Zou","Xingyi Cheng","Heng Ji","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2306.04618v2.pdf","comment":"Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is\n available at \\url{https://github.com/lifan-yuan/OOD_NLP}"},{"id":"http://arxiv.org/abs/2310.13923v2","updated":"2023-10-26T06:50:44Z","published":"2023-10-21T07:16:09Z","title":"Diversified Outlier Exposure for Out-of-Distribution Detection via\n Informative Extrapolation","summary":" Out-of-distribution (OOD) detection is important for deploying reliable\nmachine learning models on real-world applications. Recent advances in outlier\nexposure have shown promising results on OOD detection via fine-tuning model\nwith informatively sampled auxiliary outliers. However, previous methods assume\nthat the collected outliers can be sufficiently large and representative to\ncover the boundary between ID and OOD data, which might be impractical and\nchallenging. In this work, we propose a novel framework, namely, Diversified\nOutlier Exposure (DivOE), for effective OOD detection via informative\nextrapolation based on the given auxiliary outliers. Specifically, DivOE\nintroduces a new learning objective, which diversifies the auxiliary\ndistribution by explicitly synthesizing more informative outliers for\nextrapolation during training. It leverages a multi-step optimization method to\ngenerate novel outliers beyond the original ones, which is compatible with many\nvariants of outlier exposure. Extensive experiments and analyses have been\nconducted to characterize and demonstrate the effectiveness of the proposed\nDivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE.\n","authors":["Jianing Zhu","Geng Yu","Jiangchao Yao","Tongliang Liu","Gang Niu","Masashi Sugiyama","Bo Han"],"pdf_url":"https://arxiv.org/pdf/2310.13923v2.pdf","comment":"accepted by NeurIPS 2023"}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.17388v1","updated":"2023-10-26T13:35:34Z","published":"2023-10-26T13:35:34Z","title":"Architecture Design of a Networked Music Performance Platform for a\n Chamber Choir","summary":" This paper describes an architecture design process for Networked Music\nPerformance (NMP) platform for medium-sized conducted music ensembles, based on\nremote rehearsals of Academic Choir of Gdansk University of Technology. The\nissues of real-time remote communication, in-person music performance, and NMP\nare described. Three iterative steps defining and extending the architecture of\nthe NMP platform with additional features to enhance its utility in remote\nrehearsals are presented. The first iteration uses a regular video conferencing\nplatform, the second iteration uses dedicated NMP devices and tools, and the\nthird iteration adds video transmission and utilizes professional low-latency\naudio and video workstations. For each iteration, the platform architecture is\ndefined and deployed with simultaneous usability tests. Its strengths and\nweaknesses are identified through qualitative and quantitative measurements -\nstatistical analysis shows a significant improvement in rehearsal quality after\neach iteration. The final optimal architecture is described and concluded with\nguidelines for creating NMP systems for said music ensembles.\n","authors":["Jan Cychnerski","Bartłomiej Mróz"],"pdf_url":"https://arxiv.org/pdf/2310.17388v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17300v1","updated":"2023-10-26T10:45:26Z","published":"2023-10-26T10:45:26Z","title":"Comparing Photorealistic and Animated Embodied Conversational Agents in\n Serious Games: An Empirical Study on User Experience","summary":" Embodied conversational agents (ECAs) are paradigms of conversational user\ninterfaces in the form of embodied characters. While ECAs offer various\nmanipulable features, this paper focuses on a study conducted to explore two\ndistinct levels of presentation realism. The two agent versions are\nphotorealistic and animated. The study aims to provide insights and design\nsuggestions for speech-enabled ECAs within serious game environments. A\nwithin-subjects, two-by-two factorial design was employed for this research\nwith a cohort of 36 participants balanced for gender. The results showed that\nboth the photorealistic and the animated versions were perceived as highly\nusable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4\nper cent of the participants stated they preferred the photorealistic version,\n25 per cent stated they preferred the animated version and 5.6 per cent had no\nstated preference. The photorealistic agents were perceived as more realistic\nand human-like, while the animated characters made the task feel more like a\ngame. Even though the agents' realism had no significant effect on usability,\nit positively influenced participants' perceptions of the agent. This research\naims to lay the groundwork for future studies on ECA realism's impact in\nserious games across diverse contexts.\n","authors":["Danai Korre"],"pdf_url":"https://arxiv.org/pdf/2310.17300v1.pdf","comment":"21 pages, 14 figures, preprint to be published in HCI INTERNATIONAL\n 2023 25TH INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION proceedings"},{"id":"http://arxiv.org/abs/2310.17193v1","updated":"2023-10-26T07:15:40Z","published":"2023-10-26T07:15:40Z","title":"Automatic Edge Error Judgment in Figure Skating Using 3D Pose Estimation\n from a Monocular Camera and IMUs","summary":" Automatic evaluating systems are fundamental issues in sports technologies.\nIn many sports, such as figure skating, automated evaluating methods based on\npose estimation have been proposed. However, previous studies have evaluated\nskaters' skills in 2D analysis. In this paper, we propose an automatic edge\nerror judgment system with a monocular smartphone camera and inertial sensors,\nwhich enable us to analyze 3D motions. Edge error is one of the most\nsignificant scoring items and is challenging to automatically judge due to its\n3D motion. The results show that the model using 3D joint position coordinates\nestimated from the monocular camera as the input feature had the highest\naccuracy at 83% for unknown skaters' data. We also analyzed the detailed motion\nanalysis for edge error judgment. These results indicate that the monocular\ncamera can be used to judge edge errors automatically. We will provide the\nfigure skating single Lutz jump dataset, including pre-processed videos and\nlabels, at https://github.com/ryota-takedalab/JudgeAI-LutzEdge.\n","authors":["Ryota Tanaka","Tomohiro Suzuki","Kazuya Takeda","Keisuke Fujii"],"pdf_url":"https://arxiv.org/pdf/2310.17193v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10355v3","updated":"2023-10-26T02:52:40Z","published":"2023-05-17T16:34:01Z","title":"Evaluating Object Hallucination in Large Vision-Language Models","summary":" Inspired by the superior language abilities of large language models (LLM),\nlarge vision-language models (LVLM) have been recently explored by integrating\npowerful LLMs for improving the performance on complex multimodal tasks.\nDespite the promising progress on LVLMs, we find that LVLMs suffer from the\nhallucination problem, i.e. they tend to generate objects that are inconsistent\nwith the target images in the descriptions. To investigate it, this work\npresents the first systematic study on object hallucination of LVLMs. We\nconduct the evaluation experiments on several representative LVLMs, and show\nthat they mostly suffer from severe object hallucination issue. We further\ndiscuss that the visual instructions may influence the hallucination, and find\nthat: objects that frequently occur in the visual instructions or co-occur with\nthe image objects, are obviously prone to be hallucinated by LVLMs. Besides, we\nfind that existing evaluation methods might be affected by the input\ninstructions and generation styles of LVLMs. Thus, we further design an\nimproved evaluation method for object hallucination by proposing a\npolling-based query method called POPE. Experiment results demonstrate that our\nPOPE can evaluate the object hallucination in a more stable and flexible way.\nOur codes and data are publicly available at https://github.com/RUCAIBox/POPE.\n","authors":["Yifan Li","Yifan Du","Kun Zhou","Jinpeng Wang","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2305.10355v3.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17796v1","updated":"2023-10-26T21:57:21Z","published":"2023-10-26T21:57:21Z","title":"ControlLLM: Augment Language Models with Tools by Searching on Graphs","summary":" We present ControlLLM, a novel framework that enables large language models\n(LLMs) to utilize multi-modal tools for solving complex real-world tasks.\nDespite the remarkable performance of LLMs, they still struggle with tool\ninvocation due to ambiguous user prompts, inaccurate tool selection and\nparameterization, and inefficient tool scheduling. To overcome these\nchallenges, our framework comprises three key components: (1) a \\textit{task\ndecomposer} that breaks down a complex task into clear subtasks with\nwell-defined inputs and outputs; (2) a \\textit{Thoughts-on-Graph (ToG)\nparadigm} that searches the optimal solution path on a pre-built tool graph,\nwhich specifies the parameter and dependency relations among different tools;\nand (3) an \\textit{execution engine with a rich toolbox} that interprets the\nsolution path and runs the tools efficiently on different computational\ndevices. We evaluate our framework on diverse tasks involving image, audio, and\nvideo processing, demonstrating its superior accuracy, efficiency, and\nversatility compared to existing methods.\n","authors":["Zhaoyang Liu","Zeqiang Lai","Zhangwei Gao","Erfei Cui","Xizhou Zhu","Lewei Lu","Qifeng Chen","Yu Qiao","Jifeng Dai","Wenhai Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17796v1.pdf","comment":"22 pages, 9 figures, 10 tables"}]},"2023-10-27T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.18313v1","updated":"2023-10-27T17:59:51Z","published":"2023-10-27T17:59:51Z","title":"FP8-LM: Training FP8 Large Language Models","summary":" In this paper, we explore FP8 low-bit data formats for efficient training of\nlarge language models (LLMs). Our key insight is that most variables, such as\ngradients and optimizer states, in LLM training can employ low-precision data\nformats without compromising model accuracy and requiring no changes to\nhyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision\nframework for training LLMs. This framework offers three levels of FP8\nutilization to streamline mixed-precision and distributed parallel training for\nLLMs. It gradually incorporates 8-bit gradients, optimizer states, and\ndistributed learning in an incremental manner. Experiment results show that,\nduring the training of GPT-175B model on H100 GPU platform, our FP8\nmixed-precision training framework not only achieved a remarkable 42% reduction\nin real memory usage but also ran 64% faster than the widely adopted BF16\nframework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer\nEngine by 17%. This largely reduces the training costs for large foundation\nmodels. Furthermore, our FP8 mixed-precision training methodology is generic.\nIt can be seamlessly applied to other tasks such as LLM instruction tuning and\nreinforcement learning with human feedback, offering savings in fine-tuning\nexpenses. Our FP8 low-precision training framework is open-sourced at\n{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.\n","authors":["Houwen Peng","Kan Wu","Yixuan Wei","Guoshuai Zhao","Yuxiang Yang","Ze Liu","Yifan Xiong","Ziyue Yang","Bolin Ni","Jingcheng Hu","Ruihang Li","Miaosen Zhang","Chen Li","Jia Ning","Ruizhe Wang","Zheng Zhang","Shuguang Liu","Joe Chau","Han Hu","Peng Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.18313v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13548v3","updated":"2023-10-27T17:45:26Z","published":"2023-10-20T14:46:48Z","title":"Towards Understanding Sycophancy in Language Models","summary":" Human feedback is commonly utilized to finetune AI assistants. But human\nfeedback may also encourage model responses that match user beliefs over\ntruthful ones, a behaviour known as sycophancy. We investigate the prevalence\nof sycophancy in models whose finetuning procedure made use of human feedback,\nand the potential role of human preference judgments in such behavior. We first\ndemonstrate that five state-of-the-art AI assistants consistently exhibit\nsycophancy across four varied free-form text-generation tasks. To understand if\nhuman preferences drive this broadly observed behavior, we analyze existing\nhuman preference data. We find that when a response matches a user's views, it\nis more likely to be preferred. Moreover, both humans and preference models\n(PMs) prefer convincingly-written sycophantic responses over correct ones a\nnon-negligible fraction of the time. Optimizing model outputs against PMs also\nsometimes sacrifices truthfulness in favor of sycophancy. Overall, our results\nindicate that sycophancy is a general behavior of state-of-the-art AI\nassistants, likely driven in part by human preference judgments favoring\nsycophantic responses.\n","authors":["Mrinank Sharma","Meg Tong","Tomasz Korbak","David Duvenaud","Amanda Askell","Samuel R. Bowman","Newton Cheng","Esin Durmus","Zac Hatfield-Dodds","Scott R. Johnston","Shauna Kravec","Timothy Maxwell","Sam McCandlish","Kamal Ndousse","Oliver Rausch","Nicholas Schiefer","Da Yan","Miranda Zhang","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2310.13548v3.pdf","comment":"32 pages, 20 figures"},{"id":"http://arxiv.org/abs/2310.18290v1","updated":"2023-10-27T17:28:23Z","published":"2023-10-27T17:28:23Z","title":"An Approach to Automatically generating Riddles aiding Concept\n Attainment","summary":" One of the primary challenges in online learning environments, is to retain\nlearner engagement. Several different instructional strategies are proposed\nboth in online and offline environments to enhance learner engagement. The\nConcept Attainment Model is one such instructional strategy that focuses on\nlearners acquiring a deeper understanding of a concept rather than just its\ndictionary definition. This is done by searching and listing the properties\nused to distinguish examples from non-examples of various concepts. Our work\nattempts to apply the Concept Attainment Model to build conceptual riddles, to\ndeploy over online learning environments. The approach involves creating\nfactual triples from learning resources, classifying them based on their\nuniqueness to a concept into `Topic Markers' and `Common', followed by\ngenerating riddles based on the Concept Attainment Model's format and capturing\nall possible solutions to those riddles. The results obtained from the human\nevaluation of riddles prove encouraging.\n","authors":["Niharika Sri Parasa","Chaitali Diwan","Srinath Srinivasa"],"pdf_url":"https://arxiv.org/pdf/2310.18290v1.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2212.10015v3","updated":"2023-10-27T17:24:04Z","published":"2022-12-20T06:03:51Z","title":"Benchmarking Spatial Relationships in Text-to-Image Generation","summary":" Spatial understanding is a fundamental aspect of computer vision and integral\nfor human-level reasoning about images, making it an important component for\ngrounded language understanding. While recent text-to-image synthesis (T2I)\nmodels have shown unprecedented improvements in photorealism, it is unclear\nwhether they have reliable spatial understanding capabilities. We investigate\nthe ability of T2I models to generate correct spatial relationships among\nobjects and present VISOR, an evaluation metric that captures how accurately\nthe spatial relationship described in text is generated in the image. To\nbenchmark existing models, we introduce a dataset, $\\mathrm{SR}_{2D}$, that\ncontains sentences describing two or more objects and the spatial relationships\nbetween them. We construct an automated evaluation pipeline to recognize\nobjects and their spatial relationships, and employ it in a large-scale\nevaluation of T2I models. Our experiments reveal a surprising finding that,\nalthough state-of-the-art T2I models exhibit high image quality, they are\nseverely limited in their ability to generate multiple objects or the specified\nspatial relations between them. Our analyses demonstrate several biases and\nartifacts of T2I models such as the difficulty with generating multiple\nobjects, a bias towards generating the first object mentioned, spatially\ninconsistent outputs for equivalent relationships, and a correlation between\nobject co-occurrence and spatial understanding capabilities. We conduct a human\nstudy that shows the alignment between VISOR and human judgement about spatial\nunderstanding. We offer the $\\mathrm{SR}_{2D}$ dataset and the VISOR metric to\nthe community in support of T2I reasoning research.\n","authors":["Tejas Gokhale","Hamid Palangi","Besmira Nushi","Vibhav Vineet","Eric Horvitz","Ece Kamar","Chitta Baral","Yezhou Yang"],"pdf_url":"https://arxiv.org/pdf/2212.10015v3.pdf","comment":"preprint; Code and Data at https://github.com/microsoft/VISOR and\n https://huggingface.co/datasets/tgokhale/sr2d_visor"},{"id":"http://arxiv.org/abs/2305.12029v2","updated":"2023-10-27T17:01:50Z","published":"2023-05-19T22:50:02Z","title":"MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational\n Transcript Cleanup","summary":" Current disfluency detection models focus on individual utterances each from\na single speaker. However, numerous discontinuity phenomena in spoken\nconversational transcripts occur across multiple turns, hampering human\nreadability and the performance of downstream NLP tasks. This study addresses\nthese phenomena by proposing an innovative Multi-Turn Cleanup task for spoken\nconversational transcripts and collecting a new dataset, MultiTurnCleanup1. We\ndesign a data labeling schema to collect the high-quality dataset and provide\nextensive data analysis. Furthermore, we leverage two modeling approaches for\nexperimental evaluation as benchmarks for future research.\n","authors":["Hua Shen","Vicky Zayats","Johann C. Rocholl","Daniel D. Walker","Dirk Padfield"],"pdf_url":"https://arxiv.org/pdf/2305.12029v2.pdf","comment":"EMNLP 2023 main conference. Dataset:\n https://github.com/huashen218/MultiTurnCleanup"},{"id":"http://arxiv.org/abs/2310.18263v1","updated":"2023-10-27T16:51:29Z","published":"2023-10-27T16:51:29Z","title":"MalFake: A Multimodal Fake News Identification for Malayalam using\n Recurrent Neural Networks and VGG-16","summary":" The amount of news being consumed online has substantially expanded in recent\nyears. Fake news has become increasingly common, especially in regional\nlanguages like Malayalam, due to the rapid publication and lack of editorial\nstandards on some online sites. Fake news may have a terrible effect on\nsociety, causing people to make bad judgments, lose faith in authorities, and\neven engage in violent behavior. When we take into the context of India, there\nare many regional languages, and fake news is spreading in every language.\nTherefore, providing efficient techniques for identifying false information in\nregional tongues is crucial. Until now, little to no work has been done in\nMalayalam, extracting features from multiple modalities to classify fake news.\nMultimodal approaches are more accurate in detecting fake news, as features\nfrom multiple modalities are extracted to build the deep learning\nclassification model. As far as we know, this is the first piece of work in\nMalayalam that uses multimodal deep learning to tackle false information.\nModels trained with more than one modality typically outperform models taught\nwith only one modality. Our study in the Malayalam language utilizing\nmultimodal deep learning is a significant step toward more effective\nmisinformation detection and mitigation.\n","authors":["Adhish S. Sujan","Ajitha. V","Aleena Benny","Amiya M. P.","V. S. Anoop"],"pdf_url":"https://arxiv.org/pdf/2310.18263v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11403v4","updated":"2023-10-27T16:38:40Z","published":"2023-03-20T19:20:34Z","title":"eP-ALM: Efficient Perceptual Augmentation of Language Models","summary":" Large Language Models (LLMs) have so far impressed the world, with\nunprecedented capabilities that emerge in models at large scales. On the vision\nside, transformer models (i.e., ViT) are following the same trend, achieving\nthe best performance on challenging benchmarks. With the abundance of such\nunimodal models, a natural question arises; do we need also to follow this\ntrend to tackle multimodal tasks? In this work, we propose to rather direct\neffort to efficient adaptations of existing models, and propose to augment\nLanguage Models with perception. Existing approaches for adapting pretrained\nmodels for vision-language tasks still rely on several key components that\nhinder their efficiency. In particular, they still train a large number of\nparameters, rely on large multimodal pretraining, use encoders (e.g., CLIP)\ntrained on huge image-text datasets, and add significant inference overhead. In\naddition, most of these approaches have focused on Zero-Shot and In Context\nLearning, with little to no effort on direct finetuning. We investigate the\nminimal computational effort needed to adapt unimodal models for multimodal\ntasks and propose a new challenging setup, alongside different approaches, that\nefficiently adapts unimodal pretrained models. We show that by freezing more\nthan 99% of total parameters, training only one linear projection layer, and\nprepending only one trainable token, our approach (dubbed eP-ALM) significantly\noutperforms other baselines on VQA and Captioning across Image, Video, and\nAudio modalities, following the proposed setup. The code is available here:\nhttps://github.com/mshukor/eP-ALM.\n","authors":["Mustafa Shukor","Corentin Dancette","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2303.11403v4.pdf","comment":"Accepted at ICCV 2023. Project page:\n https://mshukor.github.io/eP-ALM.github.io/"},{"id":"http://arxiv.org/abs/2310.18239v1","updated":"2023-10-27T16:24:24Z","published":"2023-10-27T16:24:24Z","title":"Fine-Tuning Language Models Using Formal Methods Feedback","summary":" Although pre-trained language models encode generic knowledge beneficial for\nplanning and control, they may fail to generate appropriate control policies\nfor domain-specific tasks. Existing fine-tuning methods use human feedback to\naddress this limitation, however, sourcing human feedback is labor intensive\nand costly. We present a fully automated approach to fine-tune pre-trained\nlanguage models for applications in autonomous systems, bridging the gap\nbetween generic knowledge and domain-specific requirements while reducing cost.\nThe method synthesizes automaton-based controllers from pre-trained models\nguided by natural language task descriptions. These controllers are verifiable\nagainst independently provided specifications within a world model, which can\nbe abstract or obtained from a high-fidelity simulator. Controllers with high\ncompliance with the desired specifications receive higher ranks, guiding the\niterative fine-tuning process. We provide quantitative evidences, primarily in\nautonomous driving, to demonstrate the method's effectiveness across multiple\ntasks. The results indicate an improvement in percentage of specifications\nsatisfied by the controller from 60% to 90%.\n","authors":["Yunhao Yang","Neel P. Bhatt","Tyler Ingebrand","William Ward","Steven Carr","Zhangyang Wang","Ufuk Topcu"],"pdf_url":"https://arxiv.org/pdf/2310.18239v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18235v1","updated":"2023-10-27T16:20:10Z","published":"2023-10-27T16:20:10Z","title":"Davidsonian Scene Graph: Improving Reliability in Fine-grained\n Evaluation for Text-Image Generation","summary":" Evaluating text-to-image models is notoriously difficult. A strong recent\napproach for assessing text-image faithfulness is based on QG/A (question\ngeneration and answering), which uses pre-trained foundational models to\nautomatically generate a set of questions and answers from the prompt, and\noutput images are scored based on whether these answers extracted with a visual\nquestion answering model are consistent with the prompt-based answers. This\nkind of evaluation is naturally dependent on the quality of the underlying QG\nand QA models. We identify and address several reliability challenges in\nexisting QG/A work: (a) QG questions should respect the prompt (avoiding\nhallucinations, duplications, and omissions) and (b) VQA answers should be\nconsistent (not asserting that there is no motorcycle in an image while also\nclaiming the motorcycle is blue). We address these issues with Davidsonian\nScene Graph (DSG), an empirically grounded evaluation framework inspired by\nformal semantics. DSG is an automatic, graph-based QG/A that is modularly\nimplemented to be adaptable to any QG/A module. DSG produces atomic and unique\nquestions organized in dependency graphs, which (i) ensure appropriate semantic\ncoverage and (ii) sidestep inconsistent answers. With extensive experimentation\nand human evaluation on a range of model configurations (LLM, VQA, and T2I), we\nempirically demonstrate that DSG addresses the challenges noted above. Finally,\nwe present DSG-1k, an open-sourced evaluation benchmark that includes 1,060\nprompts, covering a wide range of fine-grained semantic categories with a\nbalanced distribution. We will release the DSG-1k prompts and the corresponding\nDSG questions.\n","authors":["Jaemin Cho","Yushi Hu","Roopal Garg","Peter Anderson","Ranjay Krishna","Jason Baldridge","Mohit Bansal","Jordi Pont-Tuset","Su Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18235v1.pdf","comment":"Project website: https://google.github.io/DSG"},{"id":"http://arxiv.org/abs/2305.09770v6","updated":"2023-10-27T16:08:32Z","published":"2023-05-16T19:48:49Z","title":"ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to\n Support Human-AI Scientific Writing","summary":" Despite a surge collection of XAI methods, users still struggle to obtain\nrequired AI explanations. Previous research suggests chatbots as dynamic\nsolutions, but the effective design of conversational XAI agents for practical\nhuman needs remains under-explored. This paper focuses on Conversational XAI\nfor AI-assisted scientific writing tasks. Drawing from human linguistic\ntheories and formative studies, we identify four design rationales:\n\"multifaceted\", \"controllability\", \"mix-initiative\", \"context-aware\ndrill-down\". We incorporate them into an interactive prototype, ConvXAI, which\nfacilitates heterogeneous AI explanations for scientific writing through\ndialogue. In two studies with 21 users, ConvXAI outperforms a GUI-based\nbaseline on improving human-perceived understanding and writing improvement.\nThe paper further discusses the practical human usage patterns in interacting\nwith ConvXAI for scientific co-writing.\n","authors":["Hua Shen","Chieh-Yang Huang","Tongshuang Wu","Ting-Hao 'Kenneth' Huang"],"pdf_url":"https://arxiv.org/pdf/2305.09770v6.pdf","comment":"CSCW 2023 Demo. ConvXAI system code:\n https://github.com/huashen218/convxai.git"},{"id":"http://arxiv.org/abs/2310.18229v1","updated":"2023-10-27T16:08:15Z","published":"2023-10-27T16:08:15Z","title":"Revising with a Backward Glance: Regressions and Skips during Reading as\n Cognitive Signals for Revision Policies in Incremental Processing","summary":" In NLP, incremental processors produce output in instalments, based on\nincoming prefixes of the linguistic input. Some tokens trigger revisions,\ncausing edits to the output hypothesis, but little is known about why models\nrevise when they revise. A policy that detects the time steps where revisions\nshould happen can improve efficiency. Still, retrieving a suitable signal to\ntrain a revision policy is an open problem, since it is not naturally available\nin datasets. In this work, we investigate the appropriateness of regressions\nand skips in human reading eye-tracking data as signals to inform revision\npolicies in incremental sequence labelling. Using generalised mixed-effects\nmodels, we find that the probability of regressions and skips by humans can\npotentially serve as useful predictors for revisions in BiLSTMs and Transformer\nmodels, with consistent results for various languages.\n","authors":["Brielen Madureira","Pelin Çelikkol","David Schlangen"],"pdf_url":"https://arxiv.org/pdf/2310.18229v1.pdf","comment":"Accepted to CoNLL 2023"},{"id":"http://arxiv.org/abs/2310.14088v2","updated":"2023-10-27T16:00:49Z","published":"2023-10-21T18:59:41Z","title":"MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark\n for Language Model Evaluation","summary":" Curated datasets for healthcare are often limited due to the need of human\nannotations from experts. In this paper, we present MedEval, a multi-level,\nmulti-task, and multi-domain medical benchmark to facilitate the development of\nlanguage models for healthcare. MedEval is comprehensive and consists of data\nfrom several healthcare systems and spans 35 human body regions from 8\nexamination modalities. With 22,779 collected sentences and 21,228 reports, we\nprovide expert annotations at multiple levels, offering a granular potential\nusage of the data and supporting a wide range of tasks. Moreover, we\nsystematically evaluated 10 generic and domain-specific language models under\nzero-shot and finetuning settings, from domain-adapted baselines in healthcare\nto general-purposed state-of-the-art large language models (e.g., ChatGPT). Our\nevaluations reveal varying effectiveness of the two categories of language\nmodels across different tasks, from which we notice the importance of\ninstruction tuning for few-shot usage of large language models. Our\ninvestigation paves the way toward benchmarking language models for healthcare\nand provides valuable insights into the strengths and limitations of adopting\nlarge language models in medical domains, informing their practical\napplications and future advancements.\n","authors":["Zexue He","Yu Wang","An Yan","Yao Liu","Eric Y. Chang","Amilcare Gentili","Julian McAuley","Chun-Nan Hsu"],"pdf_url":"https://arxiv.org/pdf/2310.14088v2.pdf","comment":"Accepted to EMNLP 2023. Camera-ready version: added more evaluation\n results on LLMs such as GPT4, LLaMa2, and LLaMa2-chat"},{"id":"http://arxiv.org/abs/2305.14815v2","updated":"2023-10-27T15:43:53Z","published":"2023-05-24T07:09:56Z","title":"Machine Reading Comprehension using Case-based Reasoning","summary":" We present an accurate and interpretable method for answer extraction in\nmachine reading comprehension that is reminiscent of case-based reasoning (CBR)\nfrom classical AI. Our method (CBR-MRC) builds upon the hypothesis that\ncontextualized answers to similar questions share semantic similarities with\neach other. Given a test question, CBR-MRC first retrieves a set of similar\ncases from a non-parametric memory and then predicts an answer by selecting the\nspan in the test context that is most similar to the contextualized\nrepresentations of answers in the retrieved cases. The semi-parametric nature\nof our approach allows it to attribute a prediction to the specific set of\nevidence cases, making it a desirable choice for building reliable and\ndebuggable QA systems. We show that CBR-MRC provides high accuracy comparable\nwith large reader models and outperforms baselines by 11.5 and 8.4 EM on\nNaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability\nof CBR-MRC in identifying not just the correct answer tokens but also the span\nwith the most relevant supporting evidence. Lastly, we observe that contexts\nfor certain question types show higher lexical diversity than others and find\nthat CBR-MRC is robust to these variations while performance using\nfully-parametric methods drops.\n","authors":["Dung Thai","Dhruv Agarwal","Mudit Chaudhary","Rajarshi Das","Manzil Zaheer","Jay-Yoon Lee","Hannaneh Hajishirzi","Andrew McCallum"],"pdf_url":"https://arxiv.org/pdf/2305.14815v2.pdf","comment":"9 pages, 2 figures"},{"id":"http://arxiv.org/abs/2207.07051v2","updated":"2023-10-27T15:35:45Z","published":"2022-07-14T16:51:09Z","title":"Language models show human-like content effects on reasoning tasks","summary":" Abstract reasoning is a key ability for an intelligent system. Large language\nmodels (LMs) achieve above-chance performance on abstract reasoning tasks, but\nexhibit many imperfections. However, human abstract reasoning is also\nimperfect. For example, human reasoning is affected by our real-world knowledge\nand beliefs, and shows notable \"content effects\"; humans reason more reliably\nwhen the semantic content of a problem supports the correct logical inferences.\nThese content-entangled reasoning patterns play a central role in debates about\nthe fundamental nature of human intelligence. Here, we investigate whether\nlanguage models $\\unicode{x2014}$ whose prior expectations capture some aspects\nof human knowledge $\\unicode{x2014}$ similarly mix content into their answers\nto logical problems. We explored this question across three logical reasoning\ntasks: natural language inference, judging the logical validity of syllogisms,\nand the Wason selection task. We evaluate state of the art large language\nmodels, as well as humans, and find that the language models reflect many of\nthe same patterns observed in humans across these tasks $\\unicode{x2014}$ like\nhumans, models answer more accurately when the semantic content of a task\nsupports the logical inferences. These parallels are reflected both in answer\npatterns, and in lower-level features like the relationship between model\nanswer distributions and human response times. Our findings have implications\nfor understanding both these cognitive effects in humans, and the factors that\ncontribute to language model performance.\n","authors":["Ishita Dasgupta","Andrew K. Lampinen","Stephanie C. Y. Chan","Hannah R. Sheahan Antonia Creswell","Dharshan Kumaran","James L. McClelland","Felix Hill"],"pdf_url":"https://arxiv.org/pdf/2207.07051v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18208v1","updated":"2023-10-27T15:31:22Z","published":"2023-10-27T15:31:22Z","title":"ArcheType: A Novel Framework for Open-Source Column Type Annotation\n using Large Language Models","summary":" Existing deep-learning approaches to semantic column type annotation (CTA)\nhave important shortcomings: they rely on semantic types which are fixed at\ntraining time; require a large number of training samples per type and incur\nlarge run-time inference costs; and their performance can degrade when\nevaluated on novel datasets, even when types remain constant. Large language\nmodels have exhibited strong zero-shot classification performance on a wide\nrange of tasks and in this paper we explore their use for CTA. We introduce\nArcheType, a simple, practical method for context sampling, prompt\nserialization, model querying, and label remapping, which enables large\nlanguage models to solve column type annotation problems in a fully zero-shot\nmanner. We ablate each component of our method separately, and establish that\nimprovements to context sampling and label remapping provide the most\nconsistent gains. ArcheType establishes new state-of-the-art performance on\nboth zero-shot and fine-tuned CTA, including three new domain-specific\nbenchmarks, which we release, along with the code to reproduce our results at\nhttps://github.com/penfever/ArcheType.\n","authors":["Benjamin Feuer","Yurong Liu","Chinmay Hegde","Juliana Freire"],"pdf_url":"https://arxiv.org/pdf/2310.18208v1.pdf","comment":"17 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.18207v1","updated":"2023-10-27T15:31:16Z","published":"2023-10-27T15:31:16Z","title":"INA: An Integrative Approach for Enhancing Negotiation Strategies with\n Reward-Based Dialogue System","summary":" In this paper, we propose a novel negotiation dialogue agent designed for the\nonline marketplace. Our agent is integrative in nature i.e, it possesses the\ncapability to negotiate on price as well as other factors, such as the addition\nor removal of items from a deal bundle, thereby offering a more flexible and\ncomprehensive negotiation experience. We create a new dataset called\nIntegrative Negotiation Dataset (IND) to enable this functionality. For this\ndataset creation, we introduce a new semi-automated data creation method, which\ncombines defining negotiation intents, actions, and intent-action simulation\nbetween users and the agent to generate potential dialogue flows. Finally, the\nprompting of GPT-J, a state-of-the-art language model, is done to generate\ndialogues for a given intent, with a human-in-the-loop process for post-editing\nand refining minor errors to ensure high data quality. We employ a set of novel\nrewards, specifically tailored for the negotiation task to train our\nNegotiation Agent, termed as the Integrative Negotiation Agent (INA). These\nrewards incentivize the chatbot to learn effective negotiation strategies that\ncan adapt to various contextual requirements and price proposals. By leveraging\nthe IND, we train our model and conduct experiments to evaluate the\neffectiveness of our reward-based dialogue system for negotiation. Our results\ndemonstrate that the proposed approach and reward system significantly enhance\nthe agent's negotiation capabilities. The INA successfully engages in\nintegrative negotiations, displaying the ability to dynamically adjust prices\nand negotiate the inclusion or exclusion of items in a bundle deal\n","authors":["Zishan Ahmad","Suman Saurabh","Vaishakh Sreekanth Menon","Asif Ekbal","Roshni Ramnani","Anutosh Maitra"],"pdf_url":"https://arxiv.org/pdf/2310.18207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18205v1","updated":"2023-10-27T15:28:12Z","published":"2023-10-27T15:28:12Z","title":"Lost in Translation, Found in Spans: Identifying Claims in Multilingual\n Social Media","summary":" Claim span identification (CSI) is an important step in fact-checking\npipelines, aiming to identify text segments that contain a checkworthy claim or\nassertion in a social media post. Despite its importance to journalists and\nhuman fact-checkers, it remains a severely understudied problem, and the scarce\nresearch on this topic so far has only focused on English. Here we aim to\nbridge this gap by creating a novel dataset, X-CLAIM, consisting of 7K\nreal-world claims collected from numerous social media platforms in five Indian\nlanguages and English. We report strong baselines with state-of-the-art\nencoder-only language models (e.g., XLM-R) and we demonstrate the benefits of\ntraining on multiple languages over alternative cross-lingual transfer methods\nsuch as zero-shot transfer, or training on translated data, from a\nhigh-resource language such as English. We evaluate generative large language\nmodels from the GPT series using prompting methods on the X-CLAIM dataset and\nwe find that they underperform the smaller encoder-only language models for\nlow-resource languages.\n","authors":["Shubham Mittal","Megha Sundriyal","Preslav Nakov"],"pdf_url":"https://arxiv.org/pdf/2310.18205v1.pdf","comment":"EMNLP 2023 (main)"},{"id":"http://arxiv.org/abs/2305.14215v2","updated":"2023-10-27T15:21:38Z","published":"2023-05-23T16:32:36Z","title":"Exploring Chain-of-Thought Style Prompting for Text-to-SQL","summary":" In-context learning with large language models (LLMs) has recently caught\nincreasing attention due to its superior few-shot performance on various tasks.\nHowever, its performance on text-to-SQL parsing still has much room for\nimprovement. In this paper, we hypothesize that a crucial aspect of LLMs to\nimprove for text-to-SQL parsing is their multi-step reasoning ability. Thus, we\nsystematically study how to enhance LLMs' reasoning ability through chain of\nthought (CoT) style prompting, including the original chain-of-thought\nprompting (Wei et al., 2022b) and least-to-most prompting (Zhou et al., 2023).\nOur experiments demonstrate that iterative prompting as in Zhou et al. (2023)\nmay be unnecessary for text-to-SQL parsing, and using detailed reasoning steps\ntends to have more error propagation issues. Based on these findings, we\npropose a new CoT-style prompting method for text-to-SQL parsing. It brings 5.2\nand 6.5 point absolute gains on the Spider development set and the Spider\nRealistic set, respectively, compared to the standard prompting method without\nreasoning steps; 2.4 and 1.5 point absolute gains, compared to the\nleast-to-most prompting method.\n","authors":["Chang-You Tai","Ziru Chen","Tianshu Zhang","Xiang Deng","Huan Sun"],"pdf_url":"https://arxiv.org/pdf/2305.14215v2.pdf","comment":"EMNLP 2023 main; long paper"},{"id":"http://arxiv.org/abs/2309.06520v2","updated":"2023-10-27T14:42:29Z","published":"2023-09-12T18:51:10Z","title":"Minimum Bayes' Risk Decoding for System Combination of Grammatical Error\n Correction Systems","summary":" For sequence-to-sequence tasks it is challenging to combine individual system\noutputs. Further, there is also often a mismatch between the decoding criterion\nand the one used for assessment. Minimum Bayes' Risk (MBR) decoding can be used\nto combine system outputs in a manner that encourages better alignment with the\nfinal assessment criterion. This paper examines MBR decoding for Grammatical\nError Correction (GEC) systems, where performance is usually evaluated in terms\nof edits and an associated F-score. Hence, we propose a novel MBR loss function\ndirectly linked to this form of criterion. Furthermore, an approach to expand\nthe possible set of candidate sentences is described. This builds on a current\nmax-voting combination scheme, as well as individual edit-level selection.\nExperiments on three popular GEC datasets and with state-of-the-art GEC systems\ndemonstrate the efficacy of the proposed MBR approach. Additionally, the paper\nhighlights how varying reward metrics within the MBR decoding framework can\nprovide control over precision, recall, and the F-score in combined GEC\nsystems.\n","authors":["Vyas Raina","Mark Gales"],"pdf_url":"https://arxiv.org/pdf/2309.06520v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18169v1","updated":"2023-10-27T14:28:41Z","published":"2023-10-27T14:28:41Z","title":"Style Description based Text-to-Speech with Conditional Prosodic Layer\n Normalization based Diffusion GAN","summary":" In this paper, we present a Diffusion GAN based approach (Prosodic Diff-TTS)\nto generate the corresponding high-fidelity speech based on the style\ndescription and content text as an input to generate speech samples within only\n4 denoising steps. It leverages the novel conditional prosodic layer\nnormalization to incorporate the style embeddings into the multi head attention\nbased phoneme encoder and mel spectrogram decoder based generator architecture\nto generate the speech. The style embedding is generated by fine tuning the\npretrained BERT model on auxiliary tasks such as pitch, speaking speed,\nemotion,gender classifications. We demonstrate the efficacy of our proposed\narchitecture on multi-speaker LibriTTS and PromptSpeech datasets, using\nmultiple quantitative metrics that measure generated accuracy and MOS.\n","authors":["Neeraj Kumar","Ankur Narang","Brejesh Lall"],"pdf_url":"https://arxiv.org/pdf/2310.18169v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18168v1","updated":"2023-10-27T14:27:43Z","published":"2023-10-27T14:27:43Z","title":"Personas as a Way to Model Truthfulness in Language Models","summary":" Large Language Models are trained on vast amounts of text from the internet,\nwhich contains both factual and misleading information about the world. Can\nlanguage models discern truth from falsehood in this contradicting data?\nExpanding on the view that LLMs can model different agents producing the\ncorpora, we hypothesize that they can cluster truthful text by modeling a\ntruthful persona: a group of agents that are likely to produce truthful text\nand share similar features. For example, trustworthy sources like Wikipedia and\nScience usually use formal writing styles and make consistent claims. By\nmodeling this persona, LLMs can generalize truthfulness beyond the specific\ncontexts in which each agent generated the training text. For example, the\nmodel can infer that the agent \"Wikipedia\" will behave truthfully on topics\nthat were only generated by \"Science\" because they share a persona. We first\nshow evidence for the persona hypothesis via two observations: (1) we can probe\nwhether a model's answer will be truthful before it is generated; (2)\nfinetuning a model on a set of facts improves its truthfulness on unseen\ntopics. Next, using arithmetics as a synthetic environment, we show that\nlanguage models can separate true and false statements, and generalize\ntruthfulness across agents; but only if agents in the training data share a\ntruthful generative process that enables the creation of a truthful persona.\nOverall, our findings suggest that models can exploit hierarchical structures\nin the data to learn abstract concepts like truthfulness.\n","authors":["Nitish Joishi","Javier Rando","Abulhair Saparov","Najoung Kim","He He"],"pdf_url":"https://arxiv.org/pdf/2310.18168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10088v2","updated":"2023-10-27T14:24:31Z","published":"2023-07-19T15:57:24Z","title":"Android in the Wild: A Large-Scale Dataset for Android Device Control","summary":" There is a growing interest in device-control systems that can interpret\nhuman natural language instructions and execute them on a digital device by\ndirectly controlling its user interface. We present a dataset for\ndevice-control research, Android in the Wild (AITW), which is orders of\nmagnitude larger than current datasets. The dataset contains human\ndemonstrations of device interactions, including the screens and actions, and\ncorresponding natural language instructions. It consists of 715k episodes\nspanning 30k unique instructions, four versions of Android (v10-13),and eight\ndevice types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It\ncontains multi-step tasks that require semantic understanding of language and\nvisual context. This dataset poses a new challenge: actions available through\nthe user interface must be inferred from their visual appearance. And, instead\nof simple UI element-based actions, the action space consists of precise\ngestures (e.g., horizontal scrolls to operate carousel widgets). We organize\nour dataset to encourage robustness analysis of device-control systems, i.e.,\nhow well a system performs in the presence of new task descriptions, new\napplications, or new platform versions. We develop two agents and report\nperformance across the dataset. The dataset is available at\nhttps://github.com/google-research/google-research/tree/master/android_in_the_wild.\n","authors":["Christopher Rawles","Alice Li","Daniel Rodriguez","Oriana Riva","Timothy Lillicrap"],"pdf_url":"https://arxiv.org/pdf/2307.10088v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18167v1","updated":"2023-10-27T14:24:06Z","published":"2023-10-27T14:24:06Z","title":"MPrompt: Exploring Multi-level Prompt Tuning for Machine Reading\n Comprehension","summary":" The large language models have achieved superior performance on various\nnatural language tasks. One major drawback of such approaches is they are\nresource-intensive in fine-tuning new datasets. Soft-prompt tuning presents a\nresource-efficient solution to fine-tune the pre-trained language models (PLMs)\nwhile keeping their weight frozen. Existing soft prompt methods mainly focus on\ndesigning the input-independent prompts that steer the model to fit the domain\nof the new dataset. Those methods often ignore the fine-grained information\nabout the task and context of the text. In this paper, we propose a multi-level\nprompt tuning (MPrompt) method for machine reading comprehension. It utilizes\nprompts at task-specific, domain-specific, and context-specific levels to\nenhance the comprehension of input semantics at different granularities. We\nalso propose an independence constraint to steer each domain-specific prompt to\nfocus on information within its domain to avoid redundancy. Moreover, we\npresent a prompt generator that incorporates context-related knowledge in the\nprompt generation to enhance contextual relevancy. We conducted extensive\nexperiments on 12 benchmarks of various QA formats and achieved an average\nimprovement of 1.94\\% over the state-of-the-art methods.\n","authors":["Guoxin Chen","Yiming Qian","Bowen Wang","Liangzhi Li"],"pdf_url":"https://arxiv.org/pdf/2310.18167v1.pdf","comment":"13 pages, 5 figures, accepted by EMNLP2023-Findings"},{"id":"http://arxiv.org/abs/2310.14735v2","updated":"2023-10-27T14:22:43Z","published":"2023-10-23T09:15:18Z","title":"Unleashing the potential of prompt engineering in Large Language Models:\n a comprehensive review","summary":" This paper delves into the pivotal role of prompt engineering in unleashing\nthe capabilities of Large Language Models (LLMs). Prompt engineering is the\nprocess of structuring input text for LLMs and is a technique integral to\noptimizing the efficacy of LLMs. This survey elucidates foundational principles\nof prompt engineering, such as role-prompting, one-shot, and few-shot\nprompting, as well as more advanced methodologies such as the chain-of-thought\nand tree-of-thoughts prompting. The paper sheds light on how external\nassistance in the form of plugins can assist in this task, and reduce machine\nhallucination by retrieving external knowledge. We subsequently delineate\nprospective directions in prompt engineering research, emphasizing the need for\na deeper understanding of structures and the role of agents in Artificial\nIntelligence-Generated Content (AIGC) tools. We discuss how to assess the\nefficacy of prompt methods from different perspectives and using different\nmethods. Finally, we gather information about the application of prompt\nengineering in such fields as education and programming, showing its\ntransformative potential. This comprehensive survey aims to serve as a friendly\nguide for anyone venturing through the big world of LLMs and prompt\nengineering.\n","authors":["Banghao Chen","Zhaofeng Zhang","Nicolas Langrené","Shengxin Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.14735v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18155v1","updated":"2023-10-27T14:03:30Z","published":"2023-10-27T14:03:30Z","title":"Elevating Code-mixed Text Handling through Auditory Information of Words","summary":" With the growing popularity of code-mixed data, there is an increasing need\nfor better handling of this type of data, which poses a number of challenges,\nsuch as dealing with spelling variations, multiple languages, different\nscripts, and a lack of resources. Current language models face difficulty in\neffectively handling code-mixed data as they primarily focus on the semantic\nrepresentation of words and ignore the auditory phonetic features. This leads\nto difficulties in handling spelling variations in code-mixed text. In this\npaper, we propose an effective approach for creating language models for\nhandling code-mixed textual data using auditory information of words from\nSOUNDEX. Our approach includes a pre-training step based on\nmasked-language-modelling, which includes SOUNDEX representations (SAMLM) and a\nnew method of providing input data to the pre-trained model. Through\nexperimentation on various code-mixed datasets (of different languages) for\nsentiment, offensive and aggression classification tasks, we establish that our\nnovel language modeling approach (SAMLM) results in improved robustness towards\nadversarial attacks on code-mixed classification tasks. Additionally, our SAMLM\nbased approach also results in better classification results over the popular\nbaselines for code-mixed tasks. We use the explainability technique, SHAP\n(SHapley Additive exPlanations) to explain how the auditory features\nincorporated through SAMLM assist the model to handle the code-mixed text\neffectively and increase robustness against adversarial attacks\n\\footnote{Source code has been made available on\n\\url{https://github.com/20118/DefenseWithPhonetics},\n\\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\\#Phonetics}}.\n","authors":[" Mamta","Zishan Ahmad","Asif Ekbal"],"pdf_url":"https://arxiv.org/pdf/2310.18155v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17015v2","updated":"2023-10-27T14:02:43Z","published":"2023-10-25T21:29:36Z","title":"Data Augmentation for Emotion Detection in Small Imbalanced Text Data","summary":" Emotion recognition in text, the task of identifying emotions such as joy or\nanger, is a challenging problem in NLP with many applications. One of the\nchallenges is the shortage of available datasets that have been annotated with\nemotions. Certain existing datasets are small, follow different emotion\ntaxonomies and display imbalance in their emotion distribution. In this work,\nwe studied the impact of data augmentation techniques precisely when applied to\nsmall imbalanced datasets, for which current state-of-the-art models (such as\nRoBERTa) under-perform. Specifically, we utilized four data augmentation\nmethods (Easy Data Augmentation EDA, static and contextual Embedding-based, and\nProtAugment) on three datasets that come from different sources and vary in\nsize, emotion categories and distributions. Our experimental results show that\nusing the augmented data when training the classifier model leads to\nsignificant improvements. Finally, we conducted two case studies: a) directly\nusing the popular chat-GPT API to paraphrase text using different prompts, and\nb) using external data to augment the training set. Results show the promising\npotential of these methods.\n","authors":["Anna Koufakou","Diego Grisales","Ragy Costa de jesus","Oscar Fox"],"pdf_url":"https://arxiv.org/pdf/2310.17015v2.pdf","comment":"To be published in the Proceedings of IEEE International Conference\n on Machine Learning Applications IEEE (ICMLA 2023)"},{"id":"http://arxiv.org/abs/2310.18152v1","updated":"2023-10-27T14:00:04Z","published":"2023-10-27T14:00:04Z","title":"Disentangled Representation Learning with Large Language Models for\n Text-Attributed Graphs","summary":" Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs\nsuch as citation networks, e-commerce networks and social networks has\nattracted considerable attention in the web community. Recently, large language\nmodels (LLMs) have demonstrated exceptional capabilities across a wide range of\ntasks. However, the existing works focus on harnessing the potential of LLMs\nsolely relying on prompts to convey graph structure information to LLMs, thus\nsuffering from insufficient understanding of the complex structural\nrelationships within TAGs. To address this problem, in this paper we present\nthe Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the\nreasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model\nincorporates graph structure information through tailored disentangled graph\nneural network (GNN) layers, enabling LLMs to capture the intricate\nrelationships hidden in text-attributed graphs from multiple structural\nfactors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing\ncomputational costs and allowing much more flexibility in combining with\ndifferent LLM models. Experimental evaluations demonstrate the effectiveness of\nthe proposed DGTL model on achieving superior or comparable performance over\nstate-of-the-art baselines. Additionally, we also demonstrate that our DGTL\nmodel can offer natural language explanations for predictions, thereby\nsignificantly enhancing model interpretability.\n","authors":["Yijian Qin","Xin Wang","Ziwei Zhang","Wenwu Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.18152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11672v2","updated":"2023-10-27T13:50:09Z","published":"2023-10-18T02:45:54Z","title":"Open-ended Commonsense Reasoning with Unrestricted Answer Scope","summary":" Open-ended Commonsense Reasoning is defined as solving a commonsense question\nwithout providing 1) a short list of answer candidates and 2) a pre-defined\nanswer scope. Conventional ways of formulating the commonsense question into a\nquestion-answering form or utilizing external knowledge to learn\nretrieval-based methods are less applicable in the open-ended setting due to an\ninherent challenge. Without pre-defining an answer scope or a few candidates,\nopen-ended commonsense reasoning entails predicting answers by searching over\nan extremely large searching space. Moreover, most questions require implicit\nmulti-hop reasoning, which presents even more challenges to our problem. In\nthis work, we leverage pre-trained language models to iteratively retrieve\nreasoning paths on the external knowledge base, which does not require\ntask-specific supervision. The reasoning paths can help to identify the most\nprecise answer to the commonsense question. We conduct experiments on two\ncommonsense benchmark datasets. Compared to other approaches, our proposed\nmethod achieves better performance both quantitatively and qualitatively.\n","authors":["Chen Ling","Xuchao Zhang","Xujiang Zhao","Yanchi Liu","Wei Cheng","Mika Oishi","Takao Osaki","Katsushi Matsuda","Haifeng Chen","Liang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.11672v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11069v4","updated":"2023-10-27T13:32:15Z","published":"2023-10-17T08:33:02Z","title":"VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System","summary":" Arabic is a complex language with many varieties and dialects spoken by over\n450 millions all around the world. Due to the linguistic diversity and\nvariations, it is challenging to build a robust and generalized ASR system for\nArabic. In this work, we address this gap by developing and demoing a system,\ndubbed VoxArabica, for dialect identification (DID) as well as automatic speech\nrecognition (ASR) of Arabic. We train a wide range of models such as HuBERT\n(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR\ntasks. Our DID models are trained to identify 17 different dialects in addition\nto MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.\nAdditionally, for the remaining dialects in ASR, we provide the option to\nchoose various models such as Whisper and MMS in a zero-shot setting. We\nintegrate these models into a single web interface with diverse features such\nas audio recording, file upload, model selection, and the option to raise flags\nfor incorrect outputs. Overall, we believe VoxArabica will be useful for a wide\nrange of audiences concerned with Arabic research. Our system is currently\nrunning at https://cdce-206-12-100-168.ngrok.io/.\n","authors":["Abdul Waheed","Bashar Talafha","Peter Sullivan","AbdelRahim Elmadany","Muhammad Abdul-Mageed"],"pdf_url":"https://arxiv.org/pdf/2310.11069v4.pdf","comment":"Accepted at ArabicNLP conference co-located with EMNLP'23. First\n three authors contributed equally"},{"id":"http://arxiv.org/abs/2310.18130v1","updated":"2023-10-27T13:23:02Z","published":"2023-10-27T13:23:02Z","title":"DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial\n Issues","summary":" Controversy is a reflection of our zeitgeist, and an important aspect to any\ndiscourse. The rise of large language models (LLMs) as conversational systems\nhas increased public reliance on these systems for answers to their various\nquestions. Consequently, it is crucial to systematically examine how these\nmodels respond to questions that pertaining to ongoing debates. However, few\nsuch datasets exist in providing human-annotated labels reflecting the\ncontemporary discussions. To foster research in this area, we propose a novel\nconstruction of a controversial questions dataset, expanding upon the publicly\nreleased Quora Question Pairs Dataset. This dataset presents challenges\nconcerning knowledge recency, safety, fairness, and bias. We evaluate different\nLLMs using a subset of this dataset, illuminating how they handle controversial\nissues and the stances they adopt. This research ultimately contributes to our\nunderstanding of LLMs' interaction with controversial issues, paving the way\nfor improvements in their comprehension and handling of complex societal\ndebates.\n","authors":["David Q. Sun","Artem Abzaliev","Hadas Kotek","Zidi Xiu","Christopher Klein","Jason D. Williams"],"pdf_url":"https://arxiv.org/pdf/2310.18130v1.pdf","comment":"Accepted to EMNLP Industry Track 2023"},{"id":"http://arxiv.org/abs/2310.18127v1","updated":"2023-10-27T13:19:19Z","published":"2023-10-27T13:19:19Z","title":"Ask more, know better: Reinforce-Learned Prompt Questions for Decision\n Making with Large Language Models","summary":" Large language models (LLMs) demonstrate their promise in tackling\ncomplicated practical challenges by combining action-based policies with chain\nof thought (CoT) reasoning. Having high-quality prompts on hand, however, is\nvital to the framework's effectiveness. Currently, these prompts are\nhandcrafted utilizing extensive human labor, resulting in CoT policies that\nfrequently fail to generalize. Human intervention is also required in order to\ndevelop grounding functions that ensure low-level controllers appropriately\nprocess CoT reasoning. In this paper, we take the first step towards a fully\nintegrated end-to-end framework for task-solving in real settings employing\ncomplicated reasoning. To that purpose, we offer a new leader-follower bilevel\nframework capable of learning to ask relevant questions (prompts) and\nsubsequently undertaking reasoning to guide the learning of actions to be\nperformed in an environment. A good prompt should make introspective revisions\nbased on historical findings, leading the CoT to consider the anticipated\ngoals. A prompt-generator policy has its own aim in our system, allowing it to\nadapt to the action policy and automatically root the CoT process towards\noutputs that lead to decisive, high-performing actions. Meanwhile, the action\npolicy is learning how to use the CoT outputs to take specific actions. Our\nempirical data reveal that our system outperforms leading methods in agent\nlearning benchmarks such as Overcooked and FourRoom.\n","authors":["Xue Yan","Yan Song","Xinyu Cui","Filippos Christianos","Haifeng Zhang","David Henry Mguni","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18122v1","updated":"2023-10-27T13:09:54Z","published":"2023-10-27T13:09:54Z","title":"OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization","summary":" Opinion summarization sets itself apart from other types of summarization\ntasks due to its distinctive focus on aspects and sentiments. Although certain\nautomated evaluation methods like ROUGE have gained popularity, we have found\nthem to be unreliable measures for assessing the quality of opinion summaries.\nIn this paper, we present OpinSummEval, a dataset comprising human judgments\nand outputs from 14 opinion summarization models. We further explore the\ncorrelation between 24 automatic metrics and human ratings across four\ndimensions. Our findings indicate that metrics based on neural networks\ngenerally outperform non-neural ones. However, even metrics built on powerful\nbackbones, such as BART and GPT-3/3.5, do not consistently correlate well\nacross all dimensions, highlighting the need for advancements in automated\nevaluation methods for opinion summarization. The code and data are publicly\navailable at https://github.com/A-Chicharito-S/OpinSummEval/tree/main.\n","authors":["Yuchen Shen","Xiaojun Wan"],"pdf_url":"https://arxiv.org/pdf/2310.18122v1.pdf","comment":"preprint, 19 pages, 4 figures, 10 tables"},{"id":"http://arxiv.org/abs/2310.18119v1","updated":"2023-10-27T13:06:24Z","published":"2023-10-27T13:06:24Z","title":"Towards a Unified Conversational Recommendation System: Multi-task\n Learning via Contextualized Knowledge Distillation","summary":" In Conversational Recommendation System (CRS), an agent is asked to recommend\na set of items to users within natural language conversations. To address the\nneed for both conversational capability and personalized recommendations, prior\nworks have utilized separate recommendation and dialogue modules. However, such\napproach inevitably results in a discrepancy between recommendation results and\ngenerated responses. To bridge the gap, we propose a multi-task learning for a\nunified CRS, where a single model jointly learns both tasks via Contextualized\nKnowledge Distillation (ConKD). We introduce two versions of ConKD: hard gate\nand soft gate. The former selectively gates between two task-specific teachers,\nwhile the latter integrates knowledge from both teachers. Our gates are\ncomputed on-the-fly in a context-specific manner, facilitating flexible\nintegration of relevant knowledge. Extensive experiments demonstrate that our\nsingle model significantly improves recommendation performance while enhancing\nfluency, and achieves comparable results in terms of diversity.\n","authors":["Yeongseo Jung","Eunseo Jung","Lei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18119v1.pdf","comment":"EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.18098v1","updated":"2023-10-27T12:33:40Z","published":"2023-10-27T12:33:40Z","title":"Mind the Gap: Automated Corpus Creation for Enthymeme Detection and\n Reconstruction in Learner Arguments","summary":" Writing strong arguments can be challenging for learners. It requires to\nselect and arrange multiple argumentative discourse units (ADUs) in a logical\nand coherent way as well as to decide which ADUs to leave implicit, so called\nenthymemes. However, when important ADUs are missing, readers might not be able\nto follow the reasoning or understand the argument's main point. This paper\nintroduces two new tasks for learner arguments: to identify gaps in arguments\n(enthymeme detection) and to fill such gaps (enthymeme reconstruction).\nApproaches to both tasks may help learners improve their argument quality. We\nstudy how corpora for these tasks can be created automatically by deleting ADUs\nfrom an argumentative text that are central to the argument and its quality,\nwhile maintaining the text's naturalness. Based on the ICLEv3 corpus of\nargumentative learner essays, we create 40,089 argument instances for enthymeme\ndetection and reconstruction. Through manual studies, we provide evidence that\nthe proposed corpus creation process leads to the desired quality reduction,\nand results in arguments that are similarly natural to those written by\nlearners. Finally, first baseline approaches to enthymeme detection and\nreconstruction demonstrate the corpus' usefulness.\n","authors":["Maja Stahl","Nick Düsterhus","Mei-Hua Chen","Henning Wachsmuth"],"pdf_url":"https://arxiv.org/pdf/2310.18098v1.pdf","comment":"Accepted to Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18089v1","updated":"2023-10-27T12:21:55Z","published":"2023-10-27T12:21:55Z","title":"Lost in Translation -- Multilingual Misinformation and its Evolution","summary":" Misinformation and disinformation are growing threats in the digital age,\nspreading rapidly across languages and borders. This paper investigates the\nprevalence and dynamics of multilingual misinformation through an analysis of\nover 250,000 unique fact-checks spanning 95 languages. First, we find that\nwhile the majority of misinformation claims are only fact-checked once, 11.7%,\ncorresponding to more than 21,000 claims, are checked multiple times. Using\nfact-checks as a proxy for the spread of misinformation, we find 33% of\nrepeated claims cross linguistic boundaries, suggesting that some\nmisinformation permeates language barriers. However, spreading patterns exhibit\nstrong homophily, with misinformation more likely to spread within the same\nlanguage. To study the evolution of claims over time and mutations across\nlanguages, we represent fact-checks with multilingual sentence embeddings and\ncluster semantically similar claims. We analyze the connected components and\nshortest paths connecting different versions of a claim finding that claims\ngradually drift over time and undergo greater alteration when traversing\nlanguages. Overall, this novel investigation of multilingual misinformation\nprovides key insights. It quantifies redundant fact-checking efforts,\nestablishes that some claims diffuse across languages, measures linguistic\nhomophily, and models the temporal and cross-lingual evolution of claims. The\nfindings advocate for expanded information sharing between fact-checkers\nglobally while underscoring the importance of localized verification.\n","authors":["Dorian Quelle","Calvin Cheng","Alexandre Bovet","Scott A. Hale"],"pdf_url":"https://arxiv.org/pdf/2310.18089v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.19068v2","updated":"2023-10-27T12:16:14Z","published":"2023-05-30T14:29:24Z","title":"Complex Query Answering on Eventuality Knowledge Graph with Implicit\n Logical Constraints","summary":" Querying knowledge graphs (KGs) using deep learning approaches can naturally\nleverage the reasoning and generalization ability to learn to infer better\nanswers. Traditional neural complex query answering (CQA) approaches mostly\nwork on entity-centric KGs. However, in the real world, we also need to make\nlogical inferences about events, states, and activities (i.e., eventualities or\nsituations) to push learning systems from System I to System II, as proposed by\nYoshua Bengio. Querying logically from an EVentuality-centric KG (EVKG) can\nnaturally provide references to such kind of intuitive and logical inference.\nThus, in this paper, we propose a new framework to leverage neural methods to\nanswer complex logical queries based on an EVKG, which can satisfy not only\ntraditional first-order logic constraints but also implicit logical constraints\nover eventualities concerning their occurrences and orders. For instance, if we\nknow that \"Food is bad\" happens before \"PersonX adds soy sauce\", then \"PersonX\nadds soy sauce\" is unlikely to be the cause of \"Food is bad\" due to implicit\ntemporal constraint. To facilitate consistent reasoning on EVKGs, we propose\nComplex Eventuality Query Answering (CEQA), a more rigorous definition of CQA\nthat considers the implicit logical constraints governing the temporal order\nand occurrence of eventualities. In this manner, we propose to leverage theorem\nprovers for constructing benchmark datasets to ensure the answers satisfy\nimplicit logical constraints. We also propose a Memory-Enhanced Query Encoding\n(MEQE) approach to significantly improve the performance of state-of-the-art\nneural query encoders on the CEQA task.\n","authors":["Jiaxin Bai","Xin Liu","Weiqi Wang","Chen Luo","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2305.19068v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17526v2","updated":"2023-10-27T12:14:27Z","published":"2023-10-26T16:18:30Z","title":"Can large language models replace humans in the systematic review\n process? Evaluating GPT-4's efficacy in screening and extracting data from\n peer-reviewed and grey literature in multiple languages","summary":" Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance.\n","authors":["Qusai Khraisha","Sophie Put","Johanna Kappenberg","Azza Warraitch","Kristin Hadfield"],"pdf_url":"https://arxiv.org/pdf/2310.17526v2.pdf","comment":"9 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/2304.09542v2","updated":"2023-10-27T12:11:16Z","published":"2023-04-19T10:16:03Z","title":"Is ChatGPT Good at Search? Investigating Large Language Models as\n Re-Ranking Agents","summary":" Large Language Models (LLMs) have demonstrated remarkable zero-shot\ngeneralization across various language-related tasks, including search engines.\nHowever, existing work utilizes the generative ability of LLMs for Information\nRetrieval (IR) rather than direct passage ranking. The discrepancy between the\npre-training objectives of LLMs and the ranking objective poses another\nchallenge. In this paper, we first investigate generative LLMs such as ChatGPT\nand GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal\nthat properly instructed LLMs can deliver competitive, even superior results to\nstate-of-the-art supervised methods on popular IR benchmarks. Furthermore, to\naddress concerns about data contamination of LLMs, we collect a new test set\ncalled NovelEval, based on the latest knowledge and aiming to verify the\nmodel's ability to rank unknown knowledge. Finally, to improve efficiency in\nreal-world applications, we delve into the potential for distilling the ranking\ncapabilities of ChatGPT into small specialized models using a permutation\ndistillation scheme. Our evaluation results turn out that a distilled 440M\nmodel outperforms a 3B supervised model on the BEIR benchmark. The code to\nreproduce our results is available at www.github.com/sunnweiwei/RankGPT.\n","authors":["Weiwei Sun","Lingyong Yan","Xinyu Ma","Shuaiqiang Wang","Pengjie Ren","Zhumin Chen","Dawei Yin","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2304.09542v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2309.06364v2","updated":"2023-10-27T12:10:54Z","published":"2023-09-06T15:00:44Z","title":"Framework-Based Qualitative Analysis of Free Responses of Large Language\n Models: Algorithmic Fidelity","summary":" Today, using Large-scale generative Language Models (LLMs) it is possible to\nsimulate free responses to interview questions like those traditionally\nanalyzed using qualitative research methods. Qualitative methodology\nencompasses a broad family of techniques involving manual analysis of\nopen-ended interviews or conversations conducted freely in natural language.\nHere we consider whether artificial \"silicon participants\" generated by LLMs\nmay be productively studied using qualitative methods aiming to produce\ninsights that could generalize to real human populations. The key concept in\nour analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023)\ncapturing the degree to which LLM-generated outputs mirror human\nsub-populations' beliefs and attitudes. By definition, high algorithmic\nfidelity suggests latent beliefs elicited from LLMs may generalize to real\nhumans, whereas low algorithmic fidelity renders such research invalid. Here we\nused an LLM to generate interviews with silicon participants matching specific\ndemographic characteristics one-for-one with a set of human participants. Using\nframework-based qualitative analysis, we showed the key themes obtained from\nboth human and silicon participants were strikingly similar. However, when we\nanalyzed the structure and tone of the interviews we found even more striking\ndifferences. We also found evidence of the hyper-accuracy distortion described\nby Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not\nhave sufficient algorithmic fidelity to expect research on it to generalize to\nhuman populations. However, the rapid pace of LLM research makes it plausible\nthis could change in the future. Thus we stress the need to establish epistemic\nnorms now around how to assess validity of LLM-based qualitative research,\nespecially concerning the need to ensure representation of heterogeneous lived\nexperiences.\n","authors":["Aliya Amirova","Theodora Fteropoulli","Nafiso Ahmed","Martin R. Cowie","Joel Z. Leibo"],"pdf_url":"https://arxiv.org/pdf/2309.06364v2.pdf","comment":"46 pages, 5 tables, 5 figures"},{"id":"http://arxiv.org/abs/2310.18077v1","updated":"2023-10-27T11:45:16Z","published":"2023-10-27T11:45:16Z","title":"Detrimental Contexts in Open-Domain Question Answering","summary":" For knowledge intensive NLP tasks, it has been widely accepted that accessing\nmore information is a contributing factor to improvements in the model's\nend-to-end performance. However, counter-intuitively, too much context can have\na negative impact on the model when evaluated on common question answering (QA)\ndatasets. In this paper, we analyze how passages can have a detrimental effect\non retrieve-then-read architectures used in question answering. Our empirical\nevidence indicates that the current read architecture does not fully leverage\nthe retrieved passages and significantly degrades its performance when using\nthe whole passages compared to utilizing subsets of them. Our findings\ndemonstrate that model accuracy can be improved by 10% on two popular QA\ndatasets by filtering out detrimental passages. Additionally, these outcomes\nare attained by utilizing existing retrieval methods without further training\nor data. We further highlight the challenges associated with identifying the\ndetrimental passages. First, even with the correct context, the model can make\nan incorrect prediction, posing a challenge in determining which passages are\nmost influential. Second, evaluation typically considers lexical matching,\nwhich is not robust to variations of correct answers. Despite these\nlimitations, our experimental results underscore the pivotal role of\nidentifying and removing these detrimental passages for the context-efficient\nretrieve-then-read pipeline. Code and data are available at\nhttps://github.com/xfactlab/emnlp2023-damaging-retrieval\n","authors":["Philhoon Oh","James Thorne"],"pdf_url":"https://arxiv.org/pdf/2310.18077v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18076v1","updated":"2023-10-27T11:44:06Z","published":"2023-10-27T11:44:06Z","title":"Knowledge Corpus Error in Question Answering","summary":" Recent works in open-domain question answering (QA) have explored generating\ncontext passages from large language models (LLMs), replacing the traditional\nretrieval step in the QA pipeline. However, it is not well understood why\ngenerated passages can be more effective than retrieved ones. This study\nrevisits the conventional formulation of QA and introduces the concept of\nknowledge corpus error. This error arises when the knowledge corpus used for\nretrieval is only a subset of the entire string space, potentially excluding\nmore helpful passages that exist outside the corpus. LLMs may mitigate this\nshortcoming by generating passages in a larger space. We come up with an\nexperiment of paraphrasing human-annotated gold context using LLMs to observe\nknowledge corpus error empirically. Our results across three QA benchmarks\nreveal an increased performance (10% - 13%) when using paraphrased passage,\nindicating a signal for the existence of knowledge corpus error. Our code is\navailable at https://github.com/xfactlab/emnlp2023-knowledge-corpus-error\n","authors":["Yejoon Lee","Philhoon Oh","James Thorne"],"pdf_url":"https://arxiv.org/pdf/2310.18076v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18075v1","updated":"2023-10-27T11:43:46Z","published":"2023-10-27T11:43:46Z","title":"DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking","summary":" Inspired by the dual-process theory of human cognition, we introduce DUMA, a\nnovel conversational agent framework that embodies a dual-mind mechanism\nthrough the utilization of two generative Large Language Models (LLMs)\ndedicated to fast and slow thinking respectively. The fast thinking model\nserves as the primary interface for external interactions and initial response\ngeneration, evaluating the necessity for engaging the slow thinking model based\non the complexity of the complete response. When invoked, the slow thinking\nmodel takes over the conversation, engaging in meticulous planning, reasoning,\nand tool utilization to provide a well-analyzed response. This dual-mind\nconfiguration allows for a seamless transition between intuitive responses and\ndeliberate problem-solving processes based on the situation. We have\nconstructed a conversational agent to handle online inquiries in the real\nestate industry. The experiment proves that our method balances effectiveness\nand efficiency, and has a significant improvement compared to the baseline.\n","authors":["Xiaoyu Tian","Liangyu Chen","Na Liu","Yaxuan Liu","Wei Zou","Kaijiang Chen","Ming Cui"],"pdf_url":"https://arxiv.org/pdf/2310.18075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18073v1","updated":"2023-10-27T11:40:32Z","published":"2023-10-27T11:40:32Z","title":"A Scalable Framework for Table of Contents Extraction from Complex ESG\n Annual Reports","summary":" Table of contents (ToC) extraction centres on structuring documents in a\nhierarchical manner. In this paper, we propose a new dataset, ESGDoc,\ncomprising 1,093 ESG annual reports from 563 companies spanning from 2001 to\n2022. These reports pose significant challenges due to their diverse structures\nand extensive length. To address these challenges, we propose a new framework\nfor Toc extraction, consisting of three steps: (1) Constructing an initial tree\nof text blocks based on reading order and font sizes; (2) Modelling each tree\nnode (or text block) independently by considering its contextual information\ncaptured in node-centric subtree; (3) Modifying the original tree by taking\nappropriate action on each tree node (Keep, Delete, or Move). This\nconstruction-modelling-modification (CMM) process offers several benefits. It\neliminates the need for pairwise modelling of section headings as in previous\napproaches, making document segmentation practically feasible. By incorporating\nstructured information, each section heading can leverage both local and\nlong-distance context relevant to itself. Experimental results show that our\napproach outperforms the previous state-of-the-art baseline with a fraction of\nrunning time. Our framework proves its scalability by effectively handling\ndocuments of any length.\n","authors":["Xinyu Wang","Lin Gui","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2310.18073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18070v1","updated":"2023-10-27T11:36:18Z","published":"2023-10-27T11:36:18Z","title":"Multi-grained Evidence Inference for Multi-choice Reading Comprehension","summary":" Multi-choice Machine Reading Comprehension (MRC) is a major and challenging\ntask for machines to answer questions according to provided options. Answers in\nmulti-choice MRC cannot be directly extracted in the given passages, and\nessentially require machines capable of reasoning from accurate extracted\nevidence. However, the critical evidence may be as simple as just one word or\nphrase, while it is hidden in the given redundant, noisy passage with multiple\nlinguistic hierarchies from phrase, fragment, sentence until the entire\npassage. We thus propose a novel general-purpose model enhancement which\nintegrates multi-grained evidence comprehensively, named Multi-grained evidence\ninferencer (Mugen), to make up for the inability. Mugen extracts three\ndifferent granularities of evidence: coarse-, middle- and fine-grained\nevidence, and integrates evidence with the original passages, achieving\nsignificant and consistent performance improvement on four multi-choice MRC\nbenchmarks.\n","authors":["Yilin Zhao","Hai Zhao","Sufeng Duan"],"pdf_url":"https://arxiv.org/pdf/2310.18070v1.pdf","comment":"Accepted by TASLP 2023, vol. 31, pp. 3896-3907"},{"id":"http://arxiv.org/abs/2304.02247v2","updated":"2023-10-27T11:35:04Z","published":"2023-04-05T06:35:41Z","title":"Disentangling Structure and Style: Political Bias Detection in News by\n Inducing Document Hierarchy","summary":" We address an important gap in detecting political bias in news articles.\nPrevious works that perform document classification can be influenced by the\nwriting style of each news outlet, leading to overfitting and limited\ngeneralizability. Our approach overcomes this limitation by considering both\nthe sentence-level semantics and the document-level rhetorical structure,\nresulting in a more robust and style-agnostic approach to detecting political\nbias in news articles. We introduce a novel multi-head hierarchical attention\nmodel that effectively encodes the structure of long documents through a\ndiverse ensemble of attention heads. While journalism follows a formalized\nrhetorical structure, the writing style may vary by news outlet. We demonstrate\nthat our method overcomes this domain dependency and outperforms previous\napproaches for robustness and accuracy. Further analysis and human evaluation\ndemonstrate the ability of our model to capture common discourse structures in\njournalism. Our code is available at:\nhttps://github.com/xfactlab/emnlp2023-Document-Hierarchy\n","authors":["Jiwoo Hong","Yejin Cho","Jaemin Jung","Jiyoung Han","James Thorne"],"pdf_url":"https://arxiv.org/pdf/2304.02247v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18063v1","updated":"2023-10-27T11:26:27Z","published":"2023-10-27T11:26:27Z","title":"\"Honey, Tell Me What's Wrong\", Global Explanation of Textual\n Discriminative Models through Cooperative Generation","summary":" The ubiquity of complex machine learning has raised the importance of\nmodel-agnostic explanation algorithms. These methods create artificial\ninstances by slightly perturbing real instances, capturing shifts in model\ndecisions. However, such methods rely on initial data and only provide\nexplanations of the decision for these. To tackle these problems, we propose\nTherapy, the first global and model-agnostic explanation method adapted to text\nwhich requires no input dataset. Therapy generates texts following the\ndistribution learned by a classifier through cooperative generation. Because it\ndoes not rely on initial samples, it allows to generate explanations even when\ndata is absent (e.g., for confidentiality reasons). Moreover, conversely to\nexisting methods that combine multiple local explanations into a global one,\nTherapy offers a global overview of the model behavior on the input space. Our\nexperiments show that although using no input data to generate samples, Therapy\nprovides insightful information about features used by the classifier that is\ncompetitive with the ones from methods relying on input samples and outperforms\nthem when input samples are not specific to the studied model.\n","authors":["Antoine Chaffin","Julien Delaunay"],"pdf_url":"https://arxiv.org/pdf/2310.18063v1.pdf","comment":"8 pages plus references and 2 pages of appendices. 7 figures and 2\n tables"},{"id":"http://arxiv.org/abs/2305.13788v2","updated":"2023-10-27T11:25:00Z","published":"2023-05-23T07:55:34Z","title":"Can Large Language Models Capture Dissenting Human Voices?","summary":" Large language models (LLMs) have shown impressive achievements in solving a\nbroad range of tasks. Augmented by instruction fine-tuning, LLMs have also been\nshown to generalize in zero-shot settings as well. However, whether LLMs\nclosely align with the human disagreement distribution has not been\nwell-studied, especially within the scope of natural language inference (NLI).\nIn this paper, we evaluate the performance and alignment of LLM distribution\nwith humans using two different techniques to estimate the multinomial\ndistribution: Monte Carlo Estimation (MCE) and Log Probability Estimation\n(LPE). As a result, we show LLMs exhibit limited ability in solving NLI tasks\nand simultaneously fail to capture human disagreement distribution. The\ninference and human alignment performances plunge even further on data samples\nwith high human disagreement levels, raising concerns about their natural\nlanguage understanding (NLU) ability and their representativeness to a larger\nhuman population. The source code for the experiments is available at\nhttps://github.com/xfactlab/emnlp2023-LLM-Disagreement\n","authors":["Noah Lee","Na Min An","James Thorne"],"pdf_url":"https://arxiv.org/pdf/2305.13788v2.pdf","comment":"To appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18046v1","updated":"2023-10-27T10:44:50Z","published":"2023-10-27T10:44:50Z","title":"ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model\n for Visual Question Answering in Vietnamese","summary":" In recent years, Visual Question Answering (VQA) has gained significant\nattention for its diverse applications, including intelligent car assistance,\naiding visually impaired individuals, and document image information retrieval\nusing natural language queries. VQA requires effective integration of\ninformation from questions and images to generate accurate answers. Neural\nmodels for VQA have made remarkable progress on large-scale datasets, with a\nprimary focus on resource-rich languages like English. To address this, we\nintroduce the ViCLEVR dataset, a pioneering collection for evaluating various\nvisual reasoning capabilities in Vietnamese while mitigating biases. The\ndataset comprises over 26,000 images and 30,000 question-answer pairs (QAs),\neach question annotated to specify the type of reasoning involved. Leveraging\nthis dataset, we conduct a comprehensive analysis of contemporary visual\nreasoning systems, offering valuable insights into their strengths and\nlimitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion\nthat identifies objects in images based on questions. The architecture\neffectively employs transformers to enable simultaneous reasoning over textual\nand visual data, merging both modalities at an early model stage. The\nexperimental findings demonstrate that our proposed model achieves\nstate-of-the-art performance across four evaluation metrics. The accompanying\ncode and dataset have been made publicly accessible at\n\\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate\nadvancements within the research community, fostering the development of more\nmultimodal fusion algorithms, specifically tailored to address the nuances of\nlow-resource languages, exemplified by Vietnamese.\n","authors":["Khiem Vinh Tran","Hao Phu Phan","Kiet Van Nguyen","Ngan Luu Thuy Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.18046v1.pdf","comment":"A pre-print version and submitted to journal"},{"id":"http://arxiv.org/abs/2310.18038v1","updated":"2023-10-27T10:36:54Z","published":"2023-10-27T10:36:54Z","title":"On General Language Understanding","summary":" Natural Language Processing prides itself to be an empirically-minded, if not\noutright empiricist field, and yet lately it seems to get itself into\nessentialist debates on issues of meaning and measurement (\"Do Large Language\nModels Understand Language, And If So, How Much?\"). This is not by accident:\nHere, as everywhere, the evidence underspecifies the understanding. As a\nremedy, this paper sketches the outlines of a model of understanding, which can\nground questions of the adequacy of current methods of measurement of model\nquality. The paper makes three claims: A) That different language use situation\ntypes have different characteristics, B) That language understanding is a\nmultifaceted phenomenon, bringing together individualistic and social\nprocesses, and C) That the choice of Understanding Indicator marks the limits\nof benchmarking, and the beginnings of considerations of the ethics of NLP use.\n","authors":["David Schlangen"],"pdf_url":"https://arxiv.org/pdf/2310.18038v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2306.09312v2","updated":"2023-10-27T10:34:20Z","published":"2023-06-15T17:47:31Z","title":"Semantic HELM: A Human-Readable Memory for Reinforcement Learning","summary":" Reinforcement learning agents deployed in the real world often have to cope\nwith partially observable environments. Therefore, most agents employ memory\nmechanisms to approximate the state of the environment. Recently, there have\nbeen impressive success stories in mastering partially observable environments,\nmostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft.\nHowever, existing methods lack interpretability in the sense that it is not\ncomprehensible for humans what the agent stores in its memory. In this regard,\nwe propose a novel memory mechanism that represents past events in human\nlanguage. Our method uses CLIP to associate visual inputs with language tokens.\nThen we feed these tokens to a pretrained language model that serves the agent\nas memory and provides it with a coherent and human-readable representation of\nthe past. We train our memory mechanism on a set of partially observable\nenvironments and find that it excels on tasks that require a memory component,\nwhile mostly attaining performance on-par with strong baselines on tasks that\ndo not. On a challenging continuous recognition task, where memorizing the past\nis crucial, our memory mechanism converges two orders of magnitude faster than\nprior methods. Since our memory mechanism is human-readable, we can peek at an\nagent's memory and check whether crucial pieces of information have been\nstored. This significantly enhances troubleshooting and paves the way toward\nmore interpretable agents.\n","authors":["Fabian Paischer","Thomas Adler","Markus Hofmarcher","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2306.09312v2.pdf","comment":"To appear at NeurIPS 2023, 10 pages (+ references and appendix),\n Code: https://github.com/ml-jku/helm"},{"id":"http://arxiv.org/abs/2305.13455v2","updated":"2023-10-27T10:27:37Z","published":"2023-05-22T19:56:10Z","title":"Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as\n Conversational Agents","summary":" Recent work has proposed a methodology for the systematic evaluation of\n\"Situated Language Understanding Agents\"-agents that operate in rich linguistic\nand non-linguistic contexts-through testing them in carefully constructed\ninteractive settings. Other recent work has argued that Large Language Models\n(LLMs), if suitably set up, can be understood as (simulators of) such agents. A\nconnection suggests itself, which this paper explores: Can LLMs be evaluated\nmeaningfully by exposing them to constrained game-like settings that are built\nto challenge specific capabilities? As a proof of concept, this paper\ninvestigates five interaction settings, showing that current chat-optimised\nLLMs are, to an extent, capable to follow game-play instructions. Both this\ncapability and the quality of the game play, measured by how well the\nobjectives of the different games are met, follows the development cycle, with\nnewer models performing better. The metrics even for the comparatively simple\nexample games are far from being saturated, suggesting that the proposed\ninstrument will remain to have diagnostic value. Our general framework for\nimplementing and evaluating games with LLMs is available at\nhttps://github.com/clp-research/clembench.\n","authors":["Kranti Chalamalasetti","Jana Götze","Sherzod Hakimov","Brielen Madureira","Philipp Sadler","David Schlangen"],"pdf_url":"https://arxiv.org/pdf/2305.13455v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18025v1","updated":"2023-10-27T10:03:21Z","published":"2023-10-27T10:03:21Z","title":"Large language models for aspect-based sentiment analysis","summary":" Large language models (LLMs) offer unprecedented text completion\ncapabilities. As general models, they can fulfill a wide range of roles,\nincluding those of more specialized models. We assess the performance of GPT-4\nand GPT-3.5 in zero shot, few shot and fine-tuned settings on the aspect-based\nsentiment analysis (ABSA) task. Fine-tuned GPT-3.5 achieves a state-of-the-art\nF1 score of 83.8 on the joint aspect term extraction and polarity\nclassification task of the SemEval-2014 Task 4, improving upon InstructABSA\n[@scaria_instructabsa_2023] by 5.7%. However, this comes at the price of 1000\ntimes more model parameters and thus increased inference cost. We discuss the\nthe cost-performance trade-offs of different models, and analyze the typical\nerrors that they make. Our results also indicate that detailed prompts improve\nperformance in zero-shot and few-shot settings but are not necessary for\nfine-tuned models. This evidence is relevant for practioners that are faced\nwith the choice of prompt engineering versus fine-tuning when using LLMs for\nABSA.\n","authors":["Paul F. Simmering","Paavo Huoviala"],"pdf_url":"https://arxiv.org/pdf/2310.18025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18023v1","updated":"2023-10-27T09:59:24Z","published":"2023-10-27T09:59:24Z","title":"SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment\n Analysis","summary":" Code-mixing is a well-studied linguistic phenomenon when two or more\nlanguages are mixed in text or speech. Several datasets have been build with\nthe goal of training computational models for code-mixing. Although it is very\ncommon to observe code-mixing with multiple languages, most datasets available\ncontain code-mixed between only two languages. In this paper, we introduce\nSentMix-3L, a novel dataset for sentiment analysis containing code-mixed data\nbetween three languages Bangla, English, and Hindi. We carry out a\ncomprehensive evaluation using SentMix-3L. We show that zero-shot prompting\nwith GPT-3.5 outperforms all transformer-based models on SentMix-3L.\n","authors":["Md Nishat Raihan","Dhiman Goswami","Antara Mahmud","Antonios Anstasopoulos","Marcos Zampieri"],"pdf_url":"https://arxiv.org/pdf/2310.18023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18018v1","updated":"2023-10-27T09:48:29Z","published":"2023-10-27T09:48:29Z","title":"NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination\n for each Benchmark","summary":" In this position paper, we argue that the classical evaluation on Natural\nLanguage Processing (NLP) tasks using annotated benchmarks is in trouble. The\nworst kind of data contamination happens when a Large Language Model (LLM) is\ntrained on the test split of a benchmark, and then evaluated in the same\nbenchmark. The extent of the problem is unknown, as it is not straightforward\nto measure. Contamination causes an overestimation of the performance of a\ncontaminated model in a target benchmark and associated task with respect to\ntheir non-contaminated counterparts. The consequences can be very harmful, with\nwrong scientific conclusions being published while other correct ones are\ndiscarded. This position paper defines different levels of data contamination\nand argues for a community effort, including the development of automatic and\nsemi-automatic measures to detect when data from a benchmark was exposed to a\nmodel, and suggestions for flagging papers with conclusions that are\ncompromised by data contamination.\n","authors":["Oscar Sainz","Jon Ander Campos","Iker García-Ferrero","Julen Etxaniz","Oier Lopez de Lacalle","Eneko Agirre"],"pdf_url":"https://arxiv.org/pdf/2310.18018v1.pdf","comment":"Accepted at EMNLP2024-Findings"},{"id":"http://arxiv.org/abs/2310.17976v1","updated":"2023-10-27T08:42:18Z","published":"2023-10-27T08:42:18Z","title":"Does Role-Playing Chatbots Capture the Character Personalities?\n Assessing Personality Traits for Role-Playing Chatbots","summary":" The emergence of large-scale pretrained language models has revolutionized\nthe capabilities of new AI application, especially in the realm of crafting\nchatbots with distinct personas. Given the \"stimulus-response\" nature of\nchatbots, this paper unveils an innovative open-ended interview-style approach\nfor personality assessment on role-playing chatbots, which offers a richer\ncomprehension of their intrinsic personalities. We conduct personality\nassessments on 32 role-playing chatbots created by the ChatHaruhi library,\nacross both the Big Five and MBTI dimensions, and measure their alignment with\nhuman perception. Evaluation results underscore that modern role-playing\nchatbots based on LLMs can effectively portray personality traits of\ncorresponding characters, with an alignment rate of 82.8% compared with\nhuman-perceived personalities. Besides, we also suggest potential strategies\nfor shaping chatbots' personalities. Hence, this paper serves as a cornerstone\nstudy for role-playing chatbots that intersects computational linguistics and\npsychology. Our resources are available at\nhttps://github.com/LC1332/Chat-Haruhi-Suzumiya\n","authors":["Xintao Wang","Xintao Wang","Yaying Fei","Ziang Leng","Cheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.17976v1.pdf","comment":"A Personality Traits Test Over ChatHaruhi"},{"id":"http://arxiv.org/abs/2310.17408v2","updated":"2023-10-27T08:28:03Z","published":"2023-10-26T14:09:57Z","title":"Tackling the Matrix Multiplication Micro-kernel Generation with Exo","summary":" The optimization of the matrix multiplication (or GEMM) has been a need\nduring the last decades. This operation is considered the flagship of current\nlinear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its\nwidespread use in a large variety of scientific applications. The GEMM is\nusually implemented following the GotoBLAS philosophy, which tiles the GEMM\noperands and uses a series of nested loops for performance improvement. These\napproaches extract the maximum computational power of the architectures through\nsmall pieces of hardware-oriented, high-performance code called micro-kernel.\nHowever, this approach forces developers to generate, with a non-negligible\neffort, a dedicated micro-kernel for each new hardware.\n In this work, we present a step-by-step procedure for generating\nmicro-kernels with the Exo compiler that performs close to (or even better\nthan) manually developed microkernels written with intrinsic functions or\nassembly language. Our solution also improves the portability of the generated\ncode, since a hardware target is fully specified by a concise library-based\ndescription of its instructions.\n","authors":["Adrián Castelló","Julian Bellavita","Grace Dinh","Yuka Ikarashi","Héctor Martínez"],"pdf_url":"https://arxiv.org/pdf/2310.17408v2.pdf","comment":"11 pages, 18 figures. Presented at CGO 2024. It includes a software\n artifact step-by-step execution"},{"id":"http://arxiv.org/abs/2310.17956v1","updated":"2023-10-27T08:05:21Z","published":"2023-10-27T08:05:21Z","title":"Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General\n Healthcare","summary":" Large Language Models (LLMs) have introduced a new era of proficiency in\ncomprehending complex healthcare and biomedical topics. However, there is a\nnoticeable lack of models in languages other than English and models that can\ninterpret multi-modal input, which is crucial for global healthcare\naccessibility. In response, this study introduces Qilin-Med-VL, the first\nChinese large vision-language model designed to integrate the analysis of\ntextual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer\n(ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum\ntraining process that includes feature alignment and instruction tuning. This\nmethod enhances the model's ability to generate medical captions and answer\ncomplex medical queries. We also release ChiMed-VL, a dataset consisting of\nmore than 1M image-text pairs. This dataset has been carefully curated to\nenable detailed and comprehensive interpretation of medical data using various\ntypes of images.\n","authors":["Junling Liu","Ziming Wang","Qichen Ye","Dading Chong","Peilin Zhou","Yining Hua"],"pdf_url":"https://arxiv.org/pdf/2310.17956v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17953v1","updated":"2023-10-27T08:01:55Z","published":"2023-10-27T08:01:55Z","title":"Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed\n Languages","summary":" Recently Whisper has approached human-level robustness and accuracy in\nEnglish automatic speech recognition (ASR), while in minor language and mixed\nlanguage speech recognition, there remains a compelling need for further\nimprovement. In this work, we present the impressive results of Whisper-MCE,\nour finetuned Whisper model, which was trained using our self-collected\ndataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile,\nconsidering word error rate (WER) poses challenges when it comes to evaluating\nits effectiveness in minor language and mixed-language contexts, we present a\nnovel rating mechanism. By comparing our model to the baseline whisper-large-v2\nmodel, we demonstrate its superior ability to accurately capture the content of\nthe original audio, achieve higher recognition accuracy, and exhibit faster\nrecognition speed. Notably, our model outperforms other existing models in the\nspecific task of recognizing mixed language.\n","authors":["Peng Xie","XingYuan Liu","ZiWei Chen","Kani Chen","Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17953v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17940v1","updated":"2023-10-27T07:34:51Z","published":"2023-10-27T07:34:51Z","title":"Unified Segment-to-Segment Framework for Simultaneous Sequence\n Generation","summary":" Simultaneous sequence generation is a pivotal task for real-time scenarios,\nsuch as streaming speech recognition, simultaneous machine translation and\nsimultaneous speech translation, where the target sequence is generated while\nreceiving the source sequence. The crux of achieving high-quality generation\nwith low latency lies in identifying the optimal moments for generating,\naccomplished by learning a mapping between the source and target sequences.\nHowever, existing methods often rely on task-specific heuristics for different\nsequence types, limiting the model's capacity to adaptively learn the\nsource-target mapping and hindering the exploration of multi-task learning for\nvarious simultaneous tasks. In this paper, we propose a unified\nsegment-to-segment framework (Seg2Seg) for simultaneous sequence generation,\nwhich learns the mapping in an adaptive and unified manner. During the process\nof simultaneous generation, the model alternates between waiting for a source\nsegment and generating a target segment, making the segment serve as the\nnatural bridge between the source and target. To accomplish this, Seg2Seg\nintroduces a latent segment as the pivot between source to target and explores\nall potential source-target mappings via the proposed expectation training,\nthereby learning the optimal moments for generating. Experiments on multiple\nsimultaneous generation tasks demonstrate that Seg2Seg achieves\nstate-of-the-art performance and exhibits better generality across various\ntasks.\n","authors":["Shaolei Zhang","Yang Feng"],"pdf_url":"https://arxiv.org/pdf/2310.17940v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17936v1","updated":"2023-10-27T07:21:37Z","published":"2023-10-27T07:21:37Z","title":"Transformers as Graph-to-Graph Models","summary":" We argue that Transformers are essentially graph-to-graph models, with\nsequences just being a special case. Attention weights are functionally\nequivalent to graph edges. Our Graph-to-Graph Transformer architecture makes\nthis ability explicit, by inputting graph edges into the attention weight\ncomputations and predicting graph edges with attention-like functions, thereby\nintegrating explicit graphs into the latent graphs learned by pretrained\nTransformers. Adding iterative graph refinement provides a joint embedding of\ninput, output, and latent graphs, allowing non-autoregressive graph prediction\nto optimise the complete graph without any bespoke pipeline or decoding\nstrategy. Empirical results show that this architecture achieves\nstate-of-the-art accuracies for modelling a variety of linguistic structures,\nintegrating very effectively with the latent linguistic representations learned\nby pretraining.\n","authors":["James Henderson","Alireza Mohammadshahi","Andrei C. Coman","Lesly Miculicich"],"pdf_url":"https://arxiv.org/pdf/2310.17936v1.pdf","comment":"Accepted to Big Picture workshop at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17924v1","updated":"2023-10-27T06:48:48Z","published":"2023-10-27T06:48:48Z","title":"SOUL: Towards Sentiment and Opinion Understanding of Language","summary":" Sentiment analysis is a well-established natural language processing task,\nwith sentiment polarity classification being one of its most popular and\nrepresentative tasks. However, despite the success of pre-trained language\nmodels in this area, they often fall short of capturing the broader\ncomplexities of sentiment analysis. To address this issue, we propose a new\ntask called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims\nto evaluate sentiment understanding through two subtasks: Review Comprehension\n(RC) and Justification Generation (JG). RC seeks to validate statements that\nfocus on subjective information based on a review text, while JG requires\nmodels to provide explanations for their sentiment predictions. To enable\ncomprehensive evaluation, we annotate a new dataset comprising 15,028\nstatements from 3,638 reviews. Experimental results indicate that SOUL is a\nchallenging task for both small and large language models, with a performance\ngap of up to 27% when compared to human performance. Furthermore, evaluations\nconducted with both human experts and GPT-4 highlight the limitations of the\nsmall language model in generating reasoning-based justifications. These\nfindings underscore the challenging nature of the SOUL task for existing\nmodels, emphasizing the need for further advancements in sentiment analysis to\naddress its complexities. The new dataset and code are available at\nhttps://github.com/DAMO-NLP-SG/SOUL.\n","authors":["Yue Deng","Wenxuan Zhang","Sinno Jialin Pan","Lidong Bing"],"pdf_url":"https://arxiv.org/pdf/2310.17924v1.pdf","comment":"EMNLP 2023 Main Conference, Short Paper"},{"id":"http://arxiv.org/abs/2310.17918v1","updated":"2023-10-27T06:22:14Z","published":"2023-10-27T06:22:14Z","title":"Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection\n Method","summary":" Large Language Models (LLMs) have shown great potential in Natural Language\nProcessing (NLP) tasks. However, recent literature reveals that LLMs generate\nnonfactual responses intermittently, which impedes the LLMs' reliability for\nfurther utilization. In this paper, we propose a novel self-detection method to\ndetect which questions that a LLM does not know that are prone to generate\nnonfactual results. Specifically, we first diversify the textual expressions\nfor a given question and collect the corresponding answers. Then we examine the\ndivergencies between the generated answers to identify the questions that the\nmodel may generate falsehoods. All of the above steps can be accomplished by\nprompting the LLMs themselves without referring to any other external\nresources. We conduct comprehensive experiments and demonstrate the\neffectiveness of our method on recently released LLMs, e.g., Vicuna, ChatGPT,\nand GPT-4.\n","authors":["Yukun Zhao","Lingyong Yan","Weiwei Sun","Guoliang Xing","Chong Meng","Shuaiqiang Wang","Zhicong Cheng","Zhaochun Ren","Dawei Yin"],"pdf_url":"https://arxiv.org/pdf/2310.17918v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17914v1","updated":"2023-10-27T06:15:30Z","published":"2023-10-27T06:15:30Z","title":"3D-Aware Visual Question Answering about Parts, Poses and Occlusions","summary":" Despite rapid progress in Visual question answering (VQA), existing datasets\nand models mainly focus on testing reasoning in 2D. However, it is important\nthat VQA models also understand the 3D structure of visual scenes, for example\nto support tasks like navigation or manipulation. This includes an\nunderstanding of the 3D object pose, their parts and occlusions. In this work,\nwe introduce the task of 3D-aware VQA, which focuses on challenging questions\nthat require a compositional reasoning over the 3D structure of visual scenes.\nWe address 3D-aware VQA from both the dataset and the model perspective. First,\nwe introduce Super-CLEVR-3D, a compositional reasoning dataset that contains\nquestions about object parts, their 3D poses, and occlusions. Second, we\npropose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas:\nprobabilistic neural symbolic program execution for reasoning and deep neural\nnetworks with 3D generative representations of objects for robust visual\nrecognition. Our experimental results show our model PO3D-VQA outperforms\nexisting methods significantly, but we still observe a significant performance\ngap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an\nimportant open research area.\n","authors":["Xingrui Wang","Wufei Ma","Zhuowan Li","Adam Kortylewski","Alan Yuille"],"pdf_url":"https://arxiv.org/pdf/2310.17914v1.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2306.02531v2","updated":"2023-10-27T05:53:16Z","published":"2023-06-05T01:36:39Z","title":"PLANNER: Generating Diversified Paragraph via Latent Language Diffusion\n Model","summary":" Autoregressive models for text sometimes generate repetitive and low-quality\noutput because errors accumulate during the steps of generation. This issue is\noften attributed to exposure bias - the difference between how a model is\ntrained, and how it is used during inference. Denoising diffusion models\nprovide an alternative approach in which a model can revisit and revise its\noutput. However, they can be computationally expensive and prior efforts on\ntext have led to models that produce less fluent output compared to\nautoregressive models, especially for longer text and paragraphs. In this\npaper, we propose PLANNER, a model that combines latent semantic diffusion with\nautoregressive generation, to generate fluent text while exercising global\ncontrol over paragraphs. The model achieves this by combining an autoregressive\n\"decoding\" module with a \"planning\" module that uses latent diffusion to\ngenerate semantic paragraph embeddings in a coarse-to-fine manner. The proposed\nmethod is evaluated on various conditional generation tasks, and results on\nsemantic generation, text completion and summarization show its effectiveness\nin generating high-quality long-form text in an efficient manner.\n","authors":["Yizhe Zhang","Jiatao Gu","Zhuofeng Wu","Shuangfei Zhai","Josh Susskind","Navdeep Jaitly"],"pdf_url":"https://arxiv.org/pdf/2306.02531v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.09887v2","updated":"2023-10-27T05:14:46Z","published":"2023-02-20T10:30:16Z","title":"90% F1 Score in Relational Triple Extraction: Is it Real ?","summary":" Extracting relational triples from text is a crucial task for constructing\nknowledge bases. Recent advancements in joint entity and relation extraction\nmodels have demonstrated remarkable F1 scores ($\\ge 90\\%$) in accurately\nextracting relational triples from free text. However, these models have been\nevaluated under restrictive experimental settings and unrealistic datasets.\nThey overlook sentences with zero triples (zero-cardinality), thereby\nsimplifying the task. In this paper, we present a benchmark study of\nstate-of-the-art joint entity and relation extraction models under a more\nrealistic setting. We include sentences that lack any triples in our\nexperiments, providing a comprehensive evaluation. Our findings reveal a\nsignificant decline (approximately 10-15\\% in one dataset and 6-14\\% in another\ndataset) in the models' F1 scores within this realistic experimental setup.\nFurthermore, we propose a two-step modeling approach that utilizes a simple\nBERT-based classifier. This approach leads to overall performance improvement\nin these models within the realistic experimental setting.\n","authors":["Pratik Saini","Samiran Pal","Tapas Nayak","Indrajit Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2302.09887v2.pdf","comment":"Accepted in GenBench workshop @ EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17894v1","updated":"2023-10-27T05:01:20Z","published":"2023-10-27T05:01:20Z","title":"Natural Language Interfaces for Tabular Data Querying and Visualization:\n A Survey","summary":" The emergence of natural language processing has revolutionized the way users\ninteract with tabular data, enabling a shift from traditional query languages\nand manual plotting to more intuitive, language-based interfaces. The rise of\nlarge language models (LLMs) such as ChatGPT and its successors has further\nadvanced this field, opening new avenues for natural language processing\ntechniques. This survey presents a comprehensive overview of natural language\ninterfaces for tabular data querying and visualization, which allow users to\ninteract with data using natural language queries. We introduce the fundamental\nconcepts and techniques underlying these interfaces with a particular emphasis\non semantic parsing, the key technology facilitating the translation from\nnatural language to SQL queries or data visualization commands. We then delve\ninto the recent advancements in Text-to-SQL and Text-to-Vis problems from the\nperspectives of datasets, methodologies, metrics, and system designs. This\nincludes a deep dive into the influence of LLMs, highlighting their strengths,\nlimitations, and potential for future improvements. Through this survey, we aim\nto provide a roadmap for researchers and practitioners interested in developing\nand applying natural language interfaces for data interaction in the era of\nlarge language models.\n","authors":["Weixu Zhang","Yifei Wang","Yuanfeng Song","Victor Junqiu Wei","Yuxing Tian","Yiyan Qi","Jonathan H. Chan","Raymond Chi-Wing Wong","Haiqin Yang"],"pdf_url":"https://arxiv.org/pdf/2310.17894v1.pdf","comment":"20 pages, 4 figures, 5 tables. Submitted to IEEE TKDE"},{"id":"http://arxiv.org/abs/2310.04691v2","updated":"2023-10-27T04:56:13Z","published":"2023-10-07T05:37:41Z","title":"EMO: Earth Mover Distance Optimization for Auto-Regressive Language\n Modeling","summary":" Neural language models are probabilistic models of human text. They are\npredominantly trained using maximum likelihood estimation (MLE), which is\nequivalent to minimizing the forward cross-entropy between the empirical data\ndistribution and the model distribution. However, various degeneration\nphenomena are still widely observed when decoding from the distributions\nlearned by such models. We establish that the forward cross-entropy is\nsuboptimal as a distance metric for aligning human and model distribution due\nto its (1) recall-prioritization (2) negative diversity ignorance and (3)\ntrain-test mismatch. In this paper, we propose Earth Mover Distance\nOptimization (EMO) for auto-regressive language modeling. EMO capitalizes on\nthe inherent properties of earth mover distance to address the aforementioned\nchallenges. Due to the high complexity of direct computation, we further\nintroduce a feasible upper bound for EMO to ease end-to-end training. Upon\nextensive evaluation of language models trained using EMO and MLE. We find that\nEMO demonstrates a consistently better language modeling performance than MLE\nacross domains. Moreover, EMO demonstrates noteworthy enhancements in\ndownstream performance with minimal fine-tuning on merely 25,000 sentences.\nThis highlights the tremendous potential of EMO as a lightweight calibration\nmethod for enhancing large-scale pre-trained language models.\n","authors":["Siyu Ren","Zhiyong Wu","Kenny Q. Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.04691v2.pdf","comment":"Update experimental results of instruction-tuning and Github link"},{"id":"http://arxiv.org/abs/2308.02122v2","updated":"2023-10-27T04:51:56Z","published":"2023-08-04T03:48:28Z","title":"ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned\n Samples in NLP","summary":" Backdoor attacks have emerged as a prominent threat to natural language\nprocessing (NLP) models, where the presence of specific triggers in the input\ncan lead poisoned models to misclassify these inputs to predetermined target\nclasses. Current detection mechanisms are limited by their inability to address\nmore covert backdoor strategies, such as style-based attacks. In this work, we\npropose an innovative test-time poisoned sample detection framework that hinges\non the interpretability of model predictions, grounded in the semantic meaning\nof inputs. We contend that triggers (e.g., infrequent words) are not supposed\nto fundamentally alter the underlying semantic meanings of poisoned samples as\nthey want to stay stealthy. Based on this observation, we hypothesize that\nwhile the model's predictions for paraphrased clean samples should remain\nstable, predictions for poisoned samples should revert to their true labels\nupon the mutations applied to triggers during the paraphrasing process. We\nemploy ChatGPT, a state-of-the-art large language model, as our paraphraser and\nformulate the trigger-removal task as a prompt engineering problem. We adopt\nfuzzing, a technique commonly used for unearthing software vulnerabilities, to\ndiscover optimal paraphrase prompts that can effectively eliminate triggers\nwhile concurrently maintaining input semantics. Experiments on 4 types of\nbackdoor attacks, including the subtle style backdoors, and 4 distinct datasets\ndemonstrate that our approach surpasses baseline methods, including STRIP, RAP,\nand ONION, in precision and recall.\n","authors":["Lu Yan","Zhuo Zhang","Guanhong Tao","Kaiyuan Zhang","Xuan Chen","Guangyu Shen","Xiangyu Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.02122v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13850v3","updated":"2023-10-27T04:42:12Z","published":"2023-05-23T09:18:47Z","title":"Global Structure Knowledge-Guided Relation Extraction Method for\n Visually-Rich Document","summary":" Visual Relation Extraction (VRE) is a powerful means of discovering\nrelationships between entities within visually-rich documents. Existing methods\noften focus on manipulating entity features to find pairwise relations, yet\nneglect the more fundamental structural information that links disparate entity\npairs together. The absence of global structure information may make the model\nstruggle to learn long-range relations and easily predict conflicted results.\nTo alleviate such limitations, we propose a GlObal Structure knowledge-guided\nrelation Extraction (GOSE) framework. GOSE initiates by generating preliminary\nrelation predictions on entity pairs extracted from a scanned image of the\ndocument. Subsequently, global structural knowledge is captured from the\npreceding iterative predictions, which are then incorporated into the\nrepresentations of the entities. This \"generate-capture-incorporate\" cycle is\nrepeated multiple times, allowing entity representations and global structure\nknowledge to be mutually reinforced. Extensive experiments validate that GOSE\nnot only outperforms existing methods in the standard fine-tuning setting but\nalso reveals superior cross-lingual learning capabilities; indeed, even yields\nstronger data-efficient performance in the low-resource setting. The code for\nGOSE will be available at https://github.com/chenxn2020/GOSE.\n","authors":["Xiangnan Chen","Qian Xiao","Juncheng Li","Duo Dong","Jun Lin","Xiaozhong Liu","Siliang Tang"],"pdf_url":"https://arxiv.org/pdf/2305.13850v3.pdf","comment":"Accepted by EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2307.04090v2","updated":"2023-10-27T04:27:41Z","published":"2023-07-09T04:19:19Z","title":"DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge\n Graphs","summary":" Recent work within the Argument Mining community has shown the applicability\nof Natural Language Processing systems for solving problems found within\ncompetitive debate. One of the most important tasks within competitive debate\nis for debaters to create high quality debate cases. We show that effective\ndebate cases can be constructed using constrained shortest path traversals on\nArgumentative Semantic Knowledge Graphs. We study this potential in the context\nof a type of American Competitive Debate, called Policy Debate, which already\nhas a large scale dataset targeting it called DebateSum. We significantly\nimprove upon DebateSum by introducing 53180 new examples, as well as further\nuseful metadata for every example, to the dataset. We leverage the txtai\nsemantic search and knowledge graph toolchain to produce and contribute 9\nsemantic knowledge graphs built on this dataset. We create a unique method for\nevaluating which knowledge graphs are better in the context of producing policy\ndebate cases. A demo which automatically generates debate cases, along with all\nother code and the Knowledge Graphs, are open-sourced and made available to the\npublic here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG\n","authors":["Allen Roush","David Mezzetti"],"pdf_url":"https://arxiv.org/pdf/2307.04090v2.pdf","comment":"8 pages, Accepted to The 4th New Frontiers in Summarization Workshop\n (EMNLP 2023), System Demonstration paper"},{"id":"http://arxiv.org/abs/2212.10767v2","updated":"2023-10-27T04:19:50Z","published":"2022-12-21T05:01:01Z","title":"How Does Beam Search improve Span-Level Confidence Estimation in\n Generative Sequence Labeling?","summary":" Sequence labeling is a core task in text understanding for IE/IR systems.\nText generation models have increasingly become the go-to solution for such\ntasks (e.g., entity extraction and dialog slot filling). While most research\nhas focused on the labeling accuracy, a key aspect -- of vital practical\nimportance -- has slipped through the cracks: understanding model confidence.\nMore specifically, we lack a principled understanding of how to reliably gauge\nthe confidence of a model in its predictions for each labeled span. This paper\naims to provide some empirical insights on estimating model confidence for\ngenerative sequence labeling. Most notably, we find that simply using the\ndecoder's output probabilities \\textbf{is not} the best in realizing\nwell-calibrated confidence estimates. As verified over six public datasets of\ndifferent tasks, we show that our proposed approach -- which leverages\nstatistics from top-$k$ predictions by a beam search -- significantly reduces\ncalibration errors of the predictions of a generative sequence labeling model.\n","authors":["Kazuma Hashimoto","Iftekhar Naim","Karthik Raman"],"pdf_url":"https://arxiv.org/pdf/2212.10767v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17884v1","updated":"2023-10-27T04:15:30Z","published":"2023-10-27T04:15:30Z","title":"Can LLMs Keep a Secret? Testing Privacy Implications of Language Models\n via Contextual Integrity Theory","summary":" The interactive use of large language models (LLMs) in AI assistants (at\nwork, home, etc.) introduces a new set of inference-time privacy risks: LLMs\nare fed different types of information from multiple sources in their inputs\nand are expected to reason about what to share in their outputs, for what\npurpose and with whom, within a given context. In this work, we draw attention\nto the highly critical yet overlooked notion of contextual privacy by proposing\nConfAIde, a benchmark designed to identify critical weaknesses in the privacy\nreasoning capabilities of instruction-tuned LLMs. Our experiments show that\neven the most capable models such as GPT-4 and ChatGPT reveal private\ninformation in contexts that humans would not, 39% and 57% of the time,\nrespectively. This leakage persists even when we employ privacy-inducing\nprompts or chain-of-thought reasoning. Our work underscores the immediate need\nto explore novel inference-time privacy-preserving approaches, based on\nreasoning and theory of mind.\n","authors":["Niloofar Mireshghallah","Hyunwoo Kim","Xuhui Zhou","Yulia Tsvetkov","Maarten Sap","Reza Shokri","Yejin Choi"],"pdf_url":"https://arxiv.org/pdf/2310.17884v1.pdf","comment":"confaide.github.io"},{"id":"http://arxiv.org/abs/2310.17877v1","updated":"2023-10-27T03:39:51Z","published":"2023-10-27T03:39:51Z","title":"ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for\n Consistent Data-to-Text Generation","summary":" We present ASPIRO, an approach for structured data verbalisation into short\ntemplate sentences in zero to few-shot settings. Unlike previous methods, our\napproach prompts large language models (LLMs) to directly produce\nentity-agnostic templates, rather than relying on LLMs to faithfully copy the\ngiven example entities, or validating/crafting the templates manually. We\nincorporate LLM re-prompting, triggered by algorithmic parsing checks, as well\nas the PARENT metric induced consistency validation to identify and rectify\ntemplate generation problems in real-time. ASPIRO, compared to direct LLM\noutput, averages 66\\% parsing error rate reduction in generated verbalisations\nof RDF triples on the DART dataset. Our best 5-shot text-davinci-003 setup,\nscoring BLEU of 50.62, METEOR of 45.16, BLEURT of 0.82, NUBIA of 0.87, and\nPARENT of 0.8962 on the Rel2Text dataset, competes effectively with recent\nfine-tuned pre-trained language models.\n","authors":["Martin Vejvar","Yasutaka Fujimoto"],"pdf_url":"https://arxiv.org/pdf/2310.17877v1.pdf","comment":"Accepted to Findings of EMNLP2023, code available at\n https://github.com/vejvarm/ASPIRO"},{"id":"http://arxiv.org/abs/2310.17876v1","updated":"2023-10-27T03:32:17Z","published":"2023-10-27T03:32:17Z","title":"TarGEN: Targeted Data Generation with Large Language Models","summary":" The rapid advancement of large language models (LLMs) has sparked interest in\ndata synthesis techniques, aiming to generate diverse and high-quality\nsynthetic datasets. However, these synthetic datasets often suffer from a lack\nof diversity and added noise. In this paper, we present TarGEN, a multi-step\nprompting strategy for generating high-quality synthetic datasets utilizing a\nLLM. An advantage of TarGEN is its seedless nature; it does not require\nspecific task instances, broadening its applicability beyond task replication.\nWe augment TarGEN with a method known as self-correction empowering LLMs to\nrectify inaccurately labeled instances during dataset creation, ensuring\nreliable labels. To assess our technique's effectiveness, we emulate 8 tasks\nfrom the SuperGLUE benchmark and finetune various language models, including\nencoder-only, encoder-decoder, and decoder-only models on both synthetic and\noriginal training sets. Evaluation on the original test set reveals that models\ntrained on datasets generated by TarGEN perform approximately 1-2% points\nbetter than those trained on original datasets (82.84% via syn. vs. 81.12% on\nog. using Flan-T5). When incorporating instruction tuning, the performance\nincreases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A\ncomprehensive analysis of the synthetic dataset compared to the original\ndataset reveals that the synthetic dataset demonstrates similar or higher\nlevels of dataset complexity and diversity. Furthermore, the synthetic dataset\ndisplays a bias level that aligns closely with the original dataset. Finally,\nwhen pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive\nresults on the OpenLLM leaderboard, surpassing the model trained on the\nSelf-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for\nquality data generation and reducing the human efforts to create complex\nbenchmarks.\n","authors":["Himanshu Gupta","Kevin Scaria","Ujjwala Anantheswaran","Shreyas Verma","Mihir Parmar","Saurabh Arjun Sawant","Swaroop Mishra","Chitta Baral"],"pdf_url":"https://arxiv.org/pdf/2310.17876v1.pdf","comment":"10 pages, 6 tables, 5 figures, 5 pages references, 17 pages appendix"},{"id":"http://arxiv.org/abs/2310.15970v3","updated":"2023-10-27T02:54:29Z","published":"2023-10-24T16:10:58Z","title":"Accented Speech Recognition With Accent-specific Codebooks","summary":" Speech accents pose a significant challenge to state-of-the-art automatic\nspeech recognition (ASR) systems. Degradation in performance across\nunderrepresented accents is a severe deterrent to the inclusive adoption of\nASR. In this work, we propose a novel accent adaptation approach for end-to-end\nASR systems using cross-attention with a trainable set of codebooks. These\nlearnable codebooks capture accent-specific information and are integrated\nwithin the ASR encoder layers. The model is trained on accented English speech,\nwhile the test data also contained accents which were not seen during training.\nOn the Mozilla Common Voice multi-accented dataset, we show that our proposed\napproach yields significant performance gains not only on the seen English\naccents (up to $37\\%$ relative improvement in word error rate) but also on the\nunseen accents (up to $5\\%$ relative improvement in WER). Further, we\nillustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We\nalso compare the performance with other approaches based on accent adversarial\ntraining.\n","authors":["Darshan Prabhu","Preethi Jyothi","Sriram Ganapathy","Vinit Unni"],"pdf_url":"https://arxiv.org/pdf/2310.15970v3.pdf","comment":"Accepted to EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2301.12534v3","updated":"2023-10-27T02:42:25Z","published":"2023-01-29T20:39:21Z","title":"Vicarious Offense and Noise Audit of Offensive Speech Classifiers:\n Unifying Human and Machine Disagreement on What is Offensive","summary":" Offensive speech detection is a key component of content moderation. However,\nwhat is offensive can be highly subjective. This paper investigates how machine\nand human moderators disagree on what is offensive when it comes to real-world\nsocial web political discourse. We show that (1) there is extensive\ndisagreement among the moderators (humans and machines); and (2) human and\nlarge-language-model classifiers are unable to predict how other human raters\nwill respond, based on their political leanings. For (1), we conduct a noise\naudit at an unprecedented scale that combines both machine and human responses.\nFor (2), we introduce a first-of-its-kind dataset of vicarious offense. Our\nnoise audit reveals that moderation outcomes vary wildly across different\nmachine moderators. Our experiments with human moderators suggest that\npolitical leanings combined with sensitive issues affect both first-person and\nvicarious offense. The dataset is available through\nhttps://github.com/Homan-Lab/voiced.\n","authors":["Tharindu Cyril Weerasooriya","Sujan Dutta","Tharindu Ranasinghe","Marcos Zampieri","Christopher M. Homan","Ashiqur R. KhudaBukhsh"],"pdf_url":"https://arxiv.org/pdf/2301.12534v3.pdf","comment":"Accepted to appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17513v2","updated":"2023-10-27T02:36:44Z","published":"2023-10-26T16:08:33Z","title":"The Expressive Power of Low-Rank Adaptation","summary":" Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that\nleverages low-rank adaptation of weight matrices, has emerged as a prevalent\ntechnique for fine-tuning pre-trained models such as large language models and\ndiffusion models. Despite its huge success in practice, the theoretical\nunderpinnings of LoRA have largely remained unexplored. This paper takes the\nfirst step to bridge this gap by theoretically analyzing the expressive power\nof LoRA. We prove that, for fully connected neural networks, LoRA can adapt any\nmodel $f$ to accurately represent any smaller target model $\\overline{f}$ if\nLoRA-rank $\\geq(\\text{width of }f) \\times \\frac{\\text{depth of\n}\\overline{f}}{\\text{depth of }f}$. We also quantify the approximation error\nwhen LoRA-rank is lower than the threshold. For Transformer networks, we show\nany model can be adapted to a target model of the same size with\nrank-$(\\frac{\\text{embedding size}}{2})$ LoRA adapters.\n","authors":["Yuchen Zeng","Kangwook Lee"],"pdf_url":"https://arxiv.org/pdf/2310.17513v2.pdf","comment":"40 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.17857v1","updated":"2023-10-27T02:18:10Z","published":"2023-10-27T02:18:10Z","title":"From Values to Opinions: Predicting Human Behaviors and Stances Using\n Value-Injected Large Language Models","summary":" Being able to predict people's opinions on issues and behaviors in realistic\nscenarios can be helpful in various domains, such as politics and marketing.\nHowever, conducting large-scale surveys like the European Social Survey to\nsolicit people's opinions on individual issues can incur prohibitive costs.\nLeveraging prior research showing influence of core human values on individual\ndecisions and actions, we propose to use value-injected large language models\n(LLM) to predict opinions and behaviors. To this end, we present Value\nInjection Method (VIM), a collection of two methods -- argument generation and\nquestion answering -- designed to inject targeted value distributions into LLMs\nvia fine-tuning. We then conduct a series of experiments on four tasks to test\nthe effectiveness of VIM and the possibility of using value-injected LLMs to\npredict opinions and behaviors of people. We find that LLMs value-injected with\nvariations of VIM substantially outperform the baselines. Also, the results\nsuggest that opinions and behaviors can be better predicted using\nvalue-injected LLMs than the baseline approaches.\n","authors":["Dongjun Kang","Joonsuk Park","Yohan Jo","JinYeong Bak"],"pdf_url":"https://arxiv.org/pdf/2310.17857v1.pdf","comment":"EMNLP 2023 main paper accepted"},{"id":"http://arxiv.org/abs/2308.10397v2","updated":"2023-10-27T01:54:26Z","published":"2023-08-21T00:25:17Z","title":"FairMonitor: A Four-Stage Automatic Framework for Detecting Stereotypes\n and Biases in Large Language Models","summary":" Detecting stereotypes and biases in Large Language Models (LLMs) can enhance\nfairness and reduce adverse impacts on individuals or groups when these LLMs\nare applied. However, the majority of existing methods focus on measuring the\nmodel's preference towards sentences containing biases and stereotypes within\ndatasets, which lacks interpretability and cannot detect implicit biases and\nstereotypes in the real world. To address this gap, this paper introduces a\nfour-stage framework to directly evaluate stereotypes and biases in the\ngenerated content of LLMs, including direct inquiry testing, serial or adapted\nstory testing, implicit association testing, and unknown situation testing.\nAdditionally, the paper proposes multi-dimensional evaluation metrics and\nexplainable zero-shot prompts for automated evaluation. Using the education\nsector as a case study, we constructed the Edu-FairMonitor based on the\nfour-stage framework, which encompasses 12,632 open-ended questions covering\nnine sensitive factors and 26 educational scenarios. Experimental results\nreveal varying degrees of stereotypes and biases in five LLMs evaluated on\nEdu-FairMonitor. Moreover, the results of our proposed automated evaluation\nmethod have shown a high correlation with human annotations.\n","authors":["Yanhong Bai","Jiabao Zhao","Jinxin Shi","Tingjiang Wei","Xingjiao Wu","Liang He"],"pdf_url":"https://arxiv.org/pdf/2308.10397v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08623v2","updated":"2023-10-27T01:51:48Z","published":"2023-07-14T05:41:22Z","title":"HYTREL: Hypergraph-enhanced Tabular Data Representation Learning","summary":" Language models pretrained on large collections of tabular data have\ndemonstrated their effectiveness in several downstream tasks. However, many of\nthese models do not take into account the row/column permutation invariances,\nhierarchical structure, etc. that exist in tabular data. To alleviate these\nlimitations, we propose HYTREL, a tabular language model, that captures the\npermutation invariances and three more structural properties of tabular data by\nusing hypergraphs - where the table cells make up the nodes and the cells\noccurring jointly together in each row, column, and the entire table are used\nto form three different types of hyperedges. We show that HYTREL is maximally\ninvariant under certain conditions for tabular data, i.e., two tables obtain\nthe same representations via HYTREL iff the two tables are identical up to\npermutations. Our empirical results demonstrate that HYTREL consistently\noutperforms other competitive baselines on four downstream tasks with minimal\npretraining, illustrating the advantages of incorporating the inductive biases\nassociated with tabular data into the representations. Finally, our qualitative\nanalyses showcase that HYTREL can assimilate the table structures to generate\nrobust representations for the cells, rows, columns, and the entire table.\n","authors":["Pei Chen","Soumajyoti Sarkar","Leonard Lausen","Balasubramaniam Srinivasan","Sheng Zha","Ruihong Huang","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2307.08623v2.pdf","comment":"NeurIPS 2023 (spotlight)"},{"id":"http://arxiv.org/abs/2305.15328v2","updated":"2023-10-27T01:44:27Z","published":"2023-05-24T16:42:17Z","title":"Visual Programming for Text-to-Image Generation and Evaluation","summary":" As large language models have demonstrated impressive performance in many\ndomains, recent works have adopted language models (LMs) as controllers of\nvisual modules for vision-and-language tasks. While existing work focuses on\nequipping LMs with visual understanding, we propose two novel\ninterpretable/explainable visual programming frameworks for text-to-image (T2I)\ngeneration and evaluation. First, we introduce VPGen, an interpretable\nstep-by-step T2I generation framework that decomposes T2I generation into three\nsteps: object/count generation, layout generation, and image generation. We\nemploy an LM to handle the first two steps (object/count generation and layout\ngeneration), by finetuning it on text-layout pairs. Our step-by-step T2I\ngeneration framework provides stronger spatial control than end-to-end models,\nthe dominant approach for this task. Furthermore, we leverage the world\nknowledge of pretrained LMs, overcoming the limitation of previous\nlayout-guided T2I works that can only handle predefined object classes. We\ndemonstrate that our VPGen has improved control in counts/spatial\nrelations/scales of objects than state-of-the-art T2I generation models.\nSecond, we introduce VPEval, an interpretable and explainable evaluation\nframework for T2I generation based on visual programming. Unlike previous T2I\nevaluations with a single scoring model that is accurate in some skills but\nunreliable in others, VPEval produces evaluation programs that invoke a set of\nvisual modules that are experts in different skills, and also provides\nvisual+textual explanations of the evaluation results. Our analysis shows that\nVPEval provides a more human-correlated evaluation for skill-specific and\nopen-ended prompts than widely used single model-based evaluation. We hope that\nour work encourages future progress on interpretable/explainable generation and\nevaluation for T2I models.\n","authors":["Jaemin Cho","Abhay Zala","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2305.15328v2.pdf","comment":"NeurIPS 2023; Project website: https://vp-t2i.github.io"},{"id":"http://arxiv.org/abs/2306.01708v2","updated":"2023-10-27T01:09:31Z","published":"2023-06-02T17:31:32Z","title":"TIES-Merging: Resolving Interference When Merging Models","summary":" Transfer learning - i.e., further fine-tuning a pre-trained model on a\ndownstream task - can confer significant advantages, including improved\ndownstream performance, faster convergence, and better sample efficiency. These\nadvantages have led to a proliferation of task-specific fine-tuned models,\nwhich typically can only perform a single task and do not benefit from one\nanother. Recently, model merging techniques have emerged as a solution to\ncombine multiple task-specific models into a single multitask model without\nperforming additional training. However, existing merging methods often ignore\nthe interference between parameters of different models, resulting in large\nperformance drops when merging multiple models. In this paper, we demonstrate\nthat prior merging techniques inadvertently lose valuable information due to\ntwo major sources of interference: (a) interference due to redundant parameter\nvalues and (b) disagreement on the sign of a given parameter's values across\nmodels. To address this, we propose our method, TRIM, ELECT SIGN & MERGE\n(TIES-Merging), which introduces three novel steps when merging models: (1)\nresetting parameters that only changed a small amount during fine-tuning, (2)\nresolving sign conflicts, and (3) merging only the parameters that are in\nalignment with the final agreed-upon sign. We find that TIES-Merging\noutperforms several existing methods in diverse settings covering a range of\nmodalities, domains, number of tasks, model sizes, architectures, and\nfine-tuning settings. We further analyze the impact of different types of\ninterference on model parameters, and highlight the importance of resolving\nsign interference. Our code is available at\nhttps://github.com/prateeky2806/ties-merging\n","authors":["Prateek Yadav","Derek Tam","Leshem Choshen","Colin Raffel","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2306.01708v2.pdf","comment":"Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables"},{"id":"http://arxiv.org/abs/2306.11207v3","updated":"2023-10-27T23:56:33Z","published":"2023-06-20T00:14:47Z","title":"Quilt-1M: One Million Image-Text Pairs for Histopathology","summary":" Recent accelerations in multi-modal applications have been made possible with\nthe plethora of image and text data available online. However, the scarcity of\nanalogous data in the medical field, specifically in histopathology, has slowed\ncomparable progress. To enable similar representation learning for\nhistopathology, we turn to YouTube, an untapped resource of videos, offering\n$1,087$ hours of valuable educational histopathology videos from expert\nclinicians. From YouTube, we curate QUILT: a large-scale vision-language\ndataset consisting of $802, 144$ image and text pairs. QUILT was automatically\ncurated using a mixture of models, including large language models, handcrafted\nalgorithms, human knowledge databases, and automatic speech recognition. In\ncomparison, the most comprehensive datasets curated for histopathology amass\nonly around $200$K samples. We combine QUILT with datasets from other sources,\nincluding Twitter, research papers, and the internet in general, to create an\neven larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it\nas the largest vision-language histopathology dataset to date. We demonstrate\nthe value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model\noutperforms state-of-the-art models on both zero-shot and linear probing tasks\nfor classifying new histopathology images across $13$ diverse patch-level\ndatasets of $8$ different sub-pathologies and cross-modal retrieval tasks.\n","authors":["Wisdom Oluchi Ikezogwo","Mehmet Saygin Seyfioglu","Fatemeh Ghezloo","Dylan Stefan Chan Geva","Fatwir Sheikh Mohammed","Pavan Kumar Anand","Ranjay Krishna","Linda Shapiro"],"pdf_url":"https://arxiv.org/pdf/2306.11207v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.08577v2","updated":"2023-10-27T23:48:01Z","published":"2023-02-16T20:46:36Z","title":"Keep it Neutral: Using Natural Language Inference to Improve Generation","summary":" We explore incorporating natural language inference (NLI) into the text\ngenerative pipeline by using a pre-trained NLI model to assess whether a\ngenerated sentence entails, contradicts, or is neutral to the prompt and\npreceding text. First, we show that the NLI task is predictive of generation\nerrors made by GPT-3. We use these results to develop an NLI-informed\ngeneration procedure for GPT-J. Then, we evaluate these generations by\nobtaining human annotations on error types and overall quality. We find that an\nNLI strategy of maximizing entailment improves text generation when the nucleus\nsampling randomness parameter value is high, while one which maximizes\ncontradiction is in fact productive when the parameter value is low. Overall,\nthough, we demonstrate that an NLI strategy of maximizing the neutral class\nprovides the highest quality of generated text (significantly better than the\nvanilla generations), regardless of parameter value.\n","authors":["Michail Mersinias","Kyle Mahowald"],"pdf_url":"https://arxiv.org/pdf/2302.08577v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18538v1","updated":"2023-10-27T23:36:14Z","published":"2023-10-27T23:36:14Z","title":"Evaluating Cross-Domain Text-to-SQL Models and Benchmarks","summary":" Text-to-SQL benchmarks play a crucial role in evaluating the progress made in\nthe field and the ranking of different models. However, accurately matching a\nmodel-generated SQL query to a reference SQL query in a benchmark fails for\nvarious reasons, such as underspecified natural language queries, inherent\nassumptions in both model-generated and reference queries, and the\nnon-deterministic nature of SQL output under certain conditions. In this paper,\nwe conduct an extensive study of several prominent cross-domain text-to-SQL\nbenchmarks and re-evaluate some of the top-performing models within these\nbenchmarks, by both manually evaluating the SQL queries and rewriting them in\nequivalent expressions. Our evaluation reveals that attaining a perfect\nperformance on these benchmarks is unfeasible due to the multiple\ninterpretations that can be derived from the provided samples. Furthermore, we\nfind that the true performance of the models is underestimated and their\nrelative performance changes after a re-evaluation. Most notably, our\nevaluation reveals a surprising discovery: a recent GPT4-based model surpasses\nthe gold standard reference queries in the Spider benchmark in our human\nevaluation. This finding highlights the importance of interpreting benchmark\nevaluations cautiously, while also acknowledging the critical role of\nadditional independent evaluations in driving advancements in the field.\n","authors":["Mohammadreza Pourreza","Davood Rafiei"],"pdf_url":"https://arxiv.org/pdf/2310.18538v1.pdf","comment":"To appear in EMNLP 2023"},{"id":"http://arxiv.org/abs/2211.09944v2","updated":"2023-10-27T23:12:55Z","published":"2022-11-17T23:38:29Z","title":"MelHuBERT: A simplified HuBERT on Mel spectrograms","summary":" Self-supervised models have had great success in learning speech\nrepresentations that can generalize to various downstream tasks. However, most\nself-supervised models require a large amount of compute and multiple GPUs to\ntrain, significantly hampering the development of self-supervised learning. In\nan attempt to reduce the computation of training, we revisit the training of\nHuBERT, a highly successful self-supervised model. We improve and simplify\nseveral key components, including the loss function, input representation, and\ntraining in multiple stages. Our model, MelHuBERT, is able to achieve favorable\nperformance on phone recognition, speaker identification, and automatic speech\nrecognition against HuBERT, while saving 31.2% of the pre-training time, or\nequivalently 33.5% MACs per one second speech. The code and pre-trained models\nare available in https://github.com/nervjack2/MelHuBERT.\n","authors":["Tzu-Quan Lin","Hung-yi Lee","Hao Tang"],"pdf_url":"https://arxiv.org/pdf/2211.09944v2.pdf","comment":"ASRU 2023"},{"id":"http://arxiv.org/abs/2310.18502v1","updated":"2023-10-27T21:31:34Z","published":"2023-10-27T21:31:34Z","title":"On the Automatic Generation and Simplification of Children's Stories","summary":" With recent advances in large language models (LLMs), the concept of\nautomatically generating children's educational materials has become\nincreasingly realistic. Working toward the goal of age-appropriate simplicity\nin generated educational texts, we first examine the ability of several popular\nLLMs to generate stories with properly adjusted lexical and readability levels.\nWe find that, in spite of the growing capabilities of LLMs, they do not yet\npossess the ability to limit their vocabulary to levels appropriate for younger\nage groups. As a second experiment, we explore the ability of state-of-the-art\nlexical simplification models to generalize to the domain of children's stories\nand, thus, create an efficient pipeline for their automatic generation. In\norder to test these models, we develop a dataset of child-directed lexical\nsimplification instances, with examples taken from the LLM-generated stories in\nour first experiment. We find that, while the strongest-performing current\nlexical simplification models do not perform as well on material designed for\nchildren due to their reliance on large language models behind the scenes, some\nmodels that still achieve fairly strong results on general data can mimic or\neven improve their performance on children-directed data with proper\nfine-tuning, which we conduct using our newly created child-directed\nsimplification dataset.\n","authors":["Maria Valentini","Jennifer Weber","Jesus Salcido","Téa Wright","Eliana Colunga","Katharina Kann"],"pdf_url":"https://arxiv.org/pdf/2310.18502v1.pdf","comment":"Accepted to EMNLP 2023 (main conference)"},{"id":"http://arxiv.org/abs/2310.18491v1","updated":"2023-10-27T21:08:51Z","published":"2023-10-27T21:08:51Z","title":"Publicly Detectable Watermarking for Language Models","summary":" We construct the first provable watermarking scheme for language models with\npublic detectability or verifiability: we use a private key for watermarking\nand a public key for watermark detection. Our protocol is the first\nwatermarking scheme that does not embed a statistical signal in generated text.\nRather, we directly embed a publicly-verifiable cryptographic signature using a\nform of rejection sampling. We show that our construction meets strong formal\nsecurity guarantees and preserves many desirable properties found in schemes in\nthe private-key watermarking setting. In particular, our watermarking scheme\nretains distortion-freeness and model agnosticity. We implement our scheme and\nmake empirical measurements over open models in the 7B parameter range. Our\nexperiments suggest that our watermarking scheme meets our formal claims while\npreserving text quality.\n","authors":["Jaiden Fairoze","Sanjam Garg","Somesh Jha","Saeed Mahloujifar","Mohammad Mahmoody","Mingyuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18491v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.12552v2","updated":"2023-10-27T20:55:02Z","published":"2023-06-21T20:36:55Z","title":"SituatedGen: Incorporating Geographical and Temporal Contexts into\n Generative Commonsense Reasoning","summary":" Recently, commonsense reasoning in text generation has attracted much\nattention. Generative commonsense reasoning is the task that requires machines,\ngiven a group of keywords, to compose a single coherent sentence with\ncommonsense plausibility. While existing datasets targeting generative\ncommonsense reasoning focus on everyday scenarios, it is unclear how well\nmachines reason under specific geographical and temporal contexts. We formalize\nthis challenging task as SituatedGen, where machines with commonsense should\ngenerate a pair of contrastive sentences given a group of keywords including\ngeographical or temporal entities. We introduce a corresponding English dataset\nconsisting of 8,268 contrastive sentence pairs, which are built upon several\nexisting commonsense reasoning benchmarks with minimal manual labor.\nExperiments show that state-of-the-art generative language models struggle to\ngenerate sentences with commonsense plausibility and still lag far behind human\nperformance. Our dataset is publicly available at\nhttps://github.com/yunx-z/situated_gen.\n","authors":["Yunxiang Zhang","Xiaojun Wan"],"pdf_url":"https://arxiv.org/pdf/2306.12552v2.pdf","comment":"Accepted to NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2307.06290v2","updated":"2023-10-27T20:53:22Z","published":"2023-07-12T16:37:31Z","title":"Instruction Mining: When Data Mining Meets Large Language Model\n Finetuning","summary":" Large language models (LLMs) are initially pretrained for broad capabilities\nand then finetuned with instruction-following datasets to improve their\nperformance in interacting with humans. Despite advances in finetuning, a\nstandardized guideline for selecting high-quality datasets to optimize this\nprocess remains elusive. In this paper, we first propose InstructMining, an\ninnovative method designed for automatically selecting premium\ninstruction-following data for finetuning LLMs. Specifically, InstructMining\nutilizes natural language indicators as a measure of data quality, applying\nthem to evaluate unseen datasets. During experimentation, we discover that\ndouble descent phenomenon exists in large language model finetuning. Based on\nthis observation, we further leverage BlendSearch to help find the best subset\namong the entire dataset (i.e., 2,532 out of 100,000). Experiment results show\nthat InstructMining-7B achieves state-of-the-art performance on two of the most\npopular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.\n","authors":["Yihan Cao","Yanbin Kang","Chi Wang","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2307.06290v2.pdf","comment":"22 pages, 7 figures"},{"id":"http://arxiv.org/abs/2305.19164v2","updated":"2023-10-27T20:32:10Z","published":"2023-05-30T16:09:16Z","title":"LANCE: Stress-testing Visual Models by Generating Language-guided\n Counterfactual Images","summary":" We propose an automated algorithm to stress-test a trained visual model by\ngenerating language-guided counterfactual test images (LANCE). Our method\nleverages recent progress in large language modeling and text-based image\nediting to augment an IID test set with a suite of diverse, realistic, and\nchallenging test images without altering model weights. We benchmark the\nperformance of a diverse set of pre-trained models on our generated data and\nobserve significant and consistent performance drops. We further analyze model\nsensitivity across different types of edits, and demonstrate its applicability\nat surfacing previously unknown class-level model biases in ImageNet. Code is\navailable at https://github.com/virajprabhu/lance.\n","authors":["Viraj Prabhu","Sriram Yenamandra","Prithvijit Chattopadhyay","Judy Hoffman"],"pdf_url":"https://arxiv.org/pdf/2305.19164v2.pdf","comment":"NeurIPS 2023 camera ready. Project webpage:\n https://virajprabhu.github.io/lance-web/"},{"id":"http://arxiv.org/abs/2310.18463v1","updated":"2023-10-27T20:15:23Z","published":"2023-10-27T20:15:23Z","title":"PeTailor: Improving Large Language Model by Tailored Chunk Scorer in\n Biomedical Triple Extraction","summary":" The automatic extraction of biomedical entities and their interaction from\nunstructured data remains a challenging task due to the limited availability of\nexpert-labeled standard datasets. In this paper, we introduce PETAI-LOR, a\nretrieval-based language framework that is augmented by tailored chunk scorer.\nUnlike previous retrieval-augmented language models (LM) that retrieve relevant\ndocuments by calculating the similarity between the input sentence and the\ncandidate document set, PETAILOR segments the sentence into chunks and\nretrieves the relevant chunk from our pre-computed chunk-based relational\nkey-value memory. Moreover, in order to comprehend the specific requirements of\nthe LM, PETAI-LOR adapt the tailored chunk scorer to the LM. We also introduce\nGM-CIHT, an expert annotated biomedical triple extraction dataset with more\nrelation types. This dataset is centered on the non-drug treatment and general\nbiomedical domain. Additionally, we investigate the efficacy of triple\nextraction models trained on general domains when applied to the biomedical\ndomain. Our experiments reveal that PETAI-LOR achieves state-of-the-art\nperformance on GM-CIHT\n","authors":["Mingchen Li","M. Chen","Huixue Zhou","Rui Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.18463v1.pdf","comment":"this is the first preprint version"},{"id":"http://arxiv.org/abs/2310.18458v1","updated":"2023-10-27T20:11:38Z","published":"2023-10-27T20:11:38Z","title":"Do Not Harm Protected Groups in Debiasing Language Representation Models","summary":" Language Representation Models (LRMs) trained with real-world data may\ncapture and exacerbate undesired bias and cause unfair treatment of people in\nvarious demographic groups. Several techniques have been investigated for\napplying interventions to LRMs to remove bias in benchmark evaluations on, for\nexample, word embeddings. However, the negative side effects of debiasing\ninterventions are usually not revealed in the downstream tasks. We propose\nxGAP-DEBIAS, a set of evaluations on assessing the fairness of debiasing. In\nthis work, We examine four debiasing techniques on a real-world text\nclassification task and show that reducing biasing is at the cost of degrading\nperformance for all demographic groups, including those the debiasing\ntechniques aim to protect. We advocate that a debiasing technique should have\ngood downstream performance with the constraint of ensuring no harm to the\nprotected group.\n","authors":["Chloe Qinyu Zhu","Rickard Stureborg","Brandon Fain"],"pdf_url":"https://arxiv.org/pdf/2310.18458v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18454v1","updated":"2023-10-27T20:04:57Z","published":"2023-10-27T20:04:57Z","title":"T5 meets Tybalt: Author Attribution in Early Modern English Drama Using\n Large Language Models","summary":" Large language models have shown breakthrough potential in many NLP domains.\nHere we consider their use for stylometry, specifically authorship\nidentification in Early Modern English drama. We find both promising and\nconcerning results; LLMs are able to accurately predict the author of\nsurprisingly short passages but are also prone to confidently misattribute\ntexts to specific authors. A fine-tuned t5-large model outperforms all tested\nbaselines, including logistic regression, SVM with a linear kernel, and cosine\ndelta, at attributing small passages. However, we see indications that the\npresence of certain authors in the model's pre-training data affects predictive\nresults in ways that are difficult to assess.\n","authors":["Rebecca M. M. Hicke","David Mimno"],"pdf_url":"https://arxiv.org/pdf/2310.18454v1.pdf","comment":"Published in CHR 2023"},{"id":"http://arxiv.org/abs/2310.18440v1","updated":"2023-10-27T19:27:59Z","published":"2023-10-27T19:27:59Z","title":"Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement","summary":" Generative language models (LMs) are increasingly used for document\nclass-prediction tasks and promise enormous improvements in cost and\nefficiency. Existing research often examines simple classification tasks, but\nthe capability of LMs to classify on complex or specialized tasks is less well\nunderstood. We consider a highly complex task that is challenging even for\nhumans: the classification of legal reasoning according to jurisprudential\nphilosophy. Using a novel dataset of historical United States Supreme Court\nopinions annotated by a team of domain experts, we systematically test the\nperformance of a variety of LMs. We find that generative models perform poorly\nwhen given instructions (i.e. prompts) equal to the instructions presented to\nhuman annotators through our codebook. Our strongest results derive from\nfine-tuning models on the annotated dataset; the best performing model is an\nin-domain model, LEGAL-BERT. We apply predictions from this fine-tuned model to\nstudy historical trends in jurisprudence, an exercise that both aligns with\nprominent qualitative historical accounts and points to areas of possible\nrefinement in those accounts. Our findings generally sound a note of caution in\nthe use of generative LMs on complex tasks without fine-tuning and point to the\ncontinued relevance of human annotation-intensive classification methods.\n","authors":["Rosamond Thalken","Edward H. Stiglitz","David Mimno","Matthew Wilkens"],"pdf_url":"https://arxiv.org/pdf/2310.18440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18435v1","updated":"2023-10-27T19:21:50Z","published":"2023-10-27T19:21:50Z","title":"Expanding the Set of Pragmatic Considerations in Conversational AI","summary":" Despite considerable performance improvements, current conversational AI\nsystems often fail to meet user expectations. We discuss several pragmatic\nlimitations of current conversational AI systems. We illustrate pragmatic\nlimitations with examples that are syntactically appropriate, but have clear\npragmatic deficiencies. We label our complaints as \"Turing Test Triggers\"\n(TTTs) as they indicate where current conversational AI systems fall short\ncompared to human behavior. We develop a taxonomy of pragmatic considerations\nintended to identify what pragmatic competencies a conversational AI system\nrequires and discuss implications for the design and evaluation of\nconversational AI systems.\n","authors":["S. M. Seals","Valerie L. Shalin"],"pdf_url":"https://arxiv.org/pdf/2310.18435v1.pdf","comment":"Pre-print version of paper that appeared at Multidisciplinary\n Perspectives on COntext-aware embodied Spoken Interactions (MP-COSIN)\n workshop at IEEE RO-MAN 2023"},{"id":"http://arxiv.org/abs/2305.14292v2","updated":"2023-10-27T19:11:55Z","published":"2023-05-23T17:37:36Z","title":"WikiChat: Stopping the Hallucination of Large Language Model Chatbots by\n Few-Shot Grounding on Wikipedia","summary":" This paper presents the first few-shot LLM-based chatbot that almost never\nhallucinates and has high conversationality and low latency. WikiChat is\ngrounded on the English Wikipedia, the largest curated free-text corpus.\n WikiChat generates a response from an LLM, retains only the grounded facts,\nand combines them with additional information it retrieves from the corpus to\nform factual and engaging responses. We distill WikiChat based on GPT-4 into a\n7B-parameter LLaMA model with minimal loss of quality, to significantly improve\nits latency, cost and privacy, and facilitate research and deployment.\n Using a novel hybrid human-and-LLM evaluation methodology, we show that our\nbest system achieves 97.3% factual accuracy in simulated conversations. It\nsignificantly outperforms all retrieval-based and LLM-based baselines, and by\n3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4.\nCompared to previous state-of-the-art retrieval-based chatbots, WikiChat is\nalso significantly more informative and engaging, just like an LLM.\n WikiChat achieves 97.9% factual accuracy in conversations with human users\nabout recent topics, 55.0% better than GPT-4, while receiving significantly\nhigher user ratings and more favorable comments.\n","authors":["Sina J. Semnani","Violet Z. Yao","Heidi C. Zhang","Monica S. Lam"],"pdf_url":"https://arxiv.org/pdf/2305.14292v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18431v1","updated":"2023-10-27T19:09:30Z","published":"2023-10-27T19:09:30Z","title":"SDOH-NLI: a Dataset for Inferring Social Determinants of Health from\n Clinical Notes","summary":" Social and behavioral determinants of health (SDOH) play a significant role\nin shaping health outcomes, and extracting these determinants from clinical\nnotes is a first step to help healthcare providers systematically identify\nopportunities to provide appropriate care and address disparities. Progress on\nusing NLP methods for this task has been hindered by the lack of high-quality\npublicly available labeled data, largely due to the privacy and regulatory\nconstraints on the use of real patients' information. This paper introduces a\nnew dataset, SDOH-NLI, that is based on publicly available notes and which we\nrelease publicly. We formulate SDOH extraction as a natural language inference\n(NLI) task, and provide binary textual entailment labels obtained from human\nraters for a cross product of a set of social history snippets as premises and\nSDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in\nthat our premises and hypotheses are obtained independently. We evaluate both\n\"off-the-shelf\" entailment models as well as models fine-tuned on our data, and\nhighlight the ways in which our dataset appears more challenging than commonly\nused NLI datasets.\n","authors":["Adam D. Lelkes","Eric Loreaux","Tal Schuster","Ming-Jun Chen","Alvin Rajkomar"],"pdf_url":"https://arxiv.org/pdf/2310.18431v1.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.09680v3","updated":"2023-10-27T18:56:51Z","published":"2023-10-14T23:16:05Z","title":"Improved Contextual Recognition In Automatic Speech Recognition Systems\n By Semantic Lattice Rescoring","summary":" Automatic Speech Recognition (ASR) has witnessed a profound research\ninterest. Recent breakthroughs have given ASR systems different prospects such\nas faithfully transcribing spoken language, which is a pivotal advancement in\nbuilding conversational agents. However, there is still an imminent challenge\nof accurately discerning context-dependent words and phrases. In this work, we\npropose a novel approach for enhancing contextual recognition within ASR\nsystems via semantic lattice processing leveraging the power of deep learning\nmodels in accurately delivering spot-on transcriptions across a wide variety of\nvocabularies and speaking styles. Our solution consists of using Hidden Markov\nModels and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks\n(DNN) models integrating both language and acoustic modeling for better\naccuracy. We infused our network with the use of a transformer-based model to\nproperly rescore the word lattice achieving remarkable capabilities with a\npalpable reduction in Word Error Rate (WER). We demonstrate the effectiveness\nof our proposed framework on the LibriSpeech dataset with empirical analyses.\n","authors":["Ankitha Sudarshan","Vinay Samuel","Parth Patwa","Ibtihel Amara","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2310.09680v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18417v1","updated":"2023-10-27T18:17:29Z","published":"2023-10-27T18:17:29Z","title":"Teacher Perception of Automatically Extracted Grammar Concepts for L2\n Language Learning","summary":" One of the challenges in language teaching is how best to organize rules\nregarding syntax, semantics, or phonology in a meaningful manner. This not only\nrequires content creators to have pedagogical skills, but also have that\nlanguage's deep understanding. While comprehensive materials to develop such\ncurricula are available in English and some broadly spoken languages, for many\nother languages, teachers need to manually create them in response to their\nstudents' needs. This is challenging because i) it requires that such experts\nbe accessible and have the necessary resources, and ii) describing all the\nintricacies of a language is time-consuming and prone to omission. In this\nwork, we aim to facilitate this process by automatically discovering and\nvisualizing grammar descriptions. We extract descriptions from a natural text\ncorpus that answer questions about morphosyntax (learning of word order,\nagreement, case marking, or word formation) and semantics (learning of\nvocabulary). We apply this method for teaching two Indian languages, Kannada\nand Marathi, which, unlike English, do not have well-developed resources for\nsecond language learning. To assess the perceived utility of the extracted\nmaterial, we enlist the help of language educators from schools in North\nAmerica to perform a manual evaluation, who find the materials have potential\nto be used for their lesson preparation and learner evaluation.\n","authors":["Aditi Chaudhary","Arun Sampath","Ashwin Sheshadri","Antonios Anastasopoulos","Graham Neubig"],"pdf_url":"https://arxiv.org/pdf/2310.18417v1.pdf","comment":"Accepted at EMNLP Findings 2023. arXiv admin note: substantial text\n overlap with arXiv:2206.05154"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.18297v1","updated":"2023-10-27T17:35:01Z","published":"2023-10-27T17:35:01Z","title":"Image Clustering Conditioned on Text Criteria","summary":" Classical clustering methods do not provide users with direct control of the\nclustering results, and the clustering results may not be consistent with the\nrelevant criterion that a user has in mind. In this work, we present a new\nmethodology for performing image clustering based on user-specified text\ncriteria by leveraging modern vision-language models and large language models.\nWe call our method Image Clustering Conditioned on Text Criteria (IC$|$TC), and\nit represents a different paradigm of image clustering. IC$|$TC requires a\nminimal and practical degree of human intervention and grants the user\nsignificant control over the clustering results in return. Our experiments show\nthat IC$|$TC can effectively cluster images with various criteria, such as\nhuman action, physical location, or the person's mood, while significantly\noutperforming baselines.\n","authors":["Sehyun Kwon","Jaeseung Park","Minkyu Kim","Jaewoong Cho","Ernest K. Ryu","Kangwook Lee"],"pdf_url":"https://arxiv.org/pdf/2310.18297v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18293v1","updated":"2023-10-27T17:29:55Z","published":"2023-10-27T17:29:55Z","title":"Always Clear Days: Degradation Type and Severity Aware All-In-One\n Adverse Weather Removal","summary":" All-in-one adverse weather removal is an emerging topic on image restoration,\nwhich aims to restore multiple weather degradation in an unified model, and the\nchallenging are twofold. First, discovering and handling the property of\nmulti-domain in target distribution formed by multiple weather conditions.\nSecond, design efficient and effective operations for different degradation\ntypes. To address this problem, most prior works focus on the multi-domain\ncaused by weather type. Inspired by inter\\&intra-domain adaptation literature,\nwe observed that not only weather type but also weather severity introduce\nmulti-domain within each weather type domain, which is ignored by previous\nmethods, and further limit their performance. To this end, we proposed a\ndegradation type and severity aware model, called \\textbf{UtilityIR}, for blind\nall-in-one bad weather image restoration. To extract weather information from\nsingle image, we proposed a novel Marginal Quality Ranking Loss (MQRL) and\nutilized Contrastive Loss (CL) to guide weather severity and type extraction,\nand leverage a bag of novel techniques such as Multi-Head Cross Attention\n(MHCA) and Local-Global Adaptive Instance Normalization (LG-AdaIN) to\nefficiently restore spatial varying weather degradation. The proposed method\ncan significantly outperform the SOTA methods subjectively and objectively on\ndifferent weather restoration tasks with a large margin, and enjoy less model\nparameters. Proposed method even can restore \\textbf{unseen} domain combined\nmultiple degradation images, and modulating restoration level. Implementation\ncode will be available at\n{https://github.com/fordevoted/UtilityIR}{\\textit{this repository}}\n","authors":["Yu-Wei Chen","Soo-Chang Pei"],"pdf_url":"https://arxiv.org/pdf/2310.18293v1.pdf","comment":"12 pages, 12 figures"},{"id":"http://arxiv.org/abs/2212.10015v3","updated":"2023-10-27T17:24:04Z","published":"2022-12-20T06:03:51Z","title":"Benchmarking Spatial Relationships in Text-to-Image Generation","summary":" Spatial understanding is a fundamental aspect of computer vision and integral\nfor human-level reasoning about images, making it an important component for\ngrounded language understanding. While recent text-to-image synthesis (T2I)\nmodels have shown unprecedented improvements in photorealism, it is unclear\nwhether they have reliable spatial understanding capabilities. We investigate\nthe ability of T2I models to generate correct spatial relationships among\nobjects and present VISOR, an evaluation metric that captures how accurately\nthe spatial relationship described in text is generated in the image. To\nbenchmark existing models, we introduce a dataset, $\\mathrm{SR}_{2D}$, that\ncontains sentences describing two or more objects and the spatial relationships\nbetween them. We construct an automated evaluation pipeline to recognize\nobjects and their spatial relationships, and employ it in a large-scale\nevaluation of T2I models. Our experiments reveal a surprising finding that,\nalthough state-of-the-art T2I models exhibit high image quality, they are\nseverely limited in their ability to generate multiple objects or the specified\nspatial relations between them. Our analyses demonstrate several biases and\nartifacts of T2I models such as the difficulty with generating multiple\nobjects, a bias towards generating the first object mentioned, spatially\ninconsistent outputs for equivalent relationships, and a correlation between\nobject co-occurrence and spatial understanding capabilities. We conduct a human\nstudy that shows the alignment between VISOR and human judgement about spatial\nunderstanding. We offer the $\\mathrm{SR}_{2D}$ dataset and the VISOR metric to\nthe community in support of T2I reasoning research.\n","authors":["Tejas Gokhale","Hamid Palangi","Besmira Nushi","Vibhav Vineet","Eric Horvitz","Ece Kamar","Chitta Baral","Yezhou Yang"],"pdf_url":"https://arxiv.org/pdf/2212.10015v3.pdf","comment":"preprint; Code and Data at https://github.com/microsoft/VISOR and\n https://huggingface.co/datasets/tgokhale/sr2d_visor"},{"id":"http://arxiv.org/abs/2310.18285v1","updated":"2023-10-27T17:22:09Z","published":"2023-10-27T17:22:09Z","title":"Heterogeneous Federated Learning with Group-Aware Prompt Tuning","summary":" Transformers have achieved remarkable success in various machine-learning\ntasks, prompting their widespread adoption. In this paper, we explore their\napplication in the context of federated learning (FL), with a particular focus\non heterogeneous scenarios where individual clients possess diverse local\ndatasets. To meet the computational and communication demands of FL, we\nleverage pre-trained Transformers and use an efficient prompt-tuning strategy.\nOur strategy introduces the concept of learning both shared and group prompts,\nenabling the acquisition of universal knowledge and group-specific knowledge\nsimultaneously. Additionally, a prompt selection module assigns personalized\ngroup prompts to each input, aligning the global model with the data\ndistribution of each client. This approach allows us to train a single global\nmodel that can automatically adapt to various local client data distributions\nwithout requiring local fine-tuning. In this way, our proposed method\neffectively bridges the gap between global and personalized local models in\nFederated Learning and surpasses alternative approaches that lack the\ncapability to adapt to previously unseen clients. The effectiveness of our\napproach is rigorously validated through extensive experimentation and ablation\nstudies.\n","authors":["Wenlong Deng","Christos Thrampoulidis","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2310.18285v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17097v2","updated":"2023-10-27T17:13:00Z","published":"2023-10-26T01:40:28Z","title":"Navigating Data Heterogeneity in Federated Learning A Semi-Supervised\n Approach for Object Detection","summary":" Federated Learning (FL) has emerged as a potent framework for training models\nacross distributed data sources while maintaining data privacy. Nevertheless,\nit faces challenges with limited high-quality labels and non-IID client data,\nparticularly in applications like autonomous driving. To address these hurdles,\nwe navigate the uncharted waters of Semi-Supervised Federated Object Detection\n(SSFOD). We present a pioneering SSFOD framework, designed for scenarios where\nlabeled data reside only at the server while clients possess unlabeled data.\nNotably, our method represents the inaugural implementation of SSFOD for\nclients with 0% labeled non-IID data, a stark contrast to previous studies that\nmaintain some subset of labels at each client. We propose FedSTO, a two-stage\nstrategy encompassing Selective Training followed by Orthogonally enhanced\nfull-parameter training, to effectively address data shift (e.g. weather\nconditions) between server and clients. Our contributions include selectively\nrefining the backbone of the detector to avert overfitting, orthogonality\nregularization to boost representation divergence, and local EMA-driven pseudo\nlabel assignment to yield high-quality pseudo labels. Extensive validation on\nprominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M)\nattests to the efficacy of our approach, demonstrating state-of-the-art\nresults. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as\nwell as fully-supervised centralized training methods.\n","authors":["Taehyeon Kim","Eric Lin","Junu Lee","Christian Lau","Vaikkunth Mugunthan"],"pdf_url":"https://arxiv.org/pdf/2310.17097v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18279v1","updated":"2023-10-27T17:11:07Z","published":"2023-10-27T17:11:07Z","title":"FOUND: Foot Optimization with Uncertain Normals for Surface Deformation\n Using Synthetic Data","summary":" Surface reconstruction from multi-view images is a challenging task, with\nsolutions often requiring a large number of sampled images with high overlap.\nWe seek to develop a method for few-view reconstruction, for the case of the\nhuman foot. To solve this task, we must extract rich geometric cues from RGB\nimages, before carefully fusing them into a final 3D object. Our FOUND approach\ntackles this, with 4 main contributions: (i) SynFoot, a synthetic dataset of\n50,000 photorealistic foot images, paired with ground truth surface normals and\nkeypoints; (ii) an uncertainty-aware surface normal predictor trained on our\nsynthetic dataset; (iii) an optimization scheme for fitting a generative foot\nmodel to a series of images; and (iv) a benchmark dataset of calibrated images\nand high resolution ground truth geometry. We show that our normal predictor\noutperforms all off-the-shelf equivalents significantly on real images, and our\noptimization scheme outperforms state-of-the-art photogrammetry pipelines,\nespecially for a few-view setting. We release our synthetic dataset and\nbaseline 3D scans to the research community.\n","authors":["Oliver Boyne","Gwangbin Bae","James Charles","Roberto Cipolla"],"pdf_url":"https://arxiv.org/pdf/2310.18279v1.pdf","comment":"14 pages, 15 figures"},{"id":"http://arxiv.org/abs/2310.18274v1","updated":"2023-10-27T16:59:51Z","published":"2023-10-27T16:59:51Z","title":"LipSim: A Provably Robust Perceptual Similarity Metric","summary":" Recent years have seen growing interest in developing and applying perceptual\nsimilarity metrics. Research has shown the superiority of perceptual metrics\nover pixel-wise metrics in aligning with human perception and serving as a\nproxy for the human visual system. On the other hand, as perceptual metrics\nrely on neural networks, there is a growing concern regarding their resilience,\ngiven the established vulnerability of neural networks to adversarial attacks.\nIt is indeed logical to infer that perceptual metrics may inherit both the\nstrengths and shortcomings of neural networks. In this work, we demonstrate the\nvulnerability of state-of-the-art perceptual similarity metrics based on an\nensemble of ViT-based feature extractors to adversarial attacks. We then\npropose a framework to train a robust perceptual similarity metric called\nLipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging\n1-Lipschitz neural networks as the backbone, LipSim provides guarded areas\naround each data point and certificates for all perturbations within an\n$\\ell_2$ ball. Finally, a comprehensive set of experiments shows the\nperformance of LipSim in terms of natural and certified scores and on the image\nretrieval application. The code is available at\nhttps://github.com/SaraGhazanfari/LipSim.\n","authors":["Sara Ghazanfari","Alexandre Araujo","Prashanth Krishnamurthy","Farshad Khorrami","Siddharth Garg"],"pdf_url":"https://arxiv.org/pdf/2310.18274v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18268v1","updated":"2023-10-27T16:56:28Z","published":"2023-10-27T16:56:28Z","title":"PlantPlotGAN: A Physics-Informed Generative Adversarial Network for\n Plant Disease Prediction","summary":" Monitoring plantations is crucial for crop management and producing healthy\nharvests. Unmanned Aerial Vehicles (UAVs) have been used to collect\nmultispectral images that aid in this monitoring. However, given the number of\nhectares to be monitored and the limitations of flight, plant disease signals\nbecome visually clear only in the later stages of plant growth and only if the\ndisease has spread throughout a significant portion of the plantation. This\nlimited amount of relevant data hampers the prediction models, as the\nalgorithms struggle to generalize patterns with unbalanced or unrealistic\naugmented datasets effectively. To address this issue, we propose PlantPlotGAN,\na physics-informed generative model capable of creating synthetic multispectral\nplot images with realistic vegetation indices. These indices served as a proxy\nfor disease detection and were used to evaluate if our model could help\nincrease the accuracy of prediction models. The results demonstrate that the\nsynthetic imagery generated from PlantPlotGAN outperforms state-of-the-art\nmethods regarding the Fr\\'echet inception distance. Moreover, prediction models\nachieve higher accuracy metrics when trained with synthetic and original\nimagery for earlier plant disease detection compared to the training processes\nbased solely on real imagery.\n","authors":["Felipe A. Lopes","Vasit Sagan","Flavio Esposito"],"pdf_url":"https://arxiv.org/pdf/2310.18268v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2303.11403v4","updated":"2023-10-27T16:38:40Z","published":"2023-03-20T19:20:34Z","title":"eP-ALM: Efficient Perceptual Augmentation of Language Models","summary":" Large Language Models (LLMs) have so far impressed the world, with\nunprecedented capabilities that emerge in models at large scales. On the vision\nside, transformer models (i.e., ViT) are following the same trend, achieving\nthe best performance on challenging benchmarks. With the abundance of such\nunimodal models, a natural question arises; do we need also to follow this\ntrend to tackle multimodal tasks? In this work, we propose to rather direct\neffort to efficient adaptations of existing models, and propose to augment\nLanguage Models with perception. Existing approaches for adapting pretrained\nmodels for vision-language tasks still rely on several key components that\nhinder their efficiency. In particular, they still train a large number of\nparameters, rely on large multimodal pretraining, use encoders (e.g., CLIP)\ntrained on huge image-text datasets, and add significant inference overhead. In\naddition, most of these approaches have focused on Zero-Shot and In Context\nLearning, with little to no effort on direct finetuning. We investigate the\nminimal computational effort needed to adapt unimodal models for multimodal\ntasks and propose a new challenging setup, alongside different approaches, that\nefficiently adapts unimodal pretrained models. We show that by freezing more\nthan 99% of total parameters, training only one linear projection layer, and\nprepending only one trainable token, our approach (dubbed eP-ALM) significantly\noutperforms other baselines on VQA and Captioning across Image, Video, and\nAudio modalities, following the proposed setup. The code is available here:\nhttps://github.com/mshukor/eP-ALM.\n","authors":["Mustafa Shukor","Corentin Dancette","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2303.11403v4.pdf","comment":"Accepted at ICCV 2023. Project page:\n https://mshukor.github.io/eP-ALM.github.io/"},{"id":"http://arxiv.org/abs/2310.18251v1","updated":"2023-10-27T16:37:36Z","published":"2023-10-27T16:37:36Z","title":"A Self-Supervised Approach to Land Cover Segmentation","summary":" Land use/land cover change (LULC) maps are integral resources in earth\nscience and agricultural research. Due to the nature of such maps, the creation\nof LULC maps is often constrained by the time and human resources necessary to\naccurately annotate satellite imagery and remote sensing data. While computer\nvision models that perform semantic segmentation to create detailed labels from\nsuch data are not uncommon, litle research has been done on self-supervised and\nunsupervised approaches to labelling LULC maps without the use of ground-truth\nmasks. Here, we demonstrate a self-supervised method of land cover segmentation\nthat has no need for high-quality ground truth labels. The proposed deep\nlearning employs a frozen pre-trained ViT backbone transferred from DINO in a\nSTEGO architecture and is fine-tuned using a custom dataset consisting of very\nhigh resolution (VHR) sattelite imagery. After only 10 epochs of fine-tuning,\nan accuracy of roughly 52% was observed across 5 samples, signifying the\nfeasibility of self-supervised models for the automated labelling of VHR LULC\nmaps.\n","authors":["Charles Moore","Dakota Hester"],"pdf_url":"https://arxiv.org/pdf/2310.18251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18237v1","updated":"2023-10-27T16:21:17Z","published":"2023-10-27T16:21:17Z","title":"Generative AI Model for Artistic Style Transfer Using Convolutional\n Neural Networks","summary":" Artistic style transfer, a captivating application of generative artificial\nintelligence, involves fusing the content of one image with the artistic style\nof another to create unique visual compositions. This paper presents a\ncomprehensive overview of a novel technique for style transfer using\nConvolutional Neural Networks (CNNs). By leveraging deep image representations\nlearned by CNNs, we demonstrate how to separate and manipulate image content\nand style, enabling the synthesis of high-quality images that combine content\nand style in a harmonious manner. We describe the methodology, including\ncontent and style representations, loss computation, and optimization, and\nshowcase experimental results highlighting the effectiveness and versatility of\nthe approach across different styles and content\n","authors":["Jonayet Miah","Duc M Cao","Md Abu Sayed","Md. Sabbirul Haque"],"pdf_url":"https://arxiv.org/pdf/2310.18237v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18236v1","updated":"2023-10-27T16:20:34Z","published":"2023-10-27T16:20:34Z","title":"How Re-sampling Helps for Long-Tail Learning?","summary":" Long-tail learning has received significant attention in recent years due to\nthe challenge it poses with extremely imbalanced datasets. In these datasets,\nonly a few classes (known as the head classes) have an adequate number of\ntraining samples, while the rest of the classes (known as the tail classes) are\ninfrequent in the training data. Re-sampling is a classical and widely used\napproach for addressing class imbalance issues. Unfortunately, recent studies\nclaim that re-sampling brings negligible performance improvements in modern\nlong-tail learning tasks. This paper aims to investigate this phenomenon\nsystematically. Our research shows that re-sampling can considerably improve\ngeneralization when the training images do not contain semantically irrelevant\ncontexts. In other scenarios, however, it can learn unexpected spurious\ncorrelations between irrelevant contexts and target labels. We design\nexperiments on two homogeneous datasets, one containing irrelevant context and\nthe other not, to confirm our findings. To prevent the learning of spurious\ncorrelations, we propose a new context shift augmentation module that generates\ndiverse training images for the tail class by maintaining a context bank\nextracted from the head-class images. Experiments demonstrate that our proposed\nmodule can boost the generalization and outperform other approaches, including\nclass-balanced re-sampling, decoupled classifier re-training, and data\naugmentation methods. The source code is available at\nhttps://www.lamda.nju.edu.cn/code_CSA.ashx.\n","authors":["Jiang-Xin Shi","Tong Wei","Yuke Xiang","Yu-Feng Li"],"pdf_url":"https://arxiv.org/pdf/2310.18236v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18235v1","updated":"2023-10-27T16:20:10Z","published":"2023-10-27T16:20:10Z","title":"Davidsonian Scene Graph: Improving Reliability in Fine-grained\n Evaluation for Text-Image Generation","summary":" Evaluating text-to-image models is notoriously difficult. A strong recent\napproach for assessing text-image faithfulness is based on QG/A (question\ngeneration and answering), which uses pre-trained foundational models to\nautomatically generate a set of questions and answers from the prompt, and\noutput images are scored based on whether these answers extracted with a visual\nquestion answering model are consistent with the prompt-based answers. This\nkind of evaluation is naturally dependent on the quality of the underlying QG\nand QA models. We identify and address several reliability challenges in\nexisting QG/A work: (a) QG questions should respect the prompt (avoiding\nhallucinations, duplications, and omissions) and (b) VQA answers should be\nconsistent (not asserting that there is no motorcycle in an image while also\nclaiming the motorcycle is blue). We address these issues with Davidsonian\nScene Graph (DSG), an empirically grounded evaluation framework inspired by\nformal semantics. DSG is an automatic, graph-based QG/A that is modularly\nimplemented to be adaptable to any QG/A module. DSG produces atomic and unique\nquestions organized in dependency graphs, which (i) ensure appropriate semantic\ncoverage and (ii) sidestep inconsistent answers. With extensive experimentation\nand human evaluation on a range of model configurations (LLM, VQA, and T2I), we\nempirically demonstrate that DSG addresses the challenges noted above. Finally,\nwe present DSG-1k, an open-sourced evaluation benchmark that includes 1,060\nprompts, covering a wide range of fine-grained semantic categories with a\nbalanced distribution. We will release the DSG-1k prompts and the corresponding\nDSG questions.\n","authors":["Jaemin Cho","Yushi Hu","Roopal Garg","Peter Anderson","Ranjay Krishna","Jason Baldridge","Mohit Bansal","Jordi Pont-Tuset","Su Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18235v1.pdf","comment":"Project website: https://google.github.io/DSG"},{"id":"http://arxiv.org/abs/2310.18234v1","updated":"2023-10-27T16:19:26Z","published":"2023-10-27T16:19:26Z","title":"Edge AI-Based Vein Detector for Efficient Venipuncture in the\n Antecubital Fossa","summary":" Assessing the condition and visibility of veins is a crucial step before\nobtaining intravenous access in the antecubital fossa, which is a common\nprocedure to draw blood or administer intravenous therapies (IV therapies).\nEven though medical practitioners are highly skilled at intravenous\ncannulation, they usually struggle to perform the procedure in patients with\nlow visible veins due to fluid retention, age, overweight, dark skin tone, or\ndiabetes. Recently, several investigations proposed combining Near Infrared\n(NIR) imaging and deep learning (DL) techniques for forearm vein segmentation.\nAlthough they have demonstrated compelling results, their use has been rather\nlimited owing to the portability and precision requirements to perform\nvenipuncture. In this paper, we aim to contribute to bridging this gap using\nthree strategies. First, we introduce a new NIR-based forearm vein segmentation\ndataset of 2,016 labelled images collected from 1,008 subjects with low visible\nveins. Second, we propose a modified U-Net architecture that locates veins\nspecifically in the antecubital fossa region of the examined patient. Finally,\na compressed version of the proposed architecture was deployed inside a\nbespoke, portable vein finder device after testing four common embedded\nmicrocomputers and four common quantization modalities. Experimental results\nshowed that the model compressed with Dynamic Range Quantization and deployed\non a Raspberry Pi 4B card produced the best execution time and precision\nbalance, with 5.14 FPS and 0.957 of latency and Intersection over Union (IoU),\nrespectively. These results show promising performance inside a\nresource-restricted low-cost device.\n","authors":["Edwin Salcedo","Patricia Peñaloza"],"pdf_url":"https://arxiv.org/pdf/2310.18234v1.pdf","comment":"Accepted for publication in MICAI 2023, Part II, LNCS 14392"},{"id":"http://arxiv.org/abs/2310.18222v1","updated":"2023-10-27T15:51:33Z","published":"2023-10-27T15:51:33Z","title":"TBDLNet: a network for classifying multidrug-resistant and\n drug-sensitive tuberculosis","summary":" This paper proposes applying a novel deep-learning model, TBDLNet, to\nrecognize CT images to classify multidrug-resistant and drug-sensitive\ntuberculosis automatically. The pre-trained ResNet50 is selected to extract\nfeatures. Three randomized neural networks are used to alleviate the\noverfitting problem. The ensemble of three RNNs is applied to boost the\nrobustness via majority voting. The proposed model is evaluated by five-fold\ncross-validation. Five indexes are selected in this paper, which are accuracy,\nsensitivity, precision, F1-score, and specificity. The TBDLNet achieves 0.9822\naccuracy, 0.9815 specificity, 0.9823 precision, 0.9829 sensitivity, and 0.9826\nF1-score, respectively. The TBDLNet is suitable for classifying\nmultidrug-resistant tuberculosis and drug-sensitive tuberculosis. It can detect\nmultidrug-resistant pulmonary tuberculosis as early as possible, which helps to\nadjust the treatment plan in time and improve the treatment effect.\n","authors":["Ziquan Zhu","Jing Tao","Shuihua Wang","Xin Zhang","Yudong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.18222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06947v4","updated":"2023-10-27T15:16:53Z","published":"2023-07-13T17:59:33Z","title":"Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action\n Recognition","summary":" Recent video recognition models utilize Transformer models for long-range\nspatio-temporal context modeling. Video transformer designs are based on\nself-attention that can model global context at a high computational cost. In\ncomparison, convolutional designs for videos offer an efficient alternative but\nlack long-range dependency modeling. Towards achieving the best of both\ndesigns, this work proposes Video-FocalNet, an effective and efficient\narchitecture for video recognition that models both local and global contexts.\nVideo-FocalNet is based on a spatio-temporal focal modulation architecture that\nreverses the interaction and aggregation steps of self-attention for better\nefficiency. Further, the aggregation step and the interaction step are both\nimplemented using efficient convolution and element-wise multiplication\noperations that are computationally less expensive than their self-attention\ncounterparts on video representations. We extensively explore the design space\nof focal modulation-based spatio-temporal context modeling and demonstrate our\nparallel spatial and temporal encoding design to be the optimal choice.\nVideo-FocalNets perform favorably well against the state-of-the-art\ntransformer-based models for video recognition on five large-scale datasets\n(Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower\ncomputational cost. Our code/models are released at\nhttps://github.com/TalalWasim/Video-FocalNets.\n","authors":["Syed Talal Wasim","Muhammad Uzair Khattak","Muzammal Naseer","Salman Khan","Mubarak Shah","Fahad Shahbaz Khan"],"pdf_url":"https://arxiv.org/pdf/2307.06947v4.pdf","comment":"Accepted to ICCV-2023. Camera-Ready version. Project page:\n https://TalalWasim.github.io/Video-FocalNets/"},{"id":"http://arxiv.org/abs/2307.11957v2","updated":"2023-10-27T15:16:39Z","published":"2023-07-22T01:56:58Z","title":"High-performance real-world optical computing trained by in situ\n model-free optimization","summary":" Optical computing systems can provide high-speed and low-energy data\nprocessing but face deficiencies in computationally demanding training and\nsimulation-to-reality gap. We propose a model-free solution for lightweight in\nsitu optimization of optical computing systems based on the score gradient\nestimation algorithm. This approach treats the system as a black box and\nback-propagates loss directly to the optical weights' probabilistic\ndistributions, hence circumventing the need for computation-heavy and biased\nsystem simulation. We demonstrate a superior classification accuracy on the\nMNIST and FMNIST datasets through experiments on a single-layer diffractive\noptical computing system. Furthermore, we show its potential for image-free and\nhigh-speed cell analysis. The inherent simplicity of our proposed method,\ncombined with its low demand for computational resources, expedites the\ntransition of optical computing from laboratory demonstrations to real-world\napplications.\n","authors":["Guangyuan Zhao","Xin Shu"],"pdf_url":"https://arxiv.org/pdf/2307.11957v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14218v4","updated":"2023-10-27T15:08:31Z","published":"2023-05-23T16:34:09Z","title":"DUBLIN -- Document Understanding By Language-Image Network","summary":" Visual document understanding is a complex task that involves analyzing both\nthe text and the visual elements in document images. Existing models often rely\non manual feature engineering or domain-specific pipelines, which limit their\ngeneralization ability across different document types and languages. In this\npaper, we propose DUBLIN, which is pretrained on web pages using three novel\nobjectives: Masked Document Text Generation Task, Bounding Box Task, and\nRendered Question Answering Task, that leverage both the spatial and semantic\ninformation in the document images. Our model achieves competitive or\nstate-of-the-art results on several benchmarks, such as Web-Based Structural\nReading Comprehension, Document Visual Question Answering, Key Information\nExtraction, Diagram Understanding, and Table Question Answering. In particular,\nwe show that DUBLIN is the first pixel-based model to achieve an EM of 77.75\nand F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms\nthe current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and\nAI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve\ncompetitive performance on RVL-CDIP document classification. Moreover, we\ncreate new baselines for text-based datasets by rendering them as document\nimages to promote research in this direction.\n","authors":["Kriti Aggarwal","Aditi Khandelwal","Kumar Tanmay","Owais Mohammed Khan","Qiang Liu","Monojit Choudhury","Hardik Hansrajbhai Chauhan","Subhojit Som","Vishrav Chaudhary","Saurabh Tiwary"],"pdf_url":"https://arxiv.org/pdf/2305.14218v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18192v1","updated":"2023-10-27T15:06:01Z","published":"2023-10-27T15:06:01Z","title":"Artifact-Robust Graph-Based Learning in Digital Pathology","summary":" Whole slide images~(WSIs) are digitized images of tissues placed in glass\nslides using advanced scanners. The digital processing of WSIs is challenging\nas they are gigapixel images and stored in multi-resolution format. A common\nchallenge with WSIs is that perturbations/artifacts are inevitable during\nstoring the glass slides and digitizing them. These perturbations include\nmotion, which often arises from slide movement during placement, and changes in\nhue and brightness due to variations in staining chemicals and the quality of\ndigitizing scanners. In this work, a novel robust learning approach to account\nfor these artifacts is presented. Due to the size and resolution of WSIs and to\naccount for neighborhood information, graph-based methods are called for. We\nuse graph convolutional network~(GCN) to extract features from the graph\nrepresenting WSI. Through a denoiser {and pooling layer}, the effects of\nperturbations in WSIs are controlled and the output is followed by a\ntransformer for the classification of different grades of prostate cancer. To\ncompare the efficacy of the proposed approach, the model without denoiser is\ntrained and tested with WSIs without any perturbation and then different\nperturbations are introduced in WSIs and passed through the network with the\ndenoiser. The accuracy and kappa scores of the proposed model with prostate\ncancer dataset compared with non-robust algorithms show significant improvement\nin cancer diagnosis.\n","authors":["Saba Heidari Gheshlaghi","Milan Aryal","Nasim Yahyasoltani","Masoud Ganji"],"pdf_url":"https://arxiv.org/pdf/2310.18192v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.08478v3","updated":"2023-10-27T15:03:18Z","published":"2023-02-16T18:35:39Z","title":"Kernelized Back-Projection Networks for Blind Super Resolution","summary":" Since non-blind Super Resolution (SR) fails to super-resolve Low-Resolution\n(LR) images degraded by arbitrary degradations, SR with the degradation model\nis required. However, this paper reveals that non-blind SR that is trained\nsimply with various blur kernels exhibits comparable performance as those with\nthe degradation model for blind SR. This result motivates us to revisit\nhigh-performance non-blind SR and extend it to blind SR with blur kernels. This\npaper proposes two SR networks by integrating kernel estimation and SR branches\nin an iterative end-to-end manner. In the first model, which is called the\nKernel Conditioned Back-Projection Network (KCBPN), the low-dimensional kernel\nrepresentations are estimated for conditioning the SR branch. In our second\nmodel, the Kernelized BackProjection Network (KBPN), a raw kernel is estimated\nand directly employed for modeling the image degradation. The estimated kernel\nis employed not only for back-propagating its residual but also for\nforward-propagating the residual to iterative stages. This forward-propagation\nencourages these stages to learn a variety of different features in different\nstages by focusing on pixels with large residuals in each stage. Experimental\nresults validate the effectiveness of our proposed networks for kernel\nestimation and SR. We will release the code for this work.\n","authors":["Tomoki Yoshida","Yuki Kondo","Takahiro Maeda","Kazutoshi Akita","Norimichi Ukita"],"pdf_url":"https://arxiv.org/pdf/2302.08478v3.pdf","comment":"The first two authors contributed equally to this work"},{"id":"http://arxiv.org/abs/2310.05351v3","updated":"2023-10-27T14:35:14Z","published":"2023-10-09T02:27:04Z","title":"Generalized Neural Collapse for a Large Number of Classes","summary":" Neural collapse provides an elegant mathematical characterization of learned\nlast layer representations (a.k.a. features) and classifier weights in deep\nclassification models. Such results not only provide insights but also motivate\nnew techniques for improving practical deep models. However, most of the\nexisting empirical and theoretical studies in neural collapse focus on the case\nthat the number of classes is small relative to the dimension of the feature\nspace. This paper extends neural collapse to cases where the number of classes\nare much larger than the dimension of feature space, which broadly occur for\nlanguage models, retrieval systems, and face recognition applications. We show\nthat the features and classifier exhibit a generalized neural collapse\nphenomenon, where the minimum one-vs-rest margins is maximized.We provide\nempirical study to verify the occurrence of generalized neural collapse in\npractical deep neural networks. Moreover, we provide theoretical study to show\nthat the generalized neural collapse provably occurs under unconstrained\nfeature model with spherical constraint, under certain technical conditions on\nfeature dimension and number of classes.\n","authors":["Jiachen Jiang","Jinxin Zhou","Peng Wang","Qing Qu","Dustin Mixon","Chong You","Zhihui Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.05351v3.pdf","comment":"32 pages, 12 figures"},{"id":"http://arxiv.org/abs/2310.17075v2","updated":"2023-10-27T14:35:04Z","published":"2023-10-26T00:36:03Z","title":"HyperFields: Towards Zero-Shot Generation of NeRFs from Text","summary":" We introduce HyperFields, a method for generating text-conditioned Neural\nRadiance Fields (NeRFs) with a single forward pass and (optionally) some\nfine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns\na smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF\ndistillation training, which distills scenes encoded in individual NeRFs into\none dynamic hypernetwork. These techniques enable a single network to fit over\na hundred unique scenes. We further demonstrate that HyperFields learns a more\ngeneral map between text and NeRFs, and consequently is capable of predicting\nnovel in-distribution and out-of-distribution scenes -- either zero-shot or\nwith a few finetuning steps. Finetuning HyperFields benefits from accelerated\nconvergence thanks to the learned general map, and is capable of synthesizing\nnovel scenes 5 to 10 times faster than existing neural optimization-based\nmethods. Our ablation experiments show that both the dynamic architecture and\nNeRF distillation are critical to the expressivity of HyperFields.\n","authors":["Sudarshan Babu","Richard Liu","Avery Zhou","Michael Maire","Greg Shakhnarovich","Rana Hanocka"],"pdf_url":"https://arxiv.org/pdf/2310.17075v2.pdf","comment":"Project page: https://threedle.github.io/hyperfields/"},{"id":"http://arxiv.org/abs/2212.06096v3","updated":"2023-10-27T14:31:32Z","published":"2022-12-12T18:10:33Z","title":"Implicit Convolutional Kernels for Steerable CNNs","summary":" Steerable convolutional neural networks (CNNs) provide a general framework\nfor building neural networks equivariant to translations and transformations of\nan origin-preserving group $G$, such as reflections and rotations. They rely on\nstandard convolutions with $G$-steerable kernels obtained by analytically\nsolving the group-specific equivariance constraint imposed onto the kernel\nspace. As the solution is tailored to a particular group $G$, implementing a\nkernel basis does not generalize to other symmetry transformations,\ncomplicating the development of general group equivariant models. We propose\nusing implicit neural representation via multi-layer perceptrons (MLPs) to\nparameterize $G$-steerable kernels. The resulting framework offers a simple and\nflexible way to implement Steerable CNNs and generalizes to any group $G$ for\nwhich a $G$-equivariant MLP can be built. We prove the effectiveness of our\nmethod on multiple tasks, including N-body simulations, point cloud\nclassification and molecular property prediction.\n","authors":["Maksim Zhdanov","Nico Hoffmann","Gabriele Cesa"],"pdf_url":"https://arxiv.org/pdf/2212.06096v3.pdf","comment":"Accepted to 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2305.19079v2","updated":"2023-10-27T14:18:02Z","published":"2023-05-30T14:42:04Z","title":"Analyzing the Sample Complexity of Self-Supervised Image Reconstruction\n Methods","summary":" Supervised training of deep neural networks on pairs of clean image and noisy\nmeasurement achieves state-of-the-art performance for many image reconstruction\ntasks, but such training pairs are difficult to collect. Self-supervised\nmethods enable training based on noisy measurements only, without clean images.\nIn this work, we investigate the cost of self-supervised training in terms of\nsample complexity for a class of self-supervised methods that enable the\ncomputation of unbiased estimates of gradients of the supervised loss,\nincluding noise2noise methods. We analytically show that a model trained with\nsuch self-supervised training is as good as the same model trained in a\nsupervised fashion, but self-supervised training requires more examples than\nsupervised training. We then study self-supervised denoising and accelerated\nMRI empirically and characterize the cost of self-supervised training in terms\nof the number of additional samples required, and find that the performance gap\nbetween self-supervised and supervised training vanishes as a function of the\ntraining examples, at a problem-dependent rate, as predicted by our theory.\n","authors":["Tobit Klug","Dogukan Atik","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2305.19079v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01859v2","updated":"2023-10-27T13:54:32Z","published":"2023-06-02T18:27:26Z","title":"Spatially Resolved Gene Expression Prediction from H&E Histology Images\n via Bi-modal Contrastive Learning","summary":" Histology imaging is an important tool in medical diagnosis and research,\nenabling the examination of tissue structure and composition at the microscopic\nlevel. Understanding the underlying molecular mechanisms of tissue architecture\nis critical in uncovering disease mechanisms and developing effective\ntreatments. Gene expression profiling provides insight into the molecular\nprocesses underlying tissue architecture, but the process can be time-consuming\nand expensive. We present BLEEP (Bi-modaL Embedding for Expression Prediction),\na bi-modal embedding framework capable of generating spatially resolved gene\nexpression profiles of whole-slide Hematoxylin and eosin (H&E) stained\nhistology images. BLEEP uses contrastive learning to construct a\nlow-dimensional joint embedding space from a reference dataset using paired\nimage and expression profiles at micrometer resolution. With this approach, the\ngene expression of any query image patch can be imputed using the expression\nprofiles from the reference dataset. We demonstrate BLEEP's effectiveness in\ngene expression prediction by benchmarking its performance on a human liver\ntissue dataset captured using the 10x Visium platform, where it achieves\nsignificant improvements over existing methods. Our results demonstrate the\npotential of BLEEP to provide insights into the molecular mechanisms underlying\ntissue architecture, with important implications in diagnosis and research of\nvarious diseases. The proposed approach can significantly reduce the time and\ncost associated with gene expression profiling, opening up new avenues for\nhigh-throughput analysis of histology images for both research and clinical\napplications.\n","authors":["Ronald Xie","Kuan Pang","Sai W. Chung","Catia T. Perciani","Sonya A. MacParland","Bo Wang","Gary D. Bader"],"pdf_url":"https://arxiv.org/pdf/2306.01859v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.13389v5","updated":"2023-10-27T13:52:56Z","published":"2022-10-24T16:43:00Z","title":"A Regularized Conditional GAN for Posterior Sampling in Image Recovery\n Problems","summary":" In image recovery problems, one seeks to infer an image from distorted,\nincomplete, and/or noise-corrupted measurements. Such problems arise in\nmagnetic resonance imaging (MRI), computed tomography, deblurring,\nsuper-resolution, inpainting, phase retrieval, image-to-image translation, and\nother applications. Given a training set of signal/measurement pairs, we seek\nto do more than just produce one good image estimate. Rather, we aim to rapidly\nand accurately sample from the posterior distribution. To do this, we propose a\nregularized conditional Wasserstein GAN that generates dozens of high-quality\nposterior samples per second. Our regularization comprises an $\\ell_1$ penalty\nand an adaptively weighted standard-deviation reward. Using quantitative\nevaluation metrics like conditional Fr\\'{e}chet inception distance, we\ndemonstrate that our method produces state-of-the-art posterior samples in both\nmulticoil MRI and large-scale inpainting applications. The code for our model\ncan be found here: https://github.com/matt-bendel/rcGAN\n","authors":["Matthew Bendel","Rizwan Ahmad","Philip Schniter"],"pdf_url":"https://arxiv.org/pdf/2210.13389v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18142v1","updated":"2023-10-27T13:47:09Z","published":"2023-10-27T13:47:09Z","title":"Semi-Supervised Panoptic Narrative Grounding","summary":" Despite considerable progress, the advancement of Panoptic Narrative\nGrounding (PNG) remains hindered by costly annotations. In this paper, we\nintroduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG)\nlearning scheme, capitalizing on a smaller set of labeled image-text pairs and\na larger set of unlabeled pairs to achieve competitive performance. Unlike\nvisual segmentation tasks, PNG involves one pixel belonging to multiple\nopen-ended nouns. As a result, existing multi-class based semi-supervised\nsegmentation frameworks cannot be directly applied to this task. To address\nthis challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to\nthe SS-PNG setting. We thoroughly investigate strategies such as Burn-In and\ndata augmentation to determine the optimal generic configuration for the\nSS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label\nquality, we propose a Quality-Based Loss Adjustment (QLA) approach to adjust\nthe semi-supervised objective, resulting in an enhanced SS-PNG-NW+. Employing\nour proposed QLA, we improve BCE Loss and Dice loss at pixel and mask levels,\nrespectively. We conduct extensive experiments on PNG datasets, with our\nSS-PNG-NW+ demonstrating promising results comparable to fully-supervised\nmodels across all data ratios. Remarkably, our SS-PNG-NW+ outperforms\nfully-supervised models with only 30% and 50% supervision data, exceeding their\nperformance by 0.8% and 1.1% respectively. This highlights the effectiveness of\nour proposed SS-PNG-NW+ in overcoming the challenges posed by limited\nannotations and enhancing the applicability of PNG tasks. The source code is\navailable at https://github.com/nini0919/SSPNG.\n","authors":["Danni Yang","Jiayi Ji","Xiaoshuai Sun","Haowei Wang","Yinan Li","Yiwei Ma","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2310.18142v1.pdf","comment":"ACM MM 2023"},{"id":"http://arxiv.org/abs/2310.18141v1","updated":"2023-10-27T13:45:30Z","published":"2023-10-27T13:45:30Z","title":"Unsupervised Representation Learning for Diverse Deformable Shape\n Collections","summary":" We introduce a novel learning-based method for encoding and manipulating 3D\nsurface meshes. Our method is specifically designed to create an interpretable\nembedding space for deformable shape collections. Unlike previous 3D mesh\nautoencoders that require meshes to be in a 1-to-1 correspondence, our approach\nis trained on diverse meshes in an unsupervised manner. Central to our method\nis a spectral pooling technique that establishes a universal latent space,\nbreaking free from traditional constraints of mesh connectivity and shape\ncategories. The entire process consists of two stages. In the first stage, we\nemploy the functional map paradigm to extract point-to-point (p2p) maps between\na collection of shapes in an unsupervised manner. These p2p maps are then\nutilized to construct a common latent space, which ensures straightforward\ninterpretation and independence from mesh connectivity and shape category.\nThrough extensive experiments, we demonstrate that our method achieves\nexcellent reconstructions and produces more realistic and smoother\ninterpolations than baseline approaches.\n","authors":["Sara Hahner","Souhaib Attaiki","Jochen Garcke","Maks Ovsjanikov"],"pdf_url":"https://arxiv.org/pdf/2310.18141v1.pdf","comment":"Accepted at International Conference on 3D Vision 2024"},{"id":"http://arxiv.org/abs/2309.11139v3","updated":"2023-10-27T13:45:02Z","published":"2023-09-20T08:34:38Z","title":"More complex encoder is not all you need","summary":" U-Net and its variants have been widely used in medical image segmentation.\nHowever, most current U-Net variants confine their improvement strategies to\nbuilding more complex encoder, while leaving the decoder unchanged or adopting\na simple symmetric structure. These approaches overlook the true functionality\nof the decoder: receiving low-resolution feature maps from the encoder and\nrestoring feature map resolution and lost information through upsampling. As a\nresult, the decoder, especially its upsampling component, plays a crucial role\nin enhancing segmentation outcomes. However, in 3D medical image segmentation,\nthe commonly used transposed convolution can result in visual artifacts. This\nissue stems from the absence of direct relationship between adjacent pixels in\nthe output feature map. Furthermore, plain encoder has already possessed\nsufficient feature extraction capability because downsampling operation leads\nto the gradual expansion of the receptive field, but the loss of information\nduring downsampling process is unignorable. To address the gap in relevant\nresearch, we extend our focus beyond the encoder and introduce neU-Net (i.e.,\nnot complex encoder U-Net), which incorporates a novel Sub-pixel Convolution\nfor upsampling to construct a powerful decoder. Additionally, we introduce\nmulti-scale wavelet inputs module on the encoder side to provide additional\ninformation. Our model design achieves excellent results, surpassing other\nstate-of-the-art methods on both the Synapse and ACDC datasets.\n","authors":["Weibin Yang","Longwei Xu","Pengwei Wang","Dehua Geng","Yusong Li","Mingyuan Xu","Zhiqi Dong"],"pdf_url":"https://arxiv.org/pdf/2309.11139v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18131v1","updated":"2023-10-27T13:23:38Z","published":"2023-10-27T13:23:38Z","title":"End-to-end Video Gaze Estimation via Capturing Head-face-eye\n Spatial-temporal Interaction Context","summary":" In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to\nfacilitate video gaze estimation via capturing spatial-temporal interaction\ncontext among head, face, and eye in an end-to-end learning way, which has not\nbeen well concerned yet. The main advantage of MCGaze is that the tasks of clue\nlocalization of head, face, and eye can be solved jointly for gaze estimation\nin a one-step way, with joint optimization to seek optimal performance. During\nthis, spatial-temporal context exchange happens among the clues on the head,\nface, and eye. Accordingly, the final gazes obtained by fusing features from\nvarious queries can be aware of global clues from heads and faces, and local\nclues from eyes simultaneously, which essentially leverages performance.\nMeanwhile, the one-step running way also ensures high running efficiency.\nExperiments on the challenging Gaze360 dataset verify the superiority of our\nproposition. The source code will be released at\nhttps://github.com/zgchen33/MCGaze.\n","authors":["Yiran Guan","Zhuoguang Chen","Wenzheng Zeng","Zhiguo Cao","Yang Xiao"],"pdf_url":"https://arxiv.org/pdf/2310.18131v1.pdf","comment":"5 pages, 3 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.18116v1","updated":"2023-10-27T13:02:12Z","published":"2023-10-27T13:02:12Z","title":"Direct Unsupervised Denoising","summary":" Traditional supervised denoisers are trained using pairs of noisy input and\nclean target images. They learn to predict a central tendency of the posterior\ndistribution over possible clean images. When, e.g., trained with the popular\nquadratic loss function, the network's output will correspond to the minimum\nmean square error (MMSE) estimate. Unsupervised denoisers based on Variational\nAutoEncoders (VAEs) have succeeded in achieving state-of-the-art results while\nrequiring only unpaired noisy data as training input. In contrast to the\ntraditional supervised approach, unsupervised denoisers do not directly produce\na single prediction, such as the MMSE estimate, but allow us to draw samples\nfrom the posterior distribution of clean solutions corresponding to the noisy\ninput. To approximate the MMSE estimate during inference, unsupervised methods\nhave to create and draw a large number of samples - a computationally expensive\nprocess - rendering the approach inapplicable in many situations. Here, we\npresent an alternative approach that trains a deterministic network alongside\nthe VAE to directly predict a central tendency. Our method achieves results\nthat surpass the results achieved by the unsupervised method at a fraction of\nthe computational cost.\n","authors":["Benjamin Salmon","Alexander Krull"],"pdf_url":"https://arxiv.org/pdf/2310.18116v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14800v4","updated":"2023-10-27T12:54:24Z","published":"2023-05-24T06:52:47Z","title":"Exploring Diverse In-Context Configurations for Image Captioning","summary":" After discovering that Language Models (LMs) can be good in-context few-shot\nlearners, numerous strategies have been proposed to optimize in-context\nsequence configurations. Recently, researchers in Vision-Language (VL) domains\nalso develop their few-shot learners, while they only use the simplest way,\nie., randomly sampling, to configure in-context image-text pairs. In order to\nexplore the effects of varying configurations on VL in-context learning, we\ndevised four strategies for image selection and four for caption assignment to\nconfigure in-context image-text pairs for image captioning. Here Image\nCaptioning is used as the case study since it can be seen as the\nvisually-conditioned LM. Our comprehensive experiments yield two\ncounter-intuitive but valuable insights, highlighting the distinct\ncharacteristics of VL in-context learning due to multi-modal synergy, as\ncompared to the NLP case. Furthermore, in our exploration of optimal\ncombination strategies, we observed an average performance enhancement of 20.7\nof CIDEr scores compared to the baseline. The code is given in\nhttps://github.com/yongliang-wu/ExploreCfg.\n","authors":["Xu Yang","Yongliang Wu","Mingzhuo Yang","Haokun Chen","Xin Geng"],"pdf_url":"https://arxiv.org/pdf/2305.14800v4.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2310.18112v1","updated":"2023-10-27T12:52:34Z","published":"2023-10-27T12:52:34Z","title":"er.autopilot 1.0: The Full Autonomous Stack for Oval Racing at High\n Speeds","summary":" The Indy Autonomous Challenge (IAC) brought together for the first time in\nhistory nine autonomous racing teams competing at unprecedented speed and in\nhead-to-head scenario, using independently developed software on open-wheel\nracecars. This paper presents the complete software architecture used by team\nTII EuroRacing (TII-ER), covering all the modules needed to avoid static\nobstacles, perform active overtakes and reach speeds above 75 m/s (270 km/h).\nIn addition to the most common modules related to perception, planning, and\ncontrol, we discuss the approaches used for vehicle dynamics modelling,\nsimulation, telemetry, and safety. Overall results and the performance of each\nmodule are described, as well as the lessons learned during the first two\nevents of the competition on oval tracks, where the team placed respectively\nsecond and third.\n","authors":["Ayoub Raji","Danilo Caporale","Francesco Gatti","Andrea Giove","Micaela Verucchi","Davide Malatesta","Nicola Musiu","Alessandro Toschi","Silviu Roberto Popitanu","Fabio Bagni","Massimiliano Bosi","Alexander Liniger","Marko Bertogna","Daniele Morra","Francesco Amerotti","Luca Bartoli","Federico Martello","Riccardo Porta"],"pdf_url":"https://arxiv.org/pdf/2310.18112v1.pdf","comment":"Preprint: Accepted to Field Robotics \"Opportunities and Challenges\n with Autonomous Racing\" Special Issue"},{"id":"http://arxiv.org/abs/2310.18104v1","updated":"2023-10-27T12:42:17Z","published":"2023-10-27T12:42:17Z","title":"Classifier-head Informed Feature Masking and Prototype-based Logit\n Smoothing for Out-of-Distribution Detection","summary":" Out-of-distribution (OOD) detection is essential when deploying neural\nnetworks in the real world. One main challenge is that neural networks often\nmake overconfident predictions on OOD data. In this study, we propose an\neffective post-hoc OOD detection method based on a new feature masking strategy\nand a novel logit smoothing strategy. Feature masking determines the important\nfeatures at the penultimate layer for each in-distribution (ID) class based on\nthe weights of the ID class in the classifier head and masks the rest features.\nLogit smoothing computes the cosine similarity between the feature vector of\nthe test sample and the prototype of the predicted ID class at the penultimate\nlayer and uses the similarity as an adaptive temperature factor on the logit to\nalleviate the network's overconfidence prediction for OOD data. With these\nstrategies, we can reduce feature activation of OOD data and enlarge the gap in\nOOD score between ID and OOD data. Extensive experiments on multiple standard\nOOD detection benchmarks demonstrate the effectiveness of our method and its\ncompatibility with existing methods, with new state-of-the-art performance\nachieved from our method. The source code will be released publicly.\n","authors":["Zhuohao Sun","Yiqiao Qiu","Zhijun Tan","Weishi Zheng","Ruixuan Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18104v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.18087v1","updated":"2023-10-27T12:12:06Z","published":"2023-10-27T12:12:06Z","title":"A Chebyshev Confidence Guided Source-Free Domain Adaptation Framework\n for Medical Image Segmentation","summary":" Source-free domain adaptation (SFDA) aims to adapt models trained on a\nlabeled source domain to an unlabeled target domain without the access to\nsource data. In medical imaging scenarios, the practical significance of SFDA\nmethods has been emphasized due to privacy concerns. Recent State-of-the-art\nSFDA methods primarily rely on self-training based on pseudo-labels (PLs).\nUnfortunately, PLs suffer from accuracy deterioration caused by domain shift,\nand thus limit the effectiveness of the adaptation process. To address this\nissue, we propose a Chebyshev confidence guided SFDA framework to accurately\nassess the reliability of PLs and generate self-improving PLs for\nself-training. The Chebyshev confidence is estimated by calculating probability\nlower bound of the PL confidence, given the prediction and the corresponding\nuncertainty. Leveraging the Chebyshev confidence, we introduce two\nconfidence-guided denoising methods: direct denoising and prototypical\ndenoising. Additionally, we propose a novel teacher-student joint training\nscheme (TJTS) that incorporates a confidence weighting module to improve PLs\niteratively. The TJTS, in collaboration with the denoising methods, effectively\nprevents the propagation of noise and enhances the accuracy of PLs. Extensive\nexperiments in diverse domain scenarios validate the effectiveness of our\nproposed framework and establish its superiority over state-of-the-art SFDA\nmethods. Our paper contributes to the field of SFDA by providing a novel\napproach for precisely estimating the reliability of pseudo-labels and a\nframework for obtaining high-quality PLs, resulting in improved adaptation\nperformance.\n","authors":["Jiesi Hu","Yanwu Yang","Xutao Guo","Jinghua Wang","Ting Ma"],"pdf_url":"https://arxiv.org/pdf/2310.18087v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2102.01161v3","updated":"2023-10-27T12:10:36Z","published":"2021-02-01T20:58:45Z","title":"Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes","summary":" Most learning methods for 3D data (point clouds, meshes) suffer significant\nperformance drops when the data is not carefully aligned to a canonical\norientation. Aligning real world 3D data collected from different sources is\nnon-trivial and requires manual intervention. In this paper, we propose the\nAdjoint Rigid Transform (ART) Network, a neural module which can be integrated\nwith a variety of 3D networks to significantly boost their performance. ART\nlearns to rotate input shapes to a learned canonical orientation, which is\ncrucial for a lot of tasks such as shape reconstruction, interpolation,\nnon-rigid registration, and latent disentanglement. ART achieves this with\nself-supervision and a rotation equivariance constraint on predicted rotations.\nThe remarkable result is that with only self-supervision, ART facilitates\nlearning a unique canonical orientation for both rigid and nonrigid shapes,\nwhich leads to a notable boost in performance of aforementioned tasks. We will\nrelease our code and pre-trained models for further research.\n","authors":["Keyang Zhou","Bharat Lal Bhatnagar","Bernt Schiele","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2102.01161v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.07982v3","updated":"2023-10-27T11:32:26Z","published":"2022-05-16T20:41:45Z","title":"TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion\n Refinement","summary":" We present TOCH, a method for refining incorrect 3D hand-object interaction\nsequences using a data prior. Existing hand trackers, especially those that\nrely on very few cameras, often produce visually unrealistic results with\nhand-object intersection or missing contacts. Although correcting such errors\nrequires reasoning about temporal aspects of interaction, most previous works\nfocus on static grasps and contacts. The core of our method are TOCH fields, a\nnovel spatio-temporal representation for modeling correspondences between hands\nand objects during interaction. TOCH fields are a point-wise, object-centric\nrepresentation, which encode the hand position relative to the object.\nLeveraging this novel representation, we learn a latent manifold of plausible\nTOCH fields with a temporal denoising auto-encoder. Experiments demonstrate\nthat TOCH outperforms state-of-the-art 3D hand-object interaction models, which\nare limited to static grasps and contacts. More importantly, our method\nproduces smooth interactions even before and after contact. Using a single\ntrained TOCH model, we quantitatively and qualitatively demonstrate its\nusefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D\nhand-object reconstruction methods and transferring grasps across objects.\n","authors":["Keyang Zhou","Bharat Lal Bhatnagar","Jan Eric Lenssen","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2205.07982v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18049v1","updated":"2023-10-27T10:52:50Z","published":"2023-10-27T10:52:50Z","title":"Text Augmented Spatial-aware Zero-shot Referring Image Segmentation","summary":" In this paper, we study a challenging task of zero-shot referring image\nsegmentation. This task aims to identify the instance mask that is most related\nto a referring expression without training on pixel-level annotations. Previous\nresearch takes advantage of pre-trained cross-modal models, e.g., CLIP, to\nalign instance-level masks with referring expressions. %Yet, CLIP only\nconsiders image-text pair level alignment, which neglects fine-grained image\nregion and complex sentence matching. Yet, CLIP only considers the global-level\nalignment of image-text pairs, neglecting fine-grained matching between the\nreferring sentence and local image regions. To address this challenge, we\nintroduce a Text Augmented Spatial-aware (TAS) zero-shot referring image\nsegmentation framework that is training-free and robust to various visual\nencoders. TAS incorporates a mask proposal network for instance-level mask\nextraction, a text-augmented visual-text matching score for mining the\nimage-text correlation, and a spatial rectifier for mask post-processing.\nNotably, the text-augmented visual-text matching score leverages a $P$ score\nand an $N$-score in addition to the typical visual-text matching score. The\n$P$-score is utilized to close the visual-text domain gap through a surrogate\ncaptioning model, where the score is computed between the surrogate\nmodel-generated texts and the referring expression. The $N$-score considers the\nfine-grained alignment of region-text pairs via negative phrase mining,\nencouraging the masked image to be repelled from the mined distracting phrases.\nExtensive experiments are conducted on various datasets, including RefCOCO,\nRefCOCO+, and RefCOCOg. The proposed method clearly outperforms\nstate-of-the-art zero-shot referring image segmentation methods.\n","authors":["Yucheng Suo","Linchao Zhu","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18049v1.pdf","comment":"Findings of EMNLP2023"},{"id":"http://arxiv.org/abs/2310.18046v1","updated":"2023-10-27T10:44:50Z","published":"2023-10-27T10:44:50Z","title":"ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model\n for Visual Question Answering in Vietnamese","summary":" In recent years, Visual Question Answering (VQA) has gained significant\nattention for its diverse applications, including intelligent car assistance,\naiding visually impaired individuals, and document image information retrieval\nusing natural language queries. VQA requires effective integration of\ninformation from questions and images to generate accurate answers. Neural\nmodels for VQA have made remarkable progress on large-scale datasets, with a\nprimary focus on resource-rich languages like English. To address this, we\nintroduce the ViCLEVR dataset, a pioneering collection for evaluating various\nvisual reasoning capabilities in Vietnamese while mitigating biases. The\ndataset comprises over 26,000 images and 30,000 question-answer pairs (QAs),\neach question annotated to specify the type of reasoning involved. Leveraging\nthis dataset, we conduct a comprehensive analysis of contemporary visual\nreasoning systems, offering valuable insights into their strengths and\nlimitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion\nthat identifies objects in images based on questions. The architecture\neffectively employs transformers to enable simultaneous reasoning over textual\nand visual data, merging both modalities at an early model stage. The\nexperimental findings demonstrate that our proposed model achieves\nstate-of-the-art performance across four evaluation metrics. The accompanying\ncode and dataset have been made publicly accessible at\n\\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate\nadvancements within the research community, fostering the development of more\nmultimodal fusion algorithms, specifically tailored to address the nuances of\nlow-resource languages, exemplified by Vietnamese.\n","authors":["Khiem Vinh Tran","Hao Phu Phan","Kiet Van Nguyen","Ngan Luu Thuy Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.18046v1.pdf","comment":"A pre-print version and submitted to journal"},{"id":"http://arxiv.org/abs/2310.15169v2","updated":"2023-10-27T10:23:04Z","published":"2023-10-23T17:59:58Z","title":"FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling","summary":" With the availability of large-scale video datasets and the advances of\ndiffusion models, text-driven video generation has achieved substantial\nprogress. However, existing video generation models are typically trained on a\nlimited number of frames, resulting in the inability to generate high-fidelity\nlong videos during inference. Furthermore, these models only support\nsingle-text conditions, whereas real-life scenarios often require multi-text\nconditions as the video content changes over time. To tackle these challenges,\nthis study explores the potential of extending the text-driven capability to\ngenerate longer videos conditioned on multiple texts. 1) We first analyze the\nimpact of initial noise in video diffusion models. Then building upon the\nobservation of noise, we propose FreeNoise, a tuning-free and time-efficient\nparadigm to enhance the generative capabilities of pretrained video diffusion\nmodels while preserving content consistency. Specifically, instead of\ninitializing noises for all frames, we reschedule a sequence of noises for\nlong-range correlation and perform temporal attention over them by window-based\nfunction. 2) Additionally, we design a novel motion injection method to support\nthe generation of videos conditioned on multiple text prompts. Extensive\nexperiments validate the superiority of our paradigm in extending the\ngenerative capabilities of video diffusion models. It is noteworthy that\ncompared with the previous best-performing method which brought about 255%\nextra time cost, our method incurs only negligible time cost of approximately\n17%. Generated video samples are available at our website:\nhttp://haonanqiu.com/projects/FreeNoise.html.\n","authors":["Haonan Qiu","Menghan Xia","Yong Zhang","Yingqing He","Xintao Wang","Ying Shan","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15169v2.pdf","comment":"Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo:\n https://github.com/arthur-qiu/LongerCrafter"},{"id":"http://arxiv.org/abs/2305.16379v2","updated":"2023-10-27T10:13:50Z","published":"2023-05-25T15:46:20Z","title":"Learning Better with Less: Effective Augmentation for Sample-Efficient\n Visual Reinforcement Learning","summary":" Data augmentation (DA) is a crucial technique for enhancing the sample\nefficiency of visual reinforcement learning (RL) algorithms. Notably, employing\nsimple observation transformations alone can yield outstanding performance\nwithout extra auxiliary representation tasks or pre-trained encoders. However,\nit remains unclear which attributes of DA account for its effectiveness in\nachieving sample-efficient visual RL. To investigate this issue and further\nexplore the potential of DA, this work conducts comprehensive experiments to\nassess the impact of DA's attributes on its efficacy and provides the following\ninsights and improvements: (1) For individual DA operations, we reveal that\nboth ample spatial diversity and slight hardness are indispensable. Building on\nthis finding, we introduce Random PadResize (Rand PR), a new DA operation that\noffers abundant spatial diversity with minimal hardness. (2) For multi-type DA\nfusion schemes, the increased DA hardness and unstable data distribution result\nin the current fusion schemes being unable to achieve higher sample efficiency\nthan their corresponding individual operations. Taking the non-stationary\nnature of RL into account, we propose a RL-tailored multi-type DA fusion scheme\ncalled Cycling Augmentation (CycAug), which performs periodic cycles of\ndifferent DA operations to increase type diversity while maintaining data\ndistribution consistency. Extensive evaluations on the DeepMind Control suite\nand CARLA driving simulator demonstrate that our methods achieve superior\nsample efficiency compared with the prior state-of-the-art methods.\n","authors":["Guozheng Ma","Linrui Zhang","Haoyu Wang","Lu Li","Zilin Wang","Zhen Wang","Li Shen","Xueqian Wang","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2305.16379v2.pdf","comment":"NeurIPS 2023 poster"},{"id":"http://arxiv.org/abs/2306.00777v2","updated":"2023-10-27T09:59:45Z","published":"2023-06-01T15:08:15Z","title":"Object pop-up: Can we infer 3D objects and their poses from human\n interactions alone?","summary":" The intimate entanglement between objects affordances and human poses is of\nlarge interest, among others, for behavioural sciences, cognitive psychology,\nand Computer Vision communities. In recent years, the latter has developed\nseveral object-centric approaches: starting from items, learning pipelines\nsynthesizing human poses and dynamics in a realistic way, satisfying both\ngeometrical and functional expectations. However, the inverse perspective is\nsignificantly less explored: Can we infer 3D objects and their poses from human\ninteractions alone? Our investigation follows this direction, showing that a\ngeneric 3D human point cloud is enough to pop up an unobserved object, even\nwhen the user is just imitating a functionality (e.g., looking through a\nbinocular) without involving a tangible counterpart. We validate our method\nqualitatively and quantitatively, with synthetic data and sequences acquired\nfor the task, showing applicability for XR/VR. The code is available at\nhttps://github.com/ptrvilya/object-popup.\n","authors":["Ilya A. Petrov","Riccardo Marin","Julian Chibane","Gerard Pons-Moll"],"pdf_url":"https://arxiv.org/pdf/2306.00777v2.pdf","comment":"Accepted at CVPR'23"},{"id":"http://arxiv.org/abs/2304.02860v2","updated":"2023-10-27T09:45:18Z","published":"2023-04-06T04:39:23Z","title":"Towards an Effective and Efficient Transformer for Rain-by-snow Weather\n Removal","summary":" Rain-by-snow weather removal is a specialized task in weather-degraded image\nrestoration aiming to eliminate coexisting rain streaks and snow particles. In\nthis paper, we propose RSFormer, an efficient and effective Transformer that\naddresses this challenge. Initially, we explore the proximity of convolution\nnetworks (ConvNets) and vision Transformers (ViTs) in hierarchical\narchitectures and experimentally find they perform approximately at intra-stage\nfeature learning. On this basis, we utilize a Transformer-like convolution\nblock (TCB) that replaces the computationally expensive self-attention while\npreserving attention characteristics for adapting to input content. We also\ndemonstrate that cross-stage progression is critical for performance\nimprovement, and propose a global-local self-attention sampling mechanism\n(GLASM) that down-/up-samples features while capturing both global and local\ndependencies. Finally, we synthesize two novel rain-by-snow datasets,\nRSCityScape and RS100K, to evaluate our proposed RSFormer. Extensive\nexperiments verify that RSFormer achieves the best trade-off between\nperformance and time-consumption compared to other restoration methods. For\ninstance, it outperforms Restormer with a 1.53% reduction in the number of\nparameters and a 15.6% reduction in inference time. Datasets, source code and\npre-trained models are available at \\url{https://github.com/chdwyb/RSFormer}.\n","authors":["Tao Gao","Yuanbo Wen","Kaihao Zhang","Peng Cheng","Ting Chen"],"pdf_url":"https://arxiv.org/pdf/2304.02860v2.pdf","comment":"code is available at \\url{https://github.com/chdwyb/RSFormer}"},{"id":"http://arxiv.org/abs/2309.02898v2","updated":"2023-10-27T09:24:20Z","published":"2023-09-06T10:41:30Z","title":"A Unified Framework for Discovering Discrete Symmetries","summary":" We consider the problem of learning a function respecting a symmetry from\namong a class of symmetries. We develop a unified framework that enables\nsymmetry discovery across a broad range of subgroups including locally\nsymmetric, dihedral and cyclic subgroups. At the core of the framework is a\nnovel architecture composed of linear, matrix-valued and non-linear functions\nthat expresses functions invariant to these subgroups in a principled manner.\nThe structure of the architecture enables us to leverage multi-armed bandit\nalgorithms and gradient descent to efficiently optimize over the linear and the\nnon-linear functions, respectively, and to infer the symmetry that is\nultimately learnt. We also discuss the necessity of the matrix-valued functions\nin the architecture. Experiments on image-digit sum and polynomial regression\ntasks demonstrate the effectiveness of our approach.\n","authors":["Pavan Karjol","Rohan Kashyap","Aditya Gopalan","Prathosh A. P"],"pdf_url":"https://arxiv.org/pdf/2309.02898v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.04904v3","updated":"2023-10-27T09:13:38Z","published":"2023-08-09T12:04:36Z","title":"StableVQA: A Deep No-Reference Quality Assessment Model for Video\n Stability","summary":" Video shakiness is an unpleasant distortion of User Generated Content (UGC)\nvideos, which is usually caused by the unstable hold of cameras. In recent\nyears, many video stabilization algorithms have been proposed, yet no specific\nand accurate metric enables comprehensively evaluating the stability of videos.\nIndeed, most existing quality assessment models evaluate video quality as a\nwhole without specifically taking the subjective experience of video stability\ninto consideration. Therefore, these models cannot measure the video stability\nexplicitly and precisely when severe shakes are present. In addition, there is\nno large-scale video database in public that includes various degrees of shaky\nvideos with the corresponding subjective scores available, which hinders the\ndevelopment of Video Quality Assessment for Stability (VQA-S). To this end, we\nbuild a new database named StableDB that contains 1,952 diversely-shaky UGC\nvideos, where each video has a Mean Opinion Score (MOS) on the degree of video\nstability rated by 34 subjects. Moreover, we elaborately design a novel VQA-S\nmodel named StableVQA, which consists of three feature extractors to acquire\nthe optical flow, semantic, and blur features respectively, and a regression\nlayer to predict the final stability score. Extensive experiments demonstrate\nthat the StableVQA achieves a higher correlation with subjective opinions than\nthe existing VQA-S models and generic VQA models. The database and codes are\navailable at https://github.com/QMME/StableVQA.\n","authors":["Tengchuan Kou","Xiaohong Liu","Wei Sun","Jun Jia","Xiongkuo Min","Guangtao Zhai","Ning Liu"],"pdf_url":"https://arxiv.org/pdf/2308.04904v3.pdf","comment":"Accepted by ACM MM'23"},{"id":"http://arxiv.org/abs/2310.17519v2","updated":"2023-10-27T09:11:32Z","published":"2023-10-26T16:13:00Z","title":"FLARE: Fast Learning of Animatable and Relightable Mesh Avatars","summary":" Our goal is to efficiently learn personalized animatable 3D head avatars from\nvideos that are geometrically accurate, realistic, relightable, and compatible\nwith current rendering systems. While 3D meshes enable efficient processing and\nare highly portable, they lack realism in terms of shape and appearance. Neural\nrepresentations, on the other hand, are realistic but lack compatibility and\nare slow to train and render. Our key insight is that it is possible to\nefficiently learn high-fidelity 3D mesh representations via differentiable\nrendering by exploiting highly-optimized methods from traditional computer\ngraphics and approximating some of the components with neural networks. To that\nend, we introduce FLARE, a technique that enables the creation of animatable\nand relightable mesh avatars from a single monocular video. First, we learn a\ncanonical geometry using a mesh representation, enabling efficient\ndifferentiable rasterization and straightforward animation via learned\nblendshapes and linear blend skinning weights. Second, we follow\nphysically-based rendering and factor observed colors into intrinsic albedo,\nroughness, and a neural representation of the illumination, allowing the\nlearned avatars to be relit in novel scenes. Since our input videos are\ncaptured on a single device with a narrow field of view, modeling the\nsurrounding environment light is non-trivial. Based on the split-sum\napproximation for modeling specular reflections, we address this by\napproximating the pre-filtered environment map with a multi-layer perceptron\n(MLP) modulated by the surface roughness, eliminating the need to explicitly\nmodel the light. We demonstrate that our mesh-based avatar formulation,\ncombined with learned deformation, material, and lighting MLPs, produces\navatars with high-quality geometry and appearance, while also being efficient\nto train and render compared to existing approaches.\n","authors":["Shrisha Bharadwaj","Yufeng Zheng","Otmar Hilliges","Michael J. Black","Victoria Fernandez-Abrevaya"],"pdf_url":"https://arxiv.org/pdf/2310.17519v2.pdf","comment":"15 pages, Accepted: ACM Transactions on Graphics (Proceedings of\n SIGGRAPH Asia), 2023"},{"id":"http://arxiv.org/abs/2310.17994v1","updated":"2023-10-27T09:06:43Z","published":"2023-10-27T09:06:43Z","title":"ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image","summary":" We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view\nsynthesis for in-the-wild scenes. While existing methods are designed for\nsingle objects with masked backgrounds, we propose new techniques to address\nchallenges introduced by in-the-wild multi-object scenes with complex\nbackgrounds. Specifically, we train a generative prior on a mixture of data\nsources that capture object-centric, indoor, and outdoor scenes. To address\nissues from data mixture such as depth-scale ambiguity, we propose a novel\ncamera conditioning parameterization and normalization scheme. Further, we\nobserve that Score Distillation Sampling (SDS) tends to truncate the\ndistribution of complex backgrounds during distillation of 360-degree scenes,\nand propose \"SDS anchoring\" to improve the diversity of synthesized novel\nviews. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset\nin the zero-shot setting, even outperforming methods specifically trained on\nDTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark\nfor single-image novel view synthesis, and demonstrate strong performance in\nthis setting. Our code and data are at http://kylesargent.github.io/zeronvs/\n","authors":["Kyle Sargent","Zizhang Li","Tanmay Shah","Charles Herrmann","Hong-Xing Yu","Yunzhi Zhang","Eric Ryan Chan","Dmitry Lagun","Li Fei-Fei","Deqing Sun","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2310.17994v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2310.17594v2","updated":"2023-10-27T08:40:15Z","published":"2023-10-26T17:13:48Z","title":"SPA: A Graph Spectral Alignment Perspective for Domain Adaptation","summary":" Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to\nextend the in-domain model to the distinctive target domains where the data\ndistributions differ. Most prior works focus on capturing the inter-domain\ntransferability but largely overlook rich intra-domain structures, which\nempirically results in even worse discriminability. In this work, we introduce\na novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The\ncore of our method is briefly condensed as follows: (i)-by casting the DA\nproblem to graph primitives, SPA composes a coarse graph alignment mechanism\nwith a novel spectral regularizer towards aligning the domain graphs in\neigenspaces; (ii)-we further develop a fine-grained message propagation module\n-- upon a novel neighbor-aware self-training mechanism -- in order for enhanced\ndiscriminability in the target domain. On standardized benchmarks, the\nextensive experiments of SPA demonstrate that its performance has surpassed the\nexisting cutting-edge DA methods. Coupled with dense model analysis, we\nconclude that our approach indeed possesses superior efficacy, robustness,\ndiscriminability, and transferability. Code and data are available at:\nhttps://github.com/CrownX/SPA.\n","authors":["Zhiqing Xiao","Haobo Wang","Ying Jin","Lei Feng","Gang Chen","Fei Huang","Junbo Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.17594v2.pdf","comment":"NeurIPS 2023 camera ready"},{"id":"http://arxiv.org/abs/2310.17974v1","updated":"2023-10-27T08:38:59Z","published":"2023-10-27T08:38:59Z","title":"FaultSeg Swin-UNETR: Transformer-Based Self-Supervised Pretraining Model\n for Fault Recognition","summary":" This paper introduces an approach to enhance seismic fault recognition\nthrough self-supervised pretraining. Seismic fault interpretation holds great\nsignificance in the fields of geophysics and geology. However, conventional\nmethods for seismic fault recognition encounter various issues, including\ndependence on data quality and quantity, as well as susceptibility to\ninterpreter subjectivity. Currently, automated fault recognition methods\nproposed based on small synthetic datasets experience performance degradation\nwhen applied to actual seismic data. To address these challenges, we have\nintroduced the concept of self-supervised learning, utilizing a substantial\namount of relatively easily obtainable unlabeled seismic data for pretraining.\nSpecifically, we have employed the Swin Transformer model as the core network\nand employed the SimMIM pretraining task to capture unique features related to\ndiscontinuities in seismic data. During the fine-tuning phase, inspired by edge\ndetection techniques, we have also refined the structure of the Swin-UNETR\nmodel, enabling multiscale decoding and fusion for more effective fault\ndetection. Experimental results demonstrate that our proposed method attains\nstate-of-the-art performance on the Thebe dataset, as measured by the OIS and\nODS metrics.\n","authors":["Zeren Zhang","Ran Chen","Jinwen Ma"],"pdf_url":"https://arxiv.org/pdf/2310.17974v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17577v2","updated":"2023-10-27T08:26:49Z","published":"2023-10-26T17:01:52Z","title":"Global Structure-Aware Diffusion Process for Low-Light Image Enhancement","summary":" This paper studies a diffusion-based framework to address the low-light image\nenhancement problem. To harness the capabilities of diffusion models, we delve\ninto this intricate process and advocate for the regularization of its inherent\nODE-trajectory. To be specific, inspired by the recent research that low\ncurvature ODE-trajectory results in a stable and effective diffusion process,\nwe formulate a curvature regularization term anchored in the intrinsic\nnon-local structures of image data, i.e., global structure-aware\nregularization, which gradually facilitates the preservation of complicated\ndetails and the augmentation of contrast during the diffusion process. This\nincorporation mitigates the adverse effects of noise and artifacts resulting\nfrom the diffusion process, leading to a more precise and flexible enhancement.\nTo additionally promote learning in challenging regions, we introduce an\nuncertainty-guided regularization technique, which wisely relaxes constraints\non the most extreme regions of the image. Experimental evaluations reveal that\nthe proposed diffusion-based framework, complemented by rank-informed\nregularization, attains distinguished performance in low-light enhancement. The\noutcomes indicate substantial advancements in image quality, noise suppression,\nand contrast amplification in comparison with state-of-the-art methods. We\nbelieve this innovative approach will stimulate further exploration and\nadvancement in low-light image processing, with potential implications for\nother applications of diffusion models. The code is publicly available at\nhttps://github.com/jinnh/GSAD.\n","authors":["Jinhui Hou","Zhiyu Zhu","Junhui Hou","Hui Liu","Huanqiang Zeng","Hui Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.17577v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17956v1","updated":"2023-10-27T08:05:21Z","published":"2023-10-27T08:05:21Z","title":"Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General\n Healthcare","summary":" Large Language Models (LLMs) have introduced a new era of proficiency in\ncomprehending complex healthcare and biomedical topics. However, there is a\nnoticeable lack of models in languages other than English and models that can\ninterpret multi-modal input, which is crucial for global healthcare\naccessibility. In response, this study introduces Qilin-Med-VL, the first\nChinese large vision-language model designed to integrate the analysis of\ntextual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer\n(ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum\ntraining process that includes feature alignment and instruction tuning. This\nmethod enhances the model's ability to generate medical captions and answer\ncomplex medical queries. We also release ChiMed-VL, a dataset consisting of\nmore than 1M image-text pairs. This dataset has been carefully curated to\nenable detailed and comprehensive interpretation of medical data using various\ntypes of images.\n","authors":["Junling Liu","Ziming Wang","Qichen Ye","Dading Chong","Peilin Zhou","Yining Hua"],"pdf_url":"https://arxiv.org/pdf/2310.17956v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17954v1","updated":"2023-10-27T08:03:12Z","published":"2023-10-27T08:03:12Z","title":"Multivessel Coronary Artery Segmentation and Stenosis Localisation using\n Ensemble Learning","summary":" Coronary angiography analysis is a common clinical task performed by\ncardiologists to diagnose coronary artery disease (CAD) through an assessment\nof atherosclerotic plaque's accumulation. This study introduces an end-to-end\nmachine learning solution developed as part of our solution for the MICCAI 2023\nAutomatic Region-based Coronary Artery Disease diagnostics using x-ray\nangiography imagEs (ARCADE) challenge, which aims to benchmark solutions for\nmultivessel coronary artery segmentation and potential stenotic lesion\nlocalisation from X-ray coronary angiograms. We adopted a robust baseline model\ntraining strategy to progressively improve performance, comprising five\nsuccessive stages of binary class pretraining, multivessel segmentation,\nfine-tuning using class frequency weighted dataloaders, fine-tuning using\nF1-based curriculum learning strategy (F1-CLS), and finally multi-target\nangiogram view classifier-based collective adaptation. Unlike many other\nmedical imaging procedures, this task exhibits a notable degree of\ninterobserver variability. %, making it particularly amenable to automated\nanalysis. Our ensemble model combines the outputs from six baseline models\nusing the weighted ensembling approach, which our analysis shows is found to\ndouble the predictive accuracy of the proposed solution. The final prediction\nwas further refined, targeting the correction of misclassified blobs. Our\nsolution achieved a mean F1 score of $37.69\\%$ for coronary artery\nsegmentation, and $39.41\\%$ for stenosis localisation, positioning our team in\nthe 5th position on both leaderboards. This work demonstrates the potential of\nautomated tools to aid CAD diagnosis, guide interventions, and improve the\naccuracy of stent injections in clinical settings.\n","authors":["Muhammad Bilal","Dinis Martinho","Reiner Sim","Adnan Qayyum","Hunaid Vohra","Massimo Caputo","Taofeek Akinosho","Sofiat Abioye","Zaheer Khan","Waleed Niaz","Junaid Qadir"],"pdf_url":"https://arxiv.org/pdf/2310.17954v1.pdf","comment":"Submission report for ARCADE challenge hosted at MICCAI2023"},{"id":"http://arxiv.org/abs/2310.17952v1","updated":"2023-10-27T07:57:24Z","published":"2023-10-27T07:57:24Z","title":"Shape-centered Representation Learning for Visible-Infrared Person\n Re-identification","summary":" Current Visible-Infrared Person Re-Identification (VI-ReID) methods\nprioritize extracting distinguishing appearance features, ignoring the natural\nresistance of body shape against modality changes. Initially, we gauged the\ndiscriminative potential of shapes by a straightforward concatenation of shape\nand appearance features. However, two unresolved issues persist in the\nutilization of shape features. One pertains to the dependence on auxiliary\nmodels for shape feature extraction in the inference phase, along with the\nerrors in generated infrared shapes due to the intrinsic modality disparity.\nThe other issue involves the inadequately explored correlation between shape\nand appearance features. To tackle the aforementioned challenges, we propose\nthe Shape-centered Representation Learning framework (ScRL), which focuses on\nlearning shape features and appearance features associated with shapes.\nSpecifically, we devise the Shape Feature Propagation (SFP), facilitating\ndirect extraction of shape features from original images with minimal\ncomplexity costs during inference. To restitute inaccuracies in infrared body\nshapes at the feature level, we present the Infrared Shape Restitution (ISR).\nFurthermore, to acquire appearance features related to shape, we design the\nAppearance Feature Enhancement (AFE), which accentuates identity-related\nfeatures while suppressing identity-unrelated features guided by shape\nfeatures. Extensive experiments are conducted to validate the effectiveness of\nthe proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy\nattains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM,\nRegDB datasets respectively, outperforming existing state-of-the-art methods.\n","authors":["Shuang Li","Jiaxu Leng","Ji Gan","Mengjingcheng Mo","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2310.17952v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17951v1","updated":"2023-10-27T07:48:36Z","published":"2023-10-27T07:48:36Z","title":"Understanding Parameter Saliency via Extreme Value Theory","summary":" Deep neural networks are being increasingly implemented throughout society in\nrecent years. It is useful to identify which parameters trigger\nmisclassification in diagnosing undesirable model behaviors. The concept of\nparameter saliency is proposed and used to diagnose convolutional neural\nnetworks (CNNs) by ranking convolution filters that may have caused\nmisclassification on the basis of parameter saliency. It is also shown that\nfine-tuning the top ranking salient filters has efficiently corrected\nmisidentification on ImageNet. However, there is still a knowledge gap in terms\nof understanding why parameter saliency ranking can find the filters inducing\nmisidentification. In this work, we attempt to bridge the gap by analyzing\nparameter saliency ranking from a statistical viewpoint, namely, extreme value\ntheory. We first show that the existing work implicitly assumes that the\ngradient norm computed for each filter follows a normal distribution. Then, we\nclarify the relationship between parameter saliency and the score based on the\npeaks-over-threshold (POT) method, which is often used to model extreme values.\nFinally, we reformulate parameter saliency in terms of the POT method, where\nthis reformulation is regarded as statistical anomaly detection and does not\nrequire the implicit assumptions of the existing parameter-saliency\nformulation. Our experimental results demonstrate that our reformulation can\ndetect malicious filters as well. Furthermore, we show that the existing\nparameter saliency method exhibits a bias against the depth of layers in deep\nneural networks. In particular, this bias has the potential to inhibit the\ndiscovery of filters that cause misidentification in situations where domain\nshift occurs. In contrast, parameter saliency based on POT shows less of this\nbias.\n","authors":["Shuo Wang","Issei Sato"],"pdf_url":"https://arxiv.org/pdf/2310.17951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17949v1","updated":"2023-10-27T07:44:25Z","published":"2023-10-27T07:44:25Z","title":"Instance Segmentation under Occlusions via Location-aware Copy-Paste\n Data Augmentation","summary":" Occlusion is a long-standing problem in computer vision, particularly in\ninstance segmentation. ACM MMSports 2023 DeepSportRadar has introduced a\ndataset that focuses on segmenting human subjects within a basketball context\nand a specialized evaluation metric for occlusion scenarios. Given the modest\nsize of the dataset and the highly deformable nature of the objects to be\nsegmented, this challenge demands the application of robust data augmentation\ntechniques and wisely-chosen deep learning architectures. Our work (ranked 1st\nin the competition) first proposes a novel data augmentation technique, capable\nof generating more training samples with wider distribution. Then, we adopt a\nnew architecture - Hybrid Task Cascade (HTC) framework with CBNetV2 as backbone\nand MaskIoU head to improve segmentation performance. Furthermore, we employ a\nStochastic Weight Averaging (SWA) training strategy to improve the model's\ngeneralization. As a result, we achieve a remarkable occlusion score (OM) of\n0.533 on the challenge dataset, securing the top-1 position on the leaderboard.\nSource code is available at this\nhttps://github.com/nguyendinhson-kaist/MMSports23-Seg-AutoID.\n","authors":["Son Nguyen","Mikel Lainsa","Hung Dao","Daeyoung Kim","Giang Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.17949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17942v1","updated":"2023-10-27T07:36:36Z","published":"2023-10-27T07:36:36Z","title":"Diversifying Spatial-Temporal Perception for Video Domain Generalization","summary":" Video domain generalization aims to learn generalizable video classification\nmodels for unseen target domains by training in a source domain. A critical\nchallenge of video domain generalization is to defend against the heavy\nreliance on domain-specific cues extracted from the source domain when\nrecognizing target videos. To this end, we propose to perceive diverse\nspatial-temporal cues in videos, aiming to discover potential domain-invariant\ncues in addition to domain-specific cues. We contribute a novel model named\nSpatial-Temporal Diversification Network (STDN), which improves the diversity\nfrom both space and time dimensions of video data. First, our STDN proposes to\ndiscover various types of spatial cues within individual frames by spatial\ngrouping. Then, our STDN proposes to explicitly model spatial-temporal\ndependencies between video contents at multiple space-time scales by\nspatial-temporal relation modeling. Extensive experiments on three benchmarks\nof different types demonstrate the effectiveness and versatility of our\napproach.\n","authors":["Kun-Yu Lin","Jia-Run Du","Yipeng Gao","Jiaming Zhou","Wei-Shi Zheng"],"pdf_url":"https://arxiv.org/pdf/2310.17942v1.pdf","comment":"Accepted to NeurIPS 2023. Code is available at\n https://github.com/KunyuLin/STDN/"},{"id":"http://arxiv.org/abs/2310.12692v2","updated":"2023-10-27T07:28:19Z","published":"2023-10-19T12:39:59Z","title":"Representation Learning via Consistent Assignment of Views over Random\n Partitions","summary":" We present Consistent Assignment of Views over Random Partitions (CARP), a\nself-supervised clustering method for representation learning of visual\nfeatures. CARP learns prototypes in an end-to-end online fashion using gradient\ndescent without additional non-differentiable modules to solve the cluster\nassignment problem. CARP optimizes a new pretext task based on random\npartitions of prototypes that regularizes the model and enforces consistency\nbetween views' assignments. Additionally, our method improves training\nstability and prevents collapsed solutions in joint-embedding training. Through\nan extensive evaluation, we demonstrate that CARP's representations are\nsuitable for learning downstream tasks. We evaluate CARP's representations\ncapabilities in 17 datasets across many standard protocols, including linear\nevaluation, few-shot classification, k-NN, k-means, image retrieval, and copy\ndetection. We compare CARP performance to 11 existing self-supervised methods.\nWe extensively ablate our method and demonstrate that our proposed random\npartition pretext task improves the quality of the learned representations by\ndevising multiple random classification tasks. In transfer learning tasks, CARP\nachieves the best performance on average against many SSL methods trained for a\nlonger time.\n","authors":["Thalles Silva","Adín Ramírez Rivera"],"pdf_url":"https://arxiv.org/pdf/2310.12692v2.pdf","comment":"To appear in NeurIPS 2023. Code available at\n https://github.com/sthalles/carp"},{"id":"http://arxiv.org/abs/2309.12095v2","updated":"2023-10-27T07:00:04Z","published":"2023-09-21T14:10:47Z","title":"Bayesian sparsification for deep neural networks with Bayesian model\n reduction","summary":" Deep learning's immense capabilities are often constrained by the complexity\nof its models, leading to an increasing demand for effective sparsification\ntechniques. Bayesian sparsification for deep learning emerges as a crucial\napproach, facilitating the design of models that are both computationally\nefficient and competitive in terms of performance across various deep learning\napplications. The state-of-the-art -- in Bayesian sparsification of deep neural\nnetworks -- combines structural shrinkage priors on model weights with an\napproximate inference scheme based on stochastic variational inference.\nHowever, model inversion of the full generative model is exceptionally\ncomputationally demanding, especially when compared to standard deep learning\nof point estimates. In this context, we advocate for the use of Bayesian model\nreduction (BMR) as a more efficient alternative for pruning of model weights.\nAs a generalization of the Savage-Dickey ratio, BMR allows a post-hoc\nelimination of redundant model weights based on the posterior estimates under a\nstraightforward (non-hierarchical) generative model. Our comparative study\nhighlights the advantages of the BMR method relative to established approaches\nbased on hierarchical horseshoe priors over model weights. We illustrate the\npotential of BMR across various deep learning architectures, from classical\nnetworks like LeNet to modern frameworks such as Vision Transformers and\nMLP-Mixers.\n","authors":["Dimitrije Marković","Karl J. Friston","Stefan J. Kiebel"],"pdf_url":"https://arxiv.org/pdf/2309.12095v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.16898v2","updated":"2023-10-27T06:43:55Z","published":"2023-10-25T18:00:26Z","title":"MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited\n Memory","summary":" Due to the high price and heavy energy consumption of GPUs, deploying deep\nmodels on IoT devices such as microcontrollers makes significant contributions\nfor ecological AI. Conventional methods successfully enable convolutional\nneural network inference of high resolution images on microcontrollers, while\nthe framework for vision transformers that achieve the state-of-the-art\nperformance in many vision applications still remains unexplored. In this\npaper, we propose a hardware-algorithm co-optimizations method called MCUFormer\nto deploy vision transformers on microcontrollers with extremely limited\nmemory, where we jointly design transformer architecture and construct the\ninference operator library to fit the memory resource constraint. More\nspecifically, we generalize the one-shot network architecture search (NAS) to\ndiscover the optimal architecture with highest task performance given the\nmemory budget from the microcontrollers, where we enlarge the existing search\nspace of vision transformers by considering the low-rank decomposition\ndimensions and patch resolution for memory reduction. For the construction of\nthe inference operator library of vision transformers, we schedule the memory\nbuffer during inference through operator integration, patch embedding\ndecomposition, and token overwriting, allowing the memory buffer to be fully\nutilized to adapt to the forward pass of the vision transformer. Experimental\nresults demonstrate that our MCUFormer achieves 73.62\\% top-1 accuracy on\nImageNet for image classification with 320KB memory on STM32F746\nmicrocontroller. Code is available at https://github.com/liangyn22/MCUFormer.\n","authors":["Yinan Liang","Ziwei Wang","Xiuwei Xu","Yansong Tang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2310.16898v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.04917v2","updated":"2023-10-27T06:42:31Z","published":"2023-09-10T02:31:50Z","title":"Text-driven Editing of 3D Scenes without Retraining","summary":" Numerous diffusion models have recently been applied to image synthesis and\nediting. However, editing 3D scenes is still in its early stages. It poses\nvarious challenges, such as the requirement to design specific methods for\ndifferent editing types, retraining new models for various 3D scenes, and the\nabsence of convenient human interaction during editing. To tackle these issues,\nwe introduce a text-driven editing method, termed DN2N, which allows for the\ndirect acquisition of a NeRF model with universal editing capabilities,\neliminating the requirement for retraining. Our method employs off-the-shelf\ntext-based editing models of 2D images to modify the 3D scene images, followed\nby a filtering process to discard poorly edited images that disrupt 3D\nconsistency. We then consider the remaining inconsistency as a problem of\nremoving noise perturbation, which can be solved by generating training data\nwith similar perturbation characteristics for training. We further propose\ncross-view regularization terms to help the generalized NeRF model mitigate\nthese perturbations. Our text-driven method allows users to edit a 3D scene\nwith their desired description, which is more friendly, intuitive, and\npractical than prior works. Empirical results show that our method achieves\nmultiple editing types, including but not limited to appearance editing,\nweather transition, material changing, and style transfer. Most importantly,\nour method generalizes well with editing abilities shared among a set of model\nparameters without requiring a customized editing model for some specific\nscenes, thus inferring novel views with editing effects directly from user\ninput. The project website is available at https://sk-fun.fun/DN2N\n","authors":["Shuangkang Fang","Yufeng Wang","Yi Yang","Yi-Hsuan Tsai","Wenrui Ding","Shuchang Zhou","Ming-Hsuan Yang"],"pdf_url":"https://arxiv.org/pdf/2309.04917v2.pdf","comment":"Project Website: https://sk-fun.fun/DN2N"},{"id":"http://arxiv.org/abs/2310.17914v1","updated":"2023-10-27T06:15:30Z","published":"2023-10-27T06:15:30Z","title":"3D-Aware Visual Question Answering about Parts, Poses and Occlusions","summary":" Despite rapid progress in Visual question answering (VQA), existing datasets\nand models mainly focus on testing reasoning in 2D. However, it is important\nthat VQA models also understand the 3D structure of visual scenes, for example\nto support tasks like navigation or manipulation. This includes an\nunderstanding of the 3D object pose, their parts and occlusions. In this work,\nwe introduce the task of 3D-aware VQA, which focuses on challenging questions\nthat require a compositional reasoning over the 3D structure of visual scenes.\nWe address 3D-aware VQA from both the dataset and the model perspective. First,\nwe introduce Super-CLEVR-3D, a compositional reasoning dataset that contains\nquestions about object parts, their 3D poses, and occlusions. Second, we\npropose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas:\nprobabilistic neural symbolic program execution for reasoning and deep neural\nnetworks with 3D generative representations of objects for robust visual\nrecognition. Our experimental results show our model PO3D-VQA outperforms\nexisting methods significantly, but we still observe a significant performance\ngap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an\nimportant open research area.\n","authors":["Xingrui Wang","Wufei Ma","Zhuowan Li","Adam Kortylewski","Alan Yuille"],"pdf_url":"https://arxiv.org/pdf/2310.17914v1.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2310.17910v1","updated":"2023-10-27T05:59:12Z","published":"2023-10-27T05:59:12Z","title":"DocStormer: Revitalizing Multi-Degraded Colored Document Images to\n Pristine PDF","summary":" For capturing colored document images, e.g. posters and magazines, it is\ncommon that multiple degradations such as shadows, wrinkles, etc., are\nsimultaneously introduced due to external factors. Restoring multi-degraded\ncolored document images is a great challenge, yet overlooked, as most existing\nalgorithms focus on enhancing color-ignored document images via binarization.\nThus, we propose DocStormer, a novel algorithm designed to restore\nmulti-degraded colored documents to their potential pristine PDF. The\ncontributions are: firstly, we propose a \"Perceive-then-Restore\" paradigm with\na reinforced transformer block, which more effectively encodes and utilizes the\ndistribution of degradations. Secondly, we are the first to utilize GAN and\npristine PDF magazine images to narrow the distribution gap between the\nenhanced results and PDF images, in pursuit of less degradation and better\nvisual quality. Thirdly, we propose a non-parametric strategy, PFILI, which\nenables a smaller training scale and larger testing resolutions with acceptable\ndetail trade-off, while saving memory and inference time. Fourthly, we are the\nfirst to propose a novel Multi-Degraded Colored Document image Enhancing\ndataset, named MD-CDE, for both training and evaluation. Experimental results\nshow that the DocStormer exhibits superior performance, capable of revitalizing\nmulti-degraded colored documents into their potential pristine digital\nversions, which fills the current academic gap from the perspective of method,\ndata, and task.\n","authors":["Chaowei Liu","Jichun Li","Yihua Teng","Chaoqun Wang","Nuo Xu","Jihao Wu","Dandan Tu"],"pdf_url":"https://arxiv.org/pdf/2310.17910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13720v7","updated":"2023-10-27T05:19:32Z","published":"2023-06-23T18:08:00Z","title":"Decoupled Diffusion Models: Image to Zero and Zero to Noise","summary":" Recent diffusion probabilistic models (DPMs) have shown remarkable abilities\nof generated content, however, they often suffer from complex forward\nprocesses, resulting in inefficient solutions for the reversed process and\nprolonged sampling times. In this paper, we aim to address the aforementioned\nchallenges by focusing on the diffusion process itself that we propose to\ndecouple the intricate diffusion process into two comparatively simpler process\nto improve the generative efficacy and speed. In particular, we present a novel\ndiffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito\ndiffusion process, in which the image distribution is approximated by an\nexplicit transition probability while the noise path is controlled by the\nstandard Wiener process. We find that decoupling the diffusion process reduces\nthe learning difficulty and the explicit transition probability improves the\ngenerative speed significantly. We prove a new training objective for DPM,\nwhich enables the model to learn to predict the noise and image components\nseparately. Moreover, given the novel forward diffusion equation, we derive the\nreverse denoising formula of DDM that naturally supports fewer steps of\ngeneration without ordinary differential equation (ODE) based accelerators. Our\nexperiments demonstrate that DDM outperforms previous DPMs by a large margin in\nfewer function evaluations setting and gets comparable performances in long\nfunction evaluations setting. We also show that our framework can be applied to\nimage-conditioned generation and high-resolution image synthesis, and that it\ncan generate high-quality images with only 10 function evaluations.\n","authors":["Yuhang Huang","Zheng Qin","Xinwang Liu","Kai Xu"],"pdf_url":"https://arxiv.org/pdf/2306.13720v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17255v2","updated":"2023-10-27T04:34:33Z","published":"2023-10-26T09:11:55Z","title":"Generalizing to Unseen Domains in Diabetic Retinopathy Classification","summary":" Diabetic retinopathy (DR) is caused by long-standing diabetes and is among\nthe fifth leading cause for visual impairments. The process of early diagnosis\nand treatments could be helpful in curing the disease, however, the detection\nprocedure is rather challenging and mostly tedious. Therefore, automated\ndiabetic retinopathy classification using deep learning techniques has gained\ninterest in the medical imaging community. Akin to several other real-world\napplications of deep learning, the typical assumption of i.i.d data is also\nviolated in DR classification that relies on deep learning. Therefore,\ndeveloping DR classification methods robust to unseen distributions is of great\nvalue. In this paper, we study the problem of generalizing a model to unseen\ndistributions or domains (a.k.a domain generalization) in DR classification. To\nthis end, we propose a simple and effective domain generalization (DG) approach\nthat achieves self-distillation in vision transformers (ViT) via a novel\nprediction softening mechanism. This prediction softening is an adaptive convex\ncombination one-hot labels with the model's own knowledge. We perform extensive\nexperiments on challenging open-source DR classification datasets under both\nmulti-source and single-source DG settings with three different ViT backbones\nto establish the efficacy and applicability of our approach against competing\nmethods. For the first time, we report the performance of several\nstate-of-the-art DG methods on open-source DR classification datasets after\nconducting thorough experiments. Finally, our method is also capable of\ndelivering improved calibration performance than other methods, showing its\nsuitability for safety-critical applications, including healthcare. We hope\nthat our contributions would investigate more DG research across the medical\nimaging community.\n","authors":["Chamuditha Jayanga Galappaththige","Gayal Kuruppu","Muhammad Haris Khan"],"pdf_url":"https://arxiv.org/pdf/2310.17255v2.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.17887v1","updated":"2023-10-27T04:30:18Z","published":"2023-10-27T04:30:18Z","title":"Impressions: Understanding Visual Semiotics and Aesthetic Impact","summary":" Is aesthetic impact different from beauty? Is visual salience a reflection of\nits capacity for effective communication? We present Impressions, a novel\ndataset through which to investigate the semiotics of images, and how specific\nvisual features and design choices can elicit specific emotions, thoughts and\nbeliefs. We posit that the impactfulness of an image extends beyond formal\ndefinitions of aesthetics, to its success as a communicative act, where style\ncontributes as much to meaning formation as the subject matter. However, prior\nimage captioning datasets are not designed to empower state-of-the-art\narchitectures to model potential human impressions or interpretations of\nimages. To fill this gap, we design an annotation task heavily inspired by\nimage analysis techniques in the Visual Arts to collect 1,440 image-caption\npairs and 4,320 unique annotations exploring impact, pragmatic image\ndescription, impressions, and aesthetic design choices. We show that existing\nmultimodal image captioning and conditional generation models struggle to\nsimulate plausible human responses to images. However, this dataset\nsignificantly improves their ability to model impressions and aesthetic\nevaluations of images through fine-tuning and few-shot adaptation.\n","authors":["Julia Kruk","Caleb Ziems","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.17887v1.pdf","comment":"To be published in EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17880v1","updated":"2023-10-27T03:52:08Z","published":"2023-10-27T03:52:08Z","title":"Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D\n Scene Representations","summary":" Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations,\ncapable of high quality novel view synthesis of complex scenes. While NeRFs\nhave been applied to graphics, vision, and robotics, problems with slow\nrendering speed and characteristic visual artifacts prevent adoption in many\nuse cases. In this work, we investigate combining an autoencoder (AE) with a\nNeRF, in which latent features (instead of colours) are rendered and then\nconvolutionally decoded. The resulting latent-space NeRF can produce novel\nviews with higher quality than standard colour-space NeRFs, as the AE can\ncorrect certain visual artifacts, while rendering over three times faster. Our\nwork is orthogonal to other techniques for improving NeRF efficiency. Further,\nwe can control the tradeoff between efficiency and image quality by shrinking\nthe AE architecture, achieving over 13 times faster rendering with only a small\ndrop in performance. We hope that our approach can form the basis of an\nefficient, yet high-fidelity, 3D scene representation for downstream tasks,\nespecially when retaining differentiability is useful, as in many robotics\nscenarios requiring continual learning.\n","authors":["Tristan Aumentado-Armstrong","Ashkan Mirzaei","Marcus A. Brubaker","Jonathan Kelly","Alex Levinshtein","Konstantinos G. Derpanis","Igor Gilitschenski"],"pdf_url":"https://arxiv.org/pdf/2310.17880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17875v1","updated":"2023-10-27T03:32:05Z","published":"2023-10-27T03:32:05Z","title":"Siamese-DETR for Generic Multi-Object Tracking","summary":" The ability to detect and track the dynamic objects in different scenes is\nfundamental to real-world applications, e.g., autonomous driving and robot\nnavigation. However, traditional Multi-Object Tracking (MOT) is limited to\ntracking objects belonging to the pre-defined closed-set categories. Recently,\nOpen-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track\ninterested objects beyond pre-defined categories with the given text prompt and\ntemplate image. However, the expensive well pre-trained (vision-)language model\nand fine-grained category annotations are required to train OVMOT models. In\nthis paper, we focus on GMOT and propose a simple but effective method,\nSiamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO)\nare required for training. Different from existing GMOT methods, which train a\nSingle Object Tracking (SOT) based detector to detect interested objects and\nthen apply a data association based MOT tracker to get the trajectories, we\nleverage the inherent object queries in DETR variants. Specifically: 1) The\nmulti-scale object queries are designed based on the given template image,\nwhich are effective for detecting different scales of objects with the same\ncategory as the template image; 2) A dynamic matching training strategy is\nintroduced to train Siamese-DETR on commonly used detection datasets, which\ntakes full advantage of provided annotations; 3) The online tracking pipeline\nis simplified through a tracking-by-query manner by incorporating the tracked\nboxes in previous frame as additional query boxes. The complex data association\nis replaced with the much simpler Non-Maximum Suppression (NMS). Extensive\nexperimental results show that Siamese-DETR surpasses existing MOT methods on\nGMOT-40 dataset by a large margin.\n","authors":["Qiankun Liu","Yichen Li","Yuqi Jiang","Ying Fu"],"pdf_url":"https://arxiv.org/pdf/2310.17875v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17874v1","updated":"2023-10-27T03:29:25Z","published":"2023-10-27T03:29:25Z","title":"SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation","summary":" Unsupervised semantic segmentation is a challenging task that segments images\ninto semantic groups without manual annotation. Prior works have primarily\nfocused on leveraging prior knowledge of semantic consistency or priori\nconcepts from self-supervised learning methods, which often overlook the\ncoherence property of image segments. In this paper, we demonstrate that the\nsmoothness prior, asserting that close features in a metric space share the\nsame semantics, can significantly simplify segmentation by casting unsupervised\nsemantic segmentation as an energy minimization problem. Under this paradigm,\nwe propose a novel approach called SmooSeg that harnesses self-supervised\nlearning methods to model the closeness relationships among observations as\nsmoothness signals. To effectively discover coherent semantic segments, we\nintroduce a novel smoothness loss that promotes piecewise smoothness within\nsegments while preserving discontinuities across different segments.\nAdditionally, to further enhance segmentation quality, we design an asymmetric\nteacher-student style predictor that generates smoothly updated pseudo labels,\nfacilitating an optimal fit between observations and labeling outputs. Thanks\nto the rich supervision cues of the smoothness prior, our SmooSeg significantly\noutperforms STEGO in terms of pixel accuracy on three datasets: COCOStuff\n(+14.9%), Cityscapes (+13.0%), and Potsdam-3 (+5.7%).\n","authors":["Mengcheng Lan","Xinjiang Wang","Yiping Ke","Jiaxing Xu","Litong Feng","Wayne Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17874v1.pdf","comment":"Accepted by NeurIPS 2023. Code available:\n https://github.com/mc-lan/SmooSeg"},{"id":"http://arxiv.org/abs/2305.18499v2","updated":"2023-10-27T03:28:48Z","published":"2023-05-29T14:29:12Z","title":"Pre-training Contextualized World Models with In-the-wild Videos for\n Reinforcement Learning","summary":" Unsupervised pre-training methods utilizing large and diverse datasets have\nachieved tremendous success across a range of domains. Recent work has\ninvestigated such unsupervised pre-training methods for model-based\nreinforcement learning (MBRL) but is limited to domain-specific or simulated\ndata. In this paper, we study the problem of pre-training world models with\nabundant in-the-wild videos for efficient learning of downstream visual control\ntasks. However, in-the-wild videos are complicated with various contextual\nfactors, such as intricate backgrounds and textured appearance, which precludes\na world model from extracting shared world knowledge to generalize better. To\ntackle this issue, we introduce Contextualized World Models (ContextWM) that\nexplicitly separate context and dynamics modeling to overcome the complexity\nand diversity of in-the-wild videos and facilitate knowledge transfer between\ndistinct scenes. Specifically, a contextualized extension of the latent\ndynamics model is elaborately realized by incorporating a context encoder to\nretain contextual information and empower the image decoder, which encourages\nthe latent dynamics model to concentrate on essential temporal variations. Our\nexperiments show that in-the-wild video pre-training equipped with ContextWM\ncan significantly improve the sample efficiency of MBRL in various domains,\nincluding robotic manipulation, locomotion, and autonomous driving. Code is\navailable at this repository: https://github.com/thuml/ContextWM.\n","authors":["Jialong Wu","Haoyu Ma","Chaoyi Deng","Mingsheng Long"],"pdf_url":"https://arxiv.org/pdf/2305.18499v2.pdf","comment":"NeurIPS 2023. Code is available at https://github.com/thuml/ContextWM"},{"id":"http://arxiv.org/abs/2308.04808v2","updated":"2023-10-27T03:20:54Z","published":"2023-08-09T09:02:47Z","title":"Joint-Relation Transformer for Multi-Person Motion Prediction","summary":" Multi-person motion prediction is a challenging problem due to the dependency\nof motion on both individual past movements and interactions with other people.\nTransformer-based methods have shown promising results on this task, but they\nmiss the explicit relation representation between joints, such as skeleton\nstructure and pairwise distance, which is crucial for accurate interaction\nmodeling. In this paper, we propose the Joint-Relation Transformer, which\nutilizes relation information to enhance interaction modeling and improve\nfuture motion prediction. Our relation information contains the relative\ndistance and the intra-/inter-person physical constraints. To fuse relation and\njoint information, we design a novel joint-relation fusion layer with\nrelation-aware attention to update both features. Additionally, we supervise\nthe relation information by forecasting future distance. Experiments show that\nour method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and\n17.8%/12.0% improvement of 3s MPJPE on CMU-Mpcap/MuPoTS-3D dataset.\n","authors":["Qingyao Xu","Weibo Mao","Jingze Gong","Chenxin Xu","Siheng Chen","Weidi Xie","Ya Zhang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2308.04808v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17975v2","updated":"2023-10-27T03:13:45Z","published":"2023-05-29T09:33:43Z","title":"Jigsaw: Learning to Assemble Multiple Fractured Objects","summary":" Automated assembly of 3D fractures is essential in orthopedics, archaeology,\nand our daily life. This paper presents Jigsaw, a novel framework for\nassembling physically broken 3D objects from multiple pieces. Our approach\nleverages hierarchical features of global and local geometry to match and align\nthe fracture surfaces. Our framework consists of four components: (1) front-end\npoint feature extractor with attention layers, (2) surface segmentation to\nseparate fracture and original parts, (3) multi-parts matching to find\ncorrespondences among fracture surface points, and (4) robust global alignment\nto recover the global poses of the pieces. We show how to jointly learn\nsegmentation and matching and seamlessly integrate feature matching and\nrigidity constraints. We evaluate Jigsaw on the Breaking Bad dataset and\nachieve superior performance compared to state-of-the-art methods. Our method\nalso generalizes well to diverse fracture modes, objects, and unseen instances.\nTo the best of our knowledge, this is the first learning-based method designed\nspecifically for 3D fracture assembly over multiple pieces. Our code is\navailable at https://jiaxin-lu.github.io/Jigsaw/.\n","authors":["Jiaxin Lu","Yifan Sun","Qixing Huang"],"pdf_url":"https://arxiv.org/pdf/2305.17975v2.pdf","comment":"18 pages, 9 figures, NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17869v1","updated":"2023-10-27T03:07:05Z","published":"2023-10-27T03:07:05Z","title":"Grid Jigsaw Representation with CLIP: A New Perspective on Image\n Clustering","summary":" Unsupervised representation learning for image clustering is essential in\ncomputer vision. Although the advancement of visual models has improved image\nclustering with efficient visual representations, challenges still remain.\nFirstly, these features often lack the ability to represent the internal\nstructure of images, hindering the accurate clustering of visually similar\nimages. Secondly, the existing features tend to lack finer-grained semantic\nlabels, limiting the ability to capture nuanced differences and similarities\nbetween images.\n In this paper, we first introduce Jigsaw based strategy method for image\nclustering called Grid Jigsaw Representation (GJR) with systematic exposition\nfrom pixel to feature in discrepancy against human and computer. We emphasize\nthat this algorithm, which mimics human jigsaw puzzle, can effectively improve\nthe model to distinguish the spatial feature between different samples and\nenhance the clustering ability. GJR modules are appended to a variety of deep\nconvolutional networks and tested with significant improvements on a wide range\nof benchmark datasets including CIFAR-10, CIFAR-100/20, STL-10, ImageNet-10 and\nImageNetDog-15.\n On the other hand, convergence efficiency is always an important challenge\nfor unsupervised image clustering. Recently, pretrained representation learning\nhas made great progress and released models can extract mature visual\nrepresentations. It is obvious that use the pretrained model as feature\nextractor can speed up the convergence of clustering where our aim is to\nprovide new perspective in image clustering with reasonable resource\napplication and provide new baseline. Further, we innovate pretrain-based Grid\nJigsaw Representation (pGJR) with improvement by GJR. The experiment results\nshow the effectiveness on the clustering task with respect to the ACC, NMI and\nARI three metrics and super fast convergence speed.\n","authors":["Zijie Song","Zhenzhen Hu","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2310.17869v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12868v2","updated":"2023-10-27T02:34:05Z","published":"2023-07-24T15:06:42Z","title":"Understanding the Latent Space of Diffusion Models through the Lens of\n Riemannian Geometry","summary":" Despite the success of diffusion models (DMs), we still lack a thorough\nunderstanding of their latent space. To understand the latent space\n$\\mathbf{x}_t \\in \\mathcal{X}$, we analyze them from a geometrical perspective.\nOur approach involves deriving the local latent basis within $\\mathcal{X}$ by\nleveraging the pullback metric associated with their encoding feature maps.\nRemarkably, our discovered local latent basis enables image editing\ncapabilities by moving $\\mathbf{x}_t$, the latent space of DMs, along the basis\nvector at specific timesteps. We further analyze how the geometric structure of\nDMs evolves over diffusion timesteps and differs across different text\nconditions. This confirms the known phenomenon of coarse-to-fine generation, as\nwell as reveals novel insights such as the discrepancy between $\\mathbf{x}_t$\nacross timesteps, the effect of dataset complexity, and the time-varying\ninfluence of text prompts. To the best of our knowledge, this paper is the\nfirst to present image editing through $\\mathbf{x}$-space traversal, editing\nonly once at specific timestep $t$ without any additional training, and\nproviding thorough analyses of the latent structure of DMs. The code to\nreproduce our experiments can be found at\nhttps://github.com/enkeejunior1/Diffusion-Pullback.\n","authors":["Yong-Hyun Park","Mingi Kwon","Jaewoong Choi","Junghyo Jo","Youngjung Uh"],"pdf_url":"https://arxiv.org/pdf/2307.12868v2.pdf","comment":"This paper has been accepted for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17842v1","updated":"2023-10-27T01:46:37Z","published":"2023-10-27T01:46:37Z","title":"What You See Is What You Detect: Towards better Object Densification in\n 3D detection","summary":" Recent works have demonstrated the importance of object completion in 3D\nPerception from Lidar signal. Several methods have been proposed in which\nmodules were used to densify the point clouds produced by laser scanners,\nleading to better recall and more accurate results. Pursuing in that direction,\nwe present, in this work, a counter-intuitive perspective: the widely-used\nfull-shape completion approach actually leads to a higher error-upper bound\nespecially for far away objects and small objects like pedestrians. Based on\nthis observation, we introduce a visible part completion method that requires\nonly 11.3\\% of the prediction points that previous methods generate. To recover\nthe dense representation, we propose a mesh-deformation-based method to augment\nthe point set associated with visible foreground objects. Considering that our\napproach focuses only on the visible part of the foreground objects to achieve\naccurate 3D detection, we named our method What You See Is What You Detect\n(WYSIWYD). Our proposed method is thus a detector-independent model that\nconsists of 2 parts: an Intra-Frustum Segmentation Transformer (IFST) and a\nMesh Depth Completion Network(MDCNet) that predicts the foreground depth from\nmesh deformation. This way, our model does not require the time-consuming\nfull-depth completion task used by most pseudo-lidar-based methods. Our\nexperimental evaluation shows that our approach can provide up to 12.2\\%\nperformance improvements over most of the public baseline models on the KITTI\nand NuScenes dataset bringing the state-of-the-art to a new level. The codes\nwill be available at\n\\textcolor[RGB]{0,0,255}{\\url{{https://github.com/Orbis36/WYSIWYD}}\n","authors":["Tianran Liu","Zeping Zhang Morteza Mousa Pasandi","Robert Laganiere"],"pdf_url":"https://arxiv.org/pdf/2310.17842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15328v2","updated":"2023-10-27T01:44:27Z","published":"2023-05-24T16:42:17Z","title":"Visual Programming for Text-to-Image Generation and Evaluation","summary":" As large language models have demonstrated impressive performance in many\ndomains, recent works have adopted language models (LMs) as controllers of\nvisual modules for vision-and-language tasks. While existing work focuses on\nequipping LMs with visual understanding, we propose two novel\ninterpretable/explainable visual programming frameworks for text-to-image (T2I)\ngeneration and evaluation. First, we introduce VPGen, an interpretable\nstep-by-step T2I generation framework that decomposes T2I generation into three\nsteps: object/count generation, layout generation, and image generation. We\nemploy an LM to handle the first two steps (object/count generation and layout\ngeneration), by finetuning it on text-layout pairs. Our step-by-step T2I\ngeneration framework provides stronger spatial control than end-to-end models,\nthe dominant approach for this task. Furthermore, we leverage the world\nknowledge of pretrained LMs, overcoming the limitation of previous\nlayout-guided T2I works that can only handle predefined object classes. We\ndemonstrate that our VPGen has improved control in counts/spatial\nrelations/scales of objects than state-of-the-art T2I generation models.\nSecond, we introduce VPEval, an interpretable and explainable evaluation\nframework for T2I generation based on visual programming. Unlike previous T2I\nevaluations with a single scoring model that is accurate in some skills but\nunreliable in others, VPEval produces evaluation programs that invoke a set of\nvisual modules that are experts in different skills, and also provides\nvisual+textual explanations of the evaluation results. Our analysis shows that\nVPEval provides a more human-correlated evaluation for skill-specific and\nopen-ended prompts than widely used single model-based evaluation. We hope that\nour work encourages future progress on interpretable/explainable generation and\nevaluation for T2I models.\n","authors":["Jaemin Cho","Abhay Zala","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2305.15328v2.pdf","comment":"NeurIPS 2023; Project website: https://vp-t2i.github.io"},{"id":"http://arxiv.org/abs/2310.17835v1","updated":"2023-10-27T01:17:48Z","published":"2023-10-27T01:17:48Z","title":"One Style is All you Need to Generate a Video","summary":" In this paper, we propose a style-based conditional video generative model.\nWe introduce a novel temporal generator based on a set of learned sinusoidal\nbases. Our method learns dynamic representations of various actions that are\nindependent of image content and can be transferred between different actors.\nBeyond the significant enhancement of video quality compared to prevalent\nmethods, we demonstrate that the disentangled dynamic and content permit their\nindependent manipulation, as well as temporal GAN-inversion to retrieve and\ntransfer a video motion from one content or identity to another without further\npreprocessing such as landmark points.\n","authors":["Sandeep Manandhar","Auguste Genovesio"],"pdf_url":"https://arxiv.org/pdf/2310.17835v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.01708v2","updated":"2023-10-27T01:09:31Z","published":"2023-06-02T17:31:32Z","title":"TIES-Merging: Resolving Interference When Merging Models","summary":" Transfer learning - i.e., further fine-tuning a pre-trained model on a\ndownstream task - can confer significant advantages, including improved\ndownstream performance, faster convergence, and better sample efficiency. These\nadvantages have led to a proliferation of task-specific fine-tuned models,\nwhich typically can only perform a single task and do not benefit from one\nanother. Recently, model merging techniques have emerged as a solution to\ncombine multiple task-specific models into a single multitask model without\nperforming additional training. However, existing merging methods often ignore\nthe interference between parameters of different models, resulting in large\nperformance drops when merging multiple models. In this paper, we demonstrate\nthat prior merging techniques inadvertently lose valuable information due to\ntwo major sources of interference: (a) interference due to redundant parameter\nvalues and (b) disagreement on the sign of a given parameter's values across\nmodels. To address this, we propose our method, TRIM, ELECT SIGN & MERGE\n(TIES-Merging), which introduces three novel steps when merging models: (1)\nresetting parameters that only changed a small amount during fine-tuning, (2)\nresolving sign conflicts, and (3) merging only the parameters that are in\nalignment with the final agreed-upon sign. We find that TIES-Merging\noutperforms several existing methods in diverse settings covering a range of\nmodalities, domains, number of tasks, model sizes, architectures, and\nfine-tuning settings. We further analyze the impact of different types of\ninterference on model parameters, and highlight the importance of resolving\nsign interference. Our code is available at\nhttps://github.com/prateeky2806/ties-merging\n","authors":["Prateek Yadav","Derek Tam","Leshem Choshen","Colin Raffel","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2306.01708v2.pdf","comment":"Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables"},{"id":"http://arxiv.org/abs/2308.00806v2","updated":"2023-10-27T00:58:35Z","published":"2023-08-01T19:44:31Z","title":"Addressing Uncertainty in Imbalanced Histopathology Image Classification\n of HER2 Breast Cancer: An interpretable Ensemble Approach with Threshold\n Filtered Single Instance Evaluation (SIE)","summary":" Breast Cancer (BC) is among women's most lethal health concerns. Early\ndiagnosis can alleviate the mortality rate by helping patients make efficient\ntreatment decisions. Human Epidermal Growth Factor Receptor (HER2) has become\none the most lethal subtype of BC. According to the College of American\nPathologists American Society of Clinical Oncology (CAP/ASCO), the severity\nlevel of HER2 expression can be classified between 0 and 3+ range. HER2 can be\ndetected effectively from immunohistochemical (IHC) and, hematoxylin & eosin\n(HE) images of different classes such as 0, 1+, 2+, and 3+. An ensemble\napproach integrated with threshold filtered single instance evaluation (SIE)\ntechnique has been proposed in this study to diagnose BC from the\nmulti-categorical expression of HER2 subtypes. Initially, DenseNet201 and\nXception have been ensembled into a single classifier as feature extractors\nwith an effective combination of global average pooling, dropout layer, dense\nlayer with a swish activation function, and l2 regularizer, batch\nnormalization, etc. After that, extracted features has been processed through\nsingle instance evaluation (SIE) to determine different confidence levels and\nadjust decision boundary among the imbalanced classes. This study has been\nconducted on the BC immunohistochemical (BCI) dataset, which is classified by\npathologists into four stages of HER2 BC. This proposed approach known as\nDenseNet201-Xception-SIE with a threshold value of 0.7 surpassed all other\nexisting state-of-art models with an accuracy of 97.12%, precision of 97.15%,\nand recall of 97.68% on H&E data and, accuracy of 97.56%, precision of 97.57%,\nand recall of 98.00% on IHC data respectively, maintaining momentous\nimprovement. Finally, Grad-CAM and Guided Grad-CAM have been employed in this\nstudy to interpret, how TL-based model works on the histopathology dataset and\nmake decisions from the data.\n","authors":["Md Sakib Hossain Shovon","M. F. Mridha","Khan Md Hasib","Sultan Alfarhood","Mejdl Safran","Dunren Che"],"pdf_url":"https://arxiv.org/pdf/2308.00806v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07280v2","updated":"2023-10-27T00:36:03Z","published":"2023-06-12T17:59:23Z","title":"Controlling Text-to-Image Diffusion by Orthogonal Finetuning","summary":" Large text-to-image diffusion models have impressive capabilities in\ngenerating photorealistic images from text prompts. How to effectively guide or\ncontrol these powerful models to perform different downstream tasks becomes an\nimportant open problem. To tackle this challenge, we introduce a principled\nfinetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image\ndiffusion models to downstream tasks. Unlike existing methods, OFT can provably\npreserve hyperspherical energy which characterizes the pairwise neuron\nrelationship on the unit hypersphere. We find that this property is crucial for\npreserving the semantic generation ability of text-to-image diffusion models.\nTo improve finetuning stability, we further propose Constrained Orthogonal\nFinetuning (COFT) which imposes an additional radius constraint to the\nhypersphere. Specifically, we consider two important finetuning text-to-image\ntasks: subject-driven generation where the goal is to generate subject-specific\nimages given a few images of a subject and a text prompt, and controllable\ngeneration where the goal is to enable the model to take in additional control\nsignals. We empirically show that our OFT framework outperforms existing\nmethods in generation quality and convergence speed.\n","authors":["Zeju Qiu","Weiyang Liu","Haiwen Feng","Yuxuan Xue","Yao Feng","Zhen Liu","Dan Zhang","Adrian Weller","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2306.07280v2.pdf","comment":"NeurIPS 2023 (43 pages, 34 figures, project page:\n https://oft.wyliu.com/)"},{"id":"http://arxiv.org/abs/2304.12483v3","updated":"2023-10-27T00:29:01Z","published":"2023-04-24T22:47:52Z","title":"Towards Realistic Generative 3D Face Models","summary":" In recent years, there has been significant progress in 2D generative face\nmodels fueled by applications such as animation, synthetic data generation, and\ndigital avatars. However, due to the absence of 3D information, these 2D models\noften struggle to accurately disentangle facial attributes like pose,\nexpression, and illumination, limiting their editing capabilities. To address\nthis limitation, this paper proposes a 3D controllable generative face model to\nproduce high-quality albedo and precise 3D shape leveraging existing 2D\ngenerative models. By combining 2D face generative models with semantic face\nmanipulation, this method enables editing of detailed 3D rendered faces. The\nproposed framework utilizes an alternating descent optimization approach over\nshape and albedo. Differentiable rendering is used to train high-quality shapes\nand albedo without 3D supervision. Moreover, this approach outperforms the\nstate-of-the-art (SOTA) methods in the well-known NoW benchmark for shape\nreconstruction. It also outperforms the SOTA reconstruction models in\nrecovering rendered faces' identities across novel poses by an average of 10%.\nAdditionally, the paper demonstrates direct control of expressions in 3D faces\nby exploiting latent space leading to text-based editing of 3D faces.\n","authors":["Aashish Rai","Hiresh Gupta","Ayush Pandey","Francisco Vicente Carrasco","Shingo Jason Takagi","Amaury Aubel","Daeil Kim","Aayush Prakash","Fernando de la Torre"],"pdf_url":"https://arxiv.org/pdf/2304.12483v3.pdf","comment":"Preprint"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2309.14208v2","updated":"2023-10-27T17:59:33Z","published":"2023-09-25T15:11:52Z","title":"Framework based on complex networks to model and mine patient pathways","summary":" The automatic discovery of a model to represent the history of encounters of\na group of patients with the healthcare system -- the so-called \"pathway of\npatients\" -- is a new field of research that supports clinical and\norganisational decisions to improve the quality and efficiency of the treatment\nprovided. The pathways of patients with chronic conditions tend to vary\nsignificantly from one person to another, have repetitive tasks, and demand the\nanalysis of multiple perspectives (interventions, diagnoses, medical\nspecialities, among others) influencing the results. Therefore, modelling and\nmining those pathways is still a challenging task. In this work, we propose a\nframework comprising: (i) a pathway model based on a multi-aspect graph, (ii) a\nnovel dissimilarity measurement to compare pathways taking the elapsed time\ninto account, and (iii) a mining method based on traditional centrality\nmeasures to discover the most relevant steps of the pathways. We evaluated the\nframework using the study cases of pregnancy and diabetes, which revealed its\nusefulness in finding clusters of similar pathways, representing them in an\neasy-to-interpret way, and highlighting the most significant patterns according\nto multiple perspectives.\n","authors":["Caroline de Oliveira Costa Souza Rosa","Márcia Ito","Alex Borges Vieira","Klaus Wehmuth","Antônio Tadeu Azevedo Gomes"],"pdf_url":"https://arxiv.org/pdf/2309.14208v2.pdf","comment":"35 pages, 11 figures, 2 appendices"},{"id":"http://arxiv.org/abs/2305.14815v2","updated":"2023-10-27T15:43:53Z","published":"2023-05-24T07:09:56Z","title":"Machine Reading Comprehension using Case-based Reasoning","summary":" We present an accurate and interpretable method for answer extraction in\nmachine reading comprehension that is reminiscent of case-based reasoning (CBR)\nfrom classical AI. Our method (CBR-MRC) builds upon the hypothesis that\ncontextualized answers to similar questions share semantic similarities with\neach other. Given a test question, CBR-MRC first retrieves a set of similar\ncases from a non-parametric memory and then predicts an answer by selecting the\nspan in the test context that is most similar to the contextualized\nrepresentations of answers in the retrieved cases. The semi-parametric nature\nof our approach allows it to attribute a prediction to the specific set of\nevidence cases, making it a desirable choice for building reliable and\ndebuggable QA systems. We show that CBR-MRC provides high accuracy comparable\nwith large reader models and outperforms baselines by 11.5 and 8.4 EM on\nNaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability\nof CBR-MRC in identifying not just the correct answer tokens but also the span\nwith the most relevant supporting evidence. Lastly, we observe that contexts\nfor certain question types show higher lexical diversity than others and find\nthat CBR-MRC is robust to these variations while performance using\nfully-parametric methods drops.\n","authors":["Dung Thai","Dhruv Agarwal","Mudit Chaudhary","Rajarshi Das","Manzil Zaheer","Jay-Yoon Lee","Hannaneh Hajishirzi","Andrew McCallum"],"pdf_url":"https://arxiv.org/pdf/2305.14815v2.pdf","comment":"9 pages, 2 figures"},{"id":"http://arxiv.org/abs/2304.09542v2","updated":"2023-10-27T12:11:16Z","published":"2023-04-19T10:16:03Z","title":"Is ChatGPT Good at Search? Investigating Large Language Models as\n Re-Ranking Agents","summary":" Large Language Models (LLMs) have demonstrated remarkable zero-shot\ngeneralization across various language-related tasks, including search engines.\nHowever, existing work utilizes the generative ability of LLMs for Information\nRetrieval (IR) rather than direct passage ranking. The discrepancy between the\npre-training objectives of LLMs and the ranking objective poses another\nchallenge. In this paper, we first investigate generative LLMs such as ChatGPT\nand GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal\nthat properly instructed LLMs can deliver competitive, even superior results to\nstate-of-the-art supervised methods on popular IR benchmarks. Furthermore, to\naddress concerns about data contamination of LLMs, we collect a new test set\ncalled NovelEval, based on the latest knowledge and aiming to verify the\nmodel's ability to rank unknown knowledge. Finally, to improve efficiency in\nreal-world applications, we delve into the potential for distilling the ranking\ncapabilities of ChatGPT into small specialized models using a permutation\ndistillation scheme. Our evaluation results turn out that a distilled 440M\nmodel outperforms a 3B supervised model on the BEIR benchmark. The code to\nreproduce our results is available at www.github.com/sunnweiwei/RankGPT.\n","authors":["Weiwei Sun","Lingyong Yan","Xinyu Ma","Shuaiqiang Wang","Pengjie Ren","Zhumin Chen","Dawei Yin","Zhaochun Ren"],"pdf_url":"https://arxiv.org/pdf/2304.09542v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18004v1","updated":"2023-10-27T09:24:38Z","published":"2023-10-27T09:24:38Z","title":"Text2Bundle: Towards Personalized Query-based Bundle Generation","summary":" Bundle generation aims to provide a bundle of items for the user, and has\nbeen widely studied and applied on online service platforms. Existing bundle\ngeneration methods mainly utilized user's preference from historical\ninteractions in common recommendation paradigm, and ignored the potential\ntextual query which is user's current explicit intention. There can be a\nscenario in which a user proactively queries a bundle with some natural\nlanguage description, the system should be able to generate a bundle that\nexactly matches the user's intention through the user's query and preferences.\nIn this work, we define this user-friendly scenario as Query-based Bundle\nGeneration task and propose a novel framework Text2Bundle that leverages both\nthe user's short-term interests from the query and the user's long-term\npreferences from the historical interactions. Our framework consists of three\nmodules: (1) a query interest extractor that mines the user's fine-grained\ninterests from the query; (2) a unified state encoder that learns the current\nbundle context state and the user's preferences based on historical interaction\nand current query; and (3) a bundle generator that generates personalized and\ncomplementary bundles using a reinforcement learning with specifically designed\nrewards. We conduct extensive experiments on three real-world datasets and\ndemonstrate the effectiveness of our framework compared with several\nstate-of-the-art methods.\n","authors":["Shixuan Zhu","Chuan Cui","JunTong Hu","Qi Shen","Yu Ji","Zhihua Wei"],"pdf_url":"https://arxiv.org/pdf/2310.18004v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.10149v3","updated":"2023-10-27T09:11:50Z","published":"2023-04-20T08:16:07Z","title":"Is ChatGPT a Good Recommender? A Preliminary Study","summary":" Recommendation systems have witnessed significant advancements and have been\nwidely used over the past decades. However, most traditional recommendation\nmethods are task-specific and therefore lack efficient generalization ability.\nRecently, the emergence of ChatGPT has significantly advanced NLP tasks by\nenhancing the capabilities of conversational models. Nonetheless, the\napplication of ChatGPT in the recommendation domain has not been thoroughly\ninvestigated. In this paper, we employ ChatGPT as a general-purpose\nrecommendation model to explore its potential for transferring extensive\nlinguistic and world knowledge acquired from large-scale corpora to\nrecommendation scenarios. Specifically, we design a set of prompts and evaluate\nChatGPT's performance on five recommendation scenarios. Unlike traditional\nrecommendation methods, we do not fine-tune ChatGPT during the entire\nevaluation process, relying only on the prompts themselves to convert\nrecommendation tasks into natural language tasks. Further, we explore the use\nof few-shot prompting to inject interaction information that contains user\npotential interest to help ChatGPT better understand user needs and interests.\nComprehensive experimental results on Amazon Beauty dataset show that ChatGPT\nhas achieved promising results in certain tasks and is capable of reaching the\nbaseline level in others. We conduct human evaluations on two\nexplainability-oriented tasks to more accurately evaluate the quality of\ncontents generated by different models. And the human evaluations show ChatGPT\ncan truly understand the provided information and generate clearer and more\nreasonable results. We hope that our study can inspire researchers to further\nexplore the potential of language models like ChatGPT to improve recommendation\nperformance and contribute to the advancement of the recommendation systems\nfield.\n","authors":["Junling Liu","Chao Liu","Peilin Zhou","Renjie Lv","Kang Zhou","Yan Zhang"],"pdf_url":"https://arxiv.org/pdf/2304.10149v3.pdf","comment":"Accepted by CIKM 2023 GenRec Workshop"},{"id":"http://arxiv.org/abs/2310.17922v1","updated":"2023-10-27T06:36:31Z","published":"2023-10-27T06:36:31Z","title":"Chain-of-Choice Hierarchical Policy Learning for Conversational\n Recommendation","summary":" Conversational Recommender Systems (CRS) illuminate user preferences via\nmulti-round interactive dialogues, ultimately navigating towards precise and\nsatisfactory recommendations. However, contemporary CRS are limited to\ninquiring binary or multi-choice questions based on a single attribute type\n(e.g., color) per round, which causes excessive rounds of interaction and\ndiminishes the user's experience. To address this, we propose a more realistic\nand efficient conversational recommendation problem setting, called\nMulti-Type-Attribute Multi-round Conversational Recommendation (MTAMCR), which\nenables CRS to inquire about multi-choice questions covering multiple types of\nattributes in each round, thereby improving interactive efficiency. Moreover,\nby formulating MTAMCR as a hierarchical reinforcement learning task, we propose\na Chain-of-Choice Hierarchical Policy Learning (CoCHPL) framework to enhance\nboth the questioning efficiency and recommendation effectiveness in MTAMCR.\nSpecifically, a long-term policy over options (i.e., ask or recommend)\ndetermines the action type, while two short-term intra-option policies\nsequentially generate the chain of attributes or items through multi-step\nreasoning and selection, optimizing the diversity and interdependence of\nquestioning attributes. Finally, extensive experiments on four benchmarks\ndemonstrate the superior performance of CoCHPL over prevailing state-of-the-art\nmethods.\n","authors":["Wei Fan","Weijia Zhang","Weiqi Wang","Yangqiu Song","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2310.17922v1.pdf","comment":"Release with source code"},{"id":"http://arxiv.org/abs/2307.04090v2","updated":"2023-10-27T04:27:41Z","published":"2023-07-09T04:19:19Z","title":"DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge\n Graphs","summary":" Recent work within the Argument Mining community has shown the applicability\nof Natural Language Processing systems for solving problems found within\ncompetitive debate. One of the most important tasks within competitive debate\nis for debaters to create high quality debate cases. We show that effective\ndebate cases can be constructed using constrained shortest path traversals on\nArgumentative Semantic Knowledge Graphs. We study this potential in the context\nof a type of American Competitive Debate, called Policy Debate, which already\nhas a large scale dataset targeting it called DebateSum. We significantly\nimprove upon DebateSum by introducing 53180 new examples, as well as further\nuseful metadata for every example, to the dataset. We leverage the txtai\nsemantic search and knowledge graph toolchain to produce and contribute 9\nsemantic knowledge graphs built on this dataset. We create a unique method for\nevaluating which knowledge graphs are better in the context of producing policy\ndebate cases. A demo which automatically generates debate cases, along with all\nother code and the Knowledge Graphs, are open-sourced and made available to the\npublic here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG\n","authors":["Allen Roush","David Mezzetti"],"pdf_url":"https://arxiv.org/pdf/2307.04090v2.pdf","comment":"8 pages, Accepted to The 4th New Frontiers in Summarization Workshop\n (EMNLP 2023), System Demonstration paper"},{"id":"http://arxiv.org/abs/2310.17870v1","updated":"2023-10-27T03:14:50Z","published":"2023-10-27T03:14:50Z","title":"Ranking with Slot Constraints","summary":" We introduce the problem of ranking with slot constraints, which can be used\nto model a wide range of application problems -- from college admission with\nlimited slots for different majors, to composing a stratified cohort of\neligible participants in a medical trial. We show that the conventional\nProbability Ranking Principle (PRP) can be highly sub-optimal for\nslot-constrained ranking problems, and we devise a new ranking algorithm,\ncalled MatchRank. The goal of MatchRank is to produce rankings that maximize\nthe number of filled slots if candidates are evaluated by a human decision\nmaker in the order of the ranking. In this way, MatchRank generalizes the PRP,\nand it subsumes the PRP as a special case when there are no slot constraints.\nOur theoretical analysis shows that MatchRank has a strong approximation\nguarantee without any independence assumptions between slots or candidates.\nFurthermore, we show how MatchRank can be implemented efficiently. Beyond the\ntheoretical guarantees, empirical evaluations show that MatchRank can provide\nsubstantial improvements over a range of synthetic and real-world tasks.\n","authors":["Wentao Guo","Andrew Wang","Bradon Thymes","Thorsten Joachims"],"pdf_url":"https://arxiv.org/pdf/2310.17870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.14263v2","updated":"2023-10-27T01:40:41Z","published":"2023-08-28T02:38:17Z","title":"Cross-Modal Retrieval: A Systematic Review of Methods and Future\n Directions","summary":" With the exponential surge in diverse multi-modal data, traditional uni-modal\nretrieval methods struggle to meet the needs of users demanding access to data\nfrom various modalities. To address this, cross-modal retrieval has emerged,\nenabling interaction across modalities, facilitating semantic matching, and\nleveraging complementarity and consistency between different modal data.\nAlthough prior literature undertook a review of the cross-modal retrieval\nfield, it exhibits numerous deficiencies pertaining to timeliness, taxonomy,\nand comprehensiveness. This paper conducts a comprehensive review of\ncross-modal retrieval's evolution, spanning from shallow statistical analysis\ntechniques to vision-language pre-training models. Commencing with a\ncomprehensive taxonomy grounded in machine learning paradigms, mechanisms, and\nmodels, the paper then delves deeply into the principles and architectures\nunderpinning existing cross-modal retrieval methods. Furthermore, it offers an\noverview of widely used benchmarks, metrics, and performances. Lastly, the\npaper probes the prospects and challenges that confront contemporary\ncross-modal retrieval, while engaging in a discourse on potential directions\nfor further progress in the field. To facilitate the research on cross-modal\nretrieval, we develop an open-source code repository at\nhttps://github.com/BMC-SDNU/Cross-Modal-Retrieval.\n","authors":["Fengling Li","Lei Zhu","Tianshi Wang","Jingjing Li","Zheng Zhang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.14263v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2205.07087v3","updated":"2023-10-27T18:13:49Z","published":"2022-05-14T16:00:21Z","title":"Pattern reconstruction with restricted Boltzmann machines","summary":" Restricted Boltzmann machines are energy models made of a visible and a\nhidden layer. We identify an effective energy function describing the\nzero-temperature landscape on the visible units and depending only on the tail\nbehaviour of the hidden layer prior distribution. Studying the location of the\nlocal minima of such an energy function, we show that the ability of a\nrestricted Boltzmann machine to reconstruct a random pattern depends indeed\nonly on the tail of the hidden prior distribution. We find that hidden priors\nwith strictly super-Gaussian tails give only a logarithmic loss in pattern\nretrieval, while an efficient retrieval is much harder with hidden units with\nstrictly sub-Gaussian tails; if the hidden prior has Gaussian tails, the\nretrieval capability is determined by the number of hidden units (as in the\nHopfield model).\n","authors":["Giuseppe Genovese"],"pdf_url":"https://arxiv.org/pdf/2205.07087v3.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.18313v1","updated":"2023-10-27T17:59:51Z","published":"2023-10-27T17:59:51Z","title":"FP8-LM: Training FP8 Large Language Models","summary":" In this paper, we explore FP8 low-bit data formats for efficient training of\nlarge language models (LLMs). Our key insight is that most variables, such as\ngradients and optimizer states, in LLM training can employ low-precision data\nformats without compromising model accuracy and requiring no changes to\nhyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision\nframework for training LLMs. This framework offers three levels of FP8\nutilization to streamline mixed-precision and distributed parallel training for\nLLMs. It gradually incorporates 8-bit gradients, optimizer states, and\ndistributed learning in an incremental manner. Experiment results show that,\nduring the training of GPT-175B model on H100 GPU platform, our FP8\nmixed-precision training framework not only achieved a remarkable 42% reduction\nin real memory usage but also ran 64% faster than the widely adopted BF16\nframework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer\nEngine by 17%. This largely reduces the training costs for large foundation\nmodels. Furthermore, our FP8 mixed-precision training methodology is generic.\nIt can be seamlessly applied to other tasks such as LLM instruction tuning and\nreinforcement learning with human feedback, offering savings in fine-tuning\nexpenses. Our FP8 low-precision training framework is open-sourced at\n{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.\n","authors":["Houwen Peng","Kan Wu","Yixuan Wei","Guoshuai Zhao","Yuxiang Yang","Ze Liu","Yifan Xiong","Ziyue Yang","Bolin Ni","Jingcheng Hu","Ruihang Li","Miaosen Zhang","Chen Li","Jia Ning","Ruizhe Wang","Zheng Zhang","Shuguang Liu","Joe Chau","Han Hu","Peng Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.18313v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14208v2","updated":"2023-10-27T17:59:33Z","published":"2023-09-25T15:11:52Z","title":"Framework based on complex networks to model and mine patient pathways","summary":" The automatic discovery of a model to represent the history of encounters of\na group of patients with the healthcare system -- the so-called \"pathway of\npatients\" -- is a new field of research that supports clinical and\norganisational decisions to improve the quality and efficiency of the treatment\nprovided. The pathways of patients with chronic conditions tend to vary\nsignificantly from one person to another, have repetitive tasks, and demand the\nanalysis of multiple perspectives (interventions, diagnoses, medical\nspecialities, among others) influencing the results. Therefore, modelling and\nmining those pathways is still a challenging task. In this work, we propose a\nframework comprising: (i) a pathway model based on a multi-aspect graph, (ii) a\nnovel dissimilarity measurement to compare pathways taking the elapsed time\ninto account, and (iii) a mining method based on traditional centrality\nmeasures to discover the most relevant steps of the pathways. We evaluated the\nframework using the study cases of pregnancy and diabetes, which revealed its\nusefulness in finding clusters of similar pathways, representing them in an\neasy-to-interpret way, and highlighting the most significant patterns according\nto multiple perspectives.\n","authors":["Caroline de Oliveira Costa Souza Rosa","Márcia Ito","Alex Borges Vieira","Klaus Wehmuth","Antônio Tadeu Azevedo Gomes"],"pdf_url":"https://arxiv.org/pdf/2309.14208v2.pdf","comment":"35 pages, 11 figures, 2 appendices"},{"id":"http://arxiv.org/abs/2310.17651v2","updated":"2023-10-27T17:59:29Z","published":"2023-10-26T17:59:32Z","title":"High-Dimensional Prediction for Sequential Decision Making","summary":" We study the problem of making predictions of an adversarially chosen\nhigh-dimensional state that are unbiased subject to an arbitrary collection of\nconditioning events, with the goal of tailoring these events to downstream\ndecision makers. We give efficient algorithms for solving this problem, as well\nas a number of applications that stem from choosing an appropriate set of\nconditioning events.\n For example, we can efficiently make predictions targeted at polynomially\nmany decision makers, giving each of them optimal swap regret if they\nbest-respond to our predictions. We generalize this to online combinatorial\noptimization, where the decision makers have a very large action space, to give\nthe first algorithms offering polynomially many decision makers no regret on\npolynomially many subsequences that may depend on their actions and the\ncontext. We apply these results to get efficient no-subsequence-regret\nalgorithms in extensive-form games (EFGs), yielding a new family of regret\nguarantees for EFGs that generalizes some existing EFG regret notions, e.g.\nregret to informed causal deviations, and is generally incomparable to other\nknown such notions.\n Next, we develop a novel transparent alternative to conformal prediction for\nbuilding valid online adversarial multiclass prediction sets. We produce class\nscores that downstream algorithms can use for producing valid-coverage\nprediction sets, as if these scores were the true conditional class\nprobabilities. We show this implies strong conditional validity guarantees\nincluding set-size-conditional and multigroup-fair coverage for polynomially\nmany downstream prediction sets. Moreover, our class scores can be guaranteed\nto have improved $L_2$ loss, cross-entropy loss, and generally any Bregman\nloss, compared to any collection of benchmark models, yielding a\nhigh-dimensional real-valued version of omniprediction.\n","authors":["Georgy Noarov","Ramya Ramalingam","Aaron Roth","Stephan Xie"],"pdf_url":"https://arxiv.org/pdf/2310.17651v2.pdf","comment":"Added references, Arxiv abstract edited"},{"id":"http://arxiv.org/abs/2310.18308v1","updated":"2023-10-27T17:55:32Z","published":"2023-10-27T17:55:32Z","title":"Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models","summary":" Generalist robot manipulators need to learn a wide variety of manipulation\nskills across diverse environments. Current robot training pipelines rely on\nhumans to provide kinesthetic demonstrations or to program simulation\nenvironments and to code up reward functions for reinforcement learning. Such\nhuman involvement is an important bottleneck towards scaling up robot learning\nacross diverse tasks and environments. We propose Generation to Simulation\n(Gen2Sim), a method for scaling up robot skill learning in simulation by\nautomating generation of 3D assets, task descriptions, task decompositions and\nreward functions using large pre-trained generative models of language and\nvision. We generate 3D assets for simulation by lifting open-world 2D\nobject-centric images to 3D using image diffusion models and querying LLMs to\ndetermine plausible physics parameters. Given URDF files of generated and\nhuman-developed assets, we chain-of-thought prompt LLMs to map these to\nrelevant task descriptions, temporal decompositions, and corresponding python\nreward functions for reinforcement learning. We show Gen2Sim succeeds in\nlearning policies for diverse long horizon tasks, where reinforcement learning\nwith non temporally decomposed reward functions fails. Gen2Sim provides a\nviable path for scaling up reinforcement learning for robot manipulators in\nsimulation, both by diversifying and expanding task and environment\ndevelopment, and by facilitating the discovery of reinforcement-learned\nbehaviors through temporal task decomposition in RL. Our work contributes\nhundreds of simulated assets, tasks and demonstrations, taking a step towards\nfully autonomous robotic manipulation skill acquisition in simulation.\n","authors":["Pushkal Katara","Zhou Xian","Katerina Fragkiadaki"],"pdf_url":"https://arxiv.org/pdf/2310.18308v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18306v1","updated":"2023-10-27T17:55:17Z","published":"2023-10-27T17:55:17Z","title":"Supervised and Penalized Baseline Correction","summary":" Spectroscopic measurements can show distorted spectra shapes arising from a\nmixture of absorbing and scattering contributions. These distortions (or\nbaselines) often manifest themselves as non-constant offsets or low-frequency\noscillations. As a result, these baselines can adversely affect analytical and\nquantitative results. Baseline correction is an umbrella term where one applies\npre-processing methods to obtain baseline spectra (the unwanted distortions)\nand then remove the distortions by differencing. However, current state-of-the\nart baseline correction methods do not utilize analyte concentrations even if\nthey are available, or even if they contribute significantly to the observed\nspectral variability. We examine a class of state-of-the-art methods (penalized\nbaseline correction) and modify them such that they can accommodate a priori\nanalyte concentration such that prediction can be enhanced. Performance will be\naccess on two near infra-red data sets across both classical penalized baseline\ncorrection methods (without analyte information) and modified penalized\nbaseline correction methods (leveraging analyte information).\n","authors":["Erik Andries Ramin Nikzad-Langerodi"],"pdf_url":"https://arxiv.org/pdf/2310.18306v1.pdf","comment":"27 pages; 8 figure with a total of 18 subfigures; 2 tables"},{"id":"http://arxiv.org/abs/2310.18304v1","updated":"2023-10-27T17:53:53Z","published":"2023-10-27T17:53:53Z","title":"A Stability Principle for Learning under Non-Stationarity","summary":" We develop a versatile framework for statistical learning in non-stationary\nenvironments. In each time period, our approach applies a stability principle\nto select a look-back window that maximizes the utilization of historical data\nwhile keeping the cumulative bias within an acceptable range relative to the\nstochastic error. Our theory showcases the adaptability of this approach to\nunknown non-stationarity. The regret bound is minimax optimal up to logarithmic\nfactors when the population losses are strongly convex, or Lipschitz only. At\nthe heart of our analysis lie two novel components: a measure of similarity\nbetween functions and a segmentation technique for dividing the non-stationary\ndata sequence into quasi-stationary pieces.\n","authors":["Chengpiao Huang","Kaizheng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18304v1.pdf","comment":"47 pages, 1 figure"},{"id":"http://arxiv.org/abs/2310.16779v3","updated":"2023-10-27T17:51:17Z","published":"2023-10-25T17:11:21Z","title":"Multi-scale Diffusion Denoised Smoothing","summary":" Along with recent diffusion models, randomized smoothing has become one of a\nfew tangible approaches that offers adversarial robustness to models at scale,\ne.g., those of large pre-trained models. Specifically, one can perform\nrandomized smoothing on any classifier via a simple \"denoise-and-classify\"\npipeline, so-called denoised smoothing, given that an accurate denoiser is\navailable - such as diffusion model. In this paper, we present scalable methods\nto address the current trade-off between certified robustness and accuracy in\ndenoised smoothing. Our key idea is to \"selectively\" apply smoothing among\nmultiple noise scales, coined multi-scale smoothing, which can be efficiently\nimplemented with a single diffusion model. This approach also suggests a new\nobjective to compare the collective robustness of multi-scale smoothed\nclassifiers, and questions which representation of diffusion model would\nmaximize the objective. To address this, we propose to further fine-tune\ndiffusion model (a) to perform consistent denoising whenever the original image\nis recoverable, but (b) to generate rather diverse outputs otherwise. Our\nexperiments show that the proposed multi-scale smoothing scheme combined with\ndiffusion fine-tuning enables strong certified robustness available with high\nnoise level while maintaining its accuracy close to non-smoothed classifiers.\n","authors":["Jongheon Jeong","Jinwoo Shin"],"pdf_url":"https://arxiv.org/pdf/2310.16779v3.pdf","comment":"Published as a conference paper at NeurIPS 2023; Code is available at\n https://github.com/jh-jeong/smoothing-multiscale"},{"id":"http://arxiv.org/abs/2310.13548v3","updated":"2023-10-27T17:45:26Z","published":"2023-10-20T14:46:48Z","title":"Towards Understanding Sycophancy in Language Models","summary":" Human feedback is commonly utilized to finetune AI assistants. But human\nfeedback may also encourage model responses that match user beliefs over\ntruthful ones, a behaviour known as sycophancy. We investigate the prevalence\nof sycophancy in models whose finetuning procedure made use of human feedback,\nand the potential role of human preference judgments in such behavior. We first\ndemonstrate that five state-of-the-art AI assistants consistently exhibit\nsycophancy across four varied free-form text-generation tasks. To understand if\nhuman preferences drive this broadly observed behavior, we analyze existing\nhuman preference data. We find that when a response matches a user's views, it\nis more likely to be preferred. Moreover, both humans and preference models\n(PMs) prefer convincingly-written sycophantic responses over correct ones a\nnon-negligible fraction of the time. Optimizing model outputs against PMs also\nsometimes sacrifices truthfulness in favor of sycophancy. Overall, our results\nindicate that sycophancy is a general behavior of state-of-the-art AI\nassistants, likely driven in part by human preference judgments favoring\nsycophantic responses.\n","authors":["Mrinank Sharma","Meg Tong","Tomasz Korbak","David Duvenaud","Amanda Askell","Samuel R. Bowman","Newton Cheng","Esin Durmus","Zac Hatfield-Dodds","Scott R. Johnston","Shauna Kravec","Timothy Maxwell","Sam McCandlish","Kamal Ndousse","Oliver Rausch","Nicholas Schiefer","Da Yan","Miranda Zhang","Ethan Perez"],"pdf_url":"https://arxiv.org/pdf/2310.13548v3.pdf","comment":"32 pages, 20 figures"},{"id":"http://arxiv.org/abs/2310.18291v1","updated":"2023-10-27T17:29:07Z","published":"2023-10-27T17:29:07Z","title":"Addressing GAN Training Instabilities via Tunable Classification Losses","summary":" Generative adversarial networks (GANs), modeled as a zero-sum game between a\ngenerator (G) and a discriminator (D), allow generating synthetic data with\nformal guarantees. Noting that D is a classifier, we begin by reformulating the\nGAN value function using class probability estimation (CPE) losses. We prove a\ntwo-way correspondence between CPE loss GANs and $f$-GANs which minimize\n$f$-divergences. We also show that all symmetric $f$-divergences are equivalent\nin convergence. In the finite sample and model capacity setting, we define and\nobtain bounds on estimation and generalization errors. We specialize these\nresults to $\\alpha$-GANs, defined using $\\alpha$-loss, a tunable CPE loss\nfamily parametrized by $\\alpha\\in(0,\\infty]$. We next introduce a class of\ndual-objective GANs to address training instabilities of GANs by modeling each\nplayer's objective using $\\alpha$-loss to obtain $(\\alpha_D,\\alpha_G)$-GANs. We\nshow that the resulting non-zero sum game simplifies to minimizing an\n$f$-divergence under appropriate conditions on $(\\alpha_D,\\alpha_G)$.\nGeneralizing this dual-objective formulation using CPE losses, we define and\nobtain upper bounds on an appropriately defined estimation error. Finally, we\nhighlight the value of tuning $(\\alpha_D,\\alpha_G)$ in alleviating training\ninstabilities for the synthetic 2D Gaussian mixture ring as well as the large\npublicly available Celeb-A and LSUN Classroom image datasets.\n","authors":["Monica Welfert","Gowtham R. Kurri","Kyle Otstot","Lalitha Sankar"],"pdf_url":"https://arxiv.org/pdf/2310.18291v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2302.14320"},{"id":"http://arxiv.org/abs/2306.14884v2","updated":"2023-10-27T17:28:50Z","published":"2023-06-26T17:53:05Z","title":"Learning to Modulate pre-trained Models in RL","summary":" Reinforcement Learning (RL) has been successful in various domains like\nrobotics, game playing, and simulation. While RL agents have shown impressive\ncapabilities in their specific tasks, they insufficiently adapt to new tasks.\nIn supervised learning, this adaptation problem is addressed by large-scale\npre-training followed by fine-tuning to new down-stream tasks. Recently,\npre-training on multiple tasks has been gaining traction in RL. However,\nfine-tuning a pre-trained model often suffers from catastrophic forgetting.\nThat is, the performance on the pre-training tasks deteriorates when\nfine-tuning on new tasks. To investigate the catastrophic forgetting\nphenomenon, we first jointly pre-train a model on datasets from two benchmark\nsuites, namely Meta-World and DMControl. Then, we evaluate and compare a\nvariety of fine-tuning methods prevalent in natural language processing, both\nin terms of performance on new tasks, and how well performance on pre-training\ntasks is retained. Our study shows that with most fine-tuning approaches, the\nperformance on pre-training tasks deteriorates significantly. Therefore, we\npropose a novel method, Learning-to-Modulate (L2M), that avoids the degradation\nof learned skills by modulating the information flow of the frozen pre-trained\nmodel via a learnable modulation pool. Our method achieves state-of-the-art\nperformance on the Continual-World benchmark, while retaining performance on\nthe pre-training tasks. Finally, to aid future research in this area, we\nrelease a dataset encompassing 50 Meta-World and 16 DMControl tasks.\n","authors":["Thomas Schmied","Markus Hofmarcher","Fabian Paischer","Razvan Pascanu","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2306.14884v2.pdf","comment":"10 pages (+ references and appendix), Code:\n https://github.com/ml-jku/L2M"},{"id":"http://arxiv.org/abs/2310.18288v1","updated":"2023-10-27T17:25:12Z","published":"2023-10-27T17:25:12Z","title":"Sustainable Concrete via Bayesian Optimization","summary":" Eight percent of global carbon dioxide emissions can be attributed to the\nproduction of cement, the main component of concrete, which is also the\ndominant source of CO2 emissions in the construction of data centers. The\ndiscovery of lower-carbon concrete formulae is therefore of high significance\nfor sustainability. However, experimenting with new concrete formulae is time\nconsuming and labor intensive, as one usually has to wait to record the\nconcrete's 28-day compressive strength, a quantity whose measurement can by its\ndefinition not be accelerated. This provides an opportunity for experimental\ndesign methodology like Bayesian Optimization (BO) to accelerate the search for\nstrong and sustainable concrete formulae. Herein, we 1) propose modeling steps\nthat make concrete strength amenable to be predicted accurately by a Gaussian\nprocess model with relatively few measurements, 2) formulate the search for\nsustainable concrete as a multi-objective optimization problem, and 3) leverage\nthe proposed model to carry out multi-objective BO with real-world strength\nmeasurements of the algorithmically proposed mixes. Our experimental results\nshow improved trade-offs between the mixtures' global warming potential (GWP)\nand their associated compressive strengths, compared to mixes based on current\nindustry practices.\n","authors":["Sebastian Ament","Andrew Witte","Nishant Garg","Julius Kusuma"],"pdf_url":"https://arxiv.org/pdf/2310.18288v1.pdf","comment":"NeurIPS 2023 Workshop on Adaptive Experimental Design and Active\n Learning in the Real World"},{"id":"http://arxiv.org/abs/2310.18286v1","updated":"2023-10-27T17:22:45Z","published":"2023-10-27T17:22:45Z","title":"Optimal Transport for Treatment Effect Estimation","summary":" Estimating conditional average treatment effect from observational data is\nhighly challenging due to the existence of treatment selection bias. Prevalent\nmethods mitigate this issue by aligning distributions of different treatment\ngroups in the latent space. However, there are two critical problems that these\nmethods fail to address: (1) mini-batch sampling effects (MSE), which causes\nmisalignment in non-ideal mini-batches with outcome imbalance and outliers; (2)\nunobserved confounder effects (UCE), which results in inaccurate discrepancy\ncalculation due to the neglect of unobserved confounders. To tackle these\nproblems, we propose a principled approach named Entire Space CounterFactual\nRegression (ESCFR), which is a new take on optimal transport in the context of\ncausality. Specifically, based on the framework of stochastic optimal\ntransport, we propose a relaxed mass-preserving regularizer to address the MSE\nissue and design a proximal factual outcome regularizer to handle the UCE\nissue. Extensive experiments demonstrate that our proposed ESCFR can\nsuccessfully tackle the treatment selection bias and achieve significantly\nbetter performance than state-of-the-art methods.\n","authors":["Hao Wang","Zhichao Chen","Jiajun Fan","Haoxuan Li","Tianqiao Liu","Weiming Liu","Quanyu Dai","Yichao Wang","Zhenhua Dong","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2310.18286v1.pdf","comment":"Accepted as NeurIPS 2023 Poster"},{"id":"http://arxiv.org/abs/2310.18285v1","updated":"2023-10-27T17:22:09Z","published":"2023-10-27T17:22:09Z","title":"Heterogeneous Federated Learning with Group-Aware Prompt Tuning","summary":" Transformers have achieved remarkable success in various machine-learning\ntasks, prompting their widespread adoption. In this paper, we explore their\napplication in the context of federated learning (FL), with a particular focus\non heterogeneous scenarios where individual clients possess diverse local\ndatasets. To meet the computational and communication demands of FL, we\nleverage pre-trained Transformers and use an efficient prompt-tuning strategy.\nOur strategy introduces the concept of learning both shared and group prompts,\nenabling the acquisition of universal knowledge and group-specific knowledge\nsimultaneously. Additionally, a prompt selection module assigns personalized\ngroup prompts to each input, aligning the global model with the data\ndistribution of each client. This approach allows us to train a single global\nmodel that can automatically adapt to various local client data distributions\nwithout requiring local fine-tuning. In this way, our proposed method\neffectively bridges the gap between global and personalized local models in\nFederated Learning and surpasses alternative approaches that lack the\ncapability to adapt to previously unseen clients. The effectiveness of our\napproach is rigorously validated through extensive experimentation and ablation\nstudies.\n","authors":["Wenlong Deng","Christos Thrampoulidis","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2310.18285v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18370v3","updated":"2023-10-27T17:21:37Z","published":"2023-05-27T22:28:25Z","title":"Explainable Brain Age Prediction using coVariance Neural Networks","summary":" In computational neuroscience, there has been an increased interest in\ndeveloping machine learning algorithms that leverage brain imaging data to\nprovide estimates of \"brain age\" for an individual. Importantly, the\ndiscordance between brain age and chronological age (referred to as \"brain age\ngap\") can capture accelerated aging due to adverse health conditions and\ntherefore, can reflect increased vulnerability towards neurological disease or\ncognitive impairments. However, widespread adoption of brain age for clinical\ndecision support has been hindered due to lack of transparency and\nmethodological justifications in most existing brain age prediction algorithms.\nIn this paper, we leverage coVariance neural networks (VNN) to propose an\nexplanation-driven and anatomically interpretable framework for brain age\nprediction using cortical thickness features. Specifically, our brain age\nprediction framework extends beyond the coarse metric of brain age gap in\nAlzheimer's disease (AD) and we make two important observations: (i) VNNs can\nassign anatomical interpretability to elevated brain age gap in AD by\nidentifying contributing brain regions, (ii) the interpretability offered by\nVNNs is contingent on their ability to exploit specific eigenvectors of the\nanatomical covariance matrix. Together, these observations facilitate an\nexplainable and anatomically interpretable perspective to the task of brain age\nprediction.\n","authors":["Saurabh Sihag","Gonzalo Mateos","Corey McMillan","Alejandro Ribeiro"],"pdf_url":"https://arxiv.org/pdf/2305.18370v3.pdf","comment":"Camera ready version for NeurIPS 2023. arXiv admin note: substantial\n text overlap with arXiv:2305.01807"},{"id":"http://arxiv.org/abs/2305.12029v2","updated":"2023-10-27T17:01:50Z","published":"2023-05-19T22:50:02Z","title":"MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational\n Transcript Cleanup","summary":" Current disfluency detection models focus on individual utterances each from\na single speaker. However, numerous discontinuity phenomena in spoken\nconversational transcripts occur across multiple turns, hampering human\nreadability and the performance of downstream NLP tasks. This study addresses\nthese phenomena by proposing an innovative Multi-Turn Cleanup task for spoken\nconversational transcripts and collecting a new dataset, MultiTurnCleanup1. We\ndesign a data labeling schema to collect the high-quality dataset and provide\nextensive data analysis. Furthermore, we leverage two modeling approaches for\nexperimental evaluation as benchmarks for future research.\n","authors":["Hua Shen","Vicky Zayats","Johann C. Rocholl","Daniel D. Walker","Dirk Padfield"],"pdf_url":"https://arxiv.org/pdf/2305.12029v2.pdf","comment":"EMNLP 2023 main conference. Dataset:\n https://github.com/huashen218/MultiTurnCleanup"},{"id":"http://arxiv.org/abs/2310.18274v1","updated":"2023-10-27T16:59:51Z","published":"2023-10-27T16:59:51Z","title":"LipSim: A Provably Robust Perceptual Similarity Metric","summary":" Recent years have seen growing interest in developing and applying perceptual\nsimilarity metrics. Research has shown the superiority of perceptual metrics\nover pixel-wise metrics in aligning with human perception and serving as a\nproxy for the human visual system. On the other hand, as perceptual metrics\nrely on neural networks, there is a growing concern regarding their resilience,\ngiven the established vulnerability of neural networks to adversarial attacks.\nIt is indeed logical to infer that perceptual metrics may inherit both the\nstrengths and shortcomings of neural networks. In this work, we demonstrate the\nvulnerability of state-of-the-art perceptual similarity metrics based on an\nensemble of ViT-based feature extractors to adversarial attacks. We then\npropose a framework to train a robust perceptual similarity metric called\nLipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging\n1-Lipschitz neural networks as the backbone, LipSim provides guarded areas\naround each data point and certificates for all perturbations within an\n$\\ell_2$ ball. Finally, a comprehensive set of experiments shows the\nperformance of LipSim in terms of natural and certified scores and on the image\nretrieval application. The code is available at\nhttps://github.com/SaraGhazanfari/LipSim.\n","authors":["Sara Ghazanfari","Alexandre Araujo","Prashanth Krishnamurthy","Farshad Khorrami","Siddharth Garg"],"pdf_url":"https://arxiv.org/pdf/2310.18274v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18268v1","updated":"2023-10-27T16:56:28Z","published":"2023-10-27T16:56:28Z","title":"PlantPlotGAN: A Physics-Informed Generative Adversarial Network for\n Plant Disease Prediction","summary":" Monitoring plantations is crucial for crop management and producing healthy\nharvests. Unmanned Aerial Vehicles (UAVs) have been used to collect\nmultispectral images that aid in this monitoring. However, given the number of\nhectares to be monitored and the limitations of flight, plant disease signals\nbecome visually clear only in the later stages of plant growth and only if the\ndisease has spread throughout a significant portion of the plantation. This\nlimited amount of relevant data hampers the prediction models, as the\nalgorithms struggle to generalize patterns with unbalanced or unrealistic\naugmented datasets effectively. To address this issue, we propose PlantPlotGAN,\na physics-informed generative model capable of creating synthetic multispectral\nplot images with realistic vegetation indices. These indices served as a proxy\nfor disease detection and were used to evaluate if our model could help\nincrease the accuracy of prediction models. The results demonstrate that the\nsynthetic imagery generated from PlantPlotGAN outperforms state-of-the-art\nmethods regarding the Fr\\'echet inception distance. Moreover, prediction models\nachieve higher accuracy metrics when trained with synthetic and original\nimagery for earlier plant disease detection compared to the training processes\nbased solely on real imagery.\n","authors":["Felipe A. Lopes","Vasit Sagan","Flavio Esposito"],"pdf_url":"https://arxiv.org/pdf/2310.18268v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV), 2024"},{"id":"http://arxiv.org/abs/2310.18265v1","updated":"2023-10-27T16:54:29Z","published":"2023-10-27T16:54:29Z","title":"Structured Semidefinite Programming for Recovering Structured\n Preconditioners","summary":" We develop a general framework for finding approximately-optimal\npreconditioners for solving linear systems. Leveraging this framework we obtain\nimproved runtimes for fundamental preconditioning and linear system solving\nproblems including the following. We give an algorithm which, given positive\ndefinite $\\mathbf{K} \\in \\mathbb{R}^{d \\times d}$ with\n$\\mathrm{nnz}(\\mathbf{K})$ nonzero entries, computes an $\\epsilon$-optimal\ndiagonal preconditioner in time $\\widetilde{O}(\\mathrm{nnz}(\\mathbf{K}) \\cdot\n\\mathrm{poly}(\\kappa^\\star,\\epsilon^{-1}))$, where $\\kappa^\\star$ is the\noptimal condition number of the rescaled matrix. We give an algorithm which,\ngiven $\\mathbf{M} \\in \\mathbb{R}^{d \\times d}$ that is either the pseudoinverse\nof a graph Laplacian matrix or a constant spectral approximation of one, solves\nlinear systems in $\\mathbf{M}$ in $\\widetilde{O}(d^2)$ time. Our diagonal\npreconditioning results improve state-of-the-art runtimes of $\\Omega(d^{3.5})$\nattained by general-purpose semidefinite programming, and our solvers improve\nstate-of-the-art runtimes of $\\Omega(d^{\\omega})$ where $\\omega > 2.3$ is the\ncurrent matrix multiplication constant. We attain our results via new\nalgorithms for a class of semidefinite programs (SDPs) we call\nmatrix-dictionary approximation SDPs, which we leverage to solve an associated\nproblem we call matrix-dictionary recovery.\n","authors":["Arun Jambulapati","Jerry Li","Christopher Musco","Kirankumar Shiragur","Aaron Sidford","Kevin Tian"],"pdf_url":"https://arxiv.org/pdf/2310.18265v1.pdf","comment":"Merge of arXiv:1812.06295 and arXiv:2008.01722"},{"id":"http://arxiv.org/abs/2206.00439v6","updated":"2023-10-27T16:53:54Z","published":"2022-06-01T12:22:56Z","title":"Algorithmic Foundations of Empirical X-risk Minimization","summary":" This manuscript introduces a new optimization framework for machine learning\nand AI, named {\\bf empirical X-risk minimization (EXM)}. X-risk is a term\nintroduced to represent a family of compositional measures or objectives, in\nwhich each data point is compared with a large number of items explicitly or\nimplicitly for defining a risk function. It includes surrogate objectives of\nmany widely used measures and non-decomposable losses, e.g., AUROC, AUPRC,\npartial AUROC, NDCG, MAP, precision/recall at top $K$ positions, precision at a\ncertain recall level, listwise losses, p-norm push, top push, global\ncontrastive losses, etc. While these non-decomposable objectives and their\noptimization algorithms have been studied in the literature of machine\nlearning, computer vision, information retrieval, and etc, optimizing these\nobjectives has encountered some unique challenges for deep learning. In this\npaper, we present recent rigorous efforts for EXM with a focus on its\nalgorithmic foundations and its applications. We introduce a class of\nalgorithmic techniques for solving EXM with smooth non-convex objectives. We\nformulate EXM into three special families of non-convex optimization problems\nbelonging to non-convex compositional optimization, non-convex min-max\noptimization and non-convex bilevel optimization, respectively. For each family\nof problems, we present some strong baseline algorithms and their complexities,\nwhich will motivate further research for improving the existing results.\nDiscussions about the presented results and future studies are given at the\nend. Efficient algorithms for optimizing a variety of X-risks are implemented\nin the LibAUC library at \\url{www.libauc.org}.\n","authors":["Tianbao Yang"],"pdf_url":"https://arxiv.org/pdf/2206.00439v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18264v1","updated":"2023-10-27T16:51:41Z","published":"2023-10-27T16:51:41Z","title":"Learning to Search Feasible and Infeasible Regions of Routing Problems\n with Flexible Neural k-Opt","summary":" In this paper, we present Neural k-Opt (NeuOpt), a novel learning-to-search\n(L2S) solver for routing problems. It learns to perform flexible k-opt\nexchanges based on a tailored action factorization method and a customized\nrecurrent dual-stream decoder. As a pioneering work to circumvent the pure\nfeasibility masking scheme and enable the autonomous exploration of both\nfeasible and infeasible regions, we then propose the Guided Infeasible Region\nExploration (GIRE) scheme, which supplements the NeuOpt policy network with\nfeasibility-related features and leverages reward shaping to steer\nreinforcement learning more effectively. Additionally, we equip NeuOpt with\nDynamic Data Augmentation (D2A) for more diverse searches during inference.\nExtensive experiments on the Traveling Salesman Problem (TSP) and Capacitated\nVehicle Routing Problem (CVRP) demonstrate that our NeuOpt not only\nsignificantly outstrips existing (masking-based) L2S solvers, but also\nshowcases superiority over the learning-to-construct (L2C) and\nlearning-to-predict (L2P) solvers. Notably, we offer fresh perspectives on how\nneural solvers can handle VRP constraints. Our code is available:\nhttps://github.com/yining043/NeuOpt.\n","authors":["Yining Ma","Zhiguang Cao","Yeow Meng Chee"],"pdf_url":"https://arxiv.org/pdf/2310.18264v1.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.13934v4","updated":"2023-10-27T16:47:13Z","published":"2023-02-27T16:34:21Z","title":"Statistical Learning under Heterogeneous Distribution Shift","summary":" This paper studies the prediction of a target $\\mathbf{z}$ from a pair of\nrandom variables $(\\mathbf{x},\\mathbf{y})$, where the ground-truth predictor is\nadditive $\\mathbb{E}[\\mathbf{z} \\mid \\mathbf{x},\\mathbf{y}] =\nf_\\star(\\mathbf{x}) +g_{\\star}(\\mathbf{y})$. We study the performance of\nempirical risk minimization (ERM) over functions $f+g$, $f \\in F$ and $g \\in\nG$, fit on a given training distribution, but evaluated on a test distribution\nwhich exhibits covariate shift. We show that, when the class $F$ is \"simpler\"\nthan $G$ (measured, e.g., in terms of its metric entropy), our predictor is\nmore resilient to heterogeneous covariate shifts} in which the shift in\n$\\mathbf{x}$ is much greater than that in $\\mathbf{y}$. Our analysis proceeds\nby demonstrating that ERM behaves qualitatively similarly to orthogonal machine\nlearning: the rate at which ERM recovers the $f$-component of the predictor has\nonly a lower-order dependence on the complexity of the class $G$, adjusted for\npartial non-indentifiability introduced by the additive structure. These\nresults rely on a novel H\\\"older style inequality for the Dudley integral which\nmay be of independent interest. Moreover, we corroborate our theoretical\nfindings with experiments demonstrating improved resilience to shifts in\n\"simpler\" features across numerous domains.\n","authors":["Max Simchowitz","Anurag Ajay","Pulkit Agrawal","Akshay Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2302.13934v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.11403v4","updated":"2023-10-27T16:38:40Z","published":"2023-03-20T19:20:34Z","title":"eP-ALM: Efficient Perceptual Augmentation of Language Models","summary":" Large Language Models (LLMs) have so far impressed the world, with\nunprecedented capabilities that emerge in models at large scales. On the vision\nside, transformer models (i.e., ViT) are following the same trend, achieving\nthe best performance on challenging benchmarks. With the abundance of such\nunimodal models, a natural question arises; do we need also to follow this\ntrend to tackle multimodal tasks? In this work, we propose to rather direct\neffort to efficient adaptations of existing models, and propose to augment\nLanguage Models with perception. Existing approaches for adapting pretrained\nmodels for vision-language tasks still rely on several key components that\nhinder their efficiency. In particular, they still train a large number of\nparameters, rely on large multimodal pretraining, use encoders (e.g., CLIP)\ntrained on huge image-text datasets, and add significant inference overhead. In\naddition, most of these approaches have focused on Zero-Shot and In Context\nLearning, with little to no effort on direct finetuning. We investigate the\nminimal computational effort needed to adapt unimodal models for multimodal\ntasks and propose a new challenging setup, alongside different approaches, that\nefficiently adapts unimodal pretrained models. We show that by freezing more\nthan 99% of total parameters, training only one linear projection layer, and\nprepending only one trainable token, our approach (dubbed eP-ALM) significantly\noutperforms other baselines on VQA and Captioning across Image, Video, and\nAudio modalities, following the proposed setup. The code is available here:\nhttps://github.com/mshukor/eP-ALM.\n","authors":["Mustafa Shukor","Corentin Dancette","Matthieu Cord"],"pdf_url":"https://arxiv.org/pdf/2303.11403v4.pdf","comment":"Accepted at ICCV 2023. Project page:\n https://mshukor.github.io/eP-ALM.github.io/"},{"id":"http://arxiv.org/abs/2305.11165v2","updated":"2023-10-27T16:34:44Z","published":"2023-05-18T17:55:52Z","title":"The noise level in linear regression with dependent data","summary":" We derive upper bounds for random design linear regression with dependent\n($\\beta$-mixing) data absent any realizability assumptions. In contrast to the\nstrictly realizable martingale noise regime, no sharp instance-optimal\nnon-asymptotics are available in the literature. Up to constant factors, our\nanalysis correctly recovers the variance term predicted by the Central Limit\nTheorem -- the noise level of the problem -- and thus exhibits graceful\ndegradation as we introduce misspecification. Past a burn-in, our result is\nsharp in the moderate deviations regime, and in particular does not inflate the\nleading order term by mixing time factors.\n","authors":["Ingvar Ziemann","Stephen Tu","George J. Pappas","Nikolai Matni"],"pdf_url":"https://arxiv.org/pdf/2305.11165v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18247v1","updated":"2023-10-27T16:34:00Z","published":"2023-10-27T16:34:00Z","title":"Guided Data Augmentation for Offline Reinforcement Learning and\n Imitation Learning","summary":" Learning from demonstration (LfD) is a popular technique that uses expert\ndemonstrations to learn robot control policies. However, the difficulty in\nacquiring expert-quality demonstrations limits the applicability of LfD\nmethods: real-world data collection is often costly, and the quality of the\ndemonstrations depends greatly on the demonstrator's abilities and safety\nconcerns. A number of works have leveraged data augmentation (DA) to\ninexpensively generate additional demonstration data, but most DA works\ngenerate augmented data in a random fashion and ultimately produce highly\nsuboptimal data. In this work, we propose Guided Data Augmentation (GuDA), a\nhuman-guided DA framework that generates expert-quality augmented data. The key\ninsight of GuDA is that while it may be difficult to demonstrate the sequence\nof actions required to produce expert data, a user can often easily identify\nwhen an augmented trajectory segment represents task progress. Thus, the user\ncan impose a series of simple rules on the DA process to automatically generate\naugmented samples that approximate expert behavior. To extract a policy from\nGuDA, we use off-the-shelf offline reinforcement learning and behavior cloning\nalgorithms. We evaluate GuDA on a physical robot soccer task as well as\nsimulated D4RL navigation tasks, a simulated autonomous driving task, and a\nsimulated soccer task. Empirically, we find that GuDA enables learning from a\nsmall set of potentially suboptimal demonstrations and substantially\noutperforms a DA strategy that samples augmented data randomly.\n","authors":["Nicholas E. Corrado","Yuxiao Qu","John U. Balis","Adam Labiosa","Josiah P. Hanna"],"pdf_url":"https://arxiv.org/pdf/2310.18247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13725v2","updated":"2023-10-27T16:30:51Z","published":"2023-10-20T04:18:47Z","title":"Enhancing drug and cell line representations via contrastive learning\n for improved anti-cancer drug prioritization","summary":" Due to cancer's complex nature and variable response to therapy, precision\noncology informed by omics sequence analysis has become the current standard of\ncare. However, the amount of data produced for each patients makes it difficult\nto quickly identify the best treatment regimen. Moreover, limited data\navailability has hindered computational methods' abilities to learn patterns\nassociated with effective drug-cell line pairs. In this work, we propose the\nuse of contrastive learning to improve learned drug and cell line\nrepresentations by preserving relationship structures associated with drug\nmechanism of action and cell line cancer types. In addition to achieving\nenhanced performance relative to a state-of-the-art method, we find that\nclassifiers using our learned representations exhibit a more balances reliance\non drug- and cell line-derived features when making predictions. This\nfacilitates more personalized drug prioritizations that are informed by signals\nrelated to drug resistance.\n","authors":["Patrick J. Lawrence","Xia Ning"],"pdf_url":"https://arxiv.org/pdf/2310.13725v2.pdf","comment":"60 pages, 4 figures, 4 tables, 11 supplementary tables, 1\n supplementary note, submitted to Nature Communications"},{"id":"http://arxiv.org/abs/2110.14053v5","updated":"2023-10-27T16:30:44Z","published":"2021-10-26T22:08:22Z","title":"NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks","summary":" Propositional satisfiability (SAT) is an NP-complete problem that impacts\nmany research fields, such as planning, verification, and security. Mainstream\nmodern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL)\nalgorithm. Recent work aimed to enhance CDCL SAT solvers using Graph Neural\nNetworks (GNNs). However, so far this approach either has not made solving more\neffective, or required substantial GPU resources for frequent online model\ninferences. Aiming to make GNN improvements practical, this paper proposes an\napproach called NeuroBack, which builds on two insights: (1) predicting phases\n(i.e., values) of variables appearing in the majority (or even all) of the\nsatisfying assignments are essential for CDCL SAT solving, and (2) it is\nsufficient to query the neural model only once for the predictions before the\nSAT solving starts. Once trained, the offline model inference allows NeuroBack\nto execute exclusively on the CPU, removing its reliance on GPU resources. To\ntrain NeuroBack, a new dataset called DataBack containing 120,286 data samples\nis created. Finally, NeuroBack is implemented as an enhancement to a\nstate-of-the-art SAT solver called Kissat. As a result, it allowed Kissat to\nsolve 5.2% more problems on the recent SAT competition problem set,\nSATCOMP-2022. NeuroBack therefore shows how machine learning can be harnessed\nto improve SAT solving in an effective and practical manner.\n","authors":["Wenxi Wang","Yang Hu","Mohit Tiwari","Sarfraz Khurshid","Kenneth McMillan","Risto Miikkulainen"],"pdf_url":"https://arxiv.org/pdf/2110.14053v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15211v2","updated":"2023-10-27T16:29:44Z","published":"2023-10-23T17:24:11Z","title":"Modeling Path Importance for Effective Alzheimer's Disease Drug\n Repurposing","summary":" Recently, drug repurposing has emerged as an effective and resource-efficient\nparadigm for AD drug discovery. Among various methods for drug repurposing,\nnetwork-based methods have shown promising results as they are capable of\nleveraging complex networks that integrate multiple interaction types, such as\nprotein-protein interactions, to more effectively identify candidate drugs.\nHowever, existing approaches typically assume paths of the same length in the\nnetwork have equal importance in identifying the therapeutic effect of drugs.\nOther domains have found that same length paths do not necessarily have the\nsame importance. Thus, relying on this assumption may be deleterious to drug\nrepurposing attempts. In this work, we propose MPI (Modeling Path Importance),\na novel network-based method for AD drug repurposing. MPI is unique in that it\nprioritizes important paths via learned node embeddings, which can effectively\ncapture a network's rich structural information. Thus, leveraging learned\nembeddings allows MPI to effectively differentiate the importance among paths.\nWe evaluate MPI against a commonly used baseline method that identifies anti-AD\ndrug candidates primarily based on the shortest paths between drugs and AD in\nthe network. We observe that among the top-50 ranked drugs, MPI prioritizes\n20.0% more drugs with anti-AD evidence compared to the baseline. Finally, Cox\nproportional-hazard models produced from insurance claims data aid us in\nidentifying the use of etodolac, nicotine, and BBB-crossing ACE-INHs as having\na reduced risk of AD, suggesting such drugs may be viable candidates for\nrepurposing and should be explored further in future studies.\n","authors":["Shunian Xiang","Patrick J. Lawrence","Bo Peng","ChienWei Chiang","Dokyoon Kim","Li Shen","Xia Ning"],"pdf_url":"https://arxiv.org/pdf/2310.15211v2.pdf","comment":"16 pages, 3 figures, 2 tables, 1 supplementary figure, 5\n supplementary tables, Preprint of an article accepted for publication in\n Pacific Symposium on Biocomputing \\copyright 2023 World Scientific Publishing\n Co., Singapore, http://psb.stanford.edu/"},{"id":"http://arxiv.org/abs/2310.09727v2","updated":"2023-10-27T16:28:19Z","published":"2023-10-15T04:10:44Z","title":"Provably Fast Convergence of Independent Natural Policy Gradient for\n Markov Potential Games","summary":" This work studies an independent natural policy gradient (NPG) algorithm for\nthe multi-agent reinforcement learning problem in Markov potential games. It is\nshown that, under mild technical assumptions and the introduction of the\n\\textit{suboptimality gap}, the independent NPG method with an oracle providing\nexact policy evaluation asymptotically reaches an $\\epsilon$-Nash Equilibrium\n(NE) within $\\mathcal{O}(1/\\epsilon)$ iterations. This improves upon the\nprevious best result of $\\mathcal{O}(1/\\epsilon^2)$ iterations and is of the\nsame order, $\\mathcal{O}(1/\\epsilon)$, that is achievable for the single-agent\ncase. Empirical results for a synthetic potential game and a congestion game\nare presented to verify the theoretical bounds.\n","authors":["Youbang Sun","Tao Liu","Ruida Zhou","P. R. Kumar","Shahin Shahrampour"],"pdf_url":"https://arxiv.org/pdf/2310.09727v2.pdf","comment":"Will appear in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18241v1","updated":"2023-10-27T16:26:14Z","published":"2023-10-27T16:26:14Z","title":"$α$-Mutual Information: A Tunable Privacy Measure for Privacy\n Protection in Data Sharing","summary":" This paper adopts Arimoto's $\\alpha$-Mutual Information as a tunable privacy\nmeasure, in a privacy-preserving data release setting that aims to prevent\ndisclosing private data to adversaries. By fine-tuning the privacy metric, we\ndemonstrate that our approach yields superior models that effectively thwart\nattackers across various performance dimensions. We formulate a general\ndistortion-based mechanism that manipulates the original data to offer privacy\nprotection. The distortion metrics are determined according to the data\nstructure of a specific experiment. We confront the problem expressed in the\nformulation by employing a general adversarial deep learning framework that\nconsists of a releaser and an adversary, trained with opposite goals. This\nstudy conducts empirical experiments on images and time-series data to verify\nthe functionality of $\\alpha$-Mutual Information. We evaluate the\nprivacy-utility trade-off of customized models and compare them to mutual\ninformation as the baseline measure. Finally, we analyze the consequence of an\nattacker's access to side information about private data and witness that\nadapting the privacy measure results in a more refined model than the\nstate-of-the-art in terms of resiliency against side information.\n","authors":["MirHamed Jafarzadeh Asl","Mohammadhadi Shateri","Fabrice Labeau"],"pdf_url":"https://arxiv.org/pdf/2310.18241v1.pdf","comment":"2023 22nd IEEE International Conference on Machine Learning and\n Applications (ICMLA)"},{"id":"http://arxiv.org/abs/2310.07747v2","updated":"2023-10-27T16:23:43Z","published":"2023-10-11T17:20:32Z","title":"Accountability in Offline Reinforcement Learning: Explaining Decisions\n with a Corpus of Examples","summary":" Learning controllers with offline data in decision-making systems is an\nessential area of research due to its potential to reduce the risk of\napplications in real-world systems. However, in responsibility-sensitive\nsettings such as healthcare, decision accountability is of paramount\nimportance, yet has not been adequately addressed by the literature. This paper\nintroduces the Accountable Offline Controller (AOC) that employs the offline\ndataset as the Decision Corpus and performs accountable control based on a\ntailored selection of examples, referred to as the Corpus Subset. AOC operates\neffectively in low-data scenarios, can be extended to the strictly offline\nimitation setting, and displays qualities of both conservation and\nadaptability. We assess AOC's performance in both simulated and real-world\nhealthcare scenarios, emphasizing its capability to manage offline control\ntasks with high levels of performance while maintaining accountability.\n","authors":["Hao Sun","Alihan Hüyük","Daniel Jarrett","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2310.07747v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18236v1","updated":"2023-10-27T16:20:34Z","published":"2023-10-27T16:20:34Z","title":"How Re-sampling Helps for Long-Tail Learning?","summary":" Long-tail learning has received significant attention in recent years due to\nthe challenge it poses with extremely imbalanced datasets. In these datasets,\nonly a few classes (known as the head classes) have an adequate number of\ntraining samples, while the rest of the classes (known as the tail classes) are\ninfrequent in the training data. Re-sampling is a classical and widely used\napproach for addressing class imbalance issues. Unfortunately, recent studies\nclaim that re-sampling brings negligible performance improvements in modern\nlong-tail learning tasks. This paper aims to investigate this phenomenon\nsystematically. Our research shows that re-sampling can considerably improve\ngeneralization when the training images do not contain semantically irrelevant\ncontexts. In other scenarios, however, it can learn unexpected spurious\ncorrelations between irrelevant contexts and target labels. We design\nexperiments on two homogeneous datasets, one containing irrelevant context and\nthe other not, to confirm our findings. To prevent the learning of spurious\ncorrelations, we propose a new context shift augmentation module that generates\ndiverse training images for the tail class by maintaining a context bank\nextracted from the head-class images. Experiments demonstrate that our proposed\nmodule can boost the generalization and outperform other approaches, including\nclass-balanced re-sampling, decoupled classifier re-training, and data\naugmentation methods. The source code is available at\nhttps://www.lamda.nju.edu.cn/code_CSA.ashx.\n","authors":["Jiang-Xin Shi","Tong Wei","Yuke Xiang","Yu-Feng Li"],"pdf_url":"https://arxiv.org/pdf/2310.18236v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18235v1","updated":"2023-10-27T16:20:10Z","published":"2023-10-27T16:20:10Z","title":"Davidsonian Scene Graph: Improving Reliability in Fine-grained\n Evaluation for Text-Image Generation","summary":" Evaluating text-to-image models is notoriously difficult. A strong recent\napproach for assessing text-image faithfulness is based on QG/A (question\ngeneration and answering), which uses pre-trained foundational models to\nautomatically generate a set of questions and answers from the prompt, and\noutput images are scored based on whether these answers extracted with a visual\nquestion answering model are consistent with the prompt-based answers. This\nkind of evaluation is naturally dependent on the quality of the underlying QG\nand QA models. We identify and address several reliability challenges in\nexisting QG/A work: (a) QG questions should respect the prompt (avoiding\nhallucinations, duplications, and omissions) and (b) VQA answers should be\nconsistent (not asserting that there is no motorcycle in an image while also\nclaiming the motorcycle is blue). We address these issues with Davidsonian\nScene Graph (DSG), an empirically grounded evaluation framework inspired by\nformal semantics. DSG is an automatic, graph-based QG/A that is modularly\nimplemented to be adaptable to any QG/A module. DSG produces atomic and unique\nquestions organized in dependency graphs, which (i) ensure appropriate semantic\ncoverage and (ii) sidestep inconsistent answers. With extensive experimentation\nand human evaluation on a range of model configurations (LLM, VQA, and T2I), we\nempirically demonstrate that DSG addresses the challenges noted above. Finally,\nwe present DSG-1k, an open-sourced evaluation benchmark that includes 1,060\nprompts, covering a wide range of fine-grained semantic categories with a\nbalanced distribution. We will release the DSG-1k prompts and the corresponding\nDSG questions.\n","authors":["Jaemin Cho","Yushi Hu","Roopal Garg","Peter Anderson","Ranjay Krishna","Jason Baldridge","Mohit Bansal","Jordi Pont-Tuset","Su Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18235v1.pdf","comment":"Project website: https://google.github.io/DSG"},{"id":"http://arxiv.org/abs/2310.18230v1","updated":"2023-10-27T16:09:39Z","published":"2023-10-27T16:09:39Z","title":"Deep Transformed Gaussian Processes","summary":" Transformed Gaussian Processes (TGPs) are stochastic processes specified by\ntransforming samples from the joint distribution from a prior process\n(typically a GP) using an invertible transformation; increasing the flexibility\nof the base process.\n Furthermore, they achieve competitive results compared with Deep Gaussian\nProcesses (DGPs), which are another generalization constructed by a\nhierarchical concatenation of GPs. In this work, we propose a generalization of\nTGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend\nof concatenating layers of stochastic processes. More precisely, we obtain a\nmulti-layer model in which each layer is a TGP. This generalization implies an\nincrement of flexibility with respect to both TGPs and DGPs. Exact inference in\nsuch a model is intractable. However, we show that one can use variational\ninference to approximate the required computations yielding a straightforward\nextension of the popular DSVI inference algorithm Salimbeni et al (2017). The\nexperiments conducted evaluate the proposed novel DTGPs in multiple regression\ndatasets, achieving good scalability and performance.\n","authors":["Sáez-Maldonado Francisco Javier","Maroñas Juan","Hernández-Lobato Daniel"],"pdf_url":"https://arxiv.org/pdf/2310.18230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.11835v2","updated":"2023-10-27T16:06:07Z","published":"2023-06-20T18:45:24Z","title":"Topological Parallax: A Geometric Specification for Deep Perception\n Models","summary":" For safety and robustness of AI systems, we introduce topological parallax as\na theoretical and computational tool that compares a trained model to a\nreference dataset to determine whether they have similar multiscale geometric\nstructure. Our proofs and examples show that this geometric similarity between\ndataset and model is essential to trustworthy interpolation and perturbation,\nand we conjecture that this new concept will add value to the current debate\nregarding the unclear relationship between overfitting and generalization in\napplications of deep-learning. In typical DNN applications, an explicit\ngeometric description of the model is impossible, but parallax can estimate\ntopological features (components, cycles, voids, etc.) in the model by\nexamining the effect on the Rips complex of geodesic distortions using the\nreference dataset. Thus, parallax indicates whether the model shares similar\nmultiscale geometric features with the dataset. Parallax presents theoretically\nvia topological data analysis [TDA] as a bi-filtered persistence module, and\nthe key properties of this module are stable under perturbation of the\nreference dataset.\n","authors":["Abraham D. Smith","Michael J. Catanzaro","Gabrielle Angeloro","Nirav Patel","Paul Bendich"],"pdf_url":"https://arxiv.org/pdf/2306.11835v2.pdf","comment":"18 pages, 6 figures. Preprint submitted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.01424v2","updated":"2023-10-27T16:05:08Z","published":"2023-06-02T10:30:37Z","title":"Partial Counterfactual Identification of Continuous Outcomes with a\n Curvature Sensitivity Model","summary":" Counterfactual inference aims to answer retrospective \"what if\" questions and\nthus belongs to the most fine-grained type of inference in Pearl's causality\nladder. Existing methods for counterfactual inference with continuous outcomes\naim at point identification and thus make strong and unnatural assumptions\nabout the underlying structural causal model. In this paper, we relax these\nassumptions and aim at partial counterfactual identification of continuous\noutcomes, i.e., when the counterfactual query resides in an ignorance interval\nwith informative bounds. We prove that, in general, the ignorance interval of\nthe counterfactual queries has non-informative bounds, already when functions\nof structural causal models are continuously differentiable. As a remedy, we\npropose a novel sensitivity model called Curvature Sensitivity Model. This\nallows us to obtain informative bounds by bounding the curvature of level sets\nof the functions. We further show that existing point counterfactual\nidentification methods are special cases of our Curvature Sensitivity Model\nwhen the bound of the curvature is set to zero. We then propose an\nimplementation of our Curvature Sensitivity Model in the form of a novel deep\ngenerative model, which we call Augmented Pseudo-Invertible Decoder. Our\nimplementation employs (i) residual normalizing flows with (ii) variational\naugmentations. We empirically demonstrate the effectiveness of our Augmented\nPseudo-Invertible Decoder. To the best of our knowledge, ours is the first\npartial identification model for Markovian structural causal models with\ncontinuous outcomes.\n","authors":["Valentyn Melnychuk","Dennis Frauen","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2306.01424v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.15626v2","updated":"2023-10-27T16:00:20Z","published":"2023-06-27T17:05:32Z","title":"LeanDojo: Theorem Proving with Retrieval-Augmented Language Models","summary":" Large language models (LLMs) have shown promise in proving formal theorems\nusing proof assistants such as Lean. However, existing methods are difficult to\nreproduce or build on, due to private code, data, and large compute\nrequirements. This has created substantial barriers to research on machine\nlearning methods for theorem proving. This paper removes these barriers by\nintroducing LeanDojo: an open-source Lean playground consisting of toolkits,\ndata, models, and benchmarks. LeanDojo extracts data from Lean and enables\ninteraction with the proof environment programmatically. It contains\nfine-grained annotations of premises in proofs, providing valuable data for\npremise selection: a key bottleneck in theorem proving. Using this data, we\ndevelop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented\nwith retrieval for selecting premises from a vast math library. It is\ninexpensive and needs only one GPU week of training. Our retriever leverages\nLeanDojo's program analysis capability to identify accessible premises and hard\nnegative examples, which makes retrieval much more effective. Furthermore, we\nconstruct a new benchmark consisting of 98,734 theorems and proofs extracted\nfrom Lean's math library. It features challenging data split requiring the\nprover to generalize to theorems relying on novel premises that are never used\nin training. We use this benchmark for training and evaluation, and\nexperimental results demonstrate the effectiveness of ReProver over\nnon-retrieval baselines and GPT-4. We thus provide the first set of open-source\nLLM-based theorem provers without any proprietary datasets and release it under\na permissive MIT license to facilitate further research.\n","authors":["Kaiyu Yang","Aidan M. Swope","Alex Gu","Rahul Chalamala","Peiyang Song","Shixing Yu","Saad Godil","Ryan Prenger","Anima Anandkumar"],"pdf_url":"https://arxiv.org/pdf/2306.15626v2.pdf","comment":"Accepted to NeurIPS 2023 (Datasets and Benchmarks Track) as an oral\n presentation. Data, code, and models available at https://leandojo.org/"},{"id":"http://arxiv.org/abs/2309.00591v2","updated":"2023-10-27T16:00:11Z","published":"2023-09-01T17:12:43Z","title":"Fast and Regret Optimal Best Arm Identification: Fundamental Limits and\n Low-Complexity Algorithms","summary":" This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual\nobjectives: (i) quick identification and commitment to the optimal arm, and\n(ii) reward maximization throughout a sequence of $T$ consecutive rounds.\nThough each objective has been individually well-studied, i.e., best arm\nidentification for (i) and regret minimization for (ii), the simultaneous\nrealization of both objectives remains an open problem, despite its practical\nimportance. This paper introduces \\emph{Regret Optimal Best Arm Identification}\n(ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both\npre-determined stopping time and adaptive stopping time requirements, we\npresent an algorithm called EOCP and its variants respectively, which not only\nachieve asymptotic optimal regret in both Gaussian and general bandits, but\nalso commit to the optimal arm in $\\mathcal{O}(\\log T)$ rounds with\npre-determined stopping time and $\\mathcal{O}(\\log^2 T)$ rounds with adaptive\nstopping time. We further characterize lower bounds on the commitment time\n(equivalent to the sample complexity) of ROBAI, showing that EOCP and its\nvariants are sample optimal with pre-determined stopping time, and almost\nsample optimal with adaptive stopping time. Numerical results confirm our\ntheoretical analysis and reveal an interesting \"over-exploration\" phenomenon\ncarried by classic UCB algorithms, such that EOCP has smaller regret even\nthough it stops exploration much earlier than UCB, i.e., $\\mathcal{O}(\\log T)$\nversus $\\mathcal{O}(T)$, which suggests over-exploration is unnecessary and\npotentially harmful to system performance.\n","authors":["Qining Zhang","Lei Ying"],"pdf_url":"https://arxiv.org/pdf/2309.00591v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2208.04627v3","updated":"2023-10-27T15:58:19Z","published":"2022-08-09T09:44:45Z","title":"Causal Effect Identification in Uncertain Causal Networks","summary":" Causal identification is at the core of the causal inference literature,\nwhere complete algorithms have been proposed to identify causal queries of\ninterest. The validity of these algorithms hinges on the restrictive assumption\nof having access to a correctly specified causal structure. In this work, we\nstudy the setting where a probabilistic model of the causal structure is\navailable. Specifically, the edges in a causal graph exist with uncertainties\nwhich may, for example, represent degree of belief from domain experts.\nAlternatively, the uncertainty about an edge may reflect the confidence of a\nparticular statistical test. The question that naturally arises in this setting\nis: Given such a probabilistic graph and a specific causal effect of interest,\nwhat is the subgraph which has the highest plausibility and for which the\ncausal effect is identifiable? We show that answering this question reduces to\nsolving an NP-complete combinatorial optimization problem which we call the\nedge ID problem. We propose efficient algorithms to approximate this problem\nand evaluate them against both real-world networks and randomly generated\ngraphs.\n","authors":["Sina Akbari","Fateme Jamshidi","Ehsan Mokhtarian","Matthew J. Vowels","Jalal Etesami","Negar Kiyavash"],"pdf_url":"https://arxiv.org/pdf/2208.04627v3.pdf","comment":"27 pages, 9 figures, NeurIPS 2023 conference, causal identification,\n causal discovery, probabilistic models"},{"id":"http://arxiv.org/abs/2310.18222v1","updated":"2023-10-27T15:51:33Z","published":"2023-10-27T15:51:33Z","title":"TBDLNet: a network for classifying multidrug-resistant and\n drug-sensitive tuberculosis","summary":" This paper proposes applying a novel deep-learning model, TBDLNet, to\nrecognize CT images to classify multidrug-resistant and drug-sensitive\ntuberculosis automatically. The pre-trained ResNet50 is selected to extract\nfeatures. Three randomized neural networks are used to alleviate the\noverfitting problem. The ensemble of three RNNs is applied to boost the\nrobustness via majority voting. The proposed model is evaluated by five-fold\ncross-validation. Five indexes are selected in this paper, which are accuracy,\nsensitivity, precision, F1-score, and specificity. The TBDLNet achieves 0.9822\naccuracy, 0.9815 specificity, 0.9823 precision, 0.9829 sensitivity, and 0.9826\nF1-score, respectively. The TBDLNet is suitable for classifying\nmultidrug-resistant tuberculosis and drug-sensitive tuberculosis. It can detect\nmultidrug-resistant pulmonary tuberculosis as early as possible, which helps to\nadjust the treatment plan in time and improve the treatment effect.\n","authors":["Ziquan Zhu","Jing Tao","Shuihua Wang","Xin Zhang","Yudong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.18222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18215v1","updated":"2023-10-27T15:42:04Z","published":"2023-10-27T15:42:04Z","title":"One Model Fits All: Cross-Region Taxi-Demand Forecasting","summary":" The growing demand for ride-hailing services has led to an increasing need\nfor accurate taxi demand prediction. Existing systems are limited to specific\nregions, lacking generalizability to unseen areas. This paper presents a novel\ntaxi demand forecasting system that leverages a graph neural network to capture\nspatial dependencies and patterns in urban environments. Additionally, the\nproposed system employs a region-neutral approach, enabling it to train a model\nthat can be applied to any region, including unseen regions. To achieve this,\nthe framework incorporates the power of Variational Autoencoder to disentangle\nthe input features into region-specific and region-neutral components. The\nregion-neutral features facilitate cross-region taxi demand predictions,\nallowing the model to generalize well across different urban areas.\nExperimental results demonstrate the effectiveness of the proposed system in\naccurately forecasting taxi demand, even in previously unobserved regions, thus\nshowcasing its potential for optimizing taxi services and improving\ntransportation efficiency on a broader scale.\n","authors":["Ren Ozeki","Haruki Yonekura","Aidana Baimbetova","Hamada Rizk","Hirozumi Yamaguchi"],"pdf_url":"https://arxiv.org/pdf/2310.18215v1.pdf","comment":"Accepted to The 31st ACM International Conference on Advances in\n Geographic Information Systems(SIGSPATIAL '23) as a short paper in the\n Research, Systems and Industrial Experience Papers track"},{"id":"http://arxiv.org/abs/2207.07051v2","updated":"2023-10-27T15:35:45Z","published":"2022-07-14T16:51:09Z","title":"Language models show human-like content effects on reasoning tasks","summary":" Abstract reasoning is a key ability for an intelligent system. Large language\nmodels (LMs) achieve above-chance performance on abstract reasoning tasks, but\nexhibit many imperfections. However, human abstract reasoning is also\nimperfect. For example, human reasoning is affected by our real-world knowledge\nand beliefs, and shows notable \"content effects\"; humans reason more reliably\nwhen the semantic content of a problem supports the correct logical inferences.\nThese content-entangled reasoning patterns play a central role in debates about\nthe fundamental nature of human intelligence. Here, we investigate whether\nlanguage models $\\unicode{x2014}$ whose prior expectations capture some aspects\nof human knowledge $\\unicode{x2014}$ similarly mix content into their answers\nto logical problems. We explored this question across three logical reasoning\ntasks: natural language inference, judging the logical validity of syllogisms,\nand the Wason selection task. We evaluate state of the art large language\nmodels, as well as humans, and find that the language models reflect many of\nthe same patterns observed in humans across these tasks $\\unicode{x2014}$ like\nhumans, models answer more accurately when the semantic content of a task\nsupports the logical inferences. These parallels are reflected both in answer\npatterns, and in lower-level features like the relationship between model\nanswer distributions and human response times. Our findings have implications\nfor understanding both these cognitive effects in humans, and the factors that\ncontribute to language model performance.\n","authors":["Ishita Dasgupta","Andrew K. Lampinen","Stephanie C. Y. Chan","Hannah R. Sheahan Antonia Creswell","Dharshan Kumaran","James L. McClelland","Felix Hill"],"pdf_url":"https://arxiv.org/pdf/2207.07051v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18212v1","updated":"2023-10-27T15:34:08Z","published":"2023-10-27T15:34:08Z","title":"Robustness of Algorithms for Causal Structure Learning to Hyperparameter\n Choice","summary":" Hyperparameters play a critical role in machine learning. Hyperparameter\ntuning can make the difference between state-of-the-art and poor prediction\nperformance for any algorithm, but it is particularly challenging for structure\nlearning due to its unsupervised nature. As a result, hyperparameter tuning is\noften neglected in favour of using the default values provided by a particular\nimplementation of an algorithm. While there have been numerous studies on\nperformance evaluation of causal discovery algorithms, how hyperparameters\naffect individual algorithms, as well as the choice of the best algorithm for a\nspecific problem, has not been studied in depth before. This work addresses\nthis gap by investigating the influence of hyperparameters on causal structure\nlearning tasks. Specifically, we perform an empirical evaluation of\nhyperparameter selection for some seminal learning algorithms on datasets of\nvarying levels of complexity. We find that, while the choice of algorithm\nremains crucial to obtaining state-of-the-art performance, hyperparameter\nselection in ensemble settings strongly influences the choice of algorithm, in\nthat a poor choice of hyperparameters can lead to analysts using algorithms\nwhich do not give state-of-the-art performance for their data.\n","authors":["Damian Machlanski","Spyridon Samothrakis","Paul Clarke"],"pdf_url":"https://arxiv.org/pdf/2310.18212v1.pdf","comment":"26 pages, 16 figures"},{"id":"http://arxiv.org/abs/2310.18209v1","updated":"2023-10-27T15:31:42Z","published":"2023-10-27T15:31:42Z","title":"Alignment and Outer Shell Isotropy for Hyperbolic Graph Contrastive\n Learning","summary":" Learning good self-supervised graph representations that are beneficial to\ndownstream tasks is challenging. Among a variety of methods, contrastive\nlearning enjoys competitive performance. The embeddings of contrastive learning\nare arranged on a hypersphere that enables the Cosine distance measurement in\nthe Euclidean space. However, the underlying structure of many domains such as\ngraphs exhibits highly non-Euclidean latent geometry. To this end, we propose a\nnovel contrastive learning framework to learn high-quality graph embedding.\nSpecifically, we design the alignment metric that effectively captures the\nhierarchical data-invariant information, as well as we propose a substitute of\nuniformity metric to prevent the so-called dimensional collapse. We show that\nin the hyperbolic space one has to address the leaf- and height-level\nuniformity which are related to properties of trees, whereas in the ambient\nspace of the hyperbolic manifold, these notions translate into imposing an\nisotropic ring density towards boundaries of Poincar\\'e ball. This ring density\ncan be easily imposed by promoting the isotropic feature distribution on the\ntangent space of manifold. In the experiments, we demonstrate the efficacy of\nour proposed method across different hyperbolic graph embedding techniques in\nboth supervised and self-supervised learning settings.\n","authors":["Yifei Zhang","Hao Zhu","Jiahong Liu","Piotr Koniusz","Irwin King"],"pdf_url":"https://arxiv.org/pdf/2310.18209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18208v1","updated":"2023-10-27T15:31:22Z","published":"2023-10-27T15:31:22Z","title":"ArcheType: A Novel Framework for Open-Source Column Type Annotation\n using Large Language Models","summary":" Existing deep-learning approaches to semantic column type annotation (CTA)\nhave important shortcomings: they rely on semantic types which are fixed at\ntraining time; require a large number of training samples per type and incur\nlarge run-time inference costs; and their performance can degrade when\nevaluated on novel datasets, even when types remain constant. Large language\nmodels have exhibited strong zero-shot classification performance on a wide\nrange of tasks and in this paper we explore their use for CTA. We introduce\nArcheType, a simple, practical method for context sampling, prompt\nserialization, model querying, and label remapping, which enables large\nlanguage models to solve column type annotation problems in a fully zero-shot\nmanner. We ablate each component of our method separately, and establish that\nimprovements to context sampling and label remapping provide the most\nconsistent gains. ArcheType establishes new state-of-the-art performance on\nboth zero-shot and fine-tuned CTA, including three new domain-specific\nbenchmarks, which we release, along with the code to reproduce our results at\nhttps://github.com/penfever/ArcheType.\n","authors":["Benjamin Feuer","Yurong Liu","Chinmay Hegde","Juliana Freire"],"pdf_url":"https://arxiv.org/pdf/2310.18208v1.pdf","comment":"17 pages, 8 figures"},{"id":"http://arxiv.org/abs/2010.10294v2","updated":"2023-10-27T15:26:02Z","published":"2020-10-19T15:13:07Z","title":"Adaptive Webpage Fingerprinting from TLS Traces","summary":" In webpage fingerprinting, an on-path adversary infers the specific webpage\nloaded by a victim user by analysing the patterns in the encrypted TLS traffic\nexchanged between the user's browser and the website's servers. This work\nstudies modern webpage fingerprinting adversaries against the TLS protocol;\naiming to shed light on their capabilities and inform potential defences.\nDespite the importance of this research area (the majority of global Internet\nusers rely on standard web browsing with TLS) and the potential real-life\nimpact, most past works have focused on attacks specific to anonymity networks\n(e.g., Tor). We introduce a TLS-specific model that: 1) scales to an\nunprecedented number of target webpages, 2) can accurately classify thousands\nof classes it never encountered during training, and 3) has low operational\ncosts even in scenarios of frequent page updates. Based on these findings, we\nthen discuss TLS-specific countermeasures and evaluate the effectiveness of the\nexisting padding capabilities provided by TLS 1.3.\n","authors":["Vasilios Mavroudis","Jamie Hayes"],"pdf_url":"https://arxiv.org/pdf/2010.10294v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08252v2","updated":"2023-10-27T15:19:41Z","published":"2023-10-12T11:55:17Z","title":"MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with\n Reinforcement Learning","summary":" Recently, Meta-Black-Box Optimization with Reinforcement Learning\n(MetaBBO-RL) has showcased the power of leveraging RL at the meta-level to\nmitigate manual fine-tuning of low-level black-box optimizers. However, this\nfield is hindered by the lack of a unified benchmark. To fill this gap, we\nintroduce MetaBox, the first benchmark platform expressly tailored for\ndeveloping and evaluating MetaBBO-RL methods. MetaBox offers a flexible\nalgorithmic template that allows users to effortlessly implement their unique\ndesigns within the platform. Moreover, it provides a broad spectrum of over 300\nproblem instances, collected from synthetic to realistic scenarios, and an\nextensive library of 19 baseline methods, including both traditional black-box\noptimizers and recent MetaBBO-RL methods. Besides, MetaBox introduces three\nstandardized performance metrics, enabling a more thorough assessment of the\nmethods. In a bid to illustrate the utility of MetaBox for facilitating\nrigorous evaluation and in-depth analysis, we carry out a wide-ranging\nbenchmarking study on existing MetaBBO-RL methods. Our MetaBox is open-source\nand accessible at: https://github.com/GMC-DRL/MetaBox.\n","authors":["Zeyuan Ma","Hongshu Guo","Jiacheng Chen","Zhenrui Li","Guojun Peng","Yue-Jiao Gong","Yining Ma","Zhiguang Cao"],"pdf_url":"https://arxiv.org/pdf/2310.08252v2.pdf","comment":"Accepted at NuerIPS 2023"},{"id":"http://arxiv.org/abs/2306.06836v2","updated":"2023-10-27T15:19:11Z","published":"2023-06-12T02:56:09Z","title":"Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function\n Approximation: Minimax Optimal and Instance-Dependent Regret Bounds","summary":" While numerous works have focused on devising efficient algorithms for\nreinforcement learning (RL) with uniformly bounded rewards, it remains an open\nquestion whether sample or time-efficient algorithms for RL with large\nstate-action space exist when the rewards are \\emph{heavy-tailed}, i.e., with\nonly finite $(1+\\epsilon)$-th moments for some $\\epsilon\\in(0,1]$. In this\nwork, we address the challenge of such rewards in RL with linear function\napproximation. We first design an algorithm, \\textsc{Heavy-OFUL}, for\nheavy-tailed linear bandits, achieving an \\emph{instance-dependent} $T$-round\nregret of $\\tilde{O}\\big(d T^{\\frac{1-\\epsilon}{2(1+\\epsilon)}}\n\\sqrt{\\sum_{t=1}^T \\nu_t^2} + d T^{\\frac{1-\\epsilon}{2(1+\\epsilon)}}\\big)$, the\n\\emph{first} of this kind. Here, $d$ is the feature dimension, and\n$\\nu_t^{1+\\epsilon}$ is the $(1+\\epsilon)$-th central moment of the reward at\nthe $t$-th round. We further show the above bound is minimax optimal when\napplied to the worst-case instances in stochastic and deterministic linear\nbandits. We then extend this algorithm to the RL settings with linear function\napproximation. Our algorithm, termed as \\textsc{Heavy-LSVI-UCB}, achieves the\n\\emph{first} computationally efficient \\emph{instance-dependent} $K$-episode\nregret of $\\tilde{O}(d \\sqrt{H \\mathcal{U}^*} K^\\frac{1}{1+\\epsilon} + d\n\\sqrt{H \\mathcal{V}^* K})$. Here, $H$ is length of the episode, and\n$\\mathcal{U}^*, \\mathcal{V}^*$ are instance-dependent quantities scaling with\nthe central moment of reward and value functions, respectively. We also provide\na matching minimax lower bound $\\Omega(d H K^{\\frac{1}{1+\\epsilon}} + d\n\\sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst\ncase. Our result is achieved via a novel robust self-normalized concentration\ninequality that may be of independent interest in handling heavy-tailed noise\nin general online regression problems.\n","authors":["Jiayi Huang","Han Zhong","Liwei Wang","Lin F. Yang"],"pdf_url":"https://arxiv.org/pdf/2306.06836v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2308.07817v2","updated":"2023-10-27T15:18:28Z","published":"2023-08-15T14:50:12Z","title":"Quantifying the Cost of Learning in Queueing Systems","summary":" Queueing systems are widely applicable stochastic models with use cases in\ncommunication networks, healthcare, service systems, etc. Although their\noptimal control has been extensively studied, most existing approaches assume\nperfect knowledge of the system parameters. Of course, this assumption rarely\nholds in practice where there is parameter uncertainty, thus motivating a\nrecent line of work on bandit learning for queueing systems. This nascent\nstream of research focuses on the asymptotic performance of the proposed\nalgorithms.\n In this paper, we argue that an asymptotic metric, which focuses on\nlate-stage performance, is insufficient to capture the intrinsic statistical\ncomplexity of learning in queueing systems which typically occurs in the early\nstage. Instead, we propose the Cost of Learning in Queueing (CLQ), a new metric\nthat quantifies the maximum increase in time-averaged queue length caused by\nparameter uncertainty. We characterize the CLQ of a single queue multi-server\nsystem, and then extend these results to multi-queue multi-server systems and\nnetworks of queues. In establishing our results, we propose a unified analysis\nframework for CLQ that bridges Lyapunov and bandit analysis, provides\nguarantees for a wide range of algorithms, and could be of independent\ninterest.\n","authors":["Daniel Freund","Thodoris Lykouris","Wentao Weng"],"pdf_url":"https://arxiv.org/pdf/2308.07817v2.pdf","comment":"A condensed version of this work was accepted for presentation at the\n Conference on Neural Information Processing Systems (NeurIPS 2023). Compared\n to the first version of the paper, the current version expands the comparison\n with related work"},{"id":"http://arxiv.org/abs/2307.11957v2","updated":"2023-10-27T15:16:39Z","published":"2023-07-22T01:56:58Z","title":"High-performance real-world optical computing trained by in situ\n model-free optimization","summary":" Optical computing systems can provide high-speed and low-energy data\nprocessing but face deficiencies in computationally demanding training and\nsimulation-to-reality gap. We propose a model-free solution for lightweight in\nsitu optimization of optical computing systems based on the score gradient\nestimation algorithm. This approach treats the system as a black box and\nback-propagates loss directly to the optical weights' probabilistic\ndistributions, hence circumventing the need for computation-heavy and biased\nsystem simulation. We demonstrate a superior classification accuracy on the\nMNIST and FMNIST datasets through experiments on a single-layer diffractive\noptical computing system. Furthermore, we show its potential for image-free and\nhigh-speed cell analysis. The inherent simplicity of our proposed method,\ncombined with its low demand for computational resources, expedites the\ntransition of optical computing from laboratory demonstrations to real-world\napplications.\n","authors":["Guangyuan Zhao","Xin Shu"],"pdf_url":"https://arxiv.org/pdf/2307.11957v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18191v1","updated":"2023-10-27T15:04:00Z","published":"2023-10-27T15:04:00Z","title":"Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's\n 4000 TPU Months","summary":" We analyze VeLO (versatile learned optimizer), the largest scale attempt to\ntrain a general purpose \"foundational\" optimizer to date. VeLO was trained on\nthousands of machine learning tasks using over 4000 TPU months with the goal of\nproducing an optimizer capable of generalizing to new problems while being\nhyperparameter free, and outperforming industry standards such as Adam. We\nindependently evaluate VeLO on the MLCommons optimizer benchmark suite. We find\nthat, contrary to initial claims: (1) VeLO has a critical hyperparameter that\nneeds problem-specific tuning, (2) VeLO does not necessarily outperform\ncompetitors in quality of solution found, and (3) VeLO is not faster than\ncompeting optimizers at reducing the training loss. These observations call\ninto question VeLO's generality and the value of the investment in training it.\n","authors":["Fady Rezk","Antreas Antoniou","Henry Gouk","Timothy Hospedales"],"pdf_url":"https://arxiv.org/pdf/2310.18191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18186v1","updated":"2023-10-27T14:59:44Z","published":"2023-10-27T14:59:44Z","title":"Model-free Posterior Sampling via Learning Rate Randomization","summary":" In this paper, we introduce Randomized Q-learning (RandQL), a novel\nrandomized model-free algorithm for regret minimization in episodic Markov\nDecision Processes (MDPs). To the best of our knowledge, RandQL is the first\ntractable model-free posterior sampling-based algorithm. We analyze the\nperformance of RandQL in both tabular and non-tabular metric space settings. In\ntabular MDPs, RandQL achieves a regret bound of order\n$\\widetilde{\\mathcal{O}}(\\sqrt{H^{5}SAT})$, where $H$ is the planning horizon,\n$S$ is the number of states, $A$ is the number of actions, and $T$ is the\nnumber of episodes. For a metric state-action space, RandQL enjoys a regret\nbound of order $\\widetilde{\\mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where\n$d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic\nexploration without using bonuses, relying instead on a novel idea of learning\nrate randomization. Our empirical study shows that RandQL outperforms existing\napproaches on baseline exploration environments.\n","authors":["Daniil Tiapkin","Denis Belomestny","Daniele Calandriello","Eric Moulines","Remi Munos","Alexey Naumov","Pierre Perrault","Michal Valko","Pierre Menard"],"pdf_url":"https://arxiv.org/pdf/2310.18186v1.pdf","comment":"NeurIPS-2023"},{"id":"http://arxiv.org/abs/2305.19742v2","updated":"2023-10-27T14:48:59Z","published":"2023-05-31T11:08:43Z","title":"Reliable Off-Policy Learning for Dosage Combinations","summary":" Decision-making in personalized medicine such as cancer therapy or critical\ncare must often make choices for dosage combinations, i.e., multiple continuous\ntreatments. Existing work for this task has modeled the effect of multiple\ntreatments independently, while estimating the joint effect has received little\nattention but comes with non-trivial challenges. In this paper, we propose a\nnovel method for reliable off-policy learning for dosage combinations. Our\nmethod proceeds along three steps: (1) We develop a tailored neural network\nthat estimates the individualized dose-response function while accounting for\nthe joint effect of multiple dependent dosages. (2) We estimate the generalized\npropensity score using conditional normalizing flows in order to detect regions\nwith limited overlap in the shared covariate-treatment space. (3) We present a\ngradient-based learning algorithm to find the optimal, individualized dosage\ncombinations. Here, we ensure reliable estimation of the policy value by\navoiding regions with limited overlap. We finally perform an extensive\nevaluation of our method to show its effectiveness. To the best of our\nknowledge, ours is the first work to provide a method for reliable off-policy\nlearning for optimal dosage combinations.\n","authors":["Jonas Schweisthal","Dennis Frauen","Valentyn Melnychuk","Stefan Feuerriegel"],"pdf_url":"https://arxiv.org/pdf/2305.19742v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.05351v3","updated":"2023-10-27T14:35:14Z","published":"2023-10-09T02:27:04Z","title":"Generalized Neural Collapse for a Large Number of Classes","summary":" Neural collapse provides an elegant mathematical characterization of learned\nlast layer representations (a.k.a. features) and classifier weights in deep\nclassification models. Such results not only provide insights but also motivate\nnew techniques for improving practical deep models. However, most of the\nexisting empirical and theoretical studies in neural collapse focus on the case\nthat the number of classes is small relative to the dimension of the feature\nspace. This paper extends neural collapse to cases where the number of classes\nare much larger than the dimension of feature space, which broadly occur for\nlanguage models, retrieval systems, and face recognition applications. We show\nthat the features and classifier exhibit a generalized neural collapse\nphenomenon, where the minimum one-vs-rest margins is maximized.We provide\nempirical study to verify the occurrence of generalized neural collapse in\npractical deep neural networks. Moreover, we provide theoretical study to show\nthat the generalized neural collapse provably occurs under unconstrained\nfeature model with spherical constraint, under certain technical conditions on\nfeature dimension and number of classes.\n","authors":["Jiachen Jiang","Jinxin Zhou","Peng Wang","Qing Qu","Dustin Mixon","Chong You","Zhihui Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.05351v3.pdf","comment":"32 pages, 12 figures"},{"id":"http://arxiv.org/abs/2307.02318v2","updated":"2023-10-27T14:31:51Z","published":"2023-07-05T14:20:20Z","title":"Deep Contract Design via Discontinuous Networks","summary":" Contract design involves a principal who establishes contractual agreements\nabout payments for outcomes that arise from the actions of an agent. In this\npaper, we initiate the study of deep learning for the automated design of\noptimal contracts. We introduce a novel representation: the Discontinuous ReLU\n(DeLU) network, which models the principal's utility as a discontinuous\npiecewise affine function of the design of a contract where each piece\ncorresponds to the agent taking a particular action. DeLU networks implicitly\nlearn closed-form expressions for the incentive compatibility constraints of\nthe agent and the utility maximization objective of the principal, and support\nparallel inference on each piece through linear programming or interior-point\nmethods that solve for optimal contracts. We provide empirical results that\ndemonstrate success in approximating the principal's utility function with a\nsmall number of training samples and scaling to find approximately optimal\ncontracts on problems with a large number of actions and outcomes.\n","authors":["Tonghan Wang","Paul Dütting","Dmitry Ivanov","Inbal Talgam-Cohen","David C. Parkes"],"pdf_url":"https://arxiv.org/pdf/2307.02318v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.06096v3","updated":"2023-10-27T14:31:32Z","published":"2022-12-12T18:10:33Z","title":"Implicit Convolutional Kernels for Steerable CNNs","summary":" Steerable convolutional neural networks (CNNs) provide a general framework\nfor building neural networks equivariant to translations and transformations of\nan origin-preserving group $G$, such as reflections and rotations. They rely on\nstandard convolutions with $G$-steerable kernels obtained by analytically\nsolving the group-specific equivariance constraint imposed onto the kernel\nspace. As the solution is tailored to a particular group $G$, implementing a\nkernel basis does not generalize to other symmetry transformations,\ncomplicating the development of general group equivariant models. We propose\nusing implicit neural representation via multi-layer perceptrons (MLPs) to\nparameterize $G$-steerable kernels. The resulting framework offers a simple and\nflexible way to implement Steerable CNNs and generalizes to any group $G$ for\nwhich a $G$-equivariant MLP can be built. We prove the effectiveness of our\nmethod on multiple tasks, including N-body simulations, point cloud\nclassification and molecular property prediction.\n","authors":["Maksim Zhdanov","Nico Hoffmann","Gabriele Cesa"],"pdf_url":"https://arxiv.org/pdf/2212.06096v3.pdf","comment":"Accepted to 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.17550v2","updated":"2023-10-27T14:31:25Z","published":"2023-10-26T16:45:34Z","title":"Human-Guided Complexity-Controlled Abstractions","summary":" Neural networks often learn task-specific latent representations that fail to\ngeneralize to novel settings or tasks. Conversely, humans learn discrete\nrepresentations (i.e., concepts or words) at a variety of abstraction levels\n(e.g., \"bird\" vs. \"sparrow\") and deploy the appropriate abstraction based on\ntask. Inspired by this, we train neural models to generate a spectrum of\ndiscrete representations, and control the complexity of the representations\n(roughly, how many bits are allocated for encoding inputs) by tuning the\nentropy of the distribution over representations. In finetuning experiments,\nusing only a small number of labeled examples for a new task, we show that (1)\ntuning the representation to a task-appropriate complexity level supports the\nhighest finetuning performance, and (2) in a human-participant study, users\nwere able to identify the appropriate complexity level for a downstream task\nusing visualizations of discrete representations. Our results indicate a\npromising direction for rapid model finetuning by leveraging human insight.\n","authors":["Andi Peng","Mycal Tucker","Eoin Kenny","Noga Zaslavsky","Pulkit Agrawal","Julie Shah"],"pdf_url":"https://arxiv.org/pdf/2310.17550v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.11551v2","updated":"2023-10-27T14:29:06Z","published":"2023-06-20T14:12:29Z","title":"IMP-MARL: a Suite of Environments for Large-scale Infrastructure\n Management Planning via MARL","summary":" We introduce IMP-MARL, an open-source suite of multi-agent reinforcement\nlearning (MARL) environments for large-scale Infrastructure Management Planning\n(IMP), offering a platform for benchmarking the scalability of cooperative MARL\nmethods in real-world engineering applications. In IMP, a multi-component\nengineering system is subject to a risk of failure due to its components'\ndamage condition. Specifically, each agent plans inspections and repairs for a\nspecific system component, aiming to minimise maintenance costs while\ncooperating to minimise system failure risk. With IMP-MARL, we release several\nenvironments including one related to offshore wind structural systems, in an\neffort to meet today's needs to improve management strategies to support\nsustainable and reliable energy systems. Supported by IMP practical engineering\nenvironments featuring up to 100 agents, we conduct a benchmark campaign, where\nthe scalability and performance of state-of-the-art cooperative MARL methods\nare compared against expert-based heuristic policies. The results reveal that\ncentralised training with decentralised execution methods scale better with the\nnumber of agents than fully centralised or decentralised RL approaches, while\nalso outperforming expert-based heuristic policies in most IMP environments.\nBased on our findings, we additionally outline remaining cooperation and\nscalability challenges that future MARL methods should still address. Through\nIMP-MARL, we encourage the implementation of new environments and the further\ndevelopment of MARL methods.\n","authors":["Pascal Leroy","Pablo G. Morato","Jonathan Pisane","Athanasios Kolios","Damien Ernst"],"pdf_url":"https://arxiv.org/pdf/2306.11551v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18168v1","updated":"2023-10-27T14:27:43Z","published":"2023-10-27T14:27:43Z","title":"Personas as a Way to Model Truthfulness in Language Models","summary":" Large Language Models are trained on vast amounts of text from the internet,\nwhich contains both factual and misleading information about the world. Can\nlanguage models discern truth from falsehood in this contradicting data?\nExpanding on the view that LLMs can model different agents producing the\ncorpora, we hypothesize that they can cluster truthful text by modeling a\ntruthful persona: a group of agents that are likely to produce truthful text\nand share similar features. For example, trustworthy sources like Wikipedia and\nScience usually use formal writing styles and make consistent claims. By\nmodeling this persona, LLMs can generalize truthfulness beyond the specific\ncontexts in which each agent generated the training text. For example, the\nmodel can infer that the agent \"Wikipedia\" will behave truthfully on topics\nthat were only generated by \"Science\" because they share a persona. We first\nshow evidence for the persona hypothesis via two observations: (1) we can probe\nwhether a model's answer will be truthful before it is generated; (2)\nfinetuning a model on a set of facts improves its truthfulness on unseen\ntopics. Next, using arithmetics as a synthetic environment, we show that\nlanguage models can separate true and false statements, and generalize\ntruthfulness across agents; but only if agents in the training data share a\ntruthful generative process that enables the creation of a truthful persona.\nOverall, our findings suggest that models can exploit hierarchical structures\nin the data to learn abstract concepts like truthfulness.\n","authors":["Nitish Joishi","Javier Rando","Abulhair Saparov","Najoung Kim","He He"],"pdf_url":"https://arxiv.org/pdf/2310.18168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.10088v2","updated":"2023-10-27T14:24:31Z","published":"2023-07-19T15:57:24Z","title":"Android in the Wild: A Large-Scale Dataset for Android Device Control","summary":" There is a growing interest in device-control systems that can interpret\nhuman natural language instructions and execute them on a digital device by\ndirectly controlling its user interface. We present a dataset for\ndevice-control research, Android in the Wild (AITW), which is orders of\nmagnitude larger than current datasets. The dataset contains human\ndemonstrations of device interactions, including the screens and actions, and\ncorresponding natural language instructions. It consists of 715k episodes\nspanning 30k unique instructions, four versions of Android (v10-13),and eight\ndevice types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It\ncontains multi-step tasks that require semantic understanding of language and\nvisual context. This dataset poses a new challenge: actions available through\nthe user interface must be inferred from their visual appearance. And, instead\nof simple UI element-based actions, the action space consists of precise\ngestures (e.g., horizontal scrolls to operate carousel widgets). We organize\nour dataset to encourage robustness analysis of device-control systems, i.e.,\nhow well a system performs in the presence of new task descriptions, new\napplications, or new platform versions. We develop two agents and report\nperformance across the dataset. The dataset is available at\nhttps://github.com/google-research/google-research/tree/master/android_in_the_wild.\n","authors":["Christopher Rawles","Alice Li","Daniel Rodriguez","Oriana Riva","Timothy Lillicrap"],"pdf_url":"https://arxiv.org/pdf/2307.10088v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03792v2","updated":"2023-10-27T14:20:20Z","published":"2023-06-06T15:39:54Z","title":"FAMO: Fast Adaptive Multitask Optimization","summary":" One of the grand enduring goals of AI is to create generalist agents that can\nlearn multiple different tasks from diverse data via multitask learning (MTL).\nHowever, in practice, applying gradient descent (GD) on the average loss across\nall tasks may yield poor multitask performance due to severe under-optimization\nof certain tasks. Previous approaches that manipulate task gradients for a more\nbalanced loss decrease require storing and computing all task gradients\n($\\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting\ntheir use in large-scale scenarios. In this work, we introduce Fast Adaptive\nMultitask Optimization FAMO, a dynamic weighting method that decreases task\nlosses in a balanced way using $\\mathcal{O}(1)$ space and time. We conduct an\nextensive set of experiments covering multi-task supervised and reinforcement\nlearning problems. Our results indicate that FAMO achieves comparable or\nsuperior performance to state-of-the-art gradient manipulation techniques while\noffering significant improvements in space and computational efficiency. Code\nis available at \\url{https://github.com/Cranial-XIX/FAMO}.\n","authors":["Bo Liu","Yihao Feng","Peter Stone","Qiang Liu"],"pdf_url":"https://arxiv.org/pdf/2306.03792v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18165v1","updated":"2023-10-27T14:17:35Z","published":"2023-10-27T14:17:35Z","title":"Enhancing Enterprise Network Security: Comparing Machine-Level and\n Process-Level Analysis for Dynamic Malware Detection","summary":" Analysing malware is important to understand how malicious software works and\nto develop appropriate detection and prevention methods. Dynamic analysis can\novercome evasion techniques commonly used to bypass static analysis and provide\ninsights into malware runtime activities. Much research on dynamic analysis\nfocused on investigating machine-level information (e.g., CPU, memory, network\nusage) to identify whether a machine is running malicious activities. A\nmalicious machine does not necessarily mean all running processes on the\nmachine are also malicious. If we can isolate the malicious process instead of\nisolating the whole machine, we could kill the malicious process, and the\nmachine can keep doing its job. Another challenge dynamic malware detection\nresearch faces is that the samples are executed in one machine without any\nbackground applications running. It is unrealistic as a computer typically runs\nmany benign (background) applications when a malware incident happens. Our\nexperiment with machine-level data shows that the existence of background\napplications decreases previous state-of-the-art accuracy by about 20.12% on\naverage. We also proposed a process-level Recurrent Neural Network (RNN)-based\ndetection model. Our proposed model performs better than the machine-level\ndetection model; 0.049 increase in detection rate and a false-positive rate\nbelow 0.1.\n","authors":["Baskoro Adi Pratomo","Toby Jackson","Pete Burnap","Andrew Hood","Eirini Anthi"],"pdf_url":"https://arxiv.org/pdf/2310.18165v1.pdf","comment":"Dataset link: https://github.com/bazz-066/cerberus-trace"},{"id":"http://arxiv.org/abs/2310.18162v1","updated":"2023-10-27T14:12:56Z","published":"2023-10-27T14:12:56Z","title":"Proportional Fairness in Clustering: A Social Choice Perspective","summary":" We study the proportional clustering problem of Chen et al. [ICML'19] and\nrelate it to the area of multiwinner voting in computational social choice. We\nshow that any clustering satisfying a weak proportionality notion of Brill and\nPeters [EC'23] simultaneously obtains the best known approximations to the\nproportional fairness notion of Chen et al. [ICML'19], but also to individual\nfairness [Jung et al., FORC'20] and the \"core\" [Li et al. ICML'21]. In fact, we\nshow that any approximation to proportional fairness is also an approximation\nto individual fairness and vice versa. Finally, we also study stronger notions\nof proportional representation, in which deviations do not only happen to\nsingle, but multiple candidate centers, and show that stronger proportionality\nnotions of Brill and Peters [EC'23] imply approximations to these stronger\nguarantees.\n","authors":["Leon Kellerhals","Jannik Peters"],"pdf_url":"https://arxiv.org/pdf/2310.18162v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.10225v2","updated":"2023-10-27T14:10:23Z","published":"2023-06-17T01:24:11Z","title":"Genes in Intelligent Agents","summary":" The genes in nature give the lives on earth the current biological\nintelligence through transmission and accumulation over billions of years.\nInspired by the biological intelligence, artificial intelligence (AI) has\ndevoted to building the machine intelligence. Although it has achieved thriving\nsuccesses, the machine intelligence still lags far behind the biological\nintelligence. The reason may lie in that animals are born with some\nintelligence encoded in their genes, but machines lack such intelligence and\nlearn from scratch. Inspired by the genes of animals, we define the ``genes''\nof machines named as the ``learngenes'' and propose the Genetic Reinforcement\nLearning (GRL). GRL is a computational framework that simulates the evolution\nof organisms in reinforcement learning (RL) and leverages the learngenes to\nlearn and evolve the intelligence agents. Leveraging GRL, we first show that\nthe learngenes take the form of the fragments of the agents' neural networks\nand can be inherited across generations. Second, we validate that the\nlearngenes can transfer ancestral experience to the agents and bring them\ninstincts and strong learning abilities. Third, we justify the Lamarckian\ninheritance of the intelligent agents and the continuous evolution of the\nlearngenes. Overall, the learngenes have taken the machine intelligence one\nmore step toward the biological intelligence.\n","authors":["Fu Feng","Jing Wang","Xu Yang","Xin Geng"],"pdf_url":"https://arxiv.org/pdf/2306.10225v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13786v2","updated":"2023-10-27T14:09:43Z","published":"2023-10-20T19:32:54Z","title":"Fundamental Limits of Membership Inference Attacks on Machine Learning\n Models","summary":" Membership inference attacks (MIA) can reveal whether a particular data point\nwas part of the training dataset, potentially exposing sensitive information\nabout individuals. This article explores the fundamental statistical\nlimitations associated with MIAs on machine learning models. More precisely, we\nfirst derive the statistical quantity that governs the effectiveness and\nsuccess of such attacks. Then, we investigate several situations for which we\nprovide bounds on this quantity of interest. This allows us to infer the\naccuracy of potential attacks as a function of the number of samples and other\nstructural parameters of learning models, which in some cases can be directly\nestimated from the dataset.\n","authors":["Eric Aubinais","Elisabeth Gassiat","Pablo Piantanida"],"pdf_url":"https://arxiv.org/pdf/2310.13786v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.04810v3","updated":"2023-10-27T14:05:02Z","published":"2023-09-09T14:29:22Z","title":"Neural Latent Geometry Search: Product Manifold Inference via\n Gromov-Hausdorff-Informed Bayesian Optimization","summary":" Recent research indicates that the performance of machine learning models can\nbe improved by aligning the geometry of the latent space with the underlying\ndata structure. Rather than relying solely on Euclidean space, researchers have\nproposed using hyperbolic and spherical spaces with constant curvature, or\ncombinations thereof, to better model the latent space and enhance model\nperformance. However, little attention has been given to the problem of\nautomatically identifying the optimal latent geometry for the downstream task.\nWe mathematically define this novel formulation and coin it as neural latent\ngeometry search (NLGS). More specifically, we introduce an initial attempt to\nsearch for a latent geometry composed of a product of constant curvature model\nspaces with a small number of query evaluations, under some simplifying\nassumptions. To accomplish this, we propose a novel notion of distance between\ncandidate latent geometries based on the Gromov-Hausdorff distance from metric\ngeometry. In order to compute the Gromov-Hausdorff distance, we introduce a\nmapping function that enables the comparison of different manifolds by\nembedding them in a common high-dimensional ambient space. We then design a\ngraph search space based on the notion of smoothness between latent geometries\nand employ the calculated distances as an additional inductive bias. Finally,\nwe use Bayesian optimization to search for the optimal latent geometry in a\nquery-efficient manner. This is a general method which can be applied to search\nfor the optimal latent geometry for a variety of models and downstream tasks.\nWe perform experiments on synthetic and real-world datasets to identify the\noptimal latent geometry for multiple machine learning problems.\n","authors":["Haitz Saez de Ocariz Borde","Alvaro Arroyo","Ismael Morales","Ingmar Posner","Xiaowen Dong"],"pdf_url":"https://arxiv.org/pdf/2309.04810v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08445v2","updated":"2023-10-27T14:04:36Z","published":"2023-06-14T11:37:12Z","title":"Deep Gaussian Markov Random Fields for Graph-Structured Dynamical\n Systems","summary":" Probabilistic inference in high-dimensional state-space models is\ncomputationally challenging. For many spatiotemporal systems, however, prior\nknowledge about the dependency structure of state variables is available. We\nleverage this structure to develop a computationally efficient approach to\nstate estimation and learning in graph-structured state-space models with\n(partially) unknown dynamics and limited historical data. Building on recent\nmethods that combine ideas from deep learning with principled inference in\nGaussian Markov random fields (GMRF), we reformulate graph-structured\nstate-space models as Deep GMRFs defined by simple spatial and temporal graph\nlayers. This results in a flexible spatiotemporal prior that can be learned\nefficiently from a single time sequence via variational inference. Under linear\nGaussian assumptions, we retain a closed-form posterior, which can be sampled\nefficiently using the conjugate gradient method, scaling favourably compared to\nclassical Kalman filter based approaches\n","authors":["Fiona Lippert","Bart Kranstauber","E. Emiel van Loon","Patrick Forré"],"pdf_url":"https://arxiv.org/pdf/2306.08445v2.pdf","comment":"NeurIPS 2023; camera-ready version"},{"id":"http://arxiv.org/abs/2310.18152v1","updated":"2023-10-27T14:00:04Z","published":"2023-10-27T14:00:04Z","title":"Disentangled Representation Learning with Large Language Models for\n Text-Attributed Graphs","summary":" Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs\nsuch as citation networks, e-commerce networks and social networks has\nattracted considerable attention in the web community. Recently, large language\nmodels (LLMs) have demonstrated exceptional capabilities across a wide range of\ntasks. However, the existing works focus on harnessing the potential of LLMs\nsolely relying on prompts to convey graph structure information to LLMs, thus\nsuffering from insufficient understanding of the complex structural\nrelationships within TAGs. To address this problem, in this paper we present\nthe Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the\nreasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model\nincorporates graph structure information through tailored disentangled graph\nneural network (GNN) layers, enabling LLMs to capture the intricate\nrelationships hidden in text-attributed graphs from multiple structural\nfactors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing\ncomputational costs and allowing much more flexibility in combining with\ndifferent LLM models. Experimental evaluations demonstrate the effectiveness of\nthe proposed DGTL model on achieving superior or comparable performance over\nstate-of-the-art baselines. Additionally, we also demonstrate that our DGTL\nmodel can offer natural language explanations for predictions, thereby\nsignificantly enhancing model interpretability.\n","authors":["Yijian Qin","Xin Wang","Ziwei Zhang","Wenwu Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.18152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18144v1","updated":"2023-10-27T13:51:18Z","published":"2023-10-27T13:51:18Z","title":"Improving Intrinsic Exploration by Creating Stationary Objectives","summary":" Exploration bonuses in reinforcement learning guide long-horizon exploration\nby defining custom intrinsic objectives. Count-based methods use the frequency\nof state visits to derive an exploration bonus. In this paper, we identify that\nany intrinsic reward function derived from count-based methods is\nnon-stationary and hence induces a difficult objective to optimize for the\nagent. The key contribution of our work lies in transforming the original\nnon-stationary rewards into stationary rewards through an augmented state\nrepresentation. For this purpose, we introduce the Stationary Objectives For\nExploration (SOFE) framework. SOFE requires identifying sufficient statistics\nfor different exploration bonuses and finding an efficient encoding of these\nstatistics to use as input to a deep network. SOFE is based on proposing state\naugmentations that expand the state space but hold the promise of simplifying\nthe optimization of the agent's objective. Our experiments show that SOFE\nimproves the agents' performance in challenging exploration problems, including\nsparse-reward tasks, pixel-based observations, 3D navigation, and procedurally\ngenerated environments.\n","authors":["Roger Creus Castanyer","Joshua Romoff","Glen Berseth"],"pdf_url":"https://arxiv.org/pdf/2310.18144v1.pdf","comment":"Under Review at ICLR 2024"},{"id":"http://arxiv.org/abs/2310.18141v1","updated":"2023-10-27T13:45:30Z","published":"2023-10-27T13:45:30Z","title":"Unsupervised Representation Learning for Diverse Deformable Shape\n Collections","summary":" We introduce a novel learning-based method for encoding and manipulating 3D\nsurface meshes. Our method is specifically designed to create an interpretable\nembedding space for deformable shape collections. Unlike previous 3D mesh\nautoencoders that require meshes to be in a 1-to-1 correspondence, our approach\nis trained on diverse meshes in an unsupervised manner. Central to our method\nis a spectral pooling technique that establishes a universal latent space,\nbreaking free from traditional constraints of mesh connectivity and shape\ncategories. The entire process consists of two stages. In the first stage, we\nemploy the functional map paradigm to extract point-to-point (p2p) maps between\na collection of shapes in an unsupervised manner. These p2p maps are then\nutilized to construct a common latent space, which ensures straightforward\ninterpretation and independence from mesh connectivity and shape category.\nThrough extensive experiments, we demonstrate that our method achieves\nexcellent reconstructions and produces more realistic and smoother\ninterpolations than baseline approaches.\n","authors":["Sara Hahner","Souhaib Attaiki","Jochen Garcke","Maks Ovsjanikov"],"pdf_url":"https://arxiv.org/pdf/2310.18141v1.pdf","comment":"Accepted at International Conference on 3D Vision 2024"},{"id":"http://arxiv.org/abs/2307.03672v2","updated":"2023-10-27T13:37:43Z","published":"2023-07-07T15:42:35Z","title":"Simulation-free Schrödinger bridges via score and flow matching","summary":" We present simulation-free score and flow matching ([SF]$^2$M), a\nsimulation-free objective for inferring stochastic dynamics given unpaired\nsamples drawn from arbitrary source and target distributions. Our method\ngeneralizes both the score-matching loss used in the training of diffusion\nmodels and the recently proposed flow matching loss used in the training of\ncontinuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic\ngenerative modeling as a Schr\\\"odinger bridge problem. It relies on static\nentropy-regularized optimal transport, or a minibatch approximation, to\nefficiently learn the SB without simulating the learned stochastic process. We\nfind that [SF]$^2$M is more efficient and gives more accurate solutions to the\nSB problem than simulation-based methods from prior work. Finally, we apply\n[SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably,\n[SF]$^2$M is the first method to accurately model cell dynamics in high\ndimensions and can recover known gene regulatory networks from simulated data.\n","authors":["Alexander Tong","Nikolay Malkin","Kilian Fatras","Lazar Atanackovic","Yanlei Zhang","Guillaume Huguet","Guy Wolf","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2307.03672v2.pdf","comment":"A version of this paper appeared in the New Frontiers in Learning,\n Control, and Dynamical Systems workshop at ICML 2023. Code:\n https://github.com/atong01/conditional-flow-matching"},{"id":"http://arxiv.org/abs/2310.18127v1","updated":"2023-10-27T13:19:19Z","published":"2023-10-27T13:19:19Z","title":"Ask more, know better: Reinforce-Learned Prompt Questions for Decision\n Making with Large Language Models","summary":" Large language models (LLMs) demonstrate their promise in tackling\ncomplicated practical challenges by combining action-based policies with chain\nof thought (CoT) reasoning. Having high-quality prompts on hand, however, is\nvital to the framework's effectiveness. Currently, these prompts are\nhandcrafted utilizing extensive human labor, resulting in CoT policies that\nfrequently fail to generalize. Human intervention is also required in order to\ndevelop grounding functions that ensure low-level controllers appropriately\nprocess CoT reasoning. In this paper, we take the first step towards a fully\nintegrated end-to-end framework for task-solving in real settings employing\ncomplicated reasoning. To that purpose, we offer a new leader-follower bilevel\nframework capable of learning to ask relevant questions (prompts) and\nsubsequently undertaking reasoning to guide the learning of actions to be\nperformed in an environment. A good prompt should make introspective revisions\nbased on historical findings, leading the CoT to consider the anticipated\ngoals. A prompt-generator policy has its own aim in our system, allowing it to\nadapt to the action policy and automatically root the CoT process towards\noutputs that lead to decisive, high-performing actions. Meanwhile, the action\npolicy is learning how to use the CoT outputs to take specific actions. Our\nempirical data reveal that our system outperforms leading methods in agent\nlearning benchmarks such as Overcooked and FourRoom.\n","authors":["Xue Yan","Yan Song","Xinyu Cui","Filippos Christianos","Haifeng Zhang","David Henry Mguni","Jun Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18123v1","updated":"2023-10-27T13:09:56Z","published":"2023-10-27T13:09:56Z","title":"Sample Complexity Bounds for Score-Matching: Causal Discovery and\n Generative Modeling","summary":" This paper provides statistical sample complexity bounds for score-matching\nand its applications in causal discovery. We demonstrate that accurate\nestimation of the score function is achievable by training a standard deep ReLU\nneural network using stochastic gradient descent. We establish bounds on the\nerror rate of recovering causal relationships using the score-matching-based\ncausal discovery method of Rolland et al. [2022], assuming a sufficiently good\nestimation of the score function. Finally, we analyze the upper bound of\nscore-matching estimation within the score-based generative modeling, which has\nbeen applied for causal discovery but is also of independent interest within\nthe domain of generative models.\n","authors":["Zhenyu Zhu","Francesco Locatello","Volkan Cevher"],"pdf_url":"https://arxiv.org/pdf/2310.18123v1.pdf","comment":"Accepted in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18118v1","updated":"2023-10-27T13:04:53Z","published":"2023-10-27T13:04:53Z","title":"A Global Multi-Unit Calibration as a Method for Large Scale IoT\n Particulate Matter Monitoring Systems Deployments","summary":" Scalable and effective calibration is a fundamental requirement for Low Cost\nAir Quality Monitoring Systems and will enable accurate and pervasive\nmonitoring in cities. Suffering from environmental interferences and\nfabrication variance, these devices need to encompass sensors specific and\ncomplex calibration processes for reaching a sufficient accuracy to be deployed\nas indicative measurement devices in Air Quality (AQ) monitoring networks.\nConcept and sensor drift often force calibration process to be frequently\nrepeated. These issues lead to unbearable calibration costs which denies their\nmassive deployment when accuracy is a concern. In this work, We propose a zero\ntransfer samples, global calibration methodology as a technological enabler for\nIoT AQ multisensory devices which relies on low cost Particulate Matter (PM)\nsensors. This methodology is based on field recorded responses from a limited\nnumber of IoT AQ multisensors units and machine learning concepts and can be\nuniversally applied to all units of the same type. A multi season test campaign\nshown that, when applied to different sensors, this methodology performances\nmatch those of state of the art methodology which requires to derive different\ncalibration parameters for each different unit. If confirmed, these results\nshow that, when properly derived, a global calibration law can be exploited for\na large number of networked devices with dramatic cost reduction eventually\nallowing massive deployment of accurate IoT AQ monitoring devices. Furthermore,\nthis calibration model could be easily embedded on board of the device or\nimplemented on the edge allowing immediate access to accurate readings for\npersonal exposure monitor applications as well as reducing long range data\ntransfer needs.\n","authors":["Saverio De Vito","Gerardo D Elia","Sergio Ferlito","Girolamo Di Francia","Milos Davidovic","Duska Kleut","Danka Stojanovic","Milena Jovasevic Stojanovic"],"pdf_url":"https://arxiv.org/pdf/2310.18118v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.04858v2","updated":"2023-10-27T12:58:06Z","published":"2022-12-09T13:56:42Z","title":"Implicit variance regularization in non-contrastive SSL","summary":" Non-contrastive SSL methods like BYOL and SimSiam rely on asymmetric\npredictor networks to avoid representational collapse without negative samples.\nYet, how predictor networks facilitate stable learning is not fully understood.\nWhile previous theoretical analyses assumed Euclidean losses, most practical\nimplementations rely on cosine similarity. To gain further theoretical insight\ninto non-contrastive SSL, we analytically study learning dynamics in\nconjunction with Euclidean and cosine similarity in the eigenspace of\nclosed-form linear predictor networks. We show that both avoid collapse through\nimplicit variance regularization albeit through different dynamical mechanisms.\nMoreover, we find that the eigenvalues act as effective learning rate\nmultipliers and propose a family of isotropic loss functions (IsoLoss) that\nequalize convergence rates across eigenmodes. Empirically, IsoLoss speeds up\nthe initial learning dynamics and increases robustness, thereby allowing us to\ndispense with the EMA target network typically used with non-contrastive\nmethods. Our analysis sheds light on the variance regularization mechanisms of\nnon-contrastive SSL and lays the theoretical grounds for crafting novel loss\nfunctions that shape the learning dynamics of the predictor's spectrum.\n","authors":["Manu Srinath Halvagal","Axel Laborieux","Friedemann Zenke"],"pdf_url":"https://arxiv.org/pdf/2212.04858v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18108v1","updated":"2023-10-27T12:48:30Z","published":"2023-10-27T12:48:30Z","title":"Transductive conformal inference with adaptive scores","summary":" Conformal inference is a fundamental and versatile tool that provides\ndistribution-free guarantees for many machine learning tasks. We consider the\ntransductive setting, where decisions are made on a test sample of $m$ new\npoints, giving rise to $m$ conformal $p$-values. {While classical results only\nconcern their marginal distribution, we show that their joint distribution\nfollows a P\\'olya urn model, and establish a concentration inequality for their\nempirical distribution function.} The results hold for arbitrary exchangeable\nscores, including {\\it adaptive} ones that can use the covariates of the\ntest+calibration samples at training stage for increased accuracy. We\ndemonstrate the usefulness of these theoretical results through uniform,\nin-probability guarantees for two machine learning tasks of current interest:\ninterval prediction for transductive transfer learning and novelty detection\nbased on two-class classification.\n","authors":["Ulysse Gazin","Gilles Blanchard","Etienne Roquain"],"pdf_url":"https://arxiv.org/pdf/2310.18108v1.pdf","comment":"27 pages, 6 Figures"},{"id":"http://arxiv.org/abs/2305.17161v2","updated":"2023-10-27T12:37:54Z","published":"2023-05-26T18:00:01Z","title":"Flow Matching for Scalable Simulation-Based Inference","summary":" Neural posterior estimation methods based on discrete normalizing flows have\nbecome established tools for simulation-based inference (SBI), but scaling them\nto high-dimensional problems can be challenging. Building on recent advances in\ngenerative modeling, we here present flow matching posterior estimation (FMPE),\na technique for SBI using continuous normalizing flows. Like diffusion models,\nand in contrast to discrete flows, flow matching allows for unconstrained\narchitectures, providing enhanced flexibility for complex data modalities. Flow\nmatching, therefore, enables exact density evaluation, fast training, and\nseamless scalability to large architectures--making it ideal for SBI. We show\nthat FMPE achieves competitive performance on an established SBI benchmark, and\nthen demonstrate its improved scalability on a challenging scientific problem:\nfor gravitational-wave inference, FMPE outperforms methods based on comparable\ndiscrete flows, reducing training time by 30% with substantially improved\naccuracy. Our work underscores the potential of FMPE to enhance performance in\nchallenging inference scenarios, thereby paving the way for more advanced\napplications to scientific problems.\n","authors":["Maximilian Dax","Jonas Wildberger","Simon Buchholz","Stephen R. Green","Jakob H. Macke","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2305.17161v2.pdf","comment":"NeurIPS 2023. Code available at\n https://github.com/dingo-gw/flow-matching-posterior-estimation"},{"id":"http://arxiv.org/abs/2310.18091v1","updated":"2023-10-27T12:24:08Z","published":"2023-10-27T12:24:08Z","title":"Adversarial Anomaly Detection using Gaussian Priors and Nonlinear\n Anomaly Scores","summary":" Anomaly detection in imbalanced datasets is a frequent and crucial problem,\nespecially in the medical domain where retrieving and labeling irregularities\nis often expensive. By combining the generative stability of a\n$\\beta$-variational autoencoder (VAE) with the discriminative strengths of\ngenerative adversarial networks (GANs), we propose a novel model,\n$\\beta$-VAEGAN. We investigate methods for composing anomaly scores based on\nthe discriminative and reconstructive capabilities of our model. Existing work\nfocuses on linear combinations of these components to determine if data is\nanomalous. We advance existing work by training a kernelized support vector\nmachine (SVM) on the respective error components to also consider nonlinear\nrelationships. This improves anomaly detection performance, while allowing\nfaster optimization. Lastly, we use the deviations from the Gaussian prior of\n$\\beta$-VAEGAN to form a novel anomaly score component. In comparison to\nstate-of-the-art work, we improve the $F_1$ score during anomaly detection from\n0.85 to 0.92 on the widely used MITBIH Arrhythmia Database.\n","authors":["Fiete Lüer","Tobias Weber","Maxim Dolgich","Christian Böhm"],"pdf_url":"https://arxiv.org/pdf/2310.18091v1.pdf","comment":"accepted at AI4TS @ ICDMW 2023"},{"id":"http://arxiv.org/abs/2310.17526v2","updated":"2023-10-27T12:14:27Z","published":"2023-10-26T16:18:30Z","title":"Can large language models replace humans in the systematic review\n process? Evaluating GPT-4's efficacy in screening and extracting data from\n peer-reviewed and grey literature in multiple languages","summary":" Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance.\n","authors":["Qusai Khraisha","Sophie Put","Johanna Kappenberg","Azza Warraitch","Kristin Hadfield"],"pdf_url":"https://arxiv.org/pdf/2310.17526v2.pdf","comment":"9 pages, 2 figures, 1 table"},{"id":"http://arxiv.org/abs/1912.13490v4","updated":"2023-10-27T12:08:01Z","published":"2019-12-31T18:45:33Z","title":"A Neurocomputational Account of Flexible Goal-directed Cognition and\n Consciousness: The Goal-Aligning Representation Internal Manipulation Theory\n (GARIM)","summary":" Goal-directed manipulation of representations is a key element of human\nflexible behaviour, while consciousness is often related to several aspects of\nhigher-order cognition and human flexibility. Currently these two phenomena are\nonly partially integrated (e.g., see Neurorepresentationalism) and this (a)\nlimits our understanding of neuro-computational processes that lead conscious\nstates to produce flexible goal-directed behaviours, (b) prevents a\ncomputational formalisation of conscious goal-directed manipulations of\nrepresentations occurring in the brain, and (c) inhibits the exploitation of\nthis knowledge for modelling and technological purposes. Addressing these\nissues, here we extend our `three-component theory of flexible cognition' by\nproposing the `Goal-Aligning Representations Internal Manipulation' (GARIM)\ntheory of conscious and flexible goal-directed cognition. The central idea of\nthe theory is that conscious states support the active manipulation of\ngoal-relevant internal representations (e.g., of world states, objects, and\naction sequences) to make them more aligned with the pursued goals. This leads\nto the generation of the knowledge which is necessary to face novel\nsituations/goals, thus increasing the flexibility of goal-directed behaviours.\nThe GARIM theory integrates key aspects of the main theories of consciousness\ninto the functional neuro-computational framework of goal-directed behaviour.\nMoreover, it takes into account the subjective sensation of agency that\naccompanies conscious goal-directed processes (`GARIM agency'). The proposal\nhas also implications for experimental studies on consciousness and clinical\naspects of conscious goal-directed behaviour. Finally, the GARIM theory benefit\ntechnological fields such as autonomous robotics and machine learning (e.g.,\nthe manipulation process may describe the operations performed by systems based\non transformers).\n","authors":["Giovanni Granato","Gianluca Baldassarre"],"pdf_url":"https://arxiv.org/pdf/1912.13490v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18080v1","updated":"2023-10-27T12:01:16Z","published":"2023-10-27T12:01:16Z","title":"Unveiling the Potential of Probabilistic Embeddings in Self-Supervised\n Learning","summary":" In recent years, self-supervised learning has played a pivotal role in\nadvancing machine learning by allowing models to acquire meaningful\nrepresentations from unlabeled data. An intriguing research avenue involves\ndeveloping self-supervised models within an information-theoretic framework,\nbut many studies often deviate from the stochasticity assumptions made when\nderiving their objectives. To gain deeper insights into this issue, we propose\nto explicitly model the representation with stochastic embeddings and assess\ntheir effects on performance, information compression and potential for\nout-of-distribution detection. From an information-theoretic perspective, we\nseek to investigate the impact of probabilistic modeling on the information\nbottleneck, shedding light on a trade-off between compression and preservation\nof information in both representation and loss space. Emphasizing the\nimportance of distinguishing between these two spaces, we demonstrate how\nconstraining one can affect the other, potentially leading to performance\ndegradation. Moreover, our findings suggest that introducing an additional\nbottleneck in the loss space can significantly enhance the ability to detect\nout-of-distribution examples, only leveraging either representation features or\nthe variance of their underlying distribution.\n","authors":["Denis Janiak","Jakub Binkowski","Piotr Bielak","Tomasz Kajdanowicz"],"pdf_url":"https://arxiv.org/pdf/2310.18080v1.pdf","comment":"Under review by AISTATS 2024"},{"id":"http://arxiv.org/abs/2310.18078v1","updated":"2023-10-27T11:56:43Z","published":"2023-10-27T11:56:43Z","title":"Lipschitz and Hölder Continuity in Reproducing Kernel Hilbert Spaces","summary":" Reproducing kernel Hilbert spaces (RKHSs) are very important function spaces,\nplaying an important role in machine learning, statistics, numerical analysis\nand pure mathematics. Since Lipschitz and H\\\"older continuity are important\nregularity properties, with many applications in interpolation, approximation\nand optimization problems, in this work we investigate these continuity notion\nin RKHSs. We provide several sufficient conditions as well as an in depth\ninvestigation of reproducing kernels inducing prescribed Lipschitz or H\\\"older\ncontinuity. Apart from new results, we also collect related known results from\nthe literature, making the present work also a convenient reference on this\ntopic.\n","authors":["Christian Fiedler"],"pdf_url":"https://arxiv.org/pdf/2310.18078v1.pdf","comment":"Preprint, under review"},{"id":"http://arxiv.org/abs/2309.16318v2","updated":"2023-10-27T11:51:52Z","published":"2023-09-28T10:15:30Z","title":"DeepPCR: Parallelizing Sequential Operations in Neural Networks","summary":" Parallelization techniques have become ubiquitous for accelerating inference\nand training of deep neural networks. Despite this, several operations are\nstill performed in a sequential manner. For instance, the forward and backward\npasses are executed layer-by-layer, and the output of diffusion models is\nproduced by applying a sequence of denoising steps. This sequential approach\nresults in a computational cost proportional to the number of steps involved,\npresenting a potential bottleneck as the number of steps increases. In this\nwork, we introduce DeepPCR, a novel algorithm which parallelizes typically\nsequential operations in order to speed up inference and training of neural\nnetworks. DeepPCR is based on interpreting a sequence of $L$ steps as the\nsolution of a specific system of equations, which we recover using the Parallel\nCyclic Reduction algorithm. This reduces the complexity of computing the\nsequential operations from $\\mathcal{O}(L)$ to $\\mathcal{O}(\\log_2L)$, thus\nyielding a speedup for large $L$. To verify the theoretical lower complexity of\nthe algorithm, and to identify regimes for speedup, we test the effectiveness\nof DeepPCR in parallelizing the forward and backward pass in multi-layer\nperceptrons, and reach speedups of up to $30\\times$ for the forward and\n$200\\times$ for the backward pass. We additionally showcase the flexibility of\nDeepPCR by parallelizing training of ResNets with as many as 1024 layers, and\ngeneration in diffusion models, enabling up to $7\\times$ faster training and\n$11\\times$ faster generation, respectively, when compared to the sequential\napproach.\n","authors":["Federico Danieli","Miguel Sarabia","Xavier Suau","Pau Rodríguez","Luca Zappella"],"pdf_url":"https://arxiv.org/pdf/2309.16318v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18074v1","updated":"2023-10-27T11:42:56Z","published":"2023-10-27T11:42:56Z","title":"On kernel-based statistical learning in the mean field limit","summary":" In many applications of machine learning, a large number of variables are\nconsidered. Motivated by machine learning of interacting particle systems, we\nconsider the situation when the number of input variables goes to infinity.\nFirst, we continue the recent investigation of the mean field limit of kernels\nand their reproducing kernel Hilbert spaces, completing the existing theory.\nNext, we provide results relevant for approximation with such kernels in the\nmean field limit, including a representer theorem. Finally, we use these\nkernels in the context of statistical learning in the mean field limit,\nfocusing on Support Vector Machines. In particular, we show mean field\nconvergence of empirical and infinite-sample solutions as well as the\nconvergence of the corresponding risks. On the one hand, our results establish\nrigorous mean field limits in the context of kernel methods, providing new\ntheoretical tools and insights for large-scale problems. On the other hand, our\nsetting corresponds to a new form of limit of learning problems, which seems to\nhave not been investigated yet in the statistical learning theory literature.\n","authors":["Christian Fiedler","Michael Herty","Sebastian Trimpe"],"pdf_url":"https://arxiv.org/pdf/2310.18074v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.02247v2","updated":"2023-10-27T11:35:04Z","published":"2023-04-05T06:35:41Z","title":"Disentangling Structure and Style: Political Bias Detection in News by\n Inducing Document Hierarchy","summary":" We address an important gap in detecting political bias in news articles.\nPrevious works that perform document classification can be influenced by the\nwriting style of each news outlet, leading to overfitting and limited\ngeneralizability. Our approach overcomes this limitation by considering both\nthe sentence-level semantics and the document-level rhetorical structure,\nresulting in a more robust and style-agnostic approach to detecting political\nbias in news articles. We introduce a novel multi-head hierarchical attention\nmodel that effectively encodes the structure of long documents through a\ndiverse ensemble of attention heads. While journalism follows a formalized\nrhetorical structure, the writing style may vary by news outlet. We demonstrate\nthat our method overcomes this domain dependency and outperforms previous\napproaches for robustness and accuracy. Further analysis and human evaluation\ndemonstrate the ability of our model to capture common discourse structures in\njournalism. Our code is available at:\nhttps://github.com/xfactlab/emnlp2023-Document-Hierarchy\n","authors":["Jiwoo Hong","Yejin Cho","Jaemin Jung","Jiyoung Han","James Thorne"],"pdf_url":"https://arxiv.org/pdf/2304.02247v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18063v1","updated":"2023-10-27T11:26:27Z","published":"2023-10-27T11:26:27Z","title":"\"Honey, Tell Me What's Wrong\", Global Explanation of Textual\n Discriminative Models through Cooperative Generation","summary":" The ubiquity of complex machine learning has raised the importance of\nmodel-agnostic explanation algorithms. These methods create artificial\ninstances by slightly perturbing real instances, capturing shifts in model\ndecisions. However, such methods rely on initial data and only provide\nexplanations of the decision for these. To tackle these problems, we propose\nTherapy, the first global and model-agnostic explanation method adapted to text\nwhich requires no input dataset. Therapy generates texts following the\ndistribution learned by a classifier through cooperative generation. Because it\ndoes not rely on initial samples, it allows to generate explanations even when\ndata is absent (e.g., for confidentiality reasons). Moreover, conversely to\nexisting methods that combine multiple local explanations into a global one,\nTherapy offers a global overview of the model behavior on the input space. Our\nexperiments show that although using no input data to generate samples, Therapy\nprovides insightful information about features used by the classifier that is\ncompetitive with the ones from methods relying on input samples and outperforms\nthem when input samples are not specific to the studied model.\n","authors":["Antoine Chaffin","Julien Delaunay"],"pdf_url":"https://arxiv.org/pdf/2310.18063v1.pdf","comment":"8 pages plus references and 2 pages of appendices. 7 figures and 2\n tables"},{"id":"http://arxiv.org/abs/2306.09312v2","updated":"2023-10-27T10:34:20Z","published":"2023-06-15T17:47:31Z","title":"Semantic HELM: A Human-Readable Memory for Reinforcement Learning","summary":" Reinforcement learning agents deployed in the real world often have to cope\nwith partially observable environments. Therefore, most agents employ memory\nmechanisms to approximate the state of the environment. Recently, there have\nbeen impressive success stories in mastering partially observable environments,\nmostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft.\nHowever, existing methods lack interpretability in the sense that it is not\ncomprehensible for humans what the agent stores in its memory. In this regard,\nwe propose a novel memory mechanism that represents past events in human\nlanguage. Our method uses CLIP to associate visual inputs with language tokens.\nThen we feed these tokens to a pretrained language model that serves the agent\nas memory and provides it with a coherent and human-readable representation of\nthe past. We train our memory mechanism on a set of partially observable\nenvironments and find that it excels on tasks that require a memory component,\nwhile mostly attaining performance on-par with strong baselines on tasks that\ndo not. On a challenging continuous recognition task, where memorizing the past\nis crucial, our memory mechanism converges two orders of magnitude faster than\nprior methods. Since our memory mechanism is human-readable, we can peek at an\nagent's memory and check whether crucial pieces of information have been\nstored. This significantly enhances troubleshooting and paves the way toward\nmore interpretable agents.\n","authors":["Fabian Paischer","Thomas Adler","Markus Hofmarcher","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2306.09312v2.pdf","comment":"To appear at NeurIPS 2023, 10 pages (+ references and appendix),\n Code: https://github.com/ml-jku/helm"},{"id":"http://arxiv.org/abs/2310.16655v2","updated":"2023-10-27T10:14:16Z","published":"2023-10-25T14:09:53Z","title":"Towards Control-Centric Representations in Reinforcement Learning from\n Images","summary":" Image-based Reinforcement Learning is a practical yet challenging task. A\nmajor hurdle lies in extracting control-centric representations while\ndisregarding irrelevant information. While approaches that follow the\nbisimulation principle exhibit the potential in learning state representations\nto address this issue, they still grapple with the limited expressive capacity\nof latent dynamics and the inadaptability to sparse reward environments. To\naddress these limitations, we introduce ReBis, which aims to capture\ncontrol-centric information by integrating reward-free control information\nalongside reward-specific knowledge. ReBis utilizes a transformer architecture\nto implicitly model the dynamics and incorporates block-wise masking to\neliminate spatiotemporal redundancy. Moreover, ReBis combines\nbisimulation-based loss with asymmetric reconstruction loss to prevent feature\ncollapse in environments with sparse rewards. Empirical studies on two large\nbenchmarks, including Atari games and DeepMind Control Suit, demonstrate that\nReBis has superior performance compared to existing methods, proving its\neffectiveness.\n","authors":["Chen Liu","Hongyu Zang","Xin Li","Yong Heng","Yifei Wang","Zhen Fang","Yisen Wang","Mingzhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16655v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16379v2","updated":"2023-10-27T10:13:50Z","published":"2023-05-25T15:46:20Z","title":"Learning Better with Less: Effective Augmentation for Sample-Efficient\n Visual Reinforcement Learning","summary":" Data augmentation (DA) is a crucial technique for enhancing the sample\nefficiency of visual reinforcement learning (RL) algorithms. Notably, employing\nsimple observation transformations alone can yield outstanding performance\nwithout extra auxiliary representation tasks or pre-trained encoders. However,\nit remains unclear which attributes of DA account for its effectiveness in\nachieving sample-efficient visual RL. To investigate this issue and further\nexplore the potential of DA, this work conducts comprehensive experiments to\nassess the impact of DA's attributes on its efficacy and provides the following\ninsights and improvements: (1) For individual DA operations, we reveal that\nboth ample spatial diversity and slight hardness are indispensable. Building on\nthis finding, we introduce Random PadResize (Rand PR), a new DA operation that\noffers abundant spatial diversity with minimal hardness. (2) For multi-type DA\nfusion schemes, the increased DA hardness and unstable data distribution result\nin the current fusion schemes being unable to achieve higher sample efficiency\nthan their corresponding individual operations. Taking the non-stationary\nnature of RL into account, we propose a RL-tailored multi-type DA fusion scheme\ncalled Cycling Augmentation (CycAug), which performs periodic cycles of\ndifferent DA operations to increase type diversity while maintaining data\ndistribution consistency. Extensive evaluations on the DeepMind Control suite\nand CARLA driving simulator demonstrate that our methods achieve superior\nsample efficiency compared with the prior state-of-the-art methods.\n","authors":["Guozheng Ma","Linrui Zhang","Haoyu Wang","Lu Li","Zilin Wang","Zhen Wang","Li Shen","Xueqian Wang","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2305.16379v2.pdf","comment":"NeurIPS 2023 poster"},{"id":"http://arxiv.org/abs/2310.01972v2","updated":"2023-10-27T09:52:12Z","published":"2023-10-03T11:28:54Z","title":"Epidemic Learning: Boosting Decentralized Learning with Randomized\n Communication","summary":" We present Epidemic Learning (EL), a simple yet powerful decentralized\nlearning (DL) algorithm that leverages changing communication topologies to\nachieve faster model convergence compared to conventional DL approaches. At\neach round of EL, each node sends its model updates to a random sample of $s$\nother nodes (in a system of $n$ nodes). We provide an extensive theoretical\nanalysis of EL, demonstrating that its changing topology culminates in superior\nconvergence properties compared to the state-of-the-art (static and dynamic)\ntopologies. Considering smooth non-convex loss functions, the number of\ntransient iterations for EL, i.e., the rounds required to achieve asymptotic\nlinear speedup, is in $O(n^3/s^2)$ which outperforms the best-known bound\n$O(n^3)$ by a factor of $s^2$, indicating the benefit of randomized\ncommunication for DL. We empirically evaluate EL in a 96-node network and\ncompare its performance with state-of-the-art DL approaches. Our results\nillustrate that EL converges up to $ 1.7\\times$ quicker than baseline DL\nalgorithms and attains $2.2 $\\% higher accuracy for the same communication\nvolume.\n","authors":["Martijn de Vos","Sadegh Farhadkhani","Rachid Guerraoui","Anne-Marie Kermarrec","Rafael Pires","Rishi Sharma"],"pdf_url":"https://arxiv.org/pdf/2310.01972v2.pdf","comment":"Accepted paper at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2303.11205v2","updated":"2023-10-27T09:37:44Z","published":"2023-02-11T14:03:09Z","title":"Entropy-dissipation Informed Neural Network for McKean-Vlasov Type PDEs","summary":" We extend the concept of self-consistency for the Fokker-Planck equation\n(FPE) to the more general McKean-Vlasov equation (MVE). While FPE describes the\nmacroscopic behavior of particles under drift and diffusion, MVE accounts for\nthe additional inter-particle interactions, which are often highly singular in\nphysical systems. Two important examples considered in this paper are the MVE\nwith Coulomb interactions and the vorticity formulation of the 2D Navier-Stokes\nequation. We show that a generalized self-consistency potential controls the\nKL-divergence between a hypothesis solution to the ground truth, through\nentropy dissipation. Built on this result, we propose to solve the MVEs by\nminimizing this potential function, while utilizing the neural networks for\nfunction approximation. We validate the empirical performance of our approach\nby comparing with state-of-the-art NN-based PDE solvers on several example\nproblems.\n","authors":["Zebang Shen","Zhenfu Wang"],"pdf_url":"https://arxiv.org/pdf/2303.11205v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.04406v2","updated":"2023-10-27T09:30:19Z","published":"2023-06-07T13:04:34Z","title":"Generalized Teacher Forcing for Learning Chaotic Dynamics","summary":" Chaotic dynamical systems (DS) are ubiquitous in nature and society. Often we\nare interested in reconstructing such systems from observed time series for\nprediction or mechanistic insight, where by reconstruction we mean learning\ngeometrical and invariant temporal properties of the system in question (like\nattractors). However, training reconstruction algorithms like recurrent neural\nnetworks (RNNs) on such systems by gradient-descent based techniques faces\nsevere challenges. This is mainly due to exploding gradients caused by the\nexponential divergence of trajectories in chaotic systems. Moreover, for\n(scientific) interpretability we wish to have as low dimensional\nreconstructions as possible, preferably in a model which is mathematically\ntractable. Here we report that a surprisingly simple modification of teacher\nforcing leads to provably strictly all-time bounded gradients in training on\nchaotic systems, and, when paired with a simple architectural rearrangement of\na tractable RNN design, piecewise-linear RNNs (PLRNNs), allows for faithful\nreconstruction in spaces of at most the dimensionality of the observed system.\nWe show on several DS that with these amendments we can reconstruct DS better\nthan current SOTA algorithms, in much lower dimensions. Performance differences\nwere particularly compelling on real world data with which most other methods\nseverely struggled. This work thus led to a simple yet powerful DS\nreconstruction algorithm which is highly interpretable at the same time.\n","authors":["Florian Hess","Zahra Monfared","Manuel Brenner","Daniel Durstewitz"],"pdf_url":"https://arxiv.org/pdf/2306.04406v2.pdf","comment":"Published in the Proceedings of the 40th International Conference on\n Machine Learning (ICML 2023)"},{"id":"http://arxiv.org/abs/2310.17273v2","updated":"2023-10-27T09:24:34Z","published":"2023-10-26T09:50:31Z","title":"Looping in the Human: Collaborative and Explainable Bayesian\n Optimization","summary":" Like many optimizers, Bayesian optimization often falls short of gaining user\ntrust due to opacity. While attempts have been made to develop human-centric\noptimizers, they typically assume user knowledge is well-specified and\nerror-free, employing users mainly as supervisors of the optimization process.\nWe relax these assumptions and propose a more balanced human-AI partnership\nwith our Collaborative and Explainable Bayesian Optimization (CoExBO)\nframework. Instead of explicitly requiring a user to provide a knowledge model,\nCoExBO employs preference learning to seamlessly integrate human insights into\nthe optimization, resulting in algorithmic suggestions that resonate with user\npreference. CoExBO explains its candidate selection every iteration to foster\ntrust, empowering users with a clearer grasp of the optimization. Furthermore,\nCoExBO offers a no-harm guarantee, allowing users to make mistakes; even with\nextreme adversarial interventions, the algorithm converges asymptotically to a\nvanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI\nteaming experiments in lithium-ion battery design, highlighting substantial\nimprovements over conventional methods.\n","authors":["Masaki Adachi","Brady Planden","David A. Howey","Krikamol Maundet","Michael A. Osborne","Siu Lun Chau"],"pdf_url":"https://arxiv.org/pdf/2310.17273v2.pdf","comment":"22 pages, 9 figures"},{"id":"http://arxiv.org/abs/2309.02898v2","updated":"2023-10-27T09:24:20Z","published":"2023-09-06T10:41:30Z","title":"A Unified Framework for Discovering Discrete Symmetries","summary":" We consider the problem of learning a function respecting a symmetry from\namong a class of symmetries. We develop a unified framework that enables\nsymmetry discovery across a broad range of subgroups including locally\nsymmetric, dihedral and cyclic subgroups. At the core of the framework is a\nnovel architecture composed of linear, matrix-valued and non-linear functions\nthat expresses functions invariant to these subgroups in a principled manner.\nThe structure of the architecture enables us to leverage multi-armed bandit\nalgorithms and gradient descent to efficiently optimize over the linear and the\nnon-linear functions, respectively, and to infer the symmetry that is\nultimately learnt. We also discuss the necessity of the matrix-valued functions\nin the architecture. Experiments on image-digit sum and polynomial regression\ntasks demonstrate the effectiveness of our approach.\n","authors":["Pavan Karjol","Rohan Kashyap","Aditya Gopalan","Prathosh A. P"],"pdf_url":"https://arxiv.org/pdf/2309.02898v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18001v1","updated":"2023-10-27T09:17:15Z","published":"2023-10-27T09:17:15Z","title":"DP-SGD with weight clipping","summary":" Recently, due to the popularity of deep neural networks and other methods\nwhose training typically relies on the optimization of an objective function,\nand due to concerns for data privacy, there is a lot of interest in\ndifferentially private gradient descent methods. To achieve differential\nprivacy guarantees with a minimum amount of noise, it is important to be able\nto bound precisely the sensitivity of the information which the participants\nwill observe. In this study, we present a novel approach that mitigates the\nbias arising from traditional gradient clipping. By leveraging public\ninformation concerning the current global model and its location within the\nsearch domain, we can achieve improved gradient bounds, leading to enhanced\nsensitivity determinations and refined noise level adjustments. We extend the\nstate of the art algorithms, present improved differential privacy guarantees\nrequiring less noise and present an empirical evaluation.\n","authors":["Antoine Barczewski","Jan Ramon"],"pdf_url":"https://arxiv.org/pdf/2310.18001v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17998v1","updated":"2023-10-27T09:16:58Z","published":"2023-10-27T09:16:58Z","title":"Closing the Gap Between the Upper Bound and the Lower Bound of Adam's\n Iteration Complexity","summary":" Recently, Arjevani et al. [1] established a lower bound of iteration\ncomplexity for the first-order optimization under an $L$-smooth condition and a\nbounded noise variance assumption. However, a thorough review of existing\nliterature on Adam's convergence reveals a noticeable gap: none of them meet\nthe above lower bound. In this paper, we close the gap by deriving a new\nconvergence guarantee of Adam, with only an $L$-smooth condition and a bounded\nnoise variance assumption. Our results remain valid across a broad spectrum of\nhyperparameters. Especially with properly chosen hyperparameters, we derive an\nupper bound of the iteration complexity of Adam and show that it meets the\nlower bound for first-order optimizers. To the best of our knowledge, this is\nthe first to establish such a tight upper bound for Adam's convergence. Our\nproof utilizes novel techniques to handle the entanglement between momentum and\nadaptive learning rate and to convert the first-order term in the Descent Lemma\nto the gradient norm, which may be of independent interest.\n","authors":["Bohan Wang","Jingwen Fu","Huishuai Zhang","Nanning Zheng","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2310.17998v1.pdf","comment":"NeurIPS 2023 Accept"},{"id":"http://arxiv.org/abs/2310.17972v1","updated":"2023-10-27T08:37:10Z","published":"2023-10-27T08:37:10Z","title":"CEFL: Carbon-Efficient Federated Learning","summary":" Federated Learning (FL) distributes machine learning (ML) training across\nmany edge devices to reduce data transfer overhead and protect data privacy.\nSince FL model training may span millions of devices and is thus\nresource-intensive, prior work has focused on improving its resource efficiency\nto optimize time-to-accuracy. However, prior work generally treats all\nresources the same, while, in practice, they may incur widely different costs,\nwhich instead motivates optimizing cost-to-accuracy. To address the problem, we\ndesign CEFL, which uses adaptive cost-aware client selection policies to\noptimize an arbitrary cost metric when training FL models. Our policies extend\nand combine prior work on utility-based client selection and critical learning\nperiods by making them cost-aware. We demonstrate CEFL by designing\ncarbon-efficient FL, where energy's carbon-intensity is the cost, and show that\nit i) reduces carbon emissions by 93\\% and reduces training time by 50%\ncompared to random client selection and ii) reduces carbon emissions by 80%,\nwhile only increasing training time by 38%, compared to a state-of-the-art\napproach that optimizes training time.\n","authors":["Talha Mehboob","Noman Bashir","Jesus Omana Iglesias","Michael Zink","David Irwin"],"pdf_url":"https://arxiv.org/pdf/2310.17972v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17966v1","updated":"2023-10-27T08:30:54Z","published":"2023-10-27T08:30:54Z","title":"Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online\n Reinforcement Learning","summary":" Offline-to-online reinforcement learning (RL) is a training paradigm that\ncombines pre-training on a pre-collected dataset with fine-tuning in an online\nenvironment. However, the incorporation of online fine-tuning can intensify the\nwell-known distributional shift problem. Existing solutions tackle this problem\nby imposing a policy constraint on the policy improvement objective in both\noffline and online learning. They typically advocate a single balance between\npolicy improvement and constraints across diverse data collections. This\none-size-fits-all manner may not optimally leverage each collected sample due\nto the significant variation in data quality across different states. To this\nend, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective\nframework that empowers existing algorithms to determine state-adaptive\nimprovement-constraint balances. FamO2O utilizes a universal model to train a\nfamily of policies with different improvement/constraint intensities, and a\nbalance model to select a suitable policy for each state. Theoretically, we\nprove that state-adaptive balances are necessary for achieving a higher policy\nperformance upper bound. Empirically, extensive experiments show that FamO2O\noffers a statistically significant improvement over various existing methods,\nachieving state-of-the-art performance on the D4RL benchmark. Codes are\navailable at https://github.com/LeapLabTHU/FamO2O.\n","authors":["Shenzhi Wang","Qisen Yang","Jiawei Gao","Matthieu Gaetan Lin","Hao Chen","Liwei Wu","Ning Jia","Shiji Song","Gao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.17966v1.pdf","comment":"NeurIPS 2023 spotlight. 24 pages, 13 figures"},{"id":"http://arxiv.org/abs/2301.12842v3","updated":"2023-10-27T08:14:48Z","published":"2023-01-30T12:51:13Z","title":"Direct Preference-based Policy Optimization without Reward Modeling","summary":" Preference-based reinforcement learning (PbRL) is an approach that enables RL\nagents to learn from preference, which is particularly useful when formulating\na reward function is challenging. Existing PbRL methods generally involve a\ntwo-step procedure: they first learn a reward model based on given preference\ndata and then employ off-the-shelf reinforcement learning algorithms using the\nlearned reward model. However, obtaining an accurate reward model solely from\npreference information, especially when the preference is from human teachers,\ncan be difficult. Instead, we propose a PbRL algorithm that directly learns\nfrom preference without requiring any reward modeling. To achieve this, we\nadopt a contrastive learning framework to design a novel policy scoring metric\nthat assigns a high score to policies that align with the given preferences. We\napply our algorithm to offline RL tasks with actual human preference labels and\nshow that our algorithm outperforms or is on par with the existing PbRL\nmethods. Notably, on high-dimensional control tasks, our algorithm surpasses\noffline RL methods that learn with ground-truth reward information. Finally, we\nshow that our algorithm can be successfully applied to fine-tune large language\nmodels.\n","authors":["Gaon An","Junhyeok Lee","Xingdong Zuo","Norio Kosaka","Kyung-Min Kim","Hyun Oh Song"],"pdf_url":"https://arxiv.org/pdf/2301.12842v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2306.04375v2","updated":"2023-10-27T08:08:56Z","published":"2023-06-07T12:17:17Z","title":"Learning via Wasserstein-Based High Probability Generalisation Bounds","summary":" Minimising upper bounds on the population risk or the generalisation gap has\nbeen widely used in structural risk minimisation (SRM) -- this is in particular\nat the core of PAC-Bayesian learning. Despite its successes and unfailing surge\nof interest in recent years, a limitation of the PAC-Bayesian framework is that\nmost bounds involve a Kullback-Leibler (KL) divergence term (or its\nvariations), which might exhibit erratic behavior and fail to capture the\nunderlying geometric structure of the learning problem -- hence restricting its\nuse in practical applications. As a remedy, recent studies have attempted to\nreplace the KL divergence in the PAC-Bayesian bounds with the Wasserstein\ndistance. Even though these bounds alleviated the aforementioned issues to a\ncertain extent, they either hold in expectation, are for bounded losses, or are\nnontrivial to minimize in an SRM framework. In this work, we contribute to this\nline of research and prove novel Wasserstein distance-based PAC-Bayesian\ngeneralisation bounds for both batch learning with independent and identically\ndistributed (i.i.d.) data, and online learning with potentially non-i.i.d.\ndata. Contrary to previous art, our bounds are stronger in the sense that (i)\nthey hold with high probability, (ii) they apply to unbounded (potentially\nheavy-tailed) losses, and (iii) they lead to optimizable training objectives\nthat can be used in SRM. As a result we derive novel Wasserstein-based\nPAC-Bayesian learning algorithms and we illustrate their empirical advantage on\na variety of experiments.\n","authors":["Paul Viallard","Maxime Haddouche","Umut Şimşekli","Benjamin Guedj"],"pdf_url":"https://arxiv.org/pdf/2306.04375v2.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.00993v2","updated":"2023-10-27T07:59:07Z","published":"2023-02-02T10:21:49Z","title":"Unpaired Multi-Domain Causal Representation Learning","summary":" The goal of causal representation learning is to find a representation of\ndata that consists of causally related latent variables. We consider a setup\nwhere one has access to data from multiple domains that potentially share a\ncausal representation. Crucially, observations in different domains are assumed\nto be unpaired, that is, we only observe the marginal distribution in each\ndomain but not their joint distribution. In this paper, we give sufficient\nconditions for identifiability of the joint distribution and the shared causal\ngraph in a linear setup. Identifiability holds if we can uniquely recover the\njoint distribution and the shared causal representation from the marginal\ndistributions in each domain. We transform our identifiability results into a\npractical method to recover the shared latent causal graph.\n","authors":["Nils Sturma","Chandler Squires","Mathias Drton","Caroline Uhler"],"pdf_url":"https://arxiv.org/pdf/2302.00993v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16844v3","updated":"2023-10-27T07:58:17Z","published":"2023-06-29T10:34:23Z","title":"Macro Placement by Wire-Mask-Guided Black-Box Optimization","summary":" The development of very large-scale integration (VLSI) technology has posed\nnew challenges for electronic design automation (EDA) techniques in chip\nfloorplanning. During this process, macro placement is an important subproblem,\nwhich tries to determine the positions of all macros with the aim of minimizing\nhalf-perimeter wirelength (HPWL) and avoiding overlapping. Previous methods\ninclude packing-based, analytical and reinforcement learning methods. In this\npaper, we propose a new black-box optimization (BBO) framework (called\nWireMask-BBO) for macro placement, by using a wire-mask-guided greedy procedure\nfor objective evaluation. Equipped with different BBO algorithms, WireMask-BBO\nempirically achieves significant improvements over previous methods, i.e.,\nachieves significantly shorter HPWL by using much less time. Furthermore, it\ncan fine-tune existing placements by treating them as initial solutions, which\ncan bring up to 50% improvement in HPWL. WireMask-BBO has the potential to\nsignificantly improve the quality and efficiency of chip floorplanning, which\nmakes it appealing to researchers and practitioners in EDA and will also\npromote the application of BBO. Our code is available at\nhttps://github.com/lamda-bbo/WireMask-BBO.\n","authors":["Yunqi Shi","Ke Xue","Lei Song","Chao Qian"],"pdf_url":"https://arxiv.org/pdf/2306.16844v3.pdf","comment":"Update NeurIPS'23 camera ready version"},{"id":"http://arxiv.org/abs/2310.17945v1","updated":"2023-10-27T07:40:45Z","published":"2023-10-27T07:40:45Z","title":"A Comprehensive and Reliable Feature Attribution Method: Double-sided\n Remove and Reconstruct (DoRaR)","summary":" The limited transparency of the inner decision-making mechanism in deep\nneural networks (DNN) and other machine learning (ML) models has hindered their\napplication in several domains. In order to tackle this issue, feature\nattribution methods have been developed to identify the crucial features that\nheavily influence decisions made by these black box models. However, many\nfeature attribution methods have inherent downsides. For example, one category\nof feature attribution methods suffers from the artifacts problem, which feeds\nout-of-distribution masked inputs directly through the classifier that was\noriginally trained on natural data points. Another category of feature\nattribution method finds explanations by using jointly trained feature\nselectors and predictors. While avoiding the artifacts problem, this new\ncategory suffers from the Encoding Prediction in the Explanation (EPITE)\nproblem, in which the predictor's decisions rely not on the features, but on\nthe masks that selects those features. As a result, the credibility of\nattribution results is undermined by these downsides. In this research, we\nintroduce the Double-sided Remove and Reconstruct (DoRaR) feature attribution\nmethod based on several improvement methods that addresses these issues. By\nconducting thorough testing on MNIST, CIFAR10 and our own synthetic dataset, we\ndemonstrate that the DoRaR feature attribution method can effectively bypass\nthe above issues and can aid in training a feature selector that outperforms\nother state-of-the-art feature attribution methods. Our code is available at\nhttps://github.com/dxq21/DoRaR.\n","authors":["Dong Qin","George Amariucai","Daji Qiao","Yong Guan","Shen Fu"],"pdf_url":"https://arxiv.org/pdf/2310.17945v1.pdf","comment":"16 pages, 22 figures"},{"id":"http://arxiv.org/abs/2310.17944v1","updated":"2023-10-27T07:39:54Z","published":"2023-10-27T07:39:54Z","title":"Trustworthy Edge Machine Learning: A Survey","summary":" The convergence of Edge Computing (EC) and Machine Learning (ML), known as\nEdge Machine Learning (EML), has become a highly regarded research area by\nutilizing distributed network resources to perform joint training and inference\nin a cooperative manner. However, EML faces various challenges due to resource\nconstraints, heterogeneous network environments, and diverse service\nrequirements of different applications, which together affect the\ntrustworthiness of EML in the eyes of its stakeholders. This survey provides a\ncomprehensive summary of definitions, attributes, frameworks, techniques, and\nsolutions for trustworthy EML. Specifically, we first emphasize the importance\nof trustworthy EML within the context of Sixth-Generation (6G) networks. We\nthen discuss the necessity of trustworthiness from the perspective of\nchallenges encountered during deployment and real-world application scenarios.\nSubsequently, we provide a preliminary definition of trustworthy EML and\nexplore its key attributes. Following this, we introduce fundamental frameworks\nand enabling technologies for trustworthy EML systems, and provide an in-depth\nliterature review of the latest solutions to enhance trustworthiness of EML.\nFinally, we discuss corresponding research challenges and open issues.\n","authors":["Xiaojie Wang","Beibei Wang","Yu Wu","Zhaolong Ning","Song Guo","Fei Richard Yu"],"pdf_url":"https://arxiv.org/pdf/2310.17944v1.pdf","comment":"27 pages, 7 figures, 10 tables"},{"id":"http://arxiv.org/abs/2310.12692v2","updated":"2023-10-27T07:28:19Z","published":"2023-10-19T12:39:59Z","title":"Representation Learning via Consistent Assignment of Views over Random\n Partitions","summary":" We present Consistent Assignment of Views over Random Partitions (CARP), a\nself-supervised clustering method for representation learning of visual\nfeatures. CARP learns prototypes in an end-to-end online fashion using gradient\ndescent without additional non-differentiable modules to solve the cluster\nassignment problem. CARP optimizes a new pretext task based on random\npartitions of prototypes that regularizes the model and enforces consistency\nbetween views' assignments. Additionally, our method improves training\nstability and prevents collapsed solutions in joint-embedding training. Through\nan extensive evaluation, we demonstrate that CARP's representations are\nsuitable for learning downstream tasks. We evaluate CARP's representations\ncapabilities in 17 datasets across many standard protocols, including linear\nevaluation, few-shot classification, k-NN, k-means, image retrieval, and copy\ndetection. We compare CARP performance to 11 existing self-supervised methods.\nWe extensively ablate our method and demonstrate that our proposed random\npartition pretext task improves the quality of the learned representations by\ndevising multiple random classification tasks. In transfer learning tasks, CARP\nachieves the best performance on average against many SSL methods trained for a\nlonger time.\n","authors":["Thalles Silva","Adín Ramírez Rivera"],"pdf_url":"https://arxiv.org/pdf/2310.12692v2.pdf","comment":"To appear in NeurIPS 2023. Code available at\n https://github.com/sthalles/carp"},{"id":"http://arxiv.org/abs/2310.17936v1","updated":"2023-10-27T07:21:37Z","published":"2023-10-27T07:21:37Z","title":"Transformers as Graph-to-Graph Models","summary":" We argue that Transformers are essentially graph-to-graph models, with\nsequences just being a special case. Attention weights are functionally\nequivalent to graph edges. Our Graph-to-Graph Transformer architecture makes\nthis ability explicit, by inputting graph edges into the attention weight\ncomputations and predicting graph edges with attention-like functions, thereby\nintegrating explicit graphs into the latent graphs learned by pretrained\nTransformers. Adding iterative graph refinement provides a joint embedding of\ninput, output, and latent graphs, allowing non-autoregressive graph prediction\nto optimise the complete graph without any bespoke pipeline or decoding\nstrategy. Empirical results show that this architecture achieves\nstate-of-the-art accuracies for modelling a variety of linguistic structures,\nintegrating very effectively with the latent linguistic representations learned\nby pretraining.\n","authors":["James Henderson","Alireza Mohammadshahi","Andrei C. Coman","Lesly Miculicich"],"pdf_url":"https://arxiv.org/pdf/2310.17936v1.pdf","comment":"Accepted to Big Picture workshop at EMNLP 2023"},{"id":"http://arxiv.org/abs/2205.15802v2","updated":"2023-10-27T07:10:35Z","published":"2022-05-31T14:04:38Z","title":"AdaTask: Adaptive Multitask Online Learning","summary":" We introduce and analyze AdaTask, a multitask online learning algorithm that\nadapts to the unknown structure of the tasks. When the $N$ tasks are\nstochastically activated, we show that the regret of AdaTask is better, by a\nfactor that can be as large as $\\sqrt{N}$, than the regret achieved by running\n$N$ independent algorithms, one for each task. AdaTask can be seen as a\ncomparator-adaptive version of Follow-the-Regularized-Leader with a Mahalanobis\nnorm potential. Through a variational formulation of this potential, our\nanalysis reveals how AdaTask jointly learns the tasks and their structure.\nExperiments supporting our findings are presented.\n","authors":["Pierre Laforgue","Andrea Della Vecchia","Nicolò Cesa-Bianchi","Lorenzo Rosasco"],"pdf_url":"https://arxiv.org/pdf/2205.15802v2.pdf","comment":"The proof of Theorem 3 is wrong: in the display equation below\n Equation (22), bottom of page 15, the gradient of $\\phi_{t+1}$ is missing a\n factor $1/(\\alpha\\eta_t)$"},{"id":"http://arxiv.org/abs/2310.16546v2","updated":"2023-10-27T07:10:09Z","published":"2023-10-25T10:53:04Z","title":"Pitfall of Optimism: Distributional Reinforcement Learning by\n Randomizing Risk Criterion","summary":" Distributional reinforcement learning algorithms have attempted to utilize\nestimated uncertainty for exploration, such as optimism in the face of\nuncertainty. However, using the estimated variance for optimistic exploration\nmay cause biased data collection and hinder convergence or performance. In this\npaper, we present a novel distributional reinforcement learning algorithm that\nselects actions by randomizing risk criterion to avoid one-sided tendency on\nrisk. We provide a perturbed distributional Bellman optimality operator by\ndistorting the risk measure and prove the convergence and optimality of the\nproposed method with the weaker contraction property. Our theoretical results\nsupport that the proposed method does not fall into biased exploration and is\nguaranteed to converge to an optimal return. Finally, we empirically show that\nour method outperforms other existing distribution-based algorithms in various\nenvironments including Atari 55 games.\n","authors":["Taehyun Cho","Seungyub Han","Heesoo Lee","Kyungjae Lee","Jungwoo Lee"],"pdf_url":"https://arxiv.org/pdf/2310.16546v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.12095v2","updated":"2023-10-27T07:00:04Z","published":"2023-09-21T14:10:47Z","title":"Bayesian sparsification for deep neural networks with Bayesian model\n reduction","summary":" Deep learning's immense capabilities are often constrained by the complexity\nof its models, leading to an increasing demand for effective sparsification\ntechniques. Bayesian sparsification for deep learning emerges as a crucial\napproach, facilitating the design of models that are both computationally\nefficient and competitive in terms of performance across various deep learning\napplications. The state-of-the-art -- in Bayesian sparsification of deep neural\nnetworks -- combines structural shrinkage priors on model weights with an\napproximate inference scheme based on stochastic variational inference.\nHowever, model inversion of the full generative model is exceptionally\ncomputationally demanding, especially when compared to standard deep learning\nof point estimates. In this context, we advocate for the use of Bayesian model\nreduction (BMR) as a more efficient alternative for pruning of model weights.\nAs a generalization of the Savage-Dickey ratio, BMR allows a post-hoc\nelimination of redundant model weights based on the posterior estimates under a\nstraightforward (non-hierarchical) generative model. Our comparative study\nhighlights the advantages of the BMR method relative to established approaches\nbased on hierarchical horseshoe priors over model weights. We illustrate the\npotential of BMR across various deep learning architectures, from classical\nnetworks like LeNet to modern frameworks such as Vision Transformers and\nMLP-Mixers.\n","authors":["Dimitrije Marković","Karl J. Friston","Stefan J. Kiebel"],"pdf_url":"https://arxiv.org/pdf/2309.12095v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.11294v3","updated":"2023-10-27T06:25:17Z","published":"2023-02-22T11:26:50Z","title":"Distributional Learning of Variational AutoEncoder: Application to\n Synthetic Data Generation","summary":" The Gaussianity assumption has been consistently criticized as a main\nlimitation of the Variational Autoencoder (VAE) despite its efficiency in\ncomputational modeling. In this paper, we propose a new approach that expands\nthe model capacity (i.e., expressive power of distributional family) without\nsacrificing the computational advantages of the VAE framework. Our VAE model's\ndecoder is composed of an infinite mixture of asymmetric Laplace distribution,\nwhich possesses general distribution fitting capabilities for continuous\nvariables. Our model is represented by a special form of a nonparametric\nM-estimator for estimating general quantile functions, and we theoretically\nestablish the relevance between the proposed model and quantile estimation. We\napply the proposed model to synthetic data generation, and particularly, our\nmodel demonstrates superiority in easily adjusting the level of data privacy.\n","authors":["Seunghwan An","Jong-June Jeon"],"pdf_url":"https://arxiv.org/pdf/2302.11294v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.01507v3","updated":"2023-10-27T06:24:08Z","published":"2023-09-04T10:27:17Z","title":"Memory Efficient Optimizers with 4-bit States","summary":" Optimizer states are a major source of memory consumption for training neural\nnetworks, limiting the maximum trainable model within given memory budget.\nCompressing the optimizer states from 32-bit floating points to lower bitwidth\nis promising to reduce the training memory footprint, while the current lowest\nachievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth\ndown to 4-bit through a detailed empirical analysis of first and second\nmoments. Specifically, we find that moments have complicated outlier patterns,\nthat current block-wise quantization cannot accurately approximate. We use a\nsmaller block size and propose to utilize both row-wise and column-wise\ninformation for better quantization. We further identify a zero point problem\nof quantizing the second moment, and solve this problem with a linear quantizer\nthat excludes the zero point. Our 4-bit optimizers are evaluated on a wide\nvariety of benchmarks including natural language understanding, machine\ntranslation, image classification, and instruction tuning. On all the tasks our\noptimizers can achieve comparable accuracy with their full-precision\ncounterparts, while enjoying better memory efficiency.\n","authors":["Bingrui Li","Jianfei Chen","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2309.01507v3.pdf","comment":"v3: camera ready revisions for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17915v1","updated":"2023-10-27T06:15:33Z","published":"2023-10-27T06:15:33Z","title":"Lifting the Veil: Unlocking the Power of Depth in Q-learning","summary":" With the help of massive data and rich computational resources, deep\nQ-learning has been widely used in operations research and management science\nand has contributed to great success in numerous applications, including\nrecommender systems, supply chains, games, and robotic manipulation. However,\nthe success of deep Q-learning lacks solid theoretical verification and\ninterpretability. The aim of this paper is to theoretically verify the power of\ndepth in deep Q-learning. Within the framework of statistical learning theory,\nwe rigorously prove that deep Q-learning outperforms its traditional version by\ndemonstrating its good generalization error bound. Our results reveal that the\nmain reason for the success of deep Q-learning is the excellent performance of\ndeep neural networks (deep nets) in capturing the special properties of rewards\nnamely, spatial sparseness and piecewise constancy, rather than their large\ncapacities. In this paper, we make fundamental contributions to the field of\nreinforcement learning by answering to the following three questions: Why does\ndeep Q-learning perform so well? When does deep Q-learning perform better than\ntraditional Q-learning? How many samples are required to achieve a specific\nprediction accuracy for deep Q-learning? Our theoretical assertions are\nverified by applying deep Q-learning in the well-known beer game in supply\nchain management and a simulated recommender system.\n","authors":["Shao-Bo Lin","Tao Li","Shaojie Tang","Yao Wang","Ding-Xuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.17915v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2212.07056v3","updated":"2023-10-27T06:14:44Z","published":"2022-12-14T06:55:32Z","title":"On the Probability of Necessity and Sufficiency of Explaining Graph\n Neural Networks: A Lower Bound Optimization Approach","summary":" The explainability of Graph Neural Networks (GNNs) is critical to various GNN\napplications, yet it remains a significant challenge. A convincing explanation\nshould be both necessary and sufficient simultaneously. However, existing GNN\nexplaining approaches focus on only one of the two aspects, necessity or\nsufficiency, or a heuristic trade-off between the two. Theoretically, the\nProbability of Necessity and Sufficiency (PNS) holds the potential to identify\nthe most necessary and sufficient explanation since it can mathematically\nquantify the necessity and sufficiency of an explanation. Nevertheless, the\ndifficulty of obtaining PNS due to non-monotonicity and the challenge of\ncounterfactual estimation limit its wide use. To address the\nnon-identifiability of PNS, we resort to a lower bound of PNS that can be\noptimized via counterfactual estimation, and propose a framework of Necessary\nand Sufficient Explanation for GNN (NSEG) via optimizing that lower bound.\nSpecifically, we depict the GNN as a structural causal model (SCM), and\nestimate the probability of counterfactual via the intervention under the SCM.\nAdditionally, we leverage continuous masks with a sampling strategy to optimize\nthe lower bound to enhance the scalability. Empirical results demonstrate that\nNSEG outperforms state-of-the-art methods, consistently generating the most\nnecessary and sufficient explanations.\n","authors":["Ruichu Cai","Yuxuan Zhu","Xuexin Chen","Yuan Fang","Min Wu","Jie Qiao","Zhifeng Hao"],"pdf_url":"https://arxiv.org/pdf/2212.07056v3.pdf","comment":"36 pages, 9 figures"},{"id":"http://arxiv.org/abs/2301.00457v2","updated":"2023-10-27T06:10:41Z","published":"2023-01-01T18:51:29Z","title":"ReSQueing Parallel and Private Stochastic Convex Optimization","summary":" We introduce a new tool for stochastic convex optimization (SCO): a\nReweighted Stochastic Query (ReSQue) estimator for the gradient of a function\nconvolved with a (Gaussian) probability density. Combining ReSQue with recent\nadvances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop\nalgorithms achieving state-of-the-art complexities for SCO in parallel and\nprivate settings. For a SCO objective constrained to the unit ball in\n$\\mathbb{R}^d$, we obtain the following results (up to polylogarithmic\nfactors). We give a parallel algorithm obtaining optimization error\n$\\epsilon_{\\text{opt}}$ with $d^{1/3}\\epsilon_{\\text{opt}}^{-2/3}$ gradient\noracle query depth and $d^{1/3}\\epsilon_{\\text{opt}}^{-2/3} +\n\\epsilon_{\\text{opt}}^{-2}$ gradient queries in total, assuming access to a\nbounded-variance stochastic gradient estimator. For $\\epsilon_{\\text{opt}} \\in\n[d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of\n[BJLLS19] while maintaining the optimal total work of stochastic gradient\ndescent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19,\nBFGT20, AFKT21, KLL21] established that if $n \\gtrsim d\n\\epsilon_{\\text{dp}}^{-2}$, $(\\epsilon_{\\text{dp}}, \\delta)$-differential\nprivacy is attained at no asymptotic cost to the SCO utility. However, these\nprior works all required a superlinear number of gradient queries. We close\nthis gap for sufficiently large $n \\gtrsim d^2 \\epsilon_{\\text{dp}}^{-3}$, by\nusing ReSQue to design an algorithm with near-linear gradient query complexity\nin this regime.\n","authors":["Yair Carmon","Arun Jambulapati","Yujia Jin","Yin Tat Lee","Daogao Liu","Aaron Sidford","Kevin Tian"],"pdf_url":"https://arxiv.org/pdf/2301.00457v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04220v5","updated":"2023-10-27T05:49:10Z","published":"2023-06-07T07:51:05Z","title":"Look Beneath the Surface: Exploiting Fundamental Symmetry for\n Sample-Efficient Offline RL","summary":" Offline reinforcement learning (RL) offers an appealing approach to\nreal-world tasks by learning policies from pre-collected datasets without\ninteracting with the environment. However, the performance of existing offline\nRL algorithms heavily depends on the scale and state-action space coverage of\ndatasets. Real-world data collection is often expensive and uncontrollable,\nleading to small and narrowly covered datasets and posing significant\nchallenges for practical deployments of offline RL. In this paper, we provide a\nnew insight that leveraging the fundamental symmetry of system dynamics can\nsubstantially enhance offline RL performance under small datasets.\nSpecifically, we propose a Time-reversal symmetry (T-symmetry) enforced\nDynamics Model (TDM), which establishes consistency between a pair of forward\nand reverse latent dynamics. TDM provides both well-behaved representations for\nsmall datasets and a new reliability measure for OOD samples based on\ncompliance with the T-symmetry. These can be readily used to construct a new\noffline RL algorithm (TSRL) with less conservative policy constraints and a\nreliable latent space data augmentation procedure. Based on extensive\nexperiments, we find TSRL achieves great performance on small benchmark\ndatasets with as few as 1% of the original samples, which significantly\noutperforms the recent offline RL algorithms in terms of data efficiency and\ngeneralizability.Code is available at: https://github.com/pcheng2/TSRL\n","authors":["Peng Cheng","Xianyuan Zhan","Zhihao Wu","Wenjia Zhang","Shoucheng Song","Han Wang","Youfang Lin","Li Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.04220v5.pdf","comment":"Accepted in NeurIPS 2023; The first two authors contributed equally"},{"id":"http://arxiv.org/abs/2310.08282v2","updated":"2023-10-27T05:25:49Z","published":"2023-10-12T12:39:08Z","title":"Data driven modeling of self-similar dynamics","summary":" Multiscale modeling of complex systems is crucial for understanding their\nintricacies. Data-driven multiscale modeling has emerged as a promising\napproach to tackle challenges associated with complex systems. On the other\nhand, self-similarity is prevalent in complex systems, hinting that large-scale\ncomplex systems can be modeled at a reduced cost. In this paper, we introduce a\nmultiscale neural network framework that incorporates self-similarity as prior\nknowledge, facilitating the modeling of self-similar dynamical systems. For\ndeterministic dynamics, our framework can discern whether the dynamics are\nself-similar. For uncertain dynamics, it can compare and determine which\nparameter set is closer to self-similarity. The framework allows us to extract\nscale-invariant kernels from the dynamics for modeling at any scale. Moreover,\nour method can identify the power law exponents in self-similar systems.\nPreliminary tests on the Ising model yielded critical exponents consistent with\ntheoretical expectations, providing valuable insights for addressing critical\nphase transitions in non-equilibrium systems.\n","authors":["Ru-yi Tao","Ning-ning Tao","Yi-zhuang You","Jiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08282v2.pdf","comment":"11 pages,5 figures,1 table"},{"id":"http://arxiv.org/abs/2310.17901v1","updated":"2023-10-27T05:25:02Z","published":"2023-10-27T05:25:02Z","title":"Improving the Knowledge Gradient Algorithm","summary":" The knowledge gradient (KG) algorithm is a popular policy for the best arm\nidentification (BAI) problem. It is built on the simple idea of always choosing\nthe measurement that yields the greatest expected one-step improvement in the\nestimate of the best mean of the arms. In this research, we show that this\npolicy has limitations, causing the algorithm not asymptotically optimal. We\nnext provide a remedy for it, by following the manner of one-step look ahead of\nKG, but instead choosing the measurement that yields the greatest one-step\nimprovement in the probability of selecting the best arm. The new policy is\ncalled improved knowledge gradient (iKG). iKG can be shown to be asymptotically\noptimal. In addition, we show that compared to KG, it is easier to extend iKG\nto variant problems of BAI, with the $\\epsilon$-good arm identification and\nfeasible arm identification as two examples. The superior performances of iKG\non these problems are further demonstrated using numerical examples.\n","authors":["Yang Le","Gao Siyang","Ho Chin Pang"],"pdf_url":"https://arxiv.org/pdf/2310.17901v1.pdf","comment":"32 pages, 42 figures"},{"id":"http://arxiv.org/abs/2306.09850v3","updated":"2023-10-27T05:08:13Z","published":"2023-06-16T13:47:04Z","title":"Practical Sharpness-Aware Minimization Cannot Converge All the Way to\n Optima","summary":" Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step\nbased on the gradient at a perturbation $y_t = x_t + \\rho \\frac{\\nabla\nf(x_t)}{\\lVert \\nabla f(x_t) \\rVert}$ of the current point $x_t$. Existing\nstudies prove convergence of SAM for smooth functions, but they do so by\nassuming decaying perturbation size $\\rho$ and/or no gradient normalization in\n$y_t$, which is detached from practice. To address this gap, we study\ndeterministic/stochastic versions of SAM with practical configurations (i.e.,\nconstant $\\rho$ and gradient normalization in $y_t$) and explore their\nconvergence properties on smooth functions with (non)convexity assumptions.\nPerhaps surprisingly, in many scenarios, we find out that SAM has limited\ncapability to converge to global minima or stationary points. For smooth\nstrongly convex functions, we show that while deterministic SAM enjoys tight\nglobal convergence rates of $\\tilde \\Theta(\\frac{1}{T^2})$, the convergence\nbound of stochastic SAM suffers an inevitable additive term $O(\\rho^2)$,\nindicating convergence only up to neighborhoods of optima. In fact, such\n$O(\\rho^2)$ factors arise for stochastic SAM in all the settings we consider,\nand also for deterministic SAM in nonconvex cases; importantly, we prove by\nexamples that such terms are unavoidable. Our results highlight vastly\ndifferent characteristics of SAM with vs. without decaying perturbation size or\ngradient normalization, and suggest that the intuitions gained from one version\nmay not apply to the other.\n","authors":["Dongkuk Si","Chulhee Yun"],"pdf_url":"https://arxiv.org/pdf/2306.09850v3.pdf","comment":"39 pages. v3 NeurIPS 2023 camera ready version"},{"id":"http://arxiv.org/abs/2310.17890v1","updated":"2023-10-27T04:42:59Z","published":"2023-10-27T04:42:59Z","title":"Submodel Partitioning in Hierarchical Federated Learning: Algorithm\n Design and Convergence Analysis","summary":" Hierarchical federated learning (HFL) has demonstrated promising scalability\nadvantages over the traditional \"star-topology\" architecture-based federated\nlearning (FL). However, HFL still imposes significant computation,\ncommunication, and storage burdens on the edge, especially when training a\nlarge-scale model over resource-constrained Internet of Things (IoT) devices.\nIn this paper, we propose hierarchical independent submodel training (HIST), a\nnew FL methodology that aims to address these issues in hierarchical settings.\nThe key idea behind HIST is a hierarchical version of model partitioning, where\nwe partition the global model into disjoint submodels in each round, and\ndistribute them across different cells, so that each cell is responsible for\ntraining only one partition of the full model. This enables each client to save\ncomputation/storage costs while alleviating the communication loads throughout\nthe hierarchy. We characterize the convergence behavior of HIST for non-convex\nloss functions under mild assumptions, showing the impact of several attributes\n(e.g., number of cells, local and global aggregation frequency) on the\nperformance-efficiency tradeoff. Finally, through numerical experiments, we\nverify that HIST is able to save communication costs by a wide margin while\nachieving the same target testing accuracy.\n","authors":["Wenzhi Fang","Dong-Jun Han","Christopher G. Brinton"],"pdf_url":"https://arxiv.org/pdf/2310.17890v1.pdf","comment":"14 pages, 4 figures"},{"id":"http://arxiv.org/abs/2306.00006v3","updated":"2023-10-27T04:35:02Z","published":"2023-05-29T08:39:16Z","title":"Truncated Affinity Maximization: One-class Homophily Modeling for Graph\n Anomaly Detection","summary":" One prevalent property we find empirically in real-world graph anomaly\ndetection (GAD) datasets is a one-class homophily, i.e., normal nodes tend to\nhave strong connection/affinity with each other, while the homophily in\nabnormal nodes is significantly weaker than normal nodes. However, this\nanomaly-discriminative property is ignored by existing GAD methods that are\ntypically built using a conventional anomaly detection objective, such as data\nreconstruction. In this work, we explore this property to introduce a novel\nunsupervised anomaly scoring measure for GAD -- local node affinity -- that\nassigns a larger anomaly score to nodes that are less affiliated with their\nneighbors, with the affinity defined as similarity on node\nattributes/representations. We further propose Truncated Affinity Maximization\n(TAM) that learns tailored node representations for our anomaly measure by\nmaximizing the local affinity of nodes to their neighbors. Optimizing on the\noriginal graph structure can be biased by non-homophily edges (i.e., edges\nconnecting normal and abnormal nodes). Thus, TAM is instead optimized on\ntruncated graphs where non-homophily edges are removed iteratively to mitigate\nthis bias. The learned representations result in significantly stronger local\naffinity for normal nodes than abnormal nodes. Extensive empirical results on\nsix real-world GAD datasets show that TAM substantially outperforms seven\ncompeting models, achieving over 10% increase in AUROC/AUPRC compared to the\nbest contenders on challenging datasets. Our code will be made available at\nhttps: //github.com/mala-lab/TAM-master/.\n","authors":["Hezhe Qiao","Guansong Pang"],"pdf_url":"https://arxiv.org/pdf/2306.00006v3.pdf","comment":"19 pages, 9 figures"},{"id":"http://arxiv.org/abs/2310.17887v1","updated":"2023-10-27T04:30:18Z","published":"2023-10-27T04:30:18Z","title":"Impressions: Understanding Visual Semiotics and Aesthetic Impact","summary":" Is aesthetic impact different from beauty? Is visual salience a reflection of\nits capacity for effective communication? We present Impressions, a novel\ndataset through which to investigate the semiotics of images, and how specific\nvisual features and design choices can elicit specific emotions, thoughts and\nbeliefs. We posit that the impactfulness of an image extends beyond formal\ndefinitions of aesthetics, to its success as a communicative act, where style\ncontributes as much to meaning formation as the subject matter. However, prior\nimage captioning datasets are not designed to empower state-of-the-art\narchitectures to model potential human impressions or interpretations of\nimages. To fill this gap, we design an annotation task heavily inspired by\nimage analysis techniques in the Visual Arts to collect 1,440 image-caption\npairs and 4,320 unique annotations exploring impact, pragmatic image\ndescription, impressions, and aesthetic design choices. We show that existing\nmultimodal image captioning and conditional generation models struggle to\nsimulate plausible human responses to images. However, this dataset\nsignificantly improves their ability to model impressions and aesthetic\nevaluations of images through fine-tuning and few-shot adaptation.\n","authors":["Julia Kruk","Caleb Ziems","Diyi Yang"],"pdf_url":"https://arxiv.org/pdf/2310.17887v1.pdf","comment":"To be published in EMNLP 2023"},{"id":"http://arxiv.org/abs/2307.04090v2","updated":"2023-10-27T04:27:41Z","published":"2023-07-09T04:19:19Z","title":"DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge\n Graphs","summary":" Recent work within the Argument Mining community has shown the applicability\nof Natural Language Processing systems for solving problems found within\ncompetitive debate. One of the most important tasks within competitive debate\nis for debaters to create high quality debate cases. We show that effective\ndebate cases can be constructed using constrained shortest path traversals on\nArgumentative Semantic Knowledge Graphs. We study this potential in the context\nof a type of American Competitive Debate, called Policy Debate, which already\nhas a large scale dataset targeting it called DebateSum. We significantly\nimprove upon DebateSum by introducing 53180 new examples, as well as further\nuseful metadata for every example, to the dataset. We leverage the txtai\nsemantic search and knowledge graph toolchain to produce and contribute 9\nsemantic knowledge graphs built on this dataset. We create a unique method for\nevaluating which knowledge graphs are better in the context of producing policy\ndebate cases. A demo which automatically generates debate cases, along with all\nother code and the Knowledge Graphs, are open-sourced and made available to the\npublic here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG\n","authors":["Allen Roush","David Mezzetti"],"pdf_url":"https://arxiv.org/pdf/2307.04090v2.pdf","comment":"8 pages, Accepted to The 4th New Frontiers in Summarization Workshop\n (EMNLP 2023), System Demonstration paper"},{"id":"http://arxiv.org/abs/2308.08643v3","updated":"2023-10-27T04:14:43Z","published":"2023-08-16T19:36:01Z","title":"Towards Personalized Federated Learning via Heterogeneous Model\n Reassembly","summary":" This paper focuses on addressing the practical yet challenging problem of\nmodel heterogeneity in federated learning, where clients possess models with\ndifferent network structures. To track this problem, we propose a novel\nframework called pFedHR, which leverages heterogeneous model reassembly to\nachieve personalized federated learning. In particular, we approach the problem\nof heterogeneous model personalization as a model-matching optimization task on\nthe server side. Moreover, pFedHR automatically and dynamically generates\ninformative and diverse personalized candidates with minimal human\nintervention. Furthermore, our proposed heterogeneous model reassembly\ntechnique mitigates the adverse impact introduced by using public data with\ndifferent distributions from the client data to a certain extent. Experimental\nresults demonstrate that pFedHR outperforms baselines on three datasets under\nboth IID and Non-IID settings. Additionally, pFedHR effectively reduces the\nadverse impact of using different public data and dynamically generates diverse\npersonalized models in an automated manner.\n","authors":["Jiaqi Wang","Xingyi Yang","Suhan Cui","Liwei Che","Lingjuan Lyu","Dongkuan Xu","Fenglong Ma"],"pdf_url":"https://arxiv.org/pdf/2308.08643v3.pdf","comment":"This paper has been accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17882v1","updated":"2023-10-27T04:11:13Z","published":"2023-10-27T04:11:13Z","title":"Machine Learning Infused Distributed Optimization for Coordinating\n Virtual Power Plant Assets","summary":" Amid the increasing interest in the deployment of Distributed Energy\nResources (DERs), the Virtual Power Plant (VPP) has emerged as a pivotal tool\nfor aggregating diverse DERs and facilitating their participation in wholesale\nenergy markets. These VPP deployments have been fueled by the Federal Energy\nRegulatory Commission's Order 2222, which makes DERs and VPPs competitive\nacross market segments. However, the diversity and decentralized nature of DERs\npresent significant challenges to the scalable coordination of VPP assets. To\naddress efficiency and speed bottlenecks, this paper presents a novel machine\nlearning-assisted distributed optimization to coordinate VPP assets. Our\nmethod, named LOOP-MAC(Learning to Optimize the Optimization Process for\nMulti-agent Coordination), adopts a multi-agent coordination perspective where\neach VPP agent manages multiple DERs and utilizes neural network approximators\nto expedite the solution search. The LOOP-MAC method employs a gauge map to\nguarantee strict compliance with local constraints, effectively reducing the\nneed for additional post-processing steps. Our results highlight the advantages\nof LOOP-MAC, showcasing accelerated solution times per iteration and\nsignificantly reduced convergence times. The LOOP-MAC method outperforms\nconventional centralized and distributed optimization methods in optimization\ntasks that require repetitive and sequential execution.\n","authors":["Meiyi Li","Javad Mohammadi"],"pdf_url":"https://arxiv.org/pdf/2310.17882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12298v2","updated":"2023-10-27T03:59:42Z","published":"2023-10-18T19:58:54Z","title":"Jorge: Approximate Preconditioning for GPU-efficient Second-order\n Optimization","summary":" Despite their better convergence properties compared to first-order\noptimizers, second-order optimizers for deep learning have been less popular\ndue to their significant computational costs. The primary efficiency bottleneck\nin such optimizers is matrix inverse calculations in the preconditioning step,\nwhich are expensive to compute on GPUs. In this paper, we introduce Jorge, a\nsecond-order optimizer that promises the best of both worlds -- rapid\nconvergence benefits of second-order methods, and high computational efficiency\ntypical of first-order methods. We address the primary computational bottleneck\nof computing matrix inverses by completely eliminating them using an\napproximation of the preconditioner computation. This makes Jorge extremely\nefficient on GPUs in terms of wall-clock time. Further, we describe an approach\nto determine Jorge's hyperparameters directly from a well-tuned SGD baseline,\nthereby significantly minimizing tuning efforts. Our empirical evaluations\ndemonstrate the distinct advantages of using Jorge, outperforming\nstate-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple\ndeep learning models, both in terms of sample efficiency and wall-clock time.\n","authors":["Siddharth Singh","Zachary Sating","Abhinav Bhatele"],"pdf_url":"https://arxiv.org/pdf/2310.12298v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17878v1","updated":"2023-10-27T03:40:37Z","published":"2023-10-27T03:40:37Z","title":"A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing\n Time","summary":" We address the problem of designing a sublinear-time spectral clustering\noracle for graphs that exhibit strong clusterability. Such graphs contain $k$\nlatent clusters, each characterized by a large inner conductance (at least\n$\\varphi$) and a small outer conductance (at most $\\varepsilon$). Our aim is to\npreprocess the graph to enable clustering membership queries, with the key\nrequirement that both preprocessing and query answering should be performed in\nsublinear time, and the resulting partition should be consistent with a\n$k$-partition that is close to the ground-truth clustering. Previous oracles\nhave relied on either a $\\textrm{poly}(k)\\log n$ gap between inner and outer\nconductances or exponential (in $k/\\varepsilon$) preprocessing time. Our\nalgorithm relaxes these assumptions, albeit at the cost of a slightly higher\nmisclassification ratio. We also show that our clustering oracle is robust\nagainst a few random edge deletions. To validate our theoretical bounds, we\nconducted experiments on synthetic networks.\n","authors":["Ranran Shen","Pan Peng"],"pdf_url":"https://arxiv.org/pdf/2310.17878v1.pdf","comment":"To appear at NeurIPS'23"},{"id":"http://arxiv.org/abs/2310.17877v1","updated":"2023-10-27T03:39:51Z","published":"2023-10-27T03:39:51Z","title":"ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for\n Consistent Data-to-Text Generation","summary":" We present ASPIRO, an approach for structured data verbalisation into short\ntemplate sentences in zero to few-shot settings. Unlike previous methods, our\napproach prompts large language models (LLMs) to directly produce\nentity-agnostic templates, rather than relying on LLMs to faithfully copy the\ngiven example entities, or validating/crafting the templates manually. We\nincorporate LLM re-prompting, triggered by algorithmic parsing checks, as well\nas the PARENT metric induced consistency validation to identify and rectify\ntemplate generation problems in real-time. ASPIRO, compared to direct LLM\noutput, averages 66\\% parsing error rate reduction in generated verbalisations\nof RDF triples on the DART dataset. Our best 5-shot text-davinci-003 setup,\nscoring BLEU of 50.62, METEOR of 45.16, BLEURT of 0.82, NUBIA of 0.87, and\nPARENT of 0.8962 on the Rel2Text dataset, competes effectively with recent\nfine-tuned pre-trained language models.\n","authors":["Martin Vejvar","Yasutaka Fujimoto"],"pdf_url":"https://arxiv.org/pdf/2310.17877v1.pdf","comment":"Accepted to Findings of EMNLP2023, code available at\n https://github.com/vejvarm/ASPIRO"},{"id":"http://arxiv.org/abs/2309.07867v2","updated":"2023-10-27T03:37:04Z","published":"2023-09-14T17:14:26Z","title":"Beta Diffusion","summary":" We introduce beta diffusion, a novel generative modeling method that\nintegrates demasking and denoising to generate data within bounded ranges.\nUsing scaled and shifted beta distributions, beta diffusion utilizes\nmultiplicative transitions over time to create both forward and reverse\ndiffusion processes, maintaining beta distributions in both the forward\nmarginals and the reverse conditionals, given the data at any point in time.\nUnlike traditional diffusion-based generative models relying on additive\nGaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is\nmultiplicative and optimized with KL-divergence upper bounds (KLUBs) derived\nfrom the convexity of the KL divergence. We demonstrate that the proposed KLUBs\nare more effective for optimizing beta diffusion compared to negative ELBOs,\nwhich can also be derived as the KLUBs of the same KL divergence with its two\narguments swapped. The loss function of beta diffusion, expressed in terms of\nBregman divergence, further supports the efficacy of KLUBs for optimization.\nExperimental results on both synthetic data and natural images demonstrate the\nunique capabilities of beta diffusion in generative modeling of range-bounded\ndata and validate the effectiveness of KLUBs in optimizing diffusion models,\nthereby making them valuable additions to the family of diffusion-based\ngenerative models and the optimization techniques used to train them.\n","authors":["Mingyuan Zhou","Tianqi Chen","Zhendong Wang","Huangjie Zheng"],"pdf_url":"https://arxiv.org/pdf/2309.07867v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.18499v2","updated":"2023-10-27T03:28:48Z","published":"2023-05-29T14:29:12Z","title":"Pre-training Contextualized World Models with In-the-wild Videos for\n Reinforcement Learning","summary":" Unsupervised pre-training methods utilizing large and diverse datasets have\nachieved tremendous success across a range of domains. Recent work has\ninvestigated such unsupervised pre-training methods for model-based\nreinforcement learning (MBRL) but is limited to domain-specific or simulated\ndata. In this paper, we study the problem of pre-training world models with\nabundant in-the-wild videos for efficient learning of downstream visual control\ntasks. However, in-the-wild videos are complicated with various contextual\nfactors, such as intricate backgrounds and textured appearance, which precludes\na world model from extracting shared world knowledge to generalize better. To\ntackle this issue, we introduce Contextualized World Models (ContextWM) that\nexplicitly separate context and dynamics modeling to overcome the complexity\nand diversity of in-the-wild videos and facilitate knowledge transfer between\ndistinct scenes. Specifically, a contextualized extension of the latent\ndynamics model is elaborately realized by incorporating a context encoder to\nretain contextual information and empower the image decoder, which encourages\nthe latent dynamics model to concentrate on essential temporal variations. Our\nexperiments show that in-the-wild video pre-training equipped with ContextWM\ncan significantly improve the sample efficiency of MBRL in various domains,\nincluding robotic manipulation, locomotion, and autonomous driving. Code is\navailable at this repository: https://github.com/thuml/ContextWM.\n","authors":["Jialong Wu","Haoyu Ma","Chaoyi Deng","Mingsheng Long"],"pdf_url":"https://arxiv.org/pdf/2305.18499v2.pdf","comment":"NeurIPS 2023. Code is available at https://github.com/thuml/ContextWM"},{"id":"http://arxiv.org/abs/2310.17870v1","updated":"2023-10-27T03:14:50Z","published":"2023-10-27T03:14:50Z","title":"Ranking with Slot Constraints","summary":" We introduce the problem of ranking with slot constraints, which can be used\nto model a wide range of application problems -- from college admission with\nlimited slots for different majors, to composing a stratified cohort of\neligible participants in a medical trial. We show that the conventional\nProbability Ranking Principle (PRP) can be highly sub-optimal for\nslot-constrained ranking problems, and we devise a new ranking algorithm,\ncalled MatchRank. The goal of MatchRank is to produce rankings that maximize\nthe number of filled slots if candidates are evaluated by a human decision\nmaker in the order of the ranking. In this way, MatchRank generalizes the PRP,\nand it subsumes the PRP as a special case when there are no slot constraints.\nOur theoretical analysis shows that MatchRank has a strong approximation\nguarantee without any independence assumptions between slots or candidates.\nFurthermore, we show how MatchRank can be implemented efficiently. Beyond the\ntheoretical guarantees, empirical evaluations show that MatchRank can provide\nsubstantial improvements over a range of synthetic and real-world tasks.\n","authors":["Wentao Guo","Andrew Wang","Bradon Thymes","Thorsten Joachims"],"pdf_url":"https://arxiv.org/pdf/2310.17870v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17867v1","updated":"2023-10-27T03:05:11Z","published":"2023-10-27T03:05:11Z","title":"Reproducibility in Multiple Instance Learning: A Case For Algorithmic\n Unit Tests","summary":" Multiple Instance Learning (MIL) is a sub-domain of classification problems\nwith positive and negative labels and a \"bag\" of inputs, where the label is\npositive if and only if a positive element is contained within the bag, and\notherwise is negative. Training in this context requires associating the\nbag-wide label to instance-level information, and implicitly contains a causal\nassumption and asymmetry to the task (i.e., you can't swap the labels without\nchanging the semantics). MIL problems occur in healthcare (one malignant cell\nindicates cancer), cyber security (one malicious executable makes an infected\ncomputer), and many other tasks. In this work, we examine five of the most\nprominent deep-MIL models and find that none of them respects the standard MIL\nassumption. They are able to learn anti-correlated instances, i.e., defaulting\nto \"positive\" labels until seeing a negative counter-example, which should not\nbe possible for a correct MIL model. We suspect that enhancements and other\nworks derived from these models will share the same issue. In any context in\nwhich these models are being used, this creates the potential for learning\nincorrect models, which creates risk of operational failure. We identify and\ndemonstrate this problem via a proposed \"algorithmic unit test\", where we\ncreate synthetic datasets that can be solved by a MIL respecting model, and\nwhich clearly reveal learning that violates MIL assumptions. The five evaluated\nmethods each fail one or more of these tests. This provides a model-agnostic\nway to identify violations of modeling assumptions, which we hope will be\nuseful for future development and evaluation of MIL models.\n","authors":["Edward Raff","James Holt"],"pdf_url":"https://arxiv.org/pdf/2310.17867v1.pdf","comment":"To appear in the 37th Conference on Neural Information Processing\n Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.15970v3","updated":"2023-10-27T02:54:29Z","published":"2023-10-24T16:10:58Z","title":"Accented Speech Recognition With Accent-specific Codebooks","summary":" Speech accents pose a significant challenge to state-of-the-art automatic\nspeech recognition (ASR) systems. Degradation in performance across\nunderrepresented accents is a severe deterrent to the inclusive adoption of\nASR. In this work, we propose a novel accent adaptation approach for end-to-end\nASR systems using cross-attention with a trainable set of codebooks. These\nlearnable codebooks capture accent-specific information and are integrated\nwithin the ASR encoder layers. The model is trained on accented English speech,\nwhile the test data also contained accents which were not seen during training.\nOn the Mozilla Common Voice multi-accented dataset, we show that our proposed\napproach yields significant performance gains not only on the seen English\naccents (up to $37\\%$ relative improvement in word error rate) but also on the\nunseen accents (up to $5\\%$ relative improvement in WER). Further, we\nillustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We\nalso compare the performance with other approaches based on accent adversarial\ntraining.\n","authors":["Darshan Prabhu","Preethi Jyothi","Sriram Ganapathy","Vinit Unni"],"pdf_url":"https://arxiv.org/pdf/2310.15970v3.pdf","comment":"Accepted to EMNLP 2023 Main Conference (Long Paper)"},{"id":"http://arxiv.org/abs/2301.12534v3","updated":"2023-10-27T02:42:25Z","published":"2023-01-29T20:39:21Z","title":"Vicarious Offense and Noise Audit of Offensive Speech Classifiers:\n Unifying Human and Machine Disagreement on What is Offensive","summary":" Offensive speech detection is a key component of content moderation. However,\nwhat is offensive can be highly subjective. This paper investigates how machine\nand human moderators disagree on what is offensive when it comes to real-world\nsocial web political discourse. We show that (1) there is extensive\ndisagreement among the moderators (humans and machines); and (2) human and\nlarge-language-model classifiers are unable to predict how other human raters\nwill respond, based on their political leanings. For (1), we conduct a noise\naudit at an unprecedented scale that combines both machine and human responses.\nFor (2), we introduce a first-of-its-kind dataset of vicarious offense. Our\nnoise audit reveals that moderation outcomes vary wildly across different\nmachine moderators. Our experiments with human moderators suggest that\npolitical leanings combined with sensitive issues affect both first-person and\nvicarious offense. The dataset is available through\nhttps://github.com/Homan-Lab/voiced.\n","authors":["Tharindu Cyril Weerasooriya","Sujan Dutta","Tharindu Ranasinghe","Marcos Zampieri","Christopher M. Homan","Ashiqur R. KhudaBukhsh"],"pdf_url":"https://arxiv.org/pdf/2301.12534v3.pdf","comment":"Accepted to appear at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.17513v2","updated":"2023-10-27T02:36:44Z","published":"2023-10-26T16:08:33Z","title":"The Expressive Power of Low-Rank Adaptation","summary":" Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that\nleverages low-rank adaptation of weight matrices, has emerged as a prevalent\ntechnique for fine-tuning pre-trained models such as large language models and\ndiffusion models. Despite its huge success in practice, the theoretical\nunderpinnings of LoRA have largely remained unexplored. This paper takes the\nfirst step to bridge this gap by theoretically analyzing the expressive power\nof LoRA. We prove that, for fully connected neural networks, LoRA can adapt any\nmodel $f$ to accurately represent any smaller target model $\\overline{f}$ if\nLoRA-rank $\\geq(\\text{width of }f) \\times \\frac{\\text{depth of\n}\\overline{f}}{\\text{depth of }f}$. We also quantify the approximation error\nwhen LoRA-rank is lower than the threshold. For Transformer networks, we show\nany model can be adapted to a target model of the same size with\nrank-$(\\frac{\\text{embedding size}}{2})$ LoRA adapters.\n","authors":["Yuchen Zeng","Kangwook Lee"],"pdf_url":"https://arxiv.org/pdf/2310.17513v2.pdf","comment":"40 pages, 5 figures"},{"id":"http://arxiv.org/abs/2307.01616v2","updated":"2023-10-27T02:13:47Z","published":"2023-07-04T10:08:25Z","title":"SageFormer: Series-Aware Framework for Long-term Multivariate Time\n Series Forecasting","summary":" In the burgeoning ecosystem of Internet of Things, multivariate time series\n(MTS) data has become ubiquitous, highlighting the fundamental role of time\nseries forecasting across numerous applications. The crucial challenge of\nlong-term MTS forecasting requires adept models capable of capturing both\nintra- and inter-series dependencies. Recent advancements in deep learning,\nnotably Transformers, have shown promise. However, many prevailing methods\neither marginalize inter-series dependencies or overlook them entirely. To\nbridge this gap, this paper introduces a novel series-aware framework,\nexplicitly designed to emphasize the significance of such dependencies. At the\nheart of this framework lies our specific implementation: the SageFormer. As a\nSeries-aware Graph-enhanced Transformer model, SageFormer proficiently discerns\nand models the intricate relationships between series using graph structures.\nBeyond capturing diverse temporal patterns, it also curtails redundant\ninformation across series. Notably, the series-aware framework seamlessly\nintegrates with existing Transformer-based models, enriching their ability to\ncomprehend inter-series relationships. Extensive experiments on real-world and\nsynthetic datasets validate the superior performance of SageFormer against\ncontemporary state-of-the-art approaches.\n","authors":["Zhenwei Zhang","Linghang Meng","Yuantao Gu"],"pdf_url":"https://arxiv.org/pdf/2307.01616v2.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2306.13229v2","updated":"2023-10-27T02:07:14Z","published":"2023-06-22T22:21:53Z","title":"TACO: Temporal Latent Action-Driven Contrastive Loss for Visual\n Reinforcement Learning","summary":" Despite recent progress in reinforcement learning (RL) from raw pixel data,\nsample inefficiency continues to present a substantial obstacle. Prior works\nhave attempted to address this challenge by creating self-supervised auxiliary\ntasks, aiming to enrich the agent's learned representations with\ncontrol-relevant information for future state prediction. However, these\nobjectives are often insufficient to learn representations that can represent\nthe optimal policy or value function, and they often consider tasks with small,\nabstract discrete action spaces and thus overlook the importance of action\nrepresentation learning in continuous control. In this paper, we introduce\nTACO: Temporal Action-driven Contrastive Learning, a simple yet powerful\ntemporal contrastive learning approach that facilitates the concurrent\nacquisition of latent state and action representations for agents. TACO\nsimultaneously learns a state and an action representation by optimizing the\nmutual information between representations of current states paired with action\nsequences and representations of the corresponding future states.\nTheoretically, TACO can be shown to learn state and action representations that\nencompass sufficient information for control, thereby improving sample\nefficiency. For online RL, TACO achieves 40% performance boost after one\nmillion environment interaction steps on average across nine challenging visual\ncontinuous control tasks from Deepmind Control Suite. In addition, we show that\nTACO can also serve as a plug-and-play module adding to existing offline visual\nRL methods to establish the new state-of-the-art performance for offline visual\nRL across offline datasets with varying quality.\n","authors":["Ruijie Zheng","Xiyao Wang","Yanchao Sun","Shuang Ma","Jieyu Zhao","Huazhe Xu","Hal Daumé III","Furong Huang"],"pdf_url":"https://arxiv.org/pdf/2306.13229v2.pdf","comment":"Accepted at 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.17852v1","updated":"2023-10-27T02:04:31Z","published":"2023-10-27T02:04:31Z","title":"Function Space Bayesian Pseudocoreset for Bayesian Neural Networks","summary":" A Bayesian pseudocoreset is a compact synthetic dataset summarizing essential\ninformation of a large-scale dataset and thus can be used as a proxy dataset\nfor scalable Bayesian inference. Typically, a Bayesian pseudocoreset is\nconstructed by minimizing a divergence measure between the posterior\nconditioning on the pseudocoreset and the posterior conditioning on the full\ndataset. However, evaluating the divergence can be challenging, particularly\nfor the models like deep neural networks having high-dimensional parameters. In\nthis paper, we propose a novel Bayesian pseudocoreset construction method that\noperates on a function space. Unlike previous methods, which construct and\nmatch the coreset and full data posteriors in the space of model parameters\n(weights), our method constructs variational approximations to the coreset\nposterior on a function space and matches it to the full data posterior in the\nfunction space. By working directly on the function space, our method could\nbypass several challenges that may arise when working on a weight space,\nincluding limited scalability and multi-modality issue. Through various\nexperiments, we demonstrate that the Bayesian pseudocoresets constructed from\nour method enjoys enhanced uncertainty quantification and better robustness\nacross various model architectures.\n","authors":["Balhae Kim","Hyungi Lee","Juho Lee"],"pdf_url":"https://arxiv.org/pdf/2310.17852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10082v2","updated":"2023-10-27T01:59:04Z","published":"2023-10-16T05:26:03Z","title":"A simple uniformly optimal method without line search for convex\n optimization","summary":" Line search (or backtracking) procedures have been widely employed into\nfirst-order methods for solving convex optimization problems, especially those\nwith unknown problem parameters (e.g., Lipschitz constant). In this paper, we\nshow that line search is superfluous in attaining the optimal rate of\nconvergence for solving a convex optimization problem whose parameters are not\ngiven a priori. In particular, we present a novel accelerated gradient descent\ntype algorithm called auto-conditioned fast gradient method (AC-FGM) that can\nachieve an optimal $\\mathcal{O}(1/k^2)$ rate of convergence for smooth convex\noptimization without requiring the estimate of a global Lipschitz constant or\nthe employment of line search procedures. We then extend AC-FGM to solve convex\noptimization problems with H\\\"{o}lder continuous gradients and show that it\nautomatically achieves the optimal rates of convergence uniformly for all\nproblem classes with the desired accuracy of the solution as the only input.\nFinally, we report some encouraging numerical results that demonstrate the\nadvantages of AC-FGM over the previously developed parameter-free methods for\nconvex optimization.\n","authors":["Tianjiao Li","Guanghui Lan"],"pdf_url":"https://arxiv.org/pdf/2310.10082v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17848v1","updated":"2023-10-27T01:57:27Z","published":"2023-10-27T01:57:27Z","title":"Boosting Data Analytics With Synthetic Volume Expansion","summary":" Synthetic data generation, a cornerstone of Generative Artificial\nIntelligence, signifies a paradigm shift in data science by addressing data\nscarcity and privacy while enabling unprecedented performance. As synthetic\ndata gains prominence, questions arise concerning the accuracy of statistical\nmethods when applied to synthetic data compared to raw data. In this article,\nwe introduce the Synthetic Data Generation for Analytics framework. This\nframework employs statistical methods on high-fidelity synthetic data generated\nby advanced models such as tabular diffusion and Generative Pre-trained\nTransformer models. These models, trained on raw data, are further enhanced\nwith insights from pertinent studies. A significant discovery within this\nframework is the generational effect: the error of a statistical method on\nsynthetic data initially diminishes with added synthetic data but may\neventually increase or plateau. This phenomenon, rooted in the complexities of\nreplicating raw data distributions, highlights a \"reflection point\"--an optimal\nthreshold in the size of synthetic data determined by specific error metrics.\nThrough three illustrative case studies-sentiment analysis of texts, predictive\nmodeling of structured data, and inference in tabular data--we demonstrate the\neffectiveness of this framework over traditional ones. We underline its\npotential to amplify various statistical methods, including gradient boosting\nfor prediction and hypothesis testing, thereby underscoring the transformative\npotential of synthetic data generation in data science.\n","authors":["Xiaotong Shen","Yifei Liu","Rex Shen"],"pdf_url":"https://arxiv.org/pdf/2310.17848v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08623v2","updated":"2023-10-27T01:51:48Z","published":"2023-07-14T05:41:22Z","title":"HYTREL: Hypergraph-enhanced Tabular Data Representation Learning","summary":" Language models pretrained on large collections of tabular data have\ndemonstrated their effectiveness in several downstream tasks. However, many of\nthese models do not take into account the row/column permutation invariances,\nhierarchical structure, etc. that exist in tabular data. To alleviate these\nlimitations, we propose HYTREL, a tabular language model, that captures the\npermutation invariances and three more structural properties of tabular data by\nusing hypergraphs - where the table cells make up the nodes and the cells\noccurring jointly together in each row, column, and the entire table are used\nto form three different types of hyperedges. We show that HYTREL is maximally\ninvariant under certain conditions for tabular data, i.e., two tables obtain\nthe same representations via HYTREL iff the two tables are identical up to\npermutations. Our empirical results demonstrate that HYTREL consistently\noutperforms other competitive baselines on four downstream tasks with minimal\npretraining, illustrating the advantages of incorporating the inductive biases\nassociated with tabular data into the representations. Finally, our qualitative\nanalyses showcase that HYTREL can assimilate the table structures to generate\nrobust representations for the cells, rows, columns, and the entire table.\n","authors":["Pei Chen","Soumajyoti Sarkar","Leonard Lausen","Balasubramaniam Srinivasan","Sheng Zha","Ruihong Huang","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2307.08623v2.pdf","comment":"NeurIPS 2023 (spotlight)"},{"id":"http://arxiv.org/abs/2310.17843v1","updated":"2023-10-27T01:49:13Z","published":"2023-10-27T01:49:13Z","title":"A Data-Centric Online Market for Machine Learning: From Discovery to\n Pricing","summary":" Data fuels machine learning (ML) - rich and high-quality training data is\nessential to the success of ML. However, to transform ML from the race among a\nfew large corporations to an accessible technology that serves numerous normal\nusers' data analysis requests, there still exist important challenges. One gap\nwe observed is that many ML users can benefit from new data that other data\nowners possess, whereas these data owners sit on piles of data without knowing\nwho can benefit from it. This gap creates the opportunity for building an\nonline market that can automatically connect supply with demand. While online\nmatching markets are prevalent (e.g., ride-hailing systems), designing a\ndata-centric market for ML exhibits many unprecedented challenges.\n This paper develops new techniques to tackle two core challenges in designing\nsuch a market: (a) to efficiently match demand with supply, we design an\nalgorithm to automatically discover useful data for any ML task from a pool of\nthousands of datasets, achieving high-quality matching between ML models and\ndata; (b) to encourage market participation of ML users without much ML\nexpertise, we design a new pricing mechanism for selling data-augmented ML\nmodels. Furthermore, our market is designed to be API-compatible with existing\nonline ML markets like Vertex AI and Sagemaker, making it easy to use while\nproviding better results due to joint data and model search. We envision that\nthe synergy of our data and model discovery algorithm and pricing mechanism\nwill be an important step towards building a new data-centric online market\nthat serves ML users effectively.\n","authors":["Minbiao Han","Jonathan Light","Steven Xia","Sainyam Galhotra","Raul Castro Fernandez","Haifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.17843v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.15328v2","updated":"2023-10-27T01:44:27Z","published":"2023-05-24T16:42:17Z","title":"Visual Programming for Text-to-Image Generation and Evaluation","summary":" As large language models have demonstrated impressive performance in many\ndomains, recent works have adopted language models (LMs) as controllers of\nvisual modules for vision-and-language tasks. While existing work focuses on\nequipping LMs with visual understanding, we propose two novel\ninterpretable/explainable visual programming frameworks for text-to-image (T2I)\ngeneration and evaluation. First, we introduce VPGen, an interpretable\nstep-by-step T2I generation framework that decomposes T2I generation into three\nsteps: object/count generation, layout generation, and image generation. We\nemploy an LM to handle the first two steps (object/count generation and layout\ngeneration), by finetuning it on text-layout pairs. Our step-by-step T2I\ngeneration framework provides stronger spatial control than end-to-end models,\nthe dominant approach for this task. Furthermore, we leverage the world\nknowledge of pretrained LMs, overcoming the limitation of previous\nlayout-guided T2I works that can only handle predefined object classes. We\ndemonstrate that our VPGen has improved control in counts/spatial\nrelations/scales of objects than state-of-the-art T2I generation models.\nSecond, we introduce VPEval, an interpretable and explainable evaluation\nframework for T2I generation based on visual programming. Unlike previous T2I\nevaluations with a single scoring model that is accurate in some skills but\nunreliable in others, VPEval produces evaluation programs that invoke a set of\nvisual modules that are experts in different skills, and also provides\nvisual+textual explanations of the evaluation results. Our analysis shows that\nVPEval provides a more human-correlated evaluation for skill-specific and\nopen-ended prompts than widely used single model-based evaluation. We hope that\nour work encourages future progress on interpretable/explainable generation and\nevaluation for T2I models.\n","authors":["Jaemin Cho","Abhay Zala","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2305.15328v2.pdf","comment":"NeurIPS 2023; Project website: https://vp-t2i.github.io"},{"id":"http://arxiv.org/abs/2310.17836v1","updated":"2023-10-27T01:29:41Z","published":"2023-10-27T01:29:41Z","title":"Positional Encoding-based Resident Identification in Multi-resident\n Smart Homes","summary":" We propose a novel resident identification framework to identify residents in\na multi-occupant smart environment. The proposed framework employs a feature\nextraction model based on the concepts of positional encoding. The feature\nextraction model considers the locations of homes as a graph. We design a novel\nalgorithm to build such graphs from layout maps of smart environments. The\nNode2Vec algorithm is used to transform the graph into high-dimensional node\nembeddings. A Long Short-Term Memory (LSTM) model is introduced to predict the\nidentities of residents using temporal sequences of sensor events with the node\nembeddings. Extensive experiments show that our proposed scheme effectively\nidentifies residents in a multi-occupant environment. Evaluation results on two\nreal-world datasets demonstrate that our proposed approach achieves 94.5% and\n87.9% accuracy, respectively.\n","authors":["Zhiyi Song","Dipankar Chaki","Abdallah Lakhdari","Athman Bouguettaya"],"pdf_url":"https://arxiv.org/pdf/2310.17836v1.pdf","comment":"27 pages, 11 figures, 2 tables"},{"id":"http://arxiv.org/abs/2310.16314v2","updated":"2023-10-27T01:22:52Z","published":"2023-10-25T02:41:50Z","title":"Understanding Code Semantics: An Evaluation of Transformer Models in\n Summarization","summary":" This paper delves into the intricacies of code summarization using advanced\ntransformer-based language models. Through empirical studies, we evaluate the\nefficacy of code summarization by altering function and variable names to\nexplore whether models truly understand code semantics or merely rely on\ntextual cues. We have also introduced adversaries like dead code and commented\ncode across three programming languages (Python, Javascript, and Java) to\nfurther scrutinize the model's understanding. Ultimately, our research aims to\noffer valuable insights into the inner workings of transformer-based LMs,\nenhancing their ability to understand code and contributing to more efficient\nsoftware development practices and maintenance workflows.\n","authors":["Debanjan Mondal","Abhilasha Lodha","Ankita Sahoo","Beena Kumari"],"pdf_url":"https://arxiv.org/pdf/2310.16314v2.pdf","comment":"Accepted at GenBench, EMNLP 2023. All authors are co-first authors\n and have equal contributions"},{"id":"http://arxiv.org/abs/2306.01708v2","updated":"2023-10-27T01:09:31Z","published":"2023-06-02T17:31:32Z","title":"TIES-Merging: Resolving Interference When Merging Models","summary":" Transfer learning - i.e., further fine-tuning a pre-trained model on a\ndownstream task - can confer significant advantages, including improved\ndownstream performance, faster convergence, and better sample efficiency. These\nadvantages have led to a proliferation of task-specific fine-tuned models,\nwhich typically can only perform a single task and do not benefit from one\nanother. Recently, model merging techniques have emerged as a solution to\ncombine multiple task-specific models into a single multitask model without\nperforming additional training. However, existing merging methods often ignore\nthe interference between parameters of different models, resulting in large\nperformance drops when merging multiple models. In this paper, we demonstrate\nthat prior merging techniques inadvertently lose valuable information due to\ntwo major sources of interference: (a) interference due to redundant parameter\nvalues and (b) disagreement on the sign of a given parameter's values across\nmodels. To address this, we propose our method, TRIM, ELECT SIGN & MERGE\n(TIES-Merging), which introduces three novel steps when merging models: (1)\nresetting parameters that only changed a small amount during fine-tuning, (2)\nresolving sign conflicts, and (3) merging only the parameters that are in\nalignment with the final agreed-upon sign. We find that TIES-Merging\noutperforms several existing methods in diverse settings covering a range of\nmodalities, domains, number of tasks, model sizes, architectures, and\nfine-tuning settings. We further analyze the impact of different types of\ninterference on model parameters, and highlight the importance of resolving\nsign interference. Our code is available at\nhttps://github.com/prateeky2806/ties-merging\n","authors":["Prateek Yadav","Derek Tam","Leshem Choshen","Colin Raffel","Mohit Bansal"],"pdf_url":"https://arxiv.org/pdf/2306.01708v2.pdf","comment":"Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables"},{"id":"http://arxiv.org/abs/2310.17341v2","updated":"2023-10-27T01:07:25Z","published":"2023-10-26T12:15:56Z","title":"De-novo Chemical Reaction Generation by Means of Temporarily\n Convolutional Neural Networks","summary":" We present here a combination of two networks, Recurrent Neural Networks\n(RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction\ngeneration using the novel Reaction Smiles-like representation of reactions\n(CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks\nare known for their autoregressive properties and are frequently used in\nlanguage modelling with direct application to SMILES generation. The relatively\nnovel TCNs possess similar properties with wide receptive field while obeying\nthe causality required for natural language processing (NLP). The combination\nof both latent representations expressed through TCN and RNN results in an\noverall better performance compared to RNN alone. Additionally, it is shown\nthat different fine-tuning protocols have a profound impact on generative scope\nof the model when applied on a dataset of interest via transfer learning.\n","authors":["Andrei Buin","Hung Yi Chiang","S. Andrew Gadsden","Faraz A. Alderson"],"pdf_url":"https://arxiv.org/pdf/2310.17341v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.06614v4","updated":"2023-10-27T01:02:22Z","published":"2023-03-12T09:10:45Z","title":"Synthetic Experience Replay","summary":" A key theme in the past decade has been that when large neural networks and\nlarge datasets combine they can produce remarkable results. In deep\nreinforcement learning (RL), this paradigm is commonly made possible through\nexperience replay, whereby a dataset of past experiences is used to train a\npolicy or value function. However, unlike in supervised or self-supervised\nlearning, an RL agent has to collect its own data, which is often limited.\nThus, it is challenging to reap the benefits of deep learning, and even small\nneural networks can overfit at the start of training. In this work, we leverage\nthe tremendous recent progress in generative modeling and propose Synthetic\nExperience Replay (SynthER), a diffusion-based approach to flexibly upsample an\nagent's collected experience. We show that SynthER is an effective method for\ntraining RL agents across offline and online settings, in both proprioceptive\nand pixel-based environments. In offline settings, we observe drastic\nimprovements when upsampling small offline datasets and see that additional\nsynthetic data also allows us to effectively train larger networks.\nFurthermore, SynthER enables online agents to train with a much higher\nupdate-to-data ratio than before, leading to a significant increase in sample\nefficiency, without any algorithmic changes. We believe that synthetic training\ndata could open the door to realizing the full potential of deep learning for\nreplay-based RL algorithms from limited data. Finally, we open-source our code\nat https://github.com/conglu1997/SynthER.\n","authors":["Cong Lu","Philip J. Ball","Yee Whye Teh","Jack Parker-Holder"],"pdf_url":"https://arxiv.org/pdf/2303.06614v4.pdf","comment":"Published at NeurIPS, 2023"},{"id":"http://arxiv.org/abs/2310.17829v1","updated":"2023-10-27T00:41:55Z","published":"2023-10-27T00:41:55Z","title":"Hybrid Optical Turbulence Models Using Machine Learning and Local\n Measurements","summary":" Accurate prediction of atmospheric optical turbulence in localized\nenvironments is essential for estimating the performance of free-space optical\nsystems. Macro-meteorological models developed to predict turbulent effects in\none environment may fail when applied in new environments. However, existing\nmacro-meteorological models are expected to offer some predictive power.\nBuilding a new model from locally-measured macro-meteorology and scintillometer\nreadings can require significant time and resources, as well as a large number\nof observations. These challenges motivate the development of a\nmachine-learning informed hybrid model framework. By combining some baseline\nmacro-meteorological model with local observations, hybrid models were trained\nto improve upon the predictive power of each baseline model. Comparisons\nbetween the performance of the hybrid models, the selected baseline\nmacro-meteorological models, and machine-learning models trained only on local\nobservations highlight potential use cases for the hybrid model framework when\nlocal data is expensive to collect. Both the hybrid and data-only models were\ntrained using the Gradient Boosted Decision Tree (GBDT) architecture with a\nvariable number of in-situ meteorological observations. The hybrid and\ndata-only models were found to outperform three baseline macro-meteorological\nmodels, even for low numbers of observations, in some cases as little as one\nday. For the first baseline macro-meteorological model investigated, the hybrid\nmodel achieves an estimated 29% reduction in mean absolute error (MAE) using\nonly one days-equivalent of observation, growing to 41% after only two days,\nand 68% after 180 days-equivalent training data. The number of days-equivalent\ntraining data required is potentially indicative of the seasonal variation in\nthe local microclimate and its propagation environment.\n","authors":["Christopher Jellen","Charles Nelson","John Burkhardt","Cody Brownell"],"pdf_url":"https://arxiv.org/pdf/2310.17829v1.pdf","comment":"15 pages, 8 figures"},{"id":"http://arxiv.org/abs/2306.07280v2","updated":"2023-10-27T00:36:03Z","published":"2023-06-12T17:59:23Z","title":"Controlling Text-to-Image Diffusion by Orthogonal Finetuning","summary":" Large text-to-image diffusion models have impressive capabilities in\ngenerating photorealistic images from text prompts. How to effectively guide or\ncontrol these powerful models to perform different downstream tasks becomes an\nimportant open problem. To tackle this challenge, we introduce a principled\nfinetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image\ndiffusion models to downstream tasks. Unlike existing methods, OFT can provably\npreserve hyperspherical energy which characterizes the pairwise neuron\nrelationship on the unit hypersphere. We find that this property is crucial for\npreserving the semantic generation ability of text-to-image diffusion models.\nTo improve finetuning stability, we further propose Constrained Orthogonal\nFinetuning (COFT) which imposes an additional radius constraint to the\nhypersphere. Specifically, we consider two important finetuning text-to-image\ntasks: subject-driven generation where the goal is to generate subject-specific\nimages given a few images of a subject and a text prompt, and controllable\ngeneration where the goal is to enable the model to take in additional control\nsignals. We empirically show that our OFT framework outperforms existing\nmethods in generation quality and convergence speed.\n","authors":["Zeju Qiu","Weiyang Liu","Haiwen Feng","Yuxuan Xue","Yao Feng","Zhen Liu","Dan Zhang","Adrian Weller","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2306.07280v2.pdf","comment":"NeurIPS 2023 (43 pages, 34 figures, project page:\n https://oft.wyliu.com/)"},{"id":"http://arxiv.org/abs/2302.10359v3","updated":"2023-10-27T00:14:27Z","published":"2023-02-20T23:29:43Z","title":"Replicable Clustering","summary":" We design replicable algorithms in the context of statistical clustering\nunder the recently introduced notion of replicability from Impagliazzo et al.\n[2022]. According to this definition, a clustering algorithm is replicable if,\nwith high probability, its output induces the exact same partition of the\nsample space after two executions on different inputs drawn from the same\ndistribution, when its internal randomness is shared across the executions. We\npropose such algorithms for the statistical $k$-medians, statistical $k$-means,\nand statistical $k$-centers problems by utilizing approximation routines for\ntheir combinatorial counterparts in a black-box manner. In particular, we\ndemonstrate a replicable $O(1)$-approximation algorithm for statistical\nEuclidean $k$-medians ($k$-means) with $\\operatorname{poly}(d)$ sample\ncomplexity. We also describe an $O(1)$-approximation algorithm with an\nadditional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit\nwith $\\exp(d)$ sample complexity. In addition, we provide experiments on\nsynthetic distributions in 2D using the $k$-means++ implementation from sklearn\nas a black-box that validate our theoretical results.\n","authors":["Hossein Esfandiari","Amin Karbasi","Vahab Mirrokni","Grigoris Velegkas","Felix Zhou"],"pdf_url":"https://arxiv.org/pdf/2302.10359v3.pdf","comment":"to be published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.08631v3","updated":"2023-10-27T00:04:33Z","published":"2023-02-17T00:06:42Z","title":"Practical Contextual Bandits with Feedback Graphs","summary":" While contextual bandit has a mature theory, effectively leveraging different\nfeedback patterns to enhance the pace of learning remains unclear. Bandits with\nfeedback graphs, which interpolates between the full information and bandit\nregimes, provides a promising framework to mitigate the statistical complexity\nof learning. In this paper, we propose and analyze an approach to contextual\nbandits with feedback graphs based upon reduction to regression. The resulting\nalgorithms are computationally practical and achieve established minimax rates,\nthereby reducing the statistical complexity in real-world applications.\n","authors":["Mengxiao Zhang","Yuheng Zhang","Olga Vrousgou","Haipeng Luo","Paul Mineiro"],"pdf_url":"https://arxiv.org/pdf/2302.08631v3.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.05037v2","updated":"2023-10-27T15:34:06Z","published":"2023-08-09T16:09:44Z","title":"Separate Anything You Describe","summary":" Language-queried audio source separation (LASS) is a new paradigm for\ncomputational auditory scene analysis (CASA). LASS aims to separate a target\nsound from an audio mixture given a natural language query, which provides a\nnatural and scalable interface for digital audio applications. Recent works on\nLASS, despite attaining promising separation performance on specific sources\n(e.g., musical instruments, limited classes of audio events), are unable to\nseparate audio concepts in the open domain. In this work, we introduce\nAudioSep, a foundation model for open-domain audio source separation with\nnatural language queries. We train AudioSep on large-scale multimodal datasets\nand extensively evaluate its capabilities on numerous tasks including audio\nevent separation, musical instrument separation, and speech enhancement.\nAudioSep demonstrates strong separation performance and impressive zero-shot\ngeneralization ability using audio captions or text labels as queries,\nsubstantially outperforming previous audio-queried and language-queried sound\nseparation models. For reproducibility of this work, we will release the source\ncode, evaluation benchmark and pre-trained model at:\nhttps://github.com/Audio-AGI/AudioSep.\n","authors":["Xubo Liu","Qiuqiang Kong","Yan Zhao","Haohe Liu","Yi Yuan","Yuzhuo Liu","Rui Xia","Yuxuan Wang","Mark D. Plumbley","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2308.05037v2.pdf","comment":"Code, benchmark and pre-trained models:\n https://github.com/Audio-AGI/AudioSep"},{"id":"http://arxiv.org/abs/2310.18099v1","updated":"2023-10-27T12:34:50Z","published":"2023-10-27T12:34:50Z","title":"Enabling Acoustic Audience Feedback in Large Virtual Events","summary":" The COVID-19 pandemic shifted many events in our daily lives into the virtual\ndomain. While virtual conference systems provide an alternative to physical\nmeetings, larger events require a muted audience to avoid an accumulation of\nbackground noise and distorted audio. However, performing artists strongly rely\non the feedback of their audience. We propose a concept for a virtual audience\nframework which supports all participants with the ambience of a real audience.\nAudience feedback is collected locally, allowing users to express enthusiasm or\ndiscontent by selecting means such as clapping, whistling, booing, and\nlaughter. This feedback is sent as abstract information to a virtual audience\nserver. We broadcast the combined virtual audience feedback information to all\nparticipants, which can be synthesized as a single acoustic feedback by the\nclient. The synthesis can be done by turning the collective audience feedback\ninto a prompt that is fed to state-of-the-art models such as AudioGen. This\nway, each user hears a single acoustic feedback sound of the entire virtual\nevent, without requiring to unmute or risk hearing distorted, unsynchronized\nfeedback.\n","authors":["Tamay Aykut","Markus Hofbauer","Christopher Kuhn","Eckehard Steinbach","Bernd Girod"],"pdf_url":"https://arxiv.org/pdf/2310.18099v1.pdf","comment":"4 pages, 2 figures"},{"id":"http://arxiv.org/abs/2308.14263v2","updated":"2023-10-27T01:40:41Z","published":"2023-08-28T02:38:17Z","title":"Cross-Modal Retrieval: A Systematic Review of Methods and Future\n Directions","summary":" With the exponential surge in diverse multi-modal data, traditional uni-modal\nretrieval methods struggle to meet the needs of users demanding access to data\nfrom various modalities. To address this, cross-modal retrieval has emerged,\nenabling interaction across modalities, facilitating semantic matching, and\nleveraging complementarity and consistency between different modal data.\nAlthough prior literature undertook a review of the cross-modal retrieval\nfield, it exhibits numerous deficiencies pertaining to timeliness, taxonomy,\nand comprehensiveness. This paper conducts a comprehensive review of\ncross-modal retrieval's evolution, spanning from shallow statistical analysis\ntechniques to vision-language pre-training models. Commencing with a\ncomprehensive taxonomy grounded in machine learning paradigms, mechanisms, and\nmodels, the paper then delves deeply into the principles and architectures\nunderpinning existing cross-modal retrieval methods. Furthermore, it offers an\noverview of widely used benchmarks, metrics, and performances. Lastly, the\npaper probes the prospects and challenges that confront contemporary\ncross-modal retrieval, while engaging in a discourse on potential directions\nfor further progress in the field. To facilitate the research on cross-modal\nretrieval, we develop an open-source code repository at\nhttps://github.com/BMC-SDNU/Cross-Modal-Retrieval.\n","authors":["Fengling Li","Lei Zhu","Tianshi Wang","Jingjing Li","Zheng Zhang","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2308.14263v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18461v1","updated":"2023-10-27T20:14:00Z","published":"2023-10-27T20:14:00Z","title":"Improved Lossless Coding for Storage and Transmission of Multichannel\n Immersive Audio","summary":" In this paper, techniques for improving multichannel lossless coding are\nexamined. A method is proposed for the simultaneous coding of two or more\ndifferent renderings (mixes) of the same content. The signal model uses both\npast samples of the upmix, and the current time samples of downmix samples to\npredict the upmix. Model parameters are optimized via a general linear solver,\nand the prediction residual is Rice coded. Additionally, the use of an SVD\nprojection prior to residual coding is proposed. A comparison is made against\nvarious baselines, including FLAC. The proposed methods show improved\ncompression ratios for the storage and transmission of immersive audio.\n","authors":["Toni Hirvonen","Mahmoud Namazi"],"pdf_url":"https://arxiv.org/pdf/2310.18461v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18377v1","updated":"2023-10-27T00:44:40Z","published":"2023-10-27T00:44:40Z","title":"Large-scale Foundation Models and Generative AI for BigData Neuroscience","summary":" Recent advances in machine learning have made revolutionary breakthroughs in\ncomputer games, image and natural language understanding, and scientific\ndiscovery. Foundation models and large-scale language models (LLMs) have\nrecently achieved human-like intelligence thanks to BigData. With the help of\nself-supervised learning (SSL) and transfer learning, these models may\npotentially reshape the landscapes of neuroscience research and make a\nsignificant impact on the future. Here we present a mini-review on recent\nadvances in foundation models and generative AI models as well as their\napplications in neuroscience, including natural language and speech, semantic\nmemory, brain-machine interfaces (BMIs), and data augmentation. We argue that\nthis paradigm-shift framework will open new avenues for many neuroscience\nresearch directions and discuss the accompanying challenges and opportunities.\n","authors":["Ran Wang","Zhe Sage Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18377v1.pdf","comment":null}]},"2023-10-30T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.17976v2","updated":"2023-10-30T03:13:15Z","published":"2023-10-27T08:42:18Z","title":"Does Role-Playing Chatbots Capture the Character Personalities?\n Assessing Personality Traits for Role-Playing Chatbots","summary":" The emergence of large-scale pretrained language models has revolutionized\nthe capabilities of new AI application, especially in the realm of crafting\nchatbots with distinct personas. Given the \"stimulus-response\" nature of\nchatbots, this paper unveils an innovative open-ended interview-style approach\nfor personality assessment on role-playing chatbots, which offers a richer\ncomprehension of their intrinsic personalities. We conduct personality\nassessments on 32 role-playing chatbots created by the ChatHaruhi library,\nacross both the Big Five and MBTI dimensions, and measure their alignment with\nhuman perception. Evaluation results underscore that modern role-playing\nchatbots based on LLMs can effectively portray personality traits of\ncorresponding characters, with an alignment rate of 82.8% compared with\nhuman-perceived personalities. Besides, we also suggest potential strategies\nfor shaping chatbots' personalities. Hence, this paper serves as a cornerstone\nstudy for role-playing chatbots that intersects computational linguistics and\npsychology. Our resources are available at\nhttps://github.com/LC1332/Chat-Haruhi-Suzumiya\n","authors":["Xintao Wang","Quan Tu","Yaying Fei","Ziang Leng","Cheng Li"],"pdf_url":"https://arxiv.org/pdf/2310.17976v2.pdf","comment":"A Personality Traits Test Over ChatHaruhi"},{"id":"http://arxiv.org/abs/2310.19240v1","updated":"2023-10-30T03:11:30Z","published":"2023-10-30T03:11:30Z","title":"M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context\n Evaluation Benchmark for Large Language Models","summary":" Managing long sequences has become an important and necessary feature for\nlarge language models (LLMs). However, it is still an open question of how to\ncomprehensively and systematically evaluate the long-sequence capability of\nLLMs. One of the reasons is that conventional and widely-used benchmarks mainly\nconsist of short sequences. In this paper, we propose M4LE, a Multi-ability,\nMulti-range, Multi-task, Multi-domain benchmark for Long-context Evaluation.\nM4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task\ntypes and 12 domains. To alleviate the scarcity of tasks with naturally long\nsequences and incorporate multiple-ability assessment, we propose an automatic\napproach (but with negligible human annotations) to convert short-sequence\ntasks into a unified long-sequence scenario where LLMs have to identify single\nor multiple relevant spans in long contexts based on explicit or semantic\nhints. Specifically, the scenario includes five different types of abilities:\n(1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span;\n(4) semantic multiple-span; and (5) global context understanding. The resulting\nsamples in M4LE are evenly distributed from 1k to 8k input length. We conducted\na systematic evaluation on 11 well-established LLMs, especially those optimized\nfor long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to\nunderstand long context, particularly when tasks require multiple-span\nattention. 2) Semantic retrieval task is more difficult for competent LLMs. 3)\nModels fine-tuned on longer text with position interpolation have comparable\nperformance to those using Neural Tangent Kernel (NTK) aware scaling methods\nwithout fine-tuning. We make our benchmark publicly available to encourage\nfuture research in this challenging area.\n","authors":["Wai-Chung Kwan","Xingshan Zeng","Yufei Wang","Yusen Sun","Liangyou Li","Lifeng Shang","Qun Liu","Kam-Fai Wong"],"pdf_url":"https://arxiv.org/pdf/2310.19240v1.pdf","comment":"Code and data are available at https://github.com/KwanWaiChung/M4LE"},{"id":"http://arxiv.org/abs/2310.12821v3","updated":"2023-10-30T03:04:07Z","published":"2023-10-19T15:17:34Z","title":"GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding\n with Large Language Model Agents","summary":" Current gesture recognition systems primarily focus on identifying gestures\nwithin a predefined set, leaving a gap in connecting these gestures to\ninteractive GUI elements or system functions (e.g., linking a 'thumb-up'\ngesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture\nunderstanding and grounding framework leveraging large language models (LLMs).\nGesture descriptions are formulated based on hand landmark coordinates from\ngesture videos and fed into our dual-agent dialogue system. A gesture agent\ndeciphers these descriptions and queries about the interaction context (e.g.,\ninterface, history, gaze data), which a context agent organizes and provides.\nFollowing iterative exchanges, the gesture agent discerns user intent,\ngrounding it to an interactive function. We validated the gesture description\nmodule using public first-view and third-view gesture datasets and tested the\nwhole system in two real-world settings: video streaming and smart home IoT\ncontrol. The highest zero-shot Top-5 grounding accuracies are 80.11% for video\nstreaming and 90.78% for smart home tasks, showing potential of the new gesture\nunderstanding paradigm.\n","authors":["Xin Zeng","Xiaoyu Wang","Tengxiang Zhang","Chun Yu","Shengdong Zhao","Yiqiang Chen"],"pdf_url":"https://arxiv.org/pdf/2310.12821v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17488v2","updated":"2023-10-30T02:50:17Z","published":"2023-10-26T15:44:57Z","title":"LightLM: A Lightweight Deep and Narrow Language Model for Generative\n Recommendation","summary":" This paper presents LightLM, a lightweight Transformer-based language model\nfor generative recommendation. While Transformer-based generative modeling has\ngained importance in various AI sub-fields such as NLP and vision, generative\nrecommendation is still in its infancy due to its unique demand on personalized\ngenerative modeling. Existing works on generative recommendation often use\nNLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are\nheavy-weight and are not specifically designed for recommendation tasks.\nLightLM tackles the issue by introducing a light-weight deep and narrow\nTransformer architecture, which is specifically tailored for direct generation\nof recommendation items. This structure is especially apt for straightforward\ngenerative recommendation and stems from the observation that language model\ndoes not have to be too wide for this task, as the input predominantly consists\nof short tokens that are well-suited for the model's capacity. We also show\nthat our devised user and item ID indexing methods, i.e., Spectral\nCollaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables\nthe deep and narrow Transformer architecture to outperform large-scale language\nmodels for recommendation. Besides, to address the hallucination problem of\ngenerating items as output, we propose the constrained generation process for\ngenerative recommenders. Experiments on real-world datasets show that LightLM\noutperforms various competitive baselines in terms of both recommendation\naccuracy and efficiency. The code can be found at\nhttps://github.com/dongyuanjushi/LightLM.\n","authors":["Kai Mei","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11053v2","updated":"2023-10-30T02:30:35Z","published":"2023-10-17T07:42:40Z","title":"Denevil: Towards Deciphering and Navigating the Ethical Values of Large\n Language Models via Instruction Learning","summary":" Large Language Models (LLMs) have made unprecedented breakthroughs, yet their\nincreasing integration into everyday life might raise societal risks due to\ngenerated unethical content. Despite extensive study on specific issues like\nbias, the intrinsic values of LLMs remain largely unexplored from a moral\nphilosophy perspective. This work delves into ethical values utilizing Moral\nFoundation Theory. Moving beyond conventional discriminative evaluations with\npoor reliability, we propose DeNEVIL, a novel prompt generation algorithm\ntailored to dynamically exploit LLMs' value vulnerabilities and elicit the\nviolation of ethics in a generative manner, revealing their underlying value\ninclinations. On such a basis, we construct MoralPrompt, a high-quality dataset\ncomprising 2,397 prompts covering 500+ value principles, and then benchmark the\nintrinsic values across a spectrum of LLMs. We discovered that most models are\nessentially misaligned, necessitating further ethical value alignment. In\nresponse, we develop VILMO, an in-context alignment method that substantially\nenhances the value compliance of LLM outputs by learning to generate\nappropriate value instructions, outperforming existing competitors. Our methods\nare suitable for black-box and open-source models, offering a promising initial\nstep in studying the ethical values of LLMs.\n","authors":["Shitong Duan","Xiaoyuan Yi","Peng Zhang","Tun Lu","Xing Xie","Ning Gu"],"pdf_url":"https://arxiv.org/pdf/2310.11053v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19233v1","updated":"2023-10-30T02:25:21Z","published":"2023-10-30T02:25:21Z","title":"Building Real-World Meeting Summarization Systems using Large Language\n Models: A Practical Perspective","summary":" This paper studies how to effectively build meeting summarization systems for\nreal-world usage using large language models (LLMs). For this purpose, we\nconduct an extensive evaluation and comparison of various closed-source and\nopen-source LLMs, namely, GPT-4, GPT- 3.5, PaLM-2, and LLaMA-2. Our findings\nreveal that most closed-source LLMs are generally better in terms of\nperformance. However, much smaller open-source models like LLaMA- 2 (7B and\n13B) could still achieve performance comparable to the large closed-source\nmodels even in zero-shot scenarios. Considering the privacy concerns of\nclosed-source models for only being accessible via API, alongside the high cost\nassociated with using fine-tuned versions of the closed-source models, the\nopensource models that can achieve competitive performance are more\nadvantageous for industrial use. Balancing performance with associated costs\nand privacy concerns, the LLaMA-2-7B model looks more promising for industrial\nusage. In sum, this paper offers practical insights on using LLMs for\nreal-world business meeting summarization, shedding light on the trade-offs\nbetween performance and cost.\n","authors":["Md Tahmid Rahman Laskar","Xue-Yong Fu","Cheng Chen","Shashi Bhushan TN"],"pdf_url":"https://arxiv.org/pdf/2310.19233v1.pdf","comment":"EMNLP 2023 Industry Track"},{"id":"http://arxiv.org/abs/2310.19232v1","updated":"2023-10-30T02:20:44Z","published":"2023-10-30T02:20:44Z","title":"Adapter Pruning using Tropical Characterization","summary":" Adapters are widely popular parameter-efficient transfer learning approaches\nin natural language processing that insert trainable modules in between layers\nof a pre-trained language model. Apart from several heuristics, however, there\nhas been a lack of studies analyzing the optimal number of adapter parameters\nneeded for downstream applications. In this paper, we propose an adapter\npruning approach by studying the tropical characteristics of trainable modules.\nWe cast it as an optimization problem that aims to prune parameters from the\nadapter layers without changing the orientation of underlying tropical\nhypersurfaces. Our experiments on five NLP datasets show that tropical geometry\ntends to identify more relevant parameters to prune when compared with the\nmagnitude-based baseline, while a combined approach works best across the\ntasks.\n","authors":["Rishabh Bhardwaj","Tushar Vaidya","Soujanya Poria"],"pdf_url":"https://arxiv.org/pdf/2310.19232v1.pdf","comment":"Accepted at EMNLP 2023, Findings"},{"id":"http://arxiv.org/abs/2310.18075v2","updated":"2023-10-30T02:16:55Z","published":"2023-10-27T11:43:46Z","title":"DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking","summary":" Inspired by the dual-process theory of human cognition, we introduce DUMA, a\nnovel conversational agent framework that embodies a dual-mind mechanism\nthrough the utilization of two generative Large Language Models (LLMs)\ndedicated to fast and slow thinking respectively. The fast thinking model\nserves as the primary interface for external interactions and initial response\ngeneration, evaluating the necessity for engaging the slow thinking model based\non the complexity of the complete response. When invoked, the slow thinking\nmodel takes over the conversation, engaging in meticulous planning, reasoning,\nand tool utilization to provide a well-analyzed response. This dual-mind\nconfiguration allows for a seamless transition between intuitive responses and\ndeliberate problem-solving processes based on the situation. We have\nconstructed a conversational agent to handle online inquiries in the real\nestate industry. The experiment proves that our method balances effectiveness\nand efficiency, and has a significant improvement compared to the baseline.\n","authors":["Xiaoyu Tian","Liangyu Chen","Na Liu","Yaxuan Liu","Wei Zou","Kaijiang Chen","Ming Cui"],"pdf_url":"https://arxiv.org/pdf/2310.18075v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07929v2","updated":"2023-10-30T01:52:11Z","published":"2023-06-09T08:08:18Z","title":"Large Language Models Are Semi-Parametric Reinforcement Learning Agents","summary":" Inspired by the insights in cognitive science with respect to human memory\nand reasoning mechanism, a novel evolvable LLM-based (Large Language Model)\nagent framework is proposed as REMEMBERER. By equipping the LLM with a\nlong-term experience memory, REMEMBERER is capable of exploiting the\nexperiences from the past episodes even for different task goals, which excels\nan LLM-based agent with fixed exemplars or equipped with a transient working\nmemory. We further introduce Reinforcement Learning with Experience Memory\n(RLEM) to update the memory. Thus, the whole system can learn from the\nexperiences of both success and failure, and evolve its capability without\nfine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER\nconstitutes a semi-parametric RL agent. Extensive experiments are conducted on\ntwo RL task sets to evaluate the proposed framework. The average results with\ndifferent initialization and training sets exceed the prior SOTA by 4% and 2%\nfor the success rate on two task sets and demonstrate the superiority and\nrobustness of REMEMBERER.\n","authors":["Danyang Zhang","Lu Chen","Situo Zhang","Hongshen Xu","Zihan Zhao","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2306.07929v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14257v3","updated":"2023-10-30T00:55:53Z","published":"2023-05-23T17:10:39Z","title":"Hierarchical Prompting Assists Large Language Model on Web Navigation","summary":" Large language models (LLMs) struggle on processing complicated observations\nin interactive decision making tasks. To alleviate this issue, we propose a\nsimple hierarchical prompting approach. Diverging from previous prompting\napproaches that always put the full observation (e.g. a web page) to the\nprompt, we propose to first construct an action-aware observation which is more\ncondensed and relevant with a dedicated SUMMARIZER prompt. The ACTOR prompt\nthen predicts the next action based on the summarized observation. While our\nmethod has broad applicability, we particularly demonstrate its efficacy in the\ncomplex domain of web navigation where a full observation often contains\nredundant and irrelevant information. Our approach outperforms the previous\nstate-of-the-art prompting mechanics by 6.2% on task success rate,\ndemonstrating its potential on interactive decision making tasks with long\nobservation traces.\n","authors":["Abishek Sridhar","Robert Lo","Frank F. Xu","Hao Zhu","Shuyan Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.14257v3.pdf","comment":"EMNLP 2023 Findings; Natural Language Reasoning and Structured\n Explanations Workshop at ACL 2023"},{"id":"http://arxiv.org/abs/2310.19212v1","updated":"2023-10-30T00:46:03Z","published":"2023-10-30T00:46:03Z","title":"EHRTutor: Enhancing Patient Understanding of Discharge Instructions","summary":" Large language models have shown success as a tutor in education in various\nfields. Educating patients about their clinical visits plays a pivotal role in\npatients' adherence to their treatment plans post-discharge. This paper\npresents EHRTutor, an innovative multi-component framework leveraging the Large\nLanguage Model (LLM) for patient education through conversational\nquestion-answering. EHRTutor first formulates questions pertaining to the\nelectronic health record discharge instructions. It then educates the patient\nthrough conversation by administering each question as a test. Finally, it\ngenerates a summary at the end of the conversation. Evaluation results using\nLLMs and domain experts have shown a clear preference for EHRTutor over the\nbaseline. Moreover, EHRTutor also offers a framework for generating synthetic\npatient education dialogues that can be used for future in-house system\ntraining.\n","authors":["Zihao Zhang","Zonghai Yao","Huixue Zhou","Feiyun ouyang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2310.19212v1.pdf","comment":"To appear in NeurIPS'23 Workshop on Generative AI for Education\n (GAIED)"},{"id":"http://arxiv.org/abs/2310.19208v1","updated":"2023-10-30T00:30:34Z","published":"2023-10-30T00:30:34Z","title":"LitCab: Lightweight Calibration of Language Models on Outputs of Varied\n Lengths","summary":" A model is considered well-calibrated when its probability estimate aligns\nwith the actual likelihood of the output being correct. Calibrating language\nmodels (LMs) is crucial, as it plays a vital role in detecting and mitigating\nhallucinations, a common issue of LMs, as well as building more trustworthy\nmodels. Yet, popular neural model calibration techniques are not well-suited\nfor LMs due to their lack of flexibility in discerning answer correctness and\ntheir high computational costs. For instance, post-processing methods like\ntemperature scaling are often unable to reorder the candidate generations.\nMoreover, training-based methods require finetuning the entire model, which is\nimpractical due to the increasing sizes of modern LMs. In this paper, we\npresent LitCab, a lightweight calibration mechanism consisting of a single\nlinear layer taking the input text representation and manipulateing the LM\noutput logits. LitCab improves model calibration by only adding < 2% of the\noriginal model parameters. For evaluation, we construct CaT, a benchmark\nconsisting of 7 text generation tasks, covering responses ranging from short\nphrases to paragraphs. We test LitCab with Llama2-7B, where it improves\ncalibration across all tasks, by reducing the average ECE score by 20%. We\nfurther conduct a comprehensive evaluation with 7 popular open-sourced LMs from\nGPT and LLaMA families, yielding the following key findings: (1) Larger models\nwithin the same family exhibit better calibration on tasks with short\ngeneration tasks, but not necessarily for longer ones. (2) GPT-family models\nshow superior calibration compared to LLaMA, Llama2 and Vicuna models despite\nhaving much fewer parameters. (3) Finetuning pretrained model (e.g., LLaMA)\nwith samples of limited purpose (e.g., conversations) may lead to worse\ncalibration, highlighting the importance of finetuning setups for calibrating\nLMs.\n","authors":["Xin Liu","Muhammad Khalifa","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19208v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18235v2","updated":"2023-10-30T16:00:49Z","published":"2023-10-27T16:20:10Z","title":"Davidsonian Scene Graph: Improving Reliability in Fine-grained\n Evaluation for Text-to-Image Generation","summary":" Evaluating text-to-image models is notoriously difficult. A strong recent\napproach for assessing text-image faithfulness is based on QG/A (question\ngeneration and answering), which uses pre-trained foundational models to\nautomatically generate a set of questions and answers from the prompt, and\noutput images are scored based on whether these answers extracted with a visual\nquestion answering model are consistent with the prompt-based answers. This\nkind of evaluation is naturally dependent on the quality of the underlying QG\nand QA models. We identify and address several reliability challenges in\nexisting QG/A work: (a) QG questions should respect the prompt (avoiding\nhallucinations, duplications, and omissions) and (b) VQA answers should be\nconsistent (not asserting that there is no motorcycle in an image while also\nclaiming the motorcycle is blue). We address these issues with Davidsonian\nScene Graph (DSG), an empirically grounded evaluation framework inspired by\nformal semantics. DSG is an automatic, graph-based QG/A that is modularly\nimplemented to be adaptable to any QG/A module. DSG produces atomic and unique\nquestions organized in dependency graphs, which (i) ensure appropriate semantic\ncoverage and (ii) sidestep inconsistent answers. With extensive experimentation\nand human evaluation on a range of model configurations (LLM, VQA, and T2I), we\nempirically demonstrate that DSG addresses the challenges noted above. Finally,\nwe present DSG-1k, an open-sourced evaluation benchmark that includes 1,060\nprompts, covering a wide range of fine-grained semantic categories with a\nbalanced distribution. We release the DSG-1k prompts and the corresponding DSG\nquestions.\n","authors":["Jaemin Cho","Yushi Hu","Roopal Garg","Peter Anderson","Ranjay Krishna","Jason Baldridge","Mohit Bansal","Jordi Pont-Tuset","Su Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18235v2.pdf","comment":"Project website: https://google.github.io/dsg"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2309.10711v2","updated":"2023-10-30T02:37:42Z","published":"2023-09-19T16:00:09Z","title":"Latent Space Energy-based Model for Fine-grained Open Set Recognition","summary":" Fine-grained open-set recognition (FineOSR) aims to recognize images\nbelonging to classes with subtle appearance differences while rejecting images\nof unknown classes. A recent trend in OSR shows the benefit of generative\nmodels to discriminative unknown detection. As a type of generative model,\nenergy-based models (EBM) are the potential for hybrid modeling of generative\nand discriminative tasks. However, most existing EBMs suffer from density\nestimation in high-dimensional space, which is critical to recognizing images\nfrom fine-grained classes. In this paper, we explore the low-dimensional latent\nspace with energy-based prior distribution for OSR in a fine-grained visual\nworld. Specifically, based on the latent space EBM, we propose an\nattribute-aware information bottleneck (AIB), a residual attribute feature\naggregation (RAFA) module, and an uncertainty-based virtual outlier synthesis\n(UVOS) module to improve the expressivity, granularity, and density of the\nsamples in fine-grained classes, respectively. Our method is flexible to take\nadvantage of recent vision transformers for powerful visual classification and\ngeneration. The method is validated on both fine-grained and general visual\nclassification datasets while preserving the capability of generating\nphoto-realistic fake images with high resolution.\n","authors":["Wentao Bao","Qi Yu","Yu Kong"],"pdf_url":"https://arxiv.org/pdf/2309.10711v2.pdf","comment":"Add ack"},{"id":"http://arxiv.org/abs/2310.19231v1","updated":"2023-10-30T02:19:16Z","published":"2023-10-30T02:19:16Z","title":"There Are No Data Like More Data- Datasets for Deep Learning in Earth\n Observation","summary":" Carefully curated and annotated datasets are the foundation of machine\nlearning, with particularly data-hungry deep neural networks forming the core\nof what is often called Artificial Intelligence (AI). Due to the massive\nsuccess of deep learning applied to Earth Observation (EO) problems, the focus\nof the community has been largely on the development of ever-more sophisticated\ndeep neural network architectures and training strategies largely ignoring the\noverall importance of datasets. For that purpose, numerous task-specific\ndatasets have been created that were largely ignored by previously published\nreview articles on AI for Earth observation. With this article, we want to\nchange the perspective and put machine learning datasets dedicated to Earth\nobservation data and applications into the spotlight. Based on a review of the\nhistorical developments, currently available resources are described and a\nperspective for future developments is formed. We hope to contribute to an\nunderstanding that the nature of our data is what distinguishes the Earth\nobservation community from many other communities that apply deep learning\ntechniques to image data, and that a detailed understanding of EO data\npeculiarities is among the core competencies of our discipline.\n","authors":["Michael Schmitt","Seyed Ali Ahmadi","Yonghao Xu","Gulsen Taskin","Ujjwal Verma","Francescopaolo Sica","Ronny Hansch"],"pdf_url":"https://arxiv.org/pdf/2310.19231v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19224v1","updated":"2023-10-30T02:03:28Z","published":"2023-10-30T02:03:28Z","title":"CHAMMI: A benchmark for channel-adaptive models in microscopy imaging","summary":" Most neural networks assume that input images have a fixed number of channels\n(three for RGB images). However, there are many settings where the number of\nchannels may vary, such as microscopy images where the number of channels\nchanges depending on instruments and experimental goals. Yet, there has not\nbeen a systemic attempt to create and evaluate neural networks that are\ninvariant to the number and type of channels. As a result, trained models\nremain specific to individual studies and are hardly reusable for other\nmicroscopy settings. In this paper, we present a benchmark for investigating\nchannel-adaptive models in microscopy imaging, which consists of 1) a dataset\nof varied-channel single-cell images, and 2) a biologically relevant evaluation\nframework. In addition, we adapted several existing techniques to create\nchannel-adaptive models and compared their performance on this benchmark to\nfixed-channel, baseline models. We find that channel-adaptive models can\ngeneralize better to out-of-domain tasks and can be computationally efficient.\nWe contribute a curated dataset (https://doi.org/10.5281/zenodo.7988357) and an\nevaluation API (https://github.com/broadinstitute/MorphEm.git) to facilitate\nobjective comparisons in future research and applications.\n","authors":["Zitong Chen","Chau Pham","Siqi Wang","Michael Doron","Nikita Moshkov","Bryan A. Plummer","Juan C. Caicedo"],"pdf_url":"https://arxiv.org/pdf/2310.19224v1.pdf","comment":"Accepted at NeurIPS Track on Datasets and Benchmarks, 2023"},{"id":"http://arxiv.org/abs/2310.19223v1","updated":"2023-10-30T02:01:49Z","published":"2023-10-30T02:01:49Z","title":"Modular Anti-noise Deep Learning Network for Robotic Grasp Detection\n Based on RGB Images","summary":" While traditional methods relies on depth sensors, the current trend leans\ntowards utilizing cost-effective RGB images, despite their absence of depth\ncues. This paper introduces an interesting approach to detect grasping pose\nfrom a single RGB image. To this end, we propose a modular learning network\naugmented with grasp detection and semantic segmentation, tailored for robots\nequipped with parallel-plate grippers. Our network not only identifies\ngraspable objects but also fuses prior grasp analyses with semantic\nsegmentation, thereby boosting grasp detection precision. Significantly, our\ndesign exhibits resilience, adeptly handling blurred and noisy visuals. Key\ncontributions encompass a trainable network for grasp detection from RGB\nimages, a modular design facilitating feasible grasp implementation, and an\narchitecture robust against common image distortions. We demonstrate the\nfeasibility and accuracy of our proposed approach through practical experiments\nand evaluations.\n","authors":["Zhaocong Li"],"pdf_url":"https://arxiv.org/pdf/2310.19223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.17952v2","updated":"2023-10-30T01:37:18Z","published":"2023-10-27T07:57:24Z","title":"Shape-centered Representation Learning for Visible-Infrared Person\n Re-identification","summary":" Current Visible-Infrared Person Re-Identification (VI-ReID) methods\nprioritize extracting distinguishing appearance features, ignoring the natural\nresistance of body shape against modality changes. Initially, we gauged the\ndiscriminative potential of shapes by a straightforward concatenation of shape\nand appearance features. However, two unresolved issues persist in the\nutilization of shape features. One pertains to the dependence on auxiliary\nmodels for shape feature extraction in the inference phase, along with the\nerrors in generated infrared shapes due to the intrinsic modality disparity.\nThe other issue involves the inadequately explored correlation between shape\nand appearance features. To tackle the aforementioned challenges, we propose\nthe Shape-centered Representation Learning framework (ScRL), which focuses on\nlearning shape features and appearance features associated with shapes.\nSpecifically, we devise the Shape Feature Propagation (SFP), facilitating\ndirect extraction of shape features from original images with minimal\ncomplexity costs during inference. To restitute inaccuracies in infrared body\nshapes at the feature level, we present the Infrared Shape Restitution (ISR).\nFurthermore, to acquire appearance features related to shape, we design the\nAppearance Feature Enhancement (AFE), which accentuates identity-related\nfeatures while suppressing identity-unrelated features guided by shape\nfeatures. Extensive experiments are conducted to validate the effectiveness of\nthe proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy\nattains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM,\nRegDB datasets respectively, outperforming existing state-of-the-art methods.\n","authors":["Shuang Li","Jiaxu Leng","Ji Gan","Mengjingcheng Mo","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2310.17952v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19210v1","updated":"2023-10-30T00:32:47Z","published":"2023-10-30T00:32:47Z","title":"Generalized Category Discovery with Clustering Assignment Consistency","summary":" Generalized category discovery (GCD) is a recently proposed open-world task.\nGiven a set of images consisting of labeled and unlabeled instances, the goal\nof GCD is to automatically cluster the unlabeled samples using information\ntransferred from the labeled dataset. The unlabeled dataset comprises both\nknown and novel classes. The main challenge is that unlabeled novel class\nsamples and unlabeled known class samples are mixed together in the unlabeled\ndataset. To address the GCD without knowing the class number of unlabeled\ndataset, we propose a co-training-based framework that encourages clustering\nconsistency. Specifically, we first introduce weak and strong augmentation\ntransformations to generate two sufficiently different views for the same\nsample. Then, based on the co-training assumption, we propose a consistency\nrepresentation learning strategy, which encourages consistency between\nfeature-prototype similarity and clustering assignment. Finally, we use the\ndiscriminative embeddings learned from the semi-supervised representation\nlearning process to construct an original sparse network and use a community\ndetection method to obtain the clustering results and the number of categories\nsimultaneously. Extensive experiments show that our method achieves\nstate-of-the-art performance on three generic benchmarks and three fine-grained\nvisual recognition datasets. Especially in the ImageNet-100 data set, our\nmethod significantly exceeds the best baseline by 15.5\\% and 7.0\\% on the\n\\texttt{Novel} and \\texttt{All} classes, respectively.\n","authors":["Xiangli Yang","Xinglin Pan","Irwin King","Zenglin Xu"],"pdf_url":"https://arxiv.org/pdf/2310.19210v1.pdf","comment":"ICONIP 2023,This paper has been nominated for ICONIP2023 Best Paper\n Award"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.17488v2","updated":"2023-10-30T02:50:17Z","published":"2023-10-26T15:44:57Z","title":"LightLM: A Lightweight Deep and Narrow Language Model for Generative\n Recommendation","summary":" This paper presents LightLM, a lightweight Transformer-based language model\nfor generative recommendation. While Transformer-based generative modeling has\ngained importance in various AI sub-fields such as NLP and vision, generative\nrecommendation is still in its infancy due to its unique demand on personalized\ngenerative modeling. Existing works on generative recommendation often use\nNLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are\nheavy-weight and are not specifically designed for recommendation tasks.\nLightLM tackles the issue by introducing a light-weight deep and narrow\nTransformer architecture, which is specifically tailored for direct generation\nof recommendation items. This structure is especially apt for straightforward\ngenerative recommendation and stems from the observation that language model\ndoes not have to be too wide for this task, as the input predominantly consists\nof short tokens that are well-suited for the model's capacity. We also show\nthat our devised user and item ID indexing methods, i.e., Spectral\nCollaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables\nthe deep and narrow Transformer architecture to outperform large-scale language\nmodels for recommendation. Besides, to address the hallucination problem of\ngenerating items as output, we propose the constrained generation process for\ngenerative recommenders. Experiments on real-world datasets show that LightLM\noutperforms various competitive baselines in terms of both recommendation\naccuracy and efficiency. The code can be found at\nhttps://github.com/dongyuanjushi/LightLM.\n","authors":["Kai Mei","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.17488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15342v2","updated":"2023-10-30T15:01:26Z","published":"2023-10-23T20:15:30Z","title":"Towards Hybrid-grained Feature Interaction Selection for Deep Sparse\n Network","summary":" Deep sparse networks are widely investigated as a neural network architecture\nfor prediction tasks with high-dimensional sparse features, with which feature\ninteraction selection is a critical component. While previous methods primarily\nfocus on how to search feature interaction in a coarse-grained space, less\nattention has been given to a finer granularity. In this work, we introduce a\nhybrid-grained feature interaction selection approach that targets both feature\nfield and feature value for deep sparse networks. To explore such expansive\nspace, we propose a decomposed space which is calculated on the fly. We then\ndevelop a selection algorithm called OptFeature, which efficiently selects the\nfeature interaction from both the feature field and the feature value\nsimultaneously. Results from experiments on three large real-world benchmark\ndatasets demonstrate that OptFeature performs well in terms of accuracy and\nefficiency. Additional studies support the feasibility of our method.\n","authors":["Fuyuan Lyu","Xing Tang","Dugang Liu","Chen Ma","Weihong Luo","Liang Chen","Xiuqiang He","Xue Liu"],"pdf_url":"https://arxiv.org/pdf/2310.15342v2.pdf","comment":"NeurIPS 2023 poster"},{"id":"http://arxiv.org/abs/2310.12455v2","updated":"2023-10-30T11:52:47Z","published":"2023-10-19T04:16:48Z","title":"Auto Search Indexer for End-to-End Document Retrieval","summary":" Generative retrieval, which is a new advanced paradigm for document\nretrieval, has recently attracted research interests, since it encodes all\ndocuments into the model and directly generates the retrieved documents.\nHowever, its power is still underutilized since it heavily relies on the\n\"preprocessed\" document identifiers (docids), thus limiting its retrieval\nperformance and ability to retrieve new documents. In this paper, we propose a\nnovel fully end-to-end retrieval paradigm. It can not only end-to-end learn the\nbest docids for existing and new documents automatically via a semantic\nindexing module, but also perform end-to-end document retrieval via an\nencoder-decoder-based generative model, namely Auto Search Indexer (ASI).\nBesides, we design a reparameterization mechanism to combine the above two\nmodules into a joint optimization framework. Extensive experimental results\ndemonstrate the superiority of our model over advanced baselines on both public\nand industrial datasets and also verify the ability to deal with new documents.\n","authors":["Tianchi Yang","Minghui Song","Zihan Zhang","Haizhen Huang","Weiwei Deng","Feng Sun","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.12455v2.pdf","comment":"EMNLP 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.07194v2","updated":"2023-10-30T02:58:46Z","published":"2023-10-11T05:05:40Z","title":"Boosting Learning for LDPC Codes to Improve the Error-Floor Performance","summary":" Low-density parity-check (LDPC) codes have been successfully commercialized\nin communication systems due to their strong error correction capabilities and\nsimple decoding process. However, the error-floor phenomenon of LDPC codes, in\nwhich the error rate stops decreasing rapidly at a certain level, presents\nchallenges for achieving extremely low error rates and deploying LDPC codes in\nscenarios demanding ultra-high reliability. In this work, we propose training\nmethods for neural min-sum (NMS) decoders to eliminate the error-floor effect.\nFirst, by leveraging the boosting learning technique of ensemble networks, we\ndivide the decoding network into two neural decoders and train the post decoder\nto be specialized for uncorrected words that the first decoder fails to\ncorrect. Secondly, to address the vanishing gradient issue in training, we\nintroduce a block-wise training schedule that locally trains a block of weights\nwhile retraining the preceding block. Lastly, we show that assigning different\nweights to unsatisfied check nodes effectively lowers the error-floor with a\nminimal number of weights. By applying these training methods to standard LDPC\ncodes, we achieve the best error-floor performance compared to other decoding\nmethods. The proposed NMS decoder, optimized solely through novel training\nmethods without additional modules, can be integrated into existing LDPC\ndecoders without incurring extra hardware costs. The source code is available\nat https://github.com/ghy1228/LDPC_Error_Floor .\n","authors":["Hee-Youl Kwak","Dae-Young Yun","Yongjune Kim","Sang-Hyo Kim","Jong-Seon No"],"pdf_url":"https://arxiv.org/pdf/2310.07194v2.pdf","comment":"17 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.19231v1","updated":"2023-10-30T02:19:16Z","published":"2023-10-30T02:19:16Z","title":"There Are No Data Like More Data- Datasets for Deep Learning in Earth\n Observation","summary":" Carefully curated and annotated datasets are the foundation of machine\nlearning, with particularly data-hungry deep neural networks forming the core\nof what is often called Artificial Intelligence (AI). Due to the massive\nsuccess of deep learning applied to Earth Observation (EO) problems, the focus\nof the community has been largely on the development of ever-more sophisticated\ndeep neural network architectures and training strategies largely ignoring the\noverall importance of datasets. For that purpose, numerous task-specific\ndatasets have been created that were largely ignored by previously published\nreview articles on AI for Earth observation. With this article, we want to\nchange the perspective and put machine learning datasets dedicated to Earth\nobservation data and applications into the spotlight. Based on a review of the\nhistorical developments, currently available resources are described and a\nperspective for future developments is formed. We hope to contribute to an\nunderstanding that the nature of our data is what distinguishes the Earth\nobservation community from many other communities that apply deep learning\ntechniques to image data, and that a detailed understanding of EO data\npeculiarities is among the core competencies of our discipline.\n","authors":["Michael Schmitt","Seyed Ali Ahmadi","Yonghao Xu","Gulsen Taskin","Ujjwal Verma","Francescopaolo Sica","Ronny Hansch"],"pdf_url":"https://arxiv.org/pdf/2310.19231v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19225v1","updated":"2023-10-30T02:04:20Z","published":"2023-10-30T02:04:20Z","title":"Stochastic Configuration Machines: FPGA Implementation","summary":" Neural networks for industrial applications generally have additional\nconstraints such as response speed, memory size and power usage. Randomized\nlearners can address some of these issues. However, hardware solutions can\nprovide better resource reduction whilst maintaining the model's performance.\nStochastic configuration networks (SCNs) are a prime choice in industrial\napplications due to their merits and feasibility for data modelling. Stochastic\nConfiguration Machines (SCMs) extend this to focus on reducing the memory\nconstraints by limiting the randomized weights to a binary value with a scalar\nfor each node and using a mechanism model to improve the learning performance\nand result interpretability. This paper aims to implement SCM models on a field\nprogrammable gate array (FPGA) and introduce binary-coded inputs to the\nalgorithm. Results are reported for two benchmark and two industrial datasets,\nincluding SCM with single-layer and deep architectures.\n","authors":["Matthew J. Felicetti","Dianhui Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19225v1.pdf","comment":"19 pages, 9 figures, 8 tables"},{"id":"http://arxiv.org/abs/2310.19222v1","updated":"2023-10-30T02:01:48Z","published":"2023-10-30T02:01:48Z","title":"Maximum Knowledge Orthogonality Reconstruction with Gradients in\n Federated Learning","summary":" Federated learning (FL) aims at keeping client data local to preserve\nprivacy. Instead of gathering the data itself, the server only collects\naggregated gradient updates from clients. Following the popularity of FL, there\nhas been considerable amount of work, revealing the vulnerability of FL\napproaches by reconstructing the input data from gradient updates. Yet, most\nexisting works assume an FL setting with unrealistically small batch size, and\nhave poor image quality when the batch size is large. Other works modify the\nneural network architectures or parameters to the point of being suspicious,\nand thus, can be detected by clients. Moreover, most of them can only\nreconstruct one sample input from a large batch. To address these limitations,\nwe propose a novel and completely analytical approach, referred to as the\nmaximum knowledge orthogonality reconstruction (MKOR), to reconstruct clients'\ninput data. Our proposed method reconstructs a mathematically proven high\nquality image from large batches. MKOR only requires the server to send\nsecretly modified parameters to clients and can efficiently and inconspicuously\nreconstruct the input images from clients' gradient updates. We evaluate MKOR's\nperformance on the MNIST, CIFAR-100, and ImageNet dataset and compare it with\nthe state-of-the-art works. The results show that MKOR outperforms the existing\napproaches, and draws attention to a pressing need for further research on the\nprivacy protection of FL so that comprehensive defense approaches can be\ndeveloped.\n","authors":["Feng Wang","Senem Velipasalar","M. Cenk Gursoy"],"pdf_url":"https://arxiv.org/pdf/2310.19222v1.pdf","comment":"Accepted in IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV) 2024"},{"id":"http://arxiv.org/abs/2310.19220v1","updated":"2023-10-30T01:53:37Z","published":"2023-10-30T01:53:37Z","title":"From Stream to Pool: Dynamic Pricing Beyond i.i.d. Arrivals","summary":" The dynamic pricing problem has been extensively studied under the\n\\textbf{stream} model: A stream of customers arrives sequentially, each with an\nindependently and identically distributed valuation. However, this formulation\nis not entirely reflective of the real world. In many scenarios, high-valuation\ncustomers tend to make purchases earlier and leave the market, leading to a\n\\emph{shift} in the valuation distribution. Thus motivated, we consider a model\nwhere a \\textbf{pool} of $n$ non-strategic unit-demand customers interact\nrepeatedly with the seller. Each customer monitors the price intermittently\naccording to an independent Poisson process and makes a purchase if the\nobserved price is lower than her \\emph{private} valuation, whereupon she leaves\nthe market permanently. We present a minimax \\emph{optimal} algorithm that\nefficiently computes a non-adaptive policy which guarantees a $1/k$ fraction of\nthe optimal revenue, given any set of $k$ prices. Moreover, we present an\nadaptive \\emph{learn-then-earn} policy based on a novel \\emph{debiasing}\napproach, and prove an $\\tilde O(kn^{3/4})$ regret bound. We further improve\nthe bound to $\\tilde O(k^{3/4} n^{3/4})$ using martingale concentration\ninequalities.\n","authors":["Titing Cui","Su Jia","Thomas Lavastida"],"pdf_url":"https://arxiv.org/pdf/2310.19220v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19218v1","updated":"2023-10-30T01:34:33Z","published":"2023-10-30T01:34:33Z","title":"A Survey of Federated Unlearning: A Taxonomy, Challenges and Future\n Directions","summary":" With the development of trustworthy Federated Learning (FL), the requirement\nof implementing right to be forgotten gives rise to the area of Federated\nUnlearning (FU). Comparing to machine unlearning, a major challenge of FU lies\nin the decentralized and privacy-preserving nature of FL, in which clients\njointly train a global model without sharing their raw data, making it\nsubstantially more intricate to selectively unlearn specific information. In\nthat regard, many efforts have been made to tackle the challenges of FU and\nhave achieved significant progress. In this paper, we present a comprehensive\nsurvey of FU. Specially, we provide the existing algorithms, objectives,\nevaluation metrics, and identify some challenges of FU. By reviewing and\ncomparing some studies, we summarize them into a taxonomy for various schemes,\npotential applications and future directions.\n","authors":["Jiaxi Yang","Yang Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.19218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.12388v3","updated":"2023-10-30T01:16:22Z","published":"2023-07-23T17:35:49Z","title":"Uncertainty-aware Grounded Action Transformation towards Sim-to-Real\n Transfer for Traffic Signal Control","summary":" Traffic signal control (TSC) is a complex and important task that affects the\ndaily lives of millions of people. Reinforcement Learning (RL) has shown\npromising results in optimizing traffic signal control, but current RL-based\nTSC methods are mainly trained in simulation and suffer from the performance\ngap between simulation and the real world. In this paper, we propose a\nsimulation-to-real-world (sim-to-real) transfer approach called UGAT, which\ntransfers a learned policy trained from a simulated environment to a real-world\nenvironment by dynamically transforming actions in the simulation with\nuncertainty to mitigate the domain gap of transition dynamics. We evaluate our\nmethod on a simulated traffic environment and show that it significantly\nimproves the performance of the transferred RL policy in the real world.\n","authors":["Longchao Da","Hao Mei","Romir Sharma","Hua Wei"],"pdf_url":"https://arxiv.org/pdf/2307.12388v3.pdf","comment":"6 pages, 3 figures. This paper is accepted by IEEE-CDC 2023"},{"id":"http://arxiv.org/abs/2310.19215v1","updated":"2023-10-30T01:01:15Z","published":"2023-10-30T01:01:15Z","title":"On the accuracy and efficiency of group-wise clipping in differentially\n private optimization","summary":" Recent advances have substantially improved the accuracy, memory cost, and\ntraining speed of differentially private (DP) deep learning, especially on\nlarge vision and language models with millions to billions of parameters. In\nthis work, we thoroughly study the per-sample gradient clipping style, a key\ncomponent in DP optimization. We show that different clipping styles have the\nsame time complexity but instantiate an accuracy-memory trade-off: while the\nall-layer clipping (of coarse granularity) is the most prevalent and usually\ngives the best accuracy, it incurs heavier memory cost compared to other\ngroup-wise clipping, such as the layer-wise clipping (of finer granularity). We\nformalize this trade-off through our convergence theory and complexity\nanalysis. Importantly, we demonstrate that the accuracy gap between group-wise\nclipping and all-layer clipping becomes smaller for larger models, while the\nmemory advantage of the group-wise clipping remains. Consequently, the\ngroup-wise clipping allows DP optimization of large models to achieve high\naccuracy and low peak memory simultaneously.\n","authors":["Zhiqi Bu","Ruixuan Liu","Yu-Xiang Wang","Sheng Zha","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2310.19215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14257v3","updated":"2023-10-30T00:55:53Z","published":"2023-05-23T17:10:39Z","title":"Hierarchical Prompting Assists Large Language Model on Web Navigation","summary":" Large language models (LLMs) struggle on processing complicated observations\nin interactive decision making tasks. To alleviate this issue, we propose a\nsimple hierarchical prompting approach. Diverging from previous prompting\napproaches that always put the full observation (e.g. a web page) to the\nprompt, we propose to first construct an action-aware observation which is more\ncondensed and relevant with a dedicated SUMMARIZER prompt. The ACTOR prompt\nthen predicts the next action based on the summarized observation. While our\nmethod has broad applicability, we particularly demonstrate its efficacy in the\ncomplex domain of web navigation where a full observation often contains\nredundant and irrelevant information. Our approach outperforms the previous\nstate-of-the-art prompting mechanics by 6.2% on task success rate,\ndemonstrating its potential on interactive decision making tasks with long\nobservation traces.\n","authors":["Abishek Sridhar","Robert Lo","Frank F. Xu","Hao Zhu","Shuyan Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.14257v3.pdf","comment":"EMNLP 2023 Findings; Natural Language Reasoning and Structured\n Explanations Workshop at ACL 2023"},{"id":"http://arxiv.org/abs/2310.19214v1","updated":"2023-10-30T00:52:17Z","published":"2023-10-30T00:52:17Z","title":"Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank\n Matrices","summary":" We consider multilevel low rank (MLR) matrices, defined as a row and column\npermutation of a sum of matrices, each one a block diagonal refinement of the\nprevious one, with all blocks low rank given in factored form. MLR matrices\nextend low rank matrices but share many of their properties, such as the total\nstorage required and complexity of matrix-vector multiplication. We address\nthree problems that arise in fitting a given matrix by an MLR matrix in the\nFrobenius norm. The first problem is factor fitting, where we adjust the\nfactors of the MLR matrix. The second is rank allocation, where we choose the\nranks of the blocks in each level, subject to the total rank having a given\nvalue, which preserves the total storage needed for the MLR matrix. The final\nproblem is to choose the hierarchical partition of rows and columns, along with\nthe ranks and factors. This paper is accompanied by an open source package that\nimplements the proposed methods.\n","authors":["Tetiana Parshakova","Trevor Hastie","Eric Darve","Stephen Boyd"],"pdf_url":"https://arxiv.org/pdf/2310.19214v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19211v1","updated":"2023-10-30T00:45:05Z","published":"2023-10-30T00:45:05Z","title":"Investigative Pattern Detection Framework for Counterterrorism","summary":" Law-enforcement investigations aimed at preventing attacks by violent\nextremists have become increasingly important for public safety. The problem is\nexacerbated by the massive data volumes that need to be scanned to identify\ncomplex behaviors of extremists and groups. Automated tools are required to\nextract information to respond queries from analysts, continually scan new\ninformation, integrate them with past events, and then alert about emerging\nthreats. We address challenges in investigative pattern detection and develop\nan Investigative Pattern Detection Framework for Counterterrorism (INSPECT).\nThe framework integrates numerous computing tools that include machine learning\ntechniques to identify behavioral indicators and graph pattern matching\ntechniques to detect risk profiles/groups. INSPECT also automates multiple\ntasks for large-scale mining of detailed forensic biographies, forming\nknowledge networks, and querying for behavioral indicators and radicalization\ntrajectories. INSPECT targets human-in-the-loop mode of investigative search\nand has been validated and evaluated using an evolving dataset on domestic\njihadism.\n","authors":["Shashika R. Muramudalige","Benjamin W. K. Hung","Rosanne Libretti","Jytte Klausen","Anura P. Jayasumana"],"pdf_url":"https://arxiv.org/pdf/2310.19211v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2306.00183v2","updated":"2023-10-30T00:43:36Z","published":"2023-05-31T21:00:50Z","title":"Diffused Redundancy in Pre-trained Representations","summary":" Representations learned by pre-training a neural network on a large dataset\nare increasingly used successfully to perform a variety of downstream tasks. In\nthis work, we take a closer look at how features are encoded in such\npre-trained representations. We find that learned representations in a given\nlayer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of\nneurons in the layer that is larger than a threshold size shares a large degree\nof similarity with the full layer and is able to perform similarly as the whole\nlayer on a variety of downstream tasks. For example, a linear probe trained on\n$20\\%$ of randomly picked neurons from the penultimate layer of a ResNet50\npre-trained on ImageNet1k achieves an accuracy within $5\\%$ of a linear probe\ntrained on the full layer of neurons for downstream CIFAR10 classification. We\nconduct experiments on different neural architectures (including CNNs and\nTransformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a\nvariety of downstream tasks taken from the VTAB benchmark. We find that the\nloss and dataset used during pre-training largely govern the degree of diffuse\nredundancy and the \"critical mass\" of neurons needed often depends on the\ndownstream task, suggesting that there is a task-inherent\nredundancy-performance Pareto frontier. Our findings shed light on the nature\nof representations learned by pre-trained deep neural networks and suggest that\nentire layers might not be necessary to perform many downstream tasks. We\ninvestigate the potential for exploiting this redundancy to achieve efficient\ngeneralization for downstream tasks and also draw caution to certain possible\nunintended consequences. Our code is available at\n\\url{https://github.com/nvedant07/diffused-redundancy}.\n","authors":["Vedant Nanda","Till Speicher","John P. Dickerson","Soheil Feizi","Krishna P. Gummadi","Adrian Weller"],"pdf_url":"https://arxiv.org/pdf/2306.00183v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.01690v3","updated":"2023-10-30T00:00:09Z","published":"2023-10-02T23:09:59Z","title":"Forecasting Tropical Cyclones with Cascaded Diffusion Models","summary":" As cyclones become more intense due to climate change, the rise of AI-based\nmodelling provides a more affordable and accessible approach compared to\ntraditional methods based on mathematical models. This work leverages diffusion\nmodels to forecast cyclone trajectories and precipitation patterns by\nintegrating satellite imaging, remote sensing, and atmospheric data, employing\na cascaded approach that incorporates forecasting, super-resolution, and\nprecipitation modelling, with training on a dataset of 51 cyclones from six\nmajor basins. Experiments demonstrate that the final forecasts from the\ncascaded models show accurate predictions up to a 36-hour rollout, with SSIM\nand PSNR values exceeding 0.5 and 20 dB, respectively, for all three tasks.\nThis work also highlights the promising efficiency of AI methods such as\ndiffusion models for high-performance needs, such as cyclone forecasting, while\nremaining computationally affordable, making them ideal for highly vulnerable\nregions with critical forecasting needs and financial limitations. Code\naccessible at \\url{https://github.com/nathzi1505/forecast-diffmodels}.\n","authors":["Pritthijit Nath","Pancham Shukla","César Quilodrán-Casas"],"pdf_url":"https://arxiv.org/pdf/2310.01690v3.pdf","comment":"6 pages, 3 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.17796v2","updated":"2023-10-30T17:30:47Z","published":"2023-10-26T21:57:21Z","title":"ControlLLM: Augment Language Models with Tools by Searching on Graphs","summary":" We present ControlLLM, a novel framework that enables large language models\n(LLMs) to utilize multi-modal tools for solving complex real-world tasks.\nDespite the remarkable performance of LLMs, they still struggle with tool\ninvocation due to ambiguous user prompts, inaccurate tool selection and\nparameterization, and inefficient tool scheduling. To overcome these\nchallenges, our framework comprises three key components: (1) a \\textit{task\ndecomposer} that breaks down a complex task into clear subtasks with\nwell-defined inputs and outputs; (2) a \\textit{Thoughts-on-Graph (ToG)\nparadigm} that searches the optimal solution path on a pre-built tool graph,\nwhich specifies the parameter and dependency relations among different tools;\nand (3) an \\textit{execution engine with a rich toolbox} that interprets the\nsolution path and runs the tools efficiently on different computational\ndevices. We evaluate our framework on diverse tasks involving image, audio, and\nvideo processing, demonstrating its superior accuracy, efficiency, and\nversatility compared to existing methods. The code is at\nhttps://github.com/OpenGVLab/ControlLLM .\n","authors":["Zhaoyang Liu","Zeqiang Lai","Zhangwei Gao","Erfei Cui","Zhiheng Li","Xizhou Zhu","Lewei Lu","Qifeng Chen","Yu Qiao","Jifeng Dai","Wenhai Wang"],"pdf_url":"https://arxiv.org/pdf/2310.17796v2.pdf","comment":"22 pages, 9 figures, 10 tables"}]},"2023-10-29T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.19181v1","updated":"2023-10-29T22:52:40Z","published":"2023-10-29T22:52:40Z","title":"From Chatbots to PhishBots? -- Preventing Phishing scams created using\n ChatGPT, Google Bard and Claude","summary":" The advanced capabilities of Large Language Models (LLMs) have made them\ninvaluable across various applications, from conversational agents and content\ncreation to data analysis, research, and innovation. However, their\neffectiveness and accessibility also render them susceptible to abuse for\ngenerating malicious content, including phishing attacks. This study explores\nthe potential of using four popular commercially available LLMs - ChatGPT (GPT\n3.5 Turbo), GPT 4, Claude and Bard to generate functional phishing attacks\nusing a series of malicious prompts. We discover that these LLMs can generate\nboth phishing emails and websites that can convincingly imitate well-known\nbrands, and also deploy a range of evasive tactics for the latter to elude\ndetection mechanisms employed by anti-phishing systems. Notably, these attacks\ncan be generated using unmodified, or \"vanilla,\" versions of these LLMs,\nwithout requiring any prior adversarial exploits such as jailbreaking. As a\ncountermeasure, we build a BERT based automated detection tool that can be used\nfor the early detection of malicious prompts to prevent LLMs from generating\nphishing content attaining an accuracy of 97\\% for phishing website prompts,\nand 94\\% for phishing email prompts.\n","authors":["Sayak Saha Roy","Poojitha Thota","Krishna Vamsi Naragam","Shirin Nilizadeh"],"pdf_url":"https://arxiv.org/pdf/2310.19181v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19177v1","updated":"2023-10-29T22:37:54Z","published":"2023-10-29T22:37:54Z","title":"Robustifying Language Models with Test-Time Adaptation","summary":" Large-scale language models achieved state-of-the-art performance over a\nnumber of language tasks. However, they fail on adversarial language examples,\nwhich are sentences optimized to fool the language models but with similar\nsemantic meanings for humans. While prior work focuses on making the language\nmodel robust at training time, retraining for robustness is often unrealistic\nfor large-scale foundation models. Instead, we propose to make the language\nmodels robust at test time. By dynamically adapting the input sentence with\npredictions from masked words, we show that we can reverse many language\nadversarial attacks. Since our approach does not require any training, it works\nfor novel tasks at test time and can adapt to novel adversarial corruptions.\nVisualizations and empirical results on two popular sentence classification\ndatasets demonstrate that our method can repair adversarial language attacks\nover 65% o\n","authors":["Noah Thomas McDermott","Junfeng Yang","Chengzhi Mao"],"pdf_url":"https://arxiv.org/pdf/2310.19177v1.pdf","comment":"8 Pages 2 Figures Submitted to ICLR Workshop"},{"id":"http://arxiv.org/abs/2306.03819v3","updated":"2023-10-29T21:41:46Z","published":"2023-06-06T16:07:24Z","title":"LEACE: Perfect linear concept erasure in closed form","summary":" Concept erasure aims to remove specified features from a representation. It\ncan improve fairness (e.g. preventing a classifier from using gender or race)\nand interpretability (e.g. removing a concept to observe changes in model\nbehavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form\nmethod which provably prevents all linear classifiers from detecting a concept\nwhile changing the representation as little as possible, as measured by a broad\nclass of norms. We apply LEACE to large language models with a novel procedure\ncalled \"concept scrubbing,\" which erases target concept information from every\nlayer in the network. We demonstrate our method on two tasks: measuring the\nreliance of language models on part-of-speech information, and reducing gender\nbias in BERT embeddings. Code is available at\nhttps://github.com/EleutherAI/concept-erasure.\n","authors":["Nora Belrose","David Schneider-Joseph","Shauli Ravfogel","Ryan Cotterell","Edward Raff","Stella Biderman"],"pdf_url":"https://arxiv.org/pdf/2306.03819v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14591v2","updated":"2023-10-29T21:37:07Z","published":"2023-05-24T00:10:15Z","title":"ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers","summary":" Large language models (LLMs) excel at implementing code from functionality\ndescriptions but struggle with algorithmic problems that require not only\nimplementation but also identification of the suitable algorithm. Moreover,\nLLM-generated programs lack guaranteed correctness and require human\nverification. To address these challenges, we propose ALGO, a framework that\nsynthesizes Algorithmic programs with LLM-Generated Oracles to guide the\ngeneration and verify their correctness. ALGO first generates a reference\noracle by prompting an LLM to exhaustively enumerate all the combinations of\nrelevant variables. This oracle is then utilized to guide an arbitrary search\nstrategy in exploring the algorithm space and to verify the synthesized\nalgorithms. Our study shows that the LLM-generated oracles are correct for 88%\nof the cases. With the oracles as verifiers, ALGO can be integrated with any\nexisting code generation model in a model-agnostic manner to enhance its\nperformance. Experiments show that when equipped with ALGO, we achieve an 8x\nbetter one-submission pass rate over the Codex model and a 2.6x better\none-submission pass rate over CodeT, the current state-of-the-art model on\nCodeContests. We can also get 1.3x better pass rate over the ChatGPT Code\nInterpreter on unseen problems. The problem set we used for testing, the\nprompts we used, the verifier and solution programs, and the test cases\ngenerated by ALGO are available at https://github.com/zkx06111/ALGO.\n","authors":["Kexun Zhang","Danqing Wang","Jingtao Xia","William Yang Wang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2305.14591v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.04675v3","updated":"2023-10-29T21:23:02Z","published":"2023-04-10T15:51:30Z","title":"Multilingual Machine Translation with Large Language Models: Empirical\n Results and Analysis","summary":" Large language models (LLMs) have demonstrated remarkable potential in\nhandling multilingual machine translation (MMT). In this paper, we\nsystematically investigate the advantages and challenges of LLMs for MMT by\nanswering two questions: 1) How well do LLMs perform in translating massive\nlanguages? 2) Which factors affect LLMs' performance in translation? We\nthoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our\nempirical results show that translation capabilities of LLMs are continually\nimproving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of\ntranslation directions but still faces a large gap towards the commercial\ntranslation system, especially on low-resource languages. Through further\nanalysis, we discover that LLMs exhibit new working patterns when used for MMT.\nFirst, instruction semantics can surprisingly be ignored when given in-context\nexemplars. Second, cross-lingual exemplars can provide better task guidance for\nlow-resource translation than exemplars in the same language pairs. Third, LLM\ncan acquire translation ability in a resource-efficient way and generate\nmoderate translation even on zero-resource languages.\n","authors":["Wenhao Zhu","Hongyi Liu","Qingxiu Dong","Jingjing Xu","Shujian Huang","Lingpeng Kong","Jiajun Chen","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2304.04675v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19156v1","updated":"2023-10-29T21:13:31Z","published":"2023-10-29T21:13:31Z","title":"Poisoning Retrieval Corpora by Injecting Adversarial Passages","summary":" Dense retrievers have achieved state-of-the-art performance in various\ninformation retrieval tasks, but to what extent can they be safely deployed in\nreal-world applications? In this work, we propose a novel attack for dense\nretrieval systems in which a malicious user generates a small number of\nadversarial passages by perturbing discrete tokens to maximize similarity with\na provided set of training queries. When these adversarial passages are\ninserted into a large retrieval corpus, we show that this attack is highly\neffective in fooling these systems to retrieve them for queries that were not\nseen by the attacker. More surprisingly, these adversarial passages can\ndirectly generalize to out-of-domain queries and corpora with a high success\nattack rate -- for instance, we find that 50 generated passages optimized on\nNatural Questions can mislead >94% of questions posed in financial documents or\nonline forums. We also benchmark and compare a range of state-of-the-art dense\nretrievers, both unsupervised and supervised. Although different systems\nexhibit varying levels of vulnerability, we show they can all be successfully\nattacked by injecting up to 500 passages, a small fraction compared to a\nretrieval corpus of millions of passages.\n","authors":["Zexuan Zhong","Ziqing Huang","Alexander Wettig","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2310.19156v1.pdf","comment":"EMNLP 2023. Our code is available at\n https://github.com/princeton-nlp/corpus-poisoning"},{"id":"http://arxiv.org/abs/2308.06828v3","updated":"2023-10-29T21:07:54Z","published":"2023-08-13T18:14:10Z","title":"An Ensemble Approach to Question Classification: Integrating Electra\n Transformer, GloVe, and LSTM","summary":" Natural Language Processing (NLP) has emerged as a crucial technology for\nunderstanding and generating human language, playing an essential role in tasks\nsuch as machine translation, sentiment analysis, and more pertinently, question\nclassification. As a subfield within NLP, question classification focuses on\ndetermining the type of information being sought, a fundamental step for\ndownstream applications like question answering systems. This study presents an\ninnovative ensemble approach for question classification, combining the\nstrengths of Electra, GloVe, and LSTM models. Rigorously tested on the\nwell-regarded TREC dataset, the model demonstrates how the integration of these\ndisparate technologies can lead to superior results. Electra brings in its\ntransformer-based capabilities for complex language understanding, GloVe offers\nglobal vector representations for capturing word-level semantics, and LSTM\ncontributes its sequence learning abilities to model long-term dependencies. By\nfusing these elements strategically, our ensemble model delivers a robust and\nefficient solution for the complex task of question classification. Through\nrigorous comparisons with well-known models like BERT, RoBERTa, and DistilBERT,\nthe ensemble approach verifies its effectiveness by attaining an 80% accuracy\nscore on the test dataset.\n","authors":["Sanad Aburass","Osama Dorgham","Maha Abu Rumman"],"pdf_url":"https://arxiv.org/pdf/2308.06828v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19152v1","updated":"2023-10-29T21:06:34Z","published":"2023-10-29T21:06:34Z","title":"BERT Lost Patience Won't Be Robust to Adversarial Slowdown","summary":" In this paper, we systematically evaluate the robustness of multi-exit\nlanguage models against adversarial slowdown. To audit their robustness, we\ndesign a slowdown attack that generates natural adversarial text bypassing\nearly-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a\ncomprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark\nagainst adversarial slowdown. We then show our attack significantly reduces the\ncomputational savings provided by the three methods in both white-box and\nblack-box settings. The more complex a mechanism is, the more vulnerable it is\nto adversarial slowdown. We also perform a linguistic analysis of the perturbed\ntext inputs, identifying common perturbation patterns that our attack\ngenerates, and comparing them with standard adversarial text attacks. Moreover,\nwe show that adversarial training is ineffective in defeating our slowdown\nattack, but input sanitization with a conversational model, e.g., ChatGPT, can\nremove perturbations effectively. This result suggests that future work is\nneeded for developing efficient yet robust multi-exit models. Our code is\navailable at: https://github.com/ztcoalson/WAFFLE\n","authors":["Zachary Coalson","Gabriel Ritter","Rakesh Bobba","Sanghyun Hong"],"pdf_url":"https://arxiv.org/pdf/2310.19152v1.pdf","comment":"Accepted to NeurIPS 2023 [Poster]"},{"id":"http://arxiv.org/abs/2304.01969v2","updated":"2023-10-29T21:03:54Z","published":"2023-04-04T17:26:11Z","title":"MEGClass: Extremely Weakly Supervised Text Classification via\n Mutually-Enhancing Text Granularities","summary":" Text classification is essential for organizing unstructured text.\nTraditional methods rely on human annotations or, more recently, a set of class\nseed words for supervision, which can be costly, particularly for specialized\nor emerging domains. To address this, using class surface names alone as\nextremely weak supervision has been proposed. However, existing approaches\ntreat different levels of text granularity (documents, sentences, or words)\nindependently, disregarding inter-granularity class disagreements and the\ncontext identifiable exclusively through joint extraction. In order to tackle\nthese issues, we introduce MEGClass, an extremely weakly-supervised text\nclassification method that leverages Mutually-Enhancing Text Granularities.\nMEGClass utilizes coarse- and fine-grained context signals obtained by jointly\nconsidering a document's most class-indicative words and sentences. This\napproach enables the learning of a contextualized document representation that\ncaptures the most discriminative class indicators. By preserving the\nheterogeneity of potential classes, MEGClass can select the most informative\nclass-indicative documents as iterative feedback to enhance the initial\nword-based class representations and ultimately fine-tune a pre-trained text\nclassifier. Extensive experiments on seven benchmark datasets demonstrate that\nMEGClass outperforms other weakly and extremely weakly supervised methods.\n","authors":["Priyanka Kargupta","Tanay Komarlu","Susik Yoon","Xuan Wang","Jiawei Han"],"pdf_url":"https://arxiv.org/pdf/2304.01969v2.pdf","comment":"Code: https://github.com/pkargupta/MEGClass/"},{"id":"http://arxiv.org/abs/2310.19145v1","updated":"2023-10-29T20:39:11Z","published":"2023-10-29T20:39:11Z","title":"Learning to Follow Object-Centric Image Editing Instructions Faithfully","summary":" Natural language instructions are a powerful interface for editing the\noutputs of text-to-image diffusion models. However, several challenges need to\nbe addressed: 1) underspecification (the need to model the implicit meaning of\ninstructions) 2) grounding (the need to localize where the edit has to be\nperformed), 3) faithfulness (the need to preserve the elements of the image not\naffected by the edit instruction). Current approaches focusing on image editing\nwith natural language instructions rely on automatically generated paired data,\nwhich, as shown in our investigation, is noisy and sometimes nonsensical,\nexacerbating the above issues. Building on recent advances in segmentation,\nChain-of-Thought prompting, and visual question answering, we significantly\nimprove the quality of the paired data. In addition, we enhance the supervision\nsignal by highlighting parts of the image that need to be changed by the\ninstruction. The model fine-tuned on the improved data is capable of performing\nfine-grained object-centric edits better than state-of-the-art baselines,\nmitigating the problems outlined above, as shown by automatic and human\nevaluations. Moreover, our model is capable of generalizing to domains unseen\nduring training, such as visual metaphors.\n","authors":["Tuhin Chakrabarty","Kanishk Singh","Arkadiy Saakyan","Smaranda Muresan"],"pdf_url":"https://arxiv.org/pdf/2310.19145v1.pdf","comment":"Findings of EMNLP 2023 (Long paper)"},{"id":"http://arxiv.org/abs/2305.14795v2","updated":"2023-10-29T20:28:17Z","published":"2023-05-24T06:48:41Z","title":"MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop\n Questions","summary":" The information stored in large language models (LLMs) falls out of date\nquickly, and retraining from scratch is often not an option. This has recently\ngiven rise to a range of techniques for injecting new facts through updating\nmodel weights. Current evaluation paradigms are extremely limited, mainly\nvalidating the recall of edited facts, but changing one fact should cause\nrippling changes to the model's related beliefs. If we edit the UK Prime\nMinister to now be Rishi Sunak, then we should get a different answer to Who is\nmarried to the British Prime Minister? In this work, we present a benchmark,\nMQuAKE (Multi-hop Question Answering for Knowledge Editing), comprising\nmulti-hop questions that assess whether edited models correctly answer\nquestions where the answer should change as an entailed consequence of edited\nfacts. While we find that current knowledge-editing approaches can recall\nedited facts accurately, they fail catastrophically on the constructed\nmulti-hop questions. We thus propose a simple memory-based approach, MeLLo,\nwhich stores all edited facts externally while prompting the language model\niteratively to generate answers that are consistent with the edited facts.\nWhile MQuAKE remains challenging, we show that MeLLo scales well with LLMs (up\nto 175B) and outperforms previous model editors by a large margin.\n","authors":["Zexuan Zhong","Zhengxuan Wu","Christopher D. Manning","Christopher Potts","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2305.14795v2.pdf","comment":"EMNLP 2023. Our code and datasets are available at\n https://github.com/princeton-nlp/MQuAKE"},{"id":"http://arxiv.org/abs/2310.19130v1","updated":"2023-10-29T19:39:03Z","published":"2023-10-29T19:39:03Z","title":"Women Wearing Lipstick: Measuring the Bias Between an Object and Its\n Related Gender","summary":" In this paper, we investigate the impact of objects on gender bias in image\ncaptioning systems. Our results show that only gender-specific objects have a\nstrong gender bias (e.g., women-lipstick). In addition, we propose a visual\nsemantic-based gender score that measures the degree of bias and can be used as\na plug-in for any image captioning system. Our experiments demonstrate the\nutility of the gender score, since we observe that our score can measure the\nbias relation between a caption and its related gender; therefore, our score\ncan be used as an additional metric to the existing Object Gender Co-Occ\napproach. Code and data are publicly available at\n\\url{https://github.com/ahmedssabir/GenderScore}.\n","authors":["Ahmed Sabir","Lluís Padró"],"pdf_url":"https://arxiv.org/pdf/2310.19130v1.pdf","comment":"EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2310.19127v1","updated":"2023-10-29T19:28:22Z","published":"2023-10-29T19:28:22Z","title":"Unified Representation for Non-compositional and Compositional\n Expressions","summary":" Accurate processing of non-compositional language relies on generating good\nrepresentations for such expressions. In this work, we study the representation\nof language non-compositionality by proposing a language model, PIER, that\nbuilds on BART and can create semantically meaningful and contextually\nappropriate representations for English potentially idiomatic expressions\n(PIEs). PIEs are characterized by their non-compositionality and contextual\nambiguity in their literal and idiomatic interpretations. Via intrinsic\nevaluation on embedding quality and extrinsic evaluation on PIE processing and\nNLU tasks, we show that representations generated by PIER result in 33% higher\nhomogeneity score for embedding clustering than BART, whereas 3.12% and 3.29%\ngains in accuracy and sequence accuracy for PIE sense classification and span\ndetection compared to the state-of-the-art IE representation model, GIEA. These\ngains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1%\naccuracy) compared to BART.\n","authors":["Ziheng Zeng","Suma Bhat"],"pdf_url":"https://arxiv.org/pdf/2310.19127v1.pdf","comment":"This work is accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2310.19123v1","updated":"2023-10-29T19:20:38Z","published":"2023-10-29T19:20:38Z","title":"Three Dogmas, a Puzzle and its Solution","summary":" Modern Logics, as formulated notably by Frege, Russell and Tarski involved\nbasic assumptions about Natural Languages in general and Indo-European\nLanguages in particular, which are contested by Linguists. Based upon those\nassumptions, formal Languages were designed to overcome what Logicians claimed\nto be 'defects' of Natural Language. In this paper we show that those\nassumptions contradict basic principles of Arabic. More specifically: The\nLogicians ideas, that within Natural Language words refer to objects,\n'ToBe'-constructions represent identity statements, Indefinite Descriptions\nmust be replaced by existential quantifiers to form meaningful Sentences and\nSymbols can have no interpretation-independent meanings, are all falsified\nusing undisputed principles of Arabic. The here presented falsification serves\ntwo purposes. First, it is used as a factual basis for the rejection of\napproaches adopting Semantic axioms of Mathematical Logics as Models for\nmeaning of Arabic Syntax. Second, it shows a way to approach the important\ncomputational problem: Satisfiability (SAT). The described way is based upon\nthe realization that parsing Arabic utilizes the existence of\n'meaning-particles' within Syntax to efficiently recognize words, phrases and\nSentences. Similar meaning-particles are shown to exist in 3CNF formulas,\nwhich, when properly handled within the machinery of 3SAT-Solvers, enable\nstructural conditions to be imposed on formulas, sufficient alone to guarantee\nthe efficient production of non-exponentially sized Free Binary Decision\nDiagrams (FBDDs). We show, why known exponential Lower Bounds on sizes of FBDDs\ndo not contradict our results and reveal practical evidence, obtained for\nmultiplication circuits, supporting our claims.\n","authors":["Elnaserledinellah Mahmood Abdelwahab"],"pdf_url":"https://arxiv.org/pdf/2310.19123v1.pdf","comment":"99 pages"},{"id":"http://arxiv.org/abs/2310.19106v1","updated":"2023-10-29T18:43:19Z","published":"2023-10-29T18:43:19Z","title":"PACuna: Automated Fine-Tuning of Language Models for Particle\n Accelerators","summary":" Navigating the landscape of particle accelerators has become increasingly\nchallenging with recent surges in contributions. These intricate devices\nchallenge comprehension, even within individual facilities. To address this, we\nintroduce PACuna, a fine-tuned language model refined through publicly\navailable accelerator resources like conferences, pre-prints, and books. We\nautomated data collection and question generation to minimize expert\ninvolvement and make the data publicly available. PACuna demonstrates\nproficiency in addressing intricate accelerator questions, validated by\nexperts. Our approach shows adapting language models to scientific domains by\nfine-tuning technical texts and auto-generated corpora capturing the latest\ndevelopments can further produce pre-trained models to answer some intricate\nquestions that commercially available assistants cannot and can serve as\nintelligent assistants for individual facilities.\n","authors":["Antonin Sulc","Raimund Kammering","Annika Eichler","Tim Wilksen"],"pdf_url":"https://arxiv.org/pdf/2310.19106v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19089v1","updated":"2023-10-29T17:27:18Z","published":"2023-10-29T17:27:18Z","title":"Pushdown Layers: Encoding Recursive Structure in Transformer Language\n Models","summary":" Recursion is a prominent feature of human language, and fundamentally\nchallenging for self-attention due to the lack of an explicit recursive-state\ntracking mechanism. Consequently, Transformer language models poorly capture\nlong-tail recursive structure and exhibit sample-inefficient syntactic\ngeneralization. This work introduces Pushdown Layers, a new self-attention\nlayer that models recursive state via a stack tape that tracks estimated depths\nof every token in an incremental parse of the observed prefix. Transformer LMs\nwith Pushdown Layers are syntactic language models that autoregressively and\nsynchronously update this stack tape as they predict new tokens, in turn using\nthe stack tape to softly modulate attention over tokens -- for instance,\nlearning to \"skip\" over closed constituents. When trained on a corpus of\nstrings annotated with silver constituency parses, Transformers equipped with\nPushdown Layers achieve dramatically better and 3-5x more sample-efficient\nsyntactic generalization, while maintaining similar perplexities. Pushdown\nLayers are a drop-in replacement for standard self-attention. We illustrate\nthis by finetuning GPT2-medium with Pushdown Layers on an automatically parsed\nWikiText-103, leading to improvements on several GLUE text classification\ntasks.\n","authors":["Shikhar Murty","Pratyusha Sharma","Jacob Andreas","Christopher D. Manning"],"pdf_url":"https://arxiv.org/pdf/2310.19089v1.pdf","comment":"Accepted at EMNLP 2023 (Long Papers)"},{"id":"http://arxiv.org/abs/2310.19084v1","updated":"2023-10-29T17:16:40Z","published":"2023-10-29T17:16:40Z","title":"Roles of Scaling and Instruction Tuning in Language Perception: Model\n vs. Human Attention","summary":" Recent large language models (LLMs) have revealed strong abilities to\nunderstand natural language. Since most of them share the same basic structure,\ni.e. the transformer block, possible contributors to their success in the\ntraining process are scaling and instruction tuning. However, how these factors\naffect the models' language perception is unclear. This work compares the\nself-attention of several existing LLMs (LLaMA, Alpaca and Vicuna) in different\nsizes (7B, 13B, 30B, 65B), together with eye saccade, an aspect of human\nreading attention, to assess the effect of scaling and instruction tuning on\nlanguage perception. Results show that scaling enhances the human resemblance\nand improves the effective attention by reducing the trivial pattern reliance,\nwhile instruction tuning does not. However, instruction tuning significantly\nenhances the models' sensitivity to instructions. We also find that current\nLLMs are consistently closer to non-native than native speakers in attention,\nsuggesting a sub-optimal language perception of all models. Our code and data\nused in the analysis is available on GitHub.\n","authors":["Changjiang Gao","Shujian Huang","Jixing Li","Jiajun Chen"],"pdf_url":"https://arxiv.org/pdf/2310.19084v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13060v2","updated":"2023-10-29T17:14:06Z","published":"2023-04-25T18:00:08Z","title":"Injecting structural hints: Using language models to study inductive\n biases in language learning","summary":" Both humans and large language models are able to learn language without\nexplicit structural supervision. What inductive biases make this learning\npossible? We address this fundamental cognitive question by leveraging\ntransformer language models: we inject inductive bias into language models by\npretraining on formally-structured data, and then evaluate the biased learners'\nability to learn typologically-diverse natural languages. Our experimental\nsetup creates a testbed for hypotheses about inductive bias in human language\nlearning. We investigate the effect of injecting models with three types of\ninductive bias: 1) recursive, hierarchical processing, 2) crossing token-token\nrelationships that can't be modeled by context-free grammars, and 3) a Zipfian\npower-law vocabulary distribution. We show that non-context-free relationships\nform the best inductive biases. Our study leverages the capabilities of\ntransformer models to run controlled language learning experiments that are not\npossible to run on humans, and surfaces hypotheses about the structures that\nfacilitate language learning in both humans and machines.\n","authors":["Isabel Papadimitriou","Dan Jurafsky"],"pdf_url":"https://arxiv.org/pdf/2304.13060v2.pdf","comment":"Findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.11778v2","updated":"2023-10-29T17:07:54Z","published":"2023-10-18T08:16:29Z","title":"Language Agents for Detecting Implicit Stereotypes in Text-to-image\n Models at Scale","summary":" The recent surge in the research of diffusion models has accelerated the\nadoption of text-to-image models in various Artificial Intelligence Generated\nContent (AIGC) commercial products. While these exceptional AIGC products are\ngaining increasing recognition and sparking enthusiasm among consumers, the\nquestions regarding whether, when, and how these models might unintentionally\nreinforce existing societal stereotypes remain largely unaddressed. Motivated\nby recent advancements in language agents, here we introduce a novel agent\narchitecture tailored for stereotype detection in text-to-image models. This\nversatile agent architecture is capable of accommodating free-form detection\ntasks and can autonomously invoke various tools to facilitate the entire\nprocess, from generating corresponding instructions and images, to detecting\nstereotypes. We build the stereotype-relevant benchmark based on multiple\nopen-text datasets, and apply this architecture to commercial products and\npopular open source text-to-image models. We find that these models often\ndisplay serious stereotypes when it comes to certain prompts about personal\ncharacteristics, social cultural context and crime-related aspects. In summary,\nthese empirical findings underscore the pervasive existence of stereotypes\nacross social dimensions, including gender, race, and religion, which not only\nvalidate the effectiveness of our proposed approach, but also emphasize the\ncritical necessity of addressing potential ethical risks in the burgeoning\nrealm of AIGC. As AIGC continues its rapid expansion trajectory, with new\nmodels and plugins emerging daily in staggering numbers, the challenge lies in\nthe timely detection and mitigation of potential biases within these models.\n","authors":["Qichao Wang","Tian Bian","Yian Yin","Tingyang Xu","Hong Cheng","Helen M. Meng","Zibin Zheng","Liang Chen","Bingzhe Wu"],"pdf_url":"https://arxiv.org/pdf/2310.11778v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19060v1","updated":"2023-10-29T16:25:32Z","published":"2023-10-29T16:25:32Z","title":"TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language\n Understanding","summary":" Large-scale video-language pre-training has made remarkable strides in\nadvancing video-language understanding tasks. However, the heavy computational\nburden of video encoding remains a formidable efficiency bottleneck,\nparticularly for long-form videos. These videos contain massive visual tokens\ndue to their inherent 3D properties and spatiotemporal redundancy, making it\nchallenging to capture complex temporal and spatial relationships. To tackle\nthis issue, we propose an efficient method called TEmporal-Spatial Token\nAggregation (TESTA). TESTA condenses video semantics by adaptively aggregating\nsimilar frames, as well as similar patches within each frame. TESTA can reduce\nthe number of visual tokens by 75% and thus accelerate video encoding. Building\nupon TESTA, we introduce a pre-trained video-language model equipped with a\ndivided space-time token aggregation module in each video encoder block. We\nevaluate our model on five datasets for paragraph-to-video retrieval and\nlong-form VideoQA tasks. Experimental results show that TESTA improves\ncomputing efficiency by 1.7 times, and achieves significant performance gains\nfrom its scalability in processing longer input frames, e.g., +13.7 R@1 on\nQuerYD and +6.5 R@1 on Condensed Movie.\n","authors":["Shuhuai Ren","Sishuo Chen","Shicheng Li","Xu Sun","Lu Hou"],"pdf_url":"https://arxiv.org/pdf/2310.19060v1.pdf","comment":"16 pages, 9 figures, code is available at\n https://github.com/RenShuhuai-Andy/TESTA"},{"id":"http://arxiv.org/abs/2310.19055v1","updated":"2023-10-29T16:02:46Z","published":"2023-10-29T16:02:46Z","title":"A Survey on Recent Named Entity Recognition and Relation Classification\n Methods with Focus on Few-Shot Learning Approaches","summary":" Named entity recognition and relation classification are key stages for\nextracting information from unstructured text. Several natural language\nprocessing applications utilize the two tasks, such as information retrieval,\nknowledge graph construction and completion, question answering and other\ndomain-specific applications, such as biomedical data mining. We present a\nsurvey of recent approaches in the two tasks with focus on few-shot learning\napproaches. Our work compares the main approaches followed in the two\nparadigms. Additionally, we report the latest metric scores in the two tasks\nwith a structured analysis that considers the results in the few-shot learning\nscope.\n","authors":["Sakher Alqaaidi","Elika Bozorgi"],"pdf_url":"https://arxiv.org/pdf/2310.19055v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.08166v2","updated":"2023-10-29T15:39:51Z","published":"2023-10-12T09:39:17Z","title":"Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task\n Instruction Tuning","summary":" Recent advancements enlarge the capabilities of large language models (LLMs)\nin zero-shot image-to-text generation and understanding by integrating\nmulti-modal inputs. However, such success is typically limited to English\nscenarios due to the lack of large-scale and high-quality non-English\nmulti-modal resources, making it extremely difficult to establish competitive\ncounterparts in other languages. In this paper, we introduce the Ziya-Visual\nseries, a set of bilingual large-scale vision-language models (LVLMs) designed\nto incorporate visual semantics into LLM for multi-modal dialogue. Composed of\nZiya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying\nTransformer from BLIP-2, further exploring the assistance of optimization\nschemes such as instruction tuning, multi-stage training and low-rank\nadaptation module for visual-language alignment. In addition, we stimulate the\nunderstanding ability of GPT-4 in multi-modal scenarios, translating our\ngathered English image-text datasets into Chinese and generating\ninstruction-response through the in-context learning method. The experiment\nresults demonstrate that compared to the existing LVLMs, Ziya-Visual achieves\ncompetitive performance across a wide range of English-only tasks including\nzero-shot image-text retrieval, image captioning, and visual question\nanswering. The evaluation leaderboard accessed by GPT-4 also indicates that our\nmodels possess satisfactory image-text understanding and generation\ncapabilities in Chinese multi-modal scenario dialogues. Code, demo and models\nare available at\n~\\url{https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1}.\n","authors":["Junyu Lu","Dixiang Zhang","Xiaojun Wu","Xinyu Gao","Ruyi Gan","Jiaxing Zhang","Yan Song","Pingjian Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.08166v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12450v2","updated":"2023-10-29T14:58:54Z","published":"2023-10-19T04:08:10Z","title":"A Read-and-Select Framework for Zero-shot Entity Linking","summary":" Zero-shot entity linking (EL) aims at aligning entity mentions to unseen\nentities to challenge the generalization ability. Previous methods largely\nfocus on the candidate retrieval stage and ignore the essential candidate\nranking stage, which disambiguates among entities and makes the final linking\nprediction. In this paper, we propose a read-and-select (ReS) framework by\nmodeling the main components of entity disambiguation, i.e., mention-entity\nmatching and cross-entity comparison. First, for each candidate, the reading\nmodule leverages mention context to output mention-aware entity\nrepresentations, enabling mention-entity matching. Then, in the selecting\nmodule, we frame the choice of candidates as a sequence labeling problem, and\nall candidate representations are fused together to enable cross-entity\ncomparison. Our method achieves the state-of-the-art performance on the\nestablished zero-shot EL dataset ZESHEL with a 2.55% micro-average accuracy\ngain, with no need for laborious multi-phase pre-training used in most of the\nprevious work, showing the effectiveness of both mention-entity and\ncross-entity interaction.\n","authors":["Zhenran Xu","Yulin Chen","Baotian Hu","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.12450v2.pdf","comment":"EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2307.04657v2","updated":"2023-10-29T14:53:56Z","published":"2023-07-10T15:56:17Z","title":"BeaverTails: Towards Improved Safety Alignment of LLM via a\n Human-Preference Dataset","summary":" In this paper, we introduce the \\textsc{BeaverTails} dataset, aimed at\nfostering research on safety alignment in large language models (LLMs). This\ndataset uniquely separates annotations of helpfulness and harmlessness for\nquestion-answering pairs, thus offering distinct perspectives on these crucial\nattributes. In total, we have gathered safety meta-labels for 30,207\nquestion-answer (QA) pairs and 30,144 pairs of expert comparison data for both\nthe helpfulness and harmlessness metrics. In total, we have gathered safety\nmeta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert\ncomparison data for both the helpfulness and harmlessness metrics. We further\nshowcase applications of BeaverTails in content moderation and reinforcement\nlearning with human feedback (RLHF), emphasizing its potential for practical\nsafety measures in LLMs. We believe this dataset provides vital resources for\nthe community, contributing towards the safe development and deployment of\nLLMs. Our project page is available at the following URL:\nhttps://sites.google.com/view/pku-beavertails.\n Warning: this paper contains example data that may be offensive or harmful.\n","authors":["Jiaming Ji","Mickel Liu","Juntao Dai","Xuehai Pan","Chi Zhang","Ce Bian","Chi Zhang","Ruiyang Sun","Yizhou Wang","Yaodong Yang"],"pdf_url":"https://arxiv.org/pdf/2307.04657v2.pdf","comment":"NeurIPS Datasets and Benchmarks 2023"},{"id":"http://arxiv.org/abs/2305.14250v2","updated":"2023-10-29T14:51:48Z","published":"2023-05-23T17:04:25Z","title":"Language Models with Rationality","summary":" While large language models (LLMs) are proficient at question-answering (QA),\nit is not always clear how (or even if) an answer follows from their latent\n\"beliefs\". This lack of interpretability is a growing impediment to widespread\nuse of LLMs. To address this, our goals are to make model beliefs and their\ninferential relationships explicit, and to resolve inconsistencies that may\nexist, so that answers are supported by interpretable chains of reasoning drawn\nfrom a consistent network of beliefs. Our approach, which we call REFLEX, is to\nadd a rational, self-reflecting layer on top of the LLM. First, given a\nquestion, we construct a belief graph using a backward-chaining process to\nmaterialize relevant model beliefs (including beliefs about answer candidates)\nand their inferential relationships. Second, we identify and minimize\ncontradictions in that graph using a formal constraint reasoner. We find that\nREFLEX significantly improves consistency (by 8%-11% absolute) without harming\noverall answer accuracy, resulting in answers supported by faithful chains of\nreasoning drawn from a more consistent belief system. This suggests a new style\nof system architecture in which an LLM extended with a rational layer can\nprovide an interpretable window into system beliefs, add a systematic reasoning\ncapability, and repair latent inconsistencies present in the LLM.\n","authors":["Nora Kassner","Oyvind Tafjord","Ashish Sabharwal","Kyle Richardson","Hinrich Schuetze","Peter Clark"],"pdf_url":"https://arxiv.org/pdf/2305.14250v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19034v1","updated":"2023-10-29T14:46:11Z","published":"2023-10-29T14:46:11Z","title":"ArBanking77: Intent Detection Neural Model and a New Dataset in Modern\n and Dialectical Arabic","summary":" This paper presents the ArBanking77, a large Arabic dataset for intent\ndetection in the banking domain. Our dataset was arabized and localized from\nthe original English Banking77 dataset, which consists of 13,083 queries to\nArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA)\nand Palestinian dialect, with each query classified into one of the 77 classes\n(intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned\non ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and\nPalestinian dialect, respectively. We performed extensive experimentation in\nwhich we simulated low-resource settings, where the model is trained on a\nsubset of the data and augmented with noisy queries to simulate colloquial\nterms, mistakes and misspellings found in real NLP systems, especially live\nchat queries. The data and the models are publicly available at\nhttps://sina.birzeit.edu/arbanking77.\n","authors":["Mustafa Jarrar","Ahmet Birim","Mohammed Khalilia","Mustafa Erden","Sana Ghanem"],"pdf_url":"https://arxiv.org/pdf/2310.19034v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19029v1","updated":"2023-10-29T14:36:37Z","published":"2023-10-29T14:36:37Z","title":"SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks","summary":" SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens,\nwhich are all sense-annotated. The corpus is annotated using two different\nsense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how\ntokens and senses are associated. Instead of linking a token to only one\nintended sense, SALMA links a token to multiple senses and provides a score to\neach sense. A smart web-based annotation tool was developed to support scoring\nmultiple senses against a given word. In addition to sense annotations, we also\nannotated the corpus using six types of named entities. The quality of our\nannotations was assessed using various metrics (Kappa, Linear Weighted Kappa,\nQuadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error),\nwhich show very high inter-annotator agreement. To establish a Word Sense\nDisambiguation baseline using our SALMA corpus, we developed an end-to-end Word\nSense Disambiguation system using Target Sense Verification. We used this\nsystem to evaluate three Target Sense Verification models available in the\nliterature. Our best model achieved an accuracy with 84.2% using Modern and\n78.7% using Ghani. The full corpus and the annotation tool are open-source and\npublicly available at https://sina.birzeit.edu/salma/.\n","authors":["Mustafa Jarrar","Sanad Malaysha","Tymaa Hammouda","Mohammed Khalilia"],"pdf_url":"https://arxiv.org/pdf/2310.19029v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.13005v2","updated":"2023-10-29T14:35:27Z","published":"2022-07-26T16:09:07Z","title":"Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark","summary":" Modern Entity Linking (EL) systems entrench a popularity bias, yet there is\nno dataset focusing on tail and emerging entities in languages other than\nEnglish. We present Hansel, a new benchmark in Chinese that fills the vacancy\nof non-English few-shot and zero-shot EL challenges. The test set of Hansel is\nhuman annotated and reviewed, created with a novel method for collecting\nzero-shot EL datasets. It covers 10K diverse documents in news, social media\nposts and other web articles, with Wikidata as its target Knowledge Base. We\ndemonstrate that the existing state-of-the-art EL system performs poorly on\nHansel (R@1 of 36.6% on Few-Shot). We then establish a strong baseline that\nscores a R@1 of 46.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We\nalso show that our baseline achieves competitive results on TAC-KBP2015 Chinese\nEntity Linking task.\n","authors":["Zhenran Xu","Zifei Shan","Yuxin Li","Baotian Hu","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2207.13005v2.pdf","comment":"WSDM 2023"},{"id":"http://arxiv.org/abs/2303.04132v2","updated":"2023-10-29T14:24:46Z","published":"2023-03-07T18:48:55Z","title":"Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and\n the Case of Information Extraction","summary":" Large language models (LLMs) have great potential for synthetic data\ngeneration. This work shows that useful data can be synthetically generated\neven for tasks that cannot be solved directly by LLMs: for problems with\nstructured outputs, it is possible to prompt an LLM to perform the task in the\nreverse direction, by generating plausible input text for a target output\nstructure. Leveraging this asymmetry in task difficulty makes it possible to\nproduce large-scale, high-quality data for complex tasks. We demonstrate the\neffectiveness of this approach on closed information extraction, where\ncollecting ground-truth data is challenging, and no satisfactory dataset exists\nto date. We synthetically generate a dataset of 1.8M data points, establish its\nsuperior quality compared to existing datasets in a human evaluation, and use\nit to finetune small models (220M and 770M parameters), termed SynthIE, that\noutperform the prior state of the art (with equal model size) by a substantial\nmargin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data,\nand models are available at https://github.com/epfl-dlab/SynthIE.\n","authors":["Martin Josifoski","Marija Sakota","Maxime Peyrard","Robert West"],"pdf_url":"https://arxiv.org/pdf/2303.04132v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.19019v1","updated":"2023-10-29T14:16:54Z","published":"2023-10-29T14:16:54Z","title":"TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language\n Modeling Likewise","summary":" Large Language Models (LLMs) exhibit impressive reasoning and data\naugmentation capabilities in various NLP tasks. However, what about small\nmodels? In this work, we propose TeacherLM-7.1B, capable of annotating relevant\nfundamentals, chain of thought, and common mistakes for most NLP samples, which\nmakes annotation more than just an answer, thus allowing other models to learn\n\"why\" instead of just \"what\". The TeacherLM-7.1B model achieved a zero-shot\nscore of 52.3 on MMLU, surpassing most models with over 100B parameters. Even\nmore remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we\naugmented 58 NLP datasets and taught various student models with different\nparameters from OPT and BLOOM series in a multi-task setting. The experimental\nresults indicate that the data augmentation provided by TeacherLM has brought\nsignificant benefits. We will release the TeacherLM series of models and\naugmented datasets as open-source.\n","authors":["Nan He","Hanyu Lai","Chenyang Zhao","Zirui Cheng","Junting Pan","Ruoyu Qin","Ruofan Lu","Rui Lu","Yunchen Zhang","Gangming Zhao","Zhaohui Hou","Zhiyuan Huang","Shaoqing Lu","Ding Liang","Mingjie Zhan"],"pdf_url":"https://arxiv.org/pdf/2310.19019v1.pdf","comment":"5 figures, 15 pages"},{"id":"http://arxiv.org/abs/2305.06908v4","updated":"2023-10-29T14:12:08Z","published":"2023-05-11T15:51:46Z","title":"CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency\n Model","summary":" Denoising diffusion probabilistic models (DDPMs) have shown promising\nperformance for speech synthesis. However, a large number of iterative steps\nare required to achieve high sample quality, which restricts the inference\nspeed. Maintaining sample quality while increasing sampling speed has become a\nchallenging task. In this paper, we propose a \"Co\"nsistency \"Mo\"del-based\n\"Speech\" synthesis method, CoMoSpeech, which achieve speech synthesis through a\nsingle diffusion sampling step while achieving high audio quality. The\nconsistency constraint is applied to distill a consistency model from a\nwell-designed diffusion-based teacher model, which ultimately yields superior\nperformances in the distilled CoMoSpeech. Our experiments show that by\ngenerating audio recordings by a single sampling step, the CoMoSpeech achieves\nan inference speed more than 150 times faster than real-time on a single NVIDIA\nA100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based\nspeech synthesis truly practical. Meanwhile, objective and subjective\nevaluations on text-to-speech and singing voice synthesis show that the\nproposed teacher models yield the best audio quality, and the one-step sampling\nbased CoMoSpeech achieves the best inference speed with better or comparable\naudio quality to other conventional multi-step diffusion model baselines. Audio\nsamples are available at https://comospeech.github.io/.\n","authors":["Zhen Ye","Wei Xue","Xu Tan","Jie Chen","Qifeng Liu","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2305.06908v4.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2302.10850v2","updated":"2023-10-29T13:05:52Z","published":"2023-02-21T18:02:20Z","title":"Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management","summary":" Reinforcement learning (RL) has shown great promise for developing dialogue\nmanagement (DM) agents that are non-myopic, conduct rich conversations, and\nmaximize overall user satisfaction. Despite recent developments in RL and\nlanguage models (LMs), using RL to power conversational chatbots remains\nchallenging, in part because RL requires online exploration to learn\neffectively, whereas collecting novel human-bot interactions can be expensive\nand unsafe. This issue is exacerbated by the combinatorial action spaces facing\nthese algorithms, as most LM agents generate responses at the word level. We\ndevelop a variety of RL algorithms, specialized to dialogue planning, that\nleverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that\ncapture diverse semantics, generate utterances reflecting different intents,\nand are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods\nsignificantly reduce the size of the action space and improve the efficacy of\nRL-based DM. We evaluate our methods in open-domain dialogue to demonstrate\ntheir effectiveness w.r.t.\\ the diversity of intent in generated utterances and\noverall DM performance.\n","authors":["Dhawal Gupta","Yinlam Chow","Aza Tulepbergenov","Mohammad Ghavamzadeh","Craig Boutilier"],"pdf_url":"https://arxiv.org/pdf/2302.10850v2.pdf","comment":"Thirty-seventh Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2305.16934v2","updated":"2023-10-29T12:32:19Z","published":"2023-05-26T13:49:44Z","title":"On Evaluating Adversarial Robustness of Large Vision-Language Models","summary":" Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented\nperformance in response generation, especially with visual inputs, enabling\nmore creative and adaptable interaction than large language models such as\nChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since\nadversaries may successfully evade the entire system by subtly manipulating the\nmost vulnerable modality (e.g., vision). To this end, we propose evaluating the\nrobustness of open-source large VLMs in the most realistic and high-risk\nsetting, where adversaries have only black-box system access and seek to\ndeceive the model into returning the targeted responses. In particular, we\nfirst craft targeted adversarial examples against pretrained models such as\nCLIP and BLIP, and then transfer these adversarial examples to other VLMs such\nas MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we\nobserve that black-box queries on these VLMs can further improve the\neffectiveness of targeted evasion, resulting in a surprisingly high success\nrate for generating targeted responses. Our findings provide a quantitative\nunderstanding regarding the adversarial vulnerability of large VLMs and call\nfor a more thorough examination of their potential security flaws before\ndeployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.\n","authors":["Yunqing Zhao","Tianyu Pang","Chao Du","Xiao Yang","Chongxuan Li","Ngai-Man Cheung","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2305.16934v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18992v1","updated":"2023-10-29T12:27:18Z","published":"2023-10-29T12:27:18Z","title":"Bipartite Graph Pre-training for Unsupervised Extractive Summarization\n with Graph Convolutional Auto-Encoders","summary":" Pre-trained sentence representations are crucial for identifying significant\nsentences in unsupervised document extractive summarization. However, the\ntraditional two-step paradigm of pre-training and sentence-ranking, creates a\ngap due to differing optimization objectives. To address this issue, we argue\nthat utilizing pre-trained embeddings derived from a process specifically\ndesigned to optimize cohensive and distinctive sentence representations helps\nrank significant sentences. To do so, we propose a novel graph pre-training\nauto-encoder to obtain sentence embeddings by explicitly modelling\nintra-sentential distinctive features and inter-sentential cohesive features\nthrough sentence-word bipartite graphs. These pre-trained sentence\nrepresentations are then utilized in a graph-based ranking algorithm for\nunsupervised summarization. Our method produces predominant performance for\nunsupervised summarization frameworks by providing summary-worthy sentence\nrepresentations. It surpasses heavy BERT- or RoBERTa-based sentence\nrepresentations in downstream tasks.\n","authors":["Qianren Mao","Shaobo Zhao","Jiarui Li","Xiaolei Gu","Shizhu He","Bo Li","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2310.18992v1.pdf","comment":"Accepted by the 2023 Conference on Empirical Methods in Natural\n Language Processing (EMNLP 2023)"},{"id":"http://arxiv.org/abs/2305.01323v3","updated":"2023-10-29T11:02:47Z","published":"2023-05-02T11:08:27Z","title":"Turning Flowchart into Dialog: Augmenting Flowchart-grounded\n Troubleshooting Dialogs via Synthetic Data Generation","summary":" Flowchart-grounded troubleshooting dialogue (FTD) systems, which follow the\ninstructions of a flowchart to diagnose users' problems in specific domains\n(e.g., vehicle, laptop), have been gaining research interest in recent years.\nHowever, collecting sufficient dialogues that are naturally grounded on\nflowcharts is costly, thus FTD systems are impeded by scarce training data. To\nmitigate the data sparsity issue, we propose a plan-based synthetic data\ngeneration (PlanSDG) approach that generates diverse synthetic dialog data at\nscale by transforming concise flowchart into dialogues. Specifically, its\ngenerative model employs a variational-base framework with a hierarchical\nplanning strategy that includes global and local latent planning variables.\nExperiments on the FloDial dataset show that synthetic dialogue produced by\nPlanSDG improves the performance of downstream tasks, including flowchart path\nretrieval and response generation, in particular on the Out-of-Flowchart\nsettings. In addition, further analysis demonstrate the quality of synthetic\ndata generated by PlanSDG in paths that are covered by current sample dialogues\nand paths that are not covered.\n","authors":["Haolan Zhan","Sameen Maruf","Lizhen Qu","Yufei Wang","Ingrid Zukerman","Gholamreza Haffari"],"pdf_url":"https://arxiv.org/pdf/2305.01323v3.pdf","comment":"Accepted by ALTA 2023"},{"id":"http://arxiv.org/abs/2310.18974v1","updated":"2023-10-29T10:47:23Z","published":"2023-10-29T10:47:23Z","title":"EtiCor: Corpus for Analyzing LLMs for Etiquettes","summary":" Etiquettes are an essential ingredient of day-to-day interactions among\npeople. Moreover, etiquettes are region-specific, and etiquettes in one region\nmight contradict those in other regions. In this paper, we propose EtiCor, an\nEtiquettes Corpus, having texts about social norms from five different regions\nacross the globe. The corpus provides a test bed for evaluating LLMs for\nknowledge and understanding of region-specific etiquettes. Additionally, we\npropose the task of Etiquette Sensitivity. We experiment with state-of-the-art\nLLMs (Delphi, Falcon40B, and GPT-3.5). Initial results indicate that LLMs,\nmostly fail to understand etiquettes from regions from non-Western world.\n","authors":["Ashutosh Dwivedi","Pradhyumna Lavania","Ashutosh Modi"],"pdf_url":"https://arxiv.org/pdf/2310.18974v1.pdf","comment":"Accepted at EMNLP 2023, Main Conference"},{"id":"http://arxiv.org/abs/2310.18964v1","updated":"2023-10-29T10:07:32Z","published":"2023-10-29T10:07:32Z","title":"LLMs and Finetuning: Benchmarking cross-domain performance for hate\n speech detection","summary":" This paper compares different pre-trained and fine-tuned large language\nmodels (LLMs) for hate speech detection. Our research underscores challenges in\nLLMs' cross-domain validity and overfitting risks. Through evaluations, we\nhighlight the need for fine-tuned models that grasp the nuances of hate speech\nthrough greater label heterogeneity. We conclude with a vision for the future\nof hate speech detection, emphasizing cross-domain generalizability and\nappropriate benchmarking practices.\n","authors":["Ahmad Nasir","Aadish Sharma","Kokil Jaidka"],"pdf_url":"https://arxiv.org/pdf/2310.18964v1.pdf","comment":"9 pages, 3 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.18956v1","updated":"2023-10-29T09:56:17Z","published":"2023-10-29T09:56:17Z","title":"End-to-End Autoregressive Retrieval via Bootstrapping for Smart Reply\n Systems","summary":" Reply suggestion systems represent a staple component of many instant\nmessaging and email systems. However, the requirement to produce sets of\nreplies, rather than individual replies, makes the task poorly suited for\nout-of-the-box retrieval architectures, which only consider individual\nmessage-reply similarity. As a result, these system often rely on additional\npost-processing modules to diversify the outputs. However, these approaches are\nultimately bottlenecked by the performance of the initial retriever, which in\npractice struggles to present a sufficiently diverse range of options to the\ndownstream diversification module, leading to the suggestions being less\nrelevant to the user. In this paper, we consider a novel approach that\nradically simplifies this pipeline through an autoregressive text-to-text\nretrieval model, that learns the smart reply task end-to-end from a dataset of\n(message, reply set) pairs obtained via bootstrapping. Empirical results show\nthis method consistently outperforms a range of state-of-the-art baselines\nacross three datasets, corresponding to a 5.1%-17.9% improvement in relevance,\nand a 0.5%-63.1% improvement in diversity compared to the best baseline\napproach. We make our code publicly available.\n","authors":["Benjamin Towle","Ke Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.18956v1.pdf","comment":"FINDINGS-EMNLP 2023"},{"id":"http://arxiv.org/abs/2212.09114v2","updated":"2023-10-29T09:32:07Z","published":"2022-12-18T15:57:46Z","title":"CAPSTONE: Curriculum Sampling for Dense Retrieval with Document\n Expansion","summary":" The dual-encoder has become the de facto architecture for dense retrieval.\nTypically, it computes the latent representations of the query and document\nindependently, thus failing to fully capture the interactions between the query\nand document. To alleviate this, recent research has focused on obtaining\nquery-informed document representations. During training, it expands the\ndocument with a real query, but during inference, it replaces the real query\nwith a generated one. This inconsistency between training and inference causes\nthe dense retrieval model to prioritize query information while disregarding\nthe document when computing the document representation. Consequently, it\nperforms even worse than the vanilla dense retrieval model because its\nperformance heavily relies on the relevance between the generated queries and\nthe real query.In this paper, we propose a curriculum sampling strategy that\nutilizes pseudo queries during training and progressively enhances the\nrelevance between the generated query and the real query. By doing so, the\nretrieval model learns to extend its attention from the document alone to both\nthe document and query, resulting in high-quality query-informed document\nrepresentations. Experimental results on both in-domain and out-of-domain\ndatasets demonstrate that our approach outperforms previous dense retrieval\nmodels.\n","authors":["Xingwei He","Yeyun Gong","A-Long Jin","Hang Zhang","Anlei Dong","Jian Jiao","Siu Ming Yiu","Nan Duan"],"pdf_url":"https://arxiv.org/pdf/2212.09114v2.pdf","comment":"Accetpted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.04205v2","updated":"2023-10-29T09:25:47Z","published":"2023-10-06T12:44:04Z","title":"Keyword Augmented Retrieval: Novel framework for Information Retrieval\n integrated with speech interface","summary":" Retrieving answers in a quick and low cost manner without hallucinations from\na combination of structured and unstructured data using Language models is a\nmajor hurdle. This is what prevents employment of Language models in knowledge\nretrieval automation. This becomes accentuated when one wants to integrate a\nspeech interface on top of a text based knowledge retrieval system. Besides,\nfor commercial search and chat-bot applications, complete reliance on\ncommercial large language models (LLMs) like GPT 3.5 etc. can be very costly.\nIn the present study, the authors have addressed the aforementioned problem by\nfirst developing a keyword based search framework which augments discovery of\nthe context from the document to be provided to the LLM. The keywords in turn\nare generated by a relatively smaller LLM and cached for comparison with\nkeywords generated by the same smaller LLM against the query raised. This\nsignificantly reduces time and cost to find the context within documents. Once\nthe context is set, a larger LLM uses that to provide answers based on a prompt\ntailored for Q\\&A. This research work demonstrates that use of keywords in\ncontext identification reduces the overall inference time and cost of\ninformation retrieval. Given this reduction in inference time and cost with the\nkeyword augmented retrieval framework, a speech based interface for user input\nand response readout was integrated. This allowed a seamless interaction with\nthe language model.\n","authors":["Anupam Purwar","Rahul Sundar"],"pdf_url":"https://arxiv.org/pdf/2310.04205v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.00269v5","updated":"2023-10-29T09:25:17Z","published":"2021-10-01T08:51:58Z","title":"A Survey of Knowledge Enhanced Pre-trained Models","summary":" Pre-trained language models learn informative word representations on a\nlarge-scale text corpus through self-supervised learning, which has achieved\npromising performance in fields of natural language processing (NLP) after\nfine-tuning. These models, however, suffer from poor robustness and lack of\ninterpretability. We refer to pre-trained language models with knowledge\ninjection as knowledge-enhanced pre-trained language models (KEPLMs). These\nmodels demonstrate deep understanding and logical reasoning and introduce\ninterpretability. In this survey, we provide a comprehensive overview of KEPLMs\nin NLP. We first discuss the advancements in pre-trained language models and\nknowledge representation learning. Then we systematically categorize existing\nKEPLMs from three different perspectives. Finally, we outline some potential\ndirections of KEPLMs for future research.\n","authors":["Jian Yang","Xinyu Hu","Gang Xiao","Yulong Shen"],"pdf_url":"https://arxiv.org/pdf/2110.00269v5.pdf","comment":"32 pages, 15 figures"},{"id":"http://arxiv.org/abs/2310.18944v1","updated":"2023-10-29T09:09:10Z","published":"2023-10-29T09:09:10Z","title":"S2F-NER: Exploring Sequence-to-Forest Generation for Complex Entity\n Recognition","summary":" Named Entity Recognition (NER) remains challenging due to the complex\nentities, like nested, overlapping, and discontinuous entities. Existing\napproaches, such as sequence-to-sequence (Seq2Seq) generation and span-based\nclassification, have shown impressive performance on various NER subtasks, but\nthey are difficult to scale to datasets with longer input text because of\neither exposure bias issue or inefficient computation. In this paper, we\npropose a novel Sequence-to-Forest generation paradigm, S2F-NER, which can\ndirectly extract entities in sentence via a Forest decoder that decode multiple\nentities in parallel rather than sequentially. Specifically, our model generate\neach path of each tree in forest autoregressively, where the maximum depth of\neach tree is three (which is the shortest feasible length for complex NER and\nis far smaller than the decoding length of Seq2Seq). Based on this novel\nparadigm, our model can elegantly mitigates the exposure bias problem and keep\nthe simplicity of Seq2Seq. Experimental results show that our model\nsignificantly outperforms the baselines on three discontinuous NER datasets and\non two nested NER datasets, especially for discontinuous entity recognition.\n","authors":["Yongxiu Xu","Heyan Huang","Yue Hu"],"pdf_url":"https://arxiv.org/pdf/2310.18944v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10156v3","updated":"2023-10-29T08:57:47Z","published":"2023-05-17T12:19:11Z","title":"Personality Understanding of Fictional Characters during Book Reading","summary":" Comprehending characters' personalities is a crucial aspect of story reading.\nAs readers engage with a story, their understanding of a character evolves\nbased on new events and information; and multiple fine-grained aspects of\npersonalities can be perceived. This leads to a natural problem of situated and\nfine-grained personality understanding. The problem has not been studied in the\nNLP field, primarily due to the lack of appropriate datasets mimicking the\nprocess of book reading. We present the first labeled dataset PersoNet for this\nproblem. Our novel annotation strategy involves annotating user notes from\nonline reading apps as a proxy for the original books. Experiments and human\nstudies indicate that our dataset construction is both efficient and accurate;\nand our task heavily relies on long-term context to achieve accurate\npredictions for both machines and humans. The dataset is available at\nhttps://github.com/Gorov/personet_acl23.\n","authors":["Mo Yu","Jiangnan Li","Shunyu Yao","Wenjie Pang","Xiaochen Zhou","Zhou Xiao","Fandong Meng","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2305.10156v3.pdf","comment":"Accepted at ACL 2023"},{"id":"http://arxiv.org/abs/2310.18930v1","updated":"2023-10-29T07:43:34Z","published":"2023-10-29T07:43:34Z","title":"Retrofitting Light-weight Language Models for Emotions using Supervised\n Contrastive Learning","summary":" We present a novel retrofitting method to induce emotion aspects into\npre-trained language models (PLMs) such as BERT and RoBERTa. Our method updates\npre-trained network weights using contrastive learning so that the text\nfragments exhibiting similar emotions are encoded nearby in the representation\nspace, and the fragments with different emotion content are pushed apart. While\ndoing so, it also ensures that the linguistic knowledge already present in PLMs\nis not inadvertently perturbed. The language models retrofitted by our method,\ni.e., BERTEmo and RoBERTaEmo, produce emotion-aware text representations, as\nevaluated through different clustering and retrieval metrics. For the\ndownstream tasks on sentiment analysis and sarcasm detection, they perform\nbetter than their pre-trained counterparts (about 1% improvement in F1-score)\nand other existing approaches. Additionally, a more significant boost in\nperformance is observed for the retrofitted models over pre-trained ones in\nfew-shot learning setting.\n","authors":["Sapan Shah","Sreedhar Reddy","Pushpak Bhattacharyya"],"pdf_url":"https://arxiv.org/pdf/2310.18930v1.pdf","comment":"EMNLP 2023 Camera Ready Version"},{"id":"http://arxiv.org/abs/2310.15080v2","updated":"2023-10-29T07:17:45Z","published":"2023-10-23T16:37:59Z","title":"Federated Learning of Large Language Models with Parameter-Efficient\n Prompt Tuning and Adaptive Optimization","summary":" Federated learning (FL) is a promising paradigm to enable collaborative model\ntraining with decentralized data. However, the training process of Large\nLanguage Models (LLMs) generally incurs the update of significant parameters,\nwhich limits the applicability of FL techniques to tackle the LLMs in real\nscenarios. Prompt tuning can significantly reduce the number of parameters to\nupdate, but it either incurs performance degradation or low training\nefficiency. The straightforward utilization of prompt tuning in the FL often\nraises non-trivial communication costs and dramatically degrades performance.\nIn addition, the decentralized data is generally non-Independent and\nIdentically Distributed (non-IID), which brings client drift problems and thus\npoor performance. This paper proposes a Parameter-efficient prompt Tuning\napproach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and\neffective FL of LLMs. First, an efficient partial prompt tuning approach is\nproposed to improve performance and efficiency simultaneously. Second, a novel\nadaptive optimization method is developed to address the client drift problems\non both the device and server sides to enhance performance further. Extensive\nexperiments based on 10 datasets demonstrate the superb performance (up to\n60.8\\% in terms of accuracy) and efficiency (up to 97.59\\% in terms of training\ntime) of FedPepTAO compared with 9 baseline approaches. Our code is available\nat https://github.com/llm-eff/FedPepTAO.\n","authors":["Tianshi Che","Ji Liu","Yang Zhou","Jiaxiang Ren","Jiwen Zhou","Victor S. Sheng","Huaiyu Dai","Dejing Dou"],"pdf_url":"https://arxiv.org/pdf/2310.15080v2.pdf","comment":"18 pages, accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18913v1","updated":"2023-10-29T05:50:03Z","published":"2023-10-29T05:50:03Z","title":"Debiasing Algorithm through Model Adaptation","summary":" Large language models are becoming the go-to solution for various language\ntasks. However, with growing capacity, models are prone to rely on spurious\ncorrelations stemming from biases and stereotypes present in the training data.\nThis work proposes a novel method for detecting and mitigating gender bias in\nlanguage models. We perform causal analysis to identify problematic model\ncomponents and discover that mid-upper feed-forward layers are most prone to\nconvey biases. Based on the analysis results, we adapt the model by multiplying\nthese layers by a linear projection. Our titular method, DAMA, significantly\ndecreases bias as measured by diverse metrics while maintaining the model's\nperformance on downstream tasks. We release code for our method and models,\nwhich retrain LLaMA's state-of-the-art performance while being significantly\nless biased.\n","authors":["Tomasz Limisiewicz","David Mareček","Tomáš Musil"],"pdf_url":"https://arxiv.org/pdf/2310.18913v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18912v1","updated":"2023-10-29T05:48:04Z","published":"2023-10-29T05:48:04Z","title":"Sentence Bag Graph Formulation for Biomedical Distant Supervision\n Relation Extraction","summary":" We introduce a novel graph-based framework for alleviating key challenges in\ndistantly-supervised relation extraction and demonstrate its effectiveness in\nthe challenging and important domain of biomedical data. Specifically, we\npropose a graph view of sentence bags referring to an entity pair, which\nenables message-passing based aggregation of information related to the entity\npair over the sentence bag. The proposed framework alleviates the common\nproblem of noisy labeling in distantly supervised relation extraction and also\neffectively incorporates inter-dependencies between sentences within a bag.\nExtensive experiments on two large-scale biomedical relation datasets and the\nwidely utilized NYT dataset demonstrate that our proposed framework\nsignificantly outperforms the state-of-the-art methods for biomedical distant\nsupervision relation extraction while also providing excellent performance for\nrelation extraction in the general text mining domain.\n","authors":["Hao Zhang","Yang Liu","Xiaoyan Liu","Tianming Liang","Gaurav Sharma","Liang Xue","Maozu Guo"],"pdf_url":"https://arxiv.org/pdf/2310.18912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18906v1","updated":"2023-10-29T05:28:44Z","published":"2023-10-29T05:28:44Z","title":"Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text\n Detection","summary":" This paper reports our submission under the team name `SynthDetectives' to\nthe ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the\ntask of AI-generated text detection. Our approach is novel in terms of its\nchoice of models in that we use accessible and lightweight models in the\nensemble. We show that ensembling the models results in an improved accuracy in\ncomparison with using them individually. Our approach achieves an accuracy\nscore of 0.9555 on the official test data provided by the shared task\norganisers.\n","authors":["Duke Nguyen","Khaing Myat Noe Naing","Aditya Joshi"],"pdf_url":"https://arxiv.org/pdf/2310.18906v1.pdf","comment":"This is an ALTA 2023 Shared Task Paper"},{"id":"http://arxiv.org/abs/2309.12269v3","updated":"2023-10-29T04:47:57Z","published":"2023-09-21T17:24:40Z","title":"The Cambridge Law Corpus: A Corpus for Legal AI Research","summary":" We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research.\nIt consists of over 250 000 court cases from the UK. Most cases are from the\n21st century, but the corpus includes cases as old as the 16th century. This\npaper presents the first release of the corpus, containing the raw text and\nmeta-data. Together with the corpus, we provide annotations on case outcomes\nfor 638 cases, done by legal experts. Using our annotated data, we have trained\nand evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to\nprovide benchmarks. We include an extensive legal and ethical discussion to\naddress the potentially sensitive nature of this material. As a consequence,\nthe corpus will only be released for research purposes under certain\nrestrictions.\n","authors":["Andreas Östling","Holli Sargeant","Huiyuan Xie","Ludwig Bull","Alexander Terenin","Leif Jonsson","Måns Magnusson","Felix Steffek"],"pdf_url":"https://arxiv.org/pdf/2309.12269v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.07550v3","updated":"2023-10-29T04:39:25Z","published":"2022-05-20T07:32:57Z","title":"Evaluating and Inducing Personality in Pre-trained Language Models","summary":" Standardized and quantified evaluation of machine behaviors is a crux of\nunderstanding LLMs. In this study, we draw inspiration from psychometric\nstudies by leveraging human personality theory as a tool for studying machine\nbehaviors. Originating as a philosophical quest for human behaviors, the study\nof personality delves into how individuals differ in thinking, feeling, and\nbehaving. Toward building and understanding human-like social machines, we are\nmotivated to ask: Can we assess machine behaviors by leveraging human\npsychometric tests in a principled and quantitative manner? If so, can we\ninduce a specific personality in LLMs? To answer these questions, we introduce\nthe Machine Personality Inventory (MPI) tool for studying machine behaviors;\nMPI follows standardized personality tests, built upon the Big Five Personality\nFactors (Big Five) theory and personality assessment inventories. By\nsystematically evaluating LLMs with MPI, we provide the first piece of evidence\ndemonstrating the efficacy of MPI in studying LLMs behaviors. We further devise\na Personality Prompting (P^2) method to induce LLMs with specific personalities\nin a controllable way, capable of producing diverse and verifiable behaviors.\nWe hope this work sheds light on future studies by adopting personality as the\nessential indicator for various downstream tasks, and could further motivate\nresearch into equally intriguing human-like machine behaviors.\n","authors":["Guangyuan Jiang","Manjie Xu","Song-Chun Zhu","Wenjuan Han","Chi Zhang","Yixin Zhu"],"pdf_url":"https://arxiv.org/pdf/2206.07550v3.pdf","comment":"Accepted at NeurIPS 2023 (Spotlight)"},{"id":"http://arxiv.org/abs/2310.18877v1","updated":"2023-10-29T02:27:56Z","published":"2023-10-29T02:27:56Z","title":"Pre-trained Speech Processing Models Contain Human-Like Biases that\n Propagate to Speech Emotion Recognition","summary":" Previous work has established that a person's demographics and speech style\naffect how well speech processing models perform for them. But where does this\nbias come from? In this work, we present the Speech Embedding Association Test\n(SpEAT), a method for detecting bias in one type of model used for many speech\ntasks: pre-trained models. The SpEAT is inspired by word embedding association\ntests in natural language processing, which quantify intrinsic bias in a\nmodel's representations of different concepts, such as race or valence\n(something's pleasantness or unpleasantness) and capture the extent to which a\nmodel trained on large-scale socio-cultural data has learned human-like biases.\nUsing the SpEAT, we test for six types of bias in 16 English speech models\n(including 4 models also trained on multilingual data), which come from the\nwav2vec 2.0, HuBERT, WavLM, and Whisper model families. We find that 14 or more\nmodels reveal positive valence (pleasantness) associations with abled people\nover disabled people, with European-Americans over African-Americans, with\nfemales over males, with U.S. accented speakers over non-U.S. accented\nspeakers, and with younger people over older people. Beyond establishing that\npre-trained speech models contain these biases, we also show that they can have\nreal world effects. We compare biases found in pre-trained models to biases in\ndownstream models adapted to the task of Speech Emotion Recognition (SER) and\nfind that in 66 of the 96 tests performed (69%), the group that is more\nassociated with positive valence as indicated by the SpEAT also tends to be\npredicted as speaking with higher valence by the downstream model. Our work\nprovides evidence that, like text and image-based models, pre-trained speech\nbased-models frequently learn human-like biases. Our work also shows that bias\nfound in pre-trained models can propagate to the downstream task of SER.\n","authors":["Isaac Slaughter","Craig Greenberg","Reva Schwartz","Aylin Caliskan"],"pdf_url":"https://arxiv.org/pdf/2310.18877v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18867v1","updated":"2023-10-29T01:45:30Z","published":"2023-10-29T01:45:30Z","title":"Prompt-Engineering and Transformer-based Question Generation and\n Evaluation","summary":" Question generation has numerous applications in the educational context.\nQuestion generation can prove helpful for students when reviewing content and\ntesting themselves. Furthermore, a question generation model can aid teachers\nby lessening the burden of creating assessments and other practice material.\nThis paper aims to find the best method to generate questions from textual data\nthrough a transformer model and prompt engineering. In this research, we\nfinetuned a pretrained distilBERT model on the SQuAD question answering dataset\nto generate questions. In addition to training a transformer model, prompt\nengineering was applied to generate questions effectively using the LLaMA\nmodel. The generated questions were compared against the baseline questions in\nthe SQuAD dataset to evaluate the effectiveness of four different prompts. All\nfour prompts demonstrated over 60% similarity on average. Of the\nprompt-generated questions, 30% achieved a high similarity score greater than\n70%.\n","authors":["Rubaba Amyeen"],"pdf_url":"https://arxiv.org/pdf/2310.18867v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18865v1","updated":"2023-10-29T01:38:36Z","published":"2023-10-29T01:38:36Z","title":"MUST: A Multilingual Student-Teacher Learning approach for low-resource\n speech recognition","summary":" Student-teacher learning or knowledge distillation (KD) has been previously\nused to address data scarcity issue for training of speech recognition (ASR)\nsystems. However, a limitation of KD training is that the student model classes\nmust be a proper or improper subset of the teacher model classes. It prevents\ndistillation from even acoustically similar languages if the character sets are\nnot same. In this work, the aforementioned limitation is addressed by proposing\na MUltilingual Student-Teacher (MUST) learning which exploits a posteriors\nmapping approach. A pre-trained mapping model is used to map posteriors from a\nteacher language to the student language ASR. These mapped posteriors are used\nas soft labels for KD learning. Various teacher ensemble schemes are\nexperimented to train an ASR model for low-resource languages. A model trained\nwith MUST learning reduces relative character error rate (CER) up to 9.5% in\ncomparison with a baseline monolingual ASR.\n","authors":["Muhammad Umar Farooq","Rehan Ahmad","Thomas Hain"],"pdf_url":"https://arxiv.org/pdf/2310.18865v1.pdf","comment":"Accepted for IEEE ASRU 2023"},{"id":"http://arxiv.org/abs/2310.18862v1","updated":"2023-10-29T01:21:36Z","published":"2023-10-29T01:21:36Z","title":"Counterfactually Probing Language Identity in Multilingual Models","summary":" Techniques in causal analysis of language models illuminate how linguistic\ninformation is organized in LLMs. We use one such technique, AlterRep, a method\nof counterfactual probing, to explore the internal structure of multilingual\nmodels (mBERT and XLM-R). We train a linear classifier on a binary language\nidentity task, to classify tokens between Language X and Language Y. Applying a\ncounterfactual probing procedure, we use the classifier weights to project the\nembeddings into the null space and push the resulting embeddings either in the\ndirection of Language X or Language Y. Then we evaluate on a masked language\nmodeling task. We find that, given a template in Language X, pushing towards\nLanguage Y systematically increases the probability of Language Y words, above\nand beyond a third-party control language. But it does not specifically push\nthe model towards translation-equivalent words in Language Y. Pushing towards\nLanguage X (the same direction as the template) has a minimal effect, but\nsomewhat degrades these models. Overall, we take these results as further\nevidence of the rich structure of massive multilingual language models, which\ninclude both a language-specific and language-general component. And we show\nthat counterfactual probing can be fruitfully applied to multilingual models.\n","authors":["Anirudh Srinivasan","Venkata S Govindarajan","Kyle Mahowald"],"pdf_url":"https://arxiv.org/pdf/2310.18862v1.pdf","comment":"12 pages, 5 figures, MRL Workshop @ EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.15777v2","updated":"2023-10-29T01:17:53Z","published":"2023-10-24T12:22:34Z","title":"MindLLM: Pre-training Lightweight Large Language Model from Scratch,\n Evaluations and Domain Applications","summary":" Large Language Models (LLMs) have demonstrated remarkable performance across\nvarious natural language tasks, marking significant strides towards general\nartificial intelligence. While general artificial intelligence is leveraged by\ndeveloping increasingly large-scale models, there could be another branch to\ndevelop lightweight custom models that better serve certain domains, taking\ninto account the high cost of training and deploying LLMs and the scarcity of\nresources. In this paper, we present MindLLM, a novel series of bilingual\nlightweight large language models, trained from scratch, alleviating such\nburdens by offering models with 1.3 billion and 3 billion parameters. A\nthorough account of experiences accrued during large model development is\ngiven, covering every step of the process, including data construction, model\narchitecture, evaluation, and applications. Such insights are hopefully\nvaluable for fellow academics and developers. MindLLM consistently matches or\nsurpasses the performance of other open-source larger models on some public\nbenchmarks. We also introduce an innovative instruction tuning framework\ntailored for smaller models to enhance their capabilities efficiently.\nMoreover, we explore the application of MindLLM in specific vertical domains\nsuch as law and finance, underscoring the agility and adaptability of our\nlightweight models.\n","authors":["Yizhe Yang","Huashan Sun","Jiawei Li","Runheng Liu","Yinghao Li","Yuhang Liu","Heyan Huang","Yang Gao"],"pdf_url":"https://arxiv.org/pdf/2310.15777v2.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2203.03897v3","updated":"2023-10-29T00:01:40Z","published":"2022-03-08T07:34:52Z","title":"Geodesic Multi-Modal Mixup for Robust Fine-Tuning","summary":" Pre-trained multi-modal models, such as CLIP, provide transferable embeddings\nand show promising results in diverse applications. However, the analysis of\nlearned multi-modal embeddings is relatively unexplored, and the embedding\ntransferability can be improved. In this work, we observe that CLIP holds\nseparated embedding subspaces for two different modalities, and then we\ninvestigate it through the lens of uniformity-alignment to measure the quality\nof learned representation. Both theoretically and empirically, we show that\nCLIP retains poor uniformity and alignment even after fine-tuning. Such a lack\nof alignment and uniformity might restrict the transferability and robustness\nof embeddings. To this end, we devise a new fine-tuning method for robust\nrepresentation equipping better alignment and uniformity. First, we propose a\nGeodesic Multi-Modal Mixup that mixes the embeddings of image and text to\ngenerate hard negative samples on the hypersphere. Then, we fine-tune the model\non hard negatives as well as original negatives and positives with contrastive\nloss. Based on the theoretical analysis about hardness guarantee and limiting\nbehavior, we justify the use of our method. Extensive experiments on retrieval,\ncalibration, few- or zero-shot classification (under distribution shift),\nembedding arithmetic, and image captioning further show that our method\nprovides transferable representations, enabling robust model adaptation on\ndiverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup\n","authors":["Changdae Oh","Junhyuk So","Hoyoon Byun","YongTaek Lim","Minchul Shin","Jong-June Jeon","Kyungwoo Song"],"pdf_url":"https://arxiv.org/pdf/2203.03897v3.pdf","comment":"To appear at NeurIPS 2023"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2306.06805v2","updated":"2023-10-29T23:13:29Z","published":"2023-06-11T23:33:59Z","title":"Unlocking Feature Visualization for Deeper Networks with MAgnitude\n Constrained Optimization","summary":" Feature visualization has gained substantial popularity, particularly after\nthe influential work by Olah et al. in 2017, which established it as a crucial\ntool for explainability. However, its widespread adoption has been limited due\nto a reliance on tricks to generate interpretable images, and corresponding\nchallenges in scaling it to deeper neural networks. Here, we describe MACO, a\nsimple approach to address these shortcomings. The main idea is to generate\nimages by optimizing the phase spectrum while keeping the magnitude constant to\nensure that generated explanations lie in the space of natural images. Our\napproach yields significantly better results (both qualitatively and\nquantitatively) and unlocks efficient and interpretable feature visualizations\nfor large state-of-the-art neural networks. We also show that our approach\nexhibits an attribution mechanism allowing us to augment feature visualizations\nwith spatial importance. We validate our method on a novel benchmark for\ncomparing feature visualization methods, and release its visualizations for all\nclasses of the ImageNet dataset on https://serre-lab.github.io/Lens/.\n Overall, our approach unlocks, for the first time, feature visualizations for\nlarge, state-of-the-art deep neural networks without resorting to any\nparametric prior image model.\n","authors":["Thomas Fel","Thibaut Boissin","Victor Boutin","Agustin Picard","Paul Novello","Julien Colin","Drew Linsley","Tom Rousseau","Rémi Cadène","Laurent Gardes","Thomas Serre"],"pdf_url":"https://arxiv.org/pdf/2306.06805v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19188v1","updated":"2023-10-29T23:08:19Z","published":"2023-10-29T23:08:19Z","title":"3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets","summary":" We present 3DMiner -- a pipeline for mining 3D shapes from challenging\nlarge-scale unannotated image datasets. Unlike other unsupervised 3D\nreconstruction methods, we assume that, within a large-enough dataset, there\nmust exist images of objects with similar shapes but varying backgrounds,\ntextures, and viewpoints. Our approach leverages the recent advances in\nlearning self-supervised image representations to cluster images with\ngeometrically similar shapes and find common image correspondences between\nthem. We then exploit these correspondences to obtain rough camera estimates as\ninitialization for bundle-adjustment. Finally, for every image cluster, we\napply a progressive bundle-adjusting reconstruction method to learn a neural\noccupancy field representing the underlying shape. We show that this procedure\nis robust to several types of errors introduced in previous steps (e.g., wrong\ncamera poses, images containing dissimilar shapes, etc.), allowing us to obtain\nshape and pose annotations for images in-the-wild. When using images from Pix3D\nchairs, our method is capable of producing significantly better results than\nstate-of-the-art unsupervised 3D reconstruction techniques, both quantitatively\nand qualitatively. Furthermore, we show how 3DMiner can be applied to\nin-the-wild data by reconstructing shapes present in images from the LAION-5B\ndataset. Project Page: https://ttchengab.github.io/3dminerOfficial\n","authors":["Ta-Ying Cheng","Matheus Gadelha","Soren Pirk","Thibault Groueix","Radomir Mech","Andrew Markham","Niki Trigoni"],"pdf_url":"https://arxiv.org/pdf/2310.19188v1.pdf","comment":"In ICCV 2023"},{"id":"http://arxiv.org/abs/2308.10814v2","updated":"2023-10-29T23:00:05Z","published":"2023-08-21T16:03:35Z","title":"Jumping through Local Minima: Quantization in the Loss Landscape of\n Vision Transformers","summary":" Quantization scale and bit-width are the most important parameters when\nconsidering how to quantize a neural network. Prior work focuses on optimizing\nquantization scales in a global manner through gradient methods (gradient\ndescent \\& Hessian analysis). Yet, when applying perturbations to quantization\nscales, we observe a very jagged, highly non-smooth test loss landscape. In\nfact, small perturbations in quantization scale can greatly affect accuracy,\nyielding a $0.5-0.8\\%$ accuracy boost in 4-bit quantized vision transformers\n(ViTs). In this regime, gradient methods break down, since they cannot reliably\nreach local minima. In our work, dubbed Evol-Q, we use evolutionary search to\neffectively traverse the non-smooth landscape. Additionally, we propose using\nan infoNCE loss, which not only helps combat overfitting on the small\ncalibration dataset ($1,000$ images) but also makes traversing such a highly\nnon-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully\nquantized ViT-Base by $10.30\\%$, $0.78\\%$, and $0.15\\%$ for $3$-bit, $4$-bit,\nand $8$-bit weight quantization levels. Extensive experiments on a variety of\nCNN and ViT architectures further demonstrate its robustness in extreme\nquantization scenarios. Our code is available at\nhttps://github.com/enyac-group/evol-q\n","authors":["Natalia Frumkin","Dibakar Gope","Diana Marculescu"],"pdf_url":"https://arxiv.org/pdf/2308.10814v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2211.09643"},{"id":"http://arxiv.org/abs/2310.19182v1","updated":"2023-10-29T22:52:43Z","published":"2023-10-29T22:52:43Z","title":"Fast Trainable Projection for Robust Fine-Tuning","summary":" Robust fine-tuning aims to achieve competitive in-distribution (ID)\nperformance while maintaining the out-of-distribution (OOD) robustness of a\npre-trained model when transferring it to a downstream task. Recently,\nprojected gradient descent has been successfully used in robust fine-tuning by\nconstraining the deviation from the initialization of the fine-tuned model\nexplicitly through projection. However, algorithmically, two limitations\nprevent this method from being adopted more widely, scalability and efficiency.\nIn this paper, we propose a new projection-based fine-tuning algorithm, Fast\nTrainable Projection (FTP) for computationally efficient learning of per-layer\nprojection constraints, resulting in an average $35\\%$ speedup on our\nbenchmarks compared to prior works. FTP can be combined with existing\noptimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we\nshow that FTP is a special instance of hyper-optimizers that tune the\nhyper-parameters of optimizers in a learnable manner through nested\ndifferentiation. Empirically, we show superior robustness on OOD datasets,\nincluding domain shifts and natural corruptions, across four different vision\ntasks with five different pre-trained models. Additionally, we demonstrate that\nFTP is broadly applicable and beneficial to other learning scenarios such as\nlow-label and continual learning settings thanks to its easy adaptability. The\ncode will be available at https://github.com/GT-RIPL/FTP.git.\n","authors":["Junjiao Tian","Yen-Cheng Liu","James Seale Smith","Zsolt Kira"],"pdf_url":"https://arxiv.org/pdf/2310.19182v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.16289v2","updated":"2023-10-29T22:52:04Z","published":"2023-05-25T17:43:05Z","title":"Diversify Your Vision Datasets with Automatic Diffusion-Based\n Augmentation","summary":" Many fine-grained classification tasks, like rare animal identification, have\nlimited training data and consequently classifiers trained on these datasets\noften fail to generalize to variations in the domain like changes in weather or\nlocation. As such, we explore how natural language descriptions of the domains\nseen in training data can be used with large vision models trained on diverse\npretraining datasets to generate useful variations of the training data. We\nintroduce ALIA (Automated Language-guided Image Augmentation), a method which\nutilizes large vision and language models to automatically generate natural\nlanguage descriptions of a dataset's domains and augment the training data via\nlanguage-guided image editing. To maintain data integrity, a model trained on\nthe original dataset filters out minimal image edits and those which corrupt\nclass-relevant information. The resulting dataset is visually consistent with\nthe original training data and offers significantly enhanced diversity. We show\nthat ALIA is able to surpasses traditional data augmentation and text-to-image\ngenerated data on fine-grained classification tasks, including cases of domain\ngeneralization and contextual bias. Code is available at\nhttps://github.com/lisadunlap/ALIA.\n","authors":["Lisa Dunlap","Alyssa Umino","Han Zhang","Jiezhi Yang","Joseph E. Gonzalez","Trevor Darrell"],"pdf_url":"https://arxiv.org/pdf/2305.16289v2.pdf","comment":"Update: replaced Planes dataset with Waterbirds & updated results\n after bug fix"},{"id":"http://arxiv.org/abs/2310.19180v1","updated":"2023-10-29T22:51:49Z","published":"2023-10-29T22:51:49Z","title":"JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music\n Generation","summary":" With rapid advances in generative artificial intelligence, the text-to-music\nsynthesis task has emerged as a promising direction for music generation from\nscratch. However, finer-grained control over multi-track generation remains an\nopen challenge. Existing models exhibit strong raw generation capability but\nlack the flexibility to compose separate tracks and combine them in a\ncontrollable manner, differing from typical workflows of human composers. To\naddress this issue, we propose JEN-1 Composer, a unified framework to\nefficiently model marginal, conditional, and joint distributions over\nmulti-track music via a single model. JEN-1 Composer framework exhibits the\ncapacity to seamlessly incorporate any diffusion-based music generation system,\n\\textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music\ngeneration. We introduce a curriculum training strategy aimed at incrementally\ninstructing the model in the transition from single-track generation to the\nflexible generation of multi-track combinations. During the inference, users\nhave the ability to iteratively produce and choose music tracks that meet their\npreferences, subsequently creating an entire musical composition incrementally\nfollowing the proposed Human-AI co-composition workflow. Quantitative and\nqualitative assessments demonstrate state-of-the-art performance in\ncontrollable and high-fidelity multi-track music synthesis. The proposed JEN-1\nComposer represents a significant advance toward interactive AI-facilitated\nmusic creation and composition. Demos will be available at\nhttps://jenmusic.ai/audio-demos.\n","authors":["Yao Yao","Peike Li","Boyu Chen","Alex Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19180v1.pdf","comment":"Preprints"},{"id":"http://arxiv.org/abs/2307.05397v2","updated":"2023-10-29T22:41:02Z","published":"2023-07-11T15:57:51Z","title":"On the Vulnerability of DeepFake Detectors to Attacks Generated by\n Denoising Diffusion Models","summary":" The detection of malicious deepfakes is a constantly evolving problem that\nrequires continuous monitoring of detectors to ensure they can detect image\nmanipulations generated by the latest emerging models. In this paper, we\ninvestigate the vulnerability of single-image deepfake detectors to black-box\nattacks created by the newest generation of generative methods, namely\nDenoising Diffusion Models (DDMs). Our experiments are run on FaceForensics++,\na widely used deepfake benchmark consisting of manipulated images generated\nwith various techniques for face identity swapping and face reenactment.\nAttacks are crafted through guided reconstruction of existing deepfakes with a\nproposed DDM approach for face restoration. Our findings indicate that\nemploying just a single denoising diffusion step in the reconstruction process\nof a deepfake can significantly reduce the likelihood of detection, all without\nintroducing any perceptible image modifications. While training detectors using\nattack examples demonstrated some effectiveness, it was observed that\ndiscriminators trained on fully diffusion-based deepfakes exhibited limited\ngeneralizability when presented with our attacks.\n","authors":["Marija Ivanovska","Vitomir Štruc"],"pdf_url":"https://arxiv.org/pdf/2307.05397v2.pdf","comment":"Submitted for review"},{"id":"http://arxiv.org/abs/2211.14646v3","updated":"2023-10-29T22:11:33Z","published":"2022-11-26T19:31:49Z","title":"Towards Improved Input Masking for Convolutional Neural Networks","summary":" The ability to remove features from the input of machine learning models is\nvery important to understand and interpret model predictions. However, this is\nnon-trivial for vision models since masking out parts of the input image\ntypically causes large distribution shifts. This is because the baseline color\nused for masking (typically grey or black) is out of distribution. Furthermore,\nthe shape of the mask itself can contain unwanted signals which can be used by\nthe model for its predictions. Recently, there has been some progress in\nmitigating this issue (called missingness bias) in image masking for vision\ntransformers. In this work, we propose a new masking method for CNNs we call\nlayer masking in which the missingness bias caused by masking is reduced to a\nlarge extent. Intuitively, layer masking applies a mask to intermediate\nactivation maps so that the model only processes the unmasked input. We show\nthat our method (i) is able to eliminate or minimize the influence of the mask\nshape or color on the output of the model, and (ii) is much better than\nreplacing the masked region by black or grey for input perturbation based\ninterpretability techniques like LIME. Thus, layer masking is much less\naffected by missingness bias than other masking strategies. We also demonstrate\nhow the shape of the mask may leak information about the class, thus affecting\nestimates of model reliance on class-relevant features derived from input\nmasking. Furthermore, we discuss the role of data augmentation techniques for\ntackling this problem, and argue that they are not sufficient for preventing\nmodel reliance on mask shape. The code for this project is publicly available\nat https://github.com/SriramB-98/layer_masking\n","authors":["Sriram Balasubramanian","Soheil Feizi"],"pdf_url":"https://arxiv.org/pdf/2211.14646v3.pdf","comment":"29 pages, 19 figures. Accepted at ICCV 2023"},{"id":"http://arxiv.org/abs/2310.19168v1","updated":"2023-10-29T22:08:00Z","published":"2023-10-29T22:08:00Z","title":"BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species\n Classification and Mapping","summary":" We propose a metadata-aware self-supervised learning~(SSL)~framework useful\nfor fine-grained classification and ecological mapping of bird species around\nthe world. Our framework unifies two SSL strategies: Contrastive Learning~(CL)\nand Masked Image Modeling~(MIM), while also enriching the embedding space with\nmetadata available with ground-level imagery of birds. We separately train\nuni-modal and cross-modal ViT on a novel cross-view global bird species dataset\ncontaining ground-level imagery, metadata (location, time), and corresponding\nsatellite imagery. We demonstrate that our models learn fine-grained and\ngeographically conditioned features of birds, by evaluating on two downstream\ntasks: fine-grained visual classification~(FGVC) and cross-modal retrieval.\nPre-trained models learned using our framework achieve SotA performance on FGVC\nof iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and\nNABirds datasets. Moreover, the impressive cross-modal retrieval performance of\nour model enables the creation of species distribution maps across any\ngeographic region. The dataset and source code will be released at\nhttps://github.com/mvrl/BirdSAT}.\n","authors":["Srikumar Sastry","Subash Khanal","Aayush Dhakal","Di Huang","Nathan Jacobs"],"pdf_url":"https://arxiv.org/pdf/2310.19168v1.pdf","comment":"Accepted at WACV 2024"},{"id":"http://arxiv.org/abs/2310.19145v1","updated":"2023-10-29T20:39:11Z","published":"2023-10-29T20:39:11Z","title":"Learning to Follow Object-Centric Image Editing Instructions Faithfully","summary":" Natural language instructions are a powerful interface for editing the\noutputs of text-to-image diffusion models. However, several challenges need to\nbe addressed: 1) underspecification (the need to model the implicit meaning of\ninstructions) 2) grounding (the need to localize where the edit has to be\nperformed), 3) faithfulness (the need to preserve the elements of the image not\naffected by the edit instruction). Current approaches focusing on image editing\nwith natural language instructions rely on automatically generated paired data,\nwhich, as shown in our investigation, is noisy and sometimes nonsensical,\nexacerbating the above issues. Building on recent advances in segmentation,\nChain-of-Thought prompting, and visual question answering, we significantly\nimprove the quality of the paired data. In addition, we enhance the supervision\nsignal by highlighting parts of the image that need to be changed by the\ninstruction. The model fine-tuned on the improved data is capable of performing\nfine-grained object-centric edits better than state-of-the-art baselines,\nmitigating the problems outlined above, as shown by automatic and human\nevaluations. Moreover, our model is capable of generalizing to domains unseen\nduring training, such as visual metaphors.\n","authors":["Tuhin Chakrabarty","Kanishk Singh","Arkadiy Saakyan","Smaranda Muresan"],"pdf_url":"https://arxiv.org/pdf/2310.19145v1.pdf","comment":"Findings of EMNLP 2023 (Long paper)"},{"id":"http://arxiv.org/abs/2211.15788v3","updated":"2023-10-29T20:24:10Z","published":"2022-11-28T21:53:05Z","title":"A Visual Active Search Framework for Geospatial Exploration","summary":" Many problems can be viewed as forms of geospatial search aided by aerial\nimagery, with examples ranging from detecting poaching activity to human\ntrafficking. We model this class of problems in a visual active search (VAS)\nframework, which has three key inputs: (1) an image of the entire search area,\nwhich is subdivided into regions, (2) a local search function, which determines\nwhether a previously unseen object class is present in a given region, and (3)\na fixed search budget, which limits the number of times the local search\nfunction can be evaluated. The goal is to maximize the number of objects found\nwithin the search budget. We propose a reinforcement learning approach for VAS\nthat learns a meta-search policy from a collection of fully annotated search\ntasks. This meta-search policy is then used to dynamically search for a novel\ntarget-object class, leveraging the outcome of any previous queries to\ndetermine where to query next. Through extensive experiments on several\nlarge-scale satellite imagery datasets, we show that the proposed approach\nsignificantly outperforms several strong baselines. We also propose novel\ndomain adaptation techniques that improve the policy at decision time when\nthere is a significant domain gap with the training data. Code is publicly\navailable.\n","authors":["Anindya Sarkar","Michael Lanier","Scott Alfeld","Jiarui Feng","Roman Garnett","Nathan Jacobs","Yevgeniy Vorobeychik"],"pdf_url":"https://arxiv.org/pdf/2211.15788v3.pdf","comment":"Accepted to WACV 2024, 24 pages, 18 figures, Code is available at:\n https://github.com/anindyasarkarIITH/VAS"},{"id":"http://arxiv.org/abs/2301.11990v3","updated":"2023-10-29T19:45:09Z","published":"2023-01-27T21:03:19Z","title":"Alignment with human representations supports robust few-shot learning","summary":" Should we care whether AI systems have representations of the world that are\nsimilar to those of humans? We provide an information-theoretic analysis that\nsuggests that there should be a U-shaped relationship between the degree of\nrepresentational alignment with humans and performance on few-shot learning\ntasks. We confirm this prediction empirically, finding such a relationship in\nan analysis of the performance of 491 computer vision models. We also show that\nhighly-aligned models are more robust to both natural adversarial attacks and\ndomain shifts. Our results suggest that human-alignment is often a sufficient,\nbut not necessary, condition for models to make effective use of limited data,\nbe robust, and generalize well.\n","authors":["Ilia Sucholutsky","Thomas L. Griffiths"],"pdf_url":"https://arxiv.org/pdf/2301.11990v3.pdf","comment":"Spotlight at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19130v1","updated":"2023-10-29T19:39:03Z","published":"2023-10-29T19:39:03Z","title":"Women Wearing Lipstick: Measuring the Bias Between an Object and Its\n Related Gender","summary":" In this paper, we investigate the impact of objects on gender bias in image\ncaptioning systems. Our results show that only gender-specific objects have a\nstrong gender bias (e.g., women-lipstick). In addition, we propose a visual\nsemantic-based gender score that measures the degree of bias and can be used as\na plug-in for any image captioning system. Our experiments demonstrate the\nutility of the gender score, since we observe that our score can measure the\nbias relation between a caption and its related gender; therefore, our score\ncan be used as an additional metric to the existing Object Gender Co-Occ\napproach. Code and data are publicly available at\n\\url{https://github.com/ahmedssabir/GenderScore}.\n","authors":["Ahmed Sabir","Lluís Padró"],"pdf_url":"https://arxiv.org/pdf/2310.19130v1.pdf","comment":"EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2208.03392v5","updated":"2023-10-29T19:32:21Z","published":"2022-08-05T21:41:15Z","title":"Federated Learning for Medical Applications: A Taxonomy, Current Trends,\n Challenges, and Future Research Directions","summary":" With the advent of the IoT, AI, ML, and DL algorithms, the landscape of\ndata-driven medical applications has emerged as a promising avenue for\ndesigning robust and scalable diagnostic and prognostic models from medical\ndata. This has gained a lot of attention from both academia and industry,\nleading to significant improvements in healthcare quality. However, the\nadoption of AI-driven medical applications still faces tough challenges,\nincluding meeting security, privacy, and quality of service (QoS) standards.\nRecent developments in \\ac{FL} have made it possible to train complex\nmachine-learned models in a distributed manner and have become an active\nresearch domain, particularly processing the medical data at the edge of the\nnetwork in a decentralized way to preserve privacy and address security\nconcerns. To this end, in this paper, we explore the present and future of FL\ntechnology in medical applications where data sharing is a significant\nchallenge. We delve into the current research trends and their outcomes,\nunravelling the complexities of designing reliable and scalable \\ac{FL} models.\nOur paper outlines the fundamental statistical issues in FL, tackles\ndevice-related problems, addresses security challenges, and navigates the\ncomplexity of privacy concerns, all while highlighting its transformative\npotential in the medical field. Our study primarily focuses on medical\napplications of \\ac{FL}, particularly in the context of global cancer\ndiagnosis. We highlight the potential of FL to enable computer-aided diagnosis\ntools that address this challenge with greater effectiveness than traditional\ndata-driven methods. We hope that this comprehensive review will serve as a\ncheckpoint for the field, summarizing the current state-of-the-art and\nidentifying open problems and future research directions.\n","authors":["Ashish Rauniyar","Desta Haileselassie Hagos","Debesh Jha","Jan Erik Håkegård","Ulas Bagci","Danda B. Rawat","Vladimir Vlassov"],"pdf_url":"https://arxiv.org/pdf/2208.03392v5.pdf","comment":"Accepted at IEEE Internet of Things Journal"},{"id":"http://arxiv.org/abs/2305.12577v3","updated":"2023-10-29T19:27:38Z","published":"2023-05-21T21:54:31Z","title":"Guided Motion Diffusion for Controllable Human Motion Synthesis","summary":" Denoising diffusion models have shown great promise in human motion synthesis\nconditioned on natural language descriptions. However, integrating spatial\nconstraints, such as pre-defined motion trajectories and obstacles, remains a\nchallenge despite being essential for bridging the gap between isolated human\nmotion and its surrounding environment. To address this issue, we propose\nGuided Motion Diffusion (GMD), a method that incorporates spatial constraints\ninto the motion generation process. Specifically, we propose an effective\nfeature projection scheme that manipulates motion representation to enhance the\ncoherency between spatial information and local poses. Together with a new\nimputation formulation, the generated motion can reliably conform to spatial\nconstraints such as global motion trajectories. Furthermore, given sparse\nspatial constraints (e.g. sparse keyframes), we introduce a new dense guidance\napproach to turn a sparse signal, which is susceptible to being ignored during\nthe reverse steps, into denser signals to guide the generated motion to the\ngiven constraints. Our extensive experiments justify the development of GMD,\nwhich achieves a significant improvement over state-of-the-art methods in\ntext-based motion generation while allowing control of the synthesized motions\nwith spatial constraints.\n","authors":["Korrawe Karunratanakul","Konpat Preechakul","Supasorn Suwajanakorn","Siyu Tang"],"pdf_url":"https://arxiv.org/pdf/2305.12577v3.pdf","comment":"ICCV23. Project page: https://korrawe.github.io/gmd-project/"},{"id":"http://arxiv.org/abs/2306.10608v3","updated":"2023-10-29T19:24:47Z","published":"2023-06-18T17:55:02Z","title":"STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced\n Audio-Visual Diarization","summary":" This report introduces our novel method named STHG for the Audio-Visual\nDiarization task of the Ego4D Challenge 2023. Our key innovation is that we\nmodel all the speakers in a video using a single, unified heterogeneous graph\nlearning framework. Unlike previous approaches that require a separate\ncomponent solely for the camera wearer, STHG can jointly detect the speech\nactivities of all people including the camera wearer. Our final method obtains\n61.1% DER on the test set of Ego4D, which significantly outperforms all the\nbaselines as well as last year's winner. Our submission achieved 1st place in\nthe Ego4D Challenge 2023. We additionally demonstrate that applying the\noff-the-shelf speech recognition system to the diarized speech segments by STHG\nproduces a competitive performance on the Speech Transcription task of this\nchallenge.\n","authors":["Kyle Min"],"pdf_url":"https://arxiv.org/pdf/2306.10608v3.pdf","comment":"Validation report for the Ego4D challenge at CVPR 2023"},{"id":"http://arxiv.org/abs/2210.07764v3","updated":"2023-10-29T19:22:11Z","published":"2022-10-14T12:54:03Z","title":"Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual\n Diarization","summary":" This report describes our approach for the Audio-Visual Diarization (AVD)\ntask of the Ego4D Challenge 2022. Specifically, we present multiple technical\nimprovements over the official baselines. First, we improve the detection\nperformance of the camera wearer's voice activity by modifying the training\nscheme of its model. Second, we discover that an off-the-shelf voice activity\ndetection model can effectively remove false positives when it is applied\nsolely to the camera wearer's voice activities. Lastly, we show that better\nactive speaker detection leads to a better AVD outcome. Our final method\nobtains 65.9% DER on the test set of Ego4D, which significantly outperforms all\nthe baselines. Our submission achieved 1st place in the Ego4D Challenge 2022.\n","authors":["Kyle Min"],"pdf_url":"https://arxiv.org/pdf/2210.07764v3.pdf","comment":"Validation report for the Ego4D challenge at ECCV 2022"},{"id":"http://arxiv.org/abs/2310.19119v1","updated":"2023-10-29T19:10:52Z","published":"2023-10-29T19:10:52Z","title":"Out-of-distribution Object Detection through Bayesian Uncertainty\n Estimation","summary":" The superior performance of object detectors is often established under the\ncondition that the test samples are in the same distribution as the training\ndata. However, in many practical applications, out-of-distribution (OOD)\ninstances are inevitable and usually lead to uncertainty in the results. In\nthis paper, we propose a novel, intuitive, and scalable probabilistic object\ndetection method for OOD detection. Unlike other uncertainty-modeling methods\nthat either require huge computational costs to infer the weight distributions\nor rely on model training through synthetic outlier data, our method is able to\ndistinguish between in-distribution (ID) data and OOD data via weight parameter\nsampling from proposed Gaussian distributions based on pre-trained networks. We\ndemonstrate that our Bayesian object detector can achieve satisfactory OOD\nidentification performance by reducing the FPR95 score by up to 8.19% and\nincreasing the AUROC score by up to 13.94% when trained on BDD100k and VOC\ndatasets as the ID datasets and evaluated on COCO2017 dataset as the OOD\ndataset.\n","authors":["Tianhao Zhang","Shenglin Wang","Nidhal Bouaynaya","Radu Calinescu","Lyudmila Mihaylova"],"pdf_url":"https://arxiv.org/pdf/2310.19119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19113v1","updated":"2023-10-29T19:01:20Z","published":"2023-10-29T19:01:20Z","title":"Dynamic V2X Autonomous Perception from Road-to-Vehicle Vision","summary":" Vehicle-to-everything (V2X) perception is an innovative technology that\nenhances vehicle perception accuracy, thereby elevating the security and\nreliability of autonomous systems. However, existing V2X perception methods\nfocus on static scenes from mainly vehicle-based vision, which is constrained\nby sensor capabilities and communication loads. To adapt V2X perception models\nto dynamic scenes, we propose to build V2X perception from road-to-vehicle\nvision and present Adaptive Road-to-Vehicle Perception (AR2VP) method. In\nAR2VP,we leverage roadside units to offer stable, wide-range sensing\ncapabilities and serve as communication hubs. AR2VP is devised to tackle both\nintra-scene and inter-scene changes. For the former, we construct a dynamic\nperception representing module, which efficiently integrates vehicle\nperceptions, enabling vehicles to capture a more comprehensive range of dynamic\nfactors within the scene.Moreover, we introduce a road-to-vehicle perception\ncompensating module, aimed at preserving the maximized roadside unit perception\ninformation in the presence of intra-scene changes.For inter-scene changes, we\nimplement an experience replay mechanism leveraging the roadside unit's storage\ncapacity to retain a subset of historical scene data, maintaining model\nrobustness in response to inter-scene shifts. We conduct perception experiment\non 3D object detection and segmentation, and the results show that AR2VP excels\nin both performance-bandwidth trade-offs and adaptability within dynamic\nenvironments.\n","authors":["Jiayao Tan","Fan Lyu","Linyan Li","Fuyuan Hu","Tingliang Feng","Fenglei Xu","Rui Yao"],"pdf_url":"https://arxiv.org/pdf/2310.19113v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19112v1","updated":"2023-10-29T18:57:15Z","published":"2023-10-29T18:57:15Z","title":"Efficient IoT Inference via Context-Awareness","summary":" While existing strategies for optimizing deep learning-based classification\nmodels on low-power platforms assume the models are trained on all classes of\ninterest, this paper posits that adopting context-awareness i.e. focusing\nsolely on the likely classes in the current context, can substantially enhance\nperformance in resource-constrained environments. We propose a new paradigm,\nCACTUS, for scalable and efficient context-aware classification where a\nmicro-classifier recognizes a small set of classes relevant to the current\ncontext and, when context change happens, rapidly switches to another suitable\nmicro-classifier. CACTUS has several innovations including optimizing the\ntraining cost of context-aware classifiers, enabling on-the-fly context-aware\nswitching between classifiers, and selecting the best context-aware classifiers\ngiven limited resources. We show that CACTUS achieves significant benefits in\naccuracy, latency, and compute budget across a range of datasets and IoT\nplatforms.\n","authors":["Mohammad Mehdi Rastikerdar","Jin Huang","Shiwei Fang","Hui Guan","Deepak Ganesan"],"pdf_url":"https://arxiv.org/pdf/2310.19112v1.pdf","comment":"12 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.19109v1","updated":"2023-10-29T18:46:33Z","published":"2023-10-29T18:46:33Z","title":"Dynamic Task and Weight Prioritization Curriculum Learning for\n Multimodal Imagery","summary":" This paper explores post-disaster analytics using multimodal deep learning\nmodels trained with curriculum learning method. Studying post-disaster\nanalytics is important as it plays a crucial role in mitigating the impact of\ndisasters by providing timely and accurate insights into the extent of damage\nand the allocation of resources. We propose a curriculum learning strategy to\nenhance the performance of multimodal deep learning models. Curriculum learning\nemulates the progressive learning sequence in human education by training deep\nlearning models on increasingly complex data. Our primary objective is to\ndevelop a curriculum-trained multimodal deep learning model, with a particular\nfocus on visual question answering (VQA) capable of jointly processing image\nand text data, in conjunction with semantic segmentation for disaster analytics\nusing the\nFloodNet\\footnote{https://github.com/BinaLab/FloodNet-Challenge-EARTHVISION2021}\ndataset. To achieve this, U-Net model is used for semantic segmentation and\nimage encoding. A custom built text classifier is used for visual question\nanswering. Existing curriculum learning methods rely on manually defined\ndifficulty functions. We introduce a novel curriculum learning approach termed\nDynamic Task and Weight Prioritization (DATWEP), which leverages a\ngradient-based method to automatically decide task difficulty during curriculum\nlearning training, thereby eliminating the need for explicit difficulty\ncomputation. The integration of DATWEP into our multimodal model shows\nimprovement on VQA performance. Source code is available at\nhttps://github.com/fualsan/DATWEP.\n","authors":["Huseyin Fuat Alsan","Taner Arsan"],"pdf_url":"https://arxiv.org/pdf/2310.19109v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.03329v2","updated":"2023-10-29T18:28:10Z","published":"2023-09-06T19:19:12Z","title":"MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary\n Polyp Segmentation","summary":" Efficient polyp segmentation in healthcare plays a critical role in enabling\nearly diagnosis of colorectal cancer. However, the segmentation of polyps\npresents numerous challenges, including the intricate distribution of\nbackgrounds, variations in polyp sizes and shapes, and indistinct boundaries.\nDefining the boundary between the foreground (i.e. polyp itself) and the\nbackground (surrounding tissue) is difficult. To mitigate these challenges, we\npropose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored\nspecifically for polyp segmentation within colonoscopy images. This network\ndraws inspiration from the fusion of a classical edge detection technique with\nan attention mechanism. By combining these techniques, MEGANet effectively\npreserves high-frequency information, notably edges and boundaries, which tend\nto erode as neural networks deepen. MEGANet is designed as an end-to-end\nframework, encompassing three key modules: an encoder, which is responsible for\ncapturing and abstracting the features from the input image, a decoder, which\nfocuses on salient features, and the Edge-Guided Attention module (EGA) that\nemploys the Laplacian Operator to accentuate polyp boundaries. Extensive\nexperiments, both qualitative and quantitative, on five benchmark datasets,\ndemonstrate that our EGANet outperforms other existing SOTA methods under six\nevaluation metrics. Our code is available at\n\\url{https://github.com/UARK-AICV/MEGANet}.\n","authors":["Nhat-Tan Bui","Dinh-Hieu Hoang","Quang-Thuc Nguyen","Minh-Triet Tran","Ngan Le"],"pdf_url":"https://arxiv.org/pdf/2309.03329v2.pdf","comment":"Accepted at the IEEE/CVF Winter Conference on Applications of\n Computer Vision (WACV 2024)"},{"id":"http://arxiv.org/abs/2207.03444v2","updated":"2023-10-29T18:12:25Z","published":"2022-07-07T17:20:15Z","title":"Fairness and Bias in Robot Learning","summary":" Machine learning has significantly enhanced the abilities of robots, enabling\nthem to perform a wide range of tasks in human environments and adapt to our\nuncertain real world. Recent works in various machine learning domains have\nhighlighted the importance of accounting for fairness to ensure that these\nalgorithms do not reproduce human biases and consequently lead to\ndiscriminatory outcomes. With robot learning systems increasingly performing\nmore and more tasks in our everyday lives, it is crucial to understand the\ninfluence of such biases to prevent unintended behavior toward certain groups\nof people. In this work, we present the first survey on fairness in robot\nlearning from an interdisciplinary perspective spanning technical, ethical, and\nlegal challenges. We propose a taxonomy for sources of bias and the resulting\ntypes of discrimination due to them. Using examples from different robot\nlearning domains, we examine scenarios of unfair outcomes and strategies to\nmitigate them. We present early advances in the field by covering different\nfairness definitions, ethical and legal considerations, and methods for fair\nrobot learning. With this work, we aim to pave the road for groundbreaking\ndevelopments in fair robot learning.\n","authors":["Laura Londoño","Juana Valeria Hurtado","Nora Hertz","Philipp Kellmeyer","Silja Voeneky","Abhinav Valada"],"pdf_url":"https://arxiv.org/pdf/2207.03444v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.06773v2","updated":"2023-10-29T17:22:56Z","published":"2023-05-11T12:54:10Z","title":"Towards a Better Understanding of the Computer Vision Research Community\n in Africa","summary":" Computer vision is a broad field of study that encompasses different tasks\n(e.g., object detection). Although computer vision is relevant to the African\ncommunities in various applications, yet computer vision research is\nunder-explored in the continent and constructs only 0.06% of top-tier\npublications in the last ten years. In this paper, our goal is to have a better\nunderstanding of the computer vision research conducted in Africa and provide\npointers on whether there is equity in research or not. We do this through an\nempirical analysis of the African computer vision publications that are Scopus\nindexed, where we collect around 63,000 publications over the period 2012-2022.\nWe first study the opportunities available for African institutions to publish\nin top-tier computer vision venues. We show that African publishing trends in\ntop-tier venues over the years do not exhibit consistent growth, unlike other\ncontinents such as North America or Asia. Moreover, we study all computer\nvision publications beyond top-tier venues in different African regions to find\nthat mainly Northern and Southern Africa are publishing in computer vision with\n68.5% and 15.9% of publications, resp. Nonetheless, we highlight that both\nEastern and Western Africa are exhibiting a promising increase with the last\ntwo years closing the gap with Southern Africa. Additionally, we study the\ncollaboration patterns in these publications to find that most of these exhibit\ninternational collaborations rather than African ones. We also show that most\nof these publications include an African author that is a key contributor as\nthe first or last author. Finally, we present the most recurring keywords in\ncomputer vision publications per African region.\n","authors":["Abdul-Hakeem Omotayo","Mai Gamal","Eman Ehab","Gbetondji Dovonon","Zainab Akinjobi","Ismaila Lukman","Houcemeddine Turki","Mahmod Abdien","Idriss Tondji","Abigail Oppong","Yvan Pimi","Karim Gamal"," Ro'ya-CV4Africa","Mennatullah Siam"],"pdf_url":"https://arxiv.org/pdf/2305.06773v2.pdf","comment":"Published in EAAMO'23 under ACM License. This work is part of our\n African computer vision grassroots research in Ro'ya - CV4Africa,\n https://ro-ya-cv4africa.github.io/homepage/"},{"id":"http://arxiv.org/abs/2310.19080v1","updated":"2023-10-29T17:03:12Z","published":"2023-10-29T17:03:12Z","title":"Reward Finetuning for Faster and More Accurate Unsupervised Object\n Discovery","summary":" Recent advances in machine learning have shown that Reinforcement Learning\nfrom Human Feedback (RLHF) can improve machine learning models and align them\nwith human preferences. Although very successful for Large Language Models\n(LLMs), these advancements have not had a comparable impact in research for\nautonomous vehicles -- where alignment with human expectations can be\nimperative. In this paper, we propose to adapt similar RL-based methods to\nunsupervised object discovery, i.e. learning to detect objects from LiDAR\npoints without any training labels. Instead of labels, we use simple heuristics\nto mimic human feedback. More explicitly, we combine multiple heuristics into a\nsimple reward function that positively correlates its score with bounding box\naccuracy, \\ie, boxes containing objects are scored higher than those without.\nWe start from the detector's own predictions to explore the space and reinforce\nboxes with high rewards through gradient updates. Empirically, we demonstrate\nthat our approach is not only more accurate, but also orders of magnitudes\nfaster to train compared to prior works on object discovery.\n","authors":["Katie Z Luo","Zhenzhen Liu","Xiangyu Chen","Yurong You","Sagie Benaim","Cheng Perng Phoo","Mark Campbell","Wen Sun","Bharath Hariharan","Kilian Q. Weinberger"],"pdf_url":"https://arxiv.org/pdf/2310.19080v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19075v1","updated":"2023-10-29T16:58:31Z","published":"2023-10-29T16:58:31Z","title":"Bespoke Solvers for Generative Flow Models","summary":" Diffusion or flow-based models are powerful generative paradigms that are\nnotoriously hard to sample as samples are defined as solutions to\nhigh-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs)\nwhich require a large Number of Function Evaluations (NFE) to approximate well.\nExisting methods to alleviate the costly sampling process include model\ndistillation and designing dedicated ODE solvers. However, distillation is\ncostly to train and sometimes can deteriorate quality, while dedicated solvers\nstill require relatively large NFE to produce high quality samples. In this\npaper we introduce \"Bespoke solvers\", a novel framework for constructing custom\nODE solvers tailored to the ODE of a given pre-trained flow model. Our approach\noptimizes an order consistent and parameter-efficient solver (e.g., with 80\nlearnable parameters), is trained for roughly 1% of the GPU time required for\ntraining the pre-trained model, and significantly improves approximation and\ngeneration quality compared to dedicated solvers. For example, a Bespoke solver\nfor a CIFAR10 model produces samples with Fr\\'echet Inception Distance (FID) of\n2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this\nmodel with only 20 NFE. On the more challenging ImageNet-64$\\times$64, Bespoke\nsamples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20\nNFE.\n","authors":["Neta Shaul","Juan Perez","Ricky T. Q. Chen","Ali Thabet","Albert Pumarola","Yaron Lipman"],"pdf_url":"https://arxiv.org/pdf/2310.19075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19070v1","updated":"2023-10-29T16:49:45Z","published":"2023-10-29T16:49:45Z","title":"Myriad: Large Multimodal Model by Applying Vision Experts for Industrial\n Anomaly Detection","summary":" Existing industrial anomaly detection (IAD) methods predict anomaly scores\nfor both anomaly detection and localization. However, they struggle to perform\na multi-turn dialog and detailed descriptions for anomaly regions, e.g., color,\nshape, and categories of industrial anomalies. Recently, large multimodal\n(i.e., vision and language) models (LMMs) have shown eminent perception\nabilities on multiple vision tasks such as image captioning, visual\nunderstanding, visual reasoning, etc., making it a competitive potential choice\nfor more comprehensible anomaly detection. However, the knowledge about anomaly\ndetection is absent in existing general LMMs, while training a specific LMM for\nanomaly detection requires a tremendous amount of annotated data and massive\ncomputation resources. In this paper, we propose a novel large multi-modal\nmodel by applying vision experts for industrial anomaly detection (dubbed\nMyriad), which leads to definite anomaly detection and high-quality anomaly\ndescription. Specifically, we adopt MiniGPT-4 as the base LMM and design an\nExpert Perception module to embed the prior knowledge from vision experts as\ntokens which are intelligible to Large Language Models (LLMs). To compensate\nfor the errors and confusions of vision experts, we introduce a domain adapter\nto bridge the visual representation gaps between generic and industrial images.\nFurthermore, we propose a Vision Expert Instructor, which enables the Q-Former\nto generate IAD domain vision-language tokens according to vision expert prior.\nExtensive experiments on MVTec-AD and VisA benchmarks demonstrate that our\nproposed method not only performs favorably against state-of-the-art methods\nunder the 1-class and few-shot settings, but also provide definite anomaly\nprediction along with detailed descriptions in IAD domain.\n","authors":["Yuanze Li","Haolin Wang","Shihao Yuan","Ming Liu","Yiwen Guo","Chen Xu","Guangming Shi","Wangmeng Zuo"],"pdf_url":"https://arxiv.org/pdf/2310.19070v1.pdf","comment":"8 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.19061v1","updated":"2023-10-29T16:26:28Z","published":"2023-10-29T16:26:28Z","title":"Multimodal ChatGPT for Medical Applications: an Experimental Study of\n GPT-4V","summary":" In this paper, we critically evaluate the capabilities of the\nstate-of-the-art multimodal large language model, i.e., GPT-4 with Vision\n(GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly\nassess GPT-4V's proficiency in answering questions paired with images using\nboth pathology and radiology datasets from 11 modalities (e.g. Microscopy,\nDermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver,\nlung, etc.). Our datasets encompass a comprehensive range of medical inquiries,\nincluding sixteen distinct question types. Throughout our evaluations, we\ndevised textual prompts for GPT-4V, directing it to synergize visual and\ntextual information. The experiments with accuracy score conclude that the\ncurrent version of GPT-4V is not recommended for real-world diagnostics due to\nits unreliable and suboptimal accuracy in responding to diagnostic medical\nquestions. In addition, we delineate seven unique facets of GPT-4V's behavior\nin medical VQA, highlighting its constraints within this complex arena. The\ncomplete details of our evaluation cases are accessible at\nhttps://github.com/ZhilingYan/GPT4V-Medical-Report.\n","authors":["Zhiling Yan","Kai Zhang","Rong Zhou","Lifang He","Xiang Li","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2310.19061v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.08036v3","updated":"2023-10-29T16:25:32Z","published":"2023-09-14T21:54:23Z","title":"BEA: Revisiting anchor-based object detection DNN using Budding Ensemble\n Architecture","summary":" This paper introduces the Budding Ensemble Architecture (BEA), a novel\nreduced ensemble architecture for anchor-based object detection models. Object\ndetection models are crucial in vision-based tasks, particularly in autonomous\nsystems. They should provide precise bounding box detections while also\ncalibrating their predicted confidence scores, leading to higher-quality\nuncertainty estimates. However, current models may make erroneous decisions due\nto false positives receiving high scores or true positives being discarded due\nto low scores. BEA aims to address these issues. The proposed loss functions in\nBEA improve the confidence score calibration and lower the uncertainty error,\nwhich results in a better distinction of true and false positives and,\neventually, higher accuracy of the object detection models. Both Base-YOLOv3\nand SSD models were enhanced using the BEA method and its proposed loss\nfunctions. The BEA on Base-YOLOv3 trained on the KITTI dataset results in a 6%\nand 3.7% increase in mAP and AP50, respectively. Utilizing a well-balanced\nuncertainty estimation threshold to discard samples in real-time even leads to\na 9.6% higher AP50 than its base model. This is attributed to a 40% increase in\nthe area under the AP50-based retention curve used to measure the quality of\ncalibration of confidence scores. Furthermore, BEA-YOLOV3 trained on KITTI\nprovides superior out-of-distribution detection on Citypersons, BDD100K, and\nCOCO datasets compared to the ensembles and vanilla models of YOLOv3 and\nGaussian-YOLOv3.\n","authors":["Syed Sha Qutub","Neslihan Kose","Rafael Rosales","Michael Paulitsch","Korbinian Hagn","Florian Geissler","Yang Peng","Gereon Hinz","Alois Knoll"],"pdf_url":"https://arxiv.org/pdf/2309.08036v3.pdf","comment":"14 pages, 5 pages supplementary material. Accepted at BMVC-2023"},{"id":"http://arxiv.org/abs/2310.19060v1","updated":"2023-10-29T16:25:32Z","published":"2023-10-29T16:25:32Z","title":"TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language\n Understanding","summary":" Large-scale video-language pre-training has made remarkable strides in\nadvancing video-language understanding tasks. However, the heavy computational\nburden of video encoding remains a formidable efficiency bottleneck,\nparticularly for long-form videos. These videos contain massive visual tokens\ndue to their inherent 3D properties and spatiotemporal redundancy, making it\nchallenging to capture complex temporal and spatial relationships. To tackle\nthis issue, we propose an efficient method called TEmporal-Spatial Token\nAggregation (TESTA). TESTA condenses video semantics by adaptively aggregating\nsimilar frames, as well as similar patches within each frame. TESTA can reduce\nthe number of visual tokens by 75% and thus accelerate video encoding. Building\nupon TESTA, we introduce a pre-trained video-language model equipped with a\ndivided space-time token aggregation module in each video encoder block. We\nevaluate our model on five datasets for paragraph-to-video retrieval and\nlong-form VideoQA tasks. Experimental results show that TESTA improves\ncomputing efficiency by 1.7 times, and achieves significant performance gains\nfrom its scalability in processing longer input frames, e.g., +13.7 R@1 on\nQuerYD and +6.5 R@1 on Condensed Movie.\n","authors":["Shuhuai Ren","Sishuo Chen","Shicheng Li","Xu Sun","Lu Hou"],"pdf_url":"https://arxiv.org/pdf/2310.19060v1.pdf","comment":"16 pages, 9 figures, code is available at\n https://github.com/RenShuhuai-Andy/TESTA"},{"id":"http://arxiv.org/abs/2305.16322v3","updated":"2023-10-29T15:59:24Z","published":"2023-05-25T17:59:58Z","title":"Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models","summary":" Text-to-Image diffusion models have made tremendous progress over the past\ntwo years, enabling the generation of highly realistic images based on\nopen-domain text descriptions. However, despite their success, text\ndescriptions often struggle to adequately convey detailed controls, even when\ncomposed of long and complex texts. Moreover, recent studies have also shown\nthat these models face challenges in understanding such complex texts and\ngenerating the corresponding images. Therefore, there is a growing need to\nenable more control modes beyond text description. In this paper, we introduce\nUni-ControlNet, a unified framework that allows for the simultaneous\nutilization of different local controls (e.g., edge maps, depth map,\nsegmentation masks) and global controls (e.g., CLIP image embeddings) in a\nflexible and composable manner within one single model. Unlike existing\nmethods, Uni-ControlNet only requires the fine-tuning of two additional\nadapters upon frozen pre-trained text-to-image diffusion models, eliminating\nthe huge cost of training from scratch. Moreover, thanks to some dedicated\nadapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2)\nof adapters, regardless of the number of local or global controls used. This\nnot only reduces the fine-tuning costs and model size, making it more suitable\nfor real-world deployment, but also facilitate composability of different\nconditions. Through both quantitative and qualitative comparisons,\nUni-ControlNet demonstrates its superiority over existing methods in terms of\ncontrollability, generation quality and composability. Code is available at\n\\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.\n","authors":["Shihao Zhao","Dongdong Chen","Yen-Chun Chen","Jianmin Bao","Shaozhe Hao","Lu Yuan","Kwan-Yee K. Wong"],"pdf_url":"https://arxiv.org/pdf/2305.16322v3.pdf","comment":"Camera Ready, Code is available at\n https://github.com/ShihaoZhaoZSH/Uni-ControlNet"},{"id":"http://arxiv.org/abs/2303.17531v2","updated":"2023-10-29T15:59:00Z","published":"2023-03-30T16:53:07Z","title":"Asymmetric Image Retrieval with Cross Model Compatible Ensembles","summary":" The asymmetrical retrieval setting is a well suited solution for resource\nconstrained applications such as face recognition and image retrieval. In this\nsetting, a large model is used for indexing the gallery while a lightweight\nmodel is used for querying. The key principle in such systems is ensuring that\nboth models share the same embedding space. Most methods in this domain are\nbased on knowledge distillation. While useful, they suffer from several\ndrawbacks: they are upper-bounded by the performance of the single best model\nfound and cannot be extended to use an ensemble of models in a straightforward\nmanner. In this paper we present an approach that does not rely on knowledge\ndistillation, rather it utilizes embedding transformation models. This allows\nthe use of N independently trained and diverse gallery models (e.g., trained on\ndifferent datasets or having a different architecture) and a single query\nmodel. As a result, we improve the overall accuracy beyond that of any single\nmodel while maintaining a low computational budget for querying. Additionally,\nwe propose a gallery image rejection method that utilizes the diversity between\nmultiple transformed embeddings to estimate the uncertainty of gallery images.\n","authors":["Ori Linial","Alon Shoshan","Nadav Bhonker","Elad Hirsch","Lior Zamir","Igor Kviatkovsky","Gerard Medioni"],"pdf_url":"https://arxiv.org/pdf/2303.17531v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15161v2","updated":"2023-10-29T15:45:38Z","published":"2023-10-23T17:57:36Z","title":"SAM-Med3D","summary":" Although the Segment Anything Model (SAM) has demonstrated impressive\nperformance in 2D natural image segmentation, its application to 3D volumetric\nmedical images reveals significant shortcomings, namely suboptimal performance\nand unstable prediction, necessitating an excessive number of prompt points to\nattain the desired outcomes. These issues can hardly be addressed by\nfine-tuning SAM on medical data because the original 2D structure of SAM\nneglects 3D spatial information. In this paper, we introduce SAM-Med3D, the\nmost comprehensive study to modify SAM for 3D medical images. Our approach is\ncharacterized by its comprehensiveness in two primary aspects: firstly, by\ncomprehensively reformulating SAM to a thorough 3D architecture trained on a\ncomprehensively processed large-scale volumetric medical dataset; and secondly,\nby providing a comprehensive evaluation of its performance. Specifically, we\ntrain SAM-Med3D with over 131K 3D masks and 247 categories. Our SAM-Med3D\nexcels at capturing 3D spatial information, exhibiting competitive performance\nwith significantly fewer prompt points than the top-performing fine-tuned SAM\nin the medical domain. We then evaluate its capabilities across 15 datasets and\nanalyze it from multiple perspectives, including anatomical structures,\nmodalities, targets, and generalization abilities. Our approach, compared with\nSAM, showcases pronouncedly enhanced efficiency and broad segmentation\ncapabilities for 3D volumetric medical images. Our code is released at\nhttps://github.com/uni-medical/SAM-Med3D.\n","authors":["Haoyu Wang","Sizheng Guo","Jin Ye","Zhongying Deng","Junlong Cheng","Tianbin Li","Jianpin Chen","Yanzhou Su","Ziyan Huang","Yiqing Shen","Bin Fu","Shaoting Zhang","Junjun He","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2310.15161v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19038v1","updated":"2023-10-29T15:05:39Z","published":"2023-10-29T15:05:39Z","title":"Boosting Decision-Based Black-Box Adversarial Attack with Gradient\n Priors","summary":" Decision-based methods have shown to be effective in black-box adversarial\nattacks, as they can obtain satisfactory performance and only require to access\nthe final model prediction. Gradient estimation is a critical step in black-box\nadversarial attacks, as it will directly affect the query efficiency. Recent\nworks have attempted to utilize gradient priors to facilitate score-based\nmethods to obtain better results. However, these gradient priors still suffer\nfrom the edge gradient discrepancy issue and the successive iteration gradient\ndirection issue, thus are difficult to simply extend to decision-based methods.\nIn this paper, we propose a novel Decision-based Black-box Attack framework\nwith Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent\ngradient prior and time-dependent prior into the gradient estimation procedure.\nFirst, by leveraging the joint bilateral filter to deal with each random\nperturbation, DBA-GP can guarantee that the generated perturbations in edge\nlocations are hardly smoothed, i.e., alleviating the edge gradient discrepancy,\nthus remaining the characteristics of the original image as much as possible.\nSecond, by utilizing a new gradient updating strategy to automatically adjust\nthe successive iteration gradient direction, DBA-GP can accelerate the\nconvergence speed, thus improving the query efficiency. Extensive experiments\nhave demonstrated that the proposed method outperforms other strong baselines\nsignificantly.\n","authors":["Han Liu","Xingshuo Huang","Xiaotong Zhang","Qimai Li","Fenglong Ma","Wei Wang","Hongyang Chen","Hong Yu","Xianchao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.19038v1.pdf","comment":"Accepted by IJCAI 2023"},{"id":"http://arxiv.org/abs/2307.07288v2","updated":"2023-10-29T14:48:41Z","published":"2023-07-14T11:59:47Z","title":"Implicit Neural Feature Fusion Function for Multispectral and\n Hyperspectral Image Fusion","summary":" Multispectral and Hyperspectral Image Fusion (MHIF) is a practical task that\naims to fuse a high-resolution multispectral image (HR-MSI) and a\nlow-resolution hyperspectral image (LR-HSI) of the same scene to obtain a\nhigh-resolution hyperspectral image (HR-HSI). Benefiting from powerful\ninductive bias capability, CNN-based methods have achieved great success in the\nMHIF task. However, they lack certain interpretability and require convolution\nstructures be stacked to enhance performance. Recently, Implicit Neural\nRepresentation (INR) has achieved good performance and interpretability in 2D\ntasks due to its ability to locally interpolate samples and utilize multimodal\ncontent such as pixels and coordinates. Although INR-based approaches show\npromise, they require extra construction of high-frequency information\n(\\emph{e.g.,} positional encoding). In this paper, inspired by previous work of\nMHIF task, we realize that HR-MSI could serve as a high-frequency detail\nauxiliary input, leading us to propose a novel INR-based hyperspectral fusion\nfunction named Implicit Neural Feature Fusion Function (INF). As an elaborate\nstructure, it solves the MHIF task and addresses deficiencies in the INR-based\napproaches. Specifically, our INF designs a Dual High-Frequency Fusion (DHFF)\nstructure that obtains high-frequency information twice from HR-MSI and LR-HSI,\nthen subtly fuses them with coordinate information. Moreover, the proposed INF\nincorporates a parameter-free method named INR with cosine similarity (INR-CS)\nthat uses cosine similarity to generate local weights through feature vectors.\nBased on INF, we construct an Implicit Neural Fusion Network (INFN) that\nachieves state-of-the-art performance for MHIF tasks of two public datasets,\n\\emph{i.e.,} CAVE and Harvard. The code will soon be made available on GitHub.\n","authors":["ShangQi Deng","RuoCheng Wu","Liang-Jian Deng","Ran Ran","Gemine Vivone"],"pdf_url":"https://arxiv.org/pdf/2307.07288v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19024v1","updated":"2023-10-29T14:30:01Z","published":"2023-10-29T14:30:01Z","title":"FPGAN-Control: A Controllable Fingerprint Generator for Training with\n Synthetic Data","summary":" Training fingerprint recognition models using synthetic data has recently\ngained increased attention in the biometric community as it alleviates the\ndependency on sensitive personal data. Existing approaches for fingerprint\ngeneration are limited in their ability to generate diverse impressions of the\nsame finger, a key property for providing effective data for training\nrecognition models. To address this gap, we present FPGAN-Control, an identity\npreserving image generation framework which enables control over the\nfingerprint's image appearance (e.g., fingerprint type, acquisition device,\npressure level) of generated fingerprints. We introduce a novel appearance loss\nthat encourages disentanglement between the fingerprint's identity and\nappearance properties. In our experiments, we used the publicly available NIST\nSD302 (N2N) dataset for training the FPGAN-Control model. We demonstrate the\nmerits of FPGAN-Control, both quantitatively and qualitatively, in terms of\nidentity preservation level, degree of appearance control, and low\nsynthetic-to-real domain gap. Finally, training recognition models using only\nsynthetic datasets generated by FPGAN-Control lead to recognition accuracies\nthat are on par or even surpass models trained using real data. To the best of\nour knowledge, this is the first work to demonstrate this.\n","authors":["Alon Shoshan","Nadav Bhonker","Emanuel Ben Baruch","Ori Nizan","Igor Kviatkovsky","Joshua Engelsma","Manoj Aggarwal","Gerard Medioni"],"pdf_url":"https://arxiv.org/pdf/2310.19024v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.01479v3","updated":"2023-10-29T14:20:43Z","published":"2023-09-04T09:34:33Z","title":"Parameter and Computation Efficient Transfer Learning for\n Vision-Language Pre-trained Models","summary":" With ever increasing parameters and computation, vision-language pre-trained\n(VLP) models exhibit prohibitive expenditure in downstream task adaption.\nRecent endeavors mainly focus on parameter efficient transfer learning (PETL)\nfor VLP models by only updating a small number of parameters. However,\nexcessive computational overhead still plagues the application of VLPs. In this\npaper, we aim at parameter and computation efficient transfer learning (PCETL)\nfor VLP models. In particular, PCETL not only needs to limit the number of\ntrainable parameters in VLP models, but also to reduce the computational\nredundancy during inference, thus enabling a more efficient transfer. To\napproach this target, we propose a novel dynamic architecture skipping (DAS)\napproach towards effective PCETL. Instead of directly optimizing the intrinsic\narchitectures of VLP models, DAS first observes the significances of their\nmodules to downstream tasks via a reinforcement learning (RL) based process,\nand then skips the redundant ones with lightweight networks, i.e., adapters,\naccording to the obtained rewards. In this case, the VLP model can well\nmaintain the scale of trainable parameters while speeding up its inference on\ndownstream tasks. To validate DAS, we apply it to two representative VLP\nmodels, namely ViLT and METER, and conduct extensive experiments on a bunch of\nVL tasks. The experimental results not only show the great advantages of DAS in\nreducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but\nalso confirm its competitiveness against existing PETL methods in terms of\nparameter scale and performance. Our source code is given in our appendix.\n","authors":["Qiong Wu","Wei Yu","Yiyi Zhou","Shubin Huang","Xiaoshuai Sun","Rongrong Ji"],"pdf_url":"https://arxiv.org/pdf/2309.01479v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.07056v3","updated":"2023-10-29T14:06:56Z","published":"2023-04-14T11:26:09Z","title":"Perceptual Quality Assessment of Face Video Compression: A Benchmark and\n An Effective Method","summary":" Recent years have witnessed an exponential increase in the demand for face\nvideo compression, and the success of artificial intelligence has expanded the\nboundaries beyond traditional hybrid video coding. Generative coding approaches\nhave been identified as promising alternatives with reasonable perceptual\nrate-distortion trade-offs, leveraging the statistical priors of face videos.\nHowever, the great diversity of distortion types in spatial and temporal\ndomains, ranging from the traditional hybrid coding frameworks to generative\nmodels, present grand challenges in compressed face video quality assessment\n(VQA). In this paper, we introduce the large-scale Compressed Face Video\nQuality Assessment (CFVQA) database, which is the first attempt to\nsystematically understand the perceptual quality and diversified compression\ndistortions in face videos. The database contains 3,240 compressed face video\nclips in multiple compression levels, which are derived from 135 source videos\nwith diversified content using six representative video codecs, including two\ntraditional methods based on hybrid coding frameworks, two end-to-end methods,\nand two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index\nfor face video compression was developed to measure the perceptual quality,\nconsidering the distinct content characteristics and temporal priors of the\nface videos. Experimental results exhibit its superior performance on the\nproposed CFVQA dataset. The benchmark is now made publicly available at:\nhttps://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.\n","authors":["Yixuan Li","Bolin Chen","Baoliang Chen","Meng Wang","Shiqi Wang","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2304.07056v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13631v2","updated":"2023-10-29T14:04:25Z","published":"2023-06-23T17:36:44Z","title":"OpenMask3D: Open-Vocabulary 3D Instance Segmentation","summary":" We introduce the task of open-vocabulary 3D instance segmentation. Current\napproaches for 3D instance segmentation can typically only recognize object\ncategories from a pre-defined closed set of classes that are annotated in the\ntraining datasets. This results in important limitations for real-world\napplications where one might need to perform tasks guided by novel,\nopen-vocabulary queries related to a wide variety of objects. Recently,\nopen-vocabulary 3D scene understanding methods have emerged to address this\nproblem by learning queryable features for each point in the scene. While such\na representation can be directly employed to perform semantic segmentation,\nexisting methods cannot separate multiple object instances. In this work, we\naddress this limitation, and propose OpenMask3D, which is a zero-shot approach\nfor open-vocabulary 3D instance segmentation. Guided by predicted\nclass-agnostic 3D instance masks, our model aggregates per-mask features via\nmulti-view fusion of CLIP-based image embeddings. Experiments and ablation\nstudies on ScanNet200 and Replica show that OpenMask3D outperforms other\nopen-vocabulary methods, especially on the long-tail distribution. Qualitative\nexperiments further showcase OpenMask3D's ability to segment object properties\nbased on free-form queries describing geometry, affordances, and materials.\n","authors":["Ayça Takmaz","Elisabetta Fedele","Robert W. Sumner","Marc Pollefeys","Federico Tombari","Francis Engelmann"],"pdf_url":"https://arxiv.org/pdf/2306.13631v2.pdf","comment":"NeurIPS 2023. Project page: https://openmask3d.github.io/"},{"id":"http://arxiv.org/abs/2310.19011v1","updated":"2023-10-29T13:58:57Z","published":"2023-10-29T13:58:57Z","title":"Efficient Test-Time Adaptation for Super-Resolution with Second-Order\n Degradation and Reconstruction","summary":" Image super-resolution (SR) aims to learn a mapping from low-resolution (LR)\nto high-resolution (HR) using paired HR-LR training images. Conventional SR\nmethods typically gather the paired training data by synthesizing LR images\nfrom HR images using a predetermined degradation model, e.g., Bicubic\ndown-sampling. However, the realistic degradation type of test images may\nmismatch with the training-time degradation type due to the dynamic changes of\nthe real-world scenarios, resulting in inferior-quality SR images. To address\nthis, existing methods attempt to estimate the degradation model and train an\nimage-specific model, which, however, is quite time-consuming and impracticable\nto handle rapidly changing domain shifts. Moreover, these methods largely\nconcentrate on the estimation of one degradation type (e.g., blur degradation),\noverlooking other degradation types like noise and JPEG in real-world test-time\nscenarios, thus limiting their practicality. To tackle these problems, we\npresent an efficient test-time adaptation framework for SR, named SRTTA, which\nis able to quickly adapt SR models to test domains with different/unknown\ndegradation types. Specifically, we design a second-order degradation scheme to\nconstruct paired data based on the degradation type of the test image, which is\npredicted by a pre-trained degradation classifier. Then, we adapt the SR model\nby implementing feature-level reconstruction learning from the initial test\nimage to its second-order degraded counterparts, which helps the SR model\ngenerate plausible HR images. Extensive experiments are conducted on newly\nsynthesized corrupted DIV2K datasets with 8 different degradations and several\nreal-world datasets, demonstrating that our SRTTA framework achieves an\nimpressive improvement over existing methods with satisfying speed. The source\ncode is available at https://github.com/DengZeshuai/SRTTA.\n","authors":["Zeshuai Deng","Zhuokun Chen","Shuaicheng Niu","Thomas H. Li","Bohan Zhuang","Mingkui Tan"],"pdf_url":"https://arxiv.org/pdf/2310.19011v1.pdf","comment":"Accepted by 37th Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2209.15210v4","updated":"2023-10-29T13:47:13Z","published":"2022-09-30T03:40:10Z","title":"Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation","summary":" Most existing methods for unsupervised domain adaptation (UDA) rely on a\nshared network to extract domain-invariant features. However, when facing\nmultiple source domains, optimizing such a network involves updating the\nparameters of the entire network, making it both computationally expensive and\nchallenging, particularly when coupled with min-max objectives. Inspired by\nrecent advances in prompt learning that adapts high-capacity models for\ndownstream tasks in a computationally economic way, we introduce Multi-Prompt\nAlignment (MPA), a simple yet efficient framework for multi-source UDA. Given a\nsource and target domain pair, MPA first trains an individual prompt to\nminimize the domain gap through a contrastive loss. Then, MPA denoises the\nlearned prompts through an auto-encoding process and aligns them by maximizing\nthe agreement of all the reconstructed prompts. Moreover, we show that the\nresulting subspace acquired from the auto-encoding process can easily\ngeneralize to a streamlined set of target domains, making our method more\nefficient for practical usage. Extensive experiments show that MPA achieves\nstate-of-the-art results on three popular datasets with an impressive average\naccuracy of 54.1% on DomainNet.\n","authors":["Haoran Chen","Zuxuan Wu","Xintong Han","Yu-Gang Jiang"],"pdf_url":"https://arxiv.org/pdf/2209.15210v4.pdf","comment":"NeurIPS 2023 camera-ready version"},{"id":"http://arxiv.org/abs/2310.19001v1","updated":"2023-10-29T13:18:00Z","published":"2023-10-29T13:18:00Z","title":"Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic\n Segmentation","summary":" This paper studies the problem of weakly open-vocabulary semantic\nsegmentation (WOVSS), which learns to segment objects of arbitrary classes\nusing mere image-text pairs. Existing works turn to enhance the vanilla vision\ntransformer by introducing explicit grouping recognition, i.e., employing\nseveral group tokens/centroids to cluster the image tokens and perform the\ngroup-text alignment. Nevertheless, these methods suffer from a granularity\ninconsistency regarding the usage of group tokens, which are aligned in the\nall-to-one v.s. one-to-one manners during the training and inference phases,\nrespectively. We argue that this discrepancy arises from the lack of elaborate\nsupervision for each group token. To bridge this granularity gap, this paper\nexplores explicit supervision for the group tokens from the prototypical\nknowledge. To this end, this paper proposes the non-learnable prototypical\nregularization (NPR) where non-learnable prototypes are estimated from source\nfeatures to serve as supervision and enable contrastive matching of the group\ntokens. This regularization encourages the group tokens to segment objects with\nless redundancy and capture more comprehensive semantic regions, leading to\nincreased compactness and richness. Based on NPR, we propose the prototypical\nguidance segmentation network (PGSeg) that incorporates multi-modal\nregularization by leveraging prototypical sources from both images and texts at\ndifferent levels, progressively enhancing the segmentation capability with\ndiverse prototypical patterns. Experimental results show that our proposed\nmethod achieves state-of-the-art performance on several benchmark datasets. The\nsource code is available at https://github.com/Ferenas/PGSeg.\n","authors":["Fei Zhang","Tianfei Zhou","Boyang Li","Hao He","Chaofan Ma","Tianjiao Zhang","Jiangchao Yao","Ya Zhang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19001v1.pdf","comment":"14 pages, Accept in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18999v1","updated":"2023-10-29T12:55:53Z","published":"2023-10-29T12:55:53Z","title":"DynPoint: Dynamic Neural Point For View Synthesis","summary":" The introduction of neural radiance fields has greatly improved the\neffectiveness of view synthesis for monocular videos. However, existing\nalgorithms face difficulties when dealing with uncontrolled or lengthy\nscenarios, and require extensive training time specific to each new scenario.\nTo tackle these limitations, we propose DynPoint, an algorithm designed to\nfacilitate the rapid synthesis of novel views for unconstrained monocular\nvideos. Rather than encoding the entirety of the scenario information into a\nlatent representation, DynPoint concentrates on predicting the explicit 3D\ncorrespondence between neighboring frames to realize information aggregation.\nSpecifically, this correspondence prediction is achieved through the estimation\nof consistent depth and scene flow information across frames. Subsequently, the\nacquired correspondence is utilized to aggregate information from multiple\nreference frames to a target frame, by constructing hierarchical neural point\nclouds. The resulting framework enables swift and accurate view synthesis for\ndesired views of target frames. The experimental results obtained demonstrate\nthe considerable acceleration of training time achieved - typically an order of\nmagnitude - by our proposed method while yielding comparable outcomes compared\nto prior approaches. Furthermore, our method exhibits strong robustness in\nhandling long-duration videos without learning a canonical representation of\nvideo content.\n","authors":["Kaichen Zhou","Jia-Xing Zhong","Sangyun Shin","Kai Lu","Yiyuan Yang","Andrew Markham","Niki Trigoni"],"pdf_url":"https://arxiv.org/pdf/2310.18999v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16934v2","updated":"2023-10-29T12:32:19Z","published":"2023-05-26T13:49:44Z","title":"On Evaluating Adversarial Robustness of Large Vision-Language Models","summary":" Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented\nperformance in response generation, especially with visual inputs, enabling\nmore creative and adaptable interaction than large language models such as\nChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since\nadversaries may successfully evade the entire system by subtly manipulating the\nmost vulnerable modality (e.g., vision). To this end, we propose evaluating the\nrobustness of open-source large VLMs in the most realistic and high-risk\nsetting, where adversaries have only black-box system access and seek to\ndeceive the model into returning the targeted responses. In particular, we\nfirst craft targeted adversarial examples against pretrained models such as\nCLIP and BLIP, and then transfer these adversarial examples to other VLMs such\nas MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we\nobserve that black-box queries on these VLMs can further improve the\neffectiveness of targeted evasion, resulting in a surprisingly high success\nrate for generating targeted responses. Our findings provide a quantitative\nunderstanding regarding the adversarial vulnerability of large VLMs and call\nfor a more thorough examination of their potential security flaws before\ndeployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.\n","authors":["Yunqing Zhao","Tianyu Pang","Chao Du","Xiao Yang","Chongxuan Li","Ngai-Man Cheung","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2305.16934v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2210.09943v2","updated":"2023-10-29T12:20:22Z","published":"2022-10-18T15:46:05Z","title":"Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face\n Recognition","summary":" Face recognition systems are widely deployed in safety-critical applications,\nincluding law enforcement, yet they exhibit bias across a range of\nsocio-demographic dimensions, such as gender and race. Conventional wisdom\ndictates that model biases arise from biased training data. As a consequence,\nprevious works on bias mitigation largely focused on pre-processing the\ntraining data, adding penalties to prevent bias from effecting the model during\ntraining, or post-processing predictions to debias them, yet these approaches\nhave shown limited success on hard problems such as face recognition. In our\nwork, we discover that biases are actually inherent to neural network\narchitectures themselves. Following this reframing, we conduct the first neural\narchitecture search for fairness, jointly with a search for hyperparameters.\nOur search outputs a suite of models which Pareto-dominate all other\nhigh-performance architectures and existing bias mitigation methods in terms of\naccuracy and fairness, often by large margins, on the two most widely used\ndatasets for face identification, CelebA and VGGFace2. Furthermore, these\nmodels generalize to other datasets and sensitive attributes. We release our\ncode, models and raw data files at https://github.com/dooleys/FR-NAS.\n","authors":["Rhea Sanjay Sukthanker","Samuel Dooley","John P. Dickerson","Colin White","Frank Hutter","Micah Goldblum"],"pdf_url":"https://arxiv.org/pdf/2210.09943v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10688v3","updated":"2023-10-29T12:14:29Z","published":"2023-02-21T14:14:40Z","title":"On Calibrating Diffusion Probabilistic Models","summary":" Recently, diffusion probabilistic models (DPMs) have achieved promising\nresults in diverse generative tasks. A typical DPM framework includes a forward\nprocess that gradually diffuses the data distribution and a reverse process\nthat recovers the data distribution from time-dependent data scores. In this\nwork, we observe that the stochastic reverse process of data scores is a\nmartingale, from which concentration bounds and the optional stopping theorem\nfor data scores can be derived. Then, we discover a simple way for calibrating\nan arbitrary pretrained DPM, with which the score matching loss can be reduced\nand the lower bounds of model likelihood can consequently be increased. We\nprovide general calibration guidelines under various model parametrizations.\nOur calibration method is performed only once and the resulting models can be\nused repeatedly for sampling. We conduct experiments on multiple datasets to\nempirically validate our proposal. Our code is at\nhttps://github.com/thudzj/Calibrated-DPMs.\n","authors":["Tianyu Pang","Cheng Lu","Chao Du","Min Lin","Shuicheng Yan","Zhijie Deng"],"pdf_url":"https://arxiv.org/pdf/2302.10688v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18986v1","updated":"2023-10-29T11:59:12Z","published":"2023-10-29T11:59:12Z","title":"Controllable Group Choreography using Contrastive Diffusion","summary":" Music-driven group choreography poses a considerable challenge but holds\nsignificant potential for a wide range of industrial applications. The ability\nto generate synchronized and visually appealing group dance motions that are\naligned with music opens up opportunities in many fields such as entertainment,\nadvertising, and virtual performances. However, most of the recent works are\nnot able to generate high-fidelity long-term motions, or fail to enable\ncontrollable experience. In this work, we aim to address the demand for\nhigh-quality and customizable group dance generation by effectively governing\nthe consistency and diversity of group choreographies. In particular, we\nutilize a diffusion-based generative approach to enable the synthesis of\nflexible number of dancers and long-term group dances, while ensuring coherence\nto the input music. Ultimately, we introduce a Group Contrastive Diffusion\n(GCD) strategy to enhance the connection between dancers and their group,\npresenting the ability to control the consistency or diversity level of the\nsynthesized group animation via the classifier-guidance sampling technique.\nThrough intensive experiments and evaluation, we demonstrate the effectiveness\nof our approach in producing visually captivating and consistent group dance\nmotions. The experimental results show the capability of our method to achieve\nthe desired levels of consistency and diversity, while maintaining the overall\nquality of the generated group choreography.\n","authors":["Nhat Le","Tuong Do","Khoa Do","Hien Nguyen","Erman Tjiputra","Quang D. Tran","Anh Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.18986v1.pdf","comment":"Accepted in ACM Transactions on Graphics"},{"id":"http://arxiv.org/abs/2310.16809v2","updated":"2023-10-29T10:59:21Z","published":"2023-10-25T17:38:55Z","title":"Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and\n In-depth Evaluation","summary":" This paper presents a comprehensive evaluation of the Optical Character\nRecognition (OCR) capabilities of the recently released GPT-4V(ision), a Large\nMultimodal Model (LMM). We assess the model's performance across a range of OCR\ntasks, including scene text recognition, handwritten text recognition,\nhandwritten mathematical expression recognition, table structure recognition,\nand information extraction from visually-rich document. The evaluation reveals\nthat GPT-4V performs well in recognizing and understanding Latin contents, but\nstruggles with multilingual scenarios and complex tasks. Specifically, it\nshowed limitations when dealing with non-Latin languages and complex tasks such\nas handwriting mathematical expression recognition, table structure\nrecognition, and end-to-end semantic entity recognition and pair extraction\nfrom document image. Based on these observations, we affirm the necessity and\ncontinued research value of specialized OCR models. In general, despite its\nversatility in handling diverse OCR tasks, GPT-4V does not outperform existing\nstate-of-the-art OCR models. How to fully utilize pre-trained general-purpose\nLMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study\noffers a critical reference for future research in OCR with LMMs. Evaluation\npipeline and results are available at\nhttps://github.com/SCUT-DLVCLab/GPT-4V_OCR.\n","authors":["Yongxin Shi","Dezhi Peng","Wenhui Liao","Zening Lin","Xinhong Chen","Chongyu Liu","Yuyi Zhang","Lianwen Jin"],"pdf_url":"https://arxiv.org/pdf/2310.16809v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18975v1","updated":"2023-10-29T10:48:44Z","published":"2023-10-29T10:48:44Z","title":"Blacksmith: Fast Adversarial Training of Vision Transformers via a\n Mixture of Single-step and Multi-step Methods","summary":" Despite the remarkable success achieved by deep learning algorithms in\nvarious domains, such as computer vision, they remain vulnerable to adversarial\nperturbations. Adversarial Training (AT) stands out as one of the most\neffective solutions to address this issue; however, single-step AT can lead to\nCatastrophic Overfitting (CO). This scenario occurs when the adversarially\ntrained network suddenly loses robustness against multi-step attacks like\nProjected Gradient Descent (PGD). Although several approaches have been\nproposed to address this problem in Convolutional Neural Networks (CNNs), we\nfound out that they do not perform well when applied to Vision Transformers\n(ViTs). In this paper, we propose Blacksmith, a novel training strategy to\novercome the CO problem, specifically in ViTs. Our approach utilizes either of\nPGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the\nadversarial training of the neural network. This will increase the diversity of\nour training attacks, which could potentially mitigate the CO issue. To manage\nthe increased training time resulting from this combination, we craft the PGD-2\nattack based on only the first half of the layers, while FGSM is applied\nend-to-end. Through our experiments, we demonstrate that our novel method\neffectively prevents CO, achieves PGD-2 level performance, and outperforms\nother existing techniques including N-FGSM, which is the state-of-the-art\nmethod in fast training for CNNs.\n","authors":["Mahdi Salmani","Alireza Dehghanpour Farashah","Mohammad Azizmalayeri","Mahdi Amiri","Navid Eslami","Mohammad Taghi Manzuri","Mohammad Hossein Rohban"],"pdf_url":"https://arxiv.org/pdf/2310.18975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18969v1","updated":"2023-10-29T10:25:23Z","published":"2023-10-29T10:25:23Z","title":"Analyzing Vision Transformers for Image Classification in Class\n Embedding Space","summary":" Despite the growing use of transformer models in computer vision, a\nmechanistic understanding of these networks is still needed. This work\nintroduces a method to reverse-engineer Vision Transformers trained to solve\nimage classification tasks. Inspired by previous research in NLP, we\ndemonstrate how the inner representations at any level of the hierarchy can be\nprojected onto the learned class embedding space to uncover how these networks\nbuild categorical representations for their predictions. We use our framework\nto show how image tokens develop class-specific representations that depend on\nattention mechanisms and contextual information, and give insights on how\nself-attention and MLP layers differentially contribute to this categorical\ncomposition. We additionally demonstrate that this method (1) can be used to\ndetermine the parts of an image that would be important for detecting the class\nof interest, and (2) exhibits significant advantages over traditional linear\nprobing approaches. Taken together, our results position our proposed framework\nas a powerful tool for mechanistic interpretability and explainability\nresearch.\n","authors":["Martina G. Vilas","Timothy Schaumlöffel","Gemma Roig"],"pdf_url":"https://arxiv.org/pdf/2310.18969v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18966v1","updated":"2023-10-29T10:15:33Z","published":"2023-10-29T10:15:33Z","title":"Spacecraft Autonomous Decision-Planning for Collision Avoidance: a\n Reinforcement Learning Approach","summary":" The space environment around the Earth is becoming increasingly populated by\nboth active spacecraft and space debris. To avoid potential collision events,\nsignificant improvements in Space Situational Awareness (SSA) activities and\nCollision Avoidance (CA) technologies are allowing the tracking and maneuvering\nof spacecraft with increasing accuracy and reliability. However, these\nprocedures still largely involve a high level of human intervention to make the\nnecessary decisions. For an increasingly complex space environment, this\ndecision-making strategy is not likely to be sustainable. Therefore, it is\nimportant to successfully introduce higher levels of automation for key Space\nTraffic Management (STM) processes to ensure the level of reliability needed\nfor navigating a large number of spacecraft. These processes range from\ncollision risk detection to the identification of the appropriate action to\ntake and the execution of avoidance maneuvers. This work proposes an\nimplementation of autonomous CA decision-making capabilities on spacecraft\nbased on Reinforcement Learning (RL) techniques. A novel methodology based on a\nPartially Observable Markov Decision Process (POMDP) framework is developed to\ntrain the Artificial Intelligence (AI) system on board the spacecraft,\nconsidering epistemic and aleatory uncertainties. The proposed framework\nconsiders imperfect monitoring information about the status of the debris in\norbit and allows the AI system to effectively learn stochastic policies to\nperform accurate Collision Avoidance Maneuvers (CAMs). The objective is to\nsuccessfully delegate the decision-making process for autonomously implementing\na CAM to the spacecraft without human intervention. This approach would allow\nfor a faster response in the decision-making process and for highly\ndecentralized operations.\n","authors":["Nicolas Bourriez","Adrien Loizeau","Adam F. Abdin"],"pdf_url":"https://arxiv.org/pdf/2310.18966v1.pdf","comment":"Preprint accepted in the 74th International Astronautical Congress\n (IAC) - Baku, Azerbaijan, 2-6 October 2023"},{"id":"http://arxiv.org/abs/2306.09244v2","updated":"2023-10-29T10:07:43Z","published":"2023-06-15T16:26:20Z","title":"Text Promptable Surgical Instrument Segmentation with Vision-Language\n Models","summary":" In this paper, we propose a novel text promptable surgical instrument\nsegmentation approach to overcome challenges associated with diversity and\ndifferentiation of surgical instruments in minimally invasive surgeries. We\nredefine the task as text promptable, thereby enabling a more nuanced\ncomprehension of surgical instruments and adaptability to new instrument types.\nInspired by recent advancements in vision-language models, we leverage\npretrained image and text encoders as our model backbone and design a text\npromptable mask decoder consisting of attention- and convolution-based\nprompting schemes for surgical instrument segmentation prediction. Our model\nleverages multiple text prompts for each surgical instrument through a new\nmixture of prompts mechanism, resulting in enhanced segmentation performance.\nAdditionally, we introduce a hard instrument area reinforcement module to\nimprove image feature comprehension and segmentation precision. Extensive\nexperiments on several surgical instrument segmentation datasets demonstrate\nour model's superior performance and promising generalization capability. To\nour knowledge, this is the first implementation of a promptable approach to\nsurgical instrument segmentation, offering significant potential for practical\napplication in the field of robotic-assisted surgery.\n","authors":["Zijian Zhou","Oluwatosin Alabi","Meng Wei","Tom Vercauteren","Miaojing Shi"],"pdf_url":"https://arxiv.org/pdf/2306.09244v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18961v1","updated":"2023-10-29T10:03:49Z","published":"2023-10-29T10:03:49Z","title":"AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly\n Detection","summary":" Zero-shot anomaly detection (ZSAD) requires detection models trained using\nauxiliary data to detect anomalies without any training sample in a target\ndataset. It is a crucial task when training data is not accessible due to\nvarious concerns, \\eg, data privacy, yet it is challenging since the models\nneed to generalize to anomalies across different domains where the appearance\nof foreground objects, abnormal regions, and background features, such as\ndefects/tumors on different products/organs, can vary significantly. Recently\nlarge pre-trained vision-language models (VLMs), such as CLIP, have\ndemonstrated strong zero-shot recognition ability in various vision tasks,\nincluding anomaly detection. However, their ZSAD performance is weak since the\nVLMs focus more on modeling the class semantics of the foreground objects\nrather than the abnormality/normality in the images. In this paper we introduce\na novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across\ndifferent domains. The key insight of AnomalyCLIP is to learn object-agnostic\ntext prompts that capture generic normality and abnormality in an image\nregardless of its foreground objects. This allows our model to focus on the\nabnormal image regions rather than the object semantics, enabling generalized\nnormality and abnormality recognition on diverse types of objects. Large-scale\nexperiments on 17 real-world anomaly detection datasets show that AnomalyCLIP\nachieves superior zero-shot performance of detecting and segmenting anomalies\nin datasets of highly diverse class semantics from various defect inspection\nand medical imaging domains. Code will be made available at\nhttps://github.com/zqhang/AnomalyCLIP.\n","authors":["Qihang Zhou","Guansong Pang","Yu Tian","Shibo He","Jiming Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18961v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18954v1","updated":"2023-10-29T09:55:28Z","published":"2023-10-29T09:55:28Z","title":"Mask Propagation for Efficient Video Semantic Segmentation","summary":" Video Semantic Segmentation (VSS) involves assigning a semantic label to each\npixel in a video sequence. Prior work in this field has demonstrated promising\nresults by extending image semantic segmentation models to exploit temporal\nrelationships across video frames; however, these approaches often incur\nsignificant computational costs. In this paper, we propose an efficient mask\npropagation framework for VSS, called MPVSS. Our approach first employs a\nstrong query-based image segmentor on sparse key frames to generate accurate\nbinary masks and class predictions. We then design a flow estimation module\nutilizing the learned queries to generate a set of segment-aware flow maps,\neach associated with a mask prediction from the key frame. Finally, the\nmask-flow pairs are warped to serve as the mask predictions for the non-key\nframes. By reusing predictions from key frames, we circumvent the need to\nprocess a large volume of video frames individually with resource-intensive\nsegmentors, alleviating temporal redundancy and significantly reducing\ncomputational costs. Extensive experiments on VSPW and Cityscapes demonstrate\nthat our mask propagation framework achieves SOTA accuracy and efficiency\ntrade-offs. For instance, our best model with Swin-L backbone outperforms the\nSOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW\ndataset. Moreover, our framework reduces up to 4x FLOPs compared to the\nper-frame Mask2Former baseline with only up to 2% mIoU degradation on the\nCityscapes validation set. Code is available at\nhttps://github.com/ziplab/MPVSS.\n","authors":["Yuetian Weng","Mingfei Han","Haoyu He","Mingjie Li","Lina Yao","Xiaojun Chang","Bohan Zhuang"],"pdf_url":"https://arxiv.org/pdf/2310.18954v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18953v1","updated":"2023-10-29T09:54:03Z","published":"2023-10-29T09:54:03Z","title":"TIC-TAC: A Framework To Learn And Evaluate Your Covariance","summary":" We study the problem of unsupervised heteroscedastic covariance estimation,\nwhere the goal is to learn the multivariate target distribution $\\mathcal{N}(y,\n\\Sigma_y | x )$ given an observation $x$. This problem is particularly\nchallenging as $\\Sigma_{y}$ varies for different samples (heteroscedastic) and\nno annotation for the covariance is available (unsupervised). Typically,\nstate-of-the-art methods predict the mean $f_{\\theta}(x)$ and covariance\n$\\textrm{Cov}(f_{\\theta}(x))$ of the target distribution through two neural\nnetworks trained using the negative log-likelihood. This raises two questions:\n(1) Does the predicted covariance truly capture the randomness of the predicted\nmean? (2) In the absence of ground-truth annotation, how can we quantify the\nperformance of covariance estimation? We address (1) by deriving TIC: Taylor\nInduced Covariance, which captures the randomness of the multivariate\n$f_{\\theta}(x)$ by incorporating its gradient and curvature around $x$ through\nthe second order Taylor polynomial. Furthermore, we tackle (2) by introducing\nTAC: Task Agnostic Correlations, a metric which leverages conditioning of the\nnormal distribution to evaluate the covariance. We verify the effectiveness of\nTIC through multiple experiments spanning synthetic (univariate, multivariate)\nand real-world datasets (UCI Regression, LSP, and MPII Human Pose Estimation).\nOur experiments show that TIC outperforms state-of-the-art in accurately\nlearning the covariance, as quantified through TAC.\n","authors":["Megh Shukla","Mathieu Salzmann","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2310.18953v1.pdf","comment":"12 pages, 4 figures. Please feel free to provide feedback!"},{"id":"http://arxiv.org/abs/2310.18949v1","updated":"2023-10-29T09:32:33Z","published":"2023-10-29T09:32:33Z","title":"Customize StyleGAN with One Hand Sketch","summary":" Generating images from human sketches typically requires dedicated networks\ntrained from scratch. In contrast, the emergence of the pre-trained\nVision-Language models (e.g., CLIP) has propelled generative applications based\non controlling the output imagery of existing StyleGAN models with text inputs\nor reference images. Parallelly, our work proposes a framework to control\nStyleGAN imagery with a single user sketch. In particular, we learn a\nconditional distribution in the latent space of a pre-trained StyleGAN model\nvia energy-based learning and propose two novel energy functions leveraging\nCLIP for cross-domain semantic supervision. Once trained, our model can\ngenerate multi-modal images semantically aligned with the input sketch.\nQuantitative evaluations on synthesized datasets have shown that our approach\nimproves significantly from previous methods in the one-shot regime. The\nsuperiority of our method is further underscored when experimenting with a wide\nrange of human sketches of diverse styles and poses. Surprisingly, our models\noutperform the previous baseline regarding both the range of sketch inputs and\nimage qualities despite operating with a stricter setting: with no extra\ntraining data and single sketch input.\n","authors":["Shaocong Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.18949v1.pdf","comment":"preprint"},{"id":"http://arxiv.org/abs/2310.18946v1","updated":"2023-10-29T09:09:32Z","published":"2023-10-29T09:09:32Z","title":"Video Frame Interpolation with Many-to-many Splatting and Spatial\n Selective Refinement","summary":" In this work, we first propose a fully differentiable Many-to-Many (M2M)\nsplatting framework to interpolate frames efficiently. Given a frame pair, we\nestimate multiple bidirectional flows to directly forward warp the pixels to\nthe desired time step before fusing overlapping pixels. In doing so, each\nsource pixel renders multiple target pixels and each target pixel can be\nsynthesized from a larger area of visual context, establishing a many-to-many\nsplatting scheme with robustness to undesirable artifacts. For each input frame\npair, M2M has a minuscule computational overhead when interpolating an\narbitrary number of in-between frames, hence achieving fast multi-frame\ninterpolation. However, directly warping and fusing pixels in the intensity\ndomain is sensitive to the quality of motion estimation and may suffer from\nless effective representation capacity. To improve interpolation accuracy, we\nfurther extend an M2M++ framework by introducing a flexible Spatial Selective\nRefinement (SSR) component, which allows for trading computational efficiency\nfor interpolation quality and vice versa. Instead of refining the entire\ninterpolated frame, SSR only processes difficult regions selected under the\nguidance of an estimated error map, thereby avoiding redundant computation.\nEvaluation on multiple benchmark datasets shows that our method is able to\nimprove the efficiency while maintaining competitive video interpolation\nquality, and it can be adjusted to use more or less compute as needed.\n","authors":["Ping Hu","Simon Niklaus","Lu Zhang","Stan Sclaroff","Kate Saenko"],"pdf_url":"https://arxiv.org/pdf/2310.18946v1.pdf","comment":"T-PAMI. arXiv admin note: substantial text overlap with\n arXiv:2204.03513"},{"id":"http://arxiv.org/abs/2211.13251v2","updated":"2023-10-29T09:04:55Z","published":"2022-11-23T19:02:50Z","title":"CGOF++: Controllable 3D Face Synthesis with Conditional Generative\n Occupancy Fields","summary":" Capitalizing on the recent advances in image generation models, existing\ncontrollable face image synthesis methods are able to generate high-fidelity\nimages with some levels of controllability, e.g., controlling the shapes,\nexpressions, textures, and poses of the generated face images. However,\nprevious methods focus on controllable 2D image generative models, which are\nprone to producing inconsistent face images under large expression and pose\nchanges. In this paper, we propose a new NeRF-based conditional 3D face\nsynthesis framework, which enables 3D controllability over the generated face\nimages by imposing explicit 3D conditions from 3D face priors. At its core is a\nconditional Generative Occupancy Field (cGOF++) that effectively enforces the\nshape of the generated face to conform to a given 3D Morphable Model (3DMM)\nmesh, built on top of EG3D [1], a recent tri-plane-based generative model. To\nachieve accurate control over fine-grained 3D face shapes of the synthesized\nimages, we additionally incorporate a 3D landmark loss as well as a volume\nwarping loss into our synthesis framework. Experiments validate the\neffectiveness of the proposed method, which is able to generate high-fidelity\nface images and shows more precise 3D controllability than state-of-the-art\n2D-based controllable face synthesis methods.\n","authors":["Keqiang Sun","Shangzhe Wu","Ning Zhang","Zhaoyang Huang","Quan Wang","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2211.13251v2.pdf","comment":"Accepted to IEEE Transactions on Pattern Analysis and Machine\n Intelligence (TPAMI). This article is an extension of the NeurIPS'22 paper\n arXiv:2206.08361"},{"id":"http://arxiv.org/abs/2310.18936v1","updated":"2023-10-29T08:50:27Z","published":"2023-10-29T08:50:27Z","title":"Adversarial Examples Are Not Real Features","summary":" The existence of adversarial examples has been a mystery for years and\nattracted much interest. A well-known theory by \\citet{ilyas2019adversarial}\nexplains adversarial vulnerability from a data perspective by showing that one\ncan extract non-robust features from adversarial examples and these features\nalone are useful for classification. However, the explanation remains quite\ncounter-intuitive since non-robust features are mostly noise features to\nhumans. In this paper, we re-examine the theory from a larger context by\nincorporating multiple learning paradigms. Notably, we find that contrary to\ntheir good usefulness under supervised learning, non-robust features attain\npoor usefulness when transferred to other self-supervised learning paradigms,\nsuch as contrastive learning, masked image modeling, and diffusion models. It\nreveals that non-robust features are not really as useful as robust or natural\nfeatures that enjoy good transferability between these paradigms. Meanwhile,\nfor robustness, we also show that naturally trained encoders from robust\nfeatures are largely non-robust under AutoAttack. Our cross-paradigm\nexamination suggests that the non-robust features are not really useful but\nmore like paradigm-wise shortcuts, and robust features alone might be\ninsufficient to attain reliable model robustness. Code is available at\n\\url{https://github.com/PKU-ML/AdvNotRealFeatures}.\n","authors":["Ang Li","Yifei Wang","Yiwen Guo","Yisen Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18936v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18933v1","updated":"2023-10-29T08:03:45Z","published":"2023-10-29T08:03:45Z","title":"Label Poisoning is All You Need","summary":" In a backdoor attack, an adversary injects corrupted data into a model's\ntraining dataset in order to gain control over its predictions on images with a\nspecific attacker-defined trigger. A typical corrupted training example\nrequires altering both the image, by applying the trigger, and the label.\nModels trained on clean images, therefore, were considered safe from backdoor\nattacks. However, in some common machine learning scenarios, the training\nlabels are provided by potentially malicious third-parties. This includes\ncrowd-sourced annotation and knowledge distillation. We, hence, investigate a\nfundamental question: can we launch a successful backdoor attack by only\ncorrupting labels? We introduce a novel approach to design label-only backdoor\nattacks, which we call FLIP, and demonstrate its strengths on three datasets\n(CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32,\nResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels\ncorrupted, FLIP achieves a near-perfect attack success rate of 99.4% while\nsuffering only a 1.8% drop in the clean test accuracy. Our approach builds upon\nthe recent advances in trajectory matching, originally introduced for dataset\ndistillation.\n","authors":["Rishi D. Jha","Jonathan Hayase","Sewoong Oh"],"pdf_url":"https://arxiv.org/pdf/2310.18933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18928v1","updated":"2023-10-29T07:38:33Z","published":"2023-10-29T07:38:33Z","title":"A transfer learning approach with convolutional neural network for Face\n Mask Detection","summary":" Due to the epidemic of the coronavirus (Covid-19) and its rapid spread around\nthe world, the world has faced an enormous crisis. To prevent the spread of the\ncoronavirus, the World Health Organization (WHO) has introduced the use of\nmasks and keeping social distance as the best preventive method. So, developing\nan automatic monitoring system for detecting facemasks in some crowded places\nis essential. To do this, we propose a mask recognition system based on\ntransfer learning and Inception v3 architecture. In the proposed method, two\ndatasets are used simultaneously for training including the Simulated Mask Face\nDataset (SMFD) and MaskedFace-Net (MFN) This paper tries to increase the\naccuracy of the proposed system by optimally setting hyper-parameters and\naccurately designing the fully connected layers. The main advantage of the\nproposed method is that in addition to masked and unmasked faces, it can also\ndetect cases of incorrect use of mask. Therefore, the proposed method\nclassifies the input face images into three categories. Experimental results\nshow the high accuracy and efficiency of the proposed method; so, this method\nhas achieved an accuracy of 99.47% and 99.33% in training and test data\nrespectively\n","authors":["Abolfazl Younesi","Reza Afrouzian","Yousef Seyfari"],"pdf_url":"https://arxiv.org/pdf/2310.18928v1.pdf","comment":"9 pages, in Persian language, 8 figures"},{"id":"http://arxiv.org/abs/2308.16573v2","updated":"2023-10-29T07:37:59Z","published":"2023-08-31T09:13:34Z","title":"Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for\n Semi-Supervised Medical Image Segmentation","summary":" Though supervised learning gains impressive success, the acquisition of\nindispensable large-scale labeled datasets are often impractical in biomedical\nimaging partially due to expensive costs and lengthy annotations done by\nexperienced radiologists. Semi-supervised learning has been shown to be an\neffective way to address this limitation by leveraging useful information from\nunlabeled datasets. In this paper, we present a new semi-supervised learning\nmethod referred to as Dual-Decoder Consistency via Pseudo-Labels Guided Data\nAugmentation (DCPA) for medical image segmentation. We devise a consistency\nregularization to improve the semi-supervised learning. Specifically, to\npromote consistent representations during the training process, we use\ndifferent decoders for student and teachers networks while maintain the same\nencoder. Moreover, to learn from unlabeled data, we create pseudo-labels\ngenerated by the teacher networks and augment the training data with the\npseudo-labels. The two techniques contribute to the improved performance of the\nproposed method. We evaluate the performance of the proposed method on three\nrepresentative medical image segmentation datasets. Extensive comparisons to\nthe state-of-the-art medical image segmentation methods were carried out under\ntypical scenarios with 10% and 20% labeled data. Experimental outcomes\ndemonstrate that our method consistently outperforms state-of-the-art\nsemi-supervised medical image segmentation methods over the three\nsemi-supervised settings. Furthermore, to explore the performance of proposed\nmethod under extreme condition, we conduct experiments with only 5% labeled\ndata. The results further verify the superior performance of the proposed\nmethod. Source code is publicly online at https://github.com/BinYCn/DCPA.git.\n","authors":["Yuanbin Chen","Tao Wang","Hui Tang","Longxuan Zhao","Ruige Zong","Shun Chen","Tao Tan","Xinlin Zhang","Tong Tong"],"pdf_url":"https://arxiv.org/pdf/2308.16573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18926v1","updated":"2023-10-29T07:36:11Z","published":"2023-10-29T07:36:11Z","title":"CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved\n Self-Supervised Video Hashing","summary":" Compressing videos into binary codes can improve retrieval speed and reduce\nstorage overhead. However, learning accurate hash codes for video retrieval can\nbe challenging due to high local redundancy and complex global dependencies\nbetween video frames, especially in the absence of labels. Existing\nself-supervised video hashing methods have been effective in designing\nexpressive temporal encoders, but have not fully utilized the temporal dynamics\nand spatial appearance of videos due to less challenging and unreliable\nlearning tasks. To address these challenges, we begin by utilizing the\ncontrastive learning task to capture global spatio-temporal information of\nvideos for hashing. With the aid of our designed augmentation strategies, which\nfocus on spatial and temporal variations to create positive pairs, the learning\nframework can generate hash codes that are invariant to motion, scale, and\nviewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e.,\nframe order verification and scene change regularization, to capture local\nspatio-temporal details within video frames, thereby enhancing the perception\nof temporal structure and the modeling of spatio-temporal relationships. Our\nproposed Contrastive Hashing with Global-Local Spatio-temporal Information\n(CHAIN) outperforms state-of-the-art self-supervised video hashing methods on\nfour video benchmark datasets. Our codes will be released.\n","authors":["Rukai Wei","Yu Liu","Jingkuan Song","Heng Cui","Yanzhao Xie","Ke Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.18926v1.pdf","comment":"12 pages, 8 figures, accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2309.05375v2","updated":"2023-10-29T07:08:04Z","published":"2023-09-11T10:54:22Z","title":"Toward a Deeper Understanding: RetNet Viewed through Convolution","summary":" The success of Vision Transformer (ViT) has been widely reported on a wide\nrange of image recognition tasks. ViT can learn global dependencies superior to\nCNN, yet CNN's inherent locality can substitute for expensive training\nresources. Recently, the outstanding performance of RetNet in the field of\nlanguage modeling has garnered attention, surpassing that of the Transformer\nwith explicit local modeling, shifting researchers' focus towards Transformers\nin the CV field. This paper investigates the effectiveness of RetNet from a CNN\nperspective and presents a variant of RetNet tailored to the visual domain.\nSimilar to RetNet we improves ViT's local modeling by applying a weight mask on\nthe original self-attention matrix. A straightforward way to locally adapt the\nself-attention matrix can be realized by an element-wise learnable weight mask\n(ELM), for which our preliminary results show promising results. However, the\nelement-wise simple learnable weight mask not only induces a non-trivial\nadditional parameter overhead but also increases the optimization complexity.\nTo this end, this work proposes a novel Gaussian mixture mask (GMM) in which\none mask only has two learnable parameters and it can be conveniently used in\nany ViT variants whose attention mechanism allows the use of masks.\nExperimental results on multiple small datasets demonstrate that the\neffectiveness of our proposed Gaussian mask for boosting ViTs for free (almost\nzero additional parameter or computation cost). Our code can be publicly\navailable at https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention.\n","authors":["Chenghao Li","Chaoning Zhang"],"pdf_url":"https://arxiv.org/pdf/2309.05375v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18921v1","updated":"2023-10-29T06:43:01Z","published":"2023-10-29T06:43:01Z","title":"QWID: Quantized Weed Identification Deep neural network","summary":" In this paper, we present an efficient solution for weed classification in\nagriculture. We focus on optimizing model performance at inference while\nrespecting the constraints of the agricultural domain. We propose a Quantized\nDeep Neural Network model that classifies a dataset of 9 weed classes using\n8-bit integer (int8) quantization, a departure from standard 32-bit floating\npoint (fp32) models. Recognizing the hardware resource limitations in\nagriculture, our model balances model size, inference time, and accuracy,\naligning with practical requirements. We evaluate the approach on ResNet-50 and\nInceptionV3 architectures, comparing their performance against their int8\nquantized versions. Transfer learning and fine-tuning are applied using the\nDeepWeeds dataset. The results show staggering model size and inference time\nreductions while maintaining accuracy in real-world production scenarios like\nDesktop, Mobile and Raspberry Pi. Our work sheds light on a promising direction\nfor efficient AI in agriculture, holding potential for broader applications.\n Code: https://github.com/parikshit14/QNN-for-weed\n","authors":["Parikshit Singh Rathore"],"pdf_url":"https://arxiv.org/pdf/2310.18921v1.pdf","comment":"6 pages, 6 figures, 4 tables"},{"id":"http://arxiv.org/abs/2310.18920v1","updated":"2023-10-29T06:36:27Z","published":"2023-10-29T06:36:27Z","title":"Improving Multi-Person Pose Tracking with A Confidence Network","summary":" Human pose estimation and tracking are fundamental tasks for understanding\nhuman behaviors in videos. Existing top-down framework-based methods usually\nperform three-stage tasks: human detection, pose estimation and tracking.\nAlthough promising results have been achieved, these methods rely heavily on\nhigh-performance detectors and may fail to track persons who are occluded or\nmiss-detected. To overcome these problems, in this paper, we develop a novel\nkeypoint confidence network and a tracking pipeline to improve human detection\nand pose estimation in top-down approaches. Specifically, the keypoint\nconfidence network is designed to determine whether each keypoint is occluded,\nand it is incorporated into the pose estimation module. In the tracking\npipeline, we propose the Bbox-revision module to reduce missing detection and\nthe ID-retrieve module to correct lost trajectories, improving the performance\nof the detection stage. Experimental results show that our approach is\nuniversal in human detection and pose estimation, achieving state-of-the-art\nperformance on both PoseTrack 2017 and 2018 datasets.\n","authors":["Zehua Fu","Wenhang Zuo","Zhenghui Hu","Qingjie Liu","Yunhong Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18920v1.pdf","comment":"Accepted by IEEE Transactions on Multimedia. 11 pages, 5 figures"},{"id":"http://arxiv.org/abs/2306.05178v3","updated":"2023-10-29T06:11:24Z","published":"2023-06-08T13:18:23Z","title":"SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions","summary":" The remarkable capabilities of pretrained image diffusion models have been\nutilized not only for generating fixed-size images but also for creating\npanoramas. However, naive stitching of multiple images often results in visible\nseams. Recent techniques have attempted to address this issue by performing\njoint diffusions in multiple windows and averaging latent features in\noverlapping regions. However, these approaches, which focus on seamless montage\ngeneration, often yield incoherent outputs by blending different scenes within\na single image. To overcome this limitation, we propose SyncDiffusion, a\nplug-and-play module that synchronizes multiple diffusions through gradient\ndescent from a perceptual similarity loss. Specifically, we compute the\ngradient of the perceptual loss using the predicted denoised images at each\ndenoising step, providing meaningful guidance for achieving coherent montages.\nOur experimental results demonstrate that our method produces significantly\nmore coherent outputs compared to previous methods (66.35% vs. 33.65% in our\nuser study) while still maintaining fidelity (as assessed by GIQA) and\ncompatibility with the input prompt (as measured by CLIP score). We further\ndemonstrate the versatility of our method across three plug-and-play\napplications: layout-guided image generation, conditional image generation and\n360-degree panorama generation. Our project page is at\nhttps://syncdiffusion.github.io.\n","authors":["Yuseung Lee","Kunho Kim","Hyunjin Kim","Minhyuk Sung"],"pdf_url":"https://arxiv.org/pdf/2306.05178v3.pdf","comment":"Accepted to NeurIPS 2023. Project page:\n https://syncdiffusion.github.io"},{"id":"http://arxiv.org/abs/2310.18917v1","updated":"2023-10-29T06:10:46Z","published":"2023-10-29T06:10:46Z","title":"TiV-NeRF: Tracking and Mapping via Time-Varying Representation with\n Dynamic Neural Radiance Fields","summary":" Previous attempts to integrate Neural Radiance Fields (NeRF) into\nSimultaneous Localization and Mapping (SLAM) framework either rely on the\nassumption of static scenes or treat dynamic objects as outliers. However, most\nof real-world scenarios is dynamic. In this paper, we propose a time-varying\nrepresentation to track and reconstruct the dynamic scenes. Our system\nsimultaneously maintains two processes, tracking process and mapping process.\nFor tracking process, the entire input images are uniformly sampled and\ntraining of the RGB images are self-supervised. For mapping process, we\nleverage know masks to differentiate dynamic objects and static backgrounds,\nand we apply distinct sampling strategies for two types of areas. The\nparameters optimization for both processes are made up by two stages, the first\nstage associates time with 3D positions to convert the deformation field to the\ncanonical field. And the second associates time with 3D positions in canonical\nfield to obtain colors and Signed Distance Function (SDF). Besides, We propose\na novel keyframe selection strategy based on the overlapping rate. We evaluate\nour approach on two publicly available synthetic datasets and validate that our\nmethod is more effective compared to current state-of-the-art dynamic mapping\nmethods.\n","authors":["Chengyao Duan","Zhiliu Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18917v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18910v1","updated":"2023-10-29T05:31:43Z","published":"2023-10-29T05:31:43Z","title":"InstanT: Semi-supervised Learning with Instance-dependent Thresholds","summary":" Semi-supervised learning (SSL) has been a fundamental challenge in machine\nlearning for decades. The primary family of SSL algorithms, known as\npseudo-labeling, involves assigning pseudo-labels to confident unlabeled\ninstances and incorporating them into the training set. Therefore, the\nselection criteria of confident instances are crucial to the success of SSL.\nRecently, there has been growing interest in the development of SSL methods\nthat use dynamic or adaptive thresholds. Yet, these methods typically apply the\nsame threshold to all samples, or use class-dependent thresholds for instances\nbelonging to a certain class, while neglecting instance-level information. In\nthis paper, we propose the study of instance-dependent thresholds, which has\nthe highest degree of freedom compared with existing methods. Specifically, we\ndevise a novel instance-dependent threshold function for all unlabeled\ninstances by utilizing their instance-level ambiguity and the\ninstance-dependent error rates of pseudo-labels, so instances that are more\nlikely to have incorrect pseudo-labels will have higher thresholds.\nFurthermore, we demonstrate that our instance-dependent threshold function\nprovides a bounded probabilistic guarantee for the correctness of the\npseudo-labels it assigns.\n","authors":["Muyang Li","Runze Wu","Haoyu Liu","Jun Yu","Xun Yang","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.18910v1.pdf","comment":"Accepted as poster for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18904v1","updated":"2023-10-29T05:20:54Z","published":"2023-10-29T05:20:54Z","title":"Identifiable Contrastive Learning with Automatic Feature Importance\n Discovery","summary":" Existing contrastive learning methods rely on pairwise sample contrast\n$z_x^\\top z_{x'}$ to learn data representations, but the learned features often\nlack clear interpretability from a human perspective. Theoretically, it lacks\nfeature identifiability and different initialization may lead to totally\ndifferent features. In this paper, we study a new method named tri-factor\ncontrastive learning (triCL) that involves a 3-factor contrast in the form of\n$z_x^\\top S z_{x'}$, where $S=\\text{diag}(s_1,\\dots,s_k)$ is a learnable\ndiagonal matrix that automatically captures the importance of each feature. We\nshow that by this simple extension, triCL can not only obtain identifiable\nfeatures that eliminate randomness but also obtain more interpretable features\nthat are ordered according to the importance matrix $S$. We show that features\nwith high importance have nice interpretability by capturing common classwise\nfeatures, and obtain superior performance when evaluated for image retrieval\nusing a few features. The proposed triCL objective is general and can be\napplied to different contrastive learning methods like SimCLR and CLIP. We\nbelieve that it is a better alternative to existing 2-factor contrastive\nlearning by improving its identifiability and interpretability with minimal\noverhead. Code is available at\nhttps://github.com/PKU-ML/Tri-factor-Contrastive-Learning.\n","authors":["Qi Zhang","Yifei Wang","Yisen Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18904v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.09203v2","updated":"2023-10-29T05:04:32Z","published":"2022-06-18T13:32:41Z","title":"Interactive Visual Reasoning under Uncertainty","summary":" One of the fundamental cognitive abilities of humans is to quickly resolve\nuncertainty by generating hypotheses and testing them via active trials.\nEncountering a novel phenomenon accompanied by ambiguous cause-effect\nrelationships, humans make hypotheses against data, conduct inferences from\nobservation, test their theory via experimentation, and correct the proposition\nif inconsistency arises. These iterative processes persist until the underlying\nmechanism becomes clear. In this work, we devise the IVRE (pronounced as\n\"ivory\") environment for evaluating artificial agents' reasoning ability under\nuncertainty. IVRE is an interactive environment featuring rich scenarios\ncentered around Blicket detection. Agents in IVRE are placed into environments\nwith various ambiguous action-effect pairs and asked to determine each object's\nrole. They are encouraged to propose effective and efficient experiments to\nvalidate their hypotheses based on observations and actively gather new\ninformation. The game ends when all uncertainties are resolved or the maximum\nnumber of trials is consumed. By evaluating modern artificial agents in IVRE,\nwe notice a clear failure of today's learning methods compared to humans. Such\ninefficacy in interactive reasoning ability under uncertainty calls for future\nresearch in building human-like intelligence.\n","authors":["Manjie Xu","Guangyuan Jiang","Wei Liang","Chi Zhang","Yixin Zhu"],"pdf_url":"https://arxiv.org/pdf/2206.09203v2.pdf","comment":"Accepted at NeurIPS 2023 (Datasets and Benchmarks)"},{"id":"http://arxiv.org/abs/2301.12549v3","updated":"2023-10-29T04:43:45Z","published":"2023-01-29T21:40:04Z","title":"Unlocking Deterministic Robustness Certification on ImageNet","summary":" Despite the promise of Lipschitz-based methods for provably-robust deep\nlearning with deterministic guarantees, current state-of-the-art results are\nlimited to feed-forward Convolutional Networks (ConvNets) on low-dimensional\ndata, such as CIFAR-10. This paper investigates strategies for expanding\ncertifiably robust training to larger, deeper models. A key challenge in\ncertifying deep networks is efficient calculation of the Lipschitz bound for\nresidual blocks found in ResNet and ViT architectures. We show that fast ways\nof bounding the Lipschitz constant for conventional ResNets are loose, and show\nhow to address this by designing a new residual block, leading to the\n\\emph{Linear ResNet} (LiResNet) architecture. We then introduce \\emph{Efficient\nMargin MAximization} (EMMA), a loss function that stabilizes robust training by\nsimultaneously penalizing worst-case adversarial examples from \\emph{all}\nclasses. Together, these contributions yield new \\emph{state-of-the-art} robust\naccuracy on CIFAR-10/100 and Tiny-ImageNet under $\\ell_2$ perturbations.\nMoreover, for the first time, we are able to scale up fast deterministic\nrobustness guarantees to ImageNet, demonstrating that this approach to robust\nlearning can be applied to real-world applications.\n We release our code on Github: \\url{https://github.com/klasleino/gloro}.\n","authors":["Kai Hu","Andy Zou","Zifan Wang","Klas Leino","Matt Fredrikson"],"pdf_url":"https://arxiv.org/pdf/2301.12549v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18899v1","updated":"2023-10-29T04:43:30Z","published":"2023-10-29T04:43:30Z","title":"Multi-task deep learning for large-scale building detail extraction from\n high-resolution satellite imagery","summary":" Understanding urban dynamics and promoting sustainable development requires\ncomprehensive insights about buildings. While geospatial artificial\nintelligence has advanced the extraction of such details from Earth\nobservational data, existing methods often suffer from computational\ninefficiencies and inconsistencies when compiling unified building-related\ndatasets for practical applications. To bridge this gap, we introduce the\nMulti-task Building Refiner (MT-BR), an adaptable neural network tailored for\nsimultaneous extraction of spatial and attributional building details from\nhigh-resolution satellite imagery, exemplified by building rooftops, urban\nfunctional types, and roof architectural types. Notably, MT-BR can be\nfine-tuned to incorporate additional building details, extending its\napplicability. For large-scale applications, we devise a novel spatial sampling\nscheme that strategically selects limited but representative image samples.\nThis process optimizes both the spatial distribution of samples and the urban\nenvironmental characteristics they contain, thus enhancing extraction\neffectiveness while curtailing data preparation expenditures. We further\nenhance MT-BR's predictive performance and generalization capabilities through\nthe integration of advanced augmentation techniques. Our quantitative results\nhighlight the efficacy of the proposed methods. Specifically, networks trained\nwith datasets curated via our sampling method demonstrate improved predictive\naccuracy relative to those using alternative sampling approaches, with no\nalterations to network architecture. Moreover, MT-BR consistently outperforms\nother state-of-the-art methods in extracting building details across various\nmetrics. The real-world practicality is also demonstrated in an application\nacross Shanghai, generating a unified dataset that encompasses both the spatial\nand attributional details of buildings.\n","authors":["Zhen Qian","Min Chen","Zhuo Sun","Fan Zhang","Qingsong Xu","Jinzhao Guo","Zhiwei Xie","Zhixin Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.18899v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.14194v2","updated":"2023-10-29T04:23:27Z","published":"2023-10-22T05:50:20Z","title":"Distractor-aware Event-based Tracking","summary":" Event cameras, or dynamic vision sensors, have recently achieved success from\nfundamental vision tasks to high-level vision researches. Due to its ability to\nasynchronously capture light intensity changes, event camera has an inherent\nadvantage to capture moving objects in challenging scenarios including objects\nunder low light, high dynamic range, or fast moving objects. Thus event camera\nare natural for visual object tracking. However, the current event-based\ntrackers derived from RGB trackers simply modify the input images to event\nframes and still follow conventional tracking pipeline that mainly focus on\nobject texture for target distinction. As a result, the trackers may not be\nrobust dealing with challenging scenarios such as moving cameras and cluttered\nforeground. In this paper, we propose a distractor-aware event-based tracker\nthat introduces transformer modules into Siamese network architecture (named\nDANet). Specifically, our model is mainly composed of a motion-aware network\nand a target-aware network, which simultaneously exploits both motion cues and\nobject contours from event data, so as to discover motion objects and identify\nthe target object by removing dynamic distractors. Our DANet can be trained in\nan end-to-end manner without any post-processing and can run at over 80 FPS on\na single V100. We conduct comprehensive experiments on two large event tracking\ndatasets to validate the proposed model. We demonstrate that our tracker has\nsuperior performance against the state-of-the-art trackers in terms of both\naccuracy and efficiency.\n","authors":["Yingkai Fu","Meng Li","Wenxi Liu","Yuanchen Wang","Jiqing Zhang","Baocai Yin","Xiaopeng Wei","Xin Yang"],"pdf_url":"https://arxiv.org/pdf/2310.14194v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.08763v2","updated":"2023-10-29T04:16:11Z","published":"2023-07-17T18:19:36Z","title":"Video-Mined Task Graphs for Keystep Recognition in Instructional Videos","summary":" Procedural activity understanding requires perceiving human actions in terms\nof a broader task, where multiple keysteps are performed in sequence across a\nlong video to reach a final goal state -- such as the steps of a recipe or a\nDIY fix-it task. Prior work largely treats keystep recognition in isolation of\nthis broader structure, or else rigidly confines keysteps to align with a\npredefined sequential script. We propose discovering a task graph automatically\nfrom how-to videos to represent probabilistically how people tend to execute\nkeysteps, and then leverage this graph to regularize keystep recognition in\nnovel videos. On multiple datasets of real-world instructional videos, we show\nthe impact: more reliable zero-shot keystep localization and improved video\nrepresentation learning, exceeding the state of the art.\n","authors":["Kumar Ashutosh","Santhosh Kumar Ramakrishnan","Triantafyllos Afouras","Kristen Grauman"],"pdf_url":"https://arxiv.org/pdf/2307.08763v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.16999v2","updated":"2023-10-29T04:09:48Z","published":"2023-10-25T20:55:07Z","title":"Trust, but Verify: Robust Image Segmentation using Deep Learning","summary":" We describe a method for verifying the output of a deep neural network for\nmedical image segmentation that is robust to several classes of random as well\nas worst-case perturbations i.e. adversarial attacks. This method is based on a\ngeneral approach recently developed by the authors called \"Trust, but Verify\"\nwherein an auxiliary verification network produces predictions about certain\nmasked features in the input image using the segmentation as an input. A\nwell-designed auxiliary network will produce high-quality predictions when the\ninput segmentations are accurate, but will produce low-quality predictions when\nthe segmentations are incorrect. Checking the predictions of such a network\nwith the original image allows us to detect bad segmentations. However, to\nensure the verification method is truly robust, we need a method for checking\nthe quality of the predictions that does not itself rely on a black-box neural\nnetwork. Indeed, we show that previous methods for segmentation evaluation that\ndo use deep neural regression networks are vulnerable to false negatives i.e.\ncan inaccurately label bad segmentations as good. We describe the design of a\nverification network that avoids such vulnerability and present results to\ndemonstrate its robustness compared to previous methods.\n","authors":["Fahim Ahmed Zaman","Xiaodong Wu","Weiyu Xu","Milan Sonka","Raghuraman Mudumbai"],"pdf_url":"https://arxiv.org/pdf/2310.16999v2.pdf","comment":"5 Pages, 8 Figures, conference"},{"id":"http://arxiv.org/abs/2310.18894v1","updated":"2023-10-29T04:07:52Z","published":"2023-10-29T04:07:52Z","title":"Emergence of Shape Bias in Convolutional Neural Networks through\n Activation Sparsity","summary":" Current deep-learning models for object recognition are known to be heavily\nbiased toward texture. In contrast, human visual systems are known to be biased\ntoward shape and structure. What could be the design principles in human visual\nsystems that led to this difference? How could we introduce more shape bias\ninto the deep learning models? In this paper, we report that sparse coding, a\nubiquitous principle in the brain, can in itself introduce shape bias into the\nnetwork. We found that enforcing the sparse coding constraint using a\nnon-differential Top-K operation can lead to the emergence of structural\nencoding in neurons in convolutional neural networks, resulting in a smooth\ndecomposition of objects into parts and subparts and endowing the networks with\nshape bias. We demonstrated this emergence of shape bias and its functional\nbenefits for different network structures with various datasets. For object\nrecognition convolutional neural networks, the shape bias leads to greater\nrobustness against style and pattern change distraction. For the image\nsynthesis generative adversary networks, the emerged shape bias leads to more\ncoherent and decomposable structures in the synthesized images. Ablation\nstudies suggest that sparse codes tend to encode structures, whereas the more\ndistributed codes tend to favor texture. Our code is host at the github\nrepository: \\url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture}\n","authors":["Tianqin Li","Ziqi Wen","Yangfan Li","Tai Sing Lee"],"pdf_url":"https://arxiv.org/pdf/2310.18894v1.pdf","comment":"Published as NeurIPS 2023 (Oral)"},{"id":"http://arxiv.org/abs/2310.14736v2","updated":"2023-10-29T03:57:18Z","published":"2023-10-23T09:16:04Z","title":"SAMCLR: Contrastive pre-training on complex scenes using SAM for view\n sampling","summary":" In Computer Vision, self-supervised contrastive learning enforces similar\nrepresentations between different views of the same image. The pre-training is\nmost often performed on image classification datasets, like ImageNet, where\nimages mainly contain a single class of objects. However, when dealing with\ncomplex scenes with multiple items, it becomes very unlikely for several views\nof the same image to represent the same object category. In this setting, we\npropose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into\nsemantic regions, then sample the two views from the same region. Preliminary\nresults show empirically that when pre-training on Cityscapes and ADE20K, then\nevaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs\nat least on par with, and most often significantly outperforms not only SimCLR,\nbut also DINO and MoCo.\n","authors":["Benjamin Missaoui","Chongbin Yuan"],"pdf_url":"https://arxiv.org/pdf/2310.14736v2.pdf","comment":"Accepted at NeurIPS 2023 Workshop on SSL"},{"id":"http://arxiv.org/abs/2310.18890v1","updated":"2023-10-29T03:35:34Z","published":"2023-10-29T03:35:34Z","title":"Towards Generalized Multi-stage Clustering: Multi-view Self-distillation","summary":" Existing multi-stage clustering methods independently learn the salient\nfeatures from multiple views and then perform the clustering task.\nParticularly, multi-view clustering (MVC) has attracted a lot of attention in\nmulti-view or multi-modal scenarios. MVC aims at exploring common semantics and\npseudo-labels from multiple views and clustering in a self-supervised manner.\nHowever, limited by noisy data and inadequate feature learning, such a\nclustering paradigm generates overconfident pseudo-labels that mis-guide the\nmodel to produce inaccurate predictions. Therefore, it is desirable to have a\nmethod that can correct this pseudo-label mistraction in multi-stage clustering\nto avoid the bias accumulation. To alleviate the effect of overconfident\npseudo-labels and improve the generalization ability of the model, this paper\nproposes a novel multi-stage deep MVC framework where multi-view\nself-distillation (DistilMVC) is introduced to distill dark knowledge of label\ndistribution. Specifically, in the feature subspace at different hierarchies,\nwe explore the common semantics of multiple views through contrastive learning\nand obtain pseudo-labels by maximizing the mutual information between views.\nAdditionally, a teacher network is responsible for distilling pseudo-labels\ninto dark knowledge, supervising the student network and improving its\npredictive capabilities to enhance the robustness. Extensive experiments on\nreal-world multi-view datasets show that our method has better clustering\nperformance than state-of-the-art methods.\n","authors":["Jiatai Wang","Zhiwei Xu","Xin Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18890v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18887v1","updated":"2023-10-29T03:24:16Z","published":"2023-10-29T03:24:16Z","title":"Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes","summary":" Unsupervised monocular depth estimation techniques have demonstrated\nencouraging results but typically assume that the scene is static. These\ntechniques suffer when trained on dynamical scenes, where apparent object\nmotion can equally be explained by hypothesizing the object's independent\nmotion, or by altering its depth. This ambiguity causes depth estimators to\npredict erroneous depth for moving objects. To resolve this issue, we introduce\nDynamo-Depth, an unifying approach that disambiguates dynamical motion by\njointly learning monocular depth, 3D independent flow field, and motion\nsegmentation from unlabeled monocular videos. Specifically, we offer our key\ninsight that a good initial estimation of motion segmentation is sufficient for\njointly learning depth and independent motion despite the fundamental\nunderlying ambiguity. Our proposed method achieves state-of-the-art performance\non monocular depth estimation on Waymo Open and nuScenes Dataset with\nsignificant improvement in the depth of moving objects. Code and additional\nresults are available at https://dynamo-depth.github.io.\n","authors":["Yihong Sun","Bharath Hariharan"],"pdf_url":"https://arxiv.org/pdf/2310.18887v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18882v1","updated":"2023-10-29T03:07:30Z","published":"2023-10-29T03:07:30Z","title":"Differentiable Learning of Generalized Structured Matrices for Efficient\n Deep Neural Networks","summary":" This paper investigates efficient deep neural networks (DNNs) to replace\ndense unstructured weight matrices with structured ones that possess desired\nproperties. The challenge arises because the optimal weight matrix structure in\npopular neural network models is obscure in most cases and may vary from layer\nto layer even in the same network. Prior structured matrices proposed for\nefficient DNNs were mostly hand-crafted without a generalized framework to\nsystematically learn them. To address this issue, we propose a generalized and\ndifferentiable framework to learn efficient structures of weight matrices by\ngradient descent. We first define a new class of structured matrices that\ncovers a wide range of structured matrices in the literature by adjusting the\nstructural parameters. Then, the frequency-domain differentiable\nparameterization scheme based on the Gaussian-Dirichlet kernel is adopted to\nlearn the structural parameters by proximal gradient descent. Finally, we\nintroduce an effective initialization method for the proposed scheme. Our\nmethod learns efficient DNNs with structured matrices, achieving lower\ncomplexity and/or higher performance than prior approaches that employ\nlow-rank, block-sparse, or block-low-rank matrices.\n","authors":["Changwoo Lee","Hun-Seok Kim"],"pdf_url":"https://arxiv.org/pdf/2310.18882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18874v1","updated":"2023-10-29T02:22:38Z","published":"2023-10-29T02:22:38Z","title":"HDMNet: A Hierarchical Matching Network with Double Attention for\n Large-scale Outdoor LiDAR Point Cloud Registration","summary":" Outdoor LiDAR point clouds are typically large-scale and complexly\ndistributed. To achieve efficient and accurate registration, emphasizing the\nsimilarity among local regions and prioritizing global local-to-local matching\nis of utmost importance, subsequent to which accuracy can be enhanced through\ncost-effective fine registration. In this paper, a novel hierarchical neural\nnetwork with double attention named HDMNet is proposed for large-scale outdoor\nLiDAR point cloud registration. Specifically, A novel feature consistency\nenhanced double-soft matching network is introduced to achieve two-stage\nmatching with high flexibility while enlarging the receptive field with high\nefficiency in a patch-to patch manner, which significantly improves the\nregistration performance. Moreover, in order to further utilize the sparse\nmatching information from deeper layer, we develop a novel trainable embedding\nmask to incorporate the confidence scores of correspondences obtained from pose\nestimation of deeper layer, eliminating additional computations. The\nhigh-confidence keypoints in the sparser point cloud of the deeper layer\ncorrespond to a high-confidence spatial neighborhood region in shallower layer,\nwhich will receive more attention, while the features of non-key regions will\nbe masked. Extensive experiments are conducted on two large-scale outdoor LiDAR\npoint cloud datasets to demonstrate the high accuracy and efficiency of the\nproposed HDMNet.\n","authors":["Weiyi Xue","Fan Lu","Guang Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18874v1.pdf","comment":"Accepted by WACV2024"},{"id":"http://arxiv.org/abs/2203.03897v3","updated":"2023-10-29T00:01:40Z","published":"2022-03-08T07:34:52Z","title":"Geodesic Multi-Modal Mixup for Robust Fine-Tuning","summary":" Pre-trained multi-modal models, such as CLIP, provide transferable embeddings\nand show promising results in diverse applications. However, the analysis of\nlearned multi-modal embeddings is relatively unexplored, and the embedding\ntransferability can be improved. In this work, we observe that CLIP holds\nseparated embedding subspaces for two different modalities, and then we\ninvestigate it through the lens of uniformity-alignment to measure the quality\nof learned representation. Both theoretically and empirically, we show that\nCLIP retains poor uniformity and alignment even after fine-tuning. Such a lack\nof alignment and uniformity might restrict the transferability and robustness\nof embeddings. To this end, we devise a new fine-tuning method for robust\nrepresentation equipping better alignment and uniformity. First, we propose a\nGeodesic Multi-Modal Mixup that mixes the embeddings of image and text to\ngenerate hard negative samples on the hypersphere. Then, we fine-tune the model\non hard negatives as well as original negatives and positives with contrastive\nloss. Based on the theoretical analysis about hardness guarantee and limiting\nbehavior, we justify the use of our method. Extensive experiments on retrieval,\ncalibration, few- or zero-shot classification (under distribution shift),\nembedding arithmetic, and image captioning further show that our method\nprovides transferable representations, enabling robust model adaptation on\ndiverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup\n","authors":["Changdae Oh","Junhyuk So","Hoyoon Byun","YongTaek Lim","Minchul Shin","Jong-June Jeon","Kyungwoo Song"],"pdf_url":"https://arxiv.org/pdf/2203.03897v3.pdf","comment":"To appear at NeurIPS 2023"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.19156v1","updated":"2023-10-29T21:13:31Z","published":"2023-10-29T21:13:31Z","title":"Poisoning Retrieval Corpora by Injecting Adversarial Passages","summary":" Dense retrievers have achieved state-of-the-art performance in various\ninformation retrieval tasks, but to what extent can they be safely deployed in\nreal-world applications? In this work, we propose a novel attack for dense\nretrieval systems in which a malicious user generates a small number of\nadversarial passages by perturbing discrete tokens to maximize similarity with\na provided set of training queries. When these adversarial passages are\ninserted into a large retrieval corpus, we show that this attack is highly\neffective in fooling these systems to retrieve them for queries that were not\nseen by the attacker. More surprisingly, these adversarial passages can\ndirectly generalize to out-of-domain queries and corpora with a high success\nattack rate -- for instance, we find that 50 generated passages optimized on\nNatural Questions can mislead >94% of questions posed in financial documents or\nonline forums. We also benchmark and compare a range of state-of-the-art dense\nretrievers, both unsupervised and supervised. Although different systems\nexhibit varying levels of vulnerability, we show they can all be successfully\nattacked by injecting up to 500 passages, a small fraction compared to a\nretrieval corpus of millions of passages.\n","authors":["Zexuan Zhong","Ziqing Huang","Alexander Wettig","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2310.19156v1.pdf","comment":"EMNLP 2023. Our code is available at\n https://github.com/princeton-nlp/corpus-poisoning"},{"id":"http://arxiv.org/abs/2310.19056v1","updated":"2023-10-29T16:04:10Z","published":"2023-10-29T16:04:10Z","title":"MILL: Mutual Verification with Large Language Models for Zero-Shot Query\n Expansion","summary":" Query expansion is a commonly-used technique in many search systems to better\nrepresent users' information needs with additional query terms. Existing\nstudies for this task usually propose to expand a query with retrieved or\ngenerated contextual documents. However, both types of methods have clear\nlimitations. For retrieval-based methods, the documents retrieved with the\noriginal query might not be accurate enough to reveal the search intent,\nespecially when the query is brief or ambiguous. For generation-based methods,\nexisting models can hardly be trained or aligned on a particular corpus, due to\nthe lack of corpus-specific labeled data. In this paper, we propose a novel\nLarge Language Model (LLM) based mutual verification framework for query\nexpansion, which alleviates the aforementioned limitations. Specifically, we\nfirst design a query-query-document generation pipeline, which can effectively\nleverage the contextual knowledge encoded in LLMs to generate sub-queries and\ncorresponding documents from multiple perspectives. Next, we employ a mutual\nverification method for both generated and retrieved contextual documents,\nwhere 1) retrieved documents are filtered with the external contextual\nknowledge in generated documents, and 2) generated documents are filtered with\nthe corpus-specific knowledge in retrieved documents. Overall, the proposed\nmethod allows retrieved and generated documents to complement each other to\nfinalize a better query expansion. We conduct extensive experiments on three\ninformation retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO.\nThe results demonstrate that our method outperforms other baselines\nsignificantly.\n","authors":["Pengyue Jia","Yiding Liu","Xiangyu Zhao","Xiaopeng Li","Changying Hao","Shuaiqiang Wang","Dawei Yin"],"pdf_url":"https://arxiv.org/pdf/2310.19056v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.13388v2","updated":"2023-10-29T10:48:51Z","published":"2023-10-20T09:56:22Z","title":"Music Augmentation and Denoising For Peak-Based Audio Fingerprinting","summary":" Audio fingerprinting is a well-established solution for song identification\nfrom short recording excerpts. Popular methods rely on the extraction of sparse\nrepresentations, generally spectral peaks, and have proven to be accurate,\nfast, and scalable to large collections. However, real-world applications of\naudio identification often happen in noisy environments, which can cause these\nsystems to fail. In this work, we tackle this problem by introducing and\nreleasing a new audio augmentation pipeline that adds noise to music snippets\nin a realistic way, by stochastically mimicking real-world scenarios. We then\npropose and release a deep learning model that removes noisy components from\nspectrograms in order to improve peak-based fingerprinting systems' accuracy.\nWe show that the addition of our model improves the identification performance\nof commonly used audio fingerprinting systems, even under noisy conditions.\n","authors":["Kamil Akesbi","Dorian Desblancs","Benjamin Martin"],"pdf_url":"https://arxiv.org/pdf/2310.13388v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18951v1","updated":"2023-10-29T09:41:21Z","published":"2023-10-29T09:41:21Z","title":"A Multimodal Ecological Civilization Pattern Recommendation Method Based\n on Large Language Models and Knowledge Graph","summary":" The Ecological Civilization Pattern Recommendation System (ECPRS) aims to\nrecommend suitable ecological civilization patterns for target regions,\npromoting sustainable development and reducing regional disparities. However,\nthe current representative recommendation methods are not suitable for\nrecommending ecological civilization patterns in a geographical context. There\nare two reasons for this. Firstly, regions have spatial heterogeneity, and the\n(ECPRS)needs to consider factors like climate, topography, vegetation, etc., to\nrecommend civilization patterns adapted to specific ecological environments,\nensuring the feasibility and practicality of the recommendations. Secondly, the\nabstract features of the ecological civilization patterns in the real world\nhave not been fully utilized., resulting in poor richness in their embedding\nrepresentations and consequently, lower performance of the recommendation\nsystem. Considering these limitations, we propose the ECPR-MML method.\nInitially, based on the novel method UGPIG, we construct a knowledge graph to\nextract regional representations incorporating spatial heterogeneity features.\nFollowing that, inspired by the significant progress made by Large Language\nModels (LLMs) in the field of Natural Language Processing (NLP), we employ\nLarge LLMs to generate multimodal features for ecological civilization patterns\nin the form of text and images. We extract and integrate these multimodal\nfeatures to obtain semantically rich representations of ecological\ncivilization. Through extensive experiments, we validate the performance of our\nECPR-MML model. Our results show that F1@5 is 2.11% higher compared to\nstate-of-the-art models, 2.02% higher than NGCF, and 1.16% higher than UGPIG.\nFurthermore, multimodal data can indeed enhance recommendation performance.\nHowever, the data generated by LLM is not as effective as real data to a\ncertain extent.\n","authors":["Zhihang Yu","Shu Wang","Yunqiang Zhu","Zhiqiang Zou"],"pdf_url":"https://arxiv.org/pdf/2310.18951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.04205v2","updated":"2023-10-29T09:25:47Z","published":"2023-10-06T12:44:04Z","title":"Keyword Augmented Retrieval: Novel framework for Information Retrieval\n integrated with speech interface","summary":" Retrieving answers in a quick and low cost manner without hallucinations from\na combination of structured and unstructured data using Language models is a\nmajor hurdle. This is what prevents employment of Language models in knowledge\nretrieval automation. This becomes accentuated when one wants to integrate a\nspeech interface on top of a text based knowledge retrieval system. Besides,\nfor commercial search and chat-bot applications, complete reliance on\ncommercial large language models (LLMs) like GPT 3.5 etc. can be very costly.\nIn the present study, the authors have addressed the aforementioned problem by\nfirst developing a keyword based search framework which augments discovery of\nthe context from the document to be provided to the LLM. The keywords in turn\nare generated by a relatively smaller LLM and cached for comparison with\nkeywords generated by the same smaller LLM against the query raised. This\nsignificantly reduces time and cost to find the context within documents. Once\nthe context is set, a larger LLM uses that to provide answers based on a prompt\ntailored for Q\\&A. This research work demonstrates that use of keywords in\ncontext identification reduces the overall inference time and cost of\ninformation retrieval. Given this reduction in inference time and cost with the\nkeyword augmented retrieval framework, a speech based interface for user input\nand response readout was integrated. This allowed a seamless interaction with\nthe language model.\n","authors":["Anupam Purwar","Rahul Sundar"],"pdf_url":"https://arxiv.org/pdf/2310.04205v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18863v1","updated":"2023-10-29T01:21:45Z","published":"2023-10-29T01:21:45Z","title":"The diminishing state of shared reality on US television news","summary":" The potential for a large, diverse population to coexist peacefully is\nthought to depend on the existence of a ``shared reality:'' a public sphere in\nwhich participants are exposed to similar facts about similar topics. A\ngeneration ago, broadcast television news was widely considered to serve this\nfunction; however, since the rise of cable news in the 1990s, critics and\nscholars have worried that the corresponding fragmentation and segregation of\naudiences along partisan lines has caused this shared reality to be lost. Here\nwe examine this concern using a unique combination of data sets tracking the\nproduction (since 2012) and consumption (since 2016) of television news content\non the three largest cable and broadcast networks respectively. With regard to\nproduction, we find strong evidence for the ``loss of shared reality\nhypothesis:'' while broadcast continues to cover similar topics with similar\nlanguage, cable news networks have become increasingly distinct, both from\nbroadcast news and each other, diverging both in terms of content and language.\nWith regard to consumption, we find more mixed evidence: while broadcast news\nhas indeed declined in popularity, it remains the dominant source of news for\nroughly 50\\% more Americans than does cable; moreover, its decline, while\nsomewhat attributable to cable, appears driven more by a shift away from news\nconsumption altogether than a growth in cable consumption. We conclude that\nshared reality on US television news is indeed diminishing, but is more robust\nthan previously thought and is declining for somewhat different reasons.\n","authors":["Homa Hosseinmardi","Samuel Wolken","David M. Rothschild","Duncan J. Watts"],"pdf_url":"https://arxiv.org/pdf/2310.18863v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2203.03897v3","updated":"2023-10-29T00:01:40Z","published":"2022-03-08T07:34:52Z","title":"Geodesic Multi-Modal Mixup for Robust Fine-Tuning","summary":" Pre-trained multi-modal models, such as CLIP, provide transferable embeddings\nand show promising results in diverse applications. However, the analysis of\nlearned multi-modal embeddings is relatively unexplored, and the embedding\ntransferability can be improved. In this work, we observe that CLIP holds\nseparated embedding subspaces for two different modalities, and then we\ninvestigate it through the lens of uniformity-alignment to measure the quality\nof learned representation. Both theoretically and empirically, we show that\nCLIP retains poor uniformity and alignment even after fine-tuning. Such a lack\nof alignment and uniformity might restrict the transferability and robustness\nof embeddings. To this end, we devise a new fine-tuning method for robust\nrepresentation equipping better alignment and uniformity. First, we propose a\nGeodesic Multi-Modal Mixup that mixes the embeddings of image and text to\ngenerate hard negative samples on the hypersphere. Then, we fine-tune the model\non hard negatives as well as original negatives and positives with contrastive\nloss. Based on the theoretical analysis about hardness guarantee and limiting\nbehavior, we justify the use of our method. Extensive experiments on retrieval,\ncalibration, few- or zero-shot classification (under distribution shift),\nembedding arithmetic, and image captioning further show that our method\nprovides transferable representations, enabling robust model adaptation on\ndiverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup\n","authors":["Changdae Oh","Junhyuk So","Hoyoon Byun","YongTaek Lim","Minchul Shin","Jong-June Jeon","Kyungwoo Song"],"pdf_url":"https://arxiv.org/pdf/2203.03897v3.pdf","comment":"To appear at NeurIPS 2023"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2310.19202v1","updated":"2023-10-29T23:53:37Z","published":"2023-10-29T23:53:37Z","title":"Improved Motor Imagery Classification Using Adaptive Spatial Filters\n Based on Particle Swarm Optimization Algorithm","summary":" As a typical self-paced brain-computer interface (BCI) system, the motor\nimagery (MI) BCI has been widely applied in fields such as robot control,\nstroke rehabilitation, and assistance for patients with stroke or spinal cord\ninjury. Many studies have focused on the traditional spatial filters obtained\nthrough the common spatial pattern (CSP) method. However, the CSP method can\nonly obtain fixed spatial filters for specific input signals. Besides, CSP\nmethod only focuses on the variance difference of two types of\nelectroencephalogram (EEG) signals, so the decoding ability of EEG signals is\nlimited. To obtain more effective spatial filters for better extraction of\nspatial features that can improve classification to MI-EEG, this paper proposes\nan adaptive spatial filter solving method based on particle swarm optimization\nalgorithm (PSO). A training and testing framework based on filter bank and\nspatial filters (FBCSP-ASP) is designed for MI EEG signal classification.\nComparative experiments are conducted on two public datasets (2a and 2b) from\nBCI competition IV, which show the outstanding average recognition accuracy of\nFBCSP-ASP. The proposed method has achieved significant performance improvement\non MI-BCI. The classification accuracy of the proposed method has reached\n74.61% and 81.19% on datasets 2a and 2b, respectively. Compared with the\nbaseline algorithm (FBCSP), the proposed algorithm improves 11.44% and 7.11% on\ntwo datasets respectively. Furthermore, the analysis based on mutual\ninformation, t-SNE and Shapley values further proves that ASP features have\nexcellent decoding ability for MI-EEG signals, and explains the improvement of\nclassification performance by the introduction of ASP features.\n","authors":["Xiong Xiong","Ying Wang","Tianyuan Song","Jinguo Huang","Guixia Kang"],"pdf_url":"https://arxiv.org/pdf/2310.19202v1.pdf","comment":"25 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.19198v1","updated":"2023-10-29T23:37:47Z","published":"2023-10-29T23:37:47Z","title":"Enhancing Motor Imagery Decoding in Brain Computer Interfaces using\n Riemann Tangent Space Mapping and Cross Frequency Coupling","summary":" Objective: Motor Imagery (MI) serves as a crucial experimental paradigm\nwithin the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor\nintentions from electroencephalogram (EEG) signals. Method: Drawing inspiration\nfrom Riemannian geometry and Cross-Frequency Coupling (CFC), this paper\nintroduces a novel approach termed Riemann Tangent Space Mapping using\nDichotomous Filter Bank with Convolutional Neural Network (DFBRTS) to enhance\nthe representation quality and decoding capability pertaining to MI features.\nDFBRTS first initiates the process by meticulously filtering EEG signals\nthrough a Dichotomous Filter Bank, structured in the fashion of a complete\nbinary tree. Subsequently, it employs Riemann Tangent Space Mapping to extract\nsalient EEG signal features within each sub-band. Finally, a lightweight\nconvolutional neural network is employed for further feature extraction and\nclassification, operating under the joint supervision of cross-entropy and\ncenter loss. To validate the efficacy, extensive experiments were conducted\nusing DFBRTS on two well-established benchmark datasets: the BCI competition IV\n2a (BCIC-IV-2a) dataset and the OpenBMI dataset. The performance of DFBRTS was\nbenchmarked against several state-of-the-art MI decoding methods, alongside\nother Riemannian geometry-based MI decoding approaches. Results: DFBRTS\nsignificantly outperforms other MI decoding algorithms on both datasets,\nachieving a remarkable classification accuracy of 78.16% for four-class and\n71.58% for two-class hold-out classification, as compared to the existing\nbenchmarks.\n","authors":["Xiong Xiong","Li Su","Jinguo Huang","Guixia Kang"],"pdf_url":"https://arxiv.org/pdf/2310.19198v1.pdf","comment":"22 pages, 7 figures"},{"id":"http://arxiv.org/abs/2310.19192v1","updated":"2023-10-29T23:12:56Z","published":"2023-10-29T23:12:56Z","title":"Conformal Normalization in Recurrent Neural Network of Grid Cells","summary":" Grid cells in the entorhinal cortex of the mammalian brain exhibit striking\nhexagon firing patterns in their response maps as the animal (e.g., a rat)\nnavigates in a 2D open environment. The responses of the population of grid\ncells collectively form a vector in a high-dimensional neural activity space,\nand this vector represents the self-position of the agent in the 2D physical\nspace. As the agent moves, the vector is transformed by a recurrent neural\nnetwork that takes the velocity of the agent as input. In this paper, we\npropose a simple and general conformal normalization of the input velocity for\nthe recurrent neural network, so that the local displacement of the position\nvector in the high-dimensional neural space is proportional to the local\ndisplacement of the agent in the 2D physical space, regardless of the direction\nof the input velocity. Our numerical experiments on the minimally simple linear\nand non-linear recurrent networks show that conformal normalization leads to\nthe emergence of the hexagon grid patterns. Furthermore, we derive a new\ntheoretical understanding that connects conformal normalization to the\nemergence of hexagon grid patterns in navigation tasks.\n","authors":["Dehong Xu","Ruiqi Gao","Wen-Hao Zhang","Xue-Xin Wei","Ying Nian Wu"],"pdf_url":"https://arxiv.org/pdf/2310.19192v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19177v1","updated":"2023-10-29T22:37:54Z","published":"2023-10-29T22:37:54Z","title":"Robustifying Language Models with Test-Time Adaptation","summary":" Large-scale language models achieved state-of-the-art performance over a\nnumber of language tasks. However, they fail on adversarial language examples,\nwhich are sentences optimized to fool the language models but with similar\nsemantic meanings for humans. While prior work focuses on making the language\nmodel robust at training time, retraining for robustness is often unrealistic\nfor large-scale foundation models. Instead, we propose to make the language\nmodels robust at test time. By dynamically adapting the input sentence with\npredictions from masked words, we show that we can reverse many language\nadversarial attacks. Since our approach does not require any training, it works\nfor novel tasks at test time and can adapt to novel adversarial corruptions.\nVisualizations and empirical results on two popular sentence classification\ndatasets demonstrate that our method can repair adversarial language attacks\nover 65% o\n","authors":["Noah Thomas McDermott","Junfeng Yang","Chengzhi Mao"],"pdf_url":"https://arxiv.org/pdf/2310.19177v1.pdf","comment":"8 Pages 2 Figures Submitted to ICLR Workshop"},{"id":"http://arxiv.org/abs/2310.19174v1","updated":"2023-10-29T22:31:20Z","published":"2023-10-29T22:31:20Z","title":"Predicting recovery following stroke: deep learning, multimodal data and\n feature selection using explainable AI","summary":" Machine learning offers great potential for automated prediction of\npost-stroke symptoms and their response to rehabilitation. Major challenges for\nthis endeavour include the very high dimensionality of neuroimaging data, the\nrelatively small size of the datasets available for learning, and how to\neffectively combine neuroimaging and tabular data (e.g. demographic information\nand clinical characteristics). This paper evaluates several solutions based on\ntwo strategies. The first is to use 2D images that summarise MRI scans. The\nsecond is to select key features that improve classification accuracy.\nAdditionally, we introduce the novel approach of training a convolutional\nneural network (CNN) on images that combine regions-of-interest extracted from\nMRIs, with symbolic representations of tabular data. We evaluate a series of\nCNN architectures (both 2D and a 3D) that are trained on different\nrepresentations of MRI and tabular data, to predict whether a composite measure\nof post-stroke spoken picture description ability is in the aphasic or\nnon-aphasic range. MRI and tabular data were acquired from 758 English speaking\nstroke survivors who participated in the PLORAS study. The classification\naccuracy for a baseline logistic regression was 0.678 for lesion size alone,\nrising to 0.757 and 0.813 when initial symptom severity and recovery time were\nsuccessively added. The highest classification accuracy 0.854 was observed when\n8 regions-of-interest was extracted from each MRI scan and combined with lesion\nsize, initial severity and recovery time in a 2D Residual Neural Network.Our\nfindings demonstrate how imaging and tabular data can be combined for high\npost-stroke classification accuracy, even when the dataset is small in machine\nlearning terms. We conclude by proposing how the current models could be\nimproved to achieve even higher levels of accuracy using images from hospital\nscanners.\n","authors":["Adam White","Margarita Saranti","Artur d'Avila Garcez","Thomas M. H. Hope","Cathy J. Price","Howard Bowman"],"pdf_url":"https://arxiv.org/pdf/2310.19174v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07304v2","updated":"2023-10-29T22:28:21Z","published":"2023-06-11T23:28:02Z","title":"A Holistic Approach to Unifying Automatic Concept Extraction and Concept\n Importance Estimation","summary":" In recent years, concept-based approaches have emerged as some of the most\npromising explainability methods to help us interpret the decisions of\nArtificial Neural Networks (ANNs). These methods seek to discover intelligible\nvisual 'concepts' buried within the complex patterns of ANN activations in two\nkey steps: (1) concept extraction followed by (2) importance estimation. While\nthese two steps are shared across methods, they all differ in their specific\nimplementations. Here, we introduce a unifying theoretical framework that\ncomprehensively defines and clarifies these two steps. This framework offers\nseveral advantages as it allows us: (i) to propose new evaluation metrics for\ncomparing different concept extraction approaches; (ii) to leverage modern\nattribution methods and evaluation metrics to extend and systematically\nevaluate state-of-the-art concept-based approaches and importance estimation\ntechniques; (iii) to derive theoretical guarantees regarding the optimality of\nsuch methods. We further leverage our framework to try to tackle a crucial\nquestion in explainability: how to efficiently identify clusters of data points\nthat are classified based on a similar shared strategy. To illustrate these\nfindings and to highlight the main strategies of a model, we introduce a visual\nrepresentation called the strategic cluster graph. Finally, we present\nhttps://serre-lab.github.io/Lens, a dedicated website that offers a complete\ncompilation of these visualizations for all classes of the ImageNet dataset.\n","authors":["Thomas Fel","Victor Boutin","Mazda Moayeri","Rémi Cadène","Louis Bethune","Léo andéol","Mathieu Chalvidal","Thomas Serre"],"pdf_url":"https://arxiv.org/pdf/2306.07304v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.08943v2","updated":"2023-10-29T22:11:39Z","published":"2023-06-15T08:33:52Z","title":"Neural Fields with Hard Constraints of Arbitrary Differential Order","summary":" While deep learning techniques have become extremely popular for solving a\nbroad range of optimization problems, methods to enforce hard constraints\nduring optimization, particularly on deep neural networks, remain\nunderdeveloped. Inspired by the rich literature on meshless interpolation and\nits extension to spectral collocation methods in scientific computing, we\ndevelop a series of approaches for enforcing hard constraints on neural fields,\nwhich we refer to as Constrained Neural Fields (CNF). The constraints can be\nspecified as a linear operator applied to the neural field and its derivatives.\nWe also design specific model representations and training strategies for\nproblems where standard models may encounter difficulties, such as conditioning\nof the system, memory consumption, and capacity of the network when being\nconstrained. Our approaches are demonstrated in a wide range of real-world\napplications. Additionally, we develop a framework that enables highly\nefficient model and constraint specification, which can be readily applied to\nany downstream task where hard constraints need to be explicitly satisfied\nduring optimization.\n","authors":["Fangcheng Zhong","Kyle Fogarty","Param Hanji","Tianhao Wu","Alejandro Sztrajman","Andrew Spielberg","Andrea Tagliasacchi","Petra Bosilj","Cengiz Oztireli"],"pdf_url":"https://arxiv.org/pdf/2306.08943v2.pdf","comment":"37th Conference on Neural Information Processing Systems (NeurIPS\n 2023)"},{"id":"http://arxiv.org/abs/2310.19167v1","updated":"2023-10-29T21:59:33Z","published":"2023-10-29T21:59:33Z","title":"Rare Event Probability Learning by Normalizing Flows","summary":" A rare event is defined by a low probability of occurrence. Accurate\nestimation of such small probabilities is of utmost importance across diverse\ndomains. Conventional Monte Carlo methods are inefficient, demanding an\nexorbitant number of samples to achieve reliable estimates. Inspired by the\nexact sampling capabilities of normalizing flows, we revisit this challenge and\npropose normalizing flow assisted importance sampling, termed NOFIS. NOFIS\nfirst learns a sequence of proposal distributions associated with predefined\nnested subset events by minimizing KL divergence losses. Next, it estimates the\nrare event probability by utilizing importance sampling in conjunction with the\nlast proposal. The efficacy of our NOFIS method is substantiated through\ncomprehensive qualitative visualizations, affirming the optimality of the\nlearned proposal distribution, as well as a series of quantitative experiments\nencompassing $10$ distinct test cases, which highlight NOFIS's superiority over\nbaseline approaches.\n","authors":["Zhenggqi Gao","Dinghuai Zhang","Luca Daniel","Duane S. Boning"],"pdf_url":"https://arxiv.org/pdf/2310.19167v1.pdf","comment":"16 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2310.19166v1","updated":"2023-10-29T21:56:22Z","published":"2023-10-29T21:56:22Z","title":"The Power of Explainability in Forecast-Informed Deep Learning Models\n for Flood Mitigation","summary":" Floods can cause horrific harm to life and property. However, they can be\nmitigated or even avoided by the effective use of hydraulic structures such as\ndams, gates, and pumps. By pre-releasing water via these structures in advance\nof extreme weather events, water levels are sufficiently lowered to prevent\nfloods. In this work, we propose FIDLAR, a Forecast Informed Deep Learning\nArchitecture, achieving flood management in watersheds with hydraulic\nstructures in an optimal manner by balancing out flood mitigation and\nunnecessary wastage of water via pre-releases. We perform experiments with\nFIDLAR using data from the South Florida Water Management District, which\nmanages a coastal area that is highly prone to frequent storms and floods.\nResults show that FIDLAR performs better than the current state-of-the-art with\nseveral orders of magnitude speedup and with provably better pre-release\nschedules. The dramatic speedups make it possible for FIDLAR to be used for\nreal-time flood management. The main contribution of this paper is the\neffective use of tools for model explainability, allowing us to understand the\ncontribution of the various environmental factors towards its decisions.\n","authors":["Jimeng Shi","Vitalii Stebliankin","Giri Narasimhan"],"pdf_url":"https://arxiv.org/pdf/2310.19166v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06253v2","updated":"2023-10-29T21:48:34Z","published":"2023-06-09T20:52:16Z","title":"Decision Stacks: Flexible Reinforcement Learning via Modular Generative\n Models","summary":" Reinforcement learning presents an attractive paradigm to reason about\nseveral distinct aspects of sequential decision making, such as specifying\ncomplex goals, planning future observations and actions, and critiquing their\nutilities. However, the combined integration of these capabilities poses\ncompeting algorithmic challenges in retaining maximal expressivity while\nallowing for flexibility in modeling choices for efficient learning and\ninference. We present Decision Stacks, a generative framework that decomposes\ngoal-conditioned policy agents into 3 generative modules. These modules\nsimulate the temporal evolution of observations, rewards, and actions via\nindependent generative models that can be learned in parallel via teacher\nforcing. Our framework guarantees both expressivity and flexibility in\ndesigning individual modules to account for key factors such as architectural\nbias, optimization objective and dynamics, transferrability across domains, and\ninference speed. Our empirical results demonstrate the effectiveness of\nDecision Stacks for offline policy optimization for several MDP and POMDP\nenvironments, outperforming existing methods and enabling flexible generative\ndecision making.\n","authors":["Siyan Zhao","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2306.06253v2.pdf","comment":"published at NeurIPS 2023, project page:\n https://siyan-zhao.github.io/decision-stacks/"},{"id":"http://arxiv.org/abs/2310.19163v1","updated":"2023-10-29T21:47:24Z","published":"2023-10-29T21:47:24Z","title":"RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning\n with Active Data Manipulation","summary":" Federated learning (FL) has recently emerged as a privacy-preserving approach\nfor machine learning in domains that rely on user interactions, particularly\nrecommender systems (RS) and online learning to rank (OLTR). While there has\nbeen substantial research on the privacy of traditional FL, little attention\nhas been paid to studying the privacy properties of these interaction-based FL\n(IFL) systems. In this work, we show that IFL can introduce unique challenges\nconcerning user privacy, particularly when the central server has knowledge and\ncontrol over the items that users interact with. Specifically, we demonstrate\nthe threat of reconstructing user interactions by presenting RAIFLE, a general\noptimization-based reconstruction attack framework customized for IFL. RAIFLE\nemploys Active Data Manipulation (ADM), a novel attack technique unique to IFL,\nwhere the server actively manipulates the training features of the items to\ninduce adversarial behaviors in the local FL updates. We show that RAIFLE is\nmore impactful than existing FL privacy attacks in the IFL context, and\ndescribe how it can undermine privacy defenses like secure aggregation and\nprivate information retrieval. Based on our findings, we propose and discuss\ncountermeasure guidelines to mitigate our attack in the context of federated\nRS/OLTR specifically and IFL more broadly.\n","authors":["Dzung Pham","Shreyas Kulkarni","Amir Houmansadr"],"pdf_url":"https://arxiv.org/pdf/2310.19163v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.03819v3","updated":"2023-10-29T21:41:46Z","published":"2023-06-06T16:07:24Z","title":"LEACE: Perfect linear concept erasure in closed form","summary":" Concept erasure aims to remove specified features from a representation. It\ncan improve fairness (e.g. preventing a classifier from using gender or race)\nand interpretability (e.g. removing a concept to observe changes in model\nbehavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form\nmethod which provably prevents all linear classifiers from detecting a concept\nwhile changing the representation as little as possible, as measured by a broad\nclass of norms. We apply LEACE to large language models with a novel procedure\ncalled \"concept scrubbing,\" which erases target concept information from every\nlayer in the network. We demonstrate our method on two tasks: measuring the\nreliance of language models on part-of-speech information, and reducing gender\nbias in BERT embeddings. Code is available at\nhttps://github.com/EleutherAI/concept-erasure.\n","authors":["Nora Belrose","David Schneider-Joseph","Shauli Ravfogel","Ryan Cotterell","Edward Raff","Stella Biderman"],"pdf_url":"https://arxiv.org/pdf/2306.03819v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.10121v2","updated":"2023-10-29T21:31:53Z","published":"2023-10-16T06:57:24Z","title":"From Continuous Dynamics to Graph Neural Networks: Neural Diffusion and\n Beyond","summary":" Graph neural networks (GNNs) have demonstrated significant promise in\nmodelling relational data and have been widely applied in various fields of\ninterest. The key mechanism behind GNNs is the so-called message passing where\ninformation is being iteratively aggregated to central nodes from their\nneighbourhood. Such a scheme has been found to be intrinsically linked to a\nphysical process known as heat diffusion, where the propagation of GNNs\nnaturally corresponds to the evolution of heat density. Analogizing the process\nof message passing to the heat dynamics allows to fundamentally understand the\npower and pitfalls of GNNs and consequently informs better model design.\nRecently, there emerges a plethora of works that proposes GNNs inspired from\nthe continuous dynamics formulation, in an attempt to mitigate the known\nlimitations of GNNs, such as oversmoothing and oversquashing. In this survey,\nwe provide the first systematic and comprehensive review of studies that\nleverage the continuous perspective of GNNs. To this end, we introduce\nfoundational ingredients for adapting continuous dynamics to GNNs, along with a\ngeneral framework for the design of graph neural dynamics. We then review and\ncategorize existing works based on their driven mechanisms and underlying\ndynamics. We also summarize how the limitations of classic GNNs can be\naddressed under the continuous framework. We conclude by identifying multiple\nopen research directions.\n","authors":["Andi Han","Dai Shi","Lequan Lin","Junbin Gao"],"pdf_url":"https://arxiv.org/pdf/2310.10121v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19159v1","updated":"2023-10-29T21:19:08Z","published":"2023-10-29T21:19:08Z","title":"Transfer Learning in Transformer-Based Demand Forecasting For Home\n Energy Management System","summary":" Increasingly, homeowners opt for photovoltaic (PV) systems and/or battery\nstorage to minimize their energy bills and maximize renewable energy usage.\nThis has spurred the development of advanced control algorithms that maximally\nachieve those goals. However, a common challenge faced while developing such\ncontrollers is the unavailability of accurate forecasts of household power\nconsumption, especially for shorter time resolutions (15 minutes) and in a\ndata-efficient manner. In this paper, we analyze how transfer learning can help\nby exploiting data from multiple households to improve a single house's load\nforecasting. Specifically, we train an advanced forecasting model (a temporal\nfusion transformer) using data from multiple different households, and then\nfinetune this global model on a new household with limited data (i.e. only a\nfew days). The obtained models are used for forecasting power consumption of\nthe household for the next 24 hours~(day-ahead) at a time resolution of 15\nminutes, with the intention of using these forecasts in advanced controllers\nsuch as Model Predictive Control. We show the benefit of this transfer learning\nsetup versus solely using the individual new household's data, both in terms of\n(i) forecasting accuracy ($\\sim$15\\% MAE reduction) and (ii) control\nperformance ($\\sim$2\\% energy cost reduction), using real-world household data.\n","authors":["Gargya Gokhale","Jonas Van Gompel","Bert Claessens","Chris Develder"],"pdf_url":"https://arxiv.org/pdf/2310.19159v1.pdf","comment":"7 pages, 2 figures, workshop article at BALANCES, BuildSys'23"},{"id":"http://arxiv.org/abs/2310.19155v1","updated":"2023-10-29T21:10:38Z","published":"2023-10-29T21:10:38Z","title":"Real-World Implementation of Reinforcement Learning Based Energy\n Coordination for a Cluster of Households","summary":" Given its substantial contribution of 40\\% to global power consumption, the\nbuilt environment has received increasing attention to serve as a source of\nflexibility to assist the modern power grid. In that respect, previous research\nmainly focused on energy management of individual buildings. In contrast, in\nthis paper, we focus on aggregated control of a set of residential buildings,\nto provide grid supporting services, that eventually should include ancillary\nservices. In particular, we present a real-life pilot study that studies the\neffectiveness of reinforcement-learning (RL) in coordinating the power\nconsumption of 8 residential buildings to jointly track a target power signal.\nOur RL approach relies solely on observed data from individual households and\ndoes not require any explicit building models or simulators, making it\npractical to implement and easy to scale. We show the feasibility of our\nproposed RL-based coordination strategy in a real-world setting. In a 4-week\ncase study, we demonstrate a hierarchical control system, relying on an\nRL-based ranking system to select which households to activate flex assets\nfrom, and a real-time PI control-based power dispatch mechanism to control the\nselected assets. Our results demonstrate satisfactory power tracking, and the\neffectiveness of the RL-based ranks which are learnt in a purely data-driven\nmanner.\n","authors":["Gargya Gokhale","Niels Tiben","Marie-Sophie Verwee","Manu Lahariya","Bert Claessens","Chris Develder"],"pdf_url":"https://arxiv.org/pdf/2310.19155v1.pdf","comment":"8 pages, 2 figures, workshop article accepted at RLEM'23\n (BuildSys'23)"},{"id":"http://arxiv.org/abs/2308.06828v3","updated":"2023-10-29T21:07:54Z","published":"2023-08-13T18:14:10Z","title":"An Ensemble Approach to Question Classification: Integrating Electra\n Transformer, GloVe, and LSTM","summary":" Natural Language Processing (NLP) has emerged as a crucial technology for\nunderstanding and generating human language, playing an essential role in tasks\nsuch as machine translation, sentiment analysis, and more pertinently, question\nclassification. As a subfield within NLP, question classification focuses on\ndetermining the type of information being sought, a fundamental step for\ndownstream applications like question answering systems. This study presents an\ninnovative ensemble approach for question classification, combining the\nstrengths of Electra, GloVe, and LSTM models. Rigorously tested on the\nwell-regarded TREC dataset, the model demonstrates how the integration of these\ndisparate technologies can lead to superior results. Electra brings in its\ntransformer-based capabilities for complex language understanding, GloVe offers\nglobal vector representations for capturing word-level semantics, and LSTM\ncontributes its sequence learning abilities to model long-term dependencies. By\nfusing these elements strategically, our ensemble model delivers a robust and\nefficient solution for the complex task of question classification. Through\nrigorous comparisons with well-known models like BERT, RoBERTa, and DistilBERT,\nthe ensemble approach verifies its effectiveness by attaining an 80% accuracy\nscore on the test dataset.\n","authors":["Sanad Aburass","Osama Dorgham","Maha Abu Rumman"],"pdf_url":"https://arxiv.org/pdf/2308.06828v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19152v1","updated":"2023-10-29T21:06:34Z","published":"2023-10-29T21:06:34Z","title":"BERT Lost Patience Won't Be Robust to Adversarial Slowdown","summary":" In this paper, we systematically evaluate the robustness of multi-exit\nlanguage models against adversarial slowdown. To audit their robustness, we\ndesign a slowdown attack that generates natural adversarial text bypassing\nearly-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a\ncomprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark\nagainst adversarial slowdown. We then show our attack significantly reduces the\ncomputational savings provided by the three methods in both white-box and\nblack-box settings. The more complex a mechanism is, the more vulnerable it is\nto adversarial slowdown. We also perform a linguistic analysis of the perturbed\ntext inputs, identifying common perturbation patterns that our attack\ngenerates, and comparing them with standard adversarial text attacks. Moreover,\nwe show that adversarial training is ineffective in defeating our slowdown\nattack, but input sanitization with a conversational model, e.g., ChatGPT, can\nremove perturbations effectively. This result suggests that future work is\nneeded for developing efficient yet robust multi-exit models. Our code is\navailable at: https://github.com/ztcoalson/WAFFLE\n","authors":["Zachary Coalson","Gabriel Ritter","Rakesh Bobba","Sanghyun Hong"],"pdf_url":"https://arxiv.org/pdf/2310.19152v1.pdf","comment":"Accepted to NeurIPS 2023 [Poster]"},{"id":"http://arxiv.org/abs/2309.04272v2","updated":"2023-10-29T21:02:23Z","published":"2023-09-08T11:47:31Z","title":"Learning in Zero-Sum Linear Quadratic Games with Last-Iterate\n Convergence","summary":" Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and\ncan be used (i)~as a dynamic game formulation for risk-sensitive or robust\ncontrol and (ii)~as a benchmark setting for multi-agent reinforcement learning\nwith two competing agents in continuous state-control spaces. In contrast to\nthe well-studied single-agent linear quadratic regulator problem, zero-sum LQ\ngames entail solving a challenging nonconvex-nonconcave min-max problem with an\nobjective function that lacks coercivity. Recently, Zhang et al. showed that\nan~$\\epsilon$-Nash equilibrium (NE) of finite horizon zero-sum LQ games can be\nlearned via nested model-free Natural Policy Gradient (NPG) algorithms with\npoly$(1/\\epsilon)$ sample complexity. In this work, we propose a simpler nested\nZeroth-Order (ZO) algorithm improving sample complexity by several orders of\nmagnitude and guaranteeing convergence of the last iterate. Our main results\nare two-fold: (i) in the deterministic setting, we establish the first global\nlast-iterate linear convergence result for the nested algorithm that seeks NE\nof zero-sum LQ games; (ii) in the model-free setting, we establish\na~$\\widetilde{\\mathcal{O}}(\\epsilon^{-2})$ sample complexity using a\nsingle-point ZO estimator. For our last-iterate convergence results, our\nanalysis leverages the Implicit Regularization (IR) property and a new gradient\ndomination condition for the primal function. Our key improvements in the\nsample complexity rely on a more sample-efficient nested algorithm design and a\nfiner control of the ZO natural gradient estimation error utilizing the\nstructure endowed by the finite-horizon setting.\n","authors":["Jiduan Wu","Anas Barakat","Ilyas Fatkhullin","Niao He"],"pdf_url":"https://arxiv.org/pdf/2309.04272v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19142v1","updated":"2023-10-29T20:32:21Z","published":"2023-10-29T20:32:21Z","title":"MAG-GNN: Reinforcement Learning Boosted Graph Neural Network","summary":" While Graph Neural Networks (GNNs) recently became powerful tools in graph\nlearning tasks, considerable efforts have been spent on improving GNNs'\nstructural encoding ability. A particular line of work proposed subgraph GNNs\nthat use subgraph information to improve GNNs' expressivity and achieved great\nsuccess. However, such effectivity sacrifices the efficiency of GNNs by\nenumerating all possible subgraphs. In this paper, we analyze the necessity of\ncomplete subgraph enumeration and show that a model can achieve a comparable\nlevel of expressivity by considering a small subset of the subgraphs. We then\nformulate the identification of the optimal subset as a combinatorial\noptimization problem and propose Magnetic Graph Neural Network (MAG-GNN), a\nreinforcement learning (RL) boosted GNN, to solve the problem. Starting with a\ncandidate subgraph set, MAG-GNN employs an RL agent to iteratively update the\nsubgraphs to locate the most expressive set for prediction. This reduces the\nexponential complexity of subgraph enumeration to the constant complexity of a\nsubgraph search algorithm while keeping good expressivity. We conduct extensive\nexperiments on many datasets, showing that MAG-GNN achieves competitive\nperformance to state-of-the-art methods and even outperforms many subgraph\nGNNs. We also demonstrate that MAG-GNN effectively reduces the running time of\nsubgraph GNNs.\n","authors":["Lecheng Kong","Jiarui Feng","Hao Liu","Dacheng Tao","Yixin Chen","Muhan Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.19142v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19137v1","updated":"2023-10-29T19:59:55Z","published":"2023-10-29T19:59:55Z","title":"Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep\n Reinforcement Learning","summary":" Reinforcement learning (RL) is a powerful tool for finding optimal policies\nin sequential decision processes. However, deep RL methods suffer from two\nweaknesses: collecting the amount of agent experience required for practical RL\nproblems is prohibitively expensive, and the learned policies exhibit poor\ngeneralization on tasks outside of the training distribution. To mitigate these\nissues, we introduce automaton distillation, a form of neuro-symbolic transfer\nlearning in which Q-value estimates from a teacher are distilled into a\nlow-dimensional representation in the form of an automaton. We then propose two\nmethods for generating Q-value estimates: static transfer, which reasons over\nan abstract Markov Decision Process constructed based on prior knowledge, and\ndynamic transfer, where symbolic information is extracted from a teacher Deep\nQ-Network (DQN). The resulting Q-value estimates from either method are used to\nbootstrap learning in the target environment via a modified DQN loss function.\nWe list several failure modes of existing automaton-based transfer methods and\ndemonstrate that both static and dynamic automaton distillation decrease the\ntime required to find optimal policies for various decision tasks.\n","authors":["Suraj Singireddy","Andre Beckus","George Atia","Sumit Jha","Alvaro Velasquez"],"pdf_url":"https://arxiv.org/pdf/2310.19137v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.11990v3","updated":"2023-10-29T19:45:09Z","published":"2023-01-27T21:03:19Z","title":"Alignment with human representations supports robust few-shot learning","summary":" Should we care whether AI systems have representations of the world that are\nsimilar to those of humans? We provide an information-theoretic analysis that\nsuggests that there should be a U-shaped relationship between the degree of\nrepresentational alignment with humans and performance on few-shot learning\ntasks. We confirm this prediction empirically, finding such a relationship in\nan analysis of the performance of 491 computer vision models. We also show that\nhighly-aligned models are more robust to both natural adversarial attacks and\ndomain shifts. Our results suggest that human-alignment is often a sufficient,\nbut not necessary, condition for models to make effective use of limited data,\nbe robust, and generalize well.\n","authors":["Ilia Sucholutsky","Thomas L. Griffiths"],"pdf_url":"https://arxiv.org/pdf/2301.11990v3.pdf","comment":"Spotlight at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.16215v2","updated":"2023-10-29T19:44:57Z","published":"2023-05-25T16:22:22Z","title":"Koopman Kernel Regression","summary":" Many machine learning approaches for decision making, such as reinforcement\nlearning, rely on simulators or predictive models to forecast the\ntime-evolution of quantities of interest, e.g., the state of an agent or the\nreward of a policy. Forecasts of such complex phenomena are commonly described\nby highly nonlinear dynamical systems, making their use in optimization-based\ndecision-making challenging. Koopman operator theory offers a beneficial\nparadigm for addressing this problem by characterizing forecasts via linear\ntime-invariant (LTI) ODEs -- turning multi-step forecasting into sparse matrix\nmultiplications. Though there exists a variety of learning approaches, they\nusually lack crucial learning-theoretic guarantees, making the behavior of the\nobtained models with increasing data and dimensionality unclear. We address the\naforementioned by deriving a novel reproducing kernel Hilbert space (RKHS) over\ntrajectories that solely spans transformations into LTI dynamical systems. The\nresulting Koopman Kernel Regression (KKR) framework enables the use of\nstatistical learning tools from function approximation for novel convergence\nresults and generalization error bounds under weaker assumptions than existing\nwork. Our experiments demonstrate superior forecasting performance compared to\nKoopman operator and sequential data predictors in RKHS.\n","authors":["Petar Bevanda","Max Beier","Armin Lederer","Stefan Sosnowski","Eyke Hüllermeier","Sandra Hirche"],"pdf_url":"https://arxiv.org/pdf/2305.16215v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16284v2","updated":"2023-10-29T19:36:37Z","published":"2023-05-25T17:40:43Z","title":"DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent\n Method","summary":" This paper proposes a new easy-to-implement parameter-free gradient-based\noptimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is\nefficient -- matching the convergence rate of optimally tuned gradient descent\nin convex optimization up to a logarithmic factor without tuning any\nparameters, and universal -- automatically adapting to both smooth and\nnonsmooth problems. While popular algorithms following the AdaGrad framework\ncompute a running average of the squared gradients to use for normalization,\nDoWG maintains a new distance-based weighted version of the running average,\nwhich is crucial to achieve the desired properties. To complement our theory,\nwe also show empirically that DoWG trains at the edge of stability, and\nvalidate its effectiveness on practical machine learning tasks.\n","authors":["Ahmed Khaled","Konstantin Mishchenko","Chi Jin"],"pdf_url":"https://arxiv.org/pdf/2305.16284v2.pdf","comment":"22 pages, 1 table, 4 figures"},{"id":"http://arxiv.org/abs/2208.03392v5","updated":"2023-10-29T19:32:21Z","published":"2022-08-05T21:41:15Z","title":"Federated Learning for Medical Applications: A Taxonomy, Current Trends,\n Challenges, and Future Research Directions","summary":" With the advent of the IoT, AI, ML, and DL algorithms, the landscape of\ndata-driven medical applications has emerged as a promising avenue for\ndesigning robust and scalable diagnostic and prognostic models from medical\ndata. This has gained a lot of attention from both academia and industry,\nleading to significant improvements in healthcare quality. However, the\nadoption of AI-driven medical applications still faces tough challenges,\nincluding meeting security, privacy, and quality of service (QoS) standards.\nRecent developments in \\ac{FL} have made it possible to train complex\nmachine-learned models in a distributed manner and have become an active\nresearch domain, particularly processing the medical data at the edge of the\nnetwork in a decentralized way to preserve privacy and address security\nconcerns. To this end, in this paper, we explore the present and future of FL\ntechnology in medical applications where data sharing is a significant\nchallenge. We delve into the current research trends and their outcomes,\nunravelling the complexities of designing reliable and scalable \\ac{FL} models.\nOur paper outlines the fundamental statistical issues in FL, tackles\ndevice-related problems, addresses security challenges, and navigates the\ncomplexity of privacy concerns, all while highlighting its transformative\npotential in the medical field. Our study primarily focuses on medical\napplications of \\ac{FL}, particularly in the context of global cancer\ndiagnosis. We highlight the potential of FL to enable computer-aided diagnosis\ntools that address this challenge with greater effectiveness than traditional\ndata-driven methods. We hope that this comprehensive review will serve as a\ncheckpoint for the field, summarizing the current state-of-the-art and\nidentifying open problems and future research directions.\n","authors":["Ashish Rauniyar","Desta Haileselassie Hagos","Debesh Jha","Jan Erik Håkegård","Ulas Bagci","Danda B. Rawat","Vladimir Vlassov"],"pdf_url":"https://arxiv.org/pdf/2208.03392v5.pdf","comment":"Accepted at IEEE Internet of Things Journal"},{"id":"http://arxiv.org/abs/2310.19126v1","updated":"2023-10-29T19:25:48Z","published":"2023-10-29T19:25:48Z","title":"Worst-case Performance of Popular Approximate Nearest Neighbor Search\n Implementations: Guarantees and Limitations","summary":" Graph-based approaches to nearest neighbor search are popular and powerful\ntools for handling large datasets in practice, but they have limited\ntheoretical guarantees. We study the worst-case performance of recent\ngraph-based approximate nearest neighbor search algorithms, such as HNSW, NSG\nand DiskANN. For DiskANN, we show that its \"slow preprocessing\" version\nprovably supports approximate nearest neighbor search query with constant\napproximation ratio and poly-logarithmic query time, on data sets with bounded\n\"intrinsic\" dimension. For the other data structure variants studied, including\nDiskANN with \"fast preprocessing\", HNSW and NSG, we present a family of\ninstances on which the empirical query time required to achieve a \"reasonable\"\naccuracy is linear in instance size. For example, for DiskANN, we show that the\nquery procedure can take at least $0.1 n$ steps on instances of size $n$ before\nit encounters any of the $5$ nearest neighbors of the query.\n","authors":["Piotr Indyk","Haike Xu"],"pdf_url":"https://arxiv.org/pdf/2310.19126v1.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19124v1","updated":"2023-10-29T19:21:33Z","published":"2023-10-29T19:21:33Z","title":"Software engineering for deep learning applications: usage of SWEng and\n MLops tools in GitHub repositories","summary":" The rising popularity of deep learning (DL) methods and techniques has\ninvigorated interest in the topic of SE4DL, the application of software\nengineering (SE) practices on deep learning software. Despite the novel\nengineering challenges brought on by the data-driven and non-deterministic\nparadigm of DL software, little work has been invested into developing\nAI-targeted SE tools. On the other hand, tools tackling more general\nengineering issues in DL are actively used and referred to under the umbrella\nterm of ``MLOps tools''. Furthermore, the available literature supports the\nutility of conventional SE tooling in DL software development. Building upon\nprevious MSR research on tool usage in open-source software works, we identify\nconventional and MLOps tools adopted in popular applied DL projects that use\nPython as the main programming language. About 70% of the GitHub repositories\nmined contained at least one conventional SE tool. Software configuration\nmanagement tools are the most adopted, while the opposite applies to\nmaintenance tools. Substantially fewer MLOps tools were in use, with only 9\ntools out of a sample of 80 used in at least one repository. The majority of\nthem were open-source rather than proprietary. One of these tools, TensorBoard,\nwas found to be adopted in about half of the repositories in our study.\nConsequently, the use of conventional SE tooling demonstrates its relevance to\nDL software. Further research is recommended on the adoption of MLOps tooling\nby open-source projects, focusing on the relevance of particular tool types,\nthe development of required tools, as well as ways to promote the use of\nalready available tools.\n","authors":["Evangelia Panourgia","Theodoros Plessas","Diomidis Spinellis"],"pdf_url":"https://arxiv.org/pdf/2310.19124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19112v1","updated":"2023-10-29T18:57:15Z","published":"2023-10-29T18:57:15Z","title":"Efficient IoT Inference via Context-Awareness","summary":" While existing strategies for optimizing deep learning-based classification\nmodels on low-power platforms assume the models are trained on all classes of\ninterest, this paper posits that adopting context-awareness i.e. focusing\nsolely on the likely classes in the current context, can substantially enhance\nperformance in resource-constrained environments. We propose a new paradigm,\nCACTUS, for scalable and efficient context-aware classification where a\nmicro-classifier recognizes a small set of classes relevant to the current\ncontext and, when context change happens, rapidly switches to another suitable\nmicro-classifier. CACTUS has several innovations including optimizing the\ntraining cost of context-aware classifiers, enabling on-the-fly context-aware\nswitching between classifiers, and selecting the best context-aware classifiers\ngiven limited resources. We show that CACTUS achieves significant benefits in\naccuracy, latency, and compute budget across a range of datasets and IoT\nplatforms.\n","authors":["Mohammad Mehdi Rastikerdar","Jin Huang","Shiwei Fang","Hui Guan","Deepak Ganesan"],"pdf_url":"https://arxiv.org/pdf/2310.19112v1.pdf","comment":"12 pages, 10 figures"},{"id":"http://arxiv.org/abs/2210.04317v2","updated":"2023-10-29T18:52:49Z","published":"2022-10-09T18:57:08Z","title":"A Spectral Approach to Item Response Theory","summary":" The Rasch model is one of the most fundamental models in \\emph{item response\ntheory} and has wide-ranging applications from education testing to\nrecommendation systems. In a universe with $n$ users and $m$ items, the Rasch\nmodel assumes that the binary response $X_{li} \\in \\{0,1\\}$ of a user $l$ with\nparameter $\\theta^*_l$ to an item $i$ with parameter $\\beta^*_i$ (e.g., a user\nlikes a movie, a student correctly solves a problem) is distributed as\n$\\Pr(X_{li}=1) = 1/(1 + \\exp{-(\\theta^*_l - \\beta^*_i)})$. In this paper, we\npropose a \\emph{new item estimation} algorithm for this celebrated model (i.e.,\nto estimate $\\beta^*$). The core of our algorithm is the computation of the\nstationary distribution of a Markov chain defined on an item-item graph. We\ncomplement our algorithmic contributions with finite-sample error guarantees,\nthe first of their kind in the literature, showing that our algorithm is\nconsistent and enjoys favorable optimality properties. We discuss practical\nmodifications to accelerate and robustify the algorithm that practitioners can\nadopt. Experiments on synthetic and real-life datasets, ranging from small\neducation testing datasets to large recommendation systems datasets show that\nour algorithm is scalable, accurate, and competitive with the most commonly\nused methods in the literature.\n","authors":["Duc Nguyen","Anderson Zhang"],"pdf_url":"https://arxiv.org/pdf/2210.04317v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19103v1","updated":"2023-10-29T18:35:05Z","published":"2023-10-29T18:35:05Z","title":"Proving Linear Mode Connectivity of Neural Networks via Optimal\n Transport","summary":" The energy landscape of high-dimensional non-convex optimization problems is\ncrucial to understanding the effectiveness of modern deep neural network\narchitectures. Recent works have experimentally shown that two different\nsolutions found after two runs of a stochastic training are often connected by\nvery simple continuous paths (e.g., linear) modulo a permutation of the\nweights. In this paper, we provide a framework theoretically explaining this\nempirical observation. Based on convergence rates in Wasserstein distance of\nempirical measures, we show that, with high probability, two wide enough\ntwo-layer neural networks trained with stochastic gradient descent are linearly\nconnected. Additionally, we express upper and lower bounds on the width of each\nlayer of two deep neural networks with independent neuron weights to be\nlinearly connected. Finally, we empirically demonstrate the validity of our\napproach by showing how the dimension of the support of the weight distribution\nof neurons, which dictates Wasserstein convergence rates is correlated with\nlinear mode connectivity.\n","authors":["Damien Ferbach","Baptiste Goujaud","Gauthier Gidel","Aymeric Dieuleveut"],"pdf_url":"https://arxiv.org/pdf/2310.19103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.02652v2","updated":"2023-10-29T18:35:01Z","published":"2023-06-05T07:38:13Z","title":"Towards Anytime Classification in Early-Exit Architectures by Enforcing\n Conditional Monotonicity","summary":" Modern predictive models are often deployed to environments in which\ncomputational budgets are dynamic. Anytime algorithms are well-suited to such\nenvironments as, at any point during computation, they can output a prediction\nwhose quality is a function of computation time. Early-exit neural networks\nhave garnered attention in the context of anytime computation due to their\ncapability to provide intermediate predictions at various stages throughout the\nnetwork. However, we demonstrate that current early-exit networks are not\ndirectly applicable to anytime settings, as the quality of predictions for\nindividual data points is not guaranteed to improve with longer computation. To\naddress this shortcoming, we propose an elegant post-hoc modification, based on\nthe Product-of-Experts, that encourages an early-exit network to become\ngradually confident. This gives our deep models the property of conditional\nmonotonicity in the prediction quality -- an essential stepping stone towards\ntruly anytime predictive modeling using early-exit architectures. Our empirical\nresults on standard image-classification tasks demonstrate that such behaviors\ncan be achieved while preserving competitive accuracy on average.\n","authors":["Metod Jazbec","James Urquhart Allingham","Dan Zhang","Eric Nalisnick"],"pdf_url":"https://arxiv.org/pdf/2306.02652v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19102v1","updated":"2023-10-29T18:33:05Z","published":"2023-10-29T18:33:05Z","title":"Atom: Low-bit Quantization for Efficient and Accurate LLM Serving","summary":" The growing demand for Large Language Models (LLMs) in applications such as\ncontent generation, intelligent chatbots, and sentiment analysis poses\nconsiderable challenges for LLM service providers. To efficiently use GPU\nresources and boost throughput, batching multiple requests has emerged as a\npopular paradigm; to further speed up batching, LLM quantization techniques\nreduce memory consumption and increase computing capacity. However, prevalent\nquantization schemes (e.g., 8-bit weight-activation quantization) cannot fully\nleverage the capabilities of modern GPUs, such as 4-bit integer operators,\nresulting in sub-optimal performance.\n To maximize LLMs' serving throughput, we introduce Atom, a low-bit\nquantization method that achieves high throughput improvements with negligible\naccuracy loss. Atom significantly boosts serving throughput by using low-bit\noperators and considerably reduces memory consumption via low-bit quantization.\nIt attains high accuracy by applying a novel mixed-precision and fine-grained\nquantization process. We evaluate Atom on 4-bit weight-activation quantization\nsetups in the serving context. Atom improves end-to-end throughput by up to\n$7.73\\times$ compared to the FP16 and by $2.53\\times$ compared to INT8\nquantization, while maintaining the same latency target.\n","authors":["Yilong Zhao","Chien-Yu Lin","Kan Zhu","Zihao Ye","Lequn Chen","Size Zheng","Luis Ceze","Arvind Krishnamurthy","Tianqi Chen","Baris Kasikci"],"pdf_url":"https://arxiv.org/pdf/2310.19102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07116v2","updated":"2023-10-29T18:19:41Z","published":"2023-05-11T20:07:10Z","title":"Energy cost and machine learning accuracy impact of k-anonymisation and\n synthetic data techniques","summary":" To address increasing societal concerns regarding privacy and climate, the EU\nadopted the General Data Protection Regulation (GDPR) and committed to the\nGreen Deal. Considerable research studied the energy efficiency of software and\nthe accuracy of machine learning models trained on anonymised data sets. Recent\nwork began exploring the impact of privacy-enhancing techniques (PET) on both\nthe energy consumption and accuracy of the machine learning models, focusing on\nk-anonymity. As synthetic data is becoming an increasingly popular PET, this\npaper analyses the energy consumption and accuracy of two phases: a) applying\nprivacy-enhancing techniques to the concerned data set, b) training the models\non the concerned privacy-enhanced data set. We use two privacy-enhancing\ntechniques: k-anonymisation (using generalisation and suppression) and\nsynthetic data, and three machine-learning models. Each model is trained on\neach privacy-enhanced data set. Our results show that models trained on\nk-anonymised data consume less energy than models trained on the original data,\nwith a similar performance regarding accuracy. Models trained on synthetic data\nhave a similar energy consumption and a similar to lower accuracy compared to\nmodels trained on the original data.\n","authors":["Pepijn de Reus","Ana Oprescu","Koen van Elsen"],"pdf_url":"https://arxiv.org/pdf/2305.07116v2.pdf","comment":"Published in the proceedings (Pages: 57-65) of The International\n Conference on Information and Communications Technology for Sustainability\n (ICT4S) 2023 in Rennes, France. 9 pages, 4 figures, 5 tables"},{"id":"http://arxiv.org/abs/2306.10168v3","updated":"2023-10-29T18:13:46Z","published":"2023-06-16T20:11:38Z","title":"Beyond Geometry: Comparing the Temporal Structure of Computation in\n Neural Circuits with Dynamical Similarity Analysis","summary":" How can we tell whether two neural networks utilize the same internal\nprocesses for a particular computation? This question is pertinent for multiple\nsubfields of neuroscience and machine learning, including neuroAI, mechanistic\ninterpretability, and brain-machine interfaces. Standard approaches for\ncomparing neural networks focus on the spatial geometry of latent states. Yet\nin recurrent networks, computations are implemented at the level of dynamics,\nand two networks performing the same computation with equivalent dynamics need\nnot exhibit the same geometry. To bridge this gap, we introduce a novel\nsimilarity metric that compares two systems at the level of their dynamics,\ncalled Dynamical Similarity Analysis (DSA). Our method incorporates two\ncomponents: Using recent advances in data-driven dynamical systems theory, we\nlearn a high-dimensional linear system that accurately captures core features\nof the original nonlinear dynamics. Next, we compare different systems passed\nthrough this embedding using a novel extension of Procrustes Analysis that\naccounts for how vector fields change under orthogonal transformation. In four\ncase studies, we demonstrate that our method disentangles conjugate and\nnon-conjugate recurrent neural networks (RNNs), while geometric methods fall\nshort. We additionally show that our method can distinguish learning rules in\nan unsupervised manner. Our method opens the door to comparative analyses of\nthe essential temporal structure of computation in neural circuits.\n","authors":["Mitchell Ostrow","Adam Eisen","Leo Kozachkov","Ila Fiete"],"pdf_url":"https://arxiv.org/pdf/2306.10168v3.pdf","comment":"22 pages, 9 figures"},{"id":"http://arxiv.org/abs/2207.03444v2","updated":"2023-10-29T18:12:25Z","published":"2022-07-07T17:20:15Z","title":"Fairness and Bias in Robot Learning","summary":" Machine learning has significantly enhanced the abilities of robots, enabling\nthem to perform a wide range of tasks in human environments and adapt to our\nuncertain real world. Recent works in various machine learning domains have\nhighlighted the importance of accounting for fairness to ensure that these\nalgorithms do not reproduce human biases and consequently lead to\ndiscriminatory outcomes. With robot learning systems increasingly performing\nmore and more tasks in our everyday lives, it is crucial to understand the\ninfluence of such biases to prevent unintended behavior toward certain groups\nof people. In this work, we present the first survey on fairness in robot\nlearning from an interdisciplinary perspective spanning technical, ethical, and\nlegal challenges. We propose a taxonomy for sources of bias and the resulting\ntypes of discrimination due to them. Using examples from different robot\nlearning domains, we examine scenarios of unfair outcomes and strategies to\nmitigate them. We present early advances in the field by covering different\nfairness definitions, ethical and legal considerations, and methods for fair\nrobot learning. With this work, we aim to pave the road for groundbreaking\ndevelopments in fair robot learning.\n","authors":["Laura Londoño","Juana Valeria Hurtado","Nora Hertz","Philipp Kellmeyer","Silja Voeneky","Abhinav Valada"],"pdf_url":"https://arxiv.org/pdf/2207.03444v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19091v1","updated":"2023-10-29T17:44:48Z","published":"2023-10-29T17:44:48Z","title":"Bridging the Gap: Towards an Expanded Toolkit for ML-Supported\n Decision-Making in the Public Sector","summary":" Machine Learning (ML) systems are becoming instrumental in the public sector,\nwith applications spanning areas like criminal justice, social welfare,\nfinancial fraud detection, and public health. While these systems offer great\npotential benefits to institutional decision-making processes, such as improved\nefficiency and reliability, they still face the challenge of aligning intricate\nand nuanced policy objectives with the precise formalization requirements\nnecessitated by ML models. In this paper, we aim to bridge the gap between ML\nand public sector decision-making by presenting a comprehensive overview of key\ntechnical challenges where disjunctions between policy goals and ML models\ncommonly arise. We concentrate on pivotal points of the ML pipeline that\nconnect the model to its operational environment, delving into the significance\nof representative training data and highlighting the importance of a model\nsetup that facilitates effective decision-making. Additionally, we link these\nchallenges with emerging methodological advancements, encompassing causal ML,\ndomain adaptation, uncertainty quantification, and multi-objective\noptimization, illustrating the path forward for harmonizing ML and public\nsector objectives.\n","authors":["Unai Fischer Abaigar","Christoph Kern","Noam Barda","Frauke Kreuter"],"pdf_url":"https://arxiv.org/pdf/2310.19091v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.18399v2","updated":"2023-10-29T17:42:41Z","published":"2023-05-28T14:45:11Z","title":"On the impact of activation and normalization in obtaining isometric\n embeddings at initialization","summary":" In this paper, we explore the structure of the penultimate Gram matrix in\ndeep neural networks, which contains the pairwise inner products of outputs\ncorresponding to a batch of inputs. In several architectures it has been\nobserved that this Gram matrix becomes degenerate with depth at initialization,\nwhich dramatically slows training. Normalization layers, such as batch or layer\nnormalization, play a pivotal role in preventing the rank collapse issue.\nDespite promising advances, the existing theoretical results do not extend to\nlayer normalization, which is widely used in transformers, and can not\nquantitatively characterize the role of non-linear activations. To bridge this\ngap, we prove that layer normalization, in conjunction with activation layers,\nbiases the Gram matrix of a multilayer perceptron towards the identity matrix\nat an exponential rate with depth at initialization. We quantify this rate\nusing the Hermite expansion of the activation function.\n","authors":["Amir Joudaki","Hadi Daneshmand","Francis Bach"],"pdf_url":"https://arxiv.org/pdf/2305.18399v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.17570v2","updated":"2023-10-29T17:40:24Z","published":"2023-05-27T20:14:11Z","title":"Auditing Fairness by Betting","summary":" We provide practical, efficient, and nonparametric methods for auditing the\nfairness of deployed classification and regression models. Whereas previous\nwork relies on a fixed-sample size, our methods are sequential and allow for\nthe continuous monitoring of incoming data, making them highly amenable to\ntracking the fairness of real-world systems. We also allow the data to be\ncollected by a probabilistic policy as opposed to sampled uniformly from the\npopulation. This enables auditing to be conducted on data gathered for another\npurpose. Moreover, this policy may change over time and different policies may\nbe used on different subpopulations. Finally, our methods can handle\ndistribution shift resulting from either changes to the model or changes in the\nunderlying population. Our approach is based on recent progress in\nanytime-valid inference and game-theoretic statistics-the \"testing by betting\"\nframework in particular. These connections ensure that our methods are\ninterpretable, fast, and easy to implement. We demonstrate the efficacy of our\napproach on three benchmark fairness datasets.\n","authors":["Ben Chugg","Santiago Cortes-Gomez","Bryan Wilder","Aaditya Ramdas"],"pdf_url":"https://arxiv.org/pdf/2305.17570v2.pdf","comment":"Accepted to NeurIPS 2023. 29 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.16105v2","updated":"2023-10-29T17:34:15Z","published":"2023-10-24T18:15:25Z","title":"Locally Differentially Private Gradient Tracking for Distributed Online\n Learning over Directed Graphs","summary":" Distributed online learning has been proven extremely effective in solving\nlarge-scale machine learning problems over streaming data. However, information\nsharing between learners in distributed learning also raises concerns about the\npotential leakage of individual learners' sensitive data. To mitigate this\nrisk, differential privacy, which is widely regarded as the \"gold standard\" for\nprivacy protection, has been widely employed in many existing results on\ndistributed online learning. However, these results often face a fundamental\ntradeoff between learning accuracy and privacy. In this paper, we propose a\nlocally differentially private gradient tracking based distributed online\nlearning algorithm that successfully circumvents this tradeoff. We prove that\nthe proposed algorithm converges in mean square to the exact optimal solution\nwhile ensuring rigorous local differential privacy, with the cumulative privacy\nbudget guaranteed to be finite even when the number of iterations tends to\ninfinity. The algorithm is applicable even when the communication graph among\nlearners is directed. To the best of our knowledge, this is the first result\nthat simultaneously ensures learning accuracy and rigorous local differential\nprivacy in distributed online learning over directed graphs. We evaluate our\nalgorithm's performance by using multiple benchmark machine-learning\napplications, including logistic regression of the \"Mushrooms\" dataset and\nCNN-based image classification of the \"MNIST\" and \"CIFAR-10\" datasets,\nrespectively. The experimental results confirm that the proposed algorithm\noutperforms existing counterparts in both training and testing accuracies.\n","authors":["Ziqin Chen","Yongqiang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16105v2.pdf","comment":"21 pages, 4 figures"},{"id":"http://arxiv.org/abs/2304.03271v3","updated":"2023-10-29T17:30:08Z","published":"2023-04-06T17:55:27Z","title":"Making AI Less \"Thirsty\": Uncovering and Addressing the Secret Water\n Footprint of AI Models","summary":" The growing carbon footprint of artificial intelligence (AI) models,\nespecially large ones such as GPT-3, has been undergoing public scrutiny.\nUnfortunately, however, the equally important and enormous water (withdrawal\nand consumption) footprint of AI models has remained under the radar. For\nexample, training GPT-3 in Microsoft's state-of-the-art U.S. data centers can\ndirectly evaporate 700,000 liters of clean freshwater, but such information has\nbeen kept a secret. More critically, the global AI demand may be accountable\nfor 4.2 -- 6.6 billion cubic meters of water withdrawal in 2027, which is more\nthan the total annual water withdrawal of 4 -- 6 Denmark or half of the United\nKingdom. This is very concerning, as freshwater scarcity has become one of the\nmost pressing challenges shared by all of us in the wake of the rapidly growing\npopulation, depleting water resources, and aging water infrastructures. To\nrespond to the global water challenges, AI models can, and also must, take\nsocial responsibility and lead by example by addressing their own water\nfootprint. In this paper, we provide a principled methodology to estimate the\nwater footprint of AI models, and also discuss the unique spatial-temporal\ndiversities of AI models' runtime water efficiency. Finally, we highlight the\nnecessity of holistically addressing water footprint along with carbon\nfootprint to enable truly sustainable AI.\n","authors":["Pengfei Li","Jianyi Yang","Mohammad A. Islam","Shaolei Ren"],"pdf_url":"https://arxiv.org/pdf/2304.03271v3.pdf","comment":"New updates include discussion on water withdrawal and water\n consumption, scope definition for water, and new estimates of GPT-3's water\n footprint based on Microsoft's new WUE and PUE data. Source codes available\n at: https://github.com/Ren-Research/Making-AI-Less-Thirsty"},{"id":"http://arxiv.org/abs/2306.05304v2","updated":"2023-10-29T17:16:12Z","published":"2023-06-08T15:50:35Z","title":"Bayesian Optimisation of Functions on Graphs","summary":" The increasing availability of graph-structured data motivates the task of\noptimising over functions defined on the node set of graphs. Traditional graph\nsearch algorithms can be applied in this case, but they may be\nsample-inefficient and do not make use of information about the function\nvalues; on the other hand, Bayesian optimisation is a class of promising\nblack-box solvers with superior sample efficiency, but it has been scarcely\nbeen applied to such novel setups. To fill this gap, we propose a novel\nBayesian optimisation framework that optimises over functions defined on\ngeneric, large-scale and potentially unknown graphs. Through the learning of\nsuitable kernels on graphs, our framework has the advantage of adapting to the\nbehaviour of the target function. The local modelling approach further\nguarantees the efficiency of our method. Extensive experiments on both\nsynthetic and real-world graphs demonstrate the effectiveness of the proposed\noptimisation framework.\n","authors":["Xingchen Wan","Pierre Osselin","Henry Kenlay","Binxin Ru","Michael A. Osborne","Xiaowen Dong"],"pdf_url":"https://arxiv.org/pdf/2306.05304v2.pdf","comment":"NeurIPS 2023. 11 pages, 11 figures, 1 table (29 pages, 31 figures, 1\n table including references and appendices)"},{"id":"http://arxiv.org/abs/2309.00073v2","updated":"2023-10-29T17:03:31Z","published":"2023-08-18T16:21:15Z","title":"Diffusion Variational Autoencoder for Tackling Stochasticity in\n Multi-Step Regression Stock Price Prediction","summary":" Multi-step stock price prediction over a long-term horizon is crucial for\nforecasting its volatility, allowing financial institutions to price and hedge\nderivatives, and banks to quantify the risk in their trading books.\nAdditionally, most financial regulators also require a liquidity horizon of\nseveral days for institutional investors to exit their risky assets, in order\nto not materially affect market prices. However, the task of multi-step stock\nprice prediction is challenging, given the highly stochastic nature of stock\ndata. Current solutions to tackle this problem are mostly designed for\nsingle-step, classification-based predictions, and are limited to low\nrepresentation expressiveness. The problem also gets progressively harder with\nthe introduction of the target price sequence, which also contains stochastic\nnoise and reduces generalizability at test-time. To tackle these issues, we\ncombine a deep hierarchical variational-autoencoder (VAE) and diffusion\nprobabilistic techniques to do seq2seq stock prediction through a stochastic\ngenerative process. The hierarchical VAE allows us to learn the complex and\nlow-level latent variables for stock prediction, while the diffusion\nprobabilistic model trains the predictor to handle stock price stochasticity by\nprogressively adding random noise to the stock data. Our Diffusion-VAE (D-Va)\nmodel is shown to outperform state-of-the-art solutions in terms of its\nprediction accuracy and variance. More importantly, the multi-step outputs can\nalso allow us to form a stock portfolio over the prediction length. We\ndemonstrate the effectiveness of our model outputs in the portfolio investment\ntask through the Sharpe ratio metric and highlight the importance of dealing\nwith different types of prediction uncertainties.\n","authors":["Kelvin J. L. Koa","Yunshan Ma","Ritchie Ng","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2309.00073v2.pdf","comment":"CIKM 2023"},{"id":"http://arxiv.org/abs/2310.19075v1","updated":"2023-10-29T16:58:31Z","published":"2023-10-29T16:58:31Z","title":"Bespoke Solvers for Generative Flow Models","summary":" Diffusion or flow-based models are powerful generative paradigms that are\nnotoriously hard to sample as samples are defined as solutions to\nhigh-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs)\nwhich require a large Number of Function Evaluations (NFE) to approximate well.\nExisting methods to alleviate the costly sampling process include model\ndistillation and designing dedicated ODE solvers. However, distillation is\ncostly to train and sometimes can deteriorate quality, while dedicated solvers\nstill require relatively large NFE to produce high quality samples. In this\npaper we introduce \"Bespoke solvers\", a novel framework for constructing custom\nODE solvers tailored to the ODE of a given pre-trained flow model. Our approach\noptimizes an order consistent and parameter-efficient solver (e.g., with 80\nlearnable parameters), is trained for roughly 1% of the GPU time required for\ntraining the pre-trained model, and significantly improves approximation and\ngeneration quality compared to dedicated solvers. For example, a Bespoke solver\nfor a CIFAR10 model produces samples with Fr\\'echet Inception Distance (FID) of\n2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this\nmodel with only 20 NFE. On the more challenging ImageNet-64$\\times$64, Bespoke\nsamples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20\nNFE.\n","authors":["Neta Shaul","Juan Perez","Ricky T. Q. Chen","Ali Thabet","Albert Pumarola","Yaron Lipman"],"pdf_url":"https://arxiv.org/pdf/2310.19075v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19069v1","updated":"2023-10-29T16:46:50Z","published":"2023-10-29T16:46:50Z","title":"Efficient Cluster Selection for Personalized Federated Learning: A\n Multi-Armed Bandit Approach","summary":" Federated learning (FL) offers a decentralized training approach for machine\nlearning models, prioritizing data privacy. However, the inherent heterogeneity\nin FL networks, arising from variations in data distribution, size, and device\ncapabilities, poses challenges in user federation. Recognizing this,\nPersonalized Federated Learning (PFL) emphasizes tailoring learning processes\nto individual data profiles. In this paper, we address the complexity of\nclustering users in PFL, especially in dynamic networks, by introducing a\ndynamic Upper Confidence Bound (dUCB) algorithm inspired by the multi-armed\nbandit (MAB) approach. The dUCB algorithm ensures that new users can\neffectively find the best cluster for their data distribution by balancing\nexploration and exploitation. The performance of our algorithm is evaluated in\nvarious cases, showing its effectiveness in handling dynamic federated learning\nscenarios.\n","authors":["Zhou Ni","Morteza Hashemi"],"pdf_url":"https://arxiv.org/pdf/2310.19069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19068v1","updated":"2023-10-29T16:46:26Z","published":"2023-10-29T16:46:26Z","title":"Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile\n Streaming","summary":" Sketching algorithms have recently proven to be a powerful approach both for\ndesigning low-space streaming algorithms as well as fast polynomial time\napproximation schemes (PTAS). In this work, we develop new techniques to extend\nthe applicability of sketching-based approaches to the sparse dictionary\nlearning and the Euclidean $k$-means clustering problems. In particular, we\ninitiate the study of the challenging setting where the dictionary/clustering\nassignment for each of the $n$ input points must be output, which has\nsurprisingly received little attention in prior work. On the fast algorithms\nfront, we obtain a new approach for designing PTAS's for the $k$-means\nclustering problem, which generalizes to the first PTAS for the sparse\ndictionary learning problem. On the streaming algorithms front, we obtain new\nupper bounds and lower bounds for dictionary learning and $k$-means clustering.\nIn particular, given a design matrix $\\mathbf A\\in\\mathbb R^{n\\times d}$ in a\nturnstile stream, we show an $\\tilde O(nr/\\epsilon^2 + dk/\\epsilon)$ space\nupper bound for $r$-sparse dictionary learning of size $k$, an $\\tilde\nO(n/\\epsilon^2 + dk/\\epsilon)$ space upper bound for $k$-means clustering, as\nwell as an $\\tilde O(n)$ space upper bound for $k$-means clustering on random\norder row insertion streams with a natural \"bounded sensitivity\" assumption. On\nthe lower bounds side, we obtain a general $\\tilde\\Omega(n/\\epsilon +\ndk/\\epsilon)$ lower bound for $k$-means clustering, as well as an\n$\\tilde\\Omega(n/\\epsilon^2)$ lower bound for algorithms which can estimate the\ncost of a single fixed set of candidate centers.\n","authors":["Gregory Dexter","Petros Drineas","David P. Woodruff","Taisuke Yasuda"],"pdf_url":"https://arxiv.org/pdf/2310.19068v1.pdf","comment":"To appear in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19066v1","updated":"2023-10-29T16:46:05Z","published":"2023-10-29T16:46:05Z","title":"Gauge-optimal approximate learning for small data classification\n problems","summary":" Small data learning problems are characterized by a significant discrepancy\nbetween the limited amount of response variable observations and the large\nfeature space dimension. In this setting, the common learning tools struggle to\nidentify the features important for the classification task from those that\nbear no relevant information, and cannot derive an appropriate learning rule\nwhich allows to discriminate between different classes. As a potential solution\nto this problem, here we exploit the idea of reducing and rotating the feature\nspace in a lower-dimensional gauge and propose the Gauge-Optimal Approximate\nLearning (GOAL) algorithm, which provides an analytically tractable joint\nsolution to the dimension reduction, feature segmentation and classification\nproblems for small data learning problems. We prove that the optimal solution\nof the GOAL algorithm consists in piecewise-linear functions in the Euclidean\nspace, and that it can be approximated through a monotonically convergent\nalgorithm which presents -- under the assumption of a discrete segmentation of\nthe feature space -- a closed-form solution for each optimization substep and\nan overall linear iteration cost scaling. The GOAL algorithm has been compared\nto other state-of-the-art machine learning (ML) tools on both synthetic data\nand challenging real-world applications from climate science and bioinformatics\n(i.e., prediction of the El Nino Southern Oscillation and inference of\nepigenetically-induced gene-activity networks from limited experimental data).\nThe experimental results show that the proposed algorithm outperforms the\nreported best competitors for these problems both in learning performance and\ncomputational cost.\n","authors":["Edoardo Vecchi","Davide Bassetti","Fabio Graziato","Lukas Pospisil","Illia Horenko"],"pdf_url":"https://arxiv.org/pdf/2310.19066v1.pdf","comment":"47 pages, 4 figures"},{"id":"http://arxiv.org/abs/2310.19065v1","updated":"2023-10-29T16:45:20Z","published":"2023-10-29T16:45:20Z","title":"Evaluating LLP Methods: Challenges and Approaches","summary":" Learning from Label Proportions (LLP) is an established machine learning\nproblem with numerous real-world applications. In this setting, data items are\ngrouped into bags, and the goal is to learn individual item labels, knowing\nonly the features of the data and the proportions of labels in each bag.\nAlthough LLP is a well-established problem, it has several unusual aspects that\ncreate challenges for benchmarking learning methods. Fundamental complications\narise because of the existence of different LLP variants, i.e., dependence\nstructures that can exist between items, labels, and bags. Accordingly, the\nfirst algorithmic challenge is the generation of variant-specific datasets\ncapturing the diversity of dependence structures and bag characteristics. The\nsecond methodological challenge is model selection, i.e., hyperparameter\ntuning; due to the nature of LLP, model selection cannot easily use the\nstandard machine learning paradigm. The final benchmarking challenge consists\nof properly evaluating LLP solution methods across various LLP variants. We\nnote that there is very little consideration of these issues in prior work, and\nthere are no general solutions for these challenges proposed to date. To\naddress these challenges, we develop methods capable of generating LLP datasets\nmeeting the requirements of different variants. We use these methods to\ngenerate a collection of datasets encompassing the spectrum of LLP problem\ncharacteristics, which can be used in future evaluation studies. Additionally,\nwe develop guidelines for benchmarking LLP algorithms, including the model\nselection and evaluation steps. Finally, we illustrate the new methods and\nguidelines by performing an extensive benchmark of a set of well-known LLP\nalgorithms. We show that choosing the best algorithm depends critically on the\nLLP variant and model selection method, demonstrating the need for our proposed\napproach.\n","authors":["Gabriel Franco","Giovanni Comarela","Mark Crovella"],"pdf_url":"https://arxiv.org/pdf/2310.19065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19064v1","updated":"2023-10-29T16:37:51Z","published":"2023-10-29T16:37:51Z","title":"Revisiting the Learnability of Apple Tasting","summary":" In online binary classification under \\textit{apple tasting} feedback, the\nlearner only observes the true label if it predicts \"1\". First studied by\n\\cite{helmbold2000apple}, we revisit this classical partial-feedback setting\nand study online learnability from a combinatorial perspective. We show that\nthe Littlestone dimension continues to prove a tight quantitative\ncharacterization of apple tasting in the agnostic setting, closing an open\nquestion posed by \\cite{helmbold2000apple}. In addition, we give a new\ncombinatorial parameter, called the Effective width, that tightly quantifies\nthe minimax expected mistakes in the realizable setting. As a corollary, we use\nthe Effective width to establish a \\textit{trichotomy} of the minimax expected\nnumber of mistakes in the realizable setting. In particular, we show that in\nthe realizable setting, the expected number of mistakes for any learner under\napple tasting feedback can only be $\\Theta(1), \\Theta(\\sqrt{T})$, or\n$\\Theta(T)$.\n","authors":["Vinod Raman","Unique Subedi","Ananth Raman","Ambuj Tewari"],"pdf_url":"https://arxiv.org/pdf/2310.19064v1.pdf","comment":"18 pages"},{"id":"http://arxiv.org/abs/2310.19063v1","updated":"2023-10-29T16:37:14Z","published":"2023-10-29T16:37:14Z","title":"Feature Aggregation in Joint Sound Classification and Localization\n Neural Networks","summary":" This study addresses the application of deep learning techniques in joint\nsound signal classification and localization networks. Current state-of-the-art\nsound source localization deep learning networks lack feature aggregation\nwithin their architecture. Feature aggregation enhances model performance by\nenabling the consolidation of information from different feature scales,\nthereby improving feature robustness and invariance. This is particularly\nimportant in SSL networks, which must differentiate direct and indirect\nacoustic signals. To address this gap, we adapt feature aggregation techniques\nfrom computer vision neural networks to signal detection neural networks.\nAdditionally, we propose the Scale Encoding Network (SEN) for feature\naggregation to encode features from various scales, compressing the network for\nmore computationally efficient aggregation. To evaluate the efficacy of feature\naggregation in SSL networks, we integrated the following computer vision\nfeature aggregation sub-architectures into a SSL control architecture: Path\nAggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network\n(BiFPN), and SEN. These sub-architectures were evaluated using two metrics for\nsignal classification and two metrics for direction-of-arrival regression.\nPANet and BiFPN are established aggregators in computer vision models, while\nthe proposed SEN is a more compact aggregator. The results suggest that models\nincorporating feature aggregations outperformed the control model, the Sound\nEvent Localization and Detection network (SELDnet), in both sound signal\nclassification and localization. The feature aggregation techniques enhance the\nperformance of sound detection neural networks, particularly in\ndirection-of-arrival regression.\n","authors":["Brendan Healy","Patrick McNamee","Zahra Nili Ahmadabadi"],"pdf_url":"https://arxiv.org/pdf/2310.19063v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.20065v2","updated":"2023-10-29T16:30:51Z","published":"2023-05-31T17:40:43Z","title":"Latent Exploration for Reinforcement Learning","summary":" In Reinforcement Learning, agents learn policies by exploring and interacting\nwith the environment. Due to the curse of dimensionality, learning policies\nthat map high-dimensional sensory input to motor output is particularly\nchallenging. During training, state of the art methods (SAC, PPO, etc.) explore\nthe environment by perturbing the actuation with independent Gaussian noise.\nWhile this unstructured exploration has proven successful in numerous tasks, it\ncan be suboptimal for overactuated systems. When multiple actuators, such as\nmotors or muscles, drive behavior, uncorrelated perturbations risk diminishing\neach other's effect, or modifying the behavior in a task-irrelevant way. While\nsolutions to introduce time correlation across action perturbations exist,\nintroducing correlation across actuators has been largely ignored. Here, we\npropose LATent TIme-Correlated Exploration (Lattice), a method to inject\ntemporally-correlated noise into the latent state of the policy network, which\ncan be seamlessly integrated with on- and off-policy algorithms. We demonstrate\nthat the noisy actions generated by perturbing the network's activations can be\nmodeled as a multivariate Gaussian distribution with a full covariance matrix.\nIn the PyBullet locomotion tasks, Lattice-SAC achieves state of the art\nresults, and reaches 18% higher reward than unstructured exploration in the\nHumanoid environment. In the musculoskeletal control environments of MyoSuite,\nLattice-PPO achieves higher reward in most reaching and object manipulation\ntasks, while also finding more energy-efficient policies with reductions of\n20-60%. Overall, we demonstrate the effectiveness of structured action noise in\ntime and actuator space for complex motor control tasks. The code is available\nat: https://github.com/amathislab/lattice.\n","authors":["Alberto Silvio Chiappa","Alessandro Marin Vargas","Ann Zixiang Huang","Alexander Mathis"],"pdf_url":"https://arxiv.org/pdf/2305.20065v2.pdf","comment":"Code available at https://github.com/amathislab/lattice"},{"id":"http://arxiv.org/abs/2310.19059v1","updated":"2023-10-29T16:24:53Z","published":"2023-10-29T16:24:53Z","title":"Escaping Saddle Points in Heterogeneous Federated Learning via\n Distributed SGD with Communication Compression","summary":" We consider the problem of finding second-order stationary points of\nheterogeneous federated learning (FL). Previous works in FL mostly focus on\nfirst-order convergence guarantees, which do not rule out the scenario of\nunstable saddle points. Meanwhile, it is a key bottleneck of FL to achieve\ncommunication efficiency without compensating the learning accuracy, especially\nwhen local data are highly heterogeneous across different clients. Given this,\nwe propose a novel algorithm Power-EF that only communicates compressed\ninformation via a novel error-feedback scheme. To our knowledge, Power-EF is\nthe first distributed and compressed SGD algorithm that provably escapes saddle\npoints in heterogeneous FL without any data homogeneity assumptions. In\nparticular, Power-EF improves to second-order stationary points after visiting\nfirst-order (possibly saddle) points, using additional gradient queries and\ncommunication rounds only of almost the same order required by first-order\nconvergence, and the convergence rate exhibits a linear speedup in terms of the\nnumber of workers. Our theory improves/recovers previous results, while\nextending to much more tolerant settings on the local data. Numerical\nexperiments are provided to complement the theory.\n","authors":["Sijin Chen","Zhize Li","Yuejie Chi"],"pdf_url":"https://arxiv.org/pdf/2310.19059v1.pdf","comment":"27 pages"},{"id":"http://arxiv.org/abs/2308.06672v2","updated":"2023-10-29T16:22:20Z","published":"2023-08-13T03:26:01Z","title":"A practical PINN framework for multi-scale problems with multi-magnitude\n loss terms","summary":" For multi-scale problems, the conventional physics-informed neural networks\n(PINNs) face some challenges in obtaining available predictions. In this paper,\nbased on PINNs, we propose a practical deep learning framework for multi-scale\nproblems by reconstructing the loss function and associating it with special\nneural network architectures. New PINN methods derived from the improved PINN\nframework differ from the conventional PINN method mainly in two aspects.\nFirst, the new methods use a novel loss function by modifying the standard loss\nfunction through a (grouping) regularization strategy. The regularization\nstrategy implements a different power operation on each loss term so that all\nloss terms composing the loss function are of approximately the same order of\nmagnitude, which makes all loss terms be optimized synchronously during the\noptimization process. Second, for the multi-frequency or high-frequency\nproblems, in addition to using the modified loss function, new methods upgrade\nthe neural network architecture from the common fully-connected neural network\nto special network architectures such as the Fourier feature architecture, and\nthe integrated architecture developed by us. The combination of the above two\ntechniques leads to a significant improvement in the computational accuracy of\nmulti-scale problems. Several challenging numerical examples demonstrate the\neffectiveness of the proposed methods. The proposed methods not only\nsignificantly outperform the conventional PINN method in terms of computational\nefficiency and computational accuracy, but also compare favorably with the\nstate-of-the-art methods in the recent literature. The improved PINN framework\nfacilitates better application of PINNs to multi-scale problems.\n","authors":["Yong Wang","Yanzhong Yao","Jiawei Guo","Zhiming Gao"],"pdf_url":"https://arxiv.org/pdf/2308.06672v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03848v2","updated":"2023-10-29T16:21:05Z","published":"2023-07-07T21:39:25Z","title":"Optimal Learners for Realizable Regression: PAC Learning and Online\n Learning","summary":" In this work, we aim to characterize the statistical complexity of realizable\nregression both in the PAC learning setting and the online learning setting.\nPrevious work had established the sufficiency of finiteness of the fat\nshattering dimension for PAC learnability and the necessity of finiteness of\nthe scaled Natarajan dimension, but little progress had been made towards a\nmore complete characterization since the work of Simon (SICOMP '97). To this\nend, we first introduce a minimax instance optimal learner for realizable\nregression and propose a novel dimension that both qualitatively and\nquantitatively characterizes which classes of real-valued predictors are\nlearnable. We then identify a combinatorial dimension related to the Graph\ndimension that characterizes ERM learnability in the realizable setting.\nFinally, we establish a necessary condition for learnability based on a\ncombinatorial dimension related to the DS dimension, and conjecture that it may\nalso be sufficient in this context. Additionally, in the context of online\nlearning we provide a dimension that characterizes the minimax instance optimal\ncumulative loss up to a constant factor and design an optimal online learner\nfor realizable regression, thus resolving an open question raised by Daskalakis\nand Golowich in STOC '22.\n","authors":["Idan Attias","Steve Hanneke","Alkis Kalavasis","Amin Karbasi","Grigoris Velegkas"],"pdf_url":"https://arxiv.org/pdf/2307.03848v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19054v1","updated":"2023-10-29T16:01:03Z","published":"2023-10-29T16:01:03Z","title":"Object-centric architectures enable efficient causal representation\n learning","summary":" Causal representation learning has showed a variety of settings in which we\ncan disentangle latent variables with identifiability guarantees (up to some\nreasonable equivalence class). Common to all of these approaches is the\nassumption that (1) the latent variables are represented as $d$-dimensional\nvectors, and (2) that the observations are the output of some injective\ngenerative function of these latent variables. While these assumptions appear\nbenign, we show that when the observations are of multiple objects, the\ngenerative function is no longer injective and disentanglement fails in\npractice. We can address this failure by combining recent developments in\nobject-centric learning and causal representation learning. By modifying the\nSlot Attention architecture arXiv:2006.15055, we develop an object-centric\narchitecture that leverages weak supervision from sparse perturbations to\ndisentangle each object's properties. This approach is more data-efficient in\nthe sense that it requires significantly fewer perturbations than a comparable\napproach that encodes to a Euclidean space and we show that this approach\nsuccessfully disentangles the properties of a set of objects in a series of\nsimple image-based disentanglement experiments.\n","authors":["Amin Mansouri","Jason Hartford","Yan Zhang","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2310.19054v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19053v1","updated":"2023-10-29T15:57:42Z","published":"2023-10-29T15:57:42Z","title":"Datasets and Benchmarks for Nanophotonic Structure and Parametric Design\n Simulations","summary":" Nanophotonic structures have versatile applications including solar cells,\nanti-reflective coatings, electromagnetic interference shielding, optical\nfilters, and light emitting diodes. To design and understand these nanophotonic\nstructures, electrodynamic simulations are essential. These simulations enable\nus to model electromagnetic fields over time and calculate optical properties.\nIn this work, we introduce frameworks and benchmarks to evaluate nanophotonic\nstructures in the context of parametric structure design problems. The\nbenchmarks are instrumental in assessing the performance of optimization\nalgorithms and identifying an optimal structure based on target optical\nproperties. Moreover, we explore the impact of varying grid sizes in\nelectrodynamic simulations, shedding light on how evaluation fidelity can be\nstrategically leveraged in enhancing structure designs.\n","authors":["Jungtaek Kim","Mingxuan Li","Oliver Hinder","Paul W. Leu"],"pdf_url":"https://arxiv.org/pdf/2310.19053v1.pdf","comment":"31 pages, 31 figures, 4 tables. Accepted at the 37th Conference on\n Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks\n Track"},{"id":"http://arxiv.org/abs/2310.19043v1","updated":"2023-10-29T15:13:36Z","published":"2023-10-29T15:13:36Z","title":"Differentially Private Permutation Tests: Applications to Kernel Methods","summary":" Recent years have witnessed growing concerns about the privacy of sensitive\ndata. In response to these concerns, differential privacy has emerged as a\nrigorous framework for privacy protection, gaining widespread recognition in\nboth academic and industrial circles. While substantial progress has been made\nin private data analysis, existing methods often suffer from impracticality or\na significant loss of statistical efficiency. This paper aims to alleviate\nthese concerns in the context of hypothesis testing by introducing\ndifferentially private permutation tests. The proposed framework extends\nclassical non-private permutation tests to private settings, maintaining both\nfinite-sample validity and differential privacy in a rigorous manner. The power\nof the proposed test depends on the choice of a test statistic, and we\nestablish general conditions for consistency and non-asymptotic uniform power.\nTo demonstrate the utility and practicality of our framework, we focus on\nreproducing kernel-based test statistics and introduce differentially private\nkernel tests for two-sample and independence testing: dpMMD and dpHSIC. The\nproposed kernel tests are straightforward to implement, applicable to various\ntypes of data, and attain minimax optimal power across different privacy\nregimes. Our empirical evaluations further highlight their competitive power\nunder various synthetic and real-world scenarios, emphasizing their practical\nvalue. The code is publicly available to facilitate the implementation of our\nframework.\n","authors":["Ilmun Kim","Antonin Schrab"],"pdf_url":"https://arxiv.org/pdf/2310.19043v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19041v1","updated":"2023-10-29T15:08:35Z","published":"2023-10-29T15:08:35Z","title":"On Linear Separation Capacity of Self-Supervised Representation Learning","summary":" Recent advances in self-supervised learning have highlighted the efficacy of\ndata augmentation in learning data representation from unlabeled data. Training\na linear model atop these enhanced representations can yield an adept\nclassifier. Despite the remarkable empirical performance, the underlying\nmechanisms that enable data augmentation to unravel nonlinear data structures\ninto linearly separable representations remain elusive. This paper seeks to\nbridge this gap by investigating under what conditions learned representations\ncan linearly separate manifolds when data is drawn from a multi-manifold model.\nOur investigation reveals that data augmentation offers additional information\nbeyond observed data and can thus improve the information-theoretic optimal\nrate of linear separation capacity. In particular, we show that self-supervised\nlearning can linearly separate manifolds with a smaller distance than\nunsupervised learning, underscoring the additional benefits of data\naugmentation. Our theoretical analysis further underscores that the performance\nof downstream linear classifiers primarily hinges on the linear separability of\ndata representations rather than the size of the labeled data set, reaffirming\nthe viability of constructing efficient classifiers with limited labeled data\namid an expansive unlabeled data set.\n","authors":["Shulei Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19041v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.06101v3","updated":"2023-10-29T15:08:03Z","published":"2023-06-09T17:59:35Z","title":"Prodigy: An Expeditiously Adaptive Parameter-Free Learner","summary":" We consider the problem of estimating the learning rate in adaptive methods,\nsuch as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to\nprovably estimate the distance to the solution $D$, which is needed to set the\nlearning rate optimally. Our techniques are modifications of the D-Adaptation\nmethod for learning-rate-free learning. Our methods improve upon the\nconvergence rate of D-Adaptation by a factor of $O(\\sqrt{\\log(D/d_0)})$, where\n$d_0$ is the initial estimate of $D$. We test our methods on 12 common\nlogistic-regression benchmark datasets, VGG11 and ResNet-50 training on\nCIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on\nCriteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT\ntransformer training on BookWiki. Our experimental results show that our\napproaches consistently outperform D-Adaptation and reach test accuracy values\nclose to that of hand-tuned Adam.\n","authors":["Konstantin Mishchenko","Aaron Defazio"],"pdf_url":"https://arxiv.org/pdf/2306.06101v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.19039v1","updated":"2023-10-29T15:07:08Z","published":"2023-10-29T15:07:08Z","title":"Machine Learning for the identification of phase-transitions in\n interacting agent-based systems","summary":" Deriving closed-form, analytical expressions for reduced-order models, and\njudiciously choosing the closures leading to them, has long been the strategy\nof choice for studying phase- and noise-induced transitions for agent-based\nmodels (ABMs). In this paper, we propose a data-driven framework that pinpoints\nphase transitions for an ABM in its mean-field limit, using a smaller number of\nvariables than traditional closed-form models. To this end, we use the manifold\nlearning algorithm Diffusion Maps to identify a parsimonious set of data-driven\nlatent variables, and show that they are in one-to-one correspondence with the\nexpected theoretical order parameter of the ABM. We then utilize a deep\nlearning framework to obtain a conformal reparametrization of the data-driven\ncoordinates that facilitates, in our example, the identification of a single\nparameter-dependent ODE in these coordinates. We identify this ODE through a\nresidual neural network inspired by a numerical integration scheme (forward\nEuler). We then use the identified ODE -- enabled through an odd symmetry\ntransformation -- to construct the bifurcation diagram exhibiting the phase\ntransition.\n","authors":["Nikolaos Evangelou","Dimitrios G. Giovanis","George A. Kevrekidis","Grigorios A. Pavliotis","Ioannis G. Kevrekidis"],"pdf_url":"https://arxiv.org/pdf/2310.19039v1.pdf","comment":"14 pages, 9 Figures"},{"id":"http://arxiv.org/abs/2310.19038v1","updated":"2023-10-29T15:05:39Z","published":"2023-10-29T15:05:39Z","title":"Boosting Decision-Based Black-Box Adversarial Attack with Gradient\n Priors","summary":" Decision-based methods have shown to be effective in black-box adversarial\nattacks, as they can obtain satisfactory performance and only require to access\nthe final model prediction. Gradient estimation is a critical step in black-box\nadversarial attacks, as it will directly affect the query efficiency. Recent\nworks have attempted to utilize gradient priors to facilitate score-based\nmethods to obtain better results. However, these gradient priors still suffer\nfrom the edge gradient discrepancy issue and the successive iteration gradient\ndirection issue, thus are difficult to simply extend to decision-based methods.\nIn this paper, we propose a novel Decision-based Black-box Attack framework\nwith Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent\ngradient prior and time-dependent prior into the gradient estimation procedure.\nFirst, by leveraging the joint bilateral filter to deal with each random\nperturbation, DBA-GP can guarantee that the generated perturbations in edge\nlocations are hardly smoothed, i.e., alleviating the edge gradient discrepancy,\nthus remaining the characteristics of the original image as much as possible.\nSecond, by utilizing a new gradient updating strategy to automatically adjust\nthe successive iteration gradient direction, DBA-GP can accelerate the\nconvergence speed, thus improving the query efficiency. Extensive experiments\nhave demonstrated that the proposed method outperforms other strong baselines\nsignificantly.\n","authors":["Han Liu","Xingshuo Huang","Xiaotong Zhang","Qimai Li","Fenglong Ma","Wei Wang","Hongyang Chen","Hong Yu","Xianchao Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.19038v1.pdf","comment":"Accepted by IJCAI 2023"},{"id":"http://arxiv.org/abs/2310.19035v1","updated":"2023-10-29T14:57:37Z","published":"2023-10-29T14:57:37Z","title":"Does Invariant Graph Learning via Environment Augmentation Learn\n Invariance?","summary":" Invariant graph representation learning aims to learn the invariance among\ndata from different environments for out-of-distribution generalization on\ngraphs. As the graph environment partitions are usually expensive to obtain,\naugmenting the environment information has become the de facto approach.\nHowever, the usefulness of the augmented environment information has never been\nverified. In this work, we find that it is fundamentally impossible to learn\ninvariant graph representations via environment augmentation without additional\nassumptions. Therefore, we develop a set of minimal assumptions, including\nvariation sufficiency and variation consistency, for feasible invariant graph\nlearning. We then propose a new framework Graph invAriant Learning Assistant\n(GALA). GALA incorporates an assistant model that needs to be sensitive to\ngraph environment changes or distribution shifts. The correctness of the proxy\npredictions by the assistant model hence can differentiate the variations in\nspurious subgraphs. We show that extracting the maximally invariant subgraph to\nthe proxy predictions provably identifies the underlying invariant subgraph for\nsuccessful OOD generalization under the established minimal assumptions.\nExtensive experiments on datasets including DrugOOD with various graph\ndistribution shifts confirm the effectiveness of GALA.\n","authors":["Yongqiang Chen","Yatao Bian","Kaiwen Zhou","Binghui Xie","Bo Han","James Cheng"],"pdf_url":"https://arxiv.org/pdf/2310.19035v1.pdf","comment":"NeurIPS 2023, 34 pages, 35 figures"},{"id":"http://arxiv.org/abs/2305.20054v2","updated":"2023-10-29T14:55:12Z","published":"2023-05-31T17:28:02Z","title":"UNSSOR: Unsupervised Neural Speech Separation by Leveraging\n Over-determined Training Mixtures","summary":" In reverberant conditions with multiple concurrent speakers, each microphone\nacquires a mixture signal of multiple speakers at a different location. In\nover-determined conditions where the microphones out-number speakers, we can\nnarrow down the solutions to speaker images and realize unsupervised speech\nseparation by leveraging each mixture signal as a constraint (i.e., the\nestimated speaker images at a microphone should add up to the mixture).\nEquipped with this insight, we propose UNSSOR, an algorithm for\n$\\textbf{u}$nsupervised $\\textbf{n}$eural $\\textbf{s}$peech\n$\\textbf{s}$eparation by leveraging $\\textbf{o}$ver-determined training\nmixtu$\\textbf{r}$es. At each training step, we feed an input mixture to a deep\nneural network (DNN) to produce an intermediate estimate for each speaker,\nlinearly filter the estimates, and optimize a loss so that, at each microphone,\nthe filtered estimates of all the speakers can add up to the mixture to satisfy\nthe above constraint. We show that this loss can promote unsupervised\nseparation of speakers. The linear filters are computed in each sub-band based\non the mixture and DNN estimates through the forward convolutive prediction\n(FCP) algorithm. To address the frequency permutation problem incurred by using\nsub-band FCP, a loss term based on minimizing intra-source magnitude scattering\nis proposed. Although UNSSOR requires over-determined training mixtures, we can\ntrain DNNs to achieve under-determined separation (e.g., unsupervised monaural\nspeech separation). Evaluation results on two-speaker separation in reverberant\nconditions show the effectiveness and potential of UNSSOR.\n","authors":["Zhong-Qiu Wang","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2305.20054v2.pdf","comment":"in Conference on Neural Information Processing Systems (NeurIPS),\n 2023"},{"id":"http://arxiv.org/abs/2308.04603v3","updated":"2023-10-29T14:52:32Z","published":"2023-08-08T22:06:14Z","title":"A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking","summary":" This paper presents a comprehensive survey on deep learning-based image\nwatermarking, a technique that entails the invisible embedding and extraction\nof watermarks within a cover image, aiming to offer a seamless blend of\nrobustness and adaptability. We navigate the complex landscape of this\ninterdisciplinary domain, linking historical foundations, current innovations,\nand prospective developments. Unlike existing literature, our study\nconcentrates exclusively on image watermarking with deep learning, delivering\nan in-depth, yet brief analysis enriched by three fundamental contributions.\nFirst, we introduce a refined categorization, segmenting the field into\nEmbedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid\nMethods. This taxonomy, inspired by the varied roles of deep learning across\nstudies, is designed to infuse clarity, offering readers technical insights and\ndirectional guidance. Second, our exploration dives into representative\nmethodologies, encapsulating the diverse research directions and inherent\nchallenges within each category to provide a consolidated perspective. Lastly,\nwe venture beyond established boundaries to outline emerging frontiers,\noffering a detailed insight into prospective research avenues.\n","authors":["Xin Zhong","Arjon Das","Fahad Alrasheedi","Abdullah Tanvir"],"pdf_url":"https://arxiv.org/pdf/2308.04603v3.pdf","comment":"This paper was accepted for publication by the MDPI Applied Sciences\n journal"},{"id":"http://arxiv.org/abs/2310.19025v1","updated":"2023-10-29T14:31:34Z","published":"2023-10-29T14:31:34Z","title":"An Improved Relaxation for Oracle-Efficient Adversarial Contextual\n Bandits","summary":" We present an oracle-efficient relaxation for the adversarial contextual\nbandits problem, where the contexts are sequentially drawn i.i.d from a known\ndistribution and the cost sequence is chosen by an online adversary. Our\nalgorithm has a regret bound of\n$O(T^{\\frac{2}{3}}(K\\log(|\\Pi|))^{\\frac{1}{3}})$ and makes at most $O(K)$ calls\nper round to an offline optimization oracle, where $K$ denotes the number of\nactions, $T$ denotes the number of rounds and $\\Pi$ denotes the set of\npolicies. This is the first result to improve the prior best bound of\n$O((TK)^{\\frac{2}{3}}(\\log(|\\Pi|))^{\\frac{1}{3}})$ as obtained by Syrgkanis et\nal. at NeurIPS 2016, and the first to match the original bound of Langford and\nZhang at NeurIPS 2007 which was obtained for the stochastic case.\n","authors":["Kiarash Banihashem","MohammadTaghi Hajiaghayi","Suho Shin","Max Springer"],"pdf_url":"https://arxiv.org/pdf/2310.19025v1.pdf","comment":"Appears in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19022v1","updated":"2023-10-29T14:25:57Z","published":"2023-10-29T14:25:57Z","title":"Optimization Landscape of Policy Gradient Methods for Discrete-time\n Static Output Feedback","summary":" In recent times, significant advancements have been made in delving into the\noptimization landscape of policy gradient methods for achieving optimal control\nin linear time-invariant (LTI) systems. Compared with state-feedback control,\noutput-feedback control is more prevalent since the underlying state of the\nsystem may not be fully observed in many practical settings. This paper\nanalyzes the optimization landscape inherent to policy gradient methods when\napplied to static output feedback (SOF) control in discrete-time LTI systems\nsubject to quadratic cost. We begin by establishing crucial properties of the\nSOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous\nHessian. Despite the absence of convexity, we leverage these properties to\nderive novel findings regarding convergence (and nearly dimension-free rate) to\nstationary points for three policy gradient methods, including the vanilla\npolicy gradient method, the natural policy gradient method, and the\nGauss-Newton method. Moreover, we provide proof that the vanilla policy\ngradient method exhibits linear convergence towards local minima when\ninitialized near such minima. The paper concludes by presenting numerical\nexamples that validate our theoretical findings. These results not only\ncharacterize the performance of gradient descent for optimizing the SOF problem\nbut also provide insights into the effectiveness of general policy gradient\nmethods within the realm of reinforcement learning.\n","authors":["Jingliang Duan","Jie Li","Xuyang Chen","Kai Zhao","Shengbo Eben Li","Lin Zhao"],"pdf_url":"https://arxiv.org/pdf/2310.19022v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.04132v2","updated":"2023-10-29T14:24:46Z","published":"2023-03-07T18:48:55Z","title":"Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and\n the Case of Information Extraction","summary":" Large language models (LLMs) have great potential for synthetic data\ngeneration. This work shows that useful data can be synthetically generated\neven for tasks that cannot be solved directly by LLMs: for problems with\nstructured outputs, it is possible to prompt an LLM to perform the task in the\nreverse direction, by generating plausible input text for a target output\nstructure. Leveraging this asymmetry in task difficulty makes it possible to\nproduce large-scale, high-quality data for complex tasks. We demonstrate the\neffectiveness of this approach on closed information extraction, where\ncollecting ground-truth data is challenging, and no satisfactory dataset exists\nto date. We synthetically generate a dataset of 1.8M data points, establish its\nsuperior quality compared to existing datasets in a human evaluation, and use\nit to finetune small models (220M and 770M parameters), termed SynthIE, that\noutperform the prior state of the art (with equal model size) by a substantial\nmargin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data,\nand models are available at https://github.com/epfl-dlab/SynthIE.\n","authors":["Martin Josifoski","Marija Sakota","Maxime Peyrard","Robert West"],"pdf_url":"https://arxiv.org/pdf/2303.04132v2.pdf","comment":"Accepted at EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.06908v4","updated":"2023-10-29T14:12:08Z","published":"2023-05-11T15:51:46Z","title":"CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency\n Model","summary":" Denoising diffusion probabilistic models (DDPMs) have shown promising\nperformance for speech synthesis. However, a large number of iterative steps\nare required to achieve high sample quality, which restricts the inference\nspeed. Maintaining sample quality while increasing sampling speed has become a\nchallenging task. In this paper, we propose a \"Co\"nsistency \"Mo\"del-based\n\"Speech\" synthesis method, CoMoSpeech, which achieve speech synthesis through a\nsingle diffusion sampling step while achieving high audio quality. The\nconsistency constraint is applied to distill a consistency model from a\nwell-designed diffusion-based teacher model, which ultimately yields superior\nperformances in the distilled CoMoSpeech. Our experiments show that by\ngenerating audio recordings by a single sampling step, the CoMoSpeech achieves\nan inference speed more than 150 times faster than real-time on a single NVIDIA\nA100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based\nspeech synthesis truly practical. Meanwhile, objective and subjective\nevaluations on text-to-speech and singing voice synthesis show that the\nproposed teacher models yield the best audio quality, and the one-step sampling\nbased CoMoSpeech achieves the best inference speed with better or comparable\naudio quality to other conventional multi-step diffusion model baselines. Audio\nsamples are available at https://comospeech.github.io/.\n","authors":["Zhen Ye","Wei Xue","Xu Tan","Jie Chen","Qifeng Liu","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2305.06908v4.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2304.07056v3","updated":"2023-10-29T14:06:56Z","published":"2023-04-14T11:26:09Z","title":"Perceptual Quality Assessment of Face Video Compression: A Benchmark and\n An Effective Method","summary":" Recent years have witnessed an exponential increase in the demand for face\nvideo compression, and the success of artificial intelligence has expanded the\nboundaries beyond traditional hybrid video coding. Generative coding approaches\nhave been identified as promising alternatives with reasonable perceptual\nrate-distortion trade-offs, leveraging the statistical priors of face videos.\nHowever, the great diversity of distortion types in spatial and temporal\ndomains, ranging from the traditional hybrid coding frameworks to generative\nmodels, present grand challenges in compressed face video quality assessment\n(VQA). In this paper, we introduce the large-scale Compressed Face Video\nQuality Assessment (CFVQA) database, which is the first attempt to\nsystematically understand the perceptual quality and diversified compression\ndistortions in face videos. The database contains 3,240 compressed face video\nclips in multiple compression levels, which are derived from 135 source videos\nwith diversified content using six representative video codecs, including two\ntraditional methods based on hybrid coding frameworks, two end-to-end methods,\nand two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index\nfor face video compression was developed to measure the perceptual quality,\nconsidering the distinct content characteristics and temporal priors of the\nface videos. Experimental results exhibit its superior performance on the\nproposed CFVQA dataset. The benchmark is now made publicly available at:\nhttps://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.\n","authors":["Yixuan Li","Bolin Chen","Baoliang Chen","Meng Wang","Shiqi Wang","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2304.07056v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.01647v2","updated":"2023-10-29T13:46:45Z","published":"2023-10-02T21:21:28Z","title":"Equivariant Adaptation of Large Pretrained Models","summary":" Equivariant networks are specifically designed to ensure consistent behavior\nwith respect to a set of input transformations, leading to higher sample\nefficiency and more accurate and robust predictions. However, redesigning each\ncomponent of prevalent deep neural network architectures to achieve chosen\nequivariance is a difficult problem and can result in a computationally\nexpensive network during both training and inference. A recently proposed\nalternative towards equivariance that removes the architectural constraints is\nto use a simple canonicalization network that transforms the input to a\ncanonical form before feeding it to an unconstrained prediction network. We\nshow here that this approach can effectively be used to make a large pretrained\nnetwork equivariant. However, we observe that the produced canonical\norientations can be misaligned with those of the training distribution,\nhindering performance. Using dataset-dependent priors to inform the\ncanonicalization function, we are able to make large pretrained models\nequivariant while maintaining their performance. This significantly improves\nthe robustness of these models to deterministic transformations of the data,\nsuch as rotations. We believe this equivariant adaptation of large pretrained\nmodels can help their domain-specific applications with known symmetry priors.\n","authors":["Arnab Kumar Mondal","Siba Smarak Panigrahi","Sékou-Oumar Kaba","Sai Rajeswar","Siamak Ravanbakhsh"],"pdf_url":"https://arxiv.org/pdf/2310.01647v2.pdf","comment":"17 pages, 6 figures. Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.19007v1","updated":"2023-10-29T13:45:07Z","published":"2023-10-29T13:45:07Z","title":"Behavior Alignment via Reward Function Optimization","summary":" Designing reward functions for efficiently guiding reinforcement learning\n(RL) agents toward specific behaviors is a complex task. This is challenging\nsince it requires the identification of reward structures that are not sparse\nand that avoid inadvertently inducing undesirable behaviors. Naively modifying\nthe reward structure to offer denser and more frequent feedback can lead to\nunintended outcomes and promote behaviors that are not aligned with the\ndesigner's intended goal. Although potential-based reward shaping is often\nsuggested as a remedy, we systematically investigate settings where deploying\nit often significantly impairs performance. To address these issues, we\nintroduce a new framework that uses a bi-level objective to learn\n\\emph{behavior alignment reward functions}. These functions integrate auxiliary\nrewards reflecting a designer's heuristics and domain knowledge with the\nenvironment's primary rewards. Our approach automatically determines the most\neffective way to blend these types of feedback, thereby enhancing robustness\nagainst heuristic reward misspecification. Remarkably, it can also adapt an\nagent's policy optimization process to mitigate suboptimalities resulting from\nlimitations and biases inherent in the underlying RL algorithms. We evaluate\nour method's efficacy on a diverse set of tasks, from small-scale experiments\nto high-dimensional control challenges. We investigate heuristic auxiliary\nrewards of varying quality -- some of which are beneficial and others\ndetrimental to the learning process. Our results show that our framework offers\na robust and principled way to integrate designer-specified heuristics. It not\nonly addresses key shortcomings of existing approaches but also consistently\nleads to high-performing solutions, even when given misaligned or\npoorly-specified auxiliary reward functions.\n","authors":["Dhawal Gupta","Yash Chandak","Scott M. Jordan","Philip S. Thomas","Bruno Castro da Silva"],"pdf_url":"https://arxiv.org/pdf/2310.19007v1.pdf","comment":"(Spotlight) Thirty-seventh Conference on Neural Information\n Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2310.19001v1","updated":"2023-10-29T13:18:00Z","published":"2023-10-29T13:18:00Z","title":"Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic\n Segmentation","summary":" This paper studies the problem of weakly open-vocabulary semantic\nsegmentation (WOVSS), which learns to segment objects of arbitrary classes\nusing mere image-text pairs. Existing works turn to enhance the vanilla vision\ntransformer by introducing explicit grouping recognition, i.e., employing\nseveral group tokens/centroids to cluster the image tokens and perform the\ngroup-text alignment. Nevertheless, these methods suffer from a granularity\ninconsistency regarding the usage of group tokens, which are aligned in the\nall-to-one v.s. one-to-one manners during the training and inference phases,\nrespectively. We argue that this discrepancy arises from the lack of elaborate\nsupervision for each group token. To bridge this granularity gap, this paper\nexplores explicit supervision for the group tokens from the prototypical\nknowledge. To this end, this paper proposes the non-learnable prototypical\nregularization (NPR) where non-learnable prototypes are estimated from source\nfeatures to serve as supervision and enable contrastive matching of the group\ntokens. This regularization encourages the group tokens to segment objects with\nless redundancy and capture more comprehensive semantic regions, leading to\nincreased compactness and richness. Based on NPR, we propose the prototypical\nguidance segmentation network (PGSeg) that incorporates multi-modal\nregularization by leveraging prototypical sources from both images and texts at\ndifferent levels, progressively enhancing the segmentation capability with\ndiverse prototypical patterns. Experimental results show that our proposed\nmethod achieves state-of-the-art performance on several benchmark datasets. The\nsource code is available at https://github.com/Ferenas/PGSeg.\n","authors":["Fei Zhang","Tianfei Zhou","Boyang Li","Hao He","Chaofan Ma","Tianjiao Zhang","Jiangchao Yao","Ya Zhang","Yanfeng Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19001v1.pdf","comment":"14 pages, Accept in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2307.08283v2","updated":"2023-10-29T13:13:00Z","published":"2023-07-17T07:12:29Z","title":"Complexity Matters: Rethinking the Latent Space for Generative Modeling","summary":" In generative modeling, numerous successful approaches leverage a\nlow-dimensional latent space, e.g., Stable Diffusion models the latent space\ninduced by an encoder and generates images through a paired decoder. Although\nthe selection of the latent space is empirically pivotal, determining the\noptimal choice and the process of identifying it remain unclear. In this study,\nwe aim to shed light on this under-explored topic by rethinking the latent\nspace from the perspective of model complexity. Our investigation starts with\nthe classic generative adversarial networks (GANs). Inspired by the GAN\ntraining objective, we propose a novel \"distance\" between the latent and data\ndistributions, whose minimization coincides with that of the generator\ncomplexity. The minimizer of this distance is characterized as the optimal\ndata-dependent latent that most effectively capitalizes on the generator's\ncapacity. Then, we consider parameterizing such a latent distribution by an\nencoder network and propose a two-stage training strategy called Decoupled\nAutoencoder (DAE), where the encoder is only updated in the first stage with an\nauxiliary decoder and then frozen in the second stage while the actual decoder\nis being trained. DAE can improve the latent distribution and as a result,\nimprove the generative performance. Our theoretical analyses are corroborated\nby comprehensive experiments on various models such as VQGAN and Diffusion\nTransformer, where our modifications yield significant improvements in sample\nquality with decreased model complexity.\n","authors":["Tianyang Hu","Fei Chen","Haonan Wang","Jiawei Li","Wenjia Wang","Jiacheng Sun","Zhenguo Li"],"pdf_url":"https://arxiv.org/pdf/2307.08283v2.pdf","comment":"Accepted to NeurIPS 2023 (Spotlight)"},{"id":"http://arxiv.org/abs/2302.10850v2","updated":"2023-10-29T13:05:52Z","published":"2023-02-21T18:02:20Z","title":"Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management","summary":" Reinforcement learning (RL) has shown great promise for developing dialogue\nmanagement (DM) agents that are non-myopic, conduct rich conversations, and\nmaximize overall user satisfaction. Despite recent developments in RL and\nlanguage models (LMs), using RL to power conversational chatbots remains\nchallenging, in part because RL requires online exploration to learn\neffectively, whereas collecting novel human-bot interactions can be expensive\nand unsafe. This issue is exacerbated by the combinatorial action spaces facing\nthese algorithms, as most LM agents generate responses at the word level. We\ndevelop a variety of RL algorithms, specialized to dialogue planning, that\nleverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that\ncapture diverse semantics, generate utterances reflecting different intents,\nand are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods\nsignificantly reduce the size of the action space and improve the efficacy of\nRL-based DM. We evaluate our methods in open-domain dialogue to demonstrate\ntheir effectiveness w.r.t.\\ the diversity of intent in generated utterances and\noverall DM performance.\n","authors":["Dhawal Gupta","Yinlam Chow","Aza Tulepbergenov","Mohammad Ghavamzadeh","Craig Boutilier"],"pdf_url":"https://arxiv.org/pdf/2302.10850v2.pdf","comment":"Thirty-seventh Conference on Neural Information Processing Systems\n (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2306.00169v2","updated":"2023-10-29T13:04:11Z","published":"2023-05-31T20:28:13Z","title":"Inconsistency, Instability, and Generalization Gap of Deep Neural\n Network Training","summary":" As deep neural networks are highly expressive, it is important to find\nsolutions with small generalization gap (the difference between the performance\non the training data and unseen data). Focusing on the stochastic nature of\ntraining, we first present a theoretical analysis in which the bound of\ngeneralization gap depends on what we call inconsistency and instability of\nmodel outputs, which can be estimated on unlabeled data. Our empirical study\nbased on this analysis shows that instability and inconsistency are strongly\npredictive of generalization gap in various settings. In particular, our\nfinding indicates that inconsistency is a more reliable indicator of\ngeneralization gap than the sharpness of the loss landscape. Furthermore, we\nshow that algorithmic reduction of inconsistency leads to superior performance.\nThe results also provide a theoretical basis for existing methods such as\nco-distillation and ensemble.\n","authors":["Rie Johnson","Tong Zhang"],"pdf_url":"https://arxiv.org/pdf/2306.00169v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.15944v2","updated":"2023-10-29T13:02:48Z","published":"2023-05-25T11:30:27Z","title":"How to Turn Your Knowledge Graph Embeddings into Generative Models","summary":" Some of the most successful knowledge graph embedding (KGE) models for link\nprediction -- CP, RESCAL, TuckER, ComplEx -- can be interpreted as energy-based\nmodels. Under this perspective they are not amenable for exact\nmaximum-likelihood estimation (MLE), sampling and struggle to integrate logical\nconstraints. This work re-interprets the score functions of these KGEs as\ncircuits -- constrained computational graphs allowing efficient\nmarginalisation. Then, we design two recipes to obtain efficient generative\ncircuit models by either restricting their activations to be non-negative or\nsquaring their outputs. Our interpretation comes with little or no loss of\nperformance for link prediction, while the circuits framework unlocks exact\nlearning by MLE, efficient sampling of new triples, and guarantee that logical\nconstraints are satisfied by design. Furthermore, our models scale more\ngracefully than the original KGEs on graphs with millions of entities.\n","authors":["Lorenzo Loconte","Nicola Di Mauro","Robert Peharz","Antonio Vergari"],"pdf_url":"https://arxiv.org/pdf/2305.15944v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.11132v2","updated":"2023-10-29T12:44:49Z","published":"2023-05-18T17:26:03Z","title":"Attacks on Online Learners: a Teacher-Student Analysis","summary":" Machine learning models are famously vulnerable to adversarial attacks: small\nad-hoc perturbations of the data that can catastrophically alter the model\npredictions. While a large literature has studied the case of test-time attacks\non pre-trained models, the important case of attacks in an online learning\nsetting has received little attention so far. In this work, we use a\ncontrol-theoretical perspective to study the scenario where an attacker may\nperturb data labels to manipulate the learning dynamics of an online learner.\nWe perform a theoretical analysis of the problem in a teacher-student setup,\nconsidering different attack strategies, and obtaining analytical results for\nthe steady state of simple linear learners. These results enable us to prove\nthat a discontinuous transition in the learner's accuracy occurs when the\nattack strength exceeds a critical threshold. We then study empirically attacks\non learners with complex architectures using real data, confirming the insights\nof our theoretical analysis. Our findings show that greedy attacks can be\nextremely efficient, especially when data stream in small batches.\n","authors":["Riccardo Giuseppe Margiotta","Sebastian Goldt","Guido Sanguinetti"],"pdf_url":"https://arxiv.org/pdf/2305.11132v2.pdf","comment":"19 pages, 10 figures"},{"id":"http://arxiv.org/abs/2310.08767v2","updated":"2023-10-29T12:34:55Z","published":"2023-10-12T23:26:44Z","title":"Modeling Fission Gas Release at the Mesoscale using Multiscale DenseNet\n Regression with Attention Mechanism and Inception Blocks","summary":" Mesoscale simulations of fission gas release (FGR) in nuclear fuel provide a\npowerful tool for understanding how microstructure evolution impacts FGR, but\nthey are computationally intensive. In this study, we present an alternate,\ndata-driven approach, using deep learning to predict instantaneous FGR flux\nfrom 2D nuclear fuel microstructure images. Four convolutional neural network\n(CNN) architectures with multiscale regression are trained and evaluated on\nsimulated FGR data generated using a hybrid phase field/cluster dynamics model.\nAll four networks show high predictive power, with $R^{2}$ values above 98%.\nThe best performing network combine a Convolutional Block Attention Module\n(CBAM) and InceptionNet mechanisms to provide superior accuracy (mean absolute\npercentage error of 4.4%), training stability, and robustness on very low\ninstantaneous FGR flux values.\n","authors":["Peter Toma","Md Ali Muntaha","Joel B. Harley","Michael R. Tonks"],"pdf_url":"https://arxiv.org/pdf/2310.08767v2.pdf","comment":"Submitted at Journal of Nuclear Materials, 20 pages, 10 figures, 3\n tables"},{"id":"http://arxiv.org/abs/2305.16934v2","updated":"2023-10-29T12:32:19Z","published":"2023-05-26T13:49:44Z","title":"On Evaluating Adversarial Robustness of Large Vision-Language Models","summary":" Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented\nperformance in response generation, especially with visual inputs, enabling\nmore creative and adaptable interaction than large language models such as\nChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since\nadversaries may successfully evade the entire system by subtly manipulating the\nmost vulnerable modality (e.g., vision). To this end, we propose evaluating the\nrobustness of open-source large VLMs in the most realistic and high-risk\nsetting, where adversaries have only black-box system access and seek to\ndeceive the model into returning the targeted responses. In particular, we\nfirst craft targeted adversarial examples against pretrained models such as\nCLIP and BLIP, and then transfer these adversarial examples to other VLMs such\nas MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we\nobserve that black-box queries on these VLMs can further improve the\neffectiveness of targeted evasion, resulting in a surprisingly high success\nrate for generating targeted responses. Our findings provide a quantitative\nunderstanding regarding the adversarial vulnerability of large VLMs and call\nfor a more thorough examination of their potential security flaws before\ndeployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.\n","authors":["Yunqing Zhao","Tianyu Pang","Chao Du","Xiao Yang","Chongxuan Li","Ngai-Man Cheung","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2305.16934v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.14078v2","updated":"2023-10-29T12:30:40Z","published":"2023-09-25T12:13:56Z","title":"ODE-based Recurrent Model-free Reinforcement Learning for POMDPs","summary":" Neural ordinary differential equations (ODEs) are widely recognized as the\nstandard for modeling physical mechanisms, which help to perform approximate\ninference in unknown physical or biological environments. In partially\nobservable (PO) environments, how to infer unseen information from raw\nobservations puzzled the agents. By using a recurrent policy with a compact\ncontext, context-based reinforcement learning provides a flexible way to\nextract unobservable information from historical transitions. To help the agent\nextract more dynamics-related information, we present a novel ODE-based\nrecurrent model combines with model-free reinforcement learning (RL) framework\nto solve partially observable Markov decision processes (POMDPs). We\nexperimentally demonstrate the efficacy of our methods across various PO\ncontinuous control and meta-RL tasks. Furthermore, our experiments illustrate\nthat our method is robust against irregular observations, owing to the ability\nof ODEs to model irregularly-sampled time series.\n","authors":["Xuanle Zhao","Duzhen Zhang","Liyuan Han","Tielin Zhang","Bo Xu"],"pdf_url":"https://arxiv.org/pdf/2309.14078v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.06807v2","updated":"2023-10-29T12:30:15Z","published":"2023-05-08T07:52:15Z","title":"Information Design in Multi-Agent Reinforcement Learning","summary":" Reinforcement learning (RL) is inspired by the way human infants and animals\nlearn from the environment. The setting is somewhat idealized because, in\nactual tasks, other agents in the environment have their own goals and behave\nadaptively to the ego agent. To thrive in those environments, the agent needs\nto influence other agents so their actions become more helpful and less\nharmful. Research in computational economics distills two ways to influence\nothers directly: by providing tangible goods (mechanism design) and by\nproviding information (information design). This work investigates information\ndesign problems for a group of RL agents. The main challenges are two-fold. One\nis the information provided will immediately affect the transition of the agent\ntrajectories, which introduces additional non-stationarity. The other is the\ninformation can be ignored, so the sender must provide information that the\nreceiver is willing to respect. We formulate the Markov signaling game, and\ndevelop the notions of signaling gradient and the extended obedience\nconstraints that address these challenges. Our algorithm is efficient on\nvarious mixed-motive tasks and provides further insights into computational\neconomics. Our code is publicly available at\nhttps://github.com/YueLin301/InformationDesignMARL.\n","authors":["Yue Lin","Wenhao Li","Hongyuan Zha","Baoxiang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.06807v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.09943v2","updated":"2023-10-29T12:20:22Z","published":"2022-10-18T15:46:05Z","title":"Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face\n Recognition","summary":" Face recognition systems are widely deployed in safety-critical applications,\nincluding law enforcement, yet they exhibit bias across a range of\nsocio-demographic dimensions, such as gender and race. Conventional wisdom\ndictates that model biases arise from biased training data. As a consequence,\nprevious works on bias mitigation largely focused on pre-processing the\ntraining data, adding penalties to prevent bias from effecting the model during\ntraining, or post-processing predictions to debias them, yet these approaches\nhave shown limited success on hard problems such as face recognition. In our\nwork, we discover that biases are actually inherent to neural network\narchitectures themselves. Following this reframing, we conduct the first neural\narchitecture search for fairness, jointly with a search for hyperparameters.\nOur search outputs a suite of models which Pareto-dominate all other\nhigh-performance architectures and existing bias mitigation methods in terms of\naccuracy and fairness, often by large margins, on the two most widely used\ndatasets for face identification, CelebA and VGGFace2. Furthermore, these\nmodels generalize to other datasets and sensitive attributes. We release our\ncode, models and raw data files at https://github.com/dooleys/FR-NAS.\n","authors":["Rhea Sanjay Sukthanker","Samuel Dooley","John P. Dickerson","Colin White","Frank Hutter","Micah Goldblum"],"pdf_url":"https://arxiv.org/pdf/2210.09943v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.10688v3","updated":"2023-10-29T12:14:29Z","published":"2023-02-21T14:14:40Z","title":"On Calibrating Diffusion Probabilistic Models","summary":" Recently, diffusion probabilistic models (DPMs) have achieved promising\nresults in diverse generative tasks. A typical DPM framework includes a forward\nprocess that gradually diffuses the data distribution and a reverse process\nthat recovers the data distribution from time-dependent data scores. In this\nwork, we observe that the stochastic reverse process of data scores is a\nmartingale, from which concentration bounds and the optional stopping theorem\nfor data scores can be derived. Then, we discover a simple way for calibrating\nan arbitrary pretrained DPM, with which the score matching loss can be reduced\nand the lower bounds of model likelihood can consequently be increased. We\nprovide general calibration guidelines under various model parametrizations.\nOur calibration method is performed only once and the resulting models can be\nused repeatedly for sampling. We conduct experiments on multiple datasets to\nempirically validate our proposal. Our code is at\nhttps://github.com/thudzj/Calibrated-DPMs.\n","authors":["Tianyu Pang","Cheng Lu","Chao Du","Min Lin","Shuicheng Yan","Zhijie Deng"],"pdf_url":"https://arxiv.org/pdf/2302.10688v3.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18988v1","updated":"2023-10-29T12:05:39Z","published":"2023-10-29T12:05:39Z","title":"A U-turn on Double Descent: Rethinking Parameter Counting in Statistical\n Learning","summary":" Conventional statistical wisdom established a well-understood relationship\nbetween model complexity and prediction error, typically presented as a\nU-shaped curve reflecting a transition between under- and overfitting regimes.\nHowever, motivated by the success of overparametrized neural networks, recent\ninfluential work has suggested this theory to be generally incomplete,\nintroducing an additional regime that exhibits a second descent in test error\nas the parameter count p grows past sample size n - a phenomenon dubbed double\ndescent. While most attention has naturally been given to the deep-learning\nsetting, double descent was shown to emerge more generally across non-neural\nmodels: known cases include linear regression, trees, and boosting. In this\nwork, we take a closer look at evidence surrounding these more classical\nstatistical machine learning methods and challenge the claim that observed\ncases of double descent truly extend the limits of a traditional U-shaped\ncomplexity-generalization curve therein. We show that once careful\nconsideration is given to what is being plotted on the x-axes of their double\ndescent plots, it becomes apparent that there are implicitly multiple\ncomplexity axes along which the parameter count grows. We demonstrate that the\nsecond descent appears exactly (and only) when and where the transition between\nthese underlying axes occurs, and that its location is thus not inherently tied\nto the interpolation threshold p=n. We then gain further insight by adopting a\nclassical nonparametric statistics perspective. We interpret the investigated\nmethods as smoothers and propose a generalized measure for the effective number\nof parameters they use on unseen examples, using which we find that their\napparent double descent curves indeed fold back into more traditional convex\nshapes - providing a resolution to tensions between double descent and\nstatistical intuition.\n","authors":["Alicia Curth","Alan Jeffares","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2310.18988v1.pdf","comment":"To appear in the Proceedings of the 37th Conference on Neural\n Information Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2309.07080v2","updated":"2023-10-29T11:47:28Z","published":"2023-09-07T22:54:06Z","title":"Bayesian Dynamic DAG Learning: Application in Discovering Dynamic\n Effective Connectome of Brain","summary":" Understanding the complex mechanisms of the brain can be unraveled by\nextracting the Dynamic Effective Connectome (DEC). Recently, score-based\nDirected Acyclic Graph (DAG) discovery methods have shown significant\nimprovements in extracting the causal structure and inferring effective\nconnectivity. However, learning DEC through these methods still faces two main\nchallenges: one with the fundamental impotence of high-dimensional dynamic DAG\ndiscovery methods and the other with the low quality of fMRI data. In this\npaper, we introduce Bayesian Dynamic DAG learning with M-matrices Acyclicity\ncharacterization \\textbf{(BDyMA)} method to address the challenges in\ndiscovering DEC. The presented dynamic causal model enables us to discover\nbidirected edges as well. Leveraging an unconstrained framework in the BDyMA\nmethod leads to more accurate results in detecting high-dimensional networks,\nachieving sparser outcomes, making it particularly suitable for extracting DEC.\nAdditionally, the score function of the BDyMA method allows the incorporation\nof prior knowledge into the process of dynamic causal discovery which further\nenhances the accuracy of results. Comprehensive simulations on synthetic data\nand experiments on Human Connectome Project (HCP) data demonstrate that our\nmethod can handle both of the two main challenges, yielding more accurate and\nreliable DEC compared to state-of-the-art and baseline methods. Additionally,\nwe investigate the trustworthiness of DTI data as prior knowledge for DEC\ndiscovery and show the improvements in DEC discovery when the DTI data is\nincorporated into the process.\n","authors":["Abdolmahdi Bagheri","Mohammad Pasande","Kevin Bello","Babak Nadjar Araabi","Alireza Akhondi-Asl"],"pdf_url":"https://arxiv.org/pdf/2309.07080v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.13985v2","updated":"2023-10-29T11:29:07Z","published":"2023-09-25T09:37:19Z","title":"Physics-Driven ML-Based Modelling for Correcting Inverse Estimation","summary":" When deploying machine learning estimators in science and engineering (SAE)\ndomains, it is critical to avoid failed estimations that can have disastrous\nconsequences, e.g., in aero engine design. This work focuses on detecting and\ncorrecting failed state estimations before adopting them in SAE inverse\nproblems, by utilizing simulations and performance metrics guided by physical\nlaws. We suggest to flag a machine learning estimation when its physical model\nerror exceeds a feasible threshold, and propose a novel approach, GEESE, to\ncorrect it through optimization, aiming at delivering both low error and high\nefficiency. The key designs of GEESE include (1) a hybrid surrogate error model\nto provide fast error estimations to reduce simulation cost and to enable\ngradient based backpropagation of error feedback, and (2) two generative models\nto approximate the probability distributions of the candidate states for\nsimulating the exploitation and exploration behaviours. All three models are\nconstructed as neural networks. GEESE is tested on three real-world SAE inverse\nproblems and compared to a number of state-of-the-art optimization/search\napproaches. Results show that it fails the least number of times in terms of\nfinding a feasible state correction, and requires physical evaluations less\nfrequently in general.\n","authors":["Ruiyuan Kang","Tingting Mu","Panos Liatsis","Dimitrios C. Kyritsis"],"pdf_url":"https://arxiv.org/pdf/2309.13985v2.pdf","comment":"19 pages, the paper is accepted by Neurips 2023 as a spotlight"},{"id":"http://arxiv.org/abs/2310.13388v2","updated":"2023-10-29T10:48:51Z","published":"2023-10-20T09:56:22Z","title":"Music Augmentation and Denoising For Peak-Based Audio Fingerprinting","summary":" Audio fingerprinting is a well-established solution for song identification\nfrom short recording excerpts. Popular methods rely on the extraction of sparse\nrepresentations, generally spectral peaks, and have proven to be accurate,\nfast, and scalable to large collections. However, real-world applications of\naudio identification often happen in noisy environments, which can cause these\nsystems to fail. In this work, we tackle this problem by introducing and\nreleasing a new audio augmentation pipeline that adds noise to music snippets\nin a realistic way, by stochastically mimicking real-world scenarios. We then\npropose and release a deep learning model that removes noisy components from\nspectrograms in order to improve peak-based fingerprinting systems' accuracy.\nWe show that the addition of our model improves the identification performance\nof commonly used audio fingerprinting systems, even under noisy conditions.\n","authors":["Kamil Akesbi","Dorian Desblancs","Benjamin Martin"],"pdf_url":"https://arxiv.org/pdf/2310.13388v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18975v1","updated":"2023-10-29T10:48:44Z","published":"2023-10-29T10:48:44Z","title":"Blacksmith: Fast Adversarial Training of Vision Transformers via a\n Mixture of Single-step and Multi-step Methods","summary":" Despite the remarkable success achieved by deep learning algorithms in\nvarious domains, such as computer vision, they remain vulnerable to adversarial\nperturbations. Adversarial Training (AT) stands out as one of the most\neffective solutions to address this issue; however, single-step AT can lead to\nCatastrophic Overfitting (CO). This scenario occurs when the adversarially\ntrained network suddenly loses robustness against multi-step attacks like\nProjected Gradient Descent (PGD). Although several approaches have been\nproposed to address this problem in Convolutional Neural Networks (CNNs), we\nfound out that they do not perform well when applied to Vision Transformers\n(ViTs). In this paper, we propose Blacksmith, a novel training strategy to\novercome the CO problem, specifically in ViTs. Our approach utilizes either of\nPGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the\nadversarial training of the neural network. This will increase the diversity of\nour training attacks, which could potentially mitigate the CO issue. To manage\nthe increased training time resulting from this combination, we craft the PGD-2\nattack based on only the first half of the layers, while FGSM is applied\nend-to-end. Through our experiments, we demonstrate that our novel method\neffectively prevents CO, achieves PGD-2 level performance, and outperforms\nother existing techniques including N-FGSM, which is the state-of-the-art\nmethod in fast training for CNNs.\n","authors":["Mahdi Salmani","Alireza Dehghanpour Farashah","Mohammad Azizmalayeri","Mahdi Amiri","Navid Eslami","Mohammad Taghi Manzuri","Mohammad Hossein Rohban"],"pdf_url":"https://arxiv.org/pdf/2310.18975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18974v1","updated":"2023-10-29T10:47:23Z","published":"2023-10-29T10:47:23Z","title":"EtiCor: Corpus for Analyzing LLMs for Etiquettes","summary":" Etiquettes are an essential ingredient of day-to-day interactions among\npeople. Moreover, etiquettes are region-specific, and etiquettes in one region\nmight contradict those in other regions. In this paper, we propose EtiCor, an\nEtiquettes Corpus, having texts about social norms from five different regions\nacross the globe. The corpus provides a test bed for evaluating LLMs for\nknowledge and understanding of region-specific etiquettes. Additionally, we\npropose the task of Etiquette Sensitivity. We experiment with state-of-the-art\nLLMs (Delphi, Falcon40B, and GPT-3.5). Initial results indicate that LLMs,\nmostly fail to understand etiquettes from regions from non-Western world.\n","authors":["Ashutosh Dwivedi","Pradhyumna Lavania","Ashutosh Modi"],"pdf_url":"https://arxiv.org/pdf/2310.18974v1.pdf","comment":"Accepted at EMNLP 2023, Main Conference"},{"id":"http://arxiv.org/abs/2211.12421v6","updated":"2023-10-29T10:35:05Z","published":"2022-11-11T02:14:28Z","title":"Data-Driven Network Neuroscience: On Data Collection and Benchmark","summary":" This paper presents a comprehensive and quality collection of functional\nhuman brain network data for potential research in the intersection of\nneuroscience, machine learning, and graph analytics. Anatomical and functional\nMRI images have been used to understand the functional connectivity of the\nhuman brain and are particularly important in identifying underlying\nneurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism.\nRecently, the study of the brain in the form of brain networks using machine\nlearning and graph analytics has become increasingly popular, especially to\npredict the early onset of these conditions. A brain network, represented as a\ngraph, retains rich structural and positional information that traditional\nexamination methods are unable to capture. However, the lack of publicly\naccessible brain network data prevents researchers from data-driven\nexplorations. One of the main difficulties lies in the complicated\ndomain-specific preprocessing steps and the exhaustive computation required to\nconvert the data from MRI images into brain networks. We bridge this gap by\ncollecting a large amount of MRI images from public databases and a private\nsource, working with domain experts to make sensible design choices, and\npreprocessing the MRI images to produce a collection of brain network datasets.\nThe datasets originate from 6 different sources, cover 4 brain conditions, and\nconsist of a total of 2,702 subjects. We test our graph datasets on 12 machine\nlearning models to provide baselines and validate the data quality on a recent\ngraph analysis model. To lower the barrier to entry and promote the research in\nthis interdisciplinary field, we release our brain network data and complete\npreprocessing details including codes at\nhttps://doi.org/10.17608/k6.auckland.21397377 and\nhttps://github.com/brainnetuoa/data_driven_network_neuroscience.\n","authors":["Jiaxing Xu","Yunhan Yang","David Tse Jung Huang","Sophi Shilpa Gururajapathy","Yiping Ke","Miao Qiao","Alan Wang","Haribalan Kumar","Josh McGeown","Eryn Kwon"],"pdf_url":"https://arxiv.org/pdf/2211.12421v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18970v1","updated":"2023-10-29T10:31:59Z","published":"2023-10-29T10:31:59Z","title":"TRIAGE: Characterizing and auditing training data for improved\n regression","summary":" Data quality is crucial for robust machine learning algorithms, with the\nrecent interest in data-centric AI emphasizing the importance of training data\ncharacterization. However, current data characterization methods are largely\nfocused on classification settings, with regression settings largely\nunderstudied. To address this, we introduce TRIAGE, a novel data\ncharacterization framework tailored to regression tasks and compatible with a\nbroad class of regressors. TRIAGE utilizes conformal predictive distributions\nto provide a model-agnostic scoring method, the TRIAGE score. We operationalize\nthe score to analyze individual samples' training dynamics and characterize\nsamples as under-, over-, or well-estimated by the model. We show that TRIAGE's\ncharacterization is consistent and highlight its utility to improve performance\nvia data sculpting/filtering, in multiple regression settings. Additionally,\nbeyond sample level, we show TRIAGE enables new approaches to dataset selection\nand feature acquisition. Overall, TRIAGE highlights the value unlocked by data\ncharacterization in real-world regression applications\n","authors":["Nabeel Seedat","Jonathan Crabbé","Zhaozhi Qian","Mihaela van der Schaar"],"pdf_url":"https://arxiv.org/pdf/2310.18970v1.pdf","comment":"Presented at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18956v1","updated":"2023-10-29T09:56:17Z","published":"2023-10-29T09:56:17Z","title":"End-to-End Autoregressive Retrieval via Bootstrapping for Smart Reply\n Systems","summary":" Reply suggestion systems represent a staple component of many instant\nmessaging and email systems. However, the requirement to produce sets of\nreplies, rather than individual replies, makes the task poorly suited for\nout-of-the-box retrieval architectures, which only consider individual\nmessage-reply similarity. As a result, these system often rely on additional\npost-processing modules to diversify the outputs. However, these approaches are\nultimately bottlenecked by the performance of the initial retriever, which in\npractice struggles to present a sufficiently diverse range of options to the\ndownstream diversification module, leading to the suggestions being less\nrelevant to the user. In this paper, we consider a novel approach that\nradically simplifies this pipeline through an autoregressive text-to-text\nretrieval model, that learns the smart reply task end-to-end from a dataset of\n(message, reply set) pairs obtained via bootstrapping. Empirical results show\nthis method consistently outperforms a range of state-of-the-art baselines\nacross three datasets, corresponding to a 5.1%-17.9% improvement in relevance,\nand a 0.5%-63.1% improvement in diversity compared to the best baseline\napproach. We make our code publicly available.\n","authors":["Benjamin Towle","Ke Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.18956v1.pdf","comment":"FINDINGS-EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18955v1","updated":"2023-10-29T09:55:41Z","published":"2023-10-29T09:55:41Z","title":"Playing in the Dark: No-regret Learning with Adversarial Constraints","summary":" We study a generalization of the classic Online Convex Optimization (OCO)\nframework by considering additional long-term adversarial constraints.\nSpecifically, after an online policy decides its action on a round, in addition\nto a convex cost function, the adversary also reveals a set of $k$ convex\nconstraints. The cost and the constraint functions could change arbitrarily\nwith time, and no information about the future functions is assumed to be\navailable. In this paper, we propose a meta-policy that simultaneously achieves\na sublinear cumulative constraint violation and a sublinear regret. This is\nachieved via a black box reduction of the constrained problem to the standard\nOCO problem for a recursively constructed sequence of surrogate cost functions.\nWe show that optimal performance bounds can be achieved by solving the\nsurrogate problem using any adaptive OCO policy enjoying a standard\ndata-dependent regret bound. A new Lyapunov-based proof technique is presented\nthat reveals a connection between regret and certain sequential inequalities\nthrough a novel decomposition result. We conclude the paper by highlighting\napplications to online multi-task learning and network control problems.\n","authors":["Abhishek Sinha","Rahul Vaze"],"pdf_url":"https://arxiv.org/pdf/2310.18955v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18953v1","updated":"2023-10-29T09:54:03Z","published":"2023-10-29T09:54:03Z","title":"TIC-TAC: A Framework To Learn And Evaluate Your Covariance","summary":" We study the problem of unsupervised heteroscedastic covariance estimation,\nwhere the goal is to learn the multivariate target distribution $\\mathcal{N}(y,\n\\Sigma_y | x )$ given an observation $x$. This problem is particularly\nchallenging as $\\Sigma_{y}$ varies for different samples (heteroscedastic) and\nno annotation for the covariance is available (unsupervised). Typically,\nstate-of-the-art methods predict the mean $f_{\\theta}(x)$ and covariance\n$\\textrm{Cov}(f_{\\theta}(x))$ of the target distribution through two neural\nnetworks trained using the negative log-likelihood. This raises two questions:\n(1) Does the predicted covariance truly capture the randomness of the predicted\nmean? (2) In the absence of ground-truth annotation, how can we quantify the\nperformance of covariance estimation? We address (1) by deriving TIC: Taylor\nInduced Covariance, which captures the randomness of the multivariate\n$f_{\\theta}(x)$ by incorporating its gradient and curvature around $x$ through\nthe second order Taylor polynomial. Furthermore, we tackle (2) by introducing\nTAC: Task Agnostic Correlations, a metric which leverages conditioning of the\nnormal distribution to evaluate the covariance. We verify the effectiveness of\nTIC through multiple experiments spanning synthetic (univariate, multivariate)\nand real-world datasets (UCI Regression, LSP, and MPII Human Pose Estimation).\nOur experiments show that TIC outperforms state-of-the-art in accurately\nlearning the covariance, as quantified through TAC.\n","authors":["Megh Shukla","Mathieu Salzmann","Alexandre Alahi"],"pdf_url":"https://arxiv.org/pdf/2310.18953v1.pdf","comment":"12 pages, 4 figures. Please feel free to provide feedback!"},{"id":"http://arxiv.org/abs/2305.19185v5","updated":"2023-10-29T09:38:17Z","published":"2023-05-30T16:29:52Z","title":"Compression with Bayesian Implicit Neural Representations","summary":" Many common types of data can be represented as functions that map\ncoordinates to signal values, such as pixel locations to RGB values in the case\nof an image. Based on this view, data can be compressed by overfitting a\ncompact neural network to its functional representation and then encoding the\nnetwork weights. However, most current solutions for this are inefficient, as\nquantization to low-bit precision substantially degrades the reconstruction\nquality. To address this issue, we propose overfitting variational Bayesian\nneural networks to the data and compressing an approximate posterior weight\nsample using relative entropy coding instead of quantizing and entropy coding\nit. This strategy enables direct optimization of the rate-distortion\nperformance by minimizing the $\\beta$-ELBO, and target different\nrate-distortion trade-offs for a given network architecture by adjusting\n$\\beta$. Moreover, we introduce an iterative algorithm for learning prior\nweight distributions and employ a progressive refinement process for the\nvariational posterior that significantly enhances performance. Experiments show\nthat our method achieves strong performance on image and audio compression\nwhile retaining simplicity.\n","authors":["Zongyu Guo","Gergely Flamich","Jiajun He","Zhibo Chen","José Miguel Hernández-Lobato"],"pdf_url":"https://arxiv.org/pdf/2305.19185v5.pdf","comment":"Accepted as a Spotlight paper in NeurIPS 2023. Updated camera-ready\n version"},{"id":"http://arxiv.org/abs/2306.06529v2","updated":"2023-10-29T09:30:59Z","published":"2023-06-10T21:55:28Z","title":"Neural Injective Functions for Multisets, Measures and Graphs via a\n Finite Witness Theorem","summary":" Injective multiset functions have a key role in the theoretical study of\nmachine learning on multisets and graphs. Yet, there remains a gap between the\nprovably injective multiset functions considered in theory, which typically\nrely on polynomial moments, and the multiset functions used in practice, which\nrely on $\\textit{neural moments}$ $\\unicode{x2014}$ whose injectivity on\nmultisets has not been studied to date.\n In this paper, we bridge this gap by showing that moments of neural networks\ndo define injective multiset functions, provided that an analytic\nnon-polynomial activation is used. The number of moments required by our theory\nis optimal essentially up to a multiplicative factor of two. To prove this\nresult, we state and prove a $\\textit{finite witness theorem}$, which is of\nindependent interest.\n As a corollary to our main theorem, we derive new approximation results for\nfunctions on multisets and measures, and new separation results for graph\nneural networks. We also provide two negative results: (1) moments of\npiecewise-linear neural networks cannot be injective multiset functions; and\n(2) even when moment-based multiset functions are injective, they can never be\nbi-Lipschitz.\n","authors":["Tal Amir","Steven J. Gortler","Ilai Avni","Ravina Ravina","Nadav Dym"],"pdf_url":"https://arxiv.org/pdf/2306.06529v2.pdf","comment":"NeurIPS 2023 camera-ready"},{"id":"http://arxiv.org/abs/2302.09267v3","updated":"2023-10-29T09:26:22Z","published":"2023-02-18T09:24:15Z","title":"Stochastic Approximation Approaches to Group Distributionally Robust\n Optimization","summary":" This paper investigates group distributionally robust optimization (GDRO),\nwith the purpose to learn a model that performs well over $m$ different\ndistributions. First, we formulate GDRO as a stochastic convex-concave\nsaddle-point problem, and demonstrate that stochastic mirror descent (SMD),\nusing $m$ samples in each iteration, achieves an $O(m (\\log m)/\\epsilon^2)$\nsample complexity for finding an $\\epsilon$-optimal solution, which matches the\n$\\Omega(m/\\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make\nuse of techniques from online learning to reduce the number of samples required\nin each round from $m$ to $1$, keeping the same sample complexity.\nSpecifically, we cast GDRO as a two-players game where one player simply\nperforms SMD and the other executes an online algorithm for non-oblivious\nmulti-armed bandits. Next, we consider a more practical scenario where the\nnumber of samples that can be drawn from each distribution is different, and\npropose a novel formulation of weighted GDRO, which allows us to derive\ndistribution-dependent convergence rates. Denote by $n_i$ the sample budget for\nthe $i$-th distribution, and assume $n_1 \\geq n_2 \\geq \\cdots \\geq n_m$. In the\nfirst approach, we incorporate non-uniform sampling into SMD such that the\nsample budget is satisfied in expectation, and prove that the excess risk of\nthe $i$-th distribution decreases at an $O(\\sqrt{n_1 \\log m}/n_i)$ rate. In the\nsecond approach, we use mini-batches to meet the budget exactly and also reduce\nthe variance in stochastic gradients, and then leverage stochastic mirror-prox\nalgorithm, which can exploit small variances, to optimize a carefully designed\nweighted GDRO problem. Under appropriate conditions, it attains an $O((\\log\nm)/\\sqrt{n_i})$ convergence rate, which almost matches the optimal\n$O(\\sqrt{1/n_i})$ rate of only learning from the $i$-th distribution with $n_i$\nsamples.\n","authors":["Lijun Zhang","Peng Zhao","Zhen-Hua Zhuang","Tianbao Yang","Zhi-Hua Zhou"],"pdf_url":"https://arxiv.org/pdf/2302.09267v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18948v1","updated":"2023-10-29T09:15:22Z","published":"2023-10-29T09:15:22Z","title":"Building a Safer Maritime Environment Through Multi-Path Long-Term\n Vessel Trajectory Forecasting","summary":" Maritime transport is paramount to global economic growth and environmental\nsustainability. In this regard, the Automatic Identification System (AIS) data\nplays a significant role by offering real-time streaming data on vessel\nmovement, which allows for enhanced traffic surveillance, assisting in vessel\nsafety by avoiding vessel-to-vessel collisions and proactively preventing\nvessel-to-whale ones. This paper tackles an intrinsic problem to trajectory\nforecasting: the effective multi-path long-term vessel trajectory forecasting\non engineered sequences of AIS data. We utilize an encoder-decoder model with\nBidirectional Long Short-Term Memory Networks (Bi-LSTM) to predict the next 12\nhours of vessel trajectories using 1 to 3 hours of AIS data. We feed the model\nwith probabilistic features engineered from the AIS data that refer to the\npotential route and destination of each trajectory so that the model,\nleveraging convolutional layers for spatial feature learning and a\nposition-aware attention mechanism that increases the importance of recent\ntimesteps of a sequence during temporal feature learning, forecasts the vessel\ntrajectory taking the potential route and destination into account. The F1\nScore of these features is approximately 85% and 75%, indicating their\nefficiency in supplementing the neural network. We trialed our model in the\nGulf of St. Lawrence, one of the North Atlantic Right Whales (NARW) habitats,\nachieving an R2 score exceeding 98% with varying techniques and features.\nDespite the high R2 score being attributed to well-defined shipping lanes, our\nmodel demonstrates superior complex decision-making during path selection. In\naddition, our model shows enhanced accuracy, with average and median\nforecasting errors of 11km and 6km, respectively. Our study confirms the\npotential of geographical data engineering and trajectory forecasting models\nfor preserving marine life species.\n","authors":["Gabriel Spadon","Jay Kumar","Matthew Smith","Sarah Vela","Romina Gehrmann","Derek Eden","Joshua van Berkel","Amilcar Soares","Ronan Fablet","Ronald Pelot","Stan Matwin"],"pdf_url":"https://arxiv.org/pdf/2310.18948v1.pdf","comment":"44 pages, 13 figures, 6 tables, 27 equations, and 1 algorithm"},{"id":"http://arxiv.org/abs/2301.12466v2","updated":"2023-10-29T09:05:52Z","published":"2023-01-29T15:31:06Z","title":"Kernelized Cumulants: Beyond Kernel Mean Embeddings","summary":" In $\\mathbb R^d$, it is well-known that cumulants provide an alternative to\nmoments that can achieve the same goals with numerous benefits such as lower\nvariance estimators. In this paper we extend cumulants to reproducing kernel\nHilbert spaces (RKHS) using tools from tensor algebras and show that they are\ncomputationally tractable by a kernel trick. These kernelized cumulants provide\na new set of all-purpose statistics; the classical maximum mean discrepancy and\nHilbert-Schmidt independence criterion arise as the degree one objects in our\ngeneral construction. We argue both theoretically and empirically (on\nsynthetic, environmental, and traffic data analysis) that going beyond degree\none has several advantages and can be achieved with the same computational\ncomplexity and minimal overhead in our experiments.\n","authors":["Patric Bonnier","Harald Oberhauser","Zoltán Szabó"],"pdf_url":"https://arxiv.org/pdf/2301.12466v2.pdf","comment":"19 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.18940v1","updated":"2023-10-29T09:02:57Z","published":"2023-10-29T09:02:57Z","title":"Language Agents with Reinforcement Learning for Strategic Play in the\n Werewolf Game","summary":" Agents built with large language models (LLMs) have recently achieved great\nadvancements. However, most of the efforts focus on single-agent or cooperative\nsettings, leaving more general multi-agent environments underexplored. We\npropose a new framework powered by reinforcement learning (RL) to develop\nstrategic language agents, i.e., LLM-based agents with strategic thinking\nability, for a popular language game, Werewolf. Werewolf is a social deduction\ngame with hidden roles that involves both cooperation and competition and\nemphasizes deceptive communication and diverse gameplay. Our agent tackles this\ngame by first using LLMs to reason about potential deceptions and generate a\nset of strategically diverse actions. Then an RL policy, which selects an\naction from the candidates, is learned by population-based training to enhance\nthe agents' decision-making ability. By combining LLMs with the RL policy, our\nagent produces a variety of emergent strategies, achieves the highest win rate\nagainst other LLM-based agents, and stays robust against adversarial human\nplayers in the Werewolf game.\n","authors":["Zelai Xu","Chao Yu","Fei Fang","Yu Wang","Yi Wu"],"pdf_url":"https://arxiv.org/pdf/2310.18940v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18938v1","updated":"2023-10-29T08:54:26Z","published":"2023-10-29T08:54:26Z","title":"Machine Learning Algorithms to Predict Chess960 Result and Develop\n Opening Themes","summary":" This work focuses on the analysis of Chess 960, also known as Fischer Random\nChess, a variant of traditional chess where the starting positions of the\npieces are randomized. The study aims to predict the game outcome using machine\nlearning techniques and develop an opening theme for each starting position.\nThe first part of the analysis utilizes machine learning models to predict the\ngame result based on certain moves in each position. The methodology involves\nsegregating raw data from .pgn files into usable formats and creating datasets\ncomprising approximately 500 games for each starting position. Three machine\nlearning algorithms -- KNN Clustering, Random Forest, and Gradient Boosted\nTrees -- have been used to predict the game outcome. To establish an opening\ntheme, the board is divided into five regions: center, white kingside, white\nqueenside, black kingside, and black queenside. The data from games played by\ntop engines in all 960 positions is used to track the movement of pieces in the\nopening. By analysing the change in the number of pieces in each region at\nspecific moves, the report predicts the region towards which the game is\ndeveloping. These models provide valuable insights into predicting game\noutcomes and understanding the opening theme in Chess 960.\n","authors":["Shreyan Deo","Nishchal Dwivedi"],"pdf_url":"https://arxiv.org/pdf/2310.18938v1.pdf","comment":"16 pages, 6 figures and 3 tables"},{"id":"http://arxiv.org/abs/2310.18937v1","updated":"2023-10-29T08:52:23Z","published":"2023-10-29T08:52:23Z","title":"The Utility of \"Even if...\" Semifactual Explanation to Optimise Positive\n Outcomes","summary":" When users receive either a positive or negative outcome from an automated\nsystem, Explainable AI (XAI) has almost exclusively focused on how to mutate\nnegative outcomes into positive ones by crossing a decision boundary using\ncounterfactuals (e.g., \\textit{\"If you earn 2k more, we will accept your loan\napplication\"}). Here, we instead focus on \\textit{positive} outcomes, and take\nthe novel step of using XAI to optimise them (e.g., \\textit{\"Even if you wish\nto half your down-payment, we will still accept your loan application\"}).\nExplanations such as these that employ \"even if...\" reasoning, and do not cross\na decision boundary, are known as semifactuals. To instantiate semifactuals in\nthis context, we introduce the concept of \\textit{Gain} (i.e., how much a user\nstands to benefit from the explanation), and consider the first causal\nformalisation of semifactuals. Tests on benchmark datasets show our algorithms\nare better at maximising gain compared to prior work, and that causality is\nimportant in the process. Most importantly however, a user study supports our\nmain hypothesis by showing people find semifactual explanations more useful\nthan counterfactuals when they receive the positive outcome of a loan\nacceptance.\n","authors":["Eoin M. Kenny","Weipeng Huang"],"pdf_url":"https://arxiv.org/pdf/2310.18937v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18936v1","updated":"2023-10-29T08:50:27Z","published":"2023-10-29T08:50:27Z","title":"Adversarial Examples Are Not Real Features","summary":" The existence of adversarial examples has been a mystery for years and\nattracted much interest. A well-known theory by \\citet{ilyas2019adversarial}\nexplains adversarial vulnerability from a data perspective by showing that one\ncan extract non-robust features from adversarial examples and these features\nalone are useful for classification. However, the explanation remains quite\ncounter-intuitive since non-robust features are mostly noise features to\nhumans. In this paper, we re-examine the theory from a larger context by\nincorporating multiple learning paradigms. Notably, we find that contrary to\ntheir good usefulness under supervised learning, non-robust features attain\npoor usefulness when transferred to other self-supervised learning paradigms,\nsuch as contrastive learning, masked image modeling, and diffusion models. It\nreveals that non-robust features are not really as useful as robust or natural\nfeatures that enjoy good transferability between these paradigms. Meanwhile,\nfor robustness, we also show that naturally trained encoders from robust\nfeatures are largely non-robust under AutoAttack. Our cross-paradigm\nexamination suggests that the non-robust features are not really useful but\nmore like paradigm-wise shortcuts, and robust features alone might be\ninsufficient to attain reliable model robustness. Code is available at\n\\url{https://github.com/PKU-ML/AdvNotRealFeatures}.\n","authors":["Ang Li","Yifei Wang","Yiwen Guo","Yisen Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18936v1.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18935v1","updated":"2023-10-29T08:47:48Z","published":"2023-10-29T08:47:48Z","title":"Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU\n Networks on Nearly-orthogonal Data","summary":" The implicit bias towards solutions with favorable properties is believed to\nbe a key reason why neural networks trained by gradient-based optimization can\ngeneralize well. While the implicit bias of gradient flow has been widely\nstudied for homogeneous neural networks (including ReLU and leaky ReLU\nnetworks), the implicit bias of gradient descent is currently only understood\nfor smooth neural networks. Therefore, implicit bias in non-smooth neural\nnetworks trained by gradient descent remains an open question. In this paper,\nwe aim to answer this question by studying the implicit bias of gradient\ndescent for training two-layer fully connected (leaky) ReLU neural networks. We\nshowed that when the training data are nearly-orthogonal, for leaky ReLU\nactivation function, gradient descent will find a network with a stable rank\nthat converges to $1$, whereas for ReLU activation function, gradient descent\nwill find a neural network with a stable rank that is upper bounded by a\nconstant. Additionally, we show that gradient descent will find a neural\nnetwork such that all the training data points have the same normalized margin\nasymptotically. Experiments on both synthetic and real data backup our\ntheoretical findings.\n","authors":["Yiwen Kou","Zixiang Chen","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2310.18935v1.pdf","comment":"55 pages, 7 figures. In NeurIPS 2023"},{"id":"http://arxiv.org/abs/2107.03913v3","updated":"2023-10-29T08:04:16Z","published":"2021-06-21T13:30:43Z","title":"Medical Profile Model: Scientific and Practical Applications in\n Healthcare","summary":" The paper researches the problem of representation learning for electronic\nhealth records. We present the patient histories as temporal sequences of\ndiseases for which embeddings are learned in an unsupervised setup with a\ntransformer-based neural network model. Additionally the embedding space\nincludes demographic parameters which allow the creation of generalized patient\nprofiles and successful transfer of medical knowledge to other domains. The\ntraining of such a medical profile model has been performed on a dataset of\nmore than one million patients. Detailed model analysis and its comparison with\nthe state-of-the-art method show its clear advantage in the diagnosis\nprediction task. Further, we show two applications based on the developed\nprofile model. First, a novel Harbinger Disease Discovery method allowing to\nreveal disease associated hypotheses and potentially are beneficial in the\ndesign of epidemiological studies. Second, the patient embeddings extracted\nfrom the profile model applied to the insurance scoring task allow significant\nimprovement in the performance metrics.\n","authors":["Pavel Blinov","Vladimir Kokh"],"pdf_url":"https://arxiv.org/pdf/2107.03913v3.pdf","comment":"8 pages, code available at\n https://github.com/sberbank-ai-lab/mimic.profile, accepted for publication at\n IEEE JBHI"},{"id":"http://arxiv.org/abs/2310.18933v1","updated":"2023-10-29T08:03:45Z","published":"2023-10-29T08:03:45Z","title":"Label Poisoning is All You Need","summary":" In a backdoor attack, an adversary injects corrupted data into a model's\ntraining dataset in order to gain control over its predictions on images with a\nspecific attacker-defined trigger. A typical corrupted training example\nrequires altering both the image, by applying the trigger, and the label.\nModels trained on clean images, therefore, were considered safe from backdoor\nattacks. However, in some common machine learning scenarios, the training\nlabels are provided by potentially malicious third-parties. This includes\ncrowd-sourced annotation and knowledge distillation. We, hence, investigate a\nfundamental question: can we launch a successful backdoor attack by only\ncorrupting labels? We introduce a novel approach to design label-only backdoor\nattacks, which we call FLIP, and demonstrate its strengths on three datasets\n(CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32,\nResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels\ncorrupted, FLIP achieves a near-perfect attack success rate of 99.4% while\nsuffering only a 1.8% drop in the clean test accuracy. Our approach builds upon\nthe recent advances in trajectory matching, originally introduced for dataset\ndistillation.\n","authors":["Rishi D. Jha","Jonathan Hayase","Sewoong Oh"],"pdf_url":"https://arxiv.org/pdf/2310.18933v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.03944v3","updated":"2023-10-29T08:03:43Z","published":"2023-03-07T14:55:05Z","title":"On Momentum-Based Gradient Methods for Bilevel Optimization with\n Nonconvex Lower-Level","summary":" Bilevel optimization is a popular two-level hierarchical optimization, which\nhas been widely applied to many machine learning tasks such as hyperparameter\nlearning, meta learning and continual learning. Although many bilevel\noptimization methods recently have been developed, the bilevel methods are not\nwell studied when the lower-level problem is nonconvex. To fill this gap, in\nthe paper, we study a class of nonconvex bilevel optimization problems, where\nboth upper-level and lower-level problems are nonconvex, and the lower-level\nproblem satisfies Polyak-{\\L}ojasiewicz (PL) condition. We propose an efficient\nmomentum-based gradient bilevel method (MGBiO) to solve these deterministic\nproblems. Meanwhile, we propose a class of efficient momentum-based stochastic\ngradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic\nproblems. Moreover, we provide a useful convergence analysis framework for our\nmethods. Specifically, under some mild conditions, we prove that our MGBiO\nmethod has a sample (or gradient) complexity of $O(\\epsilon^{-2})$ for finding\nan $\\epsilon$-stationary solution of the deterministic bilevel problems (i.e.,\n$\\|\\nabla F(x)\\|\\leq \\epsilon$), which improves the existing best results by a\nfactor of $O(\\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO\nmethods have sample complexities of $\\tilde{O}(\\epsilon^{-4})$ and\n$\\tilde{O}(\\epsilon^{-3})$, respectively, in finding an $\\epsilon$-stationary\nsolution of the stochastic bilevel problems (i.e., $\\mathbb{E}\\|\\nabla\nF(x)\\|\\leq \\epsilon$), which improves the existing best results by a factor of\n$\\tilde{O}(\\epsilon^{-3})$. Extensive experimental results on bilevel PL game\nand hyper-representation learning demonstrate the efficiency of our algorithms.\nThis paper commemorates the mathematician Boris Polyak (1935 -2023).\n","authors":["Feihu Huang"],"pdf_url":"https://arxiv.org/pdf/2303.03944v3.pdf","comment":"In new version of our paper, we relaxed some assumptions, updated our\n algorithms and added some numerical experiments"},{"id":"http://arxiv.org/abs/2310.18928v1","updated":"2023-10-29T07:38:33Z","published":"2023-10-29T07:38:33Z","title":"A transfer learning approach with convolutional neural network for Face\n Mask Detection","summary":" Due to the epidemic of the coronavirus (Covid-19) and its rapid spread around\nthe world, the world has faced an enormous crisis. To prevent the spread of the\ncoronavirus, the World Health Organization (WHO) has introduced the use of\nmasks and keeping social distance as the best preventive method. So, developing\nan automatic monitoring system for detecting facemasks in some crowded places\nis essential. To do this, we propose a mask recognition system based on\ntransfer learning and Inception v3 architecture. In the proposed method, two\ndatasets are used simultaneously for training including the Simulated Mask Face\nDataset (SMFD) and MaskedFace-Net (MFN) This paper tries to increase the\naccuracy of the proposed system by optimally setting hyper-parameters and\naccurately designing the fully connected layers. The main advantage of the\nproposed method is that in addition to masked and unmasked faces, it can also\ndetect cases of incorrect use of mask. Therefore, the proposed method\nclassifies the input face images into three categories. Experimental results\nshow the high accuracy and efficiency of the proposed method; so, this method\nhas achieved an accuracy of 99.47% and 99.33% in training and test data\nrespectively\n","authors":["Abolfazl Younesi","Reza Afrouzian","Yousef Seyfari"],"pdf_url":"https://arxiv.org/pdf/2310.18928v1.pdf","comment":"9 pages, in Persian language, 8 figures"},{"id":"http://arxiv.org/abs/2310.18924v1","updated":"2023-10-29T07:32:32Z","published":"2023-10-29T07:32:32Z","title":"Remaining Useful Life Prediction of Lithium-ion Batteries using\n Spatio-temporal Multimodal Attention Networks","summary":" Lithium-ion batteries are widely used in various applications, including\nelectric vehicles and renewable energy storage. The prediction of the remaining\nuseful life (RUL) of batteries is crucial for ensuring reliable and efficient\noperation, as well as reducing maintenance costs. However, determining the life\ncycle of batteries in real-world scenarios is challenging, and existing methods\nhave limitations in predicting the number of cycles iteratively. In addition,\nexisting works often oversimplify the datasets, neglecting important features\nof the batteries such as temperature, internal resistance, and material type.\nTo address these limitations, this paper proposes a two-stage remaining useful\nlife prediction scheme for Lithium-ion batteries using a spatio-temporal\nmultimodal attention network (ST-MAN). The proposed model is designed to\niteratively predict the number of cycles required for the battery to reach the\nend of its useful life, based on available data. The proposed ST-MAN is to\ncapture the complex spatio-temporal dependencies in the battery data, including\nthe features that are often neglected in existing works. Experimental results\ndemonstrate that the proposed ST-MAN model outperforms existing CNN and\nLSTM-based methods, achieving state-of-the-art performance in predicting the\nremaining useful life of Li-ion batteries. The proposed method has the\npotential to improve the reliability and efficiency of battery operations and\nis applicable in various industries, including automotive and renewable energy.\n","authors":["Sungho Suh","Dhruv Aditya Mittal","Hymalai Bello","Bo Zhou","Mayank Shekhar Jha","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2310.18924v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15080v2","updated":"2023-10-29T07:17:45Z","published":"2023-10-23T16:37:59Z","title":"Federated Learning of Large Language Models with Parameter-Efficient\n Prompt Tuning and Adaptive Optimization","summary":" Federated learning (FL) is a promising paradigm to enable collaborative model\ntraining with decentralized data. However, the training process of Large\nLanguage Models (LLMs) generally incurs the update of significant parameters,\nwhich limits the applicability of FL techniques to tackle the LLMs in real\nscenarios. Prompt tuning can significantly reduce the number of parameters to\nupdate, but it either incurs performance degradation or low training\nefficiency. The straightforward utilization of prompt tuning in the FL often\nraises non-trivial communication costs and dramatically degrades performance.\nIn addition, the decentralized data is generally non-Independent and\nIdentically Distributed (non-IID), which brings client drift problems and thus\npoor performance. This paper proposes a Parameter-efficient prompt Tuning\napproach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and\neffective FL of LLMs. First, an efficient partial prompt tuning approach is\nproposed to improve performance and efficiency simultaneously. Second, a novel\nadaptive optimization method is developed to address the client drift problems\non both the device and server sides to enhance performance further. Extensive\nexperiments based on 10 datasets demonstrate the superb performance (up to\n60.8\\% in terms of accuracy) and efficiency (up to 97.59\\% in terms of training\ntime) of FedPepTAO compared with 9 baseline approaches. Our code is available\nat https://github.com/llm-eff/FedPepTAO.\n","authors":["Tianshi Che","Ji Liu","Yang Zhou","Jiaxiang Ren","Jiwen Zhou","Victor S. Sheng","Huaiyu Dai","Dejing Dou"],"pdf_url":"https://arxiv.org/pdf/2310.15080v2.pdf","comment":"18 pages, accepted by EMNLP 2023"},{"id":"http://arxiv.org/abs/2304.03370v2","updated":"2023-10-29T07:17:33Z","published":"2023-04-06T20:54:03Z","title":"Reliable learning in challenging environments","summary":" The problem of designing learners that provide guarantees that their\npredictions are provably correct is of increasing importance in machine\nlearning. However, learning theoretic guarantees have only been considered in\nvery specific settings. In this work, we consider the design and analysis of\nreliable learners in challenging test-time environments as encountered in\nmodern machine learning problems: namely `adversarial' test-time attacks (in\nseveral variations) and `natural' distribution shifts. In this work, we provide\na reliable learner with provably optimal guarantees in such settings. We\ndiscuss computationally feasible implementations of the learner and further\nshow that our algorithm achieves strong positive performance guarantees on\nseveral natural examples: for example, linear separators under log-concave\ndistributions or smooth boundary classifiers under smooth probability\ndistributions.\n","authors":["Maria-Florina Balcan","Steve Hanneke","Rattana Pukdee","Dravyansh Sharma"],"pdf_url":"https://arxiv.org/pdf/2304.03370v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.16424v2","updated":"2023-10-29T07:16:38Z","published":"2023-06-22T10:32:51Z","title":"Realistic Synthetic Financial Transactions for Anti-Money Laundering\n Models","summary":" With the widespread digitization of finance and the increasing popularity of\ncryptocurrencies, the sophistication of fraud schemes devised by cybercriminals\nis growing. Money laundering -- the movement of illicit funds to conceal their\norigins -- can cross bank and national boundaries, producing complex\ntransaction patterns. The UN estimates 2-5\\% of global GDP or \\$0.8 - \\$2.0\ntrillion dollars are laundered globally each year. Unfortunately, real data to\ntrain machine learning models to detect laundering is generally not available,\nand previous synthetic data generators have had significant shortcomings. A\nrealistic, standardized, publicly-available benchmark is needed for comparing\nmodels and for the advancement of the area. To this end, this paper contributes\na synthetic financial transaction dataset generator and a set of synthetically\ngenerated AML (Anti-Money Laundering) datasets. We have calibrated this\nagent-based generator to match real transactions as closely as possible and\nmade the datasets public. We describe the generator in detail and demonstrate\nhow the datasets generated can help compare different Graph Neural Networks in\nterms of their AML abilities. In a key way, using synthetic data in these\ncomparisons can be even better than using real data: the ground truth labels\nare complete, whilst many laundering transactions in real data are never\ndetected.\n","authors":["Erik Altman","Jovan Blanuša","Luc von Niederhäusern","Béni Egressy","Andreea Anghel","Kubilay Atasu"],"pdf_url":"https://arxiv.org/pdf/2306.16424v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2204.08581v3","updated":"2023-10-29T07:08:03Z","published":"2022-04-18T22:40:14Z","title":"On Parametric Optimal Execution and Machine Learning Surrogates","summary":" We investigate optimal order execution problems in discrete time with\ninstantaneous price impact and stochastic resilience. First, in the setting of\nlinear transient price impact we derive a closed-form recursion for the optimal\nstrategy, extending the deterministic results from Obizhaeva and Wang (J\nFinancial Markets, 2013). Second, we develop a numerical algorithm based on\ndynamic programming and deep learning for the case of nonlinear transient price\nimpact as proposed by Bouchaud et al. (Quant. Finance, 2004). Specifically, we\nutilize an actor-critic framework that constructs two neural-network (NN)\nsurrogates for the value function and the feedback control. The flexible\nscalability of NN functional approximators enables parametric learning, i.e.,\nincorporating several model or market parameters as part of the input space.\nPrecise calibration of price impact, resilience, etc., is known to be extremely\nchallenging and hence it is critical to understand sensitivity of the execution\npolicy to these parameters. Our NN learner organically scales across multiple\ninput dimensions and is shown to accurately approximate optimal strategies\nacross a wide range of parameter configurations. We provide a fully\nreproducible Jupyter Notebook with our NN implementation, which is of\nindependent pedagogical interest, demonstrating the ease of use of NN\nsurrogates in (parametric) stochastic control problems.\n","authors":["Tao Chen","Mike Ludkovski","Moritz Voß"],"pdf_url":"https://arxiv.org/pdf/2204.08581v3.pdf","comment":"33 pages, 8 figures. Github repo at\n https://github.com/moritz-voss/Parametric_Optimal_Execution_ML"},{"id":"http://arxiv.org/abs/2307.08452v2","updated":"2023-10-29T06:29:33Z","published":"2023-07-17T12:47:33Z","title":"SBMLtoODEjax: Efficient Simulation and Optimization of Biological\n Network Models in JAX","summary":" Advances in bioengineering and biomedicine demand a deep understanding of the\ndynamic behavior of biological systems, ranging from protein pathways to\ncomplex cellular processes. Biological networks like gene regulatory networks\nand protein pathways are key drivers of embryogenesis and physiological\nprocesses. Comprehending their diverse behaviors is essential for tackling\ndiseases, including cancer, as well as for engineering novel biological\nconstructs. Despite the availability of extensive mathematical models\nrepresented in Systems Biology Markup Language (SBML), researchers face\nsignificant challenges in exploring the full spectrum of behaviors and\noptimizing interventions to efficiently shape those behaviors. Existing tools\ndesigned for simulation of biological network models are not tailored to\nfacilitate interventions on network dynamics nor to facilitate automated\ndiscovery. Leveraging recent developments in machine learning (ML), this paper\nintroduces SBMLtoODEjax, a lightweight library designed to seamlessly integrate\nSBML models with ML-supported pipelines, powered by JAX. SBMLtoODEjax\nfacilitates the reuse and customization of SBML-based models, harnessing JAX's\ncapabilities for efficient parallel simulations and optimization, with the aim\nto accelerate research in biological network analysis.\n","authors":["Mayalen Etcheverry","Michael Levin","Clément Moulin-Frier","Pierre-Yves Oudeyer"],"pdf_url":"https://arxiv.org/pdf/2307.08452v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2209.15217v3","updated":"2023-10-29T06:18:12Z","published":"2022-09-30T04:09:06Z","title":"Hyperbolic VAE via Latent Gaussian Distributions","summary":" We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent\nspace consists of a set of Gaussian distributions. It is known that the set of\nthe univariate Gaussian distributions with the Fisher information metric form a\nhyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed\nwith the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal\ndistribution based on the Kullback-Leibler divergence, a local approximation of\nthe squared Fisher-Rao distance, to define a density over the latent space. In\nexperiments, we demonstrate the efficacy of GM-VAE on two different tasks:\ndensity estimation of image datasets and environment modeling in model-based\nreinforcement learning. GM-VAE outperforms the other variants of hyperbolic-\nand Euclidean-VAEs on density estimation tasks and shows competitive\nperformance in model-based reinforcement learning. We observe that our model\nprovides strong numerical stability, addressing a common limitation reported in\nprevious hyperbolic-VAEs.\n","authors":["Seunghyuk Cho","Juyong Lee","Dongwoo Kim"],"pdf_url":"https://arxiv.org/pdf/2209.15217v3.pdf","comment":"20 pages, Thirty-seventh Conference on Neural Information Processing\n System, 2023"},{"id":"http://arxiv.org/abs/2310.18919v1","updated":"2023-10-29T06:12:43Z","published":"2023-10-29T06:12:43Z","title":"Posterior Sampling with Delayed Feedback for Reinforcement Learning with\n Linear Function Approximation","summary":" Recent studies in reinforcement learning (RL) have made significant progress\nby leveraging function approximation to alleviate the sample complexity hurdle\nfor better performance. Despite the success, existing provably efficient\nalgorithms typically rely on the accessibility of immediate feedback upon\ntaking actions. The failure to account for the impact of delay in observations\ncan significantly degrade the performance of real-world systems due to the\nregret blow-up. In this work, we tackle the challenge of delayed feedback in RL\nwith linear function approximation by employing posterior sampling, which has\nbeen shown to empirically outperform the popular UCB algorithms in a wide range\nof regimes. We first introduce Delayed-PSVI, an optimistic value-based\nalgorithm that effectively explores the value function space via noise\nperturbation with posterior sampling. We provide the first analysis for\nposterior sampling algorithms with delayed feedback in RL and show our\nalgorithm achieves $\\widetilde{O}(\\sqrt{d^3H^3 T} + d^2H^2 E[\\tau])$ worst-case\nregret in the presence of unknown stochastic delays. Here $E[\\tau]$ is the\nexpected delay. To further improve its computational efficiency and to expand\nits applicability in high-dimensional RL problems, we incorporate a\ngradient-based approximate sampling scheme via Langevin dynamics for\nDelayed-LPSVI, which maintains the same order-optimal regret guarantee with\n$\\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to\ndemonstrate the statistical and computational efficacy of our algorithms.\n","authors":["Nikki Lijing Kuang","Ming Yin","Mengdi Wang","Yu-Xiang Wang","Yi-An Ma"],"pdf_url":"https://arxiv.org/pdf/2310.18919v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18918v1","updated":"2023-10-29T06:11:49Z","published":"2023-10-29T06:11:49Z","title":"Hyperbolic Graph Neural Networks at Scale: A Meta Learning Approach","summary":" The progress in hyperbolic neural networks (HNNs) research is hindered by\ntheir absence of inductive bias mechanisms, which are essential for\ngeneralizing to new tasks and facilitating scalable learning over large\ndatasets. In this paper, we aim to alleviate these issues by learning\ngeneralizable inductive biases from the nodes' local subgraph and transfer them\nfor faster learning over new subgraphs with a disjoint set of nodes, edges, and\nlabels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph\nMeta Learner (H-GRAM), that, for the tasks of node classification and link\nprediction, learns transferable information from a set of support local\nsubgraphs in the form of hyperbolic meta gradients and label hyperbolic\nprotonets to enable faster learning over a query set of new tasks dealing with\ndisjoint subgraphs. Furthermore, we show that an extension of our meta-learning\nframework also mitigates the scalability challenges seen in HNNs faced by\nexisting approaches. Our comparative analysis shows that H-GRAM effectively\nlearns and transfers information in multiple challenging few-shot settings\ncompared to other state-of-the-art baselines. Additionally, we demonstrate\nthat, unlike standard HNNs, our approach is able to scale over large graph\ndatasets and improve performance over its Euclidean counterparts.\n","authors":["Nurendra Choudhary","Nikhil Rao","Chandan K. Reddy"],"pdf_url":"https://arxiv.org/pdf/2310.18918v1.pdf","comment":"Accepted to NeurIPS 2023. 14 pages of main paper, 5 pages of\n supplementary"},{"id":"http://arxiv.org/abs/2111.06036v2","updated":"2023-10-29T06:11:29Z","published":"2021-11-11T03:17:28Z","title":"CubeTR: Learning to Solve The Rubiks Cube Using Transformers","summary":" Since its first appearance, transformers have been successfully used in wide\nranging domains from computer vision to natural language processing.\nApplication of transformers in Reinforcement Learning by reformulating it as a\nsequence modelling problem was proposed only recently. Compared to other\ncommonly explored reinforcement learning problems, the Rubiks cube poses a\nunique set of challenges. The Rubiks cube has a single solved state for\nquintillions of possible configurations which leads to extremely sparse\nrewards. The proposed model CubeTR attends to longer sequences of actions and\naddresses the problem of sparse rewards. CubeTR learns how to solve the Rubiks\ncube from arbitrary starting states without any human prior, and after move\nregularisation, the lengths of solutions generated by it are expected to be\nvery close to those given by algorithms used by expert human solvers. CubeTR\nprovides insights to the generalisability of learning algorithms to higher\ndimensional cubes and the applicability of transformers in other relevant\nsparse reward scenarios.\n","authors":["Mustafa Ebrahim Chasmai"],"pdf_url":"https://arxiv.org/pdf/2111.06036v2.pdf","comment":"It has untested ideas without supporting experimentation.\n Discontinued work in this direction"},{"id":"http://arxiv.org/abs/2310.18912v1","updated":"2023-10-29T05:48:04Z","published":"2023-10-29T05:48:04Z","title":"Sentence Bag Graph Formulation for Biomedical Distant Supervision\n Relation Extraction","summary":" We introduce a novel graph-based framework for alleviating key challenges in\ndistantly-supervised relation extraction and demonstrate its effectiveness in\nthe challenging and important domain of biomedical data. Specifically, we\npropose a graph view of sentence bags referring to an entity pair, which\nenables message-passing based aggregation of information related to the entity\npair over the sentence bag. The proposed framework alleviates the common\nproblem of noisy labeling in distantly supervised relation extraction and also\neffectively incorporates inter-dependencies between sentences within a bag.\nExtensive experiments on two large-scale biomedical relation datasets and the\nwidely utilized NYT dataset demonstrate that our proposed framework\nsignificantly outperforms the state-of-the-art methods for biomedical distant\nsupervision relation extraction while also providing excellent performance for\nrelation extraction in the general text mining domain.\n","authors":["Hao Zhang","Yang Liu","Xiaoyan Liu","Tianming Liang","Gaurav Sharma","Liang Xue","Maozu Guo"],"pdf_url":"https://arxiv.org/pdf/2310.18912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18910v1","updated":"2023-10-29T05:31:43Z","published":"2023-10-29T05:31:43Z","title":"InstanT: Semi-supervised Learning with Instance-dependent Thresholds","summary":" Semi-supervised learning (SSL) has been a fundamental challenge in machine\nlearning for decades. The primary family of SSL algorithms, known as\npseudo-labeling, involves assigning pseudo-labels to confident unlabeled\ninstances and incorporating them into the training set. Therefore, the\nselection criteria of confident instances are crucial to the success of SSL.\nRecently, there has been growing interest in the development of SSL methods\nthat use dynamic or adaptive thresholds. Yet, these methods typically apply the\nsame threshold to all samples, or use class-dependent thresholds for instances\nbelonging to a certain class, while neglecting instance-level information. In\nthis paper, we propose the study of instance-dependent thresholds, which has\nthe highest degree of freedom compared with existing methods. Specifically, we\ndevise a novel instance-dependent threshold function for all unlabeled\ninstances by utilizing their instance-level ambiguity and the\ninstance-dependent error rates of pseudo-labels, so instances that are more\nlikely to have incorrect pseudo-labels will have higher thresholds.\nFurthermore, we demonstrate that our instance-dependent threshold function\nprovides a bounded probabilistic guarantee for the correctness of the\npseudo-labels it assigns.\n","authors":["Muyang Li","Runze Wu","Haoyu Liu","Jun Yu","Xun Yang","Bo Han","Tongliang Liu"],"pdf_url":"https://arxiv.org/pdf/2310.18910v1.pdf","comment":"Accepted as poster for NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18908v1","updated":"2023-10-29T05:29:59Z","published":"2023-10-29T05:29:59Z","title":"Estimating the Rate-Distortion Function by Wasserstein Gradient Descent","summary":" In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$\ndescribes how much a data source can be compressed (in bit-rate) at any given\nlevel of fidelity (distortion). Obtaining $R(D)$ for a given data source\nestablishes the fundamental performance limit for all compression algorithms.\nWe propose a new method to estimate $R(D)$ from the perspective of optimal\ntransport. Unlike the classic Blahut--Arimoto algorithm which fixes the support\nof the reproduction distribution in advance, our Wasserstein gradient descent\nalgorithm learns the support of the optimal reproduction distribution by moving\nparticles. We prove its local convergence and analyze the sample complexity of\nour R-D estimator based on a connection to entropic optimal transport.\nExperimentally, we obtain comparable or tighter bounds than state-of-the-art\nneural network methods on low-rate sources while requiring considerably less\ntuning and computation effort. We also highlight a connection to\nmaximum-likelihood deconvolution and introduce a new class of sources that can\nbe used as test cases with known solutions to the R-D problem.\n","authors":["Yibo Yang","Stephan Eckstein","Marcel Nutz","Stephan Mandt"],"pdf_url":"https://arxiv.org/pdf/2310.18908v1.pdf","comment":"Accepted as conference paper at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18907v1","updated":"2023-10-29T05:29:49Z","published":"2023-10-29T05:29:49Z","title":"Topological, or Non-topological? A Deep Learning Based Prediction","summary":" Prediction and discovery of new materials with desired properties are at the\nforefront of quantum science and technology research. A major bottleneck in\nthis field is the computational resources and time complexity related to\nfinding new materials from ab initio calculations. In this work, an effective\nand robust deep learning-based model is proposed by incorporating persistent\nhomology and graph neural network which offers an accuracy of 91.4% and an F1\nscore of 88.5% in classifying topological vs. non-topological materials,\noutperforming the other state-of-the-art classifier models. The incorporation\nof the graph neural network encodes the underlying relation between the atoms\ninto the model based on their own crystalline structures and thus proved to be\nan effective method to represent and process non-euclidean data like molecules\nwith a relatively shallow network. The persistent homology pipeline in the\nsuggested neural network is capable of integrating the atom-specific\ntopological information into the deep learning model, increasing robustness,\nand gain in performance. It is believed that the presented work will be an\nefficacious tool for predicting the topological class and therefore enable the\nhigh-throughput search for novel materials in this field.\n","authors":["Ashiqur Rasul","Md Shafayat Hossain","Ankan Ghosh Dastider","Himaddri Roy","M. Zahid Hasan","Quazi D. M. Khosru"],"pdf_url":"https://arxiv.org/pdf/2310.18907v1.pdf","comment":"13 pages, 8 figures"},{"id":"http://arxiv.org/abs/2309.02623v2","updated":"2023-10-29T05:21:37Z","published":"2023-09-05T23:49:46Z","title":"Superclustering by finding statistically significant separable groups of\n optimal gaussian clusters","summary":" The paper presents the algorithm for clustering a dataset by grouping the\noptimal, from the point of view of the BIC criterion, number of Gaussian\nclusters into the optimal, from the point of view of their statistical\nseparability, superclusters.\n The algorithm consists of three stages: representation of the dataset as a\nmixture of Gaussian distributions - clusters, which number is determined based\non the minimum of the BIC criterion; using the Mahalanobis distance, to\nestimate the distances between the clusters and cluster sizes; combining the\nresulting clusters into superclusters using the DBSCAN method by finding its\nhyperparameter (maximum distance) providing maximum value of introduced matrix\nquality criterion at maximum number of superclusters. The matrix quality\ncriterion corresponds to the proportion of statistically significant separated\nsuperclusters among all found superclusters.\n The algorithm has only one hyperparameter - statistical significance level,\nand automatically detects optimal number and shape of superclusters based of\nstatistical hypothesis testing approach. The algorithm demonstrates a good\nresults on test datasets in noise and noiseless situations. An essential\nadvantage of the algorithm is its ability to predict correct supercluster for\nnew data based on already trained clusterer and perform soft (fuzzy)\nclustering. The disadvantages of the algorithm are: its low speed and\nstochastic nature of the final clustering. It requires a sufficiently large\ndataset for clustering, which is typical for many statistical methods.\n","authors":["Oleg I. Berngardt"],"pdf_url":"https://arxiv.org/pdf/2309.02623v2.pdf","comment":"25 pages, 6 figures, 1 table"},{"id":"http://arxiv.org/abs/2304.11327v2","updated":"2023-10-29T05:20:53Z","published":"2023-04-22T05:57:00Z","title":"Understanding and Improving Feature Learning for Out-of-Distribution\n Generalization","summary":" A common explanation for the failure of out-of-distribution (OOD)\ngeneralization is that the model trained with empirical risk minimization (ERM)\nlearns spurious features instead of invariant features. However, several recent\nstudies challenged this explanation and found that deep networks may have\nalready learned sufficiently good features for OOD generalization. Despite the\ncontradictions at first glance, we theoretically show that ERM essentially\nlearns both spurious and invariant features, while ERM tends to learn spurious\nfeatures faster if the spurious correlation is stronger. Moreover, when fed the\nERM learned features to the OOD objectives, the invariant feature learning\nquality significantly affects the final OOD performance, as OOD objectives\nrarely learn new features. Therefore, ERM feature learning can be a bottleneck\nto OOD generalization. To alleviate the reliance, we propose Feature Augmented\nTraining (FeAT), to enforce the model to learn richer features ready for OOD\ngeneralization. FeAT iteratively augments the model to learn new features while\nretaining the already learned features. In each round, the retention and\naugmentation operations are performed on different subsets of the training data\nthat capture distinct features. Extensive experiments show that FeAT\neffectively learns richer features thus boosting the performance of various OOD\nobjectives.\n","authors":["Yongqiang Chen","Wei Huang","Kaiwen Zhou","Yatao Bian","Bo Han","James Cheng"],"pdf_url":"https://arxiv.org/pdf/2304.11327v2.pdf","comment":"Yongqiang Chen, Wei Huang, and Kaiwen Zhou contributed equally;\n NeurIPS 2023, 55 pages, 64 figures"},{"id":"http://arxiv.org/abs/2206.09203v2","updated":"2023-10-29T05:04:32Z","published":"2022-06-18T13:32:41Z","title":"Interactive Visual Reasoning under Uncertainty","summary":" One of the fundamental cognitive abilities of humans is to quickly resolve\nuncertainty by generating hypotheses and testing them via active trials.\nEncountering a novel phenomenon accompanied by ambiguous cause-effect\nrelationships, humans make hypotheses against data, conduct inferences from\nobservation, test their theory via experimentation, and correct the proposition\nif inconsistency arises. These iterative processes persist until the underlying\nmechanism becomes clear. In this work, we devise the IVRE (pronounced as\n\"ivory\") environment for evaluating artificial agents' reasoning ability under\nuncertainty. IVRE is an interactive environment featuring rich scenarios\ncentered around Blicket detection. Agents in IVRE are placed into environments\nwith various ambiguous action-effect pairs and asked to determine each object's\nrole. They are encouraged to propose effective and efficient experiments to\nvalidate their hypotheses based on observations and actively gather new\ninformation. The game ends when all uncertainties are resolved or the maximum\nnumber of trials is consumed. By evaluating modern artificial agents in IVRE,\nwe notice a clear failure of today's learning methods compared to humans. Such\ninefficacy in interactive reasoning ability under uncertainty calls for future\nresearch in building human-like intelligence.\n","authors":["Manjie Xu","Guangyuan Jiang","Wei Liang","Chi Zhang","Yixin Zhu"],"pdf_url":"https://arxiv.org/pdf/2206.09203v2.pdf","comment":"Accepted at NeurIPS 2023 (Datasets and Benchmarks)"},{"id":"http://arxiv.org/abs/2205.13925v4","updated":"2023-10-29T04:57:43Z","published":"2022-05-27T12:08:23Z","title":"DELTA: Diverse Client Sampling for Fasting Federated Learning","summary":" Partial client participation has been widely adopted in Federated Learning\n(FL) to reduce the communication burden efficiently. However, an inadequate\nclient sampling scheme can lead to the selection of unrepresentative subsets,\nresulting in significant variance in model updates and slowed convergence.\nExisting sampling methods are either biased or can be further optimized for\nfaster convergence.In this paper, we present DELTA, an unbiased sampling scheme\ndesigned to alleviate these issues. DELTA characterizes the effects of client\ndiversity and local variance, and samples representative clients with valuable\ninformation for global model updates. In addition, DELTA is a proven optimal\nunbiased sampling scheme that minimizes variance caused by partial client\nparticipation and outperforms other unbiased sampling schemes in terms of\nconvergence. Furthermore, to address full-client gradient dependence,we provide\na practical version of DELTA depending on the available clients' information,\nand also analyze its convergence. Our results are validated through experiments\non both synthetic and real-world datasets.\n","authors":["Lin Wang","YongXin Guo","Tao Lin","Xiaoying Tang"],"pdf_url":"https://arxiv.org/pdf/2205.13925v4.pdf","comment":"Accepted by Thirty-seventh Conference on Neural Information\n Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2306.11589v2","updated":"2023-10-29T04:47:19Z","published":"2023-06-20T15:07:37Z","title":"Sampling from Gaussian Process Posteriors using Stochastic Gradient\n Descent","summary":" Gaussian processes are a powerful framework for quantifying uncertainty and\nfor sequential decision-making but are limited by the requirement of solving\nlinear systems. In general, this has a cubic cost in dataset size and is\nsensitive to conditioning. We explore stochastic gradient algorithms as a\ncomputationally efficient method of approximately solving these linear systems:\nwe develop low-variance optimization objectives for sampling from the posterior\nand extend these to inducing points. Counterintuitively, stochastic gradient\ndescent often produces accurate predictions, even in cases where it does not\nconverge quickly to the optimum. We explain this through a spectral\ncharacterization of the implicit bias from non-convergence. We show that\nstochastic gradient descent produces predictive distributions close to the true\nposterior both in regions with sufficient data coverage, and in regions\nsufficiently far away from the data. Experimentally, stochastic gradient\ndescent achieves state-of-the-art performance on sufficiently large-scale or\nill-conditioned regression tasks. Its uncertainty estimates match the\nperformance of significantly more expensive baselines on a large-scale\nBayesian~optimization~task.\n","authors":["Jihao Andreas Lin","Javier Antorán","Shreyas Padhy","David Janz","José Miguel Hernández-Lobato","Alexander Terenin"],"pdf_url":"https://arxiv.org/pdf/2306.11589v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.10918v3","updated":"2023-10-29T04:46:50Z","published":"2023-09-19T20:30:58Z","title":"Posterior Contraction Rates for Matérn Gaussian Processes on\n Riemannian Manifolds","summary":" Gaussian processes are used in many machine learning applications that rely\non uncertainty quantification. Recently, computational tools for working with\nthese models in geometric settings, such as when inputs lie on a Riemannian\nmanifold, have been developed. This raises the question: can these intrinsic\nmodels be shown theoretically to lead to better performance, compared to simply\nembedding all relevant quantities into $\\mathbb{R}^d$ and using the restriction\nof an ordinary Euclidean Gaussian process? To study this, we prove optimal\ncontraction rates for intrinsic Mat\\'ern Gaussian processes defined on compact\nRiemannian manifolds. We also prove analogous rates for extrinsic processes\nusing trace and extension theorems between manifold and ambient Sobolev spaces:\nsomewhat surprisingly, the rates obtained turn out to coincide with those of\nthe intrinsic processes, provided that their smoothness parameters are matched\nappropriately. We illustrate these rates empirically on a number of examples,\nwhich, mirroring prior work, show that intrinsic processes can achieve better\nperformance in practice. Therefore, our work shows that finer-grained analyses\nare needed to distinguish between different levels of data-efficiency of\ngeometric Gaussian processes, particularly in settings which involve small data\nset sizes and non-asymptotic behavior.\n","authors":["Paul Rosa","Viacheslav Borovitskiy","Alexander Terenin","Judith Rousseau"],"pdf_url":"https://arxiv.org/pdf/2309.10918v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2301.12549v3","updated":"2023-10-29T04:43:45Z","published":"2023-01-29T21:40:04Z","title":"Unlocking Deterministic Robustness Certification on ImageNet","summary":" Despite the promise of Lipschitz-based methods for provably-robust deep\nlearning with deterministic guarantees, current state-of-the-art results are\nlimited to feed-forward Convolutional Networks (ConvNets) on low-dimensional\ndata, such as CIFAR-10. This paper investigates strategies for expanding\ncertifiably robust training to larger, deeper models. A key challenge in\ncertifying deep networks is efficient calculation of the Lipschitz bound for\nresidual blocks found in ResNet and ViT architectures. We show that fast ways\nof bounding the Lipschitz constant for conventional ResNets are loose, and show\nhow to address this by designing a new residual block, leading to the\n\\emph{Linear ResNet} (LiResNet) architecture. We then introduce \\emph{Efficient\nMargin MAximization} (EMMA), a loss function that stabilizes robust training by\nsimultaneously penalizing worst-case adversarial examples from \\emph{all}\nclasses. Together, these contributions yield new \\emph{state-of-the-art} robust\naccuracy on CIFAR-10/100 and Tiny-ImageNet under $\\ell_2$ perturbations.\nMoreover, for the first time, we are able to scale up fast deterministic\nrobustness guarantees to ImageNet, demonstrating that this approach to robust\nlearning can be applied to real-world applications.\n We release our code on Github: \\url{https://github.com/klasleino/gloro}.\n","authors":["Kai Hu","Andy Zou","Zifan Wang","Klas Leino","Matt Fredrikson"],"pdf_url":"https://arxiv.org/pdf/2301.12549v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2206.07550v3","updated":"2023-10-29T04:39:25Z","published":"2022-05-20T07:32:57Z","title":"Evaluating and Inducing Personality in Pre-trained Language Models","summary":" Standardized and quantified evaluation of machine behaviors is a crux of\nunderstanding LLMs. In this study, we draw inspiration from psychometric\nstudies by leveraging human personality theory as a tool for studying machine\nbehaviors. Originating as a philosophical quest for human behaviors, the study\nof personality delves into how individuals differ in thinking, feeling, and\nbehaving. Toward building and understanding human-like social machines, we are\nmotivated to ask: Can we assess machine behaviors by leveraging human\npsychometric tests in a principled and quantitative manner? If so, can we\ninduce a specific personality in LLMs? To answer these questions, we introduce\nthe Machine Personality Inventory (MPI) tool for studying machine behaviors;\nMPI follows standardized personality tests, built upon the Big Five Personality\nFactors (Big Five) theory and personality assessment inventories. By\nsystematically evaluating LLMs with MPI, we provide the first piece of evidence\ndemonstrating the efficacy of MPI in studying LLMs behaviors. We further devise\na Personality Prompting (P^2) method to induce LLMs with specific personalities\nin a controllable way, capable of producing diverse and verifiable behaviors.\nWe hope this work sheds light on future studies by adopting personality as the\nessential indicator for various downstream tasks, and could further motivate\nresearch into equally intriguing human-like machine behaviors.\n","authors":["Guangyuan Jiang","Manjie Xu","Song-Chun Zhu","Wenjuan Han","Chi Zhang","Yixin Zhu"],"pdf_url":"https://arxiv.org/pdf/2206.07550v3.pdf","comment":"Accepted at NeurIPS 2023 (Spotlight)"},{"id":"http://arxiv.org/abs/2307.01166v3","updated":"2023-10-29T04:34:39Z","published":"2023-07-03T17:18:50Z","title":"Strategic Distribution Shift of Interacting Agents via Coupled Gradient\n Flows","summary":" We propose a novel framework for analyzing the dynamics of distribution shift\nin real-world systems that captures the feedback loop between learning\nalgorithms and the distributions on which they are deployed. Prior work largely\nmodels feedback-induced distribution shift as adversarial or via an overly\nsimplistic distribution-shift structure. In contrast, we propose a coupled\npartial differential equation model that captures fine-grained changes in the\ndistribution over time by accounting for complex dynamics that arise due to\nstrategic responses to algorithmic decision-making, non-local endogenous\npopulation interactions, and other exogenous sources of distribution shift. We\nconsider two common settings in machine learning: cooperative settings with\ninformation asymmetries, and competitive settings where a learner faces\nstrategic users. For both of these settings, when the algorithm retrains via\ngradient descent, we prove asymptotic convergence of the retraining procedure\nto a steady-state, both in finite and in infinite dimensions, obtaining\nexplicit rates in terms of the model parameters. To do so we derive new results\non the convergence of coupled PDEs that extends what is known on multi-species\nsystems. Empirically, we show that our approach captures well-documented forms\nof distribution shifts like polarization and disparate impacts that simpler\nmodels cannot capture.\n","authors":["Lauren Conger","Franca Hoffmann","Eric Mazumdar","Lillian Ratliff"],"pdf_url":"https://arxiv.org/pdf/2307.01166v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18897v1","updated":"2023-10-29T04:26:23Z","published":"2023-10-29T04:26:23Z","title":"Learning Subgrid-Scale Models in Discontinuous Galerkin Methods with\n Neural Ordinary Differential Equations for Compressible Navier--Stokes\n Equations","summary":" The growing computing power over the years has enabled simulations to become\nmore complex and accurate. However, high-fidelity simulations, while immensely\nvaluable for scientific discovery and problem solving, come with significant\ncomputational demands. As a result, it is common to run a low-fidelity model\nwith a subgrid-scale model to reduce the computational cost, but selecting the\nappropriate subgrid-scale models and tuning them are challenging. We propose a\nnovel method for learning the subgrid-scale model effects when simulating\npartial differential equations using neural ordinary differential equations in\nthe context of discontinuous Galerkin (DG) spatial discretization. Our approach\nlearns the missing scales of the low-order DG solver at a continuous level and\nhence improves the accuracy of the low-order DG approximations as well as\naccelerates the filtered high-order DG simulations with a certain degree of\nprecision. We demonstrate the performance of our approach through\nmultidimensional Taylor--Green vortex examples at different Reynolds numbers\nand times, which cover laminar, transitional, and turbulent regimes. The\nproposed method not only reconstructs the subgrid-scale from the low-order\n(1st-order) approximation but also speeds up the filtered high-order DG\n(6th-order) simulation by two orders of magnitude.\n","authors":["Shinhoo Kang","Emil M. Constantinescu"],"pdf_url":"https://arxiv.org/pdf/2310.18897v1.pdf","comment":"15 figures, 2 tables, 22 pages"},{"id":"http://arxiv.org/abs/2310.16999v2","updated":"2023-10-29T04:09:48Z","published":"2023-10-25T20:55:07Z","title":"Trust, but Verify: Robust Image Segmentation using Deep Learning","summary":" We describe a method for verifying the output of a deep neural network for\nmedical image segmentation that is robust to several classes of random as well\nas worst-case perturbations i.e. adversarial attacks. This method is based on a\ngeneral approach recently developed by the authors called \"Trust, but Verify\"\nwherein an auxiliary verification network produces predictions about certain\nmasked features in the input image using the segmentation as an input. A\nwell-designed auxiliary network will produce high-quality predictions when the\ninput segmentations are accurate, but will produce low-quality predictions when\nthe segmentations are incorrect. Checking the predictions of such a network\nwith the original image allows us to detect bad segmentations. However, to\nensure the verification method is truly robust, we need a method for checking\nthe quality of the predictions that does not itself rely on a black-box neural\nnetwork. Indeed, we show that previous methods for segmentation evaluation that\ndo use deep neural regression networks are vulnerable to false negatives i.e.\ncan inaccurately label bad segmentations as good. We describe the design of a\nverification network that avoids such vulnerability and present results to\ndemonstrate its robustness compared to previous methods.\n","authors":["Fahim Ahmed Zaman","Xiaodong Wu","Weiyu Xu","Milan Sonka","Raghuraman Mudumbai"],"pdf_url":"https://arxiv.org/pdf/2310.16999v2.pdf","comment":"5 Pages, 8 Figures, conference"},{"id":"http://arxiv.org/abs/2310.18893v1","updated":"2023-10-29T04:00:33Z","published":"2023-10-29T04:00:33Z","title":"Ever Evolving Evaluator (EV3): Towards Flexible and Reliable\n Meta-Optimization for Knowledge Distillation","summary":" We introduce EV3, a novel meta-optimization framework designed to efficiently\ntrain scalable machine learning models through an intuitive\nexplore-assess-adapt protocol. In each iteration of EV3, we explore various\nmodel parameter updates, assess them using pertinent evaluation methods, and\nadapt the model based on the optimal updates and previous progress history. EV3\noffers substantial flexibility without imposing stringent constraints like\ndifferentiability on the key objectives relevant to the tasks of interest.\nMoreover, this protocol welcomes updates with biased gradients and allows for\nthe use of a diversity of losses and optimizers. Additionally, in scenarios\nwith multiple objectives, it can be used to dynamically prioritize tasks. With\ninspiration drawn from evolutionary algorithms, meta-learning, and neural\narchitecture search, we investigate an application of EV3 to knowledge\ndistillation. Our experimental results illustrate EV3's capability to safely\nexplore model spaces, while hinting at its potential applicability across\nnumerous domains due to its inherent flexibility and adaptability.\n","authors":["Li Ding","Masrour Zoghi","Guy Tennenholtz","Maryam Karimzadehgan"],"pdf_url":"https://arxiv.org/pdf/2310.18893v1.pdf","comment":"NeurIPS 2023 Workshop on Adaptive Experimental Design and Active\n Learning in the Real World (RealML-2023)"},{"id":"http://arxiv.org/abs/2310.18888v1","updated":"2023-10-29T03:29:59Z","published":"2023-10-29T03:29:59Z","title":"D2NO: Efficient Handling of Heterogeneous Input Function Spaces with\n Distributed Deep Neural Operators","summary":" Neural operators have been applied in various scientific fields, such as\nsolving parametric partial differential equations, dynamical systems with\ncontrol, and inverse problems. However, challenges arise when dealing with\ninput functions that exhibit heterogeneous properties, requiring multiple\nsensors to handle functions with minimal regularity. To address this issue,\ndiscretization-invariant neural operators have been used, allowing the sampling\nof diverse input functions with different sensor locations. However, existing\nframeworks still require an equal number of sensors for all functions. In our\nstudy, we propose a novel distributed approach to further relax the\ndiscretization requirements and solve the heterogeneous dataset challenges. Our\nmethod involves partitioning the input function space and processing individual\ninput functions using independent and separate neural networks. A centralized\nneural network is used to handle shared information across all output\nfunctions. This distributed methodology reduces the number of gradient descent\nback-propagation steps, improving efficiency while maintaining accuracy. We\ndemonstrate that the corresponding neural network is a universal approximator\nof continuous nonlinear operators and present four numerical examples to\nvalidate its performance.\n","authors":["Zecheng Zhang","Christian Moya","Lu Lu","Guang Lin","Hayden Schaeffer"],"pdf_url":"https://arxiv.org/pdf/2310.18888v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18885v1","updated":"2023-10-29T03:20:10Z","published":"2023-10-29T03:20:10Z","title":"A foundational neural operator that continuously learns without\n forgetting","summary":" Machine learning has witnessed substantial growth, leading to the development\nof advanced artificial intelligence models crafted to address a wide range of\nreal-world challenges spanning various domains, such as computer vision,\nnatural language processing, and scientific computing. Nevertheless, the\ncreation of custom models for each new task remains a resource-intensive\nundertaking, demanding considerable computational time and memory resources. In\nthis study, we introduce the concept of the Neural Combinatorial Wavelet Neural\nOperator (NCWNO) as a foundational model for scientific computing. This model\nis specifically designed to excel in learning from a diverse spectrum of\nphysics and continuously adapt to the solution operators associated with\nparametric partial differential equations (PDEs). The NCWNO leverages a gated\nstructure that employs local wavelet experts to acquire shared features across\nmultiple physical systems, complemented by a memory-based ensembling approach\namong these local wavelet experts. This combination enables rapid adaptation to\nnew challenges. The proposed foundational model offers two key advantages: (i)\nit can simultaneously learn solution operators for multiple parametric PDEs,\nand (ii) it can swiftly generalize to new parametric PDEs with minimal\nfine-tuning. The proposed NCWNO is the first foundational operator learning\nalgorithm distinguished by its (i) robustness against catastrophic forgetting,\n(ii) the maintenance of positive transfer for new parametric PDEs, and (iii)\nthe facilitation of knowledge transfer across dissimilar tasks. Through an\nextensive set of benchmark examples, we demonstrate that the NCWNO can\noutperform task-specific baseline operator learning frameworks with minimal\nhyperparameter tuning at the prediction stage. We also show that with minimal\nfine-tuning, the NCWNO performs accurate combinatorial learning of new\nparametric PDEs.\n","authors":["Tapas Tripura","Souvik Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2310.18885v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18884v1","updated":"2023-10-29T03:14:20Z","published":"2023-10-29T03:14:20Z","title":"Simple and Asymmetric Graph Contrastive Learning without Augmentations","summary":" Graph Contrastive Learning (GCL) has shown superior performance in\nrepresentation learning in graph-structured data. Despite their success, most\nexisting GCL methods rely on prefabricated graph augmentation and homophily\nassumptions. Thus, they fail to generalize well to heterophilic graphs where\nconnected nodes may have different class labels and dissimilar features. In\nthis paper, we study the problem of conducting contrastive learning on\nhomophilic and heterophilic graphs. We find that we can achieve promising\nperformance simply by considering an asymmetric view of the neighboring nodes.\nThe resulting simple algorithm, Asymmetric Contrastive Learning for Graphs\n(GraphACL), is easy to implement and does not rely on graph augmentations and\nhomophily assumptions. We provide theoretical and empirical evidence that\nGraphACL can capture one-hop local neighborhood information and two-hop\nmonophily similarity, which are both important for modeling heterophilic\ngraphs. Experimental results show that the simple GraphACL significantly\noutperforms state-of-the-art graph contrastive learning and self-supervised\nlearning methods on homophilic and heterophilic graphs. The code of GraphACL is\navailable at https://github.com/tengxiao1/GraphACL.\n","authors":["Teng Xiao","Huaisheng Zhu","Zhengyu Chen","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18884v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18882v1","updated":"2023-10-29T03:07:30Z","published":"2023-10-29T03:07:30Z","title":"Differentiable Learning of Generalized Structured Matrices for Efficient\n Deep Neural Networks","summary":" This paper investigates efficient deep neural networks (DNNs) to replace\ndense unstructured weight matrices with structured ones that possess desired\nproperties. The challenge arises because the optimal weight matrix structure in\npopular neural network models is obscure in most cases and may vary from layer\nto layer even in the same network. Prior structured matrices proposed for\nefficient DNNs were mostly hand-crafted without a generalized framework to\nsystematically learn them. To address this issue, we propose a generalized and\ndifferentiable framework to learn efficient structures of weight matrices by\ngradient descent. We first define a new class of structured matrices that\ncovers a wide range of structured matrices in the literature by adjusting the\nstructural parameters. Then, the frequency-domain differentiable\nparameterization scheme based on the Gaussian-Dirichlet kernel is adopted to\nlearn the structural parameters by proximal gradient descent. Finally, we\nintroduce an effective initialization method for the proposed scheme. Our\nmethod learns efficient DNNs with structured matrices, achieving lower\ncomplexity and/or higher performance than prior approaches that employ\nlow-rank, block-sparse, or block-low-rank matrices.\n","authors":["Changwoo Lee","Hun-Seok Kim"],"pdf_url":"https://arxiv.org/pdf/2310.18882v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.04949v2","updated":"2023-10-29T03:03:32Z","published":"2023-06-08T05:44:06Z","title":"Robust Learning with Progressive Data Expansion Against Spurious\n Correlation","summary":" While deep learning models have shown remarkable performance in various\ntasks, they are susceptible to learning non-generalizable spurious features\nrather than the core features that are genuinely correlated to the true label.\nIn this paper, beyond existing analyses of linear models, we theoretically\nexamine the learning process of a two-layer nonlinear convolutional neural\nnetwork in the presence of spurious features. Our analysis suggests that\nimbalanced data groups and easily learnable spurious features can lead to the\ndominance of spurious features during the learning process. In light of this,\nwe propose a new training algorithm called PDE that efficiently enhances the\nmodel's robustness for a better worst-group performance. PDE begins with a\ngroup-balanced subset of training data and progressively expands it to\nfacilitate the learning of the core features. Experiments on synthetic and\nreal-world benchmark datasets confirm the superior performance of our method on\nmodels such as ResNets and Transformers. On average, our method achieves a 2.8%\nimprovement in worst-group accuracy compared with the state-of-the-art method,\nwhile enjoying up to 10x faster training efficiency. Codes are available at\nhttps://github.com/uclaml/PDE.\n","authors":["Yihe Deng","Yu Yang","Baharan Mirzasoleiman","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2306.04949v2.pdf","comment":"22 pages, 7 figures, 11 tables. In NeurIPS 2023"}],"Multimedia":[{"id":"http://arxiv.org/abs/2310.19180v1","updated":"2023-10-29T22:51:49Z","published":"2023-10-29T22:51:49Z","title":"JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music\n Generation","summary":" With rapid advances in generative artificial intelligence, the text-to-music\nsynthesis task has emerged as a promising direction for music generation from\nscratch. However, finer-grained control over multi-track generation remains an\nopen challenge. Existing models exhibit strong raw generation capability but\nlack the flexibility to compose separate tracks and combine them in a\ncontrollable manner, differing from typical workflows of human composers. To\naddress this issue, we propose JEN-1 Composer, a unified framework to\nefficiently model marginal, conditional, and joint distributions over\nmulti-track music via a single model. JEN-1 Composer framework exhibits the\ncapacity to seamlessly incorporate any diffusion-based music generation system,\n\\textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music\ngeneration. We introduce a curriculum training strategy aimed at incrementally\ninstructing the model in the transition from single-track generation to the\nflexible generation of multi-track combinations. During the inference, users\nhave the ability to iteratively produce and choose music tracks that meet their\npreferences, subsequently creating an entire musical composition incrementally\nfollowing the proposed Human-AI co-composition workflow. Quantitative and\nqualitative assessments demonstrate state-of-the-art performance in\ncontrollable and high-fidelity multi-track music synthesis. The proposed JEN-1\nComposer represents a significant advance toward interactive AI-facilitated\nmusic creation and composition. Demos will be available at\nhttps://jenmusic.ai/audio-demos.\n","authors":["Yao Yao","Peike Li","Boyu Chen","Alex Wang"],"pdf_url":"https://arxiv.org/pdf/2310.19180v1.pdf","comment":"Preprints"},{"id":"http://arxiv.org/abs/2308.04603v3","updated":"2023-10-29T14:52:32Z","published":"2023-08-08T22:06:14Z","title":"A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking","summary":" This paper presents a comprehensive survey on deep learning-based image\nwatermarking, a technique that entails the invisible embedding and extraction\nof watermarks within a cover image, aiming to offer a seamless blend of\nrobustness and adaptability. We navigate the complex landscape of this\ninterdisciplinary domain, linking historical foundations, current innovations,\nand prospective developments. Unlike existing literature, our study\nconcentrates exclusively on image watermarking with deep learning, delivering\nan in-depth, yet brief analysis enriched by three fundamental contributions.\nFirst, we introduce a refined categorization, segmenting the field into\nEmbedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid\nMethods. This taxonomy, inspired by the varied roles of deep learning across\nstudies, is designed to infuse clarity, offering readers technical insights and\ndirectional guidance. Second, our exploration dives into representative\nmethodologies, encapsulating the diverse research directions and inherent\nchallenges within each category to provide a consolidated perspective. Lastly,\nwe venture beyond established boundaries to outline emerging frontiers,\noffering a detailed insight into prospective research avenues.\n","authors":["Xin Zhong","Arjon Das","Fahad Alrasheedi","Abdullah Tanvir"],"pdf_url":"https://arxiv.org/pdf/2308.04603v3.pdf","comment":"This paper was accepted for publication by the MDPI Applied Sciences\n journal"},{"id":"http://arxiv.org/abs/2305.06908v4","updated":"2023-10-29T14:12:08Z","published":"2023-05-11T15:51:46Z","title":"CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency\n Model","summary":" Denoising diffusion probabilistic models (DDPMs) have shown promising\nperformance for speech synthesis. However, a large number of iterative steps\nare required to achieve high sample quality, which restricts the inference\nspeed. Maintaining sample quality while increasing sampling speed has become a\nchallenging task. In this paper, we propose a \"Co\"nsistency \"Mo\"del-based\n\"Speech\" synthesis method, CoMoSpeech, which achieve speech synthesis through a\nsingle diffusion sampling step while achieving high audio quality. The\nconsistency constraint is applied to distill a consistency model from a\nwell-designed diffusion-based teacher model, which ultimately yields superior\nperformances in the distilled CoMoSpeech. Our experiments show that by\ngenerating audio recordings by a single sampling step, the CoMoSpeech achieves\nan inference speed more than 150 times faster than real-time on a single NVIDIA\nA100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based\nspeech synthesis truly practical. Meanwhile, objective and subjective\nevaluations on text-to-speech and singing voice synthesis show that the\nproposed teacher models yield the best audio quality, and the one-step sampling\nbased CoMoSpeech achieves the best inference speed with better or comparable\naudio quality to other conventional multi-step diffusion model baselines. Audio\nsamples are available at https://comospeech.github.io/.\n","authors":["Zhen Ye","Wei Xue","Xu Tan","Jie Chen","Qifeng Liu","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2305.06908v4.pdf","comment":"Accepted to ACM MM 2023"},{"id":"http://arxiv.org/abs/2304.07056v3","updated":"2023-10-29T14:06:56Z","published":"2023-04-14T11:26:09Z","title":"Perceptual Quality Assessment of Face Video Compression: A Benchmark and\n An Effective Method","summary":" Recent years have witnessed an exponential increase in the demand for face\nvideo compression, and the success of artificial intelligence has expanded the\nboundaries beyond traditional hybrid video coding. Generative coding approaches\nhave been identified as promising alternatives with reasonable perceptual\nrate-distortion trade-offs, leveraging the statistical priors of face videos.\nHowever, the great diversity of distortion types in spatial and temporal\ndomains, ranging from the traditional hybrid coding frameworks to generative\nmodels, present grand challenges in compressed face video quality assessment\n(VQA). In this paper, we introduce the large-scale Compressed Face Video\nQuality Assessment (CFVQA) database, which is the first attempt to\nsystematically understand the perceptual quality and diversified compression\ndistortions in face videos. The database contains 3,240 compressed face video\nclips in multiple compression levels, which are derived from 135 source videos\nwith diversified content using six representative video codecs, including two\ntraditional methods based on hybrid coding frameworks, two end-to-end methods,\nand two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index\nfor face video compression was developed to measure the perceptual quality,\nconsidering the distinct content characteristics and temporal priors of the\nface videos. Experimental results exhibit its superior performance on the\nproposed CFVQA dataset. The benchmark is now made publicly available at:\nhttps://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment.\n","authors":["Yixuan Li","Bolin Chen","Baoliang Chen","Meng Wang","Shiqi Wang","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2304.07056v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.16934v2","updated":"2023-10-29T12:32:19Z","published":"2023-05-26T13:49:44Z","title":"On Evaluating Adversarial Robustness of Large Vision-Language Models","summary":" Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented\nperformance in response generation, especially with visual inputs, enabling\nmore creative and adaptable interaction than large language models such as\nChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since\nadversaries may successfully evade the entire system by subtly manipulating the\nmost vulnerable modality (e.g., vision). To this end, we propose evaluating the\nrobustness of open-source large VLMs in the most realistic and high-risk\nsetting, where adversaries have only black-box system access and seek to\ndeceive the model into returning the targeted responses. In particular, we\nfirst craft targeted adversarial examples against pretrained models such as\nCLIP and BLIP, and then transfer these adversarial examples to other VLMs such\nas MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we\nobserve that black-box queries on these VLMs can further improve the\neffectiveness of targeted evasion, resulting in a surprisingly high success\nrate for generating targeted responses. Our findings provide a quantitative\nunderstanding regarding the adversarial vulnerability of large VLMs and call\nfor a more thorough examination of their potential security flaws before\ndeployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.\n","authors":["Yunqing Zhao","Tianyu Pang","Chao Du","Xiao Yang","Chongxuan Li","Ngai-Man Cheung","Min Lin"],"pdf_url":"https://arxiv.org/pdf/2305.16934v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18946v1","updated":"2023-10-29T09:09:32Z","published":"2023-10-29T09:09:32Z","title":"Video Frame Interpolation with Many-to-many Splatting and Spatial\n Selective Refinement","summary":" In this work, we first propose a fully differentiable Many-to-Many (M2M)\nsplatting framework to interpolate frames efficiently. Given a frame pair, we\nestimate multiple bidirectional flows to directly forward warp the pixels to\nthe desired time step before fusing overlapping pixels. In doing so, each\nsource pixel renders multiple target pixels and each target pixel can be\nsynthesized from a larger area of visual context, establishing a many-to-many\nsplatting scheme with robustness to undesirable artifacts. For each input frame\npair, M2M has a minuscule computational overhead when interpolating an\narbitrary number of in-between frames, hence achieving fast multi-frame\ninterpolation. However, directly warping and fusing pixels in the intensity\ndomain is sensitive to the quality of motion estimation and may suffer from\nless effective representation capacity. To improve interpolation accuracy, we\nfurther extend an M2M++ framework by introducing a flexible Spatial Selective\nRefinement (SSR) component, which allows for trading computational efficiency\nfor interpolation quality and vice versa. Instead of refining the entire\ninterpolated frame, SSR only processes difficult regions selected under the\nguidance of an estimated error map, thereby avoiding redundant computation.\nEvaluation on multiple benchmark datasets shows that our method is able to\nimprove the efficiency while maintaining competitive video interpolation\nquality, and it can be adjusted to use more or less compute as needed.\n","authors":["Ping Hu","Simon Niklaus","Lu Zhang","Stan Sclaroff","Kate Saenko"],"pdf_url":"https://arxiv.org/pdf/2310.18946v1.pdf","comment":"T-PAMI. arXiv admin note: substantial text overlap with\n arXiv:2204.03513"}]},"2023-10-28T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2310.18830v1","updated":"2023-10-28T22:11:25Z","published":"2023-10-28T22:11:25Z","title":"Translating away Translationese without Parallel Data","summary":" Translated texts exhibit systematic linguistic differences compared to\noriginal texts in the same language, and these differences are referred to as\ntranslationese. Translationese has effects on various cross-lingual natural\nlanguage processing tasks, potentially leading to biased results. In this\npaper, we explore a novel approach to reduce translationese in translated\ntexts: translation-based style transfer. As there are no parallel\nhuman-translated and original data in the same language, we use a\nself-supervised approach that can learn from comparable (rather than parallel)\nmono-lingual original and translated data. However, even this self-supervised\napproach requires some parallel data for validation. We show how we can\neliminate the need for parallel validation data by combining the\nself-supervised loss with an unsupervised loss. This unsupervised loss\nleverages the original language model loss over the style-transferred output\nand a semantic similarity loss between the input and style-transferred output.\nWe evaluate our approach in terms of original vs. translationese binary\nclassification in addition to measuring content preservation and target-style\nfluency. The results show that our approach is able to reduce translationese\nclassifier accuracy to a level of a random classifier after style transfer\nwhile adequately preserving the content and fluency in the target original\nstyle.\n","authors":["Rricha Jalota","Koel Dutta Chowdhury","Cristina España-Bonet","Josef van Genabith"],"pdf_url":"https://arxiv.org/pdf/2310.18830v1.pdf","comment":"Accepted at EMNLP 2023, Main Conference"},{"id":"http://arxiv.org/abs/2310.18827v1","updated":"2023-10-28T21:53:23Z","published":"2023-10-28T21:53:23Z","title":"All Things Considered: Detecting Partisan Events from News Media with\n Cross-Article Comparison","summary":" Public opinion is shaped by the information news media provide, and that\ninformation in turn may be shaped by the ideological preferences of media\noutlets. But while much attention has been devoted to media bias via overt\nideological language or topic selection, a more unobtrusive way in which the\nmedia shape opinion is via the strategic inclusion or omission of partisan\nevents that may support one side or the other. We develop a latent\nvariable-based framework to predict the ideology of news articles by comparing\nmultiple articles on the same story and identifying partisan events whose\ninclusion or omission reveals ideology. Our experiments first validate the\nexistence of partisan event selection, and then show that article alignment and\ncross-document comparison detect partisan events and article ideology better\nthan competitive baselines. Our results reveal the high-level form of media\nbias, which is present even among mainstream media with strong norms of\nobjectivity and nonpartisanship. Our codebase and dataset are available at\nhttps://github.com/launchnlp/ATC.\n","authors":["Yujian Liu","Xinliang Frederick Zhang","Kaijian Zou","Ruihong Huang","Nick Beauchamp","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18827v1.pdf","comment":"EMNLP'23 Main Conference"},{"id":"http://arxiv.org/abs/2310.18804v1","updated":"2023-10-28T20:09:29Z","published":"2023-10-28T20:09:29Z","title":"Open Visual Knowledge Extraction via Relation-Oriented Multimodality\n Model Prompting","summary":" Images contain rich relational knowledge that can help machines understand\nthe world. Existing methods on visual knowledge extraction often rely on the\npre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation\ntypes), restricting the expressiveness of the extracted knowledge. In this\nwork, we take a first exploration to a new paradigm of open visual knowledge\nextraction. To achieve this, we present OpenVik which consists of an open\nrelational region detector to detect regions potentially containing relational\nknowledge and a visual knowledge generator that generates format-free knowledge\nby prompting the large multimodality model with the detected region of\ninterest. We also explore two data enhancement techniques for diversifying the\ngenerated format-free visual knowledge. Extensive knowledge quality evaluations\nhighlight the correctness and uniqueness of the extracted open visual knowledge\nby OpenVik. Moreover, integrating our extracted knowledge across various visual\nreasoning applications shows consistent improvements, indicating the real-world\napplicability of OpenVik.\n","authors":["Hejie Cui","Xinyu Fang","Zihan Zhang","Ran Xu","Xuan Kan","Xin Liu","Yue Yu","Manling Li","Yangqiu Song","Carl Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18804v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.04678v3","updated":"2023-10-28T19:47:47Z","published":"2023-10-07T03:25:06Z","title":"DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based\n Queries","summary":" In scientific research, the ability to effectively retrieve relevant\ndocuments based on complex, multifaceted queries is critical. Existing\nevaluation datasets for this task are limited, primarily due to the high cost\nand effort required to annotate resources that effectively represent complex\nqueries. To address this, we propose a novel task, Scientific DOcument\nRetrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed\nto handle the complex nature of user queries in scientific research. We\ndeveloped a benchmark dataset within the field of computer science, consisting\nof 100 human-authored complex query cases. For each complex query, we assembled\na collection of 100 relevant documents and produced annotated relevance scores\nfor ranking them. Recognizing the significant labor of expert annotation, we\nalso introduce Anno-GPT, a scalable framework for validating the performance of\nLarge Language Models (LLMs) on expert-level dataset annotation tasks. LLM\nannotation of the DORIS-MAE dataset resulted in a 500x reduction in cost,\nwithout compromising quality. Furthermore, due to the multi-tiered structure of\nthese complex queries, the DORIS-MAE dataset can be extended to over 4,000\nsub-query test cases without requiring additional annotation. We evaluated 17\nrecent retrieval methods on DORIS-MAE, observing notable performance drops\ncompared to traditional datasets. This highlights the need for better\napproaches to handle complex, multifaceted queries in scientific research. Our\ndataset and codebase are available at\nhttps://github.com/Real-Doris-Mae/Doris-Mae-Dataset.\n","authors":["Jianyou Wang","Kaicheng Wang","Xiaoyue Wang","Prudhviraj Naidu","Leon Bergen","Ramamohan Paturi"],"pdf_url":"https://arxiv.org/pdf/2310.04678v3.pdf","comment":"To appear in NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2310.18794v1","updated":"2023-10-28T19:42:28Z","published":"2023-10-28T19:42:28Z","title":"Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded\n Dialogue Generation","summary":" Model hallucination has been a crucial interest of research in Natural\nLanguage Generation (NLG). In this work, we propose sequence-level certainty as\na common theme over hallucination in NLG, and explore the correlation between\nsequence-level certainty and the level of hallucination in model responses. We\ncategorize sequence-level certainty into two aspects: probabilistic certainty\nand semantic certainty, and reveal through experiments on Knowledge-Grounded\nDialogue Generation (KGDG) task that both a higher level of probabilistic\ncertainty and a higher level of semantic certainty in model responses are\nsignificantly correlated with a lower level of hallucination. What's more, we\nprovide theoretical proof and analysis to show that semantic certainty is a\ngood estimator of probabilistic certainty, and therefore has the potential as\nan alternative to probability-based certainty estimation in black-box\nscenarios. Based on the observation on the relationship between certainty and\nhallucination, we further propose Certainty-based Response Ranking (CRR), a\ndecoding-time method for mitigating hallucination in NLG. Based on our\ncategorization of sequence-level certainty, we propose 2 types of CRR approach:\nProbabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually\nsampled model responses using their arithmetic mean log-probability of the\nentire sequence. S-CRR approaches certainty estimation from meaning-space, and\nranks a number of model response candidates based on their semantic certainty\nlevel, which is estimated by the entailment-based Agreement Score (AS). Through\nextensive experiments across 3 KGDG datasets, 3 decoding methods, and on 4\ndifferent models, we validate the effectiveness of our 2 proposed CRR methods\nto reduce model hallucination.\n","authors":["Yixin Wan","Fanyou Wu","Weijie Xu","Srinivasan H. Sengamedu"],"pdf_url":"https://arxiv.org/pdf/2310.18794v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.09219v2","updated":"2023-10-28T19:15:13Z","published":"2023-10-13T16:12:57Z","title":"\"Kelly is a Warm Person, Joseph is a Role Model\": Gender Biases in\n LLM-Generated Reference Letters","summary":" Large Language Models (LLMs) have recently emerged as an effective tool to\nassist individuals in writing various types of content, including professional\ndocuments such as recommendation letters. Though bringing convenience, this\napplication also introduces unprecedented fairness concerns. Model-generated\nreference letters might be directly used by users in professional scenarios. If\nunderlying biases exist in these model-constructed letters, using them without\nscrutinization could lead to direct societal harms, such as sabotaging\napplication success rates for female applicants. In light of this pressing\nissue, it is imminent and necessary to comprehensively study fairness issues\nand associated harms in this real-world use case. In this paper, we critically\nexamine gender biases in LLM-generated reference letters. Drawing inspiration\nfrom social science findings, we design evaluation methods to manifest biases\nthrough 2 dimensions: (1) biases in language style and (2) biases in lexical\ncontent. We further investigate the extent of bias propagation by analyzing the\nhallucination bias of models, a term that we define to be bias exacerbation in\nmodel-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs-\nChatGPT and Alpaca, we reveal significant gender biases in LLM-generated\nrecommendation letters. Our findings not only warn against using LLMs for this\napplication without scrutinization, but also illuminate the importance of\nthoroughly studying hidden biases and harms in LLM-generated professional\ndocuments.\n","authors":["Yixin Wan","George Pu","Jiao Sun","Aparna Garimella","Kai-Wei Chang","Nanyun Peng"],"pdf_url":"https://arxiv.org/pdf/2310.09219v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18783v1","updated":"2023-10-28T18:47:57Z","published":"2023-10-28T18:47:57Z","title":"Are NLP Models Good at Tracing Thoughts: An Overview of Narrative\n Understanding","summary":" Narrative understanding involves capturing the author's cognitive processes,\nproviding insights into their knowledge, intentions, beliefs, and desires.\nAlthough large language models (LLMs) excel in generating grammatically\ncoherent text, their ability to comprehend the author's thoughts remains\nuncertain. This limitation hinders the practical applications of narrative\nunderstanding. In this paper, we conduct a comprehensive survey of narrative\nunderstanding tasks, thoroughly examining their key features, definitions,\ntaxonomy, associated datasets, training objectives, evaluation metrics, and\nlimitations. Furthermore, we explore the potential of expanding the\ncapabilities of modularized LLMs to address novel narrative understanding\ntasks. By framing narrative understanding as the retrieval of the author's\nimaginative cues that outline the narrative structure, our study introduces a\nfresh perspective on enhancing narrative comprehension.\n","authors":["Lixing Zhu","Runcong Zhao","Lin Gui","Yulan He"],"pdf_url":"https://arxiv.org/pdf/2310.18783v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.03510v2","updated":"2023-10-28T18:38:47Z","published":"2023-05-02T14:09:02Z","title":"Parameter-Efficient Cross-lingual Transfer of Vision and Language Models\n via Translation-based Alignment","summary":" Pre-trained vision and language models such as CLIP have witnessed remarkable\nsuccess in connecting images and texts with a primary focus on English texts.\nDespite recent efforts to extend CLIP to support other languages, disparities\nin performance among different languages have been observed due to uneven\nresource availability. Additionally, current cross-lingual transfer methods of\nthose pre-trained models would consume excessive resources for a large number\nof languages. Therefore, we propose a new parameter-efficient cross-lingual\ntransfer learning framework that utilizes a translation-based alignment method\nto mitigate multilingual disparities and explores parameter-efficient\nfine-tuning methods for parameter-efficient cross-lingual transfer. Extensive\nexperiments on XTD and Multi30K datasets, covering 11 languages under\nzero-shot, few-shot, and full-dataset learning scenarios, show that our\nframework significantly reduces the multilingual disparities among languages\nand improves cross-lingual transfer results, especially in low-resource\nscenarios, while only keeping and fine-tuning an extremely small number of\nparameters compared to the full model (e.g., Our framework only requires 0.16\\%\nadditional parameters of a full-model for each language in the few-shot\nlearning scenario). The codes are available at\n\\url{https://github.com/eric-ai-lab/PECTVLM}. The codes are available at\n\\url{https://github.com/eric-ai-lab/PECTVLM}.\n","authors":["Zhen Zhang","Jialu Wang","Xin Eric Wang"],"pdf_url":"https://arxiv.org/pdf/2305.03510v2.pdf","comment":"Findings of EMNLP"},{"id":"http://arxiv.org/abs/2310.18778v1","updated":"2023-10-28T18:33:24Z","published":"2023-10-28T18:33:24Z","title":"ProMap: Effective Bilingual Lexicon Induction via Language Model\n Prompting","summary":" Bilingual Lexicon Induction (BLI), where words are translated between two\nlanguages, is an important NLP task. While noticeable progress on BLI in rich\nresource languages using static word embeddings has been achieved. The word\ntranslation performance can be further improved by incorporating information\nfrom contextualized word embeddings. In this paper, we introduce ProMap, a\nnovel approach for BLI that leverages the power of prompting pretrained\nmultilingual and multidialectal language models to address these challenges. To\novercome the employment of subword tokens in these models, ProMap relies on an\neffective padded prompting of language models with a seed dictionary that\nachieves good performance when used independently. We also demonstrate the\neffectiveness of ProMap in re-ranking results from other BLI methods such as\nwith aligned static word embeddings. When evaluated on both rich-resource and\nlow-resource languages, ProMap consistently achieves state-of-the-art results.\nFurthermore, ProMap enables strong performance in few-shot scenarios (even with\nless than 10 training examples), making it a valuable tool for low-resource\nlanguage translation. Overall, we believe our method offers both exciting and\npromising direction for BLI in general and low-resource languages in\nparticular. ProMap code and data are available at\n\\url{https://github.com/4mekki4/promap}.\n","authors":["Abdellah El Mekki","Muhammad Abdul-Mageed","ElMoatez Billah Nagoudi","Ismail Berrada","Ahmed Khoumsi"],"pdf_url":"https://arxiv.org/pdf/2310.18778v1.pdf","comment":"To appear in IJCNLP-AACL 2023"},{"id":"http://arxiv.org/abs/2306.17842v3","updated":"2023-10-28T18:09:46Z","published":"2023-06-30T17:59:07Z","title":"SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen\n LLMs","summary":" In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling\nfrozen LLMs to perform both understanding and generation tasks involving\nnon-linguistic modalities such as images or videos. SPAE converts between raw\npixels and interpretable lexical tokens (or words) extracted from the LLM's\nvocabulary. The resulting tokens capture both the semantic meaning and the\nfine-grained details needed for visual reconstruction, effectively translating\nthe visual content into a language comprehensible to the LLM, and empowering it\nto perform a wide array of multimodal tasks. Our approach is validated through\nin-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set\nof image understanding and generation tasks. Our method marks the first\nsuccessful attempt to enable a frozen LLM to generate image content while\nsurpassing state-of-the-art performance in image understanding tasks, under the\nsame setting, by over 25%.\n","authors":["Lijun Yu","Yong Cheng","Zhiruo Wang","Vivek Kumar","Wolfgang Macherey","Yanping Huang","David A. Ross","Irfan Essa","Yonatan Bisk","Ming-Hsuan Yang","Kevin Murphy","Alexander G. Hauptmann","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.17842v3.pdf","comment":"NeurIPS 2023 spotlight"},{"id":"http://arxiv.org/abs/2306.17194v2","updated":"2023-10-28T18:04:36Z","published":"2023-06-28T17:54:04Z","title":"On the Exploitability of Instruction Tuning","summary":" Instruction tuning is an effective technique to align large language models\n(LLMs) with human intents. In this work, we investigate how an adversary can\nexploit instruction tuning by injecting specific instruction-following examples\ninto the training data that intentionally changes the model's behavior. For\nexample, an adversary can achieve content injection by injecting training\nexamples that mention target content and eliciting such behavior from\ndownstream models. To achieve this goal, we propose \\textit{AutoPoison}, an\nautomated data poisoning pipeline. It naturally and coherently incorporates\nversatile attack goals into poisoned data with the help of an oracle LLM. We\nshowcase two example attacks: content injection and over-refusal attacks, each\naiming to induce a specific exploitable behavior. We quantify and benchmark the\nstrength and the stealthiness of our data poisoning scheme. Our results show\nthat AutoPoison allows an adversary to change a model's behavior by poisoning\nonly a small fraction of data while maintaining a high level of stealthiness in\nthe poisoned examples. We hope our work sheds light on how data quality affects\nthe behavior of instruction-tuned models and raises awareness of the importance\nof data quality for responsible deployments of LLMs. Code is available at\n\\url{https://github.com/azshue/AutoPoison}.\n","authors":["Manli Shu","Jiongxiao Wang","Chen Zhu","Jonas Geiping","Chaowei Xiao","Tom Goldstein"],"pdf_url":"https://arxiv.org/pdf/2306.17194v2.pdf","comment":"NeurIPS 2023 camera-ready (21 pages, 10 figures)"},{"id":"http://arxiv.org/abs/2305.10626v3","updated":"2023-10-28T17:55:44Z","published":"2023-05-18T00:35:38Z","title":"Language Models Meet World Models: Embodied Experiences Enhance Language\n Models","summary":" While large language models (LMs) have shown remarkable capabilities across\nnumerous tasks, they often struggle with simple reasoning and planning in\nphysical environments, such as understanding object permanence or planning\nhousehold activities. The limitation arises from the fact that LMs are trained\nonly on written text and miss essential embodied knowledge and skills. In this\npaper, we propose a new paradigm of enhancing LMs by finetuning them with world\nmodels, to gain diverse embodied knowledge while retaining their general\nlanguage capabilities. Our approach deploys an embodied agent in a world model,\nparticularly a simulator of the physical world (VirtualHome), and acquires a\ndiverse set of embodied experiences through both goal-oriented planning and\nrandom exploration. These experiences are then used to finetune LMs to teach\ndiverse abilities of reasoning and acting in the physical world, e.g., planning\nand completing goals, object permanence and tracking, etc. Moreover, it is\ndesirable to preserve the generality of LMs during finetuning, which\nfacilitates generalizing the embodied knowledge across tasks rather than being\ntied to specific simulations. We thus further introduce the classical (EWC) for\nselective weight updates, combined with low-rank adapters (LoRA) for training\nefficiency. Extensive experiments show our approach substantially improves base\nLMs on 18 downstream tasks by 64.28% on average. In particular, the small LMs\n(1.3B, 6B, and 13B) enhanced by our approach match or even outperform much\nlarger LMs (e.g., ChatGPT).\n","authors":["Jiannan Xiang","Tianhua Tao","Yi Gu","Tianmin Shu","Zirui Wang","Zichao Yang","Zhiting Hu"],"pdf_url":"https://arxiv.org/pdf/2305.10626v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18768v1","updated":"2023-10-28T17:50:13Z","published":"2023-10-28T17:50:13Z","title":"Crossing the Aisle: Unveiling Partisan and Counter-Partisan Events in\n News Reporting","summary":" News media is expected to uphold unbiased reporting. Yet they may still\naffect public opinion by selectively including or omitting events that support\nor contradict their ideological positions. Prior work in NLP has only studied\nmedia bias via linguistic style and word usage. In this paper, we study to\nwhich degree media balances news reporting and affects consumers through event\ninclusion or omission. We first introduce the task of detecting both partisan\nand counter-partisan events: events that support or oppose the author's\npolitical ideology. To conduct our study, we annotate a high-quality dataset,\nPAC, containing 8,511 (counter-)partisan event annotations in 304 news articles\nfrom ideologically diverse media outlets. We benchmark PAC to highlight the\nchallenges of this task. Our findings highlight both the ways in which the news\nsubtly shapes opinion and the need for large language models that better\nunderstand events within a broader context. Our dataset can be found at\nhttps://github.com/launchnlp/Partisan-Event-Dataset.\n","authors":["Kaijian Zou","Xinliang Frederick Zhang","Winston Wu","Nick Beauchamp","Lu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.18768v1.pdf","comment":"EMNLP'23 Findings"},{"id":"http://arxiv.org/abs/2306.02213v2","updated":"2023-10-28T17:30:57Z","published":"2023-06-03T23:34:55Z","title":"Evaluating Emotion Arcs Across Languages: Bridging the Global Divide in\n Sentiment Analysis","summary":" Emotion arcs capture how an individual (or a population) feels over time.\nThey are widely used in industry and research; however, there is little work on\nevaluating the automatically generated arcs. This is because of the difficulty\nof establishing the true (gold) emotion arc. Our work, for the first time,\nsystematically and quantitatively evaluates automatically generated emotion\narcs. We also compare two common ways of generating emotion arcs:\nMachine-Learning (ML) models and Lexicon-Only (LexO) methods. By running\nexperiments on 18 diverse datasets in 9 languages, we show that despite being\nmarkedly poor at instance level emotion classification, LexO methods are highly\naccurate at generating emotion arcs when aggregating information from hundreds\nof instances. We also show, through experiments on six indigenous African\nlanguages, as well as Arabic, and Spanish, that automatic translations of\nEnglish emotion lexicons can be used to generate high-quality emotion arcs in\nless-resource languages. This opens up avenues for work on emotions in\nlanguages from around the world; which is crucial for commerce, public policy,\nand health research in service of speakers often left behind. Code and\nresources: https://github.com/dteodore/EmotionArcs\n","authors":["Daniela Teodorescu","Saif M. Mohammad"],"pdf_url":"https://arxiv.org/pdf/2306.02213v2.pdf","comment":"9 pages, 5 figures. arXiv admin note: substantial text overlap with\n arXiv:2210.07381"},{"id":"http://arxiv.org/abs/2305.17588v2","updated":"2023-10-28T17:07:45Z","published":"2023-05-27T22:15:48Z","title":"Diagnosing Transformers: Illuminating Feature Spaces for Clinical\n Decision-Making","summary":" Pre-trained transformers are often fine-tuned to aid clinical decision-making\nusing limited clinical notes. Model interpretability is crucial, especially in\nhigh-stakes domains like medicine, to establish trust and ensure safety, which\nrequires human engagement. We introduce SUFO, a systematic framework that\nenhances interpretability of fine-tuned transformer feature spaces. SUFO\nutilizes a range of analytic and visualization techniques, including Supervised\nprobing, Unsupervised similarity analysis, Feature dynamics, and Outlier\nanalysis to address key questions about model trust and interpretability. We\nconduct a case study investigating the impact of pre-training data where we\nfocus on real-world pathology classification tasks, and validate our findings\non MedNLI. We evaluate five 110M-sized pre-trained transformer models,\ncategorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical\nBioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal\nthat: (1) while PubMedBERT, the domain-specific model, contains valuable\ninformation for fine-tuning, it can overfit to minority classes when class\nimbalances exist. In contrast, mixed-domain models exhibit greater resistance\nto overfitting, suggesting potential improvements in domain-specific model\nrobustness; (2) in-domain pre-training accelerates feature disambiguation\nduring fine-tuning; and (3) feature spaces undergo significant sparsification\nduring this process, enabling clinicians to identify common outlier modes among\nfine-tuned models as demonstrated in this paper. These findings showcase the\nutility of SUFO in enhancing trust and safety when using transformers in\nmedicine, and we believe SUFO can aid practitioners in evaluating fine-tuned\nlanguage models for other applications in medicine and in more critical\ndomains.\n","authors":["Aliyah R. Hsu","Yeshwanth Cherapanamjeri","Briton Park","Tristan Naumann","Anobel Y. Odisho","Bin Yu"],"pdf_url":"https://arxiv.org/pdf/2305.17588v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.11166v2","updated":"2023-10-28T16:22:19Z","published":"2023-10-17T11:34:50Z","title":"ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text\n Processing","summary":" English and Chinese, known as resource-rich languages, have witnessed the\nstrong development of transformer-based language models for natural language\nprocessing tasks. Although Vietnam has approximately 100M people speaking\nVietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA,\nperformed well on general Vietnamese NLP tasks, including POS tagging and named\nentity recognition. These pre-trained language models are still limited to\nVietnamese social media tasks. In this paper, we present the first monolingual\npre-trained language model for Vietnamese social media texts, ViSoBERT, which\nis pre-trained on a large-scale corpus of high-quality and diverse Vietnamese\nsocial media texts using XLM-R architecture. Moreover, we explored our\npre-trained model on five important natural language downstream tasks on\nVietnamese social media texts: emotion recognition, hate speech detection,\nsentiment analysis, spam reviews detection, and hate speech spans detection.\nOur experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses\nthe previous state-of-the-art models on multiple Vietnamese social media tasks.\nOur ViSoBERT model is available only for research purposes.\n","authors":["Quoc-Nam Nguyen","Thang Chau Phan","Duc-Vu Nguyen","Kiet Van Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.11166v2.pdf","comment":"Accepted at EMNLP'2023 Main Conference"},{"id":"http://arxiv.org/abs/2309.17133v2","updated":"2023-10-28T16:03:35Z","published":"2023-09-29T10:54:10Z","title":"Fine-grained Late-interaction Multi-modal Retrieval for Retrieval\n Augmented Visual Question Answering","summary":" Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to\nutilize knowledge from external knowledge bases to answer visually-grounded\nquestions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong\nframework to tackle KB-VQA, first retrieves related documents with Dense\nPassage Retrieval (DPR) and then uses them to answer questions. This paper\nproposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which\nsignificantly improves knowledge retrieval in RA-VQA. FLMR addresses two major\nlimitations in RA-VQA's retriever: (1) the image representations obtained via\nimage-to-text transforms can be incomplete and inaccurate and (2) relevance\nscores between queries and documents are computed with one-dimensional\nembeddings, which can be insensitive to finer-grained relevance. FLMR overcomes\nthese limitations by obtaining image representations that complement those from\nthe image-to-text transforms using a vision model aligned with an existing\ntext-based retriever through a simple alignment network. FLMR also encodes\nimages and questions using multi-dimensional embeddings to capture\nfiner-grained relevance between queries and documents. FLMR significantly\nimproves the original RA-VQA retriever's PRRecall@5 by approximately 8\\%.\nFinally, we equipped RA-VQA with two state-of-the-art large\nmulti-modal/language models to achieve $\\sim61\\%$ VQA score in the OK-VQA\ndataset.\n","authors":["Weizhe Lin","Jinghong Chen","Jingbiao Mei","Alexandru Coca","Bill Byrne"],"pdf_url":"https://arxiv.org/pdf/2309.17133v2.pdf","comment":"To appear at NeurIPS 2023. This is the camera-ready version. We fixed\n some numbers and added more experiments to address reviewers' comments"},{"id":"http://arxiv.org/abs/2310.18738v1","updated":"2023-10-28T15:42:47Z","published":"2023-10-28T15:42:47Z","title":"TLM: Token-Level Masking for Transformers","summary":" Structured dropout approaches, such as attention dropout and DropHead, have\nbeen investigated to regularize the multi-head attention mechanism in\nTransformers. In this paper, we propose a new regularization scheme based on\ntoken-level rather than structure-level to reduce overfitting. Specifically, we\ndevise a novel Token-Level Masking (TLM) training strategy for Transformers to\nregularize the connections of self-attention, which consists of two masking\ntechniques that are effective and easy to implement. The underlying idea is to\nmanipulate the connections between tokens in the multi-head attention via\nmasking, where the networks are forced to exploit partial neighbors'\ninformation to produce a meaningful representation. The generality and\neffectiveness of TLM are thoroughly evaluated via extensive experiments on 4\ndiversified NLP tasks across 18 datasets, including natural language\nunderstanding benchmark GLUE, ChineseGLUE, Chinese Grammatical Error\nCorrection, and data-to-text generation. The results indicate that TLM can\nconsistently outperform attention dropout and DropHead, e.g., it increases by\n0.5 points relative to DropHead with BERT-large on GLUE. Moreover, TLM can\nestablish a new record on the data-to-text benchmark Rotowire (18.93 BLEU). Our\ncode will be publicly available at https://github.com/Young1993/tlm.\n","authors":["Yangjun Wu","Kebin Fang","Dongxiang Zhang","Han Wang","Hao Zhang","Gang Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18738v1.pdf","comment":"13 pages. Accepted by EMNLP2023 main conference"},{"id":"http://arxiv.org/abs/2310.18729v1","updated":"2023-10-28T15:20:44Z","published":"2023-10-28T15:20:44Z","title":"Using Large Language Models to Support Thematic Analysis in Empirical\n Legal Studies","summary":" Thematic analysis and other variants of inductive coding are widely used\nqualitative analytic methods within empirical legal studies (ELS). We propose a\nnovel framework facilitating effective collaboration of a legal expert with a\nlarge language model (LLM) for generating initial codes (phase 2 of thematic\nanalysis), searching for themes (phase 3), and classifying the data in terms of\nthe themes (to kick-start phase 4). We employed the framework for an analysis\nof a dataset (n=785) of facts descriptions from criminal court opinions\nregarding thefts. The goal of the analysis was to discover classes of typical\nthefts. Our results show that the LLM, namely OpenAI's GPT-4, generated\nreasonable initial codes, and it was capable of improving the quality of the\ncodes based on expert feedback. They also suggest that the model performed well\nin zero-shot classification of facts descriptions in terms of the themes.\nFinally, the themes autonomously discovered by the LLM appear to map fairly\nwell to the themes arrived at by legal experts. These findings can be leveraged\nby legal researchers to guide their decisions in integrating LLMs into their\nthematic analyses, as well as other inductive coding projects.\n","authors":["Jakub Drápal","Hannes Westermann","Jaromir Savelka"],"pdf_url":"https://arxiv.org/pdf/2310.18729v1.pdf","comment":"10 pages, 5 figures, 3 tables"},{"id":"http://arxiv.org/abs/2310.14954v2","updated":"2023-10-28T14:38:46Z","published":"2023-10-23T13:55:49Z","title":"Key Frame Mechanism For Efficient Conformer Based End-to-end Speech\n Recognition","summary":" Recently, Conformer as a backbone network for end-to-end automatic speech\nrecognition achieved state-of-the-art performance. The Conformer block\nleverages a self-attention mechanism to capture global information, along with\na convolutional neural network to capture local information, resulting in\nimproved performance. However, the Conformer-based model encounters an issue\nwith the self-attention mechanism, as computational complexity grows\nquadratically with the length of the input sequence. Inspired by previous\nConnectionist Temporal Classification (CTC) guided blank skipping during\ndecoding, we introduce intermediate CTC outputs as guidance into the\ndownsampling procedure of the Conformer encoder. We define the frame with\nnon-blank output as key frame. Specifically, we introduce the key frame-based\nself-attention (KFSA) mechanism, a novel method to reduce the computation of\nthe self-attention mechanism using key frames. The structure of our proposed\napproach comprises two encoders. Following the initial encoder, we introduce an\nintermediate CTC loss function to compute the label frame, enabling us to\nextract the key frames and blank frames for KFSA. Furthermore, we introduce the\nkey frame-based downsampling (KFDS) mechanism to operate on high-dimensional\nacoustic features directly and drop the frames corresponding to blank labels,\nwhich results in new acoustic feature sequences as input to the second encoder.\nBy using the proposed method, which achieves comparable or higher performance\nthan vanilla Conformer and other similar work such as Efficient Conformer.\nMeantime, our proposed method can discard more than 60\\% useless frames during\nmodel training and inference, which will accelerate the inference speed\nsignificantly. This work code is available in\n{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}\n","authors":["Peng Fan","Changhao Shan","Sining Sun","Qing Yang","Jianwei Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.14954v2.pdf","comment":"This manuscript has been accepted by IEEE Signal Processing Letters\n for publication"},{"id":"http://arxiv.org/abs/2306.10512v2","updated":"2023-10-28T13:02:24Z","published":"2023-06-18T09:54:33Z","title":"Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing\n Perspective","summary":" Large language models (LLMs), like ChatGPT, have shown some human-like\ncognitive abilities. For comparing these abilities of different models, several\nbenchmarks (i.e. sets of standard test questions) from different fields (e.g.,\nLiterature, Biology and Psychology) are often adopted and the test results\nunder traditional metrics such as accuracy, recall and F1, are reported.\nHowever, such way for evaluating LLMs can be inefficient and inaccurate from\nthe cognitive science perspective. Inspired by Computerized Adaptive Testing\n(CAT) used in psychometrics, we propose an adaptive testing framework for LLM\nevaluation. Rather than using a standard test set and simply reporting\naccuracy, this approach dynamically adjusts the characteristics of the test\nquestions, such as difficulty, based on the model's performance. This allows\nfor a more accurate estimation of the model's abilities, using fewer questions.\nMore importantly, it allows LLMs to be compared with humans easily, which is\nessential for NLP models that aim for human-level ability. Our diagnostic\nreports have found that ChatGPT often behaves like a ``careless student'',\nprone to slip and occasionally guessing the questions. We conduct a\nfine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three\naspects of Subject Knowledge, Mathematical Reasoning, and Programming, where\nGPT4 can outperform other models significantly and reach the cognitive ability\nof middle-level students. Different tests for different models using efficient\nadaptive testing -- we believe this has the potential to become a new norm in\nevaluating large language models.\n","authors":["Yan Zhuang","Qi Liu","Yuting Ning","Weizhe Huang","Rui Lv","Zhenya Huang","Guanhao Zhao","Zheng Zhang","Qingyang Mao","Shijin Wang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2306.10512v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18696v1","updated":"2023-10-28T12:46:40Z","published":"2023-10-28T12:46:40Z","title":"Probing LLMs for Joint Encoding of Linguistic Categories","summary":" Large Language Models (LLMs) exhibit impressive performance on a range of NLP\ntasks, due to the general-purpose linguistic knowledge acquired during\npretraining. Existing model interpretability research (Tenney et al., 2019)\nsuggests that a linguistic hierarchy emerges in the LLM layers, with lower\nlayers better suited to solving syntactic tasks and higher layers employed for\nsemantic processing. Yet, little is known about how encodings of different\nlinguistic phenomena interact within the models and to what extent processing\nof linguistically-related categories relies on the same, shared model\nrepresentations. In this paper, we propose a framework for testing the joint\nencoding of linguistic categories in LLMs. Focusing on syntax, we find evidence\nof joint encoding both at the same (related part-of-speech (POS) classes) and\ndifferent (POS classes and related syntactic dependency relations) levels of\nlinguistic hierarchy. Our cross-lingual experiments show that the same patterns\nhold across languages in multilingual LLMs.\n","authors":["Giulio Starace","Konstantinos Papakostas","Rochelle Choenni","Apostolos Panagiotopoulos","Matteo Rosati","Alina Leidinger","Ekaterina Shutova"],"pdf_url":"https://arxiv.org/pdf/2310.18696v1.pdf","comment":"Accepted in EMNLP Findings 2023"},{"id":"http://arxiv.org/abs/2308.13259v2","updated":"2023-10-28T12:19:29Z","published":"2023-08-25T09:23:55Z","title":"Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for\n Knowledge-intensive Question Answering","summary":" Equipped with Chain-of-Thought (CoT), Large language models (LLMs) have shown\nimpressive reasoning ability in various downstream tasks. Even so, suffering\nfrom hallucinations and the inability to access external knowledge, LLMs often\ncome with incorrect or unfaithful intermediate reasoning steps, especially in\nthe context of answering knowledge-intensive tasks such as KBQA. To alleviate\nthis issue, we propose a framework called Knowledge-Driven Chain-of-Thought\n(KD-CoT) to verify and modify reasoning traces in CoT via interaction with\nexternal knowledge, and thus overcome the hallucinations and error propagation.\nConcretely, we formulate the CoT rationale process of LLMs into a structured\nmulti-round QA format. In each round, LLMs interact with a QA system that\nretrieves external knowledge and produce faithful reasoning traces based on\nretrieved precise answers. The structured CoT reasoning of LLMs is facilitated\nby our developed KBQA CoT collection, which serves as in-context learning\ndemonstrations and can also be utilized as feedback augmentation to train a\nrobust retriever. Extensive experiments on WebQSP and ComplexWebQuestion\ndatasets demonstrate the effectiveness of proposed KD-CoT in task-solving\nreasoning generation, which outperforms the vanilla CoT ICL with an absolute\nsuccess rate of 8.0% and 5.1%. Furthermore, our proposed feedback-augmented\nretriever outperforms the state-of-the-art baselines for retrieving knowledge,\nachieving significant improvement in Hit and recall performance. Our code and\ndata are released on https://github.com/AdelWang/KD-CoT/tree/main.\n","authors":["Keheng Wang","Feiyu Duan","Sirui Wang","Peiguang Li","Yunsen Xian","Chuantao Yin","Wenge Rong","Zhang Xiong"],"pdf_url":"https://arxiv.org/pdf/2308.13259v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18685v1","updated":"2023-10-28T11:57:51Z","published":"2023-10-28T11:57:51Z","title":"When Reviewers Lock Horn: Finding Disagreement in Scientific Peer\n Reviews","summary":" To this date, the efficacy of the scientific publishing enterprise\nfundamentally rests on the strength of the peer review process. The journal\neditor or the conference chair primarily relies on the expert reviewers'\nassessment, identify points of agreement and disagreement and try to reach a\nconsensus to make a fair and informed decision on whether to accept or reject a\npaper. However, with the escalating number of submissions requiring review,\nespecially in top-tier Artificial Intelligence (AI) conferences, the\neditor/chair, among many other works, invests a significant, sometimes\nstressful effort to mitigate reviewer disagreements. Here in this work, we\nintroduce a novel task of automatically identifying contradictions among\nreviewers on a given article. To this end, we introduce ContraSciView, a\ncomprehensive review-pair contradiction dataset on around 8.5k papers (with\naround 28k review pairs containing nearly 50k review pair comments) from the\nopen review-based ICLR and NeurIPS conferences. We further propose a baseline\nmodel that detects contradictory statements from the review pairs. To the best\nof our knowledge, we make the first attempt to identify disagreements among\npeer reviewers automatically. We make our dataset and code public for further\ninvestigations.\n","authors":["Sandeep Kumar","Tirthankar Ghosal","Asif Ekbal"],"pdf_url":"https://arxiv.org/pdf/2310.18685v1.pdf","comment":"12 pages, 5 figures, EMNLP 2023 short"},{"id":"http://arxiv.org/abs/2306.13460v3","updated":"2023-10-28T11:44:48Z","published":"2023-06-23T12:03:07Z","title":"Learning Descriptive Image Captioning via Semipermeable Maximum\n Likelihood Estimation","summary":" Image captioning aims to describe visual content in natural language. As 'a\npicture is worth a thousand words', there could be various correct descriptions\nfor an image. However, with maximum likelihood estimation as the training\nobjective, the captioning model is penalized whenever its prediction mismatches\nwith the label. For instance, when the model predicts a word expressing richer\nsemantics than the label, it will be penalized and optimized to prefer more\nconcise expressions, referred to as conciseness optimization. In contrast,\npredictions that are more concise than labels lead to richness optimization.\nSuch conflicting optimization directions could eventually result in the model\ngenerating general descriptions. In this work, we introduce Semipermeable\nMaxImum Likelihood Estimation (SMILE), which allows richness optimization while\nblocking conciseness optimization, thus encouraging the model to generate\nlonger captions with more details. Extensive experiments on two mainstream\nimage captioning datasets MSCOCO and Flickr30K demonstrate that SMILE\nsignificantly enhances the descriptiveness of generated captions. We further\nprovide in-depth investigations to facilitate a better understanding of how\nSMILE works.\n","authors":["Zihao Yue","Anwen Hu","Liang Zhang","Qin Jin"],"pdf_url":"https://arxiv.org/pdf/2306.13460v3.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18679v1","updated":"2023-10-28T11:22:22Z","published":"2023-10-28T11:22:22Z","title":"N-Critics: Self-Refinement of Large Language Models with Ensemble of\n Critics","summary":" We propose a self-correction mechanism for Large Language Models (LLMs) to\nmitigate issues such as toxicity and fact hallucination. This method involves\nrefining model outputs through an ensemble of critics and the model's own\nfeedback. Drawing inspiration from human behavior, we explore whether LLMs can\nemulate the self-correction process observed in humans who often engage in\nself-reflection and seek input from others to refine their understanding of\ncomplex topics. Our approach is model-agnostic and can be applied across\nvarious domains to enhance trustworthiness by addressing fairness, bias, and\nrobustness concerns. We consistently observe performance improvements in LLMs\nfor reducing toxicity and correcting factual errors.\n","authors":["Sajad Mousavi","Ricardo Luna Gutiérrez","Desik Rengarajan","Vineet Gundecha","Ashwin Ramesh Babu","Avisek Naug","Antonio Guillen","Soumyendu Sarkar"],"pdf_url":"https://arxiv.org/pdf/2310.18679v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18662v1","updated":"2023-10-28T10:21:40Z","published":"2023-10-28T10:21:40Z","title":"ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL","summary":" Text-to-SQL aims to generate an executable SQL program given the user\nutterance and the corresponding database schema. To ensure the well-formedness\nof output SQLs, one prominent approach adopts a grammar-based recurrent decoder\nto produce the equivalent SQL abstract syntax tree (AST). However, previous\nmethods mainly utilize an RNN-series decoder, which 1) is time-consuming and\ninefficient and 2) introduces very few structure priors. In this work, we\npropose an AST structure-aware Transformer decoder (ASTormer) to replace\ntraditional RNN cells. The structural knowledge, such as node types and\npositions in the tree, is seamlessly incorporated into the decoder via both\nabsolute and relative position embeddings. Besides, the proposed framework is\ncompatible with different traversing orders even considering adaptive node\nselection. Extensive experiments on five text-to-SQL benchmarks demonstrate the\neffectiveness and efficiency of our structured decoder compared to competitive\nbaselines.\n","authors":["Ruisheng Cao","Hanchong Zhang","Hongshen Xu","Jieyu Li","Da Ma","Lu Chen","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2310.18662v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18659v1","updated":"2023-10-28T10:05:51Z","published":"2023-10-28T10:05:51Z","title":"From Indeterminacy to Determinacy: Augmenting Logical Reasoning\n Capabilities with Large Language Models","summary":" Recent advances in LLMs have revolutionized the landscape of reasoning tasks.\nTo enhance the capabilities of LLMs to emulate human reasoning, prior works\nfocus on modeling reasoning steps using specific thought structures like\nchains, trees, or graphs. However, LLM-based reasoning continues to encounter\nthree challenges: 1) Selecting appropriate reasoning structures for various\ntasks; 2) Exploiting known conditions sufficiently and efficiently to deduce\nnew insights; 3) Considering the impact of historical reasoning experience. To\naddress these challenges, we propose DetermLR, a novel reasoning framework that\nformulates the reasoning process as a transformational journey from\nindeterminate premises to determinate ones. This process is marked by the\nincremental accumulation of determinate premises, making the conclusion\nprogressively closer to clarity. DetermLR includes three essential components:\n1) Premise identification: We categorize premises into two distinct types:\ndeterminate and indeterminate. This empowers LLMs to customize reasoning\nstructures to match the specific task complexities. 2) Premise prioritization\nand exploration: We leverage quantitative measurements to assess the relevance\nof each premise to the target, prioritizing more relevant premises for\nexploring new insights. 3) Iterative process with reasoning memory: We\nintroduce a reasoning memory module to automate storage and extraction of\navailable premises and reasoning paths, preserving historical reasoning details\nfor more accurate premise prioritization. Comprehensive experimental results\nshow that DetermLR outperforms all baselines on four challenging logical\nreasoning tasks: LogiQA, ProofWriter, FOLIO, and LogicalDeduction. DetermLR can\nachieve better reasoning performance while requiring fewer visited states,\nhighlighting its superior efficiency and effectiveness in tackling logical\nreasoning tasks.\n","authors":["Hongda Sun","Weikai Xu","Wei Liu","Jian Luan","Bin Wang","Shuo Shang","Ji-Rong Wen","Rui Yan"],"pdf_url":"https://arxiv.org/pdf/2310.18659v1.pdf","comment":"Code repo: https://github.com/XiaoMi/DetermLR"},{"id":"http://arxiv.org/abs/2305.03598v3","updated":"2023-10-28T09:56:49Z","published":"2023-05-05T15:03:01Z","title":"NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial\n Reports","summary":" How can we interpret and retrieve medical evidence to support clinical\ndecisions? Clinical trial reports (CTR) amassed over the years contain\nindispensable information for the development of personalized medicine.\nHowever, it is practically infeasible to manually inspect over 400,000+\nclinical trial reports in order to find the best evidence for experimental\ntreatments. Natural Language Inference (NLI) offers a potential solution to\nthis problem, by allowing the scalable computation of textual entailment.\nHowever, existing NLI models perform poorly on biomedical corpora, and\npreviously published datasets fail to capture the full complexity of inference\nover CTRs. In this work, we present a novel resource to advance research on NLI\nfor reasoning on CTRs. The resource includes two main tasks. Firstly, to\ndetermine the inference relation between a natural language statement, and a\nCTR. Secondly, to retrieve supporting facts to justify the predicted relation.\nWe provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these\ntasks. Baselines on this corpus expose the limitations of existing NLI models,\nwith 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To\nthe best of our knowledge, we are the first to design a task that covers the\ninterpretation of full CTRs. To encourage further work on this challenging\ndataset, we make the corpus, competition leaderboard, website and code to\nreplicate the baseline experiments available at:\nhttps://github.com/ai-systems/nli4ct\n","authors":["Maël Jullien","Marco Valentino","Hannah Frost","Paul O'Regan","Donal Landers","André Freitas"],"pdf_url":"https://arxiv.org/pdf/2305.03598v3.pdf","comment":"EMNLP 2023 Camera-ready, 15 pages"},{"id":"http://arxiv.org/abs/2310.18652v1","updated":"2023-10-28T09:42:04Z","published":"2023-10-28T09:42:04Z","title":"EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health\n Records with Chest X-ray Images","summary":" Electronic Health Records (EHRs), which contain patients' medical histories\nin various multi-modal formats, often overlook the potential for joint\nreasoning across imaging and table modalities underexplored in current EHR\nQuestion Answering (QA) systems. In this paper, we introduce EHRXQA, a novel\nmulti-modal question answering dataset combining structured EHRs and chest\nX-ray images. To develop our dataset, we first construct two uni-modal\nresources: 1) The MIMIC- CXR-VQA dataset, our newly created medical visual\nquestion answering (VQA) benchmark, specifically designed to augment the\nimaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of\na previously established table-based EHR QA dataset. By integrating these two\nuni-modal resources, we successfully construct a multi-modal EHR QA dataset\nthat necessitates both uni-modal and cross-modal reasoning. To address the\nunique challenges of multi-modal questions within EHRs, we propose a\nNeuralSQL-based strategy equipped with an external VQA API. This pioneering\nendeavor enhances engagement with multi-modal EHR sources and we believe that\nour dataset can catalyze advances in real-world medical scenarios such as\nclinical decision-making and research. EHRXQA is available at\nhttps://github.com/baeseongsu/ehrxqa.\n","authors":["Seongsu Bae","Daeun Kyung","Jaehee Ryu","Eunbyeol Cho","Gyubok Lee","Sunjun Kweon","Jungwoo Oh","Lei Ji","Eric I-Chao Chang","Tackeun Kim","Edward Choi"],"pdf_url":"https://arxiv.org/pdf/2310.18652v1.pdf","comment":"Accepted at NeurIPS 2023 Datasets and Benchmarks Track (10 pages for\n main text, 4 pages for references, 28 pages for supplementary materials)"},{"id":"http://arxiv.org/abs/2305.16960v3","updated":"2023-10-28T09:02:39Z","published":"2023-05-26T14:17:36Z","title":"Training Socially Aligned Language Models on Simulated Social\n Interactions","summary":" Social alignment in AI systems aims to ensure that these models behave\naccording to established societal values. However, unlike humans, who derive\nconsensus on value judgments through social interaction, current language\nmodels (LMs) are trained to rigidly replicate their training corpus in\nisolation, leading to subpar generalization in unfamiliar scenarios and\nvulnerability to adversarial attacks. This work presents a novel training\nparadigm that permits LMs to learn from simulated social interactions. In\ncomparison to existing methodologies, our approach is considerably more\nscalable and efficient, demonstrating superior performance in alignment\nbenchmarks and human evaluations. This paradigm shift in the training of LMs\nbrings us a step closer to developing AI systems that can robustly and\naccurately reflect societal norms and values.\n","authors":["Ruibo Liu","Ruixin Yang","Chenyan Jia","Ge Zhang","Denny Zhou","Andrew M. Dai","Diyi Yang","Soroush Vosoughi"],"pdf_url":"https://arxiv.org/pdf/2305.16960v3.pdf","comment":"Code, data, and models can be downloaded via\n https://github.com/agi-templar/Stable-Alignment"},{"id":"http://arxiv.org/abs/2305.20088v2","updated":"2023-10-28T08:46:13Z","published":"2023-05-31T17:59:04Z","title":"Improving CLIP Training with Language Rewrites","summary":" Contrastive Language-Image Pre-training (CLIP) stands as one of the most\neffective and scalable methods for training transferable vision models using\npaired image and text data. CLIP models are trained using contrastive loss,\nwhich typically relies on data augmentations to prevent overfitting and\nshortcuts. However, in the CLIP training paradigm, data augmentations are\nexclusively applied to image inputs, while language inputs remain unchanged\nthroughout the entire training process, limiting the exposure of diverse texts\nto the same image. In this paper, we introduce Language augmented CLIP\n(LaCLIP), a simple yet highly effective approach to enhance CLIP training\nthrough language rewrites. Leveraging the in-context learning capability of\nlarge language models, we rewrite the text descriptions associated with each\nimage. These rewritten texts exhibit diversity in sentence structure and\nvocabulary while preserving the original key concepts and meanings. During\ntraining, LaCLIP randomly selects either the original texts or the rewritten\nversions as text augmentations for each image. Extensive experiments on CC3M,\nCC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with\nlanguage rewrites significantly improves the transfer performance without\ncomputation or memory overhead during training. Specifically for ImageNet\nzero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on\nLAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.\n","authors":["Lijie Fan","Dilip Krishnan","Phillip Isola","Dina Katabi","Yonglong Tian"],"pdf_url":"https://arxiv.org/pdf/2305.20088v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.17228v2","updated":"2023-10-28T08:24:06Z","published":"2023-10-26T08:27:36Z","title":"TST$^\\mathrm{R}$: Target Similarity Tuning Meets the Real World","summary":" Target similarity tuning (TST) is a method of selecting relevant examples in\nnatural language (NL) to code generation through large language models (LLMs)\nto improve performance. Its goal is to adapt a sentence embedding model to have\nthe similarity between two NL inputs match the similarity between their\nassociated code outputs. In this paper, we propose different methods to apply\nand improve TST in the real world. First, we replace the sentence transformer\nwith embeddings from a larger model, which reduces sensitivity to the language\ndistribution and thus provides more flexibility in synthetic generation of\nexamples, and we train a tiny model that transforms these embeddings to a space\nwhere embedding similarity matches code similarity, which allows the model to\nremain a black box and only requires a few matrix multiplications at inference\ntime. Second, we show how to efficiently select a smaller number of training\nexamples to train the TST model. Third, we introduce a ranking-based evaluation\nfor TST that does not require end-to-end code generation experiments, which can\nbe expensive to perform.\n","authors":["Anirudh Khatry","Sumit Gulwani","Priyanshu Gupta","Vu Le","Ananya Singha","Mukul Singh","Gust Verbruggen"],"pdf_url":"https://arxiv.org/pdf/2310.17228v2.pdf","comment":"Accepted for EMNLP-Findings, 2023"},{"id":"http://arxiv.org/abs/2310.18633v1","updated":"2023-10-28T08:21:16Z","published":"2023-10-28T08:21:16Z","title":"Setting the Trap: Capturing and Defeating Backdoors in Pretrained\n Language Models through Honeypots","summary":" In the field of natural language processing, the prevalent approach involves\nfine-tuning pretrained language models (PLMs) using local samples. Recent\nresearch has exposed the susceptibility of PLMs to backdoor attacks, wherein\nthe adversaries can embed malicious prediction behaviors by manipulating a few\ntraining samples. In this study, our objective is to develop a\nbackdoor-resistant tuning procedure that yields a backdoor-free model, no\nmatter whether the fine-tuning dataset contains poisoned samples. To this end,\nwe propose and integrate a honeypot module into the original PLM, specifically\ndesigned to absorb backdoor information exclusively. Our design is motivated by\nthe observation that lower-layer representations in PLMs carry sufficient\nbackdoor features while carrying minimal information about the original tasks.\nConsequently, we can impose penalties on the information acquired by the\nhoneypot module to inhibit backdoor creation during the fine-tuning process of\nthe stem network. Comprehensive experiments conducted on benchmark datasets\nsubstantiate the effectiveness and robustness of our defensive strategy.\nNotably, these results indicate a substantial reduction in the attack success\nrate ranging from 10\\% to 40\\% when compared to prior state-of-the-art methods.\n","authors":["Ruixiang Tang","Jiayi Yuan","Yiming Li","Zirui Liu","Rui Chen","Xia Hu"],"pdf_url":"https://arxiv.org/pdf/2310.18633v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.10519v2","updated":"2023-10-28T07:58:04Z","published":"2023-05-17T18:54:37Z","title":"Statistical Knowledge Assessment for Large Language Models","summary":" Given varying prompts regarding a factoid question, can a large language\nmodel (LLM) reliably generate factually correct answers? Existing LLMs may\ngenerate distinct responses for different prompts. In this paper, we study the\nproblem of quantifying knowledge contained in an LLM regarding a given set of\nfacts. We propose KaRR, a statistical approach to assess factual knowledge for\nLLMs. The main idea is to estimate the ratio of LLM generating text\ncorresponding to the answer entity given diverse prompts of the subject and the\nquerying relation, versus it generating by random chances. Our assessment suite\ncontains a comprehensive set of 994,123 entities and 600 relations, with\n1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes,\nincluding LLaMA, Alpaca, OPT, etc. Experiments show that our results have a\nstrong correlation (0.43 Kendall's $\\tau$) with the results of human assessment\non LLMs. Our results reveal that the knowledge in LLMs with the same backbone\narchitecture adheres to the scaling law, while tuning on instruction-following\ndata sometimes compromises the model's capability to generate factually correct\ntext reliably.\n","authors":["Qingxiu Dong","Jingjing Xu","Lingpeng Kong","Zhifang Sui","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2305.10519v2.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18628v1","updated":"2023-10-28T07:54:39Z","published":"2023-10-28T07:54:39Z","title":"Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive\n Learning for Code Generation","summary":" With the rise of powerful closed-sourced LLMs (ChatGPT, GPT-4), there are\nincreasing interests in distilling the capabilies of close-sourced LLMs to\nsmaller open-sourced LLMs. Previous distillation methods usually prompt ChatGPT\nto generate a set of instructions and answers, for the student model to learn.\nHowever, such standard distillation approach neglects the merits and conditions\nof the student model. Inspired by modern teaching principles, we design a\npersonalised distillation process, in which the student attempts to solve a\ntask first, then the teacher provides an adaptive refinement for the student to\nimprove. Instead of feeding the student with teacher's prior, personalised\ndistillation enables personalised learning for the student model, as it only\nlearns on examples it makes mistakes upon and learns to improve its own\nsolution. On code generation, personalised distillation consistently\noutperforms standard distillation with only one third of the data. With only\n2.5-3K personalised examples that incur a data-collection cost of 4-6$, we\nboost CodeGen-mono-16B by 7% to achieve 36.4% pass@1 and StarCoder by 12.2% to\nachieve 45.8% pass@1 on HumanEval.\n","authors":["Hailin Chen","Amrita Saha","Steven Hoi","Shafiq Joty"],"pdf_url":"https://arxiv.org/pdf/2310.18628v1.pdf","comment":"Accepted to EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.04556v2","updated":"2023-10-28T07:09:38Z","published":"2023-05-08T08:53:37Z","title":"Non-Autoregressive Math Word Problem Solver with Unified Tree Structure","summary":" Existing MWP solvers employ sequence or binary tree to present the solution\nexpression and decode it from given problem description. However, such\nstructures fail to handle the variants that can be derived via mathematical\nmanipulation, e.g., $(a_1+a_2) * a_3$ and $a_1 * a_3+a_2 * a_3$ can both be\npossible valid solutions for a same problem but formulated as different\nexpression sequences or trees. The multiple solution variants depicting\ndifferent possible solving procedures for the same input problem would raise\ntwo issues: 1) making it hard for the model to learn the mapping function\nbetween the input and output spaces effectively, and 2) wrongly indicating\n\\textit{wrong} when evaluating a valid expression variant. To address these\nissues, we introduce a unified tree structure to present a solution expression,\nwhere the elements are permutable and identical for all the expression\nvariants. We propose a novel non-autoregressive solver, named \\textit{MWP-NAS},\nto parse the problem and deduce the solution expression based on the unified\ntree. For evaluating the possible expression variants, we design a path-based\nmetric to evaluate the partial accuracy of expressions of a unified tree. The\nresults from extensive experiments conducted on Math23K and MAWPS demonstrate\nthe effectiveness of our proposed MWP-NAS. The codes and checkpoints are\navailable at: \\url{https://github.com/mengqunhan/MWP-NAS}.\n","authors":["Yi Bin","Mengqun Han","Wenhao Shi","Lei Wang","Yang Yang","See-Kiong Ng","Heng Tao Shen"],"pdf_url":"https://arxiv.org/pdf/2305.04556v2.pdf","comment":"Accepted at EMNLP2023"},{"id":"http://arxiv.org/abs/2310.18619v1","updated":"2023-10-28T07:00:28Z","published":"2023-10-28T07:00:28Z","title":"Dense Retrieval as Indirect Supervision for Large-space Decision Making","summary":" Many discriminative natural language understanding (NLU) tasks have large\nlabel spaces. Learning such a process of large-space decision making is\nparticularly challenging due to the lack of training instances per label and\nthe difficulty of selection among many fine-grained labels. Inspired by dense\nretrieval methods for passage finding in open-domain QA, we propose a\nreformulation of large-space discriminative NLU tasks as a learning-to-retrieve\ntask, leading to a novel solution named Dense Decision Retrieval (DDR ).\nInstead of predicting fine-grained decisions as logits, DDR adopts a\ndual-encoder architecture that learns to predict by retrieving from a decision\nthesaurus. This approach not only leverages rich indirect supervision signals\nfrom easy-to-consume learning resources for dense retrieval, it also leads to\nenhanced prediction generalizability with a semantically meaningful\nrepresentation of the large decision space. When evaluated on tasks with\ndecision spaces ranging from hundreds to hundred-thousand scales, DDR\noutperforms strong baselines greatly by 27.54% in P@1 on two extreme\nmulti-label classification tasks, 1.17% in F1 score ultra-fine entity typing,\nand 1.26% in accuracy on three few-shot intent classification tasks on average.\nCode and resources are available at https://github.com/luka-group/DDR\n","authors":["Nan Xu","Fei Wang","Mingtao Dong","Muhao Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18619v1.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2210.04074v3","updated":"2023-10-28T06:37:48Z","published":"2022-10-08T18:00:22Z","title":"Are All Steps Equally Important? Benchmarking Essentiality Detection of\n Events","summary":" Natural language expresses events with varying granularities, where\ncoarse-grained events (goals) can be broken down into finer-grained event\nsequences (steps). A critical yet overlooked aspect of understanding event\nprocesses is recognizing that not all step events hold equal importance toward\nthe completion of a goal. In this paper, we address this gap by examining the\nextent to which current models comprehend the essentiality of step events in\nrelation to a goal event. Cognitive studies suggest that such capability\nenables machines to emulate human commonsense reasoning about preconditions and\nnecessary efforts of everyday tasks. We contribute a high-quality corpus of\n(goal, step) pairs gathered from the community guideline website WikiHow, with\nsteps manually annotated for their essentiality concerning the goal by experts.\nThe high inter-annotator agreement demonstrates that humans possess a\nconsistent understanding of event essentiality. However, after evaluating\nmultiple statistical and largescale pre-trained language models, we find that\nexisting approaches considerably underperform compared to humans. This\nobservation highlights the need for further exploration into this critical and\nchallenging task. The dataset and code are available at\nhttp://cogcomp.org/page/publication_view/1023.\n","authors":["Haoyu Wang","Hongming Zhang","Yueguan Wang","Yuqian Deng","Muhao Chen","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2210.04074v3.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18604v1","updated":"2023-10-28T06:11:18Z","published":"2023-10-28T06:11:18Z","title":"Anaphor Assisted Document-Level Relation Extraction","summary":" Document-level relation extraction (DocRE) involves identifying relations\nbetween entities distributed in multiple sentences within a document. Existing\nmethods focus on building a heterogeneous document graph to model the internal\nstructure of an entity and the external interaction between entities. However,\nthere are two drawbacks in existing methods. On one hand, anaphor plays an\nimportant role in reasoning to identify relations between entities but is\nignored by these methods. On the other hand, these methods achieve\ncross-sentence entity interactions implicitly by utilizing a document or\nsentences as intermediate nodes. Such an approach has difficulties in learning\nfine-grained interactions between entities across different sentences,\nresulting in sub-optimal performance. To address these issues, we propose an\nAnaphor-Assisted (AA) framework for DocRE tasks. Experimental results on the\nwidely-used datasets demonstrate that our model achieves a new state-of-the-art\nperformance.\n","authors":["Chonggang Lu","Richong Zhang","Kai Sun","Jaein Kim","Cunwang Zhang","Yongyi Mao"],"pdf_url":"https://arxiv.org/pdf/2310.18604v1.pdf","comment":"Accepted to EMNLP 2023 Main Conference"},{"id":"http://arxiv.org/abs/2310.18600v1","updated":"2023-10-28T05:51:57Z","published":"2023-10-28T05:51:57Z","title":"MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of\n Indian Legal Case Judgments","summary":" Automatic summarization of legal case judgments is a practically important\nproblem that has attracted substantial research efforts in many countries. In\nthe context of the Indian judiciary, there is an additional complexity --\nIndian legal case judgments are mostly written in complex English, but a\nsignificant portion of India's population lacks command of the English\nlanguage. Hence, it is crucial to summarize the legal documents in Indian\nlanguages to ensure equitable access to justice. While prior research primarily\nfocuses on summarizing legal case judgments in their source languages, this\nstudy presents a pioneering effort toward cross-lingual summarization of\nEnglish legal documents into Hindi, the most frequently spoken Indian language.\nWe construct the first high-quality legal corpus comprising of 3,122 case\njudgments from prominent Indian courts in English, along with their summaries\nin both English and Hindi, drafted by legal practitioners. We benchmark the\nperformance of several diverse summarization approaches on our corpus and\ndemonstrate the need for further research in cross-lingual summarization in the\nlegal domain.\n","authors":["Debtanu Datta","Shubham Soni","Rajdeep Mukherjee","Saptarshi Ghosh"],"pdf_url":"https://arxiv.org/pdf/2310.18600v1.pdf","comment":"Accepted at EMNLP 2023 (Main Conference)"},{"id":"http://arxiv.org/abs/2310.12860v2","updated":"2023-10-28T05:07:31Z","published":"2023-10-19T16:11:02Z","title":"Probing LLMs for hate speech detection: strengths and vulnerabilities","summary":" Recently efforts have been made by social media platforms as well as\nresearchers to detect hateful or toxic language using large language models.\nHowever, none of these works aim to use explanation, additional context and\nvictim community information in the detection process. We utilise different\nprompt variation, input information and evaluate large language models in zero\nshot setting (without adding any in-context examples). We select three large\nlanguage models (GPT-3.5, text-davinci and Flan-T5) and three datasets -\nHateXplain, implicit hate and ToxicSpans. We find that on average including the\ntarget information in the pipeline improves the model performance substantially\n(~20-30%) over the baseline across the datasets. There is also a considerable\neffect of adding the rationales/explanations into the pipeline (~10-20%) over\nthe baseline across the datasets. In addition, we further provide a typology of\nthe error cases where these large language models fail to (i) classify and (ii)\nexplain the reason for the decisions they take. Such vulnerable points\nautomatically constitute 'jailbreak' prompts for these models and industry\nscale safeguard techniques need to be developed to make the models robust\nagainst such prompts.\n","authors":["Sarthak Roy","Ashish Harshavardhan","Animesh Mukherjee","Punyajoy Saha"],"pdf_url":"https://arxiv.org/pdf/2310.12860v2.pdf","comment":"13 pages, 9 figures, 7 tables, accepted to findings of EMNLP 2023"},{"id":"http://arxiv.org/abs/2305.12147v2","updated":"2023-10-28T04:22:58Z","published":"2023-05-20T09:23:09Z","title":"LogiCoT: Logical Chain-of-Thought Instruction-Tuning","summary":" Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive\nchain-of-thought reasoning ability. Recent work on self-instruction tuning,\nsuch as Alpaca, has focused on enhancing the general proficiency of models.\nThese instructions enable the model to achieve performance comparable to\nGPT-3.5 on general tasks like open-domain text generation and paraphrasing.\nHowever, they fall short of helping the model handle complex reasoning tasks.\nTo bridge the gap, this paper presents LogiCoT, a new instruction-tuning\ndataset for Logical Chain-of-Thought reasoning with GPT-4. We elaborate on the\nprocess of harvesting instructions for prompting GPT-4 to generate\nchain-of-thought rationales. LogiCoT serves as an instruction set for teaching\nmodels of logical reasoning and elicits general reasoning skills.\n","authors":["Hanmeng Liu","Zhiyang Teng","Leyang Cui","Chaoli Zhang","Qiji Zhou","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2305.12147v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.06939v3","updated":"2023-10-28T04:19:41Z","published":"2023-04-14T06:17:46Z","title":"Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with\n Text","summary":" In-context vision and language models like Flamingo support arbitrarily\ninterleaved sequences of images and text as input. This format not only enables\nfew-shot learning via interleaving independent supervised (image, text)\nexamples, but also, more complex prompts involving interaction between images,\ne.g., \"What do image A and image B have in common?\" To support this interface,\npretraining occurs over web corpora that similarly contain interleaved\nimages+text. To date, however, large-scale data of this form have not been\npublicly available.\n We release Multimodal C4, an augmentation of the popular text-only C4 corpus\nwith images interleaved. We use a linear assignment algorithm to place images\ninto longer bodies of text using CLIP features, a process that we show\noutperforms alternatives. Multimodal C4 spans everyday topics like cooking,\ntravel, technology, etc. A manual inspection of a random sample of documents\nshows that a vast majority (88%) of images are topically relevant, and that\nlinear assignment frequently selects individual sentences specifically\nwell-aligned with each image (80%). After filtering NSFW images, ads, etc., the\nresulting corpus consists of 101.2M documents with 571M images interleaved in\n43B English tokens.\n","authors":["Wanrong Zhu","Jack Hessel","Anas Awadalla","Samir Yitzhak Gadre","Jesse Dodge","Alex Fang","Youngjae Yu","Ludwig Schmidt","William Yang Wang","Yejin Choi"],"pdf_url":"https://arxiv.org/pdf/2304.06939v3.pdf","comment":"NeurIPS D&B 2023. Project homepage: https://github.com/allenai/mmc4"},{"id":"http://arxiv.org/abs/2310.14633v2","updated":"2023-10-28T04:14:38Z","published":"2023-10-23T07:13:31Z","title":"Extending Input Contexts of Language Models through Training on\n Segmented Sequences","summary":" Effectively training language models on long inputs poses many technical\nchallenges. As a cost consideration, languages models are pretrained on a fixed\nsequence length before being adapted to longer sequences. We explore various\nmethods for adapting models to longer inputs by training on segmented sequences\nand an interpolation-based method for extending absolute positional embeddings.\nWe develop a training procedure to extend the input context size of pretrained\nmodels with no architectural changes and no additional memory costs than\ntraining on the original input lengths. By sub-sampling segments from long\ninputs while maintaining their original position the model is able to learn new\npositional interactions. Our method benefits both models trained with absolute\npositional embeddings, by extending their input contexts, as well as popular\nrelative positional embedding methods showing a reduced perplexity on sequences\nlonger than they were trained on. We demonstrate our method can extend input\ncontexts by a factor of 4x while improving perplexity.\n","authors":["Petros Karypis","Julian McAuley","George Karypis"],"pdf_url":"https://arxiv.org/pdf/2310.14633v2.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2305.11317v2","updated":"2023-10-28T04:13:44Z","published":"2023-05-18T21:53:58Z","title":"Collaborative Generative AI: Integrating GPT-k for Efficient Editing in\n Text-to-Image Generation","summary":" The field of text-to-image (T2I) generation has garnered significant\nattention both within the research community and among everyday users. Despite\nthe advancements of T2I models, a common issue encountered by users is the need\nfor repetitive editing of input prompts in order to receive a satisfactory\nimage, which is time-consuming and labor-intensive. Given the demonstrated text\ngeneration power of large-scale language models, such as GPT-k, we investigate\nthe potential of utilizing such models to improve the prompt editing process\nfor T2I generation. We conduct a series of experiments to compare the common\nedits made by humans and GPT-k, evaluate the performance of GPT-k in prompting\nT2I, and examine factors that may influence this process. We found that GPT-k\nmodels focus more on inserting modifiers while humans tend to replace words and\nphrases, which includes changes to the subject matter. Experimental results\nshow that GPT-k are more effective in adjusting modifiers rather than\npredicting spontaneous changes in the primary subject matters. Adopting the\nedit suggested by GPT-k models may reduce the percentage of remaining edits by\n20-30%.\n","authors":["Wanrong Zhu","Xinyi Wang","Yujie Lu","Tsu-Jui Fu","Xin Eric Wang","Miguel Eckstein","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.11317v2.pdf","comment":"EMNLP 2023"},{"id":"http://arxiv.org/abs/2310.18581v1","updated":"2023-10-28T04:07:58Z","published":"2023-10-28T04:07:58Z","title":"Accelerating LLM Inference by Enabling Intermediate Layer Decoding","summary":" Large Language Models (LLMs) have achieved remarkable performance across a\nwide variety of natural language tasks; however, their large size makes their\ninference slow and computationally expensive which poses a practical challenge\nfor resource constrained real-world applications. Focusing on this problem, we\npropose to instruction tune LLMs in a way that enables intermediate layer\ndecoding for efficiently generating text, but importantly without compromising\nthe quality of the generation. Specifically, we instruction tune LLMs with\nadditional explicit Losses from the InTermediate layErs (LITE) and show that it\nenables these layers to acquire 'good' generation ability without affecting the\ngeneration ability of the final layer. We perform 'dynamic confidence-based\nearly exiting' at token level from the intermediate layers which improves the\nefficiency of inference while maintaining the generation quality. We conduct\ncomprehensive experiments by instruction tuning LLaMA-2 models on the widely\nused Alpaca dataset and holistically evaluate on four different\nhuman-instruction test sets: Vicuna, WizardLM, Koala, and Self-Instruct. We\nshow that 'dynamic early exiting' achieves consistent and considerable cost\nimprovements (37.86% on average) while maintaining the generation quality of\nthe responses. We further conduct a thorough analysis of the results over\nseveral important aspects, such as comparing the semantic similarity of the\noutputs and dissecting the efficiency improvements by comparing the number of\ntokens generated in the output. In summary, our work contributes to improving\nthe efficiency of LLM inference while maintaining the generation quality, a\ncrucial step en route to enabling their widespread adoption.\n","authors":["Neeraj Varshney","Agneet Chatterjee","Mihir Parmar","Chitta Baral"],"pdf_url":"https://arxiv.org/pdf/2310.18581v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.06775v3","updated":"2023-10-28T02:11:31Z","published":"2023-07-06T16:04:46Z","title":"A Novel Site-Agnostic Multimodal Deep Learning Model to Identify\n Pro-Eating Disorder Content on Social Media","summary":" Over the last decade, there has been a vast increase in eating disorder\ndiagnoses and eating disorder-attributed deaths, reaching their zenith during\nthe Covid-19 pandemic. This immense growth derived in part from the stressors\nof the pandemic but also from increased exposure to social media, which is rife\nwith content that promotes eating disorders. This study aimed to create a\nmultimodal deep learning model that can determine if a given social media post\npromotes eating disorders based on a combination of visual and textual data. A\nlabeled dataset of Tweets was collected from Twitter, upon which twelve deep\nlearning models were trained and tested. Based on model performance, the most\neffective deep learning model was the multimodal fusion of the RoBERTa natural\nlanguage processing model and the MaxViT image classification model, attaining\naccuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT\nfusion model, deployed to classify an unlabeled dataset of posts from the\nsocial media sites Tumblr and Reddit, generated results akin to those of\nprevious research studies that did not employ artificial intelligence-based\ntechniques, indicating that deep learning models can develop insights congruent\nto those of researchers. Additionally, the model was used to conduct a\ntimeseries analysis of yet unseen Tweets from eight Twitter hashtags,\nuncovering that, since 2014, the relative abundance of content that promotes\neating disorders has decreased drastically within those communities. Despite\nthis reduction, by 2018, content that promotes eating disorders had either\nstopped declining or increased in ampleness anew on these hashtags.\n","authors":["Jonathan Feldman"],"pdf_url":"https://arxiv.org/pdf/2307.06775v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.15694v4","updated":"2023-10-28T02:00:39Z","published":"2023-10-24T10:05:32Z","title":"COPF: Continual Learning Human Preference through Optimal Policy Fitting","summary":" The technique of Reinforcement Learning from Human Feedback (RLHF) is a\ncommonly employed method to improve pre-trained Language Models (LM), enhancing\ntheir ability to conform to human preferences. Nevertheless, the current\nRLHF-based LMs necessitate full retraining each time novel queries or feedback\nare introduced, which becomes a challenging task because human preferences can\nvary between different domains or tasks. Retraining LMs poses practical\ndifficulties in many real-world situations due to the significant time and\ncomputational resources required, along with concerns related to data privacy.\nTo address this limitation, we propose a new method called Continual Optimal\nPolicy Fitting (COPF), in which we estimate a series of optimal policies using\nthe Monte Carlo method, and then continually fit the policy sequence with the\nfunction regularization. COPF involves a single learning phase and doesn't\nnecessitate complex reinforcement learning. Importantly, it shares the\ncapability with RLHF to learn from unlabeled data, making it flexible for\ncontinual preference learning. Our experimental results show that COPF\noutperforms strong Continuous learning (CL) baselines when it comes to\nconsistently aligning with human preferences on different tasks and domains.\n","authors":["Han Zhang","Lin Gui","Yuanzhao Zhai","Hui Wang","Yu Lei","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2310.15694v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14702v3","updated":"2023-10-28T01:03:15Z","published":"2023-05-24T04:13:15Z","title":"DecipherPref: Analyzing Influential Factors in Human Preference\n Judgments via GPT-4","summary":" Human preference judgments are pivotal in guiding large language models\n(LLMs) to produce outputs that align with human values. Human evaluations are\nalso used in summarization tasks to compare outputs from various systems,\ncomplementing existing automatic metrics. Despite their significance, however,\nthere has been limited research probing these pairwise or $k$-wise comparisons.\nThe collective impact and relative importance of factors such as output length,\ninformativeness, fluency, and factual consistency are still not well\nunderstood. It is also unclear if there are other hidden factors influencing\nhuman judgments. In this paper, we conduct an in-depth examination of a\ncollection of pairwise human judgments released by OpenAI. Utilizing the\nBradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in\nthese human judgments. We find that the most favored factors vary across tasks\nand genres, whereas the least favored factors tend to be consistent, e.g.,\noutputs are too brief, contain excessive off-focus content or hallucinated\nfacts. Our findings have implications on the construction of balanced datasets\nin human preference evaluations, which is a crucial step in shaping the\nbehaviors of future LLMs.\n","authors":["Yebowen Hu","Kaiqiang Song","Sangwoo Cho","Xiaoyang Wang","Hassan Foroosh","Fei Liu"],"pdf_url":"https://arxiv.org/pdf/2305.14702v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18545v1","updated":"2023-10-28T00:27:21Z","published":"2023-10-28T00:27:21Z","title":"Identifying Conspiracy Theories News based on Event Relation Graph","summary":" Conspiracy theories, as a type of misinformation, are narratives that\nexplains an event or situation in an irrational or malicious manner. While most\nprevious work examined conspiracy theory in social media short texts, limited\nattention was put on such misinformation in long news documents. In this paper,\nwe aim to identify whether a news article contains conspiracy theories. We\nobserve that a conspiracy story can be made up by mixing uncorrelated events\ntogether, or by presenting an unusual distribution of relations between events.\nAchieving a contextualized understanding of events in a story is essential for\ndetecting conspiracy theories. Thus, we propose to incorporate an event\nrelation graph for each article, in which events are nodes, and four common\ntypes of event relations, coreference, temporal, causal, and subevent\nrelations, are considered as edges. Then, we integrate the event relation graph\ninto conspiracy theory identification in two ways: an event-aware language\nmodel is developed to augment the basic language model with the knowledge of\nevents and event relations via soft labels; further, a heterogeneous graph\nattention network is designed to derive a graph embedding based on hard labels.\nExperiments on a large benchmark dataset show that our approach based on event\nrelation graph improves both precision and recall of conspiracy theory\nidentification, and generalizes well for new unseen media sources.\n","authors":["Yuanyuan Lei","Ruihong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.18545v1.pdf","comment":"Accepted to EMNLP 2023 Findings"},{"id":"http://arxiv.org/abs/2305.14772v2","updated":"2023-10-28T00:18:26Z","published":"2023-05-24T06:23:02Z","title":"A Question Answering Framework for Decontextualizing User-facing\n Snippets from Scientific Documents","summary":" Many real-world applications (e.g., note taking, search) require extracting a\nsentence or paragraph from a document and showing that snippet to a human\noutside of the source document. Yet, users may find snippets difficult to\nunderstand as they lack context from the original document. In this work, we\nuse language models to rewrite snippets from scientific documents to be read on\ntheir own. First, we define the requirements and challenges for this\nuser-facing decontextualization task, such as clarifying where edits occur and\nhandling references to other documents. Second, we propose a framework that\ndecomposes the task into three stages: question generation, question answering,\nand rewriting. Using this framework, we collect gold decontextualizations from\nexperienced scientific article readers. We then conduct a range of experiments\nacross state-of-the-art commercial and open-source language models to identify\nhow to best provide missing-but-relevant information to models for our task.\nFinally, we develop QaDecontext, a simple prompting strategy inspired by our\nframework that improves over end-to-end prompting. We conclude with analysis\nthat finds, while rewriting is easy, question generation and answering remain\nchallenging for today's models.\n","authors":["Benjamin Newman","Luca Soldaini","Raymond Fok","Arman Cohan","Kyle Lo"],"pdf_url":"https://arxiv.org/pdf/2305.14772v2.pdf","comment":"19 pages, 2 figures, 8 tables, EMNLP2023"},{"id":"http://arxiv.org/abs/2310.18544v1","updated":"2023-10-28T00:18:19Z","published":"2023-10-28T00:18:19Z","title":"Discourse Structures Guided Fine-grained Propaganda Identification","summary":" Propaganda is a form of deceptive narratives that instigate or mislead the\npublic, usually with a political purpose. In this paper, we aim to identify\npropaganda in political news at two fine-grained levels: sentence-level and\ntoken-level. We observe that propaganda content is more likely to be embedded\nin sentences that attribute causality or assert contrast to nearby sentences,\nas well as seen in opinionated evaluation, speculation and discussions of\nfuture expectation. Hence, we propose to incorporate both local and global\ndiscourse structures for propaganda discovery and construct two teacher models\nfor identifying PDTB-style discourse relations between nearby sentences and\ncommon discourse roles of sentences in a news article respectively. We further\ndevise two methods to incorporate the two types of discourse structures for\npropaganda identification by either using teacher predicted probabilities as\nadditional features or soliciting guidance in a knowledge distillation\nframework. Experiments on the benchmark dataset demonstrate that leveraging\nguidance from discourse structures can significantly improve both precision and\nrecall of propaganda content identification.\n","authors":["Yuanyuan Lei","Ruihong Huang"],"pdf_url":"https://arxiv.org/pdf/2310.18544v1.pdf","comment":"Accepted to EMNLP 2023"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2310.18850v1","updated":"2023-10-28T23:46:31Z","published":"2023-10-28T23:46:31Z","title":"Exploring Data Augmentations on Self-/Semi-/Fully- Supervised\n Pre-trained Models","summary":" Data augmentation has become a standard component of vision pre-trained\nmodels to capture the invariance between augmented views. In practice,\naugmentation techniques that mask regions of a sample with zero/mean values or\npatches from other samples are commonly employed in pre-trained models with\nself-/semi-/fully-supervised contrastive losses. However, the underlying\nmechanism behind the effectiveness of these augmentation techniques remains\npoorly explored. To investigate the problems, we conduct an empirical study to\nquantify how data augmentation affects performance. Concretely, we apply 4\ntypes of data augmentations termed with Random Erasing, CutOut, CutMix and\nMixUp to a series of self-/semi-/fully- supervised pre-trained models. We\nreport their performance on vision tasks such as image classification, object\ndetection, instance segmentation, and semantic segmentation. We then explicitly\nevaluate the invariance and diversity of the feature embedding. We observe\nthat: 1) Masking regions of the images decreases the invariance of the learned\nfeature embedding while providing a more considerable diversity. 2) Manual\nannotations do not change the invariance or diversity of the learned feature\nembedding. 3) The MixUp approach improves the diversity significantly, with\nonly a marginal decrease in terms of the invariance.\n","authors":["Shentong Mo","Zhun Sun","Chao Li"],"pdf_url":"https://arxiv.org/pdf/2310.18850v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18849v1","updated":"2023-10-28T23:38:30Z","published":"2023-10-28T23:38:30Z","title":"Deep Learning-based Compressed Domain Multimedia for Man and Machine: A\n Taxonomy and Application to Point Cloud Classification","summary":" In the current golden age of multimedia, human visualization is no longer the\nsingle main target, with the final consumer often being a machine which\nperforms some processing or computer vision tasks. In both cases, deep learning\nplays a undamental role in extracting features from the multimedia\nrepresentation data, usually producing a compressed representation referred to\nas latent representation. The increasing development and adoption of deep\nlearning-based solutions in a wide area of multimedia applications have opened\nan exciting new vision where a common compressed multimedia representation is\nused for both man and machine. The main benefits of this vision are two-fold:\ni) improved performance for the computer vision tasks, since the effects of\ncoding artifacts are mitigated; and ii) reduced computational complexity, since\nprior decoding is not required. This paper proposes the first taxonomy for\ndesigning compressed domain computer vision solutions driven by the\narchitecture and weights compatibility with an available spatio-temporal\ncomputer vision processor. The potential of the proposed taxonomy is\ndemonstrated for the specific case of point cloud classification by designing\nnovel compressed domain processors using the JPEG Pleno Point Cloud Coding\nstandard under development and adaptations of the PointGrid classifier.\nExperimental results show that the designed compressed domain point cloud\nclassification solutions can significantly outperform the spatial-temporal\ndomain classification benchmarks when applied to the decompressed data,\ncontaining coding artifacts, and even surpass their performance when applied to\nthe original uncompressed data.\n","authors":["Abdelrahman Seleem","André F. R. Guarda","Nuno M. M. Rodrigues","Fernando Pereira"],"pdf_url":"https://arxiv.org/pdf/2310.18849v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.07017v2","updated":"2023-10-28T23:25:10Z","published":"2023-05-11T17:56:09Z","title":"An Inverse Scaling Law for CLIP Training","summary":" CLIP, one of the pioneering foundation models that connect images and text,\nhas enabled many recent breakthroughs in computer vision. However, its\nassociated training cost is prohibitively high, imposing a significant barrier\nto its widespread exploration. In this paper, we present a surprising finding\nthat there exists an inverse scaling law for CLIP training, whereby the larger\nthe image/text encoders used, the shorter the sequence length of image/text\ntokens that can be applied in training. Moreover, we showcase that the strategy\nfor reducing image/text token length plays a crucial role in determining the\nquality of this scaling law.\n As a result of this finding, we are able to successfully train CLIP even with\nlimited computational resources. For example, using 8 A100 GPUs, our CLIP\nmodels achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days,\n67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling\nup -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot\naccuracy, and meanwhile accelerate the training by ~33x compared to its\nOpenCLIP counterpart. By reducing the computation barrier associated with CLIP,\nwe hope to inspire more research in this field, particularly from academics.\nOur code is available at https://github.com/UCSC-VLAA/CLIPA.\n","authors":["Xianhang Li","Zeyu Wang","Cihang Xie"],"pdf_url":"https://arxiv.org/pdf/2305.07017v2.pdf","comment":"NeurIPS 2023 camera-ready"},{"id":"http://arxiv.org/abs/2310.18846v1","updated":"2023-10-28T23:16:49Z","published":"2023-10-28T23:16:49Z","title":"INCODE: Implicit Neural Conditioning with Prior Knowledge Embeddings","summary":" Implicit Neural Representations (INRs) have revolutionized signal\nrepresentation by leveraging neural networks to provide continuous and smooth\nrepresentations of complex data. However, existing INRs face limitations in\ncapturing fine-grained details, handling noise, and adapting to diverse signal\ntypes. To address these challenges, we introduce INCODE, a novel approach that\nenhances the control of the sinusoidal-based activation function in INRs using\ndeep prior knowledge. INCODE comprises a harmonizer network and a composer\nnetwork, where the harmonizer network dynamically adjusts key parameters of the\nactivation function. Through a task-specific pre-trained model, INCODE adapts\nthe task-specific parameters to optimize the representation process. Our\napproach not only excels in representation, but also extends its prowess to\ntackle complex tasks such as audio, image, and 3D shape reconstructions, as\nwell as intricate challenges such as neural radiance fields (NeRFs), and\ninverse problems, including denoising, super-resolution, inpainting, and CT\nreconstruction. Through comprehensive experiments, INCODE demonstrates its\nsuperiority in terms of robustness, accuracy, quality, and convergence rate,\nbroadening the scope of signal representation. Please visit the project's\nwebsite for details on the proposed method and access to the code.\n","authors":["Amirhossein Kazerouni","Reza Azad","Alireza Hosseini","Dorit Merhof","Ulas Bagci"],"pdf_url":"https://arxiv.org/pdf/2310.18846v1.pdf","comment":"Accepted at WACV 2024 conference"},{"id":"http://arxiv.org/abs/2310.18840v1","updated":"2023-10-28T22:57:24Z","published":"2023-10-28T22:57:24Z","title":"Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models","summary":" Personalized text-to-image (T2I) synthesis based on diffusion models has\nattracted significant attention in recent research. However, existing methods\nprimarily concentrate on customizing subjects or styles, neglecting the\nexploration of global geometry. In this study, we propose an approach that\nfocuses on the customization of 360-degree panoramas, which inherently possess\nglobal geometric properties, using a T2I diffusion model. To achieve this, we\ncurate a paired image-text dataset specifically designed for the task and\nsubsequently employ it to fine-tune a pre-trained T2I diffusion model with\nLoRA. Nevertheless, the fine-tuned model alone does not ensure the continuity\nbetween the leftmost and rightmost sides of the synthesized images, a crucial\ncharacteristic of 360-degree panoramas. To address this issue, we propose a\nmethod called StitchDiffusion. Specifically, we perform pre-denoising\noperations twice at each time step of the denoising process on the stitch block\nconsisting of the leftmost and rightmost image regions. Furthermore, a global\ncropping is adopted to synthesize seamless 360-degree panoramas. Experimental\nresults demonstrate the effectiveness of our customized model combined with the\nproposed StitchDiffusion in generating high-quality 360-degree panoramic\nimages. Moreover, our customized model exhibits exceptional generalization\nability in producing scenes unseen in the fine-tuning dataset. Code is\navailable at https://github.com/littlewhitesea/StitchDiffusion.\n","authors":["Hai Wang","Xiaoyu Xiang","Yuchen Fan","Jing-Hao Xue"],"pdf_url":"https://arxiv.org/pdf/2310.18840v1.pdf","comment":"Accepted by WACV 2024, Project Page:\n https://littlewhitesea.github.io/stitchdiffusion.github.io/"},{"id":"http://arxiv.org/abs/2310.18815v1","updated":"2023-10-28T20:41:41Z","published":"2023-10-28T20:41:41Z","title":"Rethinking Semi-Supervised Federated Learning: How to co-train\n fully-labeled and fully-unlabeled client imaging data","summary":" The most challenging, yet practical, setting of semi-supervised federated\nlearning (SSFL) is where a few clients have fully labeled data whereas the\nother clients have fully unlabeled data. This is particularly common in\nhealthcare settings where collaborating partners (typically hospitals) may have\nimages but not annotations. The bottleneck in this setting is the joint\ntraining of labeled and unlabeled clients as the objective function for each\nclient varies based on the availability of labels. This paper investigates an\nalternative way for effective training with labeled and unlabeled clients in a\nfederated setting. We propose a novel learning scheme specifically designed for\nSSFL which we call Isolated Federated Learning (IsoFed) that circumvents the\nproblem by avoiding simple averaging of supervised and semi-supervised models\ntogether. In particular, our training approach consists of two parts - (a)\nisolated aggregation of labeled and unlabeled client models, and (b) local\nself-supervised pretraining of isolated global models in all clients. We\nevaluate our model performance on medical image datasets of four different\nmodalities publicly available within the biomedical image classification\nbenchmark MedMNIST. We further vary the proportion of labeled clients and the\ndegree of heterogeneity to demonstrate the effectiveness of the proposed method\nunder varied experimental settings.\n","authors":["Pramit Saha","Divyanshu Mishra","J. Alison Noble"],"pdf_url":"https://arxiv.org/pdf/2310.18815v1.pdf","comment":"Published in MICCAI 2023 with early acceptance and selected as 1 of\n the top 20 poster highlights under the category: Which work has the potential\n to impact other applications of AI and CV"},{"id":"http://arxiv.org/abs/2310.18812v1","updated":"2023-10-28T20:30:59Z","published":"2023-10-28T20:30:59Z","title":"UniCat: Crafting a Stronger Fusion Baseline for Multimodal\n Re-Identification","summary":" Multimodal Re-Identification (ReID) is a popular retrieval task that aims to\nre-identify objects across diverse data streams, prompting many researchers to\nintegrate multiple modalities into a unified representation. While such fusion\npromises a holistic view, our investigations shed light on potential pitfalls.\nWe uncover that prevailing late-fusion techniques often produce suboptimal\nlatent representations when compared to methods that train modalities in\nisolation. We argue that this effect is largely due to the inadvertent\nrelaxation of the training objectives on individual modalities when using\nfusion, what others have termed modality laziness. We present a nuanced\npoint-of-view that this relaxation can lead to certain modalities failing to\nfully harness available task-relevant information, and yet, offers a protective\nveil to noisy modalities, preventing them from overfitting to task-irrelevant\ndata. Our findings also show that unimodal concatenation (UniCat) and other\nlate-fusion ensembling of unimodal backbones, when paired with best-known\ntraining techniques, exceed the current state-of-the-art performance across\nseveral multimodal ReID benchmarks. By unveiling the double-edged sword of\n\"modality laziness\", we motivate future research in balancing local modality\nstrengths with global representations.\n","authors":["Jennifer Crawford","Haoli Yin","Luke McDermott","Daniel Cummings"],"pdf_url":"https://arxiv.org/pdf/2310.18812v1.pdf","comment":"Accepted NeurIPS 2023 UniReps, 9 pages, 4 tables"},{"id":"http://arxiv.org/abs/2310.18807v1","updated":"2023-10-28T20:12:58Z","published":"2023-10-28T20:12:58Z","title":"OC-NMN: Object-centric Compositional Neural Module Network for\n Generative Visual Analogical Reasoning","summary":" A key aspect of human intelligence is the ability to imagine -- composing\nlearned concepts in novel ways -- to make sense of new scenarios. Such capacity\nis not yet attained for machine learning systems. In this work, in the context\nof visual reasoning, we show how modularity can be leveraged to derive a\ncompositional data augmentation framework inspired by imagination. Our method,\ndenoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes\nvisual generative reasoning tasks into a series of primitives applied to\nobjects without using a domain-specific language. We show that our modular\narchitectural choices can be used to generate new training tasks that lead to\nbetter out-of-distribution generalization. We compare our model to existing and\nnew baselines in proposed visual reasoning benchmark that consists of applying\narithmetic operations to MNIST digits.\n","authors":["Rim Assouel","Pau Rodriguez","Perouz Taslakian","David Vazquez","Yoshua Bengio"],"pdf_url":"https://arxiv.org/pdf/2310.18807v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18804v1","updated":"2023-10-28T20:09:29Z","published":"2023-10-28T20:09:29Z","title":"Open Visual Knowledge Extraction via Relation-Oriented Multimodality\n Model Prompting","summary":" Images contain rich relational knowledge that can help machines understand\nthe world. Existing methods on visual knowledge extraction often rely on the\npre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation\ntypes), restricting the expressiveness of the extracted knowledge. In this\nwork, we take a first exploration to a new paradigm of open visual knowledge\nextraction. To achieve this, we present OpenVik which consists of an open\nrelational region detector to detect regions potentially containing relational\nknowledge and a visual knowledge generator that generates format-free knowledge\nby prompting the large multimodality model with the detected region of\ninterest. We also explore two data enhancement techniques for diversifying the\ngenerated format-free visual knowledge. Extensive knowledge quality evaluations\nhighlight the correctness and uniqueness of the extracted open visual knowledge\nby OpenVik. Moreover, integrating our extracted knowledge across various visual\nreasoning applications shows consistent improvements, indicating the real-world\napplicability of OpenVik.\n","authors":["Hejie Cui","Xinyu Fang","Zihan Zhang","Ran Xu","Xuan Kan","Xin Liu","Yue Yu","Manling Li","Yangqiu Song","Carl Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18804v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18795v1","updated":"2023-10-28T19:49:43Z","published":"2023-10-28T19:49:43Z","title":"A Review on the Applications of Machine Learning for Tinnitus Diagnosis\n Using EEG Signals","summary":" Tinnitus is a prevalent hearing disorder that can be caused by various\nfactors such as age, hearing loss, exposure to loud noises, ear infections or\ntumors, certain medications, head or neck injuries, and psychological\nconditions like anxiety and depression. While not every patient requires\nmedical attention, about 20% of sufferers seek clinical intervention. Early\ndiagnosis is crucial for effective treatment. New developments have been made\nin tinnitus detection to aid in early detection of this illness. Over the past\nfew years, there has been a notable growth in the usage of\nelectroencephalography (EEG) to study variations in oscillatory brain activity\nrelated to tinnitus. However, the results obtained from numerous studies vary\ngreatly, leading to conflicting conclusions. Currently, clinicians rely solely\non their expertise to identify individuals with tinnitus. Researchers in this\nfield have incorporated various data modalities and machine-learning techniques\nto aid clinicians in identifying tinnitus characteristics and classifying\npeople with tinnitus. The purpose of writing this article is to review articles\nthat focus on using machine learning (ML) to identify or predict tinnitus\npatients using EEG signals as input data. We have evaluated 11 articles\npublished between 2016 and 2023 using a systematic literature review (SLR)\nmethod. This article arranges perfect summaries of all the research reviewed\nand compares the significant aspects of each. Additionally, we performed\nstatistical analyses to gain a deeper comprehension of the most recent research\nin this area. Almost all of the reviewed articles followed a five-step\nprocedure to achieve the goal of tinnitus. Disclosure. Finally, we discuss the\nopen affairs and challenges in this method of tinnitus recognition or\nprediction and suggest future directions for research.\n","authors":["Farzaneh Ramezani","Hamidreza Bolhasani"],"pdf_url":"https://arxiv.org/pdf/2310.18795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.05994v2","updated":"2023-10-28T19:37:02Z","published":"2023-09-12T06:49:56Z","title":"ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution\n Detection in Segmentation","summary":" Recent advancements in dense out-of-distribution (OOD) detection have\nprimarily focused on scenarios where the training and testing datasets share a\nsimilar domain, with the assumption that no domain shift exists between them.\nHowever, in real-world situations, domain shift often exits and significantly\naffects the accuracy of existing out-of-distribution (OOD) detection models. In\nthis work, we propose a dual-level OOD detection framework to handle domain\nshift and semantic shift jointly. The first level distinguishes whether domain\nshift exists in the image by leveraging global low-level features, while the\nsecond level identifies pixels with semantic shift by utilizing dense\nhigh-level feature maps. In this way, we can selectively adapt the model to\nunseen domains as well as enhance model's capacity in detecting novel classes.\nWe validate the efficacy of our proposed method on several OOD segmentation\nbenchmarks, including those with significant domain shifts and those without,\nobserving consistent performance improvements across various baseline models.\nCode is available at\n${\\href{https://github.com/gaozhitong/ATTA}{https://github.com/gaozhitong/ATTA}}$.\n","authors":["Zhitong Gao","Shipeng Yan","Xuming He"],"pdf_url":"https://arxiv.org/pdf/2309.05994v2.pdf","comment":"Published in NeurIPS 2023"},{"id":"http://arxiv.org/abs/2302.01381v2","updated":"2023-10-28T19:26:10Z","published":"2023-02-02T19:28:41Z","title":"Effective Robustness against Natural Distribution Shifts for Models with\n Different Training Data","summary":" \"Effective robustness\" measures the extra out-of-distribution (OOD)\nrobustness beyond what can be predicted from the in-distribution (ID)\nperformance. Existing effective robustness evaluations typically use a single\ntest set such as ImageNet to evaluate the ID accuracy. This becomes problematic\nwhen evaluating models trained on different data distributions, e.g., comparing\nmodels trained on ImageNet vs. zero-shot language-image pre-trained models\ntrained on LAION. In this paper, we propose a new evaluation metric to evaluate\nand compare the effective robustness of models trained on different data. To do\nthis, we control for the accuracy on multiple ID test sets that cover the\ntraining distributions for all the evaluated models. Our new evaluation metric\nprovides a better estimate of effective robustness when there are models with\ndifferent training data. It may also explain the surprising effective\nrobustness gains of zero-shot CLIP-like models exhibited in prior works that\nused ImageNet as the only ID test set, while the gains diminish under our new\nevaluation. Additional artifacts including interactive visualizations are\nprovided at https://shizhouxing.github.io/effective-robustness.\n","authors":["Zhouxing Shi","Nicholas Carlini","Ananth Balashankar","Ludwig Schmidt","Cho-Jui Hsieh","Alex Beutel","Yao Qin"],"pdf_url":"https://arxiv.org/pdf/2302.01381v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18788v1","updated":"2023-10-28T19:25:01Z","published":"2023-10-28T19:25:01Z","title":"PrObeD: Proactive Object Detection Wrapper","summary":" Previous research in $2D$ object detection focuses on various tasks,\nincluding detecting objects in generic and camouflaged images. These works are\nregarded as passive works for object detection as they take the input image as\nis. However, convergence to global minima is not guaranteed to be optimal in\nneural networks; therefore, we argue that the trained weights in the object\ndetector are not optimal. To rectify this problem, we propose a wrapper based\non proactive schemes, PrObeD, which enhances the performance of these object\ndetectors by learning a signal. PrObeD consists of an encoder-decoder\narchitecture, where the encoder network generates an image-dependent signal\ntermed templates to encrypt the input images, and the decoder recovers this\ntemplate from the encrypted images. We propose that learning the optimum\ntemplate results in an object detector with an improved detection performance.\nThe template acts as a mask to the input images to highlight semantics useful\nfor the object detector. Finetuning the object detector with these encrypted\nimages enhances the detection performance for both generic and camouflaged. Our\nexperiments on MS-COCO, CAMO, COD$10$K, and NC$4$K datasets show improvement\nover different detectors after applying PrObeD. Our models/codes are available\nat https://github.com/vishal3477/Proactive-Object-Detection.\n","authors":["Vishal Asnani","Abhinav Kumar","Suya You","Xiaoming Liu"],"pdf_url":"https://arxiv.org/pdf/2310.18788v1.pdf","comment":"Accepted at Neurips 2023"},{"id":"http://arxiv.org/abs/2206.12809v2","updated":"2023-10-28T19:12:15Z","published":"2022-06-26T07:15:48Z","title":"A Comparison of AIS, X-Band Marine Radar Systems and Camera Surveillance\n Systems in the Collection of Tracking Data","summary":" Maritime traffic has increased in recent years, especially in terms of\nseaborne trade. To ensure safety, security, and protection of the marine\nenvironment, several systems have been deployed. To overcome some of their\ninconveniences, the collected data is typically fused. The fused data is used\nfor various purposes, one of our interest is target tracking. The most relevant\nsystems in that context are AIS and X-band marine radar. Many works consider\nthat visual data provided by camera surveillance systems enable additional\nadvantages. Therefore, many tracking algorithms using visual data (images) have\nbeen developed. Yet, there is little emphasis on the reasons making the\nintegration of camera systems important. Thus, our main aim in this paper is to\nanalyze the aforementioned surveillance systems for target tracking and\nconclude some of the maritime security improvements resulted from the\nintegration of cameras to the overall maritime surveillance system.\n","authors":["Yassir Zardoua","Abdelali Astito","Mohammed Boulaala"],"pdf_url":"https://arxiv.org/pdf/2206.12809v2.pdf","comment":"The journal that published this paper is no longer online. We\n discovered it was a predatory journal. Withdrawing this paper will allow us\n to publish it elsewhere. We enhanced this paper and will provide its full\n text as a replacement"},{"id":"http://arxiv.org/abs/2306.17842v3","updated":"2023-10-28T18:09:46Z","published":"2023-06-30T17:59:07Z","title":"SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen\n LLMs","summary":" In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling\nfrozen LLMs to perform both understanding and generation tasks involving\nnon-linguistic modalities such as images or videos. SPAE converts between raw\npixels and interpretable lexical tokens (or words) extracted from the LLM's\nvocabulary. The resulting tokens capture both the semantic meaning and the\nfine-grained details needed for visual reconstruction, effectively translating\nthe visual content into a language comprehensible to the LLM, and empowering it\nto perform a wide array of multimodal tasks. Our approach is validated through\nin-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set\nof image understanding and generation tasks. Our method marks the first\nsuccessful attempt to enable a frozen LLM to generate image content while\nsurpassing state-of-the-art performance in image understanding tasks, under the\nsame setting, by over 25%.\n","authors":["Lijun Yu","Yong Cheng","Zhiruo Wang","Vivek Kumar","Wolfgang Macherey","Yanping Huang","David A. Ross","Irfan Essa","Yonatan Bisk","Ming-Hsuan Yang","Kevin Murphy","Alexander G. Hauptmann","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.17842v3.pdf","comment":"NeurIPS 2023 spotlight"},{"id":"http://arxiv.org/abs/2310.18773v1","updated":"2023-10-28T18:05:32Z","published":"2023-10-28T18:05:32Z","title":"CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale\n Point Cloud Data","summary":" City-scale 3D point cloud is a promising way to express detailed and\ncomplicated outdoor structures. It encompasses both the appearance and geometry\nfeatures of segmented city components, including cars, streets, and buildings,\nthat can be utilized for attractive applications such as user-interactive\nnavigation of autonomous vehicles and drones. However, compared to the\nextensive text annotations available for images and indoor scenes, the scarcity\nof text annotations for outdoor scenes poses a significant challenge for\nachieving these applications. To tackle this problem, we introduce the\nCityRefer dataset for city-level visual grounding. The dataset consists of 35k\nnatural language descriptions of 3D objects appearing in SensatUrban city\nscenes and 5k landmarks labels synchronizing with OpenStreetMap. To ensure the\nquality and accuracy of the dataset, all descriptions and labels in the\nCityRefer dataset are manually verified. We also have developed a baseline\nsystem that can learn encoded language descriptions, 3D object instances, and\ngeographical information about the city's landmarks to perform visual grounding\non the CityRefer dataset. To the best of our knowledge, the CityRefer dataset\nis the largest city-level visual grounding dataset for localizing specific 3D\nobjects.\n","authors":["Taiki Miyanishi","Fumiya Kitamori","Shuhei Kurita","Jungdae Lee","Motoaki Kawanabe","Nakamasa Inoue"],"pdf_url":"https://arxiv.org/pdf/2310.18773v1.pdf","comment":"NeurIPS D&B 2023. The first two authors are equally contributed"},{"id":"http://arxiv.org/abs/2303.15149v2","updated":"2023-10-28T17:58:15Z","published":"2023-03-27T12:33:23Z","title":"What Can Human Sketches Do for Object Detection?","summary":" Sketches are highly expressive, inherently capturing subjective and\nfine-grained visual cues. The exploration of such innate properties of human\nsketches has, however, been limited to that of image retrieval. In this paper,\nfor the first time, we cultivate the expressiveness of sketches but for the\nfundamental vision task of object detection. The end result is a sketch-enabled\nobject detection framework that detects based on what \\textit{you} sketch --\n\\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of\nzebras (instance-aware detection), and only the \\textit{part} (e.g., ``head\" of\na ``zebra\") that you desire (part-aware detection). We further dictate that our\nmodel works without (i) knowing which category to expect at testing (zero-shot)\nand (ii) not requiring additional bounding boxes (as per fully supervised) and\nclass labels (as per weakly supervised). Instead of devising a model from the\nground up, we show an intuitive synergy between foundation models (e.g., CLIP)\nand existing sketch models build for sketch-based image retrieval (SBIR), which\ncan already elegantly solve the task -- CLIP to provide model generalisation,\nand SBIR to bridge the (sketch$\\rightarrow$photo) gap. In particular, we first\nperform independent prompting on both sketch and photo branches of an SBIR\nmodel to build highly generalisable sketch and photo encoders on the back of\nthe generalisation ability of CLIP. We then devise a training paradigm to adapt\nthe learned encoders for object detection, such that the region embeddings of\ndetected boxes are aligned with the sketch and photo embeddings from SBIR.\nEvaluating our framework on standard object detection datasets like PASCAL-VOC\nand MS-COCO outperforms both supervised (SOD) and weakly-supervised object\ndetectors (WSOD) on zero-shot setups. Project Page:\n\\url{https://pinakinathc.github.io/sketch-detect}\n","authors":["Pinaki Nath Chowdhury","Ayan Kumar Bhunia","Aneeshan Sain","Subhadeep Koley","Tao Xiang","Yi-Zhe Song"],"pdf_url":"https://arxiv.org/pdf/2303.15149v2.pdf","comment":"Best Paper Finalist (Top 12 Best Papers). Presented in special\n single-track plenary sessions to all attendees in Computer Vision and Pattern\n Recognition (CVPR), 2023. Updated an error in Fig.3 (from Softmax to Cross\n Entropy). Thanks to the community for pointing it out"},{"id":"http://arxiv.org/abs/2310.16831v2","updated":"2023-10-28T16:50:41Z","published":"2023-10-25T17:59:01Z","title":"PERF: Panoramic Neural Radiance Field from a Single Panorama","summary":" Neural Radiance Field (NeRF) has achieved substantial progress in novel view\nsynthesis given multi-view images. Recently, some works have attempted to train\na NeRF from a single image with 3D priors. They mainly focus on a limited field\nof view with a few occlusions, which greatly limits their scalability to\nreal-world 360-degree panoramic scenarios with large-size occlusions. In this\npaper, we present PERF, a 360-degree novel view synthesis framework that trains\na panoramic neural radiance field from a single panorama. Notably, PERF allows\n3D roaming in a complex scene without expensive and tedious image collection.\nTo achieve this goal, we propose a novel collaborative RGBD inpainting method\nand a progressive inpainting-and-erasing method to lift up a 360-degree 2D\nscene to a 3D scene. Specifically, we first predict a panoramic depth map as\ninitialization given a single panorama and reconstruct visible 3D regions with\nvolume rendering. Then we introduce a collaborative RGBD inpainting approach\ninto a NeRF for completing RGB images and depth maps from random views, which\nis derived from an RGB Stable Diffusion model and a monocular depth estimator.\nFinally, we introduce an inpainting-and-erasing strategy to avoid inconsistent\ngeometry between a newly-sampled view and reference views. The two components\nare integrated into the learning of NeRFs in a unified optimization framework\nand achieve promising results. Extensive experiments on Replica and a new\ndataset PERF-in-the-wild demonstrate the superiority of our PERF over\nstate-of-the-art methods. Our PERF can be widely used for real-world\napplications, such as panorama-to-3D, text-to-3D, and 3D scene stylization\napplications. Project page and code are available at\nhttps://perf-project.github.io/ and https://github.com/perf-project/PeRF.\n","authors":["Guangcong Wang","Peng Wang","Zhaoxi Chen","Wenping Wang","Chen Change Loy","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2310.16831v2.pdf","comment":"Project Page: https://perf-project.github.io/ , Code:\n https://github.com/perf-project/PeRF"},{"id":"http://arxiv.org/abs/2303.01870v2","updated":"2023-10-28T16:27:56Z","published":"2023-03-03T11:53:01Z","title":"Revisiting Adversarial Training for ImageNet: Architectures, Training\n and Generalization across Threat Models","summary":" While adversarial training has been extensively studied for ResNet\narchitectures and low resolution datasets like CIFAR, much less is known for\nImageNet. Given the recent debate about whether transformers are more robust\nthan convnets, we revisit adversarial training on ImageNet comparing ViTs and\nConvNeXts. Extensive experiments show that minor changes in architecture, most\nnotably replacing PatchStem with ConvStem, and training scheme have a\nsignificant impact on the achieved robustness. These changes not only increase\nrobustness in the seen $\\ell_\\infty$-threat model, but even more so improve\ngeneralization to unseen $\\ell_1/\\ell_2$-attacks. Our modified ConvNeXt,\nConvNeXt + ConvStem, yields the most robust $\\ell_\\infty$-models across\ndifferent ranges of model parameters and FLOPs, while our ViT + ConvStem yields\nthe best generalization to unseen threat models.\n","authors":["Naman D Singh","Francesco Croce","Matthias Hein"],"pdf_url":"https://arxiv.org/pdf/2303.01870v2.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.17133v2","updated":"2023-10-28T16:03:35Z","published":"2023-09-29T10:54:10Z","title":"Fine-grained Late-interaction Multi-modal Retrieval for Retrieval\n Augmented Visual Question Answering","summary":" Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to\nutilize knowledge from external knowledge bases to answer visually-grounded\nquestions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong\nframework to tackle KB-VQA, first retrieves related documents with Dense\nPassage Retrieval (DPR) and then uses them to answer questions. This paper\nproposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which\nsignificantly improves knowledge retrieval in RA-VQA. FLMR addresses two major\nlimitations in RA-VQA's retriever: (1) the image representations obtained via\nimage-to-text transforms can be incomplete and inaccurate and (2) relevance\nscores between queries and documents are computed with one-dimensional\nembeddings, which can be insensitive to finer-grained relevance. FLMR overcomes\nthese limitations by obtaining image representations that complement those from\nthe image-to-text transforms using a vision model aligned with an existing\ntext-based retriever through a simple alignment network. FLMR also encodes\nimages and questions using multi-dimensional embeddings to capture\nfiner-grained relevance between queries and documents. FLMR significantly\nimproves the original RA-VQA retriever's PRRecall@5 by approximately 8\\%.\nFinally, we equipped RA-VQA with two state-of-the-art large\nmulti-modal/language models to achieve $\\sim61\\%$ VQA score in the OK-VQA\ndataset.\n","authors":["Weizhe Lin","Jinghong Chen","Jingbiao Mei","Alexandru Coca","Bill Byrne"],"pdf_url":"https://arxiv.org/pdf/2309.17133v2.pdf","comment":"To appear at NeurIPS 2023. This is the camera-ready version. We fixed\n some numbers and added more experiments to address reviewers' comments"},{"id":"http://arxiv.org/abs/2309.10987v2","updated":"2023-10-28T15:51:44Z","published":"2023-09-20T01:04:57Z","title":"SpikingNeRF: Making Bio-inspired Neural Networks See through the Real\n World","summary":" Spiking neural networks (SNNs) have been thriving on numerous tasks to\nleverage their promising energy efficiency and exploit their potentialities as\nbiologically plausible intelligence. Meanwhile, the Neural Radiance Fields\n(NeRF) render high-quality 3D scenes with massive energy consumption, but few\nworks delve into the energy-saving solution with a bio-inspired approach. In\nthis paper, we propose SpikingNeRF, which aligns the radiance ray with the\ntemporal dimension of SNN, to naturally accommodate the SNN to the\nreconstruction of Radiance Fields. Thus, the computation turns into a\nspike-based, multiplication-free manner, reducing the energy consumption. In\nSpikingNeRF, each sampled point on the ray is matched onto a particular time\nstep, and represented in a hybrid manner where the voxel grids are maintained\nas well. Based on the voxel grids, sampled points are determined whether to be\nmasked for better training and inference. However, this operation also incurs\nirregular temporal length. We propose the temporal padding strategy to tackle\nthe masked samples to maintain regular temporal length, i.e., regular tensors,\nand the temporal condensing strategy to form a denser data structure for\nhardware-friendly computation. Extensive experiments on various datasets\ndemonstrate that our method reduces the 70.79\\% energy consumption on average\nand obtains comparable synthesis quality with the ANN baseline.\n","authors":["Xingting Yao","Qinghao Hu","Tielong Liu","Zitao Mo","Zeyu Zhu","Zhengyang Zhuge","Jian Cheng"],"pdf_url":"https://arxiv.org/pdf/2309.10987v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18737v1","updated":"2023-10-28T15:42:07Z","published":"2023-10-28T15:42:07Z","title":"Pre-training with Random Orthogonal Projection Image Modeling","summary":" Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual\npre-training without the use of labels. MIM applies random crops to input\nimages, processes them with an encoder, and then recovers the masked inputs\nwith a decoder, which encourages the network to capture and learn structural\ninformation about objects and scenes. The intermediate feature representations\nobtained from MIM are suitable for fine-tuning on downstream tasks. In this\npaper, we propose an Image Modeling framework based on random orthogonal\nprojection instead of binary masking as in MIM. Our proposed Random Orthogonal\nProjection Image Modeling (ROPIM) reduces spatially-wise token information\nunder guaranteed bound on the noise variance and can be considered as masking\nentire spatial image area under locally varying masking degrees. Since ROPIM\nuses a random subspace for the projection that realizes the masking step, the\nreadily available complement of the subspace can be used during unmasking to\npromote recovery of removed information. In this paper, we show that using\nrandom orthogonal projection leads to superior performance compared to\ncrop-based masking. We demonstrate state-of-the-art results on several popular\nbenchmarks.\n","authors":["Maryam Haghighat","Peyman Moghadam","Shaheer Mohamed","Piotr Koniusz"],"pdf_url":"https://arxiv.org/pdf/2310.18737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18728v1","updated":"2023-10-28T15:14:43Z","published":"2023-10-28T15:14:43Z","title":"Online Multi-view Anomaly Detection with Disentangled Product-of-Experts\n Modeling","summary":" Multi-view or even multi-modal data is appealing yet challenging for\nreal-world applications. Detecting anomalies in multi-view data is a prominent\nrecent research topic. However, most of the existing methods 1) are only\nsuitable for two views or type-specific anomalies, 2) suffer from the issue of\nfusion disentanglement, and 3) do not support online detection after model\ndeployment. To address these challenges, our main ideas in this paper are\nthree-fold: multi-view learning, disentangled representation learning, and\ngenerative model. To this end, we propose dPoE, a novel multi-view variational\nautoencoder model that involves (1) a Product-of-Experts (PoE) layer in\ntackling multi-view data, (2) a Total Correction (TC) discriminator in\ndisentangling view-common and view-specific representations, and (3) a joint\nloss function in wrapping up all components. In addition, we devise theoretical\ninformation bounds to control both view-common and view-specific\nrepresentations. Extensive experiments on six real-world datasets demonstrate\nthat the proposed dPoE outperforms baselines markedly.\n","authors":["Hao Wang","Zhi-Qi Cheng","Jingdong Sun","Xin Yang","Xiao Wu","Hongyang Chen","Yan Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18728v1.pdf","comment":"Accepted by ACM Multimedia 2023, 10 pages, 5 tables, and 3 figures"},{"id":"http://arxiv.org/abs/2310.18709v1","updated":"2023-10-28T13:37:52Z","published":"2023-10-28T13:37:52Z","title":"Audio-Visual Instance Segmentation","summary":" In this paper, we propose a new multi-modal task, namely audio-visual\ninstance segmentation (AVIS), in which the goal is to identify, segment, and\ntrack individual sounding object instances in audible videos, simultaneously.\nTo our knowledge, it is the first time that instance segmentation has been\nextended into the audio-visual domain. To better facilitate this research, we\nconstruct the first audio-visual instance segmentation benchmark (AVISeg).\nSpecifically, AVISeg consists of 1,258 videos with an average duration of 62.6\nseconds from YouTube and public audio-visual datasets, where 117 videos have\nbeen annotated by using an interactive semi-automatic labeling tool based on\nthe Segment Anything Model (SAM). In addition, we present a simple baseline\nmodel for the AVIS task. Our new model introduces an audio branch and a\ncross-modal fusion module to Mask2Former to locate all sounding objects.\nFinally, we evaluate the proposed method using two backbones on AVISeg. We\nbelieve that AVIS will inspire the community towards a more comprehensive\nmulti-modal understanding.\n","authors":["Ruohao Guo","Yaru Chen","Yanyu Qi","Wenzhen Yue","Dantong Niu","Xianghua Ying"],"pdf_url":"https://arxiv.org/pdf/2310.18709v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.13092v2","updated":"2023-10-28T13:21:49Z","published":"2023-06-22T17:59:58Z","title":"Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale\n From A New Perspective","summary":" We present a new dataset condensation framework termed Squeeze, Recover and\nRelabel (SRe$^2$L) that decouples the bilevel optimization of model and\nsynthetic data during training, to handle varying scales of datasets, model\narchitectures and image resolutions for efficient dataset condensation. The\nproposed method demonstrates flexibility across diverse dataset scales and\nexhibits multiple advantages in terms of arbitrary resolutions of synthesized\nimages, low training cost and memory consumption with high-resolution\nsynthesis, and the ability to scale up to arbitrary evaluation network\narchitectures. Extensive experiments are conducted on Tiny-ImageNet and full\nImageNet-1K datasets. Under 50 IPC, our approach achieves the highest 42.5% and\n60.8% validation accuracy on Tiny-ImageNet and ImageNet-1K, outperforming all\nprevious state-of-the-art methods by margins of 14.5% and 32.9%, respectively.\nOur approach also surpasses MTT in terms of speed by approximately 52$\\times$\n(ConvNet-4) and 16$\\times$ (ResNet-18) faster with less memory consumption of\n11.6$\\times$ and 6.4$\\times$ during data synthesis. Our code and condensed\ndatasets of 50, 200 IPC with 4K recovery budget are available at\nhttps://github.com/VILA-Lab/SRe2L.\n","authors":["Zeyuan Yin","Eric Xing","Zhiqiang Shen"],"pdf_url":"https://arxiv.org/pdf/2306.13092v2.pdf","comment":"NeurIPS 2023 spotlight. Code at https://github.com/VILA-Lab/SRe2L"},{"id":"http://arxiv.org/abs/2305.08381v3","updated":"2023-10-28T13:17:38Z","published":"2023-05-15T06:40:56Z","title":"Parameter-efficient Tuning of Large-scale Multimodal Foundation Model","summary":" Driven by the progress of large-scale pre-training, parameter-efficient\ntransfer learning has gained immense popularity across different subfields of\nArtificial Intelligence. The core is to adapt the model to downstream tasks\nwith only a small set of parameters. Recently, researchers have leveraged such\nproven techniques in multimodal tasks and achieve promising results. However,\ntwo critical issues remain unresolved: how to further reduce the complexity\nwith lightweight design and how to boost alignment between modalities under\nextremely low parameters. In this paper, we propose A graceful prompt framework\nfor cross-modal transfer (Aurora) to overcome these challenges. Considering the\nredundancy in existing architectures, we first utilize the mode approximation\nto generate 0.1M trainable parameters to implement the multimodal prompt\ntuning, which explores the low intrinsic dimension with only 0.04% parameters\nof the pre-trained model. Then, for better modality alignment, we propose the\nInformative Context Enhancement and Gated Query Transformation module under\nextremely few parameters scenes. A thorough evaluation on six cross-modal\nbenchmarks shows that it not only outperforms the state-of-the-art but even\noutperforms the full fine-tuning approach. Our code is available at:\nhttps://github.com/WillDreamer/Aurora.\n","authors":["Haixin Wang","Xinlong Yang","Jianlong Chang","Dian Jin","Jinan Sun","Shikun Zhang","Xiao Luo","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2305.08381v3.pdf","comment":"Accepted by NeurIPS2023"},{"id":"http://arxiv.org/abs/2310.18698v1","updated":"2023-10-28T12:49:33Z","published":"2023-10-28T12:49:33Z","title":"Triplet Attention Transformer for Spatiotemporal Predictive Learning","summary":" Spatiotemporal predictive learning offers a self-supervised learning paradigm\nthat enables models to learn both spatial and temporal patterns by predicting\nfuture sequences based on historical sequences. Mainstream methods are\ndominated by recurrent units, yet they are limited by their lack of\nparallelization and often underperform in real-world scenarios. To improve\nprediction quality while maintaining computational efficiency, we propose an\ninnovative triplet attention transformer designed to capture both inter-frame\ndynamics and intra-frame static features. Specifically, the model incorporates\nthe Triplet Attention Module (TAM), which replaces traditional recurrent units\nby exploring self-attention mechanisms in temporal, spatial, and channel\ndimensions. In this configuration: (i) temporal tokens contain abstract\nrepresentations of inter-frame, facilitating the capture of inherent temporal\ndependencies; (ii) spatial and channel attention combine to refine the\nintra-frame representation by performing fine-grained interactions across\nspatial and channel dimensions. Alternating temporal, spatial, and\nchannel-level attention allows our approach to learn more complex short- and\nlong-range spatiotemporal dependencies. Extensive experiments demonstrate\nperformance surpassing existing recurrent-based and recurrent-free methods,\nachieving state-of-the-art under multi-scenario examination including moving\nobject trajectory prediction, traffic flow prediction, driving scene\nprediction, and human motion capture.\n","authors":["Xuesong Nie","Xi Chen","Haoyuan Jin","Zhihang Zhu","Yunfeng Yan","Donglian Qi"],"pdf_url":"https://arxiv.org/pdf/2310.18698v1.pdf","comment":"Accepted to WACV 2024"},{"id":"http://arxiv.org/abs/2106.14186v3","updated":"2023-10-28T12:17:10Z","published":"2021-06-27T10:22:33Z","title":"An XAI Approach to Deep Learning Models in the Detection of DCIS","summary":" The results showed that XAI could indeed be used as a proof of concept to\nbegin discussions on the implementation of assistive AI systems within the\nclinical community.\n","authors":["Michele La Ferla","Matthew Montebello","Dylan Seychell"],"pdf_url":"https://arxiv.org/pdf/2106.14186v3.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.18689v1","updated":"2023-10-28T12:08:12Z","published":"2023-10-28T12:08:12Z","title":"Foundational Models in Medical Imaging: A Comprehensive Survey and\n Future Vision","summary":" Foundation models, large-scale, pre-trained deep-learning models adapted to a\nwide range of downstream tasks have gained significant interest lately in\nvarious deep-learning problems undergoing a paradigm shift with the rise of\nthese models. Trained on large-scale dataset to bridge the gap between\ndifferent modalities, foundation models facilitate contextual reasoning,\ngeneralization, and prompt capabilities at test time. The predictions of these\nmodels can be adjusted for new tasks by augmenting the model input with\ntask-specific hints called prompts without requiring extensive labeled data and\nretraining. Capitalizing on the advances in computer vision, medical imaging\nhas also marked a growing interest in these models. To assist researchers in\nnavigating this direction, this survey intends to provide a comprehensive\noverview of foundation models in the domain of medical imaging. Specifically,\nwe initiate our exploration by providing an exposition of the fundamental\nconcepts forming the basis of foundation models. Subsequently, we offer a\nmethodical taxonomy of foundation models within the medical domain, proposing a\nclassification system primarily structured around training strategies, while\nalso incorporating additional facets such as application domains, imaging\nmodalities, specific organs of interest, and the algorithms integral to these\nmodels. Furthermore, we emphasize the practical use case of some selected\napproaches and then discuss the opportunities, applications, and future\ndirections of these large-scale pre-trained models, for analyzing medical\nimages. In the same vein, we address the prevailing challenges and research\npathways associated with foundational models in medical imaging. These\nencompass the areas of interpretability, data management, computational\nrequirements, and the nuanced issue of contextual comprehension.\n","authors":["Bobby Azad","Reza Azad","Sania Eskandari","Afshin Bozorgpour","Amirhossein Kazerouni","Islem Rekik","Dorit Merhof"],"pdf_url":"https://arxiv.org/pdf/2310.18689v1.pdf","comment":"The paper is currently in the process of being prepared for\n submission to MIA"},{"id":"http://arxiv.org/abs/2305.08420v2","updated":"2023-10-28T12:00:08Z","published":"2023-05-15T08:01:05Z","title":"RelaMiX: Exploring Few-Shot Adaptation in Video-based Action Recognition","summary":" Domain adaptation is essential for activity recognition to ensure accurate\nand robust performance across diverse environments, sensor types, and data\nsources. Unsupervised domain adaptation methods have been extensively studied,\nyet, they require large-scale unlabeled data from the target domain. In this\nwork, we address Few-Shot Domain Adaptation for video-based Activity\nRecognition (FSDA-AR), which leverages a very small amount of labeled target\nvideos to achieve effective adaptation. This setting is attractive and\npromising for applications, as it requires recording and labeling only a few,\nor even a single example per class in the target domain, which often includes\nactivities that are rare yet crucial to recognize. We construct FSDA-AR\nbenchmarks using five established datasets considering diverse domain types:\nUCF101, HMDB51, EPIC-KITCHEN, Sims4Action, and ToyotaSmartHome. Our results\ndemonstrate that FSDA-AR performs comparably to unsupervised domain adaptation\nwith significantly fewer (yet labeled) target domain samples. We further\npropose a novel approach, RelaMiX, to better leverage the few labeled target\ndomain samples as knowledge guidance. RelaMiX encompasses a temporal relational\nattention network with relation dropout, alongside a cross-domain information\nalignment mechanism. Furthermore, it integrates a mechanism for mixing features\nwithin a latent space by using the few-shot target domain samples. The proposed\nRelaMiX solution achieves state-of-the-art performance on all datasets within\nthe FSDA-AR benchmark. To encourage future research of few-shot domain\nadaptation for video-based activity recognition, our benchmarks and source code\nare made publicly available at https://github.com/KPeng9510/RelaMiX.\n","authors":["Kunyu Peng","Di Wen","David Schneider","Jiaming Zhang","Kailun Yang","M. Saquib Sarfraz","Rainer Stiefelhagen","Alina Roitberg"],"pdf_url":"https://arxiv.org/pdf/2305.08420v2.pdf","comment":"Benchmarks and source code are made publicly available at\n https://github.com/KPeng9510/RelaMiX"},{"id":"http://arxiv.org/abs/2307.00716v2","updated":"2023-10-28T11:46:07Z","published":"2023-07-03T02:39:08Z","title":"JourneyDB: A Benchmark for Generative Image Understanding","summary":" While recent advancements in vision-language models have had a transformative\nimpact on multi-modal comprehension, the extent to which these models possess\nthe ability to comprehend generated images remains uncertain. Synthetic images,\nin comparison to real data, encompass a higher level of diversity in terms of\nboth content and style, thereby presenting significant challenges for the\nmodels to fully grasp. In light of this challenge, we introduce a comprehensive\ndataset, referred to as JourneyDB, that caters to the domain of generative\nimages within the context of multi-modal visual understanding. Our meticulously\ncurated dataset comprises 4 million distinct and high-quality generated images,\neach paired with the corresponding text prompts that were employed in their\ncreation. Furthermore, we additionally introduce an external subset with\nresults of another 22 text-to-image generative models, which makes JourneyDB a\ncomprehensive benchmark for evaluating the comprehension of generated images.\nOn our dataset, we have devised four benchmarks to assess the performance of\ngenerated image comprehension in relation to both content and style\ninterpretation. These benchmarks encompass prompt inversion, style retrieval,\nimage captioning, and visual question answering. Lastly, we evaluate the\nperformance of state-of-the-art multi-modal models when applied to the\nJourneyDB dataset, providing a comprehensive analysis of their strengths and\nlimitations in comprehending generated content. We anticipate that the proposed\ndataset and benchmarks will facilitate further research in the field of\ngenerative content understanding. The dataset is publicly available at\nhttps://journeydb.github.io.\n","authors":["Keqiang Sun","Junting Pan","Yuying Ge","Hao Li","Haodong Duan","Xiaoshi Wu","Renrui Zhang","Aojun Zhou","Zipeng Qin","Yi Wang","Jifeng Dai","Yu Qiao","Limin Wang","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2307.00716v2.pdf","comment":"Accepted to the Thirty-seventh Conference on Neural Information\n Processing Systems (NeurIPS 2023)"},{"id":"http://arxiv.org/abs/2301.13721v3","updated":"2023-10-28T11:21:47Z","published":"2023-01-31T15:58:32Z","title":"DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models","summary":" Targeting to understand the underlying explainable factors behind\nobservations and modeling the conditional generation process on these factors,\nwe connect disentangled representation learning to Diffusion Probabilistic\nModels (DPMs) to take advantage of the remarkable modeling ability of DPMs. We\npropose a new task, disentanglement of (DPMs): given a pre-trained DPM, without\nany annotations of the factors, the task is to automatically discover the\ninherent factors behind the observations and disentangle the gradient fields of\nDPM into sub-gradient fields, each conditioned on the representation of each\ndiscovered factor. With disentangled DPMs, those inherent factors can be\nautomatically discovered, explicitly represented, and clearly injected into the\ndiffusion process via the sub-gradient fields. To tackle this task, we devise\nan unsupervised approach named DisDiff, achieving disentangled representation\nlearning in the framework of DPMs. Extensive experiments on synthetic and\nreal-world datasets demonstrate the effectiveness of DisDiff.\n","authors":["Tao Yang","Yuwang Wang","Yan Lv","Nanning Zheng"],"pdf_url":"https://arxiv.org/pdf/2301.13721v3.pdf","comment":"Accepted by NeurIPS 2023"},{"id":"http://arxiv.org/abs/2309.00613v2","updated":"2023-10-28T11:16:53Z","published":"2023-09-01T17:59:29Z","title":"Iterative Multi-granular Image Editing using Diffusion Models","summary":" Recent advances in text-guided image synthesis has dramatically changed how\ncreative professionals generate artistic and aesthetically pleasing visual\nassets. To fully support such creative endeavors, the process should possess\nthe ability to: 1) iteratively edit the generations and 2) control the spatial\nreach of desired changes (global, local or anything in between). We formalize\nthis pragmatic problem setting as Iterative Multi-granular Editing. While there\nhas been substantial progress with diffusion-based models for image synthesis\nand editing, they are all one shot (i.e., no iterative editing capabilities)\nand do not naturally yield multi-granular control (i.e., covering the full\nspectrum of local-to-global edits). To overcome these drawbacks, we propose\nEMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent\niteration strategy, which re-purposes a pre-trained diffusion model to\nfacilitate iterative editing. This is complemented by a gradient control\noperation for multi-granular control. We introduce a new benchmark dataset to\nevaluate our newly proposed setting. We conduct exhaustive quantitatively and\nqualitatively evaluation against recent state-of-the-art approaches adapted to\nour task, to being out the mettle of EMILIE. We hope our work would attract\nattention to this newly identified, pragmatic problem setting.\n","authors":["K J Joseph","Prateksha Udhayanan","Tripti Shukla","Aishwarya Agarwal","Srikrishna Karanam","Koustava Goswami","Balaji Vasan Srinivasan"],"pdf_url":"https://arxiv.org/pdf/2309.00613v2.pdf","comment":"Accepted to IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV) 2024"},{"id":"http://arxiv.org/abs/2310.18676v1","updated":"2023-10-28T11:15:37Z","published":"2023-10-28T11:15:37Z","title":"Efficient Object Detection in Optical Remote Sensing Imagery via\n Attention-based Feature Distillation","summary":" Efficient object detection methods have recently received great attention in\nremote sensing. Although deep convolutional networks often have excellent\ndetection accuracy, their deployment on resource-limited edge devices is\ndifficult. Knowledge distillation (KD) is a strategy for addressing this issue\nsince it makes models lightweight while maintaining accuracy. However, existing\nKD methods for object detection have encountered two constraints. First, they\ndiscard potentially important background information and only distill nearby\nforeground regions. Second, they only rely on the global context, which limits\nthe student detector's ability to acquire local information from the teacher\ndetector. To address the aforementioned challenges, we propose Attention-based\nFeature Distillation (AFD), a new KD approach that distills both local and\nglobal information from the teacher detector. To enhance local distillation, we\nintroduce a multi-instance attention mechanism that effectively distinguishes\nbetween background and foreground elements. This approach prompts the student\ndetector to focus on the pertinent channels and pixels, as identified by the\nteacher detector. Local distillation lacks global information, thus attention\nglobal distillation is proposed to reconstruct the relationship between various\npixels and pass it from teacher to student detector. The performance of AFD is\nevaluated on two public aerial image benchmarks, and the evaluation results\ndemonstrate that AFD in object detection can attain the performance of other\nstate-of-the-art models while being efficient.\n","authors":["Pourya Shamsolmoali","Jocelyn Chanussot","Huiyu Zhou","Yue Lu"],"pdf_url":"https://arxiv.org/pdf/2310.18676v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.13876v2","updated":"2023-10-28T10:47:56Z","published":"2023-05-23T09:52:49Z","title":"Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans","summary":" We present a novel task for cross-dataset visual grounding in 3D scenes\n(Cross3DVG), which overcomes limitations of existing 3D visual grounding\nmodels, specifically their restricted 3D resources and consequent tendencies of\noverfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual\ngrounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse\ndescriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with\nhuman annotations. After training the Cross3DVG model using the source 3D\nvisual grounding dataset, we evaluate it without target labels using the target\ndataset with, e.g., different sensors, 3D reconstruction methods, and language\nannotators. Comprehensive experiments are conducted using established visual\ngrounding models and with CLIP-based multi-view 2D and 3D integration designed\nto bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D\nvisual grounding exhibits significantly worse performance than learning and\nevaluation with a single dataset because of the 3D data and language variants\nacross datasets. Moreover, (ii) better object detector and localization modules\nand fusing 3D data and multi-view CLIP-based image features can alleviate this\nlower performance. Our Cross3DVG task can provide a benchmark for developing\nrobust 3D visual grounding models to handle diverse 3D scenes while leveraging\ndeep language understanding.\n","authors":["Taiki Miyanishi","Daichi Azuma","Shuhei Kurita","Motoki Kawanabe"],"pdf_url":"https://arxiv.org/pdf/2305.13876v2.pdf","comment":"3DV2024"},{"id":"http://arxiv.org/abs/2310.18660v1","updated":"2023-10-28T10:19:55Z","published":"2023-10-28T10:19:55Z","title":"Foundation Models for Generalist Geospatial Artificial Intelligence","summary":" Significant progress in the development of highly adaptable and reusable\nArtificial Intelligence (AI) models is expected to have a significant impact on\nEarth science and remote sensing. Foundation models are pre-trained on large\nunlabeled datasets through self-supervision, and then fine-tuned for various\ndownstream tasks with small labeled datasets. This paper introduces a\nfirst-of-a-kind framework for the efficient pre-training and fine-tuning of\nfoundational models on extensive geospatial data. We have utilized this\nframework to create Prithvi, a transformer-based geospatial foundational model\npre-trained on more than 1TB of multispectral satellite imagery from the\nHarmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the\nefficacy of our framework in successfully fine-tuning Prithvi to a range of\nEarth observation tasks that have not been tackled by previous work on\nfoundation models involving multi-temporal cloud gap imputation, flood mapping,\nwildfire scar segmentation, and multi-temporal crop segmentation. Our\nexperiments show that the pre-trained model accelerates the fine-tuning process\ncompared to leveraging randomly initialized weights. In addition, pre-trained\nPrithvi compares well against the state-of-the-art, e.g., outperforming a\nconditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%)\nin the structural similarity index. Finally, due to the limited availability of\nlabeled data in the field of Earth observation, we gradually reduce the\nquantity of available labeled data for refining the model to evaluate data\nefficiency and demonstrate that data can be decreased significantly without\naffecting the model's accuracy. The pre-trained 100 million parameter model and\ncorresponding fine-tuning workflows have been released publicly as open source\ncontributions to the global Earth sciences community through Hugging Face.\n","authors":["Johannes Jakubik","Sujit Roy","C. E. Phillips","Paolo Fraccaro","Denys Godwin","Bianca Zadrozny","Daniela Szwarcman","Carlos Gomes","Gabby Nyirjesy","Blair Edwards","Daiki Kimura","Naomi Simumba","Linsong Chu","S. Karthik Mukkavilli","Devyani Lambhate","Kamal Das","Ranjini Bangalore","Dario Oliveira","Michal Muszynski","Kumar Ankur","Muthukumaran Ramasubramanian","Iksha Gurung","Sam Khallaghi"," Hanxi"," Li","Michael Cecil","Maryam Ahmadi","Fatemeh Kordi","Hamed Alemohammad","Manil Maskey","Raghu Ganti","Kommy Weldemariam","Rahul Ramachandran"],"pdf_url":"https://arxiv.org/pdf/2310.18660v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.13949v2","updated":"2023-10-28T10:17:19Z","published":"2023-04-27T04:07:29Z","title":"UCF: Uncovering Common Features for Generalizable Deepfake Detection","summary":" Deepfake detection remains a challenging task due to the difficulty of\ngeneralizing to new types of forgeries. This problem primarily stems from the\noverfitting of existing detection methods to forgery-irrelevant features and\nmethod-specific patterns. The latter has been rarely studied and not well\naddressed by previous works. This paper presents a novel approach to address\nthe two types of overfitting issues by uncovering common forgery features.\nSpecifically, we first propose a disentanglement framework that decomposes\nimage information into three distinct components: forgery-irrelevant,\nmethod-specific forgery, and common forgery features. To ensure the decoupling\nof method-specific and common forgery features, a multi-task learning strategy\nis employed, including a multi-class classification that predicts the category\nof the forgery method and a binary classification that distinguishes the real\nfrom the fake. Additionally, a conditional decoder is designed to utilize\nforgery features as a condition along with forgery-irrelevant features to\ngenerate reconstructed images. Furthermore, a contrastive regularization\ntechnique is proposed to encourage the disentanglement of the common and\nspecific forgery features. Ultimately, we only utilize the common forgery\nfeatures for the purpose of generalizable deepfake detection. Extensive\nevaluations demonstrate that our framework can perform superior generalization\nthan current state-of-the-art methods.\n","authors":["Zhiyuan Yan","Yong Zhang","Yanbo Fan","Baoyuan Wu"],"pdf_url":"https://arxiv.org/pdf/2304.13949v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.01426v2","updated":"2023-10-28T10:05:52Z","published":"2023-07-04T01:34:41Z","title":"DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection","summary":" A critical yet frequently overlooked challenge in the field of deepfake\ndetection is the lack of a standardized, unified, comprehensive benchmark. This\nissue leads to unfair performance comparisons and potentially misleading\nresults. Specifically, there is a lack of uniformity in data processing\npipelines, resulting in inconsistent data inputs for detection models.\nAdditionally, there are noticeable differences in experimental settings, and\nevaluation strategies and metrics lack standardization. To fill this gap, we\npresent the first comprehensive benchmark for deepfake detection, called\nDeepfakeBench, which offers three key contributions: 1) a unified data\nmanagement system to ensure consistent input across all detectors, 2) an\nintegrated framework for state-of-the-art methods implementation, and 3)\nstandardized evaluation metrics and protocols to promote transparency and\nreproducibility. Featuring an extensible, modular-based codebase, DeepfakeBench\ncontains 15 state-of-the-art detection methods, 9 deepfake datasets, a series\nof deepfake detection evaluation protocols and analysis tools, as well as\ncomprehensive evaluations. Moreover, we provide new insights based on extensive\nanalysis of these evaluations from various perspectives (e.g., data\naugmentations, backbones). We hope that our efforts could facilitate future\nresearch and foster innovation in this increasingly critical domain. All codes,\nevaluations, and analyses of our benchmark are publicly available at\nhttps://github.com/SCLBD/DeepfakeBench.\n","authors":["Zhiyuan Yan","Yong Zhang","Xinhang Yuan","Siwei Lyu","Baoyuan Wu"],"pdf_url":"https://arxiv.org/pdf/2307.01426v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18656v1","updated":"2023-10-28T09:57:28Z","published":"2023-10-28T09:57:28Z","title":"Med-DANet V2: A Flexible Dynamic Architecture for Efficient Medical\n Volumetric Segmentation","summary":" Recent works have shown that the computational efficiency of 3D medical image\n(e.g. CT and MRI) segmentation can be impressively improved by dynamic\ninference based on slice-wise complexity. As a pioneering work, a dynamic\narchitecture network for medical volumetric segmentation (i.e. Med-DANet) has\nachieved a favorable accuracy and efficiency trade-off by dynamically selecting\na suitable 2D candidate model from the pre-defined model bank for different\nslices. However, the issues of incomplete data analysis, high training costs,\nand the two-stage pipeline in Med-DANet require further improvement. To this\nend, this paper further explores a unified formulation of the dynamic inference\nframework from the perspective of both the data itself and the model structure.\nFor each slice of the input volume, our proposed method dynamically selects an\nimportant foreground region for segmentation based on the policy generated by\nour Decision Network and Crop Position Network. Besides, we propose to insert a\nstage-wise quantization selector to the employed segmentation model (e.g.\nU-Net) for dynamic architecture adapting. Extensive experiments on BraTS 2019\nand 2020 show that our method achieves comparable or better performance than\nprevious state-of-the-art methods with much less model complexity. Compared\nwith previous methods Med-DANet and TransBTS with dynamic and static\narchitecture respectively, our framework improves the model efficiency by up to\nnearly 4.1 and 17.3 times with comparable segmentation results on BraTS 2019.\n","authors":["Haoran Shen","Yifu Zhang","Wenxuan Wang","Chen Chen","Jing Liu","Shanshan Song","Jiangyun Li"],"pdf_url":"https://arxiv.org/pdf/2310.18656v1.pdf","comment":"Accepted by WACV 2024"},{"id":"http://arxiv.org/abs/2310.18653v1","updated":"2023-10-28T09:43:13Z","published":"2023-10-28T09:43:13Z","title":"Feature Guided Masked Autoencoder for Self-supervised Learning in Remote\n Sensing","summary":" Self-supervised learning guided by masked image modelling, such as Masked\nAutoEncoder (MAE), has attracted wide attention for pretraining vision\ntransformers in remote sensing. However, MAE tends to excessively focus on\npixel details, thereby limiting the model's capacity for semantic\nunderstanding, in particular for noisy SAR images. In this paper, we explore\nspectral and spatial remote sensing image features as improved\nMAE-reconstruction targets. We first conduct a study on reconstructing various\nimage features, all performing comparably well or better than raw pixels. Based\non such observations, we propose Feature Guided Masked Autoencoder (FG-MAE):\nreconstructing a combination of Histograms of Oriented Graidents (HOG) and\nNormalized Difference Indices (NDI) for multispectral images, and\nreconstructing HOG for SAR images. Experimental results on three downstream\ntasks illustrate the effectiveness of FG-MAE with a particular boost for SAR\nimagery. Furthermore, we demonstrate the well-inherited scalability of FG-MAE\nand release a first series of pretrained vision transformers for medium\nresolution SAR and multispectral images.\n","authors":["Yi Wang","Hugo Hernández Hernández","Conrad M Albrecht","Xiao Xiang Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.18653v1.pdf","comment":"13 pages, 8 figures"},{"id":"http://arxiv.org/abs/2310.18652v1","updated":"2023-10-28T09:42:04Z","published":"2023-10-28T09:42:04Z","title":"EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health\n Records with Chest X-ray Images","summary":" Electronic Health Records (EHRs), which contain patients' medical histories\nin various multi-modal formats, often overlook the potential for joint\nreasoning across imaging and table modalities underexplored in current EHR\nQuestion Answering (QA) systems. In this paper, we introduce EHRXQA, a novel\nmulti-modal question answering dataset combining structured EHRs and chest\nX-ray images. To develop our dataset, we first construct two uni-modal\nresources: 1) The MIMIC- CXR-VQA dataset, our newly created medical visual\nquestion answering (VQA) benchmark, specifically designed to augment the\nimaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of\na previously established table-based EHR QA dataset. By integrating these two\nuni-modal resources, we successfully construct a multi-modal EHR QA dataset\nthat necessitates both uni-modal and cross-modal reasoning. To address the\nunique challenges of multi-modal questions within EHRs, we propose a\nNeuralSQL-based strategy equipped with an external VQA API. This pioneering\nendeavor enhances engagement with multi-modal EHR sources and we believe that\nour dataset can catalyze advances in real-world medical scenarios such as\nclinical decision-making and research. EHRXQA is available at\nhttps://github.com/baeseongsu/ehrxqa.\n","authors":["Seongsu Bae","Daeun Kyung","Jaehee Ryu","Eunbyeol Cho","Gyubok Lee","Sunjun Kweon","Jungwoo Oh","Lei Ji","Eric I-Chao Chang","Tackeun Kim","Edward Choi"],"pdf_url":"https://arxiv.org/pdf/2310.18652v1.pdf","comment":"Accepted at NeurIPS 2023 Datasets and Benchmarks Track (10 pages for\n main text, 4 pages for references, 28 pages for supplementary materials)"},{"id":"http://arxiv.org/abs/2310.18651v1","updated":"2023-10-28T09:35:30Z","published":"2023-10-28T09:35:30Z","title":"Local-Global Self-Supervised Visual Representation Learning","summary":" Self-supervised representation learning methods mainly focus on image-level\ninstance discrimination. This study explores the potential benefits of\nincorporating patch-level discrimination into existing methods to enhance the\nquality of learned representations by simultaneously looking at local and\nglobal visual features. Towards this idea, we present a straightforward yet\neffective patch-matching algorithm that can find the corresponding patches\nacross the augmented views of an image. The augmented views are subsequently\nfed into a self-supervised learning framework employing Vision Transformer\n(ViT) as its backbone. The result is the generation of both image-level and\npatch-level representations. Leveraging the proposed patch-matching algorithm,\nthe model minimizes the representation distance between not only the CLS tokens\nbut also the corresponding patches. As a result, the model gains a more\ncomprehensive understanding of both the entirety of the image as well as its\nfiner details. We pretrain the proposed method on small, medium, and\nlarge-scale datasets. It is shown that our approach could outperform\nstate-of-the-art image-level representation learning methods on both image\nclassification and downstream tasks. Keywords: Self-Supervised Learning; Visual\nRepresentations; Local-Global Representation Learning; Patch-Wise\nRepresentation Learning; Vision Transformer (ViT)\n","authors":["Ali Javidani","Mohammad Amin Sadeghi","Babak Nadjar Araabi"],"pdf_url":"https://arxiv.org/pdf/2310.18651v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18642v1","updated":"2023-10-28T08:58:20Z","published":"2023-10-28T08:58:20Z","title":"One-shot Localization and Segmentation of Medical Images with Foundation\n Models","summary":" Recent advances in Vision Transformers (ViT) and Stable Diffusion (SD) models\nwith their ability to capture rich semantic features of the image have been\nused for image correspondence tasks on natural images. In this paper, we\nexamine the ability of a variety of pre-trained ViT (DINO, DINOv2, SAM, CLIP)\nand SD models, trained exclusively on natural images, for solving the\ncorrespondence problems on medical images. While many works have made a case\nfor in-domain training, we show that the models trained on natural images can\noffer good performance on medical images across different modalities\n(CT,MR,Ultrasound) sourced from various manufacturers, over multiple anatomical\nregions (brain, thorax, abdomen, extremities), and on wide variety of tasks.\nFurther, we leverage the correspondence with respect to a template image to\nprompt a Segment Anything (SAM) model to arrive at single shot segmentation,\nachieving dice range of 62%-90% across tasks, using just one image as\nreference. We also show that our single-shot method outperforms the recently\nproposed few-shot segmentation method - UniverSeg (Dice range 47%-80%) on most\nof the semantic segmentation tasks(six out of seven) across medical imaging\nmodalities.\n","authors":["Deepa Anand","Gurunath Reddy M","Vanika Singhal","Dattesh D. Shanbhag","Shriram KS","Uday Patil","Chitresh Bhushan","Kavitha Manickam","Dawei Gui","Rakesh Mullick","Avinash Gopal","Parminder Bhatia","Taha Kass-Hout"],"pdf_url":"https://arxiv.org/pdf/2310.18642v1.pdf","comment":"Accepted at NeurIPS 2023 R0-FoMo Workshop"},{"id":"http://arxiv.org/abs/2310.18640v1","updated":"2023-10-28T08:49:16Z","published":"2023-10-28T08:49:16Z","title":"Switching Temporary Teachers for Semi-Supervised Semantic Segmentation","summary":" The teacher-student framework, prevalent in semi-supervised semantic\nsegmentation, mainly employs the exponential moving average (EMA) to update a\nsingle teacher's weights based on the student's. However, EMA updates raise a\nproblem in that the weights of the teacher and student are getting coupled,\ncausing a potential performance bottleneck. Furthermore, this problem may\nbecome more severe when training with more complicated labels such as\nsegmentation masks but with few annotated data. This paper introduces Dual\nTeacher, a simple yet effective approach that employs dual temporary teachers\naiming to alleviate the coupling problem for the student. The temporary\nteachers work in shifts and are progressively improved, so consistently prevent\nthe teacher and student from becoming excessively close. Specifically, the\ntemporary teachers periodically take turns generating pseudo-labels to train a\nstudent model and maintain the distinct characteristics of the student model\nfor each epoch. Consequently, Dual Teacher achieves competitive performance on\nthe PASCAL VOC, Cityscapes, and ADE20K benchmarks with remarkably shorter\ntraining times than state-of-the-art methods. Moreover, we demonstrate that our\napproach is model-agnostic and compatible with both CNN- and Transformer-based\nmodels. Code is available at \\url{https://github.com/naver-ai/dual-teacher}.\n","authors":["Jaemin Na","Jung-Woo Ha","Hyung Jin Chang","Dongyoon Han","Wonjun Hwang"],"pdf_url":"https://arxiv.org/pdf/2310.18640v1.pdf","comment":"NeurIPS-2023"},{"id":"http://arxiv.org/abs/2310.18639v1","updated":"2023-10-28T08:48:44Z","published":"2023-10-28T08:48:44Z","title":"Towards Plastic and Stable Exemplar-Free Incremental Learning: A\n Dual-Learner Framework with Cumulative Parameter Averaging","summary":" The dilemma between plasticity and stability presents a significant challenge\nin Incremental Learning (IL), especially in the exemplar-free scenario where\naccessing old-task samples is strictly prohibited during the learning of a new\ntask. A straightforward solution to this issue is learning and storing an\nindependent model for each task, known as Single Task Learning (STL). Despite\nthe linear growth in model storage with the number of tasks in STL, we\nempirically discover that averaging these model parameters can potentially\npreserve knowledge across all tasks. Inspired by this observation, we propose a\nDual-Learner framework with Cumulative Parameter Averaging (DLCPA). DLCPA\nemploys a dual-learner design: a plastic learner focused on acquiring new-task\nknowledge and a stable learner responsible for accumulating all learned\nknowledge. The knowledge from the plastic learner is transferred to the stable\nlearner via cumulative parameter averaging. Additionally, several task-specific\nclassifiers work in cooperation with the stable learner to yield the final\nprediction. Specifically, when learning a new task, these modules are updated\nin a cyclic manner: i) the plastic learner is initially optimized using a\nself-supervised loss besides the supervised loss to enhance the feature\nextraction robustness; ii) the stable learner is then updated with respect to\nthe plastic learner in a cumulative parameter averaging manner to maintain its\ntask-wise generalization; iii) the task-specific classifier is accordingly\noptimized to align with the stable learner. Experimental results on CIFAR-100\nand Tiny-ImageNet show that DLCPA outperforms several state-of-the-art\nexemplar-free baselines in both Task-IL and Class-IL settings.\n","authors":["Wenju Sun","Qingyong Li","Wen Wang","Yangli-ao Geng"],"pdf_url":"https://arxiv.org/pdf/2310.18639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.20088v2","updated":"2023-10-28T08:46:13Z","published":"2023-05-31T17:59:04Z","title":"Improving CLIP Training with Language Rewrites","summary":" Contrastive Language-Image Pre-training (CLIP) stands as one of the most\neffective and scalable methods for training transferable vision models using\npaired image and text data. CLIP models are trained using contrastive loss,\nwhich typically relies on data augmentations to prevent overfitting and\nshortcuts. However, in the CLIP training paradigm, data augmentations are\nexclusively applied to image inputs, while language inputs remain unchanged\nthroughout the entire training process, limiting the exposure of diverse texts\nto the same image. In this paper, we introduce Language augmented CLIP\n(LaCLIP), a simple yet highly effective approach to enhance CLIP training\nthrough language rewrites. Leveraging the in-context learning capability of\nlarge language models, we rewrite the text descriptions associated with each\nimage. These rewritten texts exhibit diversity in sentence structure and\nvocabulary while preserving the original key concepts and meanings. During\ntraining, LaCLIP randomly selects either the original texts or the rewritten\nversions as text augmentations for each image. Extensive experiments on CC3M,\nCC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with\nlanguage rewrites significantly improves the transfer performance without\ncomputation or memory overhead during training. Specifically for ImageNet\nzero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on\nLAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.\n","authors":["Lijie Fan","Dilip Krishnan","Phillip Isola","Dina Katabi","Yonglong Tian"],"pdf_url":"https://arxiv.org/pdf/2305.20088v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.18636v1","updated":"2023-10-28T08:45:51Z","published":"2023-10-28T08:45:51Z","title":"Electrical Impedance Tomography: A Fair Comparative Study on Deep\n Learning and Analytic-based Approaches","summary":" Electrical Impedance Tomography (EIT) is a powerful imaging technique with\ndiverse applications, e.g., medical diagnosis, industrial monitoring, and\nenvironmental studies. The EIT inverse problem is about inferring the internal\nconductivity distribution of an object from measurements taken on its boundary.\nIt is severely ill-posed, necessitating advanced computational methods for\naccurate image reconstructions. Recent years have witnessed significant\nprogress, driven by innovations in analytic-based approaches and deep learning.\nThis review explores techniques for solving the EIT inverse problem, focusing\non the interplay between contemporary deep learning-based strategies and\nclassical analytic-based methods. Four state-of-the-art deep learning\nalgorithms are rigorously examined, harnessing the representational\ncapabilities of deep neural networks to reconstruct intricate conductivity\ndistributions. In parallel, two analytic-based methods, rooted in mathematical\nformulations and regularisation techniques, are dissected for their strengths\nand limitations. These methodologies are evaluated through various numerical\nexperiments, encompassing diverse scenarios that reflect real-world\ncomplexities. A suite of performance metrics is employed to assess the efficacy\nof these methods. These metrics collectively provide a nuanced understanding of\nthe methods' ability to capture essential features and delineate complex\nconductivity patterns. One novel feature of the study is the incorporation of\nvariable conductivity scenarios, introducing a level of heterogeneity that\nmimics textured inclusions. This departure from uniform conductivity\nassumptions mimics realistic scenarios where tissues or materials exhibit\nspatially varying electrical properties. Exploring how each method responds to\nsuch variable conductivity scenarios opens avenues for understanding their\nrobustness and adaptability.\n","authors":["Derick Nganyu Tanyu","Jianfeng Ning","Andreas Hauptmann","Bangti Jin","Peter Maass"],"pdf_url":"https://arxiv.org/pdf/2310.18636v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.14113v2","updated":"2023-10-28T08:28:35Z","published":"2023-08-27T14:07:57Z","title":"Semantic-aware Consistency Network for Cloth-changing Person\n Re-Identification","summary":" Cloth-changing Person Re-Identification (CC-ReID) is a challenging task that\naims to retrieve the target person across multiple surveillance cameras when\nclothing changes might happen. Despite recent progress in CC-ReID, existing\napproaches are still hindered by the interference of clothing variations since\nthey lack effective constraints to keep the model consistently focused on\nclothing-irrelevant regions. To address this issue, we present a Semantic-aware\nConsistency Network (SCNet) to learn identity-related semantic features by\nproposing effective consistency constraints. Specifically, we generate the\nblack-clothing image by erasing pixels in the clothing area, which explicitly\nmitigates the interference from clothing variations. In addition, to fully\nexploit the fine-grained identity information, a head-enhanced attention module\nis introduced, which learns soft attention maps by utilizing the proposed\npart-based matching loss to highlight head information. We further design a\nsemantic consistency loss to facilitate the learning of high-level\nidentity-related semantic features, forcing the model to focus on semantically\nconsistent cloth-irrelevant regions. By using the consistency constraint, our\nmodel does not require any extra auxiliary segmentation module to generate the\nblack-clothing image or locate the head region during the inference stage.\nExtensive experiments on four cloth-changing person Re-ID datasets (LTCC, PRCC,\nVc-Clothes, and DeepChange) demonstrate that our proposed SCNet makes\nsignificant improvements over prior state-of-the-art approaches. Our code is\navailable at: https://github.com/Gpn-star/SCNet.\n","authors":["Peini Guo","Hong Liu","Jianbing Wu","Guoquan Wang","Tao Wang"],"pdf_url":"https://arxiv.org/pdf/2308.14113v2.pdf","comment":"Accepted by ACM MM 2023"},{"id":"http://arxiv.org/abs/2307.11965v3","updated":"2023-10-28T07:55:08Z","published":"2023-07-22T02:50:07Z","title":"An Intelligent Remote Sensing Image Quality Inspection System","summary":" Due to the inevitable presence of quality problems, quality inspection of\nremote sensing images is indeed an indispensable step between the acquisition\nand the application of them. However, traditional manual inspection suffers\nfrom low efficiency. Hence, we propose a novel deep learning-based two-step\nintelligent system consisting of multiple advanced computer vision models,\nwhich first performs image classification by SwinV2 and then accordingly adopts\nthe most appropriate method, such as semantic segmentation by Segformer, to\nlocalize the quality problems. Results demonstrate that the proposed method\nexhibits excellent performance and efficiency, surpassing traditional methods.\nFurthermore, we conduct an initial exploration of applying multimodal models to\nremote sensing image quality inspection.\n","authors":["Yijiong Yu","Tao Wang","Kang Ran","Chang Li","Hao Wu"],"pdf_url":"https://arxiv.org/pdf/2307.11965v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.14183v3","updated":"2023-10-28T07:48:50Z","published":"2023-09-25T14:46:01Z","title":"Species196: A One-Million Semi-supervised Dataset for Fine-grained\n Species Recognition","summary":" The development of foundation vision models has pushed the general visual\nrecognition to a high level, but cannot well address the fine-grained\nrecognition in specialized domain such as invasive species classification.\nIdentifying and managing invasive species has strong social and ecological\nvalue. Currently, most invasive species datasets are limited in scale and cover\na narrow range of species, which restricts the development of deep-learning\nbased invasion biometrics systems. To fill the gap of this area, we introduced\nSpecies196, a large-scale semi-supervised dataset of 196-category invasive\nspecies. It collects over 19K images with expert-level accurate annotations\nSpecies196-L, and 1.2M unlabeled images of invasive species Species196-U. The\ndataset provides four experimental settings for benchmarking the existing\nmodels and algorithms, namely, supervised learning, semi-supervised learning,\nself-supervised pretraining and zero-shot inference ability of large\nmulti-modal models. To facilitate future research on these four learning\nparadigms, we conduct an empirical study of the representative methods on the\nintroduced dataset. The dataset is publicly available at\nhttps://species-dataset.github.io/.\n","authors":["Wei He","Kai Han","Ying Nie","Chengcheng Wang","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2309.14183v3.pdf","comment":"Accepted by NeurIPS 2023 Track Datasets and Benchmarks"},{"id":"http://arxiv.org/abs/2211.14485v2","updated":"2023-10-28T07:42:44Z","published":"2022-11-26T05:16:39Z","title":"FastHuman: Reconstructing High-Quality Clothed Human in Minutes","summary":" We propose an approach for optimizing high-quality clothed human body shapes\nin minutes, using multi-view posed images. While traditional neural rendering\nmethods struggle to disentangle geometry and appearance using only rendering\nloss, and are computationally intensive, our method uses a mesh-based patch\nwarping technique to ensure multi-view photometric consistency, and sphere\nharmonics (SH) illumination to refine geometric details efficiently. We employ\noriented point clouds' shape representation and SH shading, which significantly\nreduces optimization and rendering times compared to implicit methods. Our\napproach has demonstrated promising results on both synthetic and real-world\ndatasets, making it an effective solution for rapidly generating high-quality\nhuman body shapes. Project page\n\\href{https://l1346792580123.github.io/nccsfs/}{https://l1346792580123.github.io/nccsfs/}\n","authors":["Lixiang Lin","Songyou Peng","Qijun Gan","Jianke Zhu"],"pdf_url":"https://arxiv.org/pdf/2211.14485v2.pdf","comment":"International Conference on 3D Vision, 3DV 2024"},{"id":"http://arxiv.org/abs/2310.18626v1","updated":"2023-10-28T07:40:42Z","published":"2023-10-28T07:40:42Z","title":"Benchmark Generation Framework with Customizable Distortions for Image\n Classifier Robustness","summary":" We present a novel framework for generating adversarial benchmarks to\nevaluate the robustness of image classification models. Our framework allows\nusers to customize the types of distortions to be optimally applied to images,\nwhich helps address the specific distortions relevant to their deployment. The\nbenchmark can generate datasets at various distortion levels to assess the\nrobustness of different image classifiers. Our results show that the\nadversarial samples generated by our framework with any of the image\nclassification models, like ResNet-50, Inception-V3, and VGG-16, are effective\nand transferable to other models causing them to fail. These failures happen\neven when these models are adversarially retrained using state-of-the-art\ntechniques, demonstrating the generalizability of our adversarial samples. We\nachieve competitive performance in terms of net $L_2$ distortion compared to\nstate-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we\ndemonstrate our framework achieves such results with simple distortions like\nGaussian noise without introducing unnatural artifacts or color bleeds. This is\nmade possible by a model-based reinforcement learning (RL) agent and a\ntechnique that reduces a deep tree search of the image for model sensitivity to\nperturbations, to a one-level analysis and action. The flexibility of choosing\ndistortions and setting classification probability thresholds for multiple\nclasses makes our framework suitable for algorithmic audits.\n","authors":["Soumyendu Sarkar","Ashwin Ramesh Babu","Sajad Mousavi","Zachariah Carmichael","Vineet Gundecha","Sahand Ghorbanpour","Ricardo Luna","Gutierrez Antonio Guillen","Avisek Naug"],"pdf_url":"https://arxiv.org/pdf/2310.18626v1.pdf","comment":"Accepted at IEEE/CVF Winter Conference on Applications of Computer\n Vision (WACV 2024)"},{"id":"http://arxiv.org/abs/2310.18620v1","updated":"2023-10-28T07:12:09Z","published":"2023-10-28T07:12:09Z","title":"ODM3D: Alleviating Foreground Sparsity for Enhanced Semi-Supervised\n Monocular 3D Object Detection","summary":" Monocular 3D object detection (M3OD) is a significant yet inherently\nchallenging task in autonomous driving due to absence of implicit depth cues in\na single RGB image. In this paper, we strive to boost currently underperforming\nmonocular 3D object detectors by leveraging an abundance of unlabelled data via\nsemi-supervised learning. Our proposed ODM3D framework entails cross-modal\nknowledge distillation at various levels to inject LiDAR-domain knowledge into\na monocular detector during training. By identifying foreground sparsity as the\nmain culprit behind existing methods' suboptimal training, we exploit the\nprecise localisation information embedded in LiDAR points to enable more\nforeground-attentive and efficient distillation via the proposed BEV occupancy\nguidance mask, leading to notably improved knowledge transfer and M3OD\nperformance. Besides, motivated by insights into why existing cross-modal\nGT-sampling techniques fail on our task at hand, we further design a novel\ncross-modal object-wise data augmentation strategy for effective RGB-LiDAR\njoint learning. Our method ranks 1st in both KITTI validation and test\nbenchmarks, significantly surpassing all existing monocular methods, supervised\nor semi-supervised, on both BEV and 3D detection metrics.\n","authors":["Weijia Zhang","Dongnan Liu","Chao Ma","Weidong Cai"],"pdf_url":"https://arxiv.org/pdf/2310.18620v1.pdf","comment":"Accepted by WACV 2024"},{"id":"http://arxiv.org/abs/2310.13268v3","updated":"2023-10-28T07:03:14Z","published":"2023-10-20T04:23:12Z","title":"DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model\n Statistics","summary":" Diffusion probabilistic models (DPMs) have exhibited excellent performance\nfor high-fidelity image generation while suffering from inefficient sampling.\nRecent works accelerate the sampling procedure by proposing fast ODE solvers\nthat leverage the specific ODE form of DPMs. However, they highly rely on\nspecific parameterization during inference (such as noise/data prediction),\nwhich might not be the optimal choice. In this work, we propose a novel\nformulation towards the optimal parameterization during sampling that minimizes\nthe first-order discretization error of the ODE solution. Based on such\nformulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by\nintroducing several coefficients efficiently computed on the pretrained model,\nwhich we call empirical model statistics. We further incorporate multistep\nmethods and a predictor-corrector framework, and propose some techniques for\nimproving sample quality at small numbers of function evaluations (NFE) or\nlarge guidance scales. Experiments show that DPM-Solver-v3 achieves\nconsistently better or comparable performance in both unconditional and\nconditional sampling with both pixel-space and latent-space DPMs, especially in\n5$\\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on\nunconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable\nDiffusion, bringing a speed-up of 15%$\\sim$30% compared to previous\nstate-of-the-art training-free methods. Code is available at\nhttps://github.com/thu-ml/DPM-Solver-v3.\n","authors":["Kaiwen Zheng","Cheng Lu","Jianfei Chen","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2310.13268v3.pdf","comment":"Accepted at NeurIPS 2023"},{"id":"http://arxiv.org/abs/2305.15393v2","updated":"2023-10-28T06:56:32Z","published":"2023-05-24T17:56:16Z","title":"LayoutGPT: Compositional Visual Planning and Generation with Large\n Language Models","summary":" Attaining a high degree of user controllability in visual generation often\nrequires intricate, fine-grained inputs like layouts. However, such inputs\nimpose a substantial burden on users when compared to simple text inputs. To\naddress the issue, we study how Large Language Models (LLMs) can serve as\nvisual planners by generating layouts from text conditions, and thus\ncollaborate with visual generative models. We propose LayoutGPT, a method to\ncompose in-context visual demonstrations in style sheet language to enhance the\nvisual planning skills of LLMs. LayoutGPT can generate plausible layouts in\nmultiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also\nshows superior performance in converting challenging language concepts like\nnumerical and spatial relations to layout arrangements for faithful\ntext-to-image generation. When combined with a downstream image generation\nmodel, LayoutGPT outperforms text-to-image models/systems by 20-40% and\nachieves comparable performance as human users in designing visual layouts for\nnumerical and spatial correctness. Lastly, LayoutGPT achieves comparable\nperformance to supervised methods in 3D indoor scene synthesis, demonstrating\nits effectiveness and potential in multiple visual domains.\n","authors":["Weixi Feng","Wanrong Zhu","Tsu-jui Fu","Varun Jampani","Arjun Akula","Xuehai He","Sugato Basu","Xin Eric Wang","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2305.15393v2.pdf","comment":"NeurIPS 2023"},{"id":"http://arxiv.org/abs/2310.11448v3","updated":"2023-10-28T06:41:48Z","published":"2023-10-17T17:57:38Z","title":"4K4D: Real-Time 4D View Synthesis at 4K Resolution","summary":" This paper targets high-fidelity and real-time view synthesis of dynamic 3D\nscenes at 4K resolution. Recently, some methods on dynamic view synthesis have\nshown impressive rendering quality. However, their speed is still limited when\nrendering high-resolution images. To overcome this problem, we propose 4K4D, a\n4D point cloud representation that supports hardware rasterization and enables\nunprecedented rendering speed. Our representation is built on a 4D feature grid\nso that the points are naturally regularized and can be robustly optimized. In\naddition, we design a novel hybrid appearance model that significantly boosts\nthe rendering quality while preserving efficiency. Moreover, we develop a\ndifferentiable depth peeling algorithm to effectively learn the proposed model\nfrom RGB videos. Experiments show that our representation can be rendered at\nover 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the\nENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x\nfaster than previous methods and achieves the state-of-the-art rendering\nquality. Our project page is available at https://zju3dv.github.io/4k4d/.\n","authors":["Zhen Xu","Sida Peng","Haotong Lin","Guangzhao He","Jiaming Sun","Yujun Shen","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2310.11448v3.pdf","comment":"Project Page: https://zju3dv.github.io/4k4d"},{"id":"http://arxiv.org/abs/2310.10942v2","updated":"2023-10-28T06:04:29Z","published":"2023-10-17T02:38:09Z","title":"UNK-VQA: A Dataset and A Probe into Multi-modal Large Models' Abstention\n Ability","summary":" Teaching Visual Question Answering (VQA) models to refrain from answering\nunanswerable questions is necessary for building a trustworthy AI system.\nExisting studies, though have explored various aspects of VQA but somewhat\nignored this particular attribute. This paper aims to bridge the research gap\nby contributing a comprehensive dataset, called UNK-VQA. The dataset is\nspecifically designed to address the challenge of questions that models do not\nknow. To this end, we first augment the existing data via deliberate\nperturbations on either the image or question. In specific, we carefully ensure\nthat the question-image semantics remain close to the original unperturbed\ndistribution. By this means, the identification of unanswerable questions\nbecomes challenging, setting our dataset apart from others that involve mere\nimage replacement. We then extensively evaluate the zero- and few-shot\nperformance of several emerging multi-modal large models and discover their\nsignificant limitations when applied to our dataset. Additionally, we also\npropose a straightforward method to tackle these unanswerable questions. This\ndataset, we believe, will serve as a valuable benchmark for enhancing the\nabstention capability of VQA models, thereby leading to increased\ntrustworthiness of AI systems. We have made the\n\\href{https://github.com/guoyang9/UNK-VQA}{dataset} available to facilitate\nfurther exploration in this area.\n","authors":["Yanyang Guo","Fangkai Jiao","Zhiqi Shen","Liqiang Nie","Mohan Kankanhalli"],"pdf_url":"https://arxiv.org/pdf/2310.10942v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.10440v3","updated":"2023-10-28T05:51:52Z","published":"2023-04-20T16:31:22Z","title":"OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping","summary":" Accurately depicting the complex traffic scene is a vital component for\nautonomous vehicles to execute correct judgments. However, existing benchmarks\ntend to oversimplify the scene by solely focusing on lane perception tasks.\nObserving that human drivers rely on both lanes and traffic signals to operate\ntheir vehicles safely, we present OpenLane-V2, the first dataset on topology\nreasoning for traffic scene structure. The objective of the presented dataset\nis to advance research in understanding the structure of road scenes by\nexamining the relationship between perceived entities, such as traffic elements\nand lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000\nannotated road scenes that describe traffic elements and their correlation to\nthe lanes. It comprises three primary sub-tasks, including the 3D lane\ndetection inherited from OpenLane, accompanied by corresponding metrics to\nevaluate the model's performance. We evaluate various state-of-the-art methods,\nand present their quantitative and qualitative results on OpenLane-V2 to\nindicate future avenues for investigating topology reasoning in traffic scenes.\n","authors":["Huijie Wang","Tianyu Li","Yang Li","Li Chen","Chonghao Sima","Zhenbo Liu","Bangjun Wang","Peijin Jia","Yuting Wang","Shengyin Jiang","Feng Wen","Hang Xu","Ping Luo","Junchi Yan","Wei Zhang","Hongyang Li"],"pdf_url":"https://arxiv.org/pdf/2304.10440v3.pdf","comment":"Accepted by NeurIPS 2023 Track on Datasets and Benchmarks |\n OpenLane-V2 Dataset: https://github.com/OpenDriveLab/OpenLane-V2"},{"id":"http://arxiv.org/abs/2310.18598v1","updated":"2023-10-28T05:23:55Z","published":"2023-10-28T05:23:55Z","title":"Domain Generalisation via Risk Distribution Matching","summary":" We propose a novel approach for domain generalisation (DG) leveraging risk\ndistributions to characterise domains, thereby achieving domain invariance. In\nour findings, risk distributions effectively highlight differences between\ntraining domains and reveal their inherent complexities. In testing, we may\nobserve similar, or potentially intensifying in magnitude, divergences between\nrisk distributions. Hence, we propose a compelling proposition: Minimising the\ndivergences between risk distributions across training domains leads to robust\ninvariance for DG. The key rationale behind this concept is that a model,\ntrained on domain-invariant or stable features, may consistently produce\nsimilar risk distributions across various domains. Building upon this idea, we\npropose Risk Distribution Matching (RDM). Using the maximum mean discrepancy\n(MMD) distance, RDM aims to minimise the variance of risk distributions across\ntraining domains. However, when the number of domains increases, the direct\noptimisation of variance leads to linear growth in MMD computations, resulting\nin inefficiency. Instead, we propose an approximation that requires only one\nMMD computation, by aligning just two distributions: that of the worst-case\ndomain and the aggregated distribution from all domains. Notably, this method\nempirically outperforms optimising distributional variance while being\ncomputationally more efficient. Unlike conventional DG matching algorithms, RDM\nstands out for its enhanced efficacy by concentrating on scalar risk\ndistributions, sidestepping the pitfalls of high-dimensional challenges seen in\nfeature or gradient matching. Our extensive experiments on standard benchmark\ndatasets demonstrate that RDM shows superior generalisation capability over\nstate-of-the-art DG methods.\n","authors":["Toan Nguyen","Kien Do","Bao Duong","Thin Nguyen"],"pdf_url":"https://arxiv.org/pdf/2310.18598v1.pdf","comment":"Accepted at 2024 IEEE/CVF Winter Conference on Applications of\n Computer Vision (WACV 2024)"},{"id":"http://arxiv.org/abs/2310.18589v1","updated":"2023-10-28T04:54:48Z","published":"2023-10-28T04:54:48Z","title":"This Looks Like Those: Illuminating Prototypical Concepts Using Multiple\n Visualizations","summary":" We present ProtoConcepts, a method for interpretable image classification\ncombining deep learning and case-based reasoning using prototypical parts.\nExisting work in prototype-based image classification uses a ``this looks like\nthat'' reasoning process, which dissects a test image by finding prototypical\nparts and combining evidence from these prototypes to make a final\nclassification. However, all of the existing prototypical part-based image\nclassifiers provide only one-to-one comparisons, where a single training image\npatch serves as a prototype to compare with a part of our test image. With\nthese single-image comparisons, it can often be difficult to identify the\nunderlying concept being compared (e.g., ``is it comparing the color or the\nshape?''). Our proposed method modifies the architecture of prototype-based\nnetworks to instead learn prototypical concepts which are visualized using\nmultiple image patches. Having multiple visualizations of the same prototype\nallows us to more easily identify the concept captured by that prototype (e.g.,\n``the test image and the related training patches are all the same shade of\nblue''), and allows our model to create richer, more interpretable visual\nexplanations. Our experiments show that our ``this looks like those'' reasoning\nprocess can be applied as a modification to a wide range of existing\nprototypical image classification networks while achieving comparable accuracy\non benchmark datasets.\n","authors":["Chiyu Ma","Brandon Zhao","Chaofan Chen","Cynthia Rudin"],"pdf_url":"https://arxiv.org/pdf/2310.18589v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18585v1","updated":"2023-10-28T04:30:16Z","published":"2023-10-28T04:30:16Z","title":"Visual Explanations via Iterated Integrated Attributions","summary":" We introduce Iterated Integrated Attributions (IIA) - a generic method for\nexplaining the predictions of vision models. IIA employs iterative integration\nacross the input image, the internal representations generated by the model,\nand their gradients, yielding precise and focused explanation maps. We\ndemonstrate the effectiveness of IIA through comprehensive evaluations across\nvarious tasks, datasets, and network architectures. Our results showcase that\nIIA produces accurate explanation maps, outperforming other state-of-the-art\nexplanation techniques.\n","authors":["Oren Barkan","Yehonatan Elisha","Yuval Asher","Amit Eshel","Noam Koenigstein"],"pdf_url":"https://arxiv.org/pdf/2310.18585v1.pdf","comment":"ICCV 2023"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.04678v3","updated":"2023-10-28T19:47:47Z","published":"2023-10-07T03:25:06Z","title":"DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based\n Queries","summary":" In scientific research, the ability to effectively retrieve relevant\ndocuments based on complex, multifaceted queries is critical. Existing\nevaluation datasets for this task are limited, primarily due to the high cost\nand effort required to annotate resources that effectively represent complex\nqueries. To address this, we propose a novel task, Scientific DOcument\nRetrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed\nto handle the complex nature of user queries in scientific research. We\ndeveloped a benchmark dataset within the field of computer science, consisting\nof 100 human-authored complex query cases. For each complex query, we assembled\na collection of 100 relevant documents and produced annotated relevance scores\nfor ranking them. Recognizing the significant labor of expert annotation, we\nalso introduce Anno-GPT, a scalable framework for validating the performance of\nLarge Language Models (LLMs) on expert-level dataset annotation tasks. LLM\nannotation of the DORIS-MAE dataset resulted in a 500x reduction in cost,\nwithout compromising quality. Furthermore, due to the multi-tiered structure of\nthese complex queries, the DORIS-MAE dataset can be extended to over 4,000\nsub-query test cases without requiring additional annotation. We evaluated 17\nrecent retrieval methods on DORIS-MAE, observing notable performance drops\ncompared to traditional datasets. This highlights the need for better\napproaches to handle complex, multifaceted queries in scientific research. Our\ndataset and codebase are available at\nhttps://github.com/Real-Doris-Mae/Doris-Mae-Dataset.\n","authors":["Jianyou Wang","Kaicheng Wang","Xiaoyue Wang","Prudhviraj Naidu","Leon Bergen","Ramamohan Paturi"],"pdf_url":"https://arxiv.org/pdf/2310.04678v3.pdf","comment":"To appear in NeurIPS 2023 Datasets and Benchmarks Track"},{"id":"http://arxiv.org/abs/2310.18770v1","updated":"2023-10-28T17:54:26Z","published":"2023-10-28T17:54:26Z","title":"Leveraging Multimodal Features and Item-level User Feedback for Bundle\n Construction","summary":" Automatic bundle construction is a crucial prerequisite step in various\nbundle-aware online services. Previous approaches are mostly designed to model\nthe bundling strategy of existing bundles. However, it is hard to acquire\nlarge-scale well-curated bundle dataset, especially for those platforms that\nhave not offered bundle services before. Even for platforms with mature bundle\nservices, there are still many items that are included in few or even zero\nbundles, which give rise to sparsity and cold-start challenges in the bundle\nconstruction models. To tackle these issues, we target at leveraging multimodal\nfeatures, item-level user feedback signals, and the bundle composition\ninformation, to achieve a comprehensive formulation of bundle construction.\nNevertheless, such formulation poses two new technical challenges: 1) how to\nlearn effective representations by optimally unifying multiple features, and 2)\nhow to address the problems of modality missing, noise, and sparsity problems\ninduced by the incomplete query bundles. In this work, to address these\ntechnical challenges, we propose a Contrastive Learning-enhanced Hierarchical\nEncoder method (CLHE). Specifically, we use self-attention modules to combine\nthe multimodal and multi-item features, and then leverage both item- and\nbundle-level contrastive learning to enhance the representation learning, thus\nto counter the modality missing, noise, and sparsity problems. Extensive\nexperiments on four datasets in two application domains demonstrate that our\nmethod outperforms a list of SOTA methods. The code and dataset are available\nat https://github.com/Xiaohao-Liu/CLHE.\n","authors":["Yunshan Ma","Xiaohao Liu","Yinwei Wei","Zhulin Tao","Xiang Wang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2310.18770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18700v1","updated":"2023-10-28T12:57:39Z","published":"2023-10-28T12:57:39Z","title":"Empowering Collaborative Filtering with Principled Adversarial\n Contrastive Loss","summary":" Contrastive Learning (CL) has achieved impressive performance in\nself-supervised learning tasks, showing superior generalization ability.\nInspired by the success, adopting CL into collaborative filtering (CF) is\nprevailing in semi-supervised top-K recommendations. The basic idea is to\nroutinely conduct heuristic-based data augmentation and apply contrastive\nlosses (e.g., InfoNCE) on the augmented views. Yet, some CF-tailored challenges\nmake this adoption suboptimal, such as the issue of out-of-distribution, the\nrisk of false negatives, and the nature of top-K evaluation. They necessitate\nthe CL-based CF scheme to focus more on mining hard negatives and\ndistinguishing false negatives from the vast unlabeled user-item interactions,\nfor informative contrast signals. Worse still, there is limited understanding\nof contrastive loss in CF methods, especially w.r.t. its generalization\nability. To bridge the gap, we delve into the reasons underpinning the success\nof contrastive loss in CF, and propose a principled Adversarial InfoNCE loss\n(AdvInfoNCE), which is a variant of InfoNCE, specially tailored for CF methods.\nAdvInfoNCE adaptively explores and assigns hardness to each negative instance\nin an adversarial fashion and further utilizes a fine-grained hardness-aware\nranking criterion to empower the recommender's generalization ability. Training\nCF models with AdvInfoNCE, we validate the effectiveness of AdvInfoNCE on both\nsynthetic and real-world benchmark datasets, thus showing its generalization\nability to mitigate out-of-distribution problems. Given the theoretical\nguarantees and empirical superiority of AdvInfoNCE over most contrastive loss\nfunctions, we advocate its adoption as a standard loss in recommender systems,\nparticularly for the out-of-distribution tasks. Codes are available at\nhttps://github.com/LehengTHU/AdvInfoNCE.\n","authors":["An Zhang","Leheng Sheng","Zhibo Cai","Xiang Wang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2310.18700v1.pdf","comment":"Accepted to NeurIPS 2023"},{"id":"http://arxiv.org/abs/2304.07041v5","updated":"2023-10-28T08:56:32Z","published":"2023-04-14T10:29:18Z","title":"A Diffusion model for POI recommendation","summary":" Next Point-of-Interest (POI) recommendation is a critical task in\nlocation-based services that aim to provide personalized suggestions for the\nuser's next destination. Previous works on POI recommendation have laid focused\non modeling the user's spatial preference. However, existing works that\nleverage spatial information are only based on the aggregation of users'\nprevious visited positions, which discourages the model from recommending POIs\nin novel areas. This trait of position-based methods will harm the model's\nperformance in many situations. Additionally, incorporating sequential\ninformation into the user's spatial preference remains a challenge. In this\npaper, we propose Diff-POI: a Diffusion-based model that samples the user's\nspatial preference for the next POI recommendation. Inspired by the wide\napplication of diffusion algorithm in sampling from distributions, Diff-POI\nencodes the user's visiting sequence and spatial character with two\ntailor-designed graph encoding modules, followed by a diffusion-based sampling\nstrategy to explore the user's spatial visiting trends. We leverage the\ndiffusion process and its reversed form to sample from the posterior\ndistribution and optimized the corresponding score function. We design a joint\ntraining and inference framework to optimize and evaluate the proposed\nDiff-POI. Extensive experiments on four real-world POI recommendation datasets\ndemonstrate the superiority of our Diff-POI over state-of-the-art baseline\nmethods. Further ablation and parameter studies on Diff-POI reveal the\nfunctionality and effectiveness of the proposed diffusion-based sampling\nstrategy for addressing the limitations of existing methods.\n","authors":["Yifang Qin","Hongjun Wu","Wei Ju","Xiao Luo","Ming Zhang"],"pdf_url":"https://arxiv.org/pdf/2304.07041v5.pdf","comment":"Accepted by ACM Transactions on Information Systems (TOIS 2023)"},{"id":"http://arxiv.org/abs/2310.18619v1","updated":"2023-10-28T07:00:28Z","published":"2023-10-28T07:00:28Z","title":"Dense Retrieval as Indirect Supervision for Large-space Decision Making","summary":" Many discriminative natural language understanding (NLU) tasks have large\nlabel spaces. Learning such a process of large-space decision making is\nparticularly challenging due to the lack of training instances per label and\nthe difficulty of selection among many fine-grained labels. Inspired by dense\nretrieval methods for passage finding in open-domain QA, we propose a\nreformulation of large-space discriminative NLU tasks as a learning-to-retrieve\ntask, leading to a novel solution named Dense Decision Retrieval (DDR ).\nInstead of predicting fine-grained decisions as logits, DDR adopts a\ndual-encoder architecture that learns to predict by retrieving from a decision\nthesaurus. This approach not only leverages rich indirect supervision signals\nfrom easy-to-consume learning resources for dense retrieval, it also leads to\nenhanced prediction generalizability with a semantically meaningful\nrepresentation of the large decision space. When evaluated on tasks with\ndecision spaces ranging from hundreds to hundred-thousand scales, DDR\noutperforms strong baselines greatly by 27.54% in P@1 on two extreme\nmulti-label classification tasks, 1.17% in F1 score ultra-fine entity typing,\nand 1.26% in accuracy on three few-shot intent classification tasks on average.\nCode and resources are available at https://github.com/luka-group/DDR\n","authors":["Nan Xu","Fei Wang","Mingtao Dong","Muhao Chen"],"pdf_url":"https://arxiv.org/pdf/2310.18619v1.pdf","comment":"EMNLP 2023 (Findings)"},{"id":"http://arxiv.org/abs/2310.18608v1","updated":"2023-10-28T06:31:06Z","published":"2023-10-28T06:31:06Z","title":"Embedding in Recommender Systems: A Survey","summary":" Recommender systems have become an essential component of many online\nplatforms, providing personalized recommendations to users. A crucial aspect is\nembedding techniques that coverts the high-dimensional discrete features, such\nas user and item IDs, into low-dimensional continuous vectors and can enhance\nthe recommendation performance. Applying embedding techniques captures complex\nentity relationships and has spurred substantial research. In this survey, we\nprovide an overview of the recent literature on embedding techniques in\nrecommender systems. This survey covers embedding methods like collaborative\nfiltering, self-supervised learning, and graph-based techniques. Collaborative\nfiltering generates embeddings capturing user-item preferences, excelling in\nsparse data. Self-supervised methods leverage contrastive or generative\nlearning for various tasks. Graph-based techniques like node2vec exploit\ncomplex relationships in network-rich environments. Addressing the scalability\nchallenges inherent to embedding methods, our survey delves into innovative\ndirections within the field of recommendation systems. These directions aim to\nenhance performance and reduce computational complexity, paving the way for\nimproved recommender systems. Among these innovative approaches, we will\nintroduce Auto Machine Learning (AutoML), hash techniques, and quantization\ntechniques in this survey. We discuss various architectures and techniques and\nhighlight the challenges and future directions in these aspects. This survey\naims to provide a comprehensive overview of the state-of-the-art in this\nrapidly evolving field and serve as a useful resource for researchers and\npractitioners working in the area of recommender systems.\n","authors":["Xiangyu Zhao","Maolin Wang","Xinjian Zhao","Jiansheng Li","Shucheng Zhou","Dawei Yin","Qing Li","Jiliang Tang","Ruocheng Guo"],"pdf_url":"https://arxiv.org/pdf/2310.18608v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2306.07528v3","updated":"2023-10-28T06:12:37Z","published":"2023-06-13T03:46:22Z","title":"Unified Off-Policy Learning to Rank: a Reinforcement Learning\n Perspective","summary":" Off-policy Learning to Rank (LTR) aims to optimize a ranker from data\ncollected by a deployed logging policy. However, existing off-policy learning\nto rank methods often make strong assumptions about how users generate the\nclick data, i.e., the click model, and hence need to tailor their methods\nspecifically under different click models. In this paper, we unified the\nranking process under general stochastic click models as a Markov Decision\nProcess (MDP), and the optimal ranking could be learned with offline\nreinforcement learning (RL) directly. Building upon this, we leverage offline\nRL techniques for off-policy LTR and propose the Click Model-Agnostic Unified\nOff-policy Learning to Rank (CUOLR) method, which could be easily applied to a\nwide range of click models. Through a dedicated formulation of the MDP, we show\nthat offline RL algorithms can adapt to various click models without complex\ndebiasing techniques and prior knowledge of the model. Results on various\nlarge-scale datasets demonstrate that CUOLR consistently outperforms the\nstate-of-the-art off-policy learning to rank algorithms while maintaining\nconsistency and robustness under different click models.\n","authors":["Zeyu Zhang","Yi Su","Hui Yuan","Yiran Wu","Rishab Balasubramanian","Qingyun Wu","Huazheng Wang","Mengdi Wang"],"pdf_url":"https://arxiv.org/pdf/2306.07528v3.pdf","comment":"accepted by Neruips 2023"},{"id":"http://arxiv.org/abs/2310.00402v2","updated":"2023-10-28T04:40:29Z","published":"2023-09-30T14:55:44Z","title":"DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph\n Index using Query-sensitivity Entry Vertex","summary":" Given a vector dataset $\\mathcal{X}$ and a query vector $\\vec{x}_q$,\ngraph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph\nindex $G$ and approximately return vectors with minimum distances to\n$\\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is\nthat a graph index would be too large to fit into the memory especially for a\nlarge-scale $\\mathcal{X}$. To solve this, a Product Quantization (PQ)-based\nhybrid method called DiskANN is proposed to store a low-dimensional PQ index in\nmemory and retain a graph index in SSD, thus reducing memory overhead while\nensuring a high search accuracy. However, it suffers from two I/O issues that\nsignificantly affect the overall efficiency: (1) long routing path from an\nentry vertex to the query's neighborhood that results in large number of I/O\nrequests and (2) redundant I/O requests during the routing process. We propose\nan optimized DiskANN++ to overcome above issues. Specifically, for the first\nissue, we present a query-sensitive entry vertex selection strategy to replace\nDiskANN's static graph-central entry vertex by a dynamically determined entry\nvertex that is close to the query. For the second I/O issue, we present an\nisomorphic mapping on DiskANN's graph index to optimize the SSD layout and\npropose an asynchronously optimized Pagesearch based on the optimized SSD\nlayout as an alternative to DiskANN's beamsearch. Comprehensive experimental\nstudies on eight real-world datasets demonstrate our DiskANN++'s superiority on\nefficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to\nDiskANN, given the same accuracy constraint.\n","authors":["Jiongkang Ni","Xiaoliang Xu","Yuxiang Wang","Can Li","Jiajie Yao","Shihai Xiao","Xuecang Zhang"],"pdf_url":"https://arxiv.org/pdf/2310.00402v2.pdf","comment":"14 pages including references, 9 figures"}],"Multimedia":[{"id":"http://arxiv.org/abs/2306.17842v3","updated":"2023-10-28T18:09:46Z","published":"2023-06-30T17:59:07Z","title":"SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen\n LLMs","summary":" In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling\nfrozen LLMs to perform both understanding and generation tasks involving\nnon-linguistic modalities such as images or videos. SPAE converts between raw\npixels and interpretable lexical tokens (or words) extracted from the LLM's\nvocabulary. The resulting tokens capture both the semantic meaning and the\nfine-grained details needed for visual reconstruction, effectively translating\nthe visual content into a language comprehensible to the LLM, and empowering it\nto perform a wide array of multimodal tasks. Our approach is validated through\nin-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set\nof image understanding and generation tasks. Our method marks the first\nsuccessful attempt to enable a frozen LLM to generate image content while\nsurpassing state-of-the-art performance in image understanding tasks, under the\nsame setting, by over 25%.\n","authors":["Lijun Yu","Yong Cheng","Zhiruo Wang","Vivek Kumar","Wolfgang Macherey","Yanping Huang","David A. Ross","Irfan Essa","Yonatan Bisk","Ming-Hsuan Yang","Kevin Murphy","Alexander G. Hauptmann","Lu Jiang"],"pdf_url":"https://arxiv.org/pdf/2306.17842v3.pdf","comment":"NeurIPS 2023 spotlight"},{"id":"http://arxiv.org/abs/2310.18770v1","updated":"2023-10-28T17:54:26Z","published":"2023-10-28T17:54:26Z","title":"Leveraging Multimodal Features and Item-level User Feedback for Bundle\n Construction","summary":" Automatic bundle construction is a crucial prerequisite step in various\nbundle-aware online services. Previous approaches are mostly designed to model\nthe bundling strategy of existing bundles. However, it is hard to acquire\nlarge-scale well-curated bundle dataset, especially for those platforms that\nhave not offered bundle services before. Even for platforms with mature bundle\nservices, there are still many items that are included in few or even zero\nbundles, which give rise to sparsity and cold-start challenges in the bundle\nconstruction models. To tackle these issues, we target at leveraging multimodal\nfeatures, item-level user feedback signals, and the bundle composition\ninformation, to achieve a comprehensive formulation of bundle construction.\nNevertheless, such formulation poses two new technical challenges: 1) how to\nlearn effective representations by optimally unifying multiple features, and 2)\nhow to address the problems of modality missing, noise, and sparsity problems\ninduced by the incomplete query bundles. In this work, to address these\ntechnical challenges, we propose a Contrastive Learning-enhanced Hierarchical\nEncoder method (CLHE). Specifically, we use self-attention modules to combine\nthe multimodal and multi-item features, and then leverage both item- and\nbundle-level contrastive learning to enhance the representation learning, thus\nto counter the modality missing, noise, and sparsity problems. Extensive\nexperiments on four datasets in two application domains demonstrate that our\nmethod outperforms a list of SOTA methods. The code and dataset are available\nat https://github.com/Xiaohao-Liu/CLHE.\n","authors":["Yunshan Ma","Xiaohao Liu","Yinwei Wei","Zhulin Tao","Xiang Wang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2310.18770v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18728v1","updated":"2023-10-28T15:14:43Z","published":"2023-10-28T15:14:43Z","title":"Online Multi-view Anomaly Detection with Disentangled Product-of-Experts\n Modeling","summary":" Multi-view or even multi-modal data is appealing yet challenging for\nreal-world applications. Detecting anomalies in multi-view data is a prominent\nrecent research topic. However, most of the existing methods 1) are only\nsuitable for two views or type-specific anomalies, 2) suffer from the issue of\nfusion disentanglement, and 3) do not support online detection after model\ndeployment. To address these challenges, our main ideas in this paper are\nthree-fold: multi-view learning, disentangled representation learning, and\ngenerative model. To this end, we propose dPoE, a novel multi-view variational\nautoencoder model that involves (1) a Product-of-Experts (PoE) layer in\ntackling multi-view data, (2) a Total Correction (TC) discriminator in\ndisentangling view-common and view-specific representations, and (3) a joint\nloss function in wrapping up all components. In addition, we devise theoretical\ninformation bounds to control both view-common and view-specific\nrepresentations. Extensive experiments on six real-world datasets demonstrate\nthat the proposed dPoE outperforms baselines markedly.\n","authors":["Hao Wang","Zhi-Qi Cheng","Jingdong Sun","Xin Yang","Xiao Wu","Hongyang Chen","Yan Yang"],"pdf_url":"https://arxiv.org/pdf/2310.18728v1.pdf","comment":"Accepted by ACM Multimedia 2023, 10 pages, 5 tables, and 3 figures"},{"id":"http://arxiv.org/abs/2310.18709v1","updated":"2023-10-28T13:37:52Z","published":"2023-10-28T13:37:52Z","title":"Audio-Visual Instance Segmentation","summary":" In this paper, we propose a new multi-modal task, namely audio-visual\ninstance segmentation (AVIS), in which the goal is to identify, segment, and\ntrack individual sounding object instances in audible videos, simultaneously.\nTo our knowledge, it is the first time that instance segmentation has been\nextended into the audio-visual domain. To better facilitate this research, we\nconstruct the first audio-visual instance segmentation benchmark (AVISeg).\nSpecifically, AVISeg consists of 1,258 videos with an average duration of 62.6\nseconds from YouTube and public audio-visual datasets, where 117 videos have\nbeen annotated by using an interactive semi-automatic labeling tool based on\nthe Segment Anything Model (SAM). In addition, we present a simple baseline\nmodel for the AVIS task. Our new model introduces an audio branch and a\ncross-modal fusion module to Mask2Former to locate all sounding objects.\nFinally, we evaluate the proposed method using two backbones on AVISeg. We\nbelieve that AVIS will inspire the community towards a more comprehensive\nmulti-modal understanding.\n","authors":["Ruohao Guo","Yaru Chen","Yanyu Qi","Wenzhen Yue","Dantong Niu","Xianghua Ying"],"pdf_url":"https://arxiv.org/pdf/2310.18709v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18609v1","updated":"2023-10-28T06:36:53Z","published":"2023-10-28T06:36:53Z","title":"Deep3DSketch+: Obtaining Customized 3D Model by Single Free-Hand Sketch\n through Deep Learning","summary":" As 3D models become critical in today's manufacturing and product design,\nconventional 3D modeling approaches based on Computer-Aided Design (CAD) are\nlabor-intensive, time-consuming, and have high demands on the creators. This\nwork aims to introduce an alternative approach to 3D modeling by utilizing\nfree-hand sketches to obtain desired 3D models. We introduce Deep3DSketch+,\nwhich is a deep-learning algorithm that takes the input of a single free-hand\nsketch and produces a complete and high-fidelity model that matches the sketch\ninput. The neural network has view- and structural-awareness enabled by a Shape\nDiscriminator (SD) and a Stroke Enhancement Module (SEM), which overcomes the\nlimitations of sparsity and ambiguity of the sketches. The network design also\nbrings high robustness to partial sketch input in industrial applications.Our\napproach has undergone extensive experiments, demonstrating its\nstate-of-the-art (SOTA) performance on both synthetic and real-world datasets.\nThese results validate the effectiveness and superiority of our method compared\nto existing techniques. We have demonstrated the conversion of free-hand\nsketches into physical 3D objects using additive manufacturing. We believe that\nour approach has the potential to accelerate product design and democratize\ncustomized manufacturing.\n","authors":["Ying Zang","Chenglong Fu","Tianrun Chen","Yuanqi Hu","Qingshan Liu","Wenjun Hu"],"pdf_url":"https://arxiv.org/pdf/2310.18609v1.pdf","comment":null}]}} \ No newline at end of file diff --git a/favicon.ico b/favicon.ico new file mode 100644 index 00000000..7f5166c7 Binary files /dev/null and b/favicon.ico differ diff --git a/index.css b/index.css new file mode 100644 index 00000000..9ded9d94 --- /dev/null +++ b/index.css @@ -0,0 +1,355 @@ +:root { + /* Palette: Nord (https://www.nordtheme.com)*/ + --nord00: #2e3440; + --nord01: #3b4252; + --nord02: #434c5e; + --nord03: #4c566a; + --nord04: #d8dee9; + --nord05: #e5e9f0; + --nord06: #eceff4; + --nord07: #8fbcbb; + --nord08: #88c0d0; + --nord09: #81a1c1; + --nord0A: #5e81ac; + --nord0B: #bf616a; + --nord0C: #d08770; + --nord0D: #ebcb8b; + --nord0E: #a3be8c; + --nord0F: #b48ead; + + + /* Typograph */ + --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", + sans-serif; + --font-size-scaler: 62.5%; + --font-size-m: 1.6rem; + --font-size-s: 1.4rem; + + /* Components */ + --body-color: var(--nord06); + --body-bg: var(--nord00); + + --header-title: var(--nord06); + --header-container: var(--nord00); + --header-title-preffix: var(--nord0F); + + --chip-font: var(--nord08); + --chip-color: var(--nord0B); + + --icons: var(--nord06); + --icons-hover: var(--nord0F); + + --day-container: var(--nord01); + --date: var(--nord09); + + --summary: var(--nord0E); + --summary-hover: var(--nord0F); + + --details-open: var(--nord02); + --details-content: var(--nord05); + --details-a: var(--nord07); + --details-a-hover: var(--nord0F); + + --highlight-title: var(--nord0B); + --highlight-author: var(--nord0B); + + --article-summary-hover-color: var(--nord0D); + --article-summary-color: var(--nord04); + + --article-title-color: var(--nord05); + --article-title-hover-color: var(--nord0E); + + --accordion-content-rail-color: var(--nord01); + --accordion-content-hover-rail-color: var(--nord0D); + --accordion-title-marker-color: var(--nord01); + --accordion-title-hover-marker-color: var(--nord0E); + + --footer-color: var(--nord04); + --footer-link-hover-color: var(--nord0D); +} + +[data-theme="light"] { + /* Theme design */ + + --color-primary: var(--nord07); + --color-primary-second: var(--nord00); + --color-info: var(--nord0A); + --color-success: var(--nord0E); + --color-warning: var(--nord0C); + --color-danger: var(--nord0B); + + --color-text: var(--nord00); + --color-hover: var(--nord0D); + --color-shadow: var(--nord03); + + --color-primary-h: var(--nord09); + --color-primary-s: var(--nord08); + --color-primary-l: var(--nord07); + + --color-contrast-higher-h: var(--nord01); + --color-contrast-higher-l: var(--nord02); + --color-contrast-higher-s: var(--nord03); + + --color-content: white; + + --background: var(--nord06); + --background-content: var(--nord05); + --background-color: var(--nord04); + + /* Components */ + + --chip-font: var(--nord06); + --chip-color: var(--nord09); + + --body-color: var(--background-color); + --body-bg: var(--background); + + --header-title: var(--color-shadow); + --header-container: var(--background); + --header-title-preffix: var(--color-primary-h); + + --icons: var(--color-shadow); + --icons-hover: var(--color-hover); + + --day-container: var(--background-content); + --date: var(--color-primary-l); + + --summary: var(--color-info); + --summary-hover: var(--color-success); + + --details-open: var(--color-content); + --details-content: var(--color-text); + --details-a: var(--color-primary-h); + --details-a-hover: var(--color-hover); + + --highlight-title: var(--color-danger); + --highlight-author: var(--color-warning); + + --article-summary-color: var(--color-text); + --article-summary-hover-color: var(--color-primary-s); + + --article-title-color: var(--color-primary); + --article-title-hover-color: var(--color-success); + + --accordion-content-rail-color: var(--color-warning); + --accordion-content-hover-rail-color: var(--color-warning); + --accordion-title-marker-color: var(--color-success); + --accordion-title-hover-marker-color: var(--color-success); + + --footer-color: var(--color-text); + --footer-link-hover-color: var(--color-hover); +} + +html { + font-size: var(--font-size-scaler); +} + +body { + background-color: var(--body-bg); + font-family: var(--font-family-default); + color: var(--body-color); + margin: 0; + padding-top: 16px; + display: grid; +} + +.header-container { + width: 90%; + max-width: 1200px; + background: var(--header-container); + margin: 0 auto; +} + +.header-title { + font-size: 32px; + font-weight: bold; + color: var(--header-title); + margin: 0; + padding-bottom: 14px; +} + +.header-title-preffix { + color: var(--header-title-preffix); +} + +.icons { + color: var(--icons); + padding-bottom: 16px; +} + +.icons a { + color: var(--icons); + text-decoration: none; +} + +.icons a:hover { + color: var(--icons-hover); +} + +.day-container { + padding: 16px 16px 16px 16px; + background: var(--day-container); + width: 90%; + max-width: 1200px; + margin: 0 auto; + margin-bottom: 8px; + border-radius: 10px; +} + +.date { + font-size: 24px; + font-weight: 700; + margin: 0; + color: var(--date); +} + +p { + margin: 0; +} + +summary { + font-weight: 600; + color: var(--summary); +} + +summary:hover { + text-decoration: underline; + cursor: pointer; + color: var(--summary-hover); +} + +details { + --border-color: transparent; + + padding: 2px 4px; + font-size: 20px; + border: 1px solid var(--border-color); + border-radius: 4px; +} + +details[open] { + background-color: var(--details-open); + margin-bottom: 8px; +} + +.details-content { + padding: 12px 3px; + gap: 16px; + color: var(--details-content); +} + +details a { + color: var(--details-a); +} + +details a:hover { + color: var(--details-a-hover); +} + +footer { + margin: 0 auto; + color: var(--footer-color); + font-size: var(--font-size-s); + display: flex; + padding: 0 16px; + justify-content: space-between; +} + +.description { + margin: 0 auto; + color: var(--footer-color); + font-size: var(--font-size-s); + display: flex; + padding: 0 16px; + text-align: center; +} + +.highlight-author { + color: var(--highlight-author); + font-weight: bold; +} + +.highlight-title { + color: var(--highlight-title); + font-weight: bold; +} + +.channel-description { + text-align: center; + font-size: var(--font-size-scaler); +} + +.article-summary-link { + color: var(--article-summary-color); + font-size: var(--font-size-s); + text-decoration: none; +} + +.article-summary-link:hover { + color: var(--article-summary-hover-color); + --accordion-content-rail-color: var(--accordion-content-hover-rail-color); +} + +.article-summary-box-outer { + display: block; + padding: 4px 8px 8px 4px; +} + +.article-summary-box-inner { + padding-left: 8px; + border-left: 1px solid var(--accordion-content-rail-color); + font-size: var(--font-size-m); +} + +.article-expander { + padding: 10px 4px; + border-radius: 4px; +} + +.article-authors { + font-size: var(--font-size-m); + padding: 0.25em 1em; +} + +.article-authors a { + text-decoration: none; +} + +.article-expander-title { + font-size: var(--font-size-m); + font-weight: 600; +} + +.article-expander-title:hover { + cursor: pointer; +} + +.article-expander-title::marker { + color: var(--accordion-title-marker-color); +} + +.article-expander-title:hover::marker { + color: var(--accordion-title-hover-marker-color); +} + +/* for switcher */ +.theme-switch { + display: inline-block; + position: relative; +} + +.theme-switch input { + display: none; +} + +/* chip */ +.chip { + font-size: 90%; + align-items: center; + color: var(--chip-font); + background: var(--chip-color); + border-radius: 5rem; + display: inline-flex; + padding: .2rem .4rem; + vertical-align: middle; +} \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 00000000..1bc06ea2 --- /dev/null +++ b/index.html @@ -0,0 +1,102574 @@ + + + + + MyArxiv + + + + + + + + + + + + + + + +
+
+
+
+ MyArxiv +
+
+ +
+ +
+
+
+ +
+
+ +
+
+
+ + Computation and Language 13 + +
+
+
+ + ☆ M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context + Evaluation Benchmark for Large Language Models + + +
+ Managing long sequences has become an important and necessary feature for +large language models (LLMs). However, it is still an open question of how to +comprehensively and systematically evaluate the long-sequence capability of +LLMs. One of the reasons is that conventional and widely-used benchmarks mainly +consist of short sequences. In this paper, we propose M4LE, a Multi-ability, +Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. +M4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task +types and 12 domains. To alleviate the scarcity of tasks with naturally long +sequences and incorporate multiple-ability assessment, we propose an automatic +approach (but with negligible human annotations) to convert short-sequence +tasks into a unified long-sequence scenario where LLMs have to identify single +or multiple relevant spans in long contexts based on explicit or semantic +hints. Specifically, the scenario includes five different types of abilities: +(1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span; +(4) semantic multiple-span; and (5) global context understanding. The resulting +samples in M4LE are evenly distributed from 1k to 8k input length. We conducted +a systematic evaluation on 11 well-established LLMs, especially those optimized +for long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to +understand long context, particularly when tasks require multiple-span +attention. 2) Semantic retrieval task is more difficult for competent LLMs. 3) +Models fine-tuned on longer text with position interpolation have comparable +performance to those using Neural Tangent Kernel (NTK) aware scaling methods +without fine-tuning. We make our benchmark publicly available to encourage +future research in this challenging area. + +
+
+ comment: Code and data are available at https://github.com/KwanWaiChung/M4LE +
+
+
+
+
+ + ☆ Building Real-World Meeting Summarization Systems using Large Language + Models: A Practical Perspective EMNLP 2023 + + +
+ This paper studies how to effectively build meeting summarization systems for +real-world usage using large language models (LLMs). For this purpose, we +conduct an extensive evaluation and comparison of various closed-source and +open-source LLMs, namely, GPT-4, GPT- 3.5, PaLM-2, and LLaMA-2. Our findings +reveal that most closed-source LLMs are generally better in terms of +performance. However, much smaller open-source models like LLaMA- 2 (7B and +13B) could still achieve performance comparable to the large closed-source +models even in zero-shot scenarios. Considering the privacy concerns of +closed-source models for only being accessible via API, alongside the high cost +associated with using fine-tuned versions of the closed-source models, the +opensource models that can achieve competitive performance are more +advantageous for industrial use. Balancing performance with associated costs +and privacy concerns, the LLaMA-2-7B model looks more promising for industrial +usage. In sum, this paper offers practical insights on using LLMs for +real-world business meeting summarization, shedding light on the trade-offs +between performance and cost. + +
+
+ comment: EMNLP 2023 Industry Track +
+
+
+
+
+ + ☆ Adapter Pruning using Tropical Characterization EMNLP 2023 + + +
+ Adapters are widely popular parameter-efficient transfer learning approaches +in natural language processing that insert trainable modules in between layers +of a pre-trained language model. Apart from several heuristics, however, there +has been a lack of studies analyzing the optimal number of adapter parameters +needed for downstream applications. In this paper, we propose an adapter +pruning approach by studying the tropical characteristics of trainable modules. +We cast it as an optimization problem that aims to prune parameters from the +adapter layers without changing the orientation of underlying tropical +hypersurfaces. Our experiments on five NLP datasets show that tropical geometry +tends to identify more relevant parameters to prune when compared with the +magnitude-based baseline, while a combined approach works best across the +tasks. + +
+
+ comment: Accepted at EMNLP 2023, Findings +
+
+
+
+
+ + ☆ EHRTutor: Enhancing Patient Understanding of Discharge Instructions NeurIPS'23 + + +
+ Large language models have shown success as a tutor in education in various +fields. Educating patients about their clinical visits plays a pivotal role in +patients' adherence to their treatment plans post-discharge. This paper +presents EHRTutor, an innovative multi-component framework leveraging the Large +Language Model (LLM) for patient education through conversational +question-answering. EHRTutor first formulates questions pertaining to the +electronic health record discharge instructions. It then educates the patient +through conversation by administering each question as a test. Finally, it +generates a summary at the end of the conversation. Evaluation results using +LLMs and domain experts have shown a clear preference for EHRTutor over the +baseline. Moreover, EHRTutor also offers a framework for generating synthetic +patient education dialogues that can be used for future in-house system +training. + +
+
+ comment: To appear in NeurIPS'23 Workshop on Generative AI for Education + (GAIED) +
+
+
+
+
+ + ☆ LitCab: Lightweight Calibration of Language Models on Outputs of Varied + Lengths + + +
+ A model is considered well-calibrated when its probability estimate aligns +with the actual likelihood of the output being correct. Calibrating language +models (LMs) is crucial, as it plays a vital role in detecting and mitigating +hallucinations, a common issue of LMs, as well as building more trustworthy +models. Yet, popular neural model calibration techniques are not well-suited +for LMs due to their lack of flexibility in discerning answer correctness and +their high computational costs. For instance, post-processing methods like +temperature scaling are often unable to reorder the candidate generations. +Moreover, training-based methods require finetuning the entire model, which is +impractical due to the increasing sizes of modern LMs. In this paper, we +present LitCab, a lightweight calibration mechanism consisting of a single +linear layer taking the input text representation and manipulateing the LM +output logits. LitCab improves model calibration by only adding < 2% of the +original model parameters. For evaluation, we construct CaT, a benchmark +consisting of 7 text generation tasks, covering responses ranging from short +phrases to paragraphs. We test LitCab with Llama2-7B, where it improves +calibration across all tasks, by reducing the average ECE score by 20%. We +further conduct a comprehensive evaluation with 7 popular open-sourced LMs from +GPT and LLaMA families, yielding the following key findings: (1) Larger models +within the same family exhibit better calibration on tasks with short +generation tasks, but not necessarily for longer ones. (2) GPT-family models +show superior calibration compared to LLaMA, Llama2 and Vicuna models despite +having much fewer parameters. (3) Finetuning pretrained model (e.g., LLaMA) +with samples of limited purpose (e.g., conversations) may lead to worse +calibration, highlighting the importance of finetuning setups for calibrating +LMs. + +
+
+
+
+
+ + ♻ ☆ Does Role-Playing Chatbots Capture the Character Personalities? + Assessing Personality Traits for Role-Playing Chatbots + + +
+ The emergence of large-scale pretrained language models has revolutionized +the capabilities of new AI application, especially in the realm of crafting +chatbots with distinct personas. Given the "stimulus-response" nature of +chatbots, this paper unveils an innovative open-ended interview-style approach +for personality assessment on role-playing chatbots, which offers a richer +comprehension of their intrinsic personalities. We conduct personality +assessments on 32 role-playing chatbots created by the ChatHaruhi library, +across both the Big Five and MBTI dimensions, and measure their alignment with +human perception. Evaluation results underscore that modern role-playing +chatbots based on LLMs can effectively portray personality traits of +corresponding characters, with an alignment rate of 82.8% compared with +human-perceived personalities. Besides, we also suggest potential strategies +for shaping chatbots' personalities. Hence, this paper serves as a cornerstone +study for role-playing chatbots that intersects computational linguistics and +psychology. Our resources are available at +https://github.com/LC1332/Chat-Haruhi-Suzumiya + +
+
+ comment: A Personality Traits Test Over ChatHaruhi +
+
+
+
+
+ + ♻ ☆ GestureGPT: Zero-shot Interactive Gesture Understanding and Grounding + with Large Language Model Agents + + +
+ Current gesture recognition systems primarily focus on identifying gestures +within a predefined set, leaving a gap in connecting these gestures to +interactive GUI elements or system functions (e.g., linking a 'thumb-up' +gesture to a 'like' button). We introduce GestureGPT, a novel zero-shot gesture +understanding and grounding framework leveraging large language models (LLMs). +Gesture descriptions are formulated based on hand landmark coordinates from +gesture videos and fed into our dual-agent dialogue system. A gesture agent +deciphers these descriptions and queries about the interaction context (e.g., +interface, history, gaze data), which a context agent organizes and provides. +Following iterative exchanges, the gesture agent discerns user intent, +grounding it to an interactive function. We validated the gesture description +module using public first-view and third-view gesture datasets and tested the +whole system in two real-world settings: video streaming and smart home IoT +control. The highest zero-shot Top-5 grounding accuracies are 80.11% for video +streaming and 90.78% for smart home tasks, showing potential of the new gesture +understanding paradigm. + +
+
+
+
+
+ + ♻ ☆ LightLM: A Lightweight Deep and Narrow Language Model for Generative + Recommendation + + +
+ This paper presents LightLM, a lightweight Transformer-based language model +for generative recommendation. While Transformer-based generative modeling has +gained importance in various AI sub-fields such as NLP and vision, generative +recommendation is still in its infancy due to its unique demand on personalized +generative modeling. Existing works on generative recommendation often use +NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are +heavy-weight and are not specifically designed for recommendation tasks. +LightLM tackles the issue by introducing a light-weight deep and narrow +Transformer architecture, which is specifically tailored for direct generation +of recommendation items. This structure is especially apt for straightforward +generative recommendation and stems from the observation that language model +does not have to be too wide for this task, as the input predominantly consists +of short tokens that are well-suited for the model's capacity. We also show +that our devised user and item ID indexing methods, i.e., Spectral +Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables +the deep and narrow Transformer architecture to outperform large-scale language +models for recommendation. Besides, to address the hallucination problem of +generating items as output, we propose the constrained generation process for +generative recommenders. Experiments on real-world datasets show that LightLM +outperforms various competitive baselines in terms of both recommendation +accuracy and efficiency. The code can be found at +https://github.com/dongyuanjushi/LightLM. + +
+
+
+
+
+ + ♻ ☆ Denevil: Towards Deciphering and Navigating the Ethical Values of Large + Language Models via Instruction Learning + + +
+ Large Language Models (LLMs) have made unprecedented breakthroughs, yet their +increasing integration into everyday life might raise societal risks due to +generated unethical content. Despite extensive study on specific issues like +bias, the intrinsic values of LLMs remain largely unexplored from a moral +philosophy perspective. This work delves into ethical values utilizing Moral +Foundation Theory. Moving beyond conventional discriminative evaluations with +poor reliability, we propose DeNEVIL, a novel prompt generation algorithm +tailored to dynamically exploit LLMs' value vulnerabilities and elicit the +violation of ethics in a generative manner, revealing their underlying value +inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset +comprising 2,397 prompts covering 500+ value principles, and then benchmark the +intrinsic values across a spectrum of LLMs. We discovered that most models are +essentially misaligned, necessitating further ethical value alignment. In +response, we develop VILMO, an in-context alignment method that substantially +enhances the value compliance of LLM outputs by learning to generate +appropriate value instructions, outperforming existing competitors. Our methods +are suitable for black-box and open-source models, offering a promising initial +step in studying the ethical values of LLMs. + +
+
+
+
+
+ + ♻ ☆ DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking + + +
+ Inspired by the dual-process theory of human cognition, we introduce DUMA, a +novel conversational agent framework that embodies a dual-mind mechanism +through the utilization of two generative Large Language Models (LLMs) +dedicated to fast and slow thinking respectively. The fast thinking model +serves as the primary interface for external interactions and initial response +generation, evaluating the necessity for engaging the slow thinking model based +on the complexity of the complete response. When invoked, the slow thinking +model takes over the conversation, engaging in meticulous planning, reasoning, +and tool utilization to provide a well-analyzed response. This dual-mind +configuration allows for a seamless transition between intuitive responses and +deliberate problem-solving processes based on the situation. We have +constructed a conversational agent to handle online inquiries in the real +estate industry. The experiment proves that our method balances effectiveness +and efficiency, and has a significant improvement compared to the baseline. + +
+
+
+
+
+ + ♻ ☆ Large Language Models Are Semi-Parametric Reinforcement Learning Agents + + +
+ Inspired by the insights in cognitive science with respect to human memory +and reasoning mechanism, a novel evolvable LLM-based (Large Language Model) +agent framework is proposed as REMEMBERER. By equipping the LLM with a +long-term experience memory, REMEMBERER is capable of exploiting the +experiences from the past episodes even for different task goals, which excels +an LLM-based agent with fixed exemplars or equipped with a transient working +memory. We further introduce Reinforcement Learning with Experience Memory +(RLEM) to update the memory. Thus, the whole system can learn from the +experiences of both success and failure, and evolve its capability without +fine-tuning the parameters of the LLM. In this way, the proposed REMEMBERER +constitutes a semi-parametric RL agent. Extensive experiments are conducted on +two RL task sets to evaluate the proposed framework. The average results with +different initialization and training sets exceed the prior SOTA by 4% and 2% +for the success rate on two task sets and demonstrate the superiority and +robustness of REMEMBERER. + +
+
+
+
+
+ + ♻ ☆ Hierarchical Prompting Assists Large Language Model on Web Navigation EMNLP 2023 + + +
+ Large language models (LLMs) struggle on processing complicated observations +in interactive decision making tasks. To alleviate this issue, we propose a +simple hierarchical prompting approach. Diverging from previous prompting +approaches that always put the full observation (e.g. a web page) to the +prompt, we propose to first construct an action-aware observation which is more +condensed and relevant with a dedicated SUMMARIZER prompt. The ACTOR prompt +then predicts the next action based on the summarized observation. While our +method has broad applicability, we particularly demonstrate its efficacy in the +complex domain of web navigation where a full observation often contains +redundant and irrelevant information. Our approach outperforms the previous +state-of-the-art prompting mechanics by 6.2% on task success rate, +demonstrating its potential on interactive decision making tasks with long +observation traces. + +
+
+ comment: EMNLP 2023 Findings; Natural Language Reasoning and Structured + Explanations Workshop at ACL 2023 +
+
+
+
+
+ + ♻ ☆ Davidsonian Scene Graph: Improving Reliability in Fine-grained + Evaluation for Text-to-Image Generation + + +
+ Evaluating text-to-image models is notoriously difficult. A strong recent +approach for assessing text-image faithfulness is based on QG/A (question +generation and answering), which uses pre-trained foundational models to +automatically generate a set of questions and answers from the prompt, and +output images are scored based on whether these answers extracted with a visual +question answering model are consistent with the prompt-based answers. This +kind of evaluation is naturally dependent on the quality of the underlying QG +and QA models. We identify and address several reliability challenges in +existing QG/A work: (a) QG questions should respect the prompt (avoiding +hallucinations, duplications, and omissions) and (b) VQA answers should be +consistent (not asserting that there is no motorcycle in an image while also +claiming the motorcycle is blue). We address these issues with Davidsonian +Scene Graph (DSG), an empirically grounded evaluation framework inspired by +formal semantics. DSG is an automatic, graph-based QG/A that is modularly +implemented to be adaptable to any QG/A module. DSG produces atomic and unique +questions organized in dependency graphs, which (i) ensure appropriate semantic +coverage and (ii) sidestep inconsistent answers. With extensive experimentation +and human evaluation on a range of model configurations (LLM, VQA, and T2I), we +empirically demonstrate that DSG addresses the challenges noted above. Finally, +we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 +prompts, covering a wide range of fine-grained semantic categories with a +balanced distribution. We release the DSG-1k prompts and the corresponding DSG +questions. + +
+
+ comment: Project website: https://google.github.io/dsg +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 6 + +
+
+
+ + ☆ There Are No Data Like More Data- Datasets for Deep Learning in Earth + Observation + + +
+ Carefully curated and annotated datasets are the foundation of machine +learning, with particularly data-hungry deep neural networks forming the core +of what is often called Artificial Intelligence (AI). Due to the massive +success of deep learning applied to Earth Observation (EO) problems, the focus +of the community has been largely on the development of ever-more sophisticated +deep neural network architectures and training strategies largely ignoring the +overall importance of datasets. For that purpose, numerous task-specific +datasets have been created that were largely ignored by previously published +review articles on AI for Earth observation. With this article, we want to +change the perspective and put machine learning datasets dedicated to Earth +observation data and applications into the spotlight. Based on a review of the +historical developments, currently available resources are described and a +perspective for future developments is formed. We hope to contribute to an +understanding that the nature of our data is what distinguishes the Earth +observation community from many other communities that apply deep learning +techniques to image data, and that a detailed understanding of EO data +peculiarities is among the core competencies of our discipline. + +
+
+
+
+
+ + ☆ CHAMMI: A benchmark for channel-adaptive models in microscopy imaging NeurIPS + + +
+ Most neural networks assume that input images have a fixed number of channels +(three for RGB images). However, there are many settings where the number of +channels may vary, such as microscopy images where the number of channels +changes depending on instruments and experimental goals. Yet, there has not +been a systemic attempt to create and evaluate neural networks that are +invariant to the number and type of channels. As a result, trained models +remain specific to individual studies and are hardly reusable for other +microscopy settings. In this paper, we present a benchmark for investigating +channel-adaptive models in microscopy imaging, which consists of 1) a dataset +of varied-channel single-cell images, and 2) a biologically relevant evaluation +framework. In addition, we adapted several existing techniques to create +channel-adaptive models and compared their performance on this benchmark to +fixed-channel, baseline models. We find that channel-adaptive models can +generalize better to out-of-domain tasks and can be computationally efficient. +We contribute a curated dataset (https://doi.org/10.5281/zenodo.7988357) and an +evaluation API (https://github.com/broadinstitute/MorphEm.git) to facilitate +objective comparisons in future research and applications. + +
+
+ comment: Accepted at NeurIPS Track on Datasets and Benchmarks, 2023 +
+
+
+
+
+ + ☆ Modular Anti-noise Deep Learning Network for Robotic Grasp Detection + Based on RGB Images + + +
+ While traditional methods relies on depth sensors, the current trend leans +towards utilizing cost-effective RGB images, despite their absence of depth +cues. This paper introduces an interesting approach to detect grasping pose +from a single RGB image. To this end, we propose a modular learning network +augmented with grasp detection and semantic segmentation, tailored for robots +equipped with parallel-plate grippers. Our network not only identifies +graspable objects but also fuses prior grasp analyses with semantic +segmentation, thereby boosting grasp detection precision. Significantly, our +design exhibits resilience, adeptly handling blurred and noisy visuals. Key +contributions encompass a trainable network for grasp detection from RGB +images, a modular design facilitating feasible grasp implementation, and an +architecture robust against common image distortions. We demonstrate the +feasibility and accuracy of our proposed approach through practical experiments +and evaluations. + +
+
+
+
+
+ + ☆ Generalized Category Discovery with Clustering Assignment Consistency ICONIP 2023 + + +
+ Generalized category discovery (GCD) is a recently proposed open-world task. +Given a set of images consisting of labeled and unlabeled instances, the goal +of GCD is to automatically cluster the unlabeled samples using information +transferred from the labeled dataset. The unlabeled dataset comprises both +known and novel classes. The main challenge is that unlabeled novel class +samples and unlabeled known class samples are mixed together in the unlabeled +dataset. To address the GCD without knowing the class number of unlabeled +dataset, we propose a co-training-based framework that encourages clustering +consistency. Specifically, we first introduce weak and strong augmentation +transformations to generate two sufficiently different views for the same +sample. Then, based on the co-training assumption, we propose a consistency +representation learning strategy, which encourages consistency between +feature-prototype similarity and clustering assignment. Finally, we use the +discriminative embeddings learned from the semi-supervised representation +learning process to construct an original sparse network and use a community +detection method to obtain the clustering results and the number of categories +simultaneously. Extensive experiments show that our method achieves +state-of-the-art performance on three generic benchmarks and three fine-grained +visual recognition datasets. Especially in the ImageNet-100 data set, our +method significantly exceeds the best baseline by 15.5\% and 7.0\% on the +\texttt{Novel} and \texttt{All} classes, respectively. + +
+
+ comment: ICONIP 2023,This paper has been nominated for ICONIP2023 Best Paper + Award +
+
+
+
+
+ + ♻ ☆ Latent Space Energy-based Model for Fine-grained Open Set Recognition + + +
+ Fine-grained open-set recognition (FineOSR) aims to recognize images +belonging to classes with subtle appearance differences while rejecting images +of unknown classes. A recent trend in OSR shows the benefit of generative +models to discriminative unknown detection. As a type of generative model, +energy-based models (EBM) are the potential for hybrid modeling of generative +and discriminative tasks. However, most existing EBMs suffer from density +estimation in high-dimensional space, which is critical to recognizing images +from fine-grained classes. In this paper, we explore the low-dimensional latent +space with energy-based prior distribution for OSR in a fine-grained visual +world. Specifically, based on the latent space EBM, we propose an +attribute-aware information bottleneck (AIB), a residual attribute feature +aggregation (RAFA) module, and an uncertainty-based virtual outlier synthesis +(UVOS) module to improve the expressivity, granularity, and density of the +samples in fine-grained classes, respectively. Our method is flexible to take +advantage of recent vision transformers for powerful visual classification and +generation. The method is validated on both fine-grained and general visual +classification datasets while preserving the capability of generating +photo-realistic fake images with high resolution. + +
+
+ comment: Add ack +
+
+
+
+
+ + ♻ ☆ Shape-centered Representation Learning for Visible-Infrared Person + Re-identification + + +
+ Current Visible-Infrared Person Re-Identification (VI-ReID) methods +prioritize extracting distinguishing appearance features, ignoring the natural +resistance of body shape against modality changes. Initially, we gauged the +discriminative potential of shapes by a straightforward concatenation of shape +and appearance features. However, two unresolved issues persist in the +utilization of shape features. One pertains to the dependence on auxiliary +models for shape feature extraction in the inference phase, along with the +errors in generated infrared shapes due to the intrinsic modality disparity. +The other issue involves the inadequately explored correlation between shape +and appearance features. To tackle the aforementioned challenges, we propose +the Shape-centered Representation Learning framework (ScRL), which focuses on +learning shape features and appearance features associated with shapes. +Specifically, we devise the Shape Feature Propagation (SFP), facilitating +direct extraction of shape features from original images with minimal +complexity costs during inference. To restitute inaccuracies in infrared body +shapes at the feature level, we present the Infrared Shape Restitution (ISR). +Furthermore, to acquire appearance features related to shape, we design the +Appearance Feature Enhancement (AFE), which accentuates identity-related +features while suppressing identity-unrelated features guided by shape +features. Extensive experiments are conducted to validate the effectiveness of +the proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy +attains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM, +RegDB datasets respectively, outperforming existing state-of-the-art methods. + +
+
+
+
+
+
+
+
+ + Information Retrieval 3 + +
+
+
+ + ♻ ☆ LightLM: A Lightweight Deep and Narrow Language Model for Generative + Recommendation + + +
+ This paper presents LightLM, a lightweight Transformer-based language model +for generative recommendation. While Transformer-based generative modeling has +gained importance in various AI sub-fields such as NLP and vision, generative +recommendation is still in its infancy due to its unique demand on personalized +generative modeling. Existing works on generative recommendation often use +NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are +heavy-weight and are not specifically designed for recommendation tasks. +LightLM tackles the issue by introducing a light-weight deep and narrow +Transformer architecture, which is specifically tailored for direct generation +of recommendation items. This structure is especially apt for straightforward +generative recommendation and stems from the observation that language model +does not have to be too wide for this task, as the input predominantly consists +of short tokens that are well-suited for the model's capacity. We also show +that our devised user and item ID indexing methods, i.e., Spectral +Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables +the deep and narrow Transformer architecture to outperform large-scale language +models for recommendation. Besides, to address the hallucination problem of +generating items as output, we propose the constrained generation process for +generative recommenders. Experiments on real-world datasets show that LightLM +outperforms various competitive baselines in terms of both recommendation +accuracy and efficiency. The code can be found at +https://github.com/dongyuanjushi/LightLM. + +
+
+
+
+
+ + ♻ ☆ Towards Hybrid-grained Feature Interaction Selection for Deep Sparse + Network NeurIPS 2023 + + +
+ Deep sparse networks are widely investigated as a neural network architecture +for prediction tasks with high-dimensional sparse features, with which feature +interaction selection is a critical component. While previous methods primarily +focus on how to search feature interaction in a coarse-grained space, less +attention has been given to a finer granularity. In this work, we introduce a +hybrid-grained feature interaction selection approach that targets both feature +field and feature value for deep sparse networks. To explore such expansive +space, we propose a decomposed space which is calculated on the fly. We then +develop a selection algorithm called OptFeature, which efficiently selects the +feature interaction from both the feature field and the feature value +simultaneously. Results from experiments on three large real-world benchmark +datasets demonstrate that OptFeature performs well in terms of accuracy and +efficiency. Additional studies support the feasibility of our method. + +
+
+ comment: NeurIPS 2023 poster +
+
+
+
+
+ + ♻ ☆ Auto Search Indexer for End-to-End Document Retrieval EMNLP 2023 + + +
+ Generative retrieval, which is a new advanced paradigm for document +retrieval, has recently attracted research interests, since it encodes all +documents into the model and directly generates the retrieved documents. +However, its power is still underutilized since it heavily relies on the +"preprocessed" document identifiers (docids), thus limiting its retrieval +performance and ability to retrieve new documents. In this paper, we propose a +novel fully end-to-end retrieval paradigm. It can not only end-to-end learn the +best docids for existing and new documents automatically via a semantic +indexing module, but also perform end-to-end document retrieval via an +encoder-decoder-based generative model, namely Auto Search Indexer (ASI). +Besides, we design a reparameterization mechanism to combine the above two +modules into a joint optimization framework. Extensive experimental results +demonstrate the superiority of our model over advanced baselines on both public +and industrial datasets and also verify the ability to deal with new documents. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+
+
+
+ + Machine Learning 13 + +
+
+
+ + ☆ There Are No Data Like More Data- Datasets for Deep Learning in Earth + Observation + + +
+ Carefully curated and annotated datasets are the foundation of machine +learning, with particularly data-hungry deep neural networks forming the core +of what is often called Artificial Intelligence (AI). Due to the massive +success of deep learning applied to Earth Observation (EO) problems, the focus +of the community has been largely on the development of ever-more sophisticated +deep neural network architectures and training strategies largely ignoring the +overall importance of datasets. For that purpose, numerous task-specific +datasets have been created that were largely ignored by previously published +review articles on AI for Earth observation. With this article, we want to +change the perspective and put machine learning datasets dedicated to Earth +observation data and applications into the spotlight. Based on a review of the +historical developments, currently available resources are described and a +perspective for future developments is formed. We hope to contribute to an +understanding that the nature of our data is what distinguishes the Earth +observation community from many other communities that apply deep learning +techniques to image data, and that a detailed understanding of EO data +peculiarities is among the core competencies of our discipline. + +
+
+
+
+
+ + ☆ Stochastic Configuration Machines: FPGA Implementation + + +
+ Neural networks for industrial applications generally have additional +constraints such as response speed, memory size and power usage. Randomized +learners can address some of these issues. However, hardware solutions can +provide better resource reduction whilst maintaining the model's performance. +Stochastic configuration networks (SCNs) are a prime choice in industrial +applications due to their merits and feasibility for data modelling. Stochastic +Configuration Machines (SCMs) extend this to focus on reducing the memory +constraints by limiting the randomized weights to a binary value with a scalar +for each node and using a mechanism model to improve the learning performance +and result interpretability. This paper aims to implement SCM models on a field +programmable gate array (FPGA) and introduce binary-coded inputs to the +algorithm. Results are reported for two benchmark and two industrial datasets, +including SCM with single-layer and deep architectures. + +
+
+ comment: 19 pages, 9 figures, 8 tables +
+
+
+
+
+ + ☆ Maximum Knowledge Orthogonality Reconstruction with Gradients in + Federated Learning WACV + + +
+ Federated learning (FL) aims at keeping client data local to preserve +privacy. Instead of gathering the data itself, the server only collects +aggregated gradient updates from clients. Following the popularity of FL, there +has been considerable amount of work, revealing the vulnerability of FL +approaches by reconstructing the input data from gradient updates. Yet, most +existing works assume an FL setting with unrealistically small batch size, and +have poor image quality when the batch size is large. Other works modify the +neural network architectures or parameters to the point of being suspicious, +and thus, can be detected by clients. Moreover, most of them can only +reconstruct one sample input from a large batch. To address these limitations, +we propose a novel and completely analytical approach, referred to as the +maximum knowledge orthogonality reconstruction (MKOR), to reconstruct clients' +input data. Our proposed method reconstructs a mathematically proven high +quality image from large batches. MKOR only requires the server to send +secretly modified parameters to clients and can efficiently and inconspicuously +reconstruct the input images from clients' gradient updates. We evaluate MKOR's +performance on the MNIST, CIFAR-100, and ImageNet dataset and compare it with +the state-of-the-art works. The results show that MKOR outperforms the existing +approaches, and draws attention to a pressing need for further research on the +privacy protection of FL so that comprehensive defense approaches can be +developed. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV) 2024 +
+
+
+
+
+ + ☆ From Stream to Pool: Dynamic Pricing Beyond i.i.d. Arrivals + + +
+ The dynamic pricing problem has been extensively studied under the +\textbf{stream} model: A stream of customers arrives sequentially, each with an +independently and identically distributed valuation. However, this formulation +is not entirely reflective of the real world. In many scenarios, high-valuation +customers tend to make purchases earlier and leave the market, leading to a +\emph{shift} in the valuation distribution. Thus motivated, we consider a model +where a \textbf{pool} of $n$ non-strategic unit-demand customers interact +repeatedly with the seller. Each customer monitors the price intermittently +according to an independent Poisson process and makes a purchase if the +observed price is lower than her \emph{private} valuation, whereupon she leaves +the market permanently. We present a minimax \emph{optimal} algorithm that +efficiently computes a non-adaptive policy which guarantees a $1/k$ fraction of +the optimal revenue, given any set of $k$ prices. Moreover, we present an +adaptive \emph{learn-then-earn} policy based on a novel \emph{debiasing} +approach, and prove an $\tilde O(kn^{3/4})$ regret bound. We further improve +the bound to $\tilde O(k^{3/4} n^{3/4})$ using martingale concentration +inequalities. + +
+
+
+
+
+ + ☆ A Survey of Federated Unlearning: A Taxonomy, Challenges and Future + Directions + + +
+ With the development of trustworthy Federated Learning (FL), the requirement +of implementing right to be forgotten gives rise to the area of Federated +Unlearning (FU). Comparing to machine unlearning, a major challenge of FU lies +in the decentralized and privacy-preserving nature of FL, in which clients +jointly train a global model without sharing their raw data, making it +substantially more intricate to selectively unlearn specific information. In +that regard, many efforts have been made to tackle the challenges of FU and +have achieved significant progress. In this paper, we present a comprehensive +survey of FU. Specially, we provide the existing algorithms, objectives, +evaluation metrics, and identify some challenges of FU. By reviewing and +comparing some studies, we summarize them into a taxonomy for various schemes, +potential applications and future directions. + +
+
+
+
+
+ + ☆ On the accuracy and efficiency of group-wise clipping in differentially + private optimization + + +
+ Recent advances have substantially improved the accuracy, memory cost, and +training speed of differentially private (DP) deep learning, especially on +large vision and language models with millions to billions of parameters. In +this work, we thoroughly study the per-sample gradient clipping style, a key +component in DP optimization. We show that different clipping styles have the +same time complexity but instantiate an accuracy-memory trade-off: while the +all-layer clipping (of coarse granularity) is the most prevalent and usually +gives the best accuracy, it incurs heavier memory cost compared to other +group-wise clipping, such as the layer-wise clipping (of finer granularity). We +formalize this trade-off through our convergence theory and complexity +analysis. Importantly, we demonstrate that the accuracy gap between group-wise +clipping and all-layer clipping becomes smaller for larger models, while the +memory advantage of the group-wise clipping remains. Consequently, the +group-wise clipping allows DP optimization of large models to achieve high +accuracy and low peak memory simultaneously. + +
+
+
+
+
+ + ☆ Factor Fitting, Rank Allocation, and Partitioning in Multilevel Low Rank + Matrices + + +
+ We consider multilevel low rank (MLR) matrices, defined as a row and column +permutation of a sum of matrices, each one a block diagonal refinement of the +previous one, with all blocks low rank given in factored form. MLR matrices +extend low rank matrices but share many of their properties, such as the total +storage required and complexity of matrix-vector multiplication. We address +three problems that arise in fitting a given matrix by an MLR matrix in the +Frobenius norm. The first problem is factor fitting, where we adjust the +factors of the MLR matrix. The second is rank allocation, where we choose the +ranks of the blocks in each level, subject to the total rank having a given +value, which preserves the total storage needed for the MLR matrix. The final +problem is to choose the hierarchical partition of rows and columns, along with +the ranks and factors. This paper is accompanied by an open source package that +implements the proposed methods. + +
+
+
+
+
+ + ☆ Investigative Pattern Detection Framework for Counterterrorism + + +
+ Law-enforcement investigations aimed at preventing attacks by violent +extremists have become increasingly important for public safety. The problem is +exacerbated by the massive data volumes that need to be scanned to identify +complex behaviors of extremists and groups. Automated tools are required to +extract information to respond queries from analysts, continually scan new +information, integrate them with past events, and then alert about emerging +threats. We address challenges in investigative pattern detection and develop +an Investigative Pattern Detection Framework for Counterterrorism (INSPECT). +The framework integrates numerous computing tools that include machine learning +techniques to identify behavioral indicators and graph pattern matching +techniques to detect risk profiles/groups. INSPECT also automates multiple +tasks for large-scale mining of detailed forensic biographies, forming +knowledge networks, and querying for behavioral indicators and radicalization +trajectories. INSPECT targets human-in-the-loop mode of investigative search +and has been validated and evaluated using an evolving dataset on domestic +jihadism. + +
+
+ comment: 9 pages, 4 figures +
+
+
+
+
+ + ♻ ☆ Boosting Learning for LDPC Codes to Improve the Error-Floor Performance + + +
+ Low-density parity-check (LDPC) codes have been successfully commercialized +in communication systems due to their strong error correction capabilities and +simple decoding process. However, the error-floor phenomenon of LDPC codes, in +which the error rate stops decreasing rapidly at a certain level, presents +challenges for achieving extremely low error rates and deploying LDPC codes in +scenarios demanding ultra-high reliability. In this work, we propose training +methods for neural min-sum (NMS) decoders to eliminate the error-floor effect. +First, by leveraging the boosting learning technique of ensemble networks, we +divide the decoding network into two neural decoders and train the post decoder +to be specialized for uncorrected words that the first decoder fails to +correct. Secondly, to address the vanishing gradient issue in training, we +introduce a block-wise training schedule that locally trains a block of weights +while retraining the preceding block. Lastly, we show that assigning different +weights to unsatisfied check nodes effectively lowers the error-floor with a +minimal number of weights. By applying these training methods to standard LDPC +codes, we achieve the best error-floor performance compared to other decoding +methods. The proposed NMS decoder, optimized solely through novel training +methods without additional modules, can be integrated into existing LDPC +decoders without incurring extra hardware costs. The source code is available +at https://github.com/ghy1228/LDPC_Error_Floor . + +
+
+ comment: 17 pages, 10 figures +
+
+
+
+
+ + ♻ ☆ Uncertainty-aware Grounded Action Transformation towards Sim-to-Real + Transfer for Traffic Signal Control + + +
+ Traffic signal control (TSC) is a complex and important task that affects the +daily lives of millions of people. Reinforcement Learning (RL) has shown +promising results in optimizing traffic signal control, but current RL-based +TSC methods are mainly trained in simulation and suffer from the performance +gap between simulation and the real world. In this paper, we propose a +simulation-to-real-world (sim-to-real) transfer approach called UGAT, which +transfers a learned policy trained from a simulated environment to a real-world +environment by dynamically transforming actions in the simulation with +uncertainty to mitigate the domain gap of transition dynamics. We evaluate our +method on a simulated traffic environment and show that it significantly +improves the performance of the transferred RL policy in the real world. + +
+
+ comment: 6 pages, 3 figures. This paper is accepted by IEEE-CDC 2023 +
+
+
+
+
+ + ♻ ☆ Hierarchical Prompting Assists Large Language Model on Web Navigation EMNLP 2023 + + +
+ Large language models (LLMs) struggle on processing complicated observations +in interactive decision making tasks. To alleviate this issue, we propose a +simple hierarchical prompting approach. Diverging from previous prompting +approaches that always put the full observation (e.g. a web page) to the +prompt, we propose to first construct an action-aware observation which is more +condensed and relevant with a dedicated SUMMARIZER prompt. The ACTOR prompt +then predicts the next action based on the summarized observation. While our +method has broad applicability, we particularly demonstrate its efficacy in the +complex domain of web navigation where a full observation often contains +redundant and irrelevant information. Our approach outperforms the previous +state-of-the-art prompting mechanics by 6.2% on task success rate, +demonstrating its potential on interactive decision making tasks with long +observation traces. + +
+
+ comment: EMNLP 2023 Findings; Natural Language Reasoning and Structured + Explanations Workshop at ACL 2023 +
+
+
+
+
+ + ♻ ☆ Diffused Redundancy in Pre-trained Representations NeurIPS 2023 + + +
+ Representations learned by pre-training a neural network on a large dataset +are increasingly used successfully to perform a variety of downstream tasks. In +this work, we take a closer look at how features are encoded in such +pre-trained representations. We find that learned representations in a given +layer exhibit a degree of diffuse redundancy, ie, any randomly chosen subset of +neurons in the layer that is larger than a threshold size shares a large degree +of similarity with the full layer and is able to perform similarly as the whole +layer on a variety of downstream tasks. For example, a linear probe trained on +$20\%$ of randomly picked neurons from the penultimate layer of a ResNet50 +pre-trained on ImageNet1k achieves an accuracy within $5\%$ of a linear probe +trained on the full layer of neurons for downstream CIFAR10 classification. We +conduct experiments on different neural architectures (including CNNs and +Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a +variety of downstream tasks taken from the VTAB benchmark. We find that the +loss and dataset used during pre-training largely govern the degree of diffuse +redundancy and the "critical mass" of neurons needed often depends on the +downstream task, suggesting that there is a task-inherent +redundancy-performance Pareto frontier. Our findings shed light on the nature +of representations learned by pre-trained deep neural networks and suggest that +entire layers might not be necessary to perform many downstream tasks. We +investigate the potential for exploiting this redundancy to achieve efficient +generalization for downstream tasks and also draw caution to certain possible +unintended consequences. Our code is available at +\url{https://github.com/nvedant07/diffused-redundancy}. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Forecasting Tropical Cyclones with Cascaded Diffusion Models + + +
+ As cyclones become more intense due to climate change, the rise of AI-based +modelling provides a more affordable and accessible approach compared to +traditional methods based on mathematical models. This work leverages diffusion +models to forecast cyclone trajectories and precipitation patterns by +integrating satellite imaging, remote sensing, and atmospheric data, employing +a cascaded approach that incorporates forecasting, super-resolution, and +precipitation modelling, with training on a dataset of 51 cyclones from six +major basins. Experiments demonstrate that the final forecasts from the +cascaded models show accurate predictions up to a 36-hour rollout, with SSIM +and PSNR values exceeding 0.5 and 20 dB, respectively, for all three tasks. +This work also highlights the promising efficiency of AI methods such as +diffusion models for high-performance needs, such as cyclone forecasting, while +remaining computationally affordable, making them ideal for highly vulnerable +regions with critical forecasting needs and financial limitations. Code +accessible at \url{https://github.com/nathzi1505/forecast-diffmodels}. + +
+
+ comment: 6 pages, 3 figures +
+
+
+
+
+
+
+
+ + Multimedia 1 + +
+
+
+ + ♻ ☆ ControlLLM: Augment Language Models with Tools by Searching on Graphs + + +
+ We present ControlLLM, a novel framework that enables large language models +(LLMs) to utilize multi-modal tools for solving complex real-world tasks. +Despite the remarkable performance of LLMs, they still struggle with tool +invocation due to ambiguous user prompts, inaccurate tool selection and +parameterization, and inefficient tool scheduling. To overcome these +challenges, our framework comprises three key components: (1) a \textit{task +decomposer} that breaks down a complex task into clear subtasks with +well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) +paradigm} that searches the optimal solution path on a pre-built tool graph, +which specifies the parameter and dependency relations among different tools; +and (3) an \textit{execution engine with a rich toolbox} that interprets the +solution path and runs the tools efficiently on different computational +devices. We evaluate our framework on diverse tasks involving image, audio, and +video processing, demonstrating its superior accuracy, efficiency, and +versatility compared to existing methods. The code is at +https://github.com/OpenGVLab/ControlLLM . + +
+
+ comment: 22 pages, 9 figures, 10 tables +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 56 + +
+
+
+ + ☆ From Chatbots to PhishBots? -- Preventing Phishing scams created using + ChatGPT, Google Bard and Claude + + +
+ The advanced capabilities of Large Language Models (LLMs) have made them +invaluable across various applications, from conversational agents and content +creation to data analysis, research, and innovation. However, their +effectiveness and accessibility also render them susceptible to abuse for +generating malicious content, including phishing attacks. This study explores +the potential of using four popular commercially available LLMs - ChatGPT (GPT +3.5 Turbo), GPT 4, Claude and Bard to generate functional phishing attacks +using a series of malicious prompts. We discover that these LLMs can generate +both phishing emails and websites that can convincingly imitate well-known +brands, and also deploy a range of evasive tactics for the latter to elude +detection mechanisms employed by anti-phishing systems. Notably, these attacks +can be generated using unmodified, or "vanilla," versions of these LLMs, +without requiring any prior adversarial exploits such as jailbreaking. As a +countermeasure, we build a BERT based automated detection tool that can be used +for the early detection of malicious prompts to prevent LLMs from generating +phishing content attaining an accuracy of 97\% for phishing website prompts, +and 94\% for phishing email prompts. + +
+
+
+
+
+ + ☆ Robustifying Language Models with Test-Time Adaptation ICLR + + +
+ Large-scale language models achieved state-of-the-art performance over a +number of language tasks. However, they fail on adversarial language examples, +which are sentences optimized to fool the language models but with similar +semantic meanings for humans. While prior work focuses on making the language +model robust at training time, retraining for robustness is often unrealistic +for large-scale foundation models. Instead, we propose to make the language +models robust at test time. By dynamically adapting the input sentence with +predictions from masked words, we show that we can reverse many language +adversarial attacks. Since our approach does not require any training, it works +for novel tasks at test time and can adapt to novel adversarial corruptions. +Visualizations and empirical results on two popular sentence classification +datasets demonstrate that our method can repair adversarial language attacks +over 65% o + +
+
+ comment: 8 Pages 2 Figures Submitted to ICLR Workshop +
+
+
+
+
+ + ☆ Poisoning Retrieval Corpora by Injecting Adversarial Passages EMNLP 2023 + + +
+ Dense retrievers have achieved state-of-the-art performance in various +information retrieval tasks, but to what extent can they be safely deployed in +real-world applications? In this work, we propose a novel attack for dense +retrieval systems in which a malicious user generates a small number of +adversarial passages by perturbing discrete tokens to maximize similarity with +a provided set of training queries. When these adversarial passages are +inserted into a large retrieval corpus, we show that this attack is highly +effective in fooling these systems to retrieve them for queries that were not +seen by the attacker. More surprisingly, these adversarial passages can +directly generalize to out-of-domain queries and corpora with a high success +attack rate -- for instance, we find that 50 generated passages optimized on +Natural Questions can mislead >94% of questions posed in financial documents or +online forums. We also benchmark and compare a range of state-of-the-art dense +retrievers, both unsupervised and supervised. Although different systems +exhibit varying levels of vulnerability, we show they can all be successfully +attacked by injecting up to 500 passages, a small fraction compared to a +retrieval corpus of millions of passages. + +
+
+ comment: EMNLP 2023. Our code is available at + https://github.com/princeton-nlp/corpus-poisoning +
+
+
+
+
+ + ☆ BERT Lost Patience Won't Be Robust to Adversarial Slowdown NeurIPS 2023 + + +
+ In this paper, we systematically evaluate the robustness of multi-exit +language models against adversarial slowdown. To audit their robustness, we +design a slowdown attack that generates natural adversarial text bypassing +early-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a +comprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark +against adversarial slowdown. We then show our attack significantly reduces the +computational savings provided by the three methods in both white-box and +black-box settings. The more complex a mechanism is, the more vulnerable it is +to adversarial slowdown. We also perform a linguistic analysis of the perturbed +text inputs, identifying common perturbation patterns that our attack +generates, and comparing them with standard adversarial text attacks. Moreover, +we show that adversarial training is ineffective in defeating our slowdown +attack, but input sanitization with a conversational model, e.g., ChatGPT, can +remove perturbations effectively. This result suggests that future work is +needed for developing efficient yet robust multi-exit models. Our code is +available at: https://github.com/ztcoalson/WAFFLE + +
+
+ comment: Accepted to NeurIPS 2023 [Poster] +
+
+
+
+
+ + ☆ Learning to Follow Object-Centric Image Editing Instructions Faithfully EMNLP 2023 + + +
+ Natural language instructions are a powerful interface for editing the +outputs of text-to-image diffusion models. However, several challenges need to +be addressed: 1) underspecification (the need to model the implicit meaning of +instructions) 2) grounding (the need to localize where the edit has to be +performed), 3) faithfulness (the need to preserve the elements of the image not +affected by the edit instruction). Current approaches focusing on image editing +with natural language instructions rely on automatically generated paired data, +which, as shown in our investigation, is noisy and sometimes nonsensical, +exacerbating the above issues. Building on recent advances in segmentation, +Chain-of-Thought prompting, and visual question answering, we significantly +improve the quality of the paired data. In addition, we enhance the supervision +signal by highlighting parts of the image that need to be changed by the +instruction. The model fine-tuned on the improved data is capable of performing +fine-grained object-centric edits better than state-of-the-art baselines, +mitigating the problems outlined above, as shown by automatic and human +evaluations. Moreover, our model is capable of generalizing to domains unseen +during training, such as visual metaphors. + +
+
+ comment: Findings of EMNLP 2023 (Long paper) +
+
+
+
+
+ + ☆ Women Wearing Lipstick: Measuring the Bias Between an Object and Its + Related Gender EMNLP + + +
+ In this paper, we investigate the impact of objects on gender bias in image +captioning systems. Our results show that only gender-specific objects have a +strong gender bias (e.g., women-lipstick). In addition, we propose a visual +semantic-based gender score that measures the degree of bias and can be used as +a plug-in for any image captioning system. Our experiments demonstrate the +utility of the gender score, since we observe that our score can measure the +bias relation between a caption and its related gender; therefore, our score +can be used as an additional metric to the existing Object Gender Co-Occ +approach. Code and data are publicly available at +\url{https://github.com/ahmedssabir/GenderScore}. + +
+
+ comment: EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Unified Representation for Non-compositional and Compositional + Expressions EMNLP 2023 + + +
+ Accurate processing of non-compositional language relies on generating good +representations for such expressions. In this work, we study the representation +of language non-compositionality by proposing a language model, PIER, that +builds on BART and can create semantically meaningful and contextually +appropriate representations for English potentially idiomatic expressions +(PIEs). PIEs are characterized by their non-compositionality and contextual +ambiguity in their literal and idiomatic interpretations. Via intrinsic +evaluation on embedding quality and extrinsic evaluation on PIE processing and +NLU tasks, we show that representations generated by PIER result in 33% higher +homogeneity score for embedding clustering than BART, whereas 3.12% and 3.29% +gains in accuracy and sequence accuracy for PIE sense classification and span +detection compared to the state-of-the-art IE representation model, GIEA. These +gains are achieved without sacrificing PIER's performance on NLU tasks (+/- 1% +accuracy) compared to BART. + +
+
+ comment: This work is accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Three Dogmas, a Puzzle and its Solution + + +
+ Modern Logics, as formulated notably by Frege, Russell and Tarski involved +basic assumptions about Natural Languages in general and Indo-European +Languages in particular, which are contested by Linguists. Based upon those +assumptions, formal Languages were designed to overcome what Logicians claimed +to be 'defects' of Natural Language. In this paper we show that those +assumptions contradict basic principles of Arabic. More specifically: The +Logicians ideas, that within Natural Language words refer to objects, +'ToBe'-constructions represent identity statements, Indefinite Descriptions +must be replaced by existential quantifiers to form meaningful Sentences and +Symbols can have no interpretation-independent meanings, are all falsified +using undisputed principles of Arabic. The here presented falsification serves +two purposes. First, it is used as a factual basis for the rejection of +approaches adopting Semantic axioms of Mathematical Logics as Models for +meaning of Arabic Syntax. Second, it shows a way to approach the important +computational problem: Satisfiability (SAT). The described way is based upon +the realization that parsing Arabic utilizes the existence of +'meaning-particles' within Syntax to efficiently recognize words, phrases and +Sentences. Similar meaning-particles are shown to exist in 3CNF formulas, +which, when properly handled within the machinery of 3SAT-Solvers, enable +structural conditions to be imposed on formulas, sufficient alone to guarantee +the efficient production of non-exponentially sized Free Binary Decision +Diagrams (FBDDs). We show, why known exponential Lower Bounds on sizes of FBDDs +do not contradict our results and reveal practical evidence, obtained for +multiplication circuits, supporting our claims. + +
+
+ comment: 99 pages +
+
+
+
+
+ + ☆ PACuna: Automated Fine-Tuning of Language Models for Particle + Accelerators + + +
+ Navigating the landscape of particle accelerators has become increasingly +challenging with recent surges in contributions. These intricate devices +challenge comprehension, even within individual facilities. To address this, we +introduce PACuna, a fine-tuned language model refined through publicly +available accelerator resources like conferences, pre-prints, and books. We +automated data collection and question generation to minimize expert +involvement and make the data publicly available. PACuna demonstrates +proficiency in addressing intricate accelerator questions, validated by +experts. Our approach shows adapting language models to scientific domains by +fine-tuning technical texts and auto-generated corpora capturing the latest +developments can further produce pre-trained models to answer some intricate +questions that commercially available assistants cannot and can serve as +intelligent assistants for individual facilities. + +
+
+
+
+
+ + ☆ Pushdown Layers: Encoding Recursive Structure in Transformer Language + Models EMNLP 2023 + + +
+ Recursion is a prominent feature of human language, and fundamentally +challenging for self-attention due to the lack of an explicit recursive-state +tracking mechanism. Consequently, Transformer language models poorly capture +long-tail recursive structure and exhibit sample-inefficient syntactic +generalization. This work introduces Pushdown Layers, a new self-attention +layer that models recursive state via a stack tape that tracks estimated depths +of every token in an incremental parse of the observed prefix. Transformer LMs +with Pushdown Layers are syntactic language models that autoregressively and +synchronously update this stack tape as they predict new tokens, in turn using +the stack tape to softly modulate attention over tokens -- for instance, +learning to "skip" over closed constituents. When trained on a corpus of +strings annotated with silver constituency parses, Transformers equipped with +Pushdown Layers achieve dramatically better and 3-5x more sample-efficient +syntactic generalization, while maintaining similar perplexities. Pushdown +Layers are a drop-in replacement for standard self-attention. We illustrate +this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed +WikiText-103, leading to improvements on several GLUE text classification +tasks. + +
+
+ comment: Accepted at EMNLP 2023 (Long Papers) +
+
+
+
+
+ + ☆ Roles of Scaling and Instruction Tuning in Language Perception: Model + vs. Human Attention + + +
+ Recent large language models (LLMs) have revealed strong abilities to +understand natural language. Since most of them share the same basic structure, +i.e. the transformer block, possible contributors to their success in the +training process are scaling and instruction tuning. However, how these factors +affect the models' language perception is unclear. This work compares the +self-attention of several existing LLMs (LLaMA, Alpaca and Vicuna) in different +sizes (7B, 13B, 30B, 65B), together with eye saccade, an aspect of human +reading attention, to assess the effect of scaling and instruction tuning on +language perception. Results show that scaling enhances the human resemblance +and improves the effective attention by reducing the trivial pattern reliance, +while instruction tuning does not. However, instruction tuning significantly +enhances the models' sensitivity to instructions. We also find that current +LLMs are consistently closer to non-native than native speakers in attention, +suggesting a sub-optimal language perception of all models. Our code and data +used in the analysis is available on GitHub. + +
+
+
+
+
+ + ☆ TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language + Understanding + + +
+ Large-scale video-language pre-training has made remarkable strides in +advancing video-language understanding tasks. However, the heavy computational +burden of video encoding remains a formidable efficiency bottleneck, +particularly for long-form videos. These videos contain massive visual tokens +due to their inherent 3D properties and spatiotemporal redundancy, making it +challenging to capture complex temporal and spatial relationships. To tackle +this issue, we propose an efficient method called TEmporal-Spatial Token +Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating +similar frames, as well as similar patches within each frame. TESTA can reduce +the number of visual tokens by 75% and thus accelerate video encoding. Building +upon TESTA, we introduce a pre-trained video-language model equipped with a +divided space-time token aggregation module in each video encoder block. We +evaluate our model on five datasets for paragraph-to-video retrieval and +long-form VideoQA tasks. Experimental results show that TESTA improves +computing efficiency by 1.7 times, and achieves significant performance gains +from its scalability in processing longer input frames, e.g., +13.7 R@1 on +QuerYD and +6.5 R@1 on Condensed Movie. + +
+
+ comment: 16 pages, 9 figures, code is available at + https://github.com/RenShuhuai-Andy/TESTA +
+
+
+
+
+ + ☆ A Survey on Recent Named Entity Recognition and Relation Classification + Methods with Focus on Few-Shot Learning Approaches + + +
+ Named entity recognition and relation classification are key stages for +extracting information from unstructured text. Several natural language +processing applications utilize the two tasks, such as information retrieval, +knowledge graph construction and completion, question answering and other +domain-specific applications, such as biomedical data mining. We present a +survey of recent approaches in the two tasks with focus on few-shot learning +approaches. Our work compares the main approaches followed in the two +paradigms. Additionally, we report the latest metric scores in the two tasks +with a structured analysis that considers the results in the few-shot learning +scope. + +
+
+
+
+
+ + ☆ ArBanking77: Intent Detection Neural Model and a New Dataset in Modern + and Dialectical Arabic + + +
+ This paper presents the ArBanking77, a large Arabic dataset for intent +detection in the banking domain. Our dataset was arabized and localized from +the original English Banking77 dataset, which consists of 13,083 queries to +ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) +and Palestinian dialect, with each query classified into one of the 77 classes +(intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned +on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and +Palestinian dialect, respectively. We performed extensive experimentation in +which we simulated low-resource settings, where the model is trained on a +subset of the data and augmented with noisy queries to simulate colloquial +terms, mistakes and misspellings found in real NLP systems, especially live +chat queries. The data and the models are publicly available at +https://sina.birzeit.edu/arbanking77. + +
+
+
+
+
+ + ☆ SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks + + +
+ SALMA, the first Arabic sense-annotated corpus, consists of ~34K tokens, +which are all sense-annotated. The corpus is annotated using two different +sense inventories simultaneously (Modern and Ghani). SALMA novelty lies in how +tokens and senses are associated. Instead of linking a token to only one +intended sense, SALMA links a token to multiple senses and provides a score to +each sense. A smart web-based annotation tool was developed to support scoring +multiple senses against a given word. In addition to sense annotations, we also +annotated the corpus using six types of named entities. The quality of our +annotations was assessed using various metrics (Kappa, Linear Weighted Kappa, +Quadratic Weighted Kappa, Mean Average Error, and Root Mean Square Error), +which show very high inter-annotator agreement. To establish a Word Sense +Disambiguation baseline using our SALMA corpus, we developed an end-to-end Word +Sense Disambiguation system using Target Sense Verification. We used this +system to evaluate three Target Sense Verification models available in the +literature. Our best model achieved an accuracy with 84.2% using Modern and +78.7% using Ghani. The full corpus and the annotation tool are open-source and +publicly available at https://sina.birzeit.edu/salma/. + +
+
+
+
+
+ + ☆ TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language + Modeling Likewise + + +
+ Large Language Models (LLMs) exhibit impressive reasoning and data +augmentation capabilities in various NLP tasks. However, what about small +models? In this work, we propose TeacherLM-7.1B, capable of annotating relevant +fundamentals, chain of thought, and common mistakes for most NLP samples, which +makes annotation more than just an answer, thus allowing other models to learn +"why" instead of just "what". The TeacherLM-7.1B model achieved a zero-shot +score of 52.3 on MMLU, surpassing most models with over 100B parameters. Even +more remarkable is its data augmentation ability. Based on TeacherLM-7.1B, we +augmented 58 NLP datasets and taught various student models with different +parameters from OPT and BLOOM series in a multi-task setting. The experimental +results indicate that the data augmentation provided by TeacherLM has brought +significant benefits. We will release the TeacherLM series of models and +augmented datasets as open-source. + +
+
+ comment: 5 figures, 15 pages +
+
+
+
+
+ + ☆ Bipartite Graph Pre-training for Unsupervised Extractive Summarization + with Graph Convolutional Auto-Encoders EMNLP 2023 + + +
+ Pre-trained sentence representations are crucial for identifying significant +sentences in unsupervised document extractive summarization. However, the +traditional two-step paradigm of pre-training and sentence-ranking, creates a +gap due to differing optimization objectives. To address this issue, we argue +that utilizing pre-trained embeddings derived from a process specifically +designed to optimize cohensive and distinctive sentence representations helps +rank significant sentences. To do so, we propose a novel graph pre-training +auto-encoder to obtain sentence embeddings by explicitly modelling +intra-sentential distinctive features and inter-sentential cohesive features +through sentence-word bipartite graphs. These pre-trained sentence +representations are then utilized in a graph-based ranking algorithm for +unsupervised summarization. Our method produces predominant performance for +unsupervised summarization frameworks by providing summary-worthy sentence +representations. It surpasses heavy BERT- or RoBERTa-based sentence +representations in downstream tasks. + +
+
+ comment: Accepted by the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP 2023) +
+
+
+
+
+ + ☆ EtiCor: Corpus for Analyzing LLMs for Etiquettes EMNLP 2023 + + +
+ Etiquettes are an essential ingredient of day-to-day interactions among +people. Moreover, etiquettes are region-specific, and etiquettes in one region +might contradict those in other regions. In this paper, we propose EtiCor, an +Etiquettes Corpus, having texts about social norms from five different regions +across the globe. The corpus provides a test bed for evaluating LLMs for +knowledge and understanding of region-specific etiquettes. Additionally, we +propose the task of Etiquette Sensitivity. We experiment with state-of-the-art +LLMs (Delphi, Falcon40B, and GPT-3.5). Initial results indicate that LLMs, +mostly fail to understand etiquettes from regions from non-Western world. + +
+
+ comment: Accepted at EMNLP 2023, Main Conference +
+
+
+
+
+ + ☆ LLMs and Finetuning: Benchmarking cross-domain performance for hate + speech detection + + +
+ This paper compares different pre-trained and fine-tuned large language +models (LLMs) for hate speech detection. Our research underscores challenges in +LLMs' cross-domain validity and overfitting risks. Through evaluations, we +highlight the need for fine-tuned models that grasp the nuances of hate speech +through greater label heterogeneity. We conclude with a vision for the future +of hate speech detection, emphasizing cross-domain generalizability and +appropriate benchmarking practices. + +
+
+ comment: 9 pages, 3 figures, 4 tables +
+
+
+
+
+ + ☆ End-to-End Autoregressive Retrieval via Bootstrapping for Smart Reply + Systems EMNLP 2023 + + +
+ Reply suggestion systems represent a staple component of many instant +messaging and email systems. However, the requirement to produce sets of +replies, rather than individual replies, makes the task poorly suited for +out-of-the-box retrieval architectures, which only consider individual +message-reply similarity. As a result, these system often rely on additional +post-processing modules to diversify the outputs. However, these approaches are +ultimately bottlenecked by the performance of the initial retriever, which in +practice struggles to present a sufficiently diverse range of options to the +downstream diversification module, leading to the suggestions being less +relevant to the user. In this paper, we consider a novel approach that +radically simplifies this pipeline through an autoregressive text-to-text +retrieval model, that learns the smart reply task end-to-end from a dataset of +(message, reply set) pairs obtained via bootstrapping. Empirical results show +this method consistently outperforms a range of state-of-the-art baselines +across three datasets, corresponding to a 5.1%-17.9% improvement in relevance, +and a 0.5%-63.1% improvement in diversity compared to the best baseline +approach. We make our code publicly available. + +
+
+ comment: FINDINGS-EMNLP 2023 +
+
+
+
+
+ + ☆ S2F-NER: Exploring Sequence-to-Forest Generation for Complex Entity + Recognition + + +
+ Named Entity Recognition (NER) remains challenging due to the complex +entities, like nested, overlapping, and discontinuous entities. Existing +approaches, such as sequence-to-sequence (Seq2Seq) generation and span-based +classification, have shown impressive performance on various NER subtasks, but +they are difficult to scale to datasets with longer input text because of +either exposure bias issue or inefficient computation. In this paper, we +propose a novel Sequence-to-Forest generation paradigm, S2F-NER, which can +directly extract entities in sentence via a Forest decoder that decode multiple +entities in parallel rather than sequentially. Specifically, our model generate +each path of each tree in forest autoregressively, where the maximum depth of +each tree is three (which is the shortest feasible length for complex NER and +is far smaller than the decoding length of Seq2Seq). Based on this novel +paradigm, our model can elegantly mitigates the exposure bias problem and keep +the simplicity of Seq2Seq. Experimental results show that our model +significantly outperforms the baselines on three discontinuous NER datasets and +on two nested NER datasets, especially for discontinuous entity recognition. + +
+
+
+
+
+ + ☆ Retrofitting Light-weight Language Models for Emotions using Supervised + Contrastive Learning EMNLP 2023 + + +
+ We present a novel retrofitting method to induce emotion aspects into +pre-trained language models (PLMs) such as BERT and RoBERTa. Our method updates +pre-trained network weights using contrastive learning so that the text +fragments exhibiting similar emotions are encoded nearby in the representation +space, and the fragments with different emotion content are pushed apart. While +doing so, it also ensures that the linguistic knowledge already present in PLMs +is not inadvertently perturbed. The language models retrofitted by our method, +i.e., BERTEmo and RoBERTaEmo, produce emotion-aware text representations, as +evaluated through different clustering and retrieval metrics. For the +downstream tasks on sentiment analysis and sarcasm detection, they perform +better than their pre-trained counterparts (about 1% improvement in F1-score) +and other existing approaches. Additionally, a more significant boost in +performance is observed for the retrofitted models over pre-trained ones in +few-shot learning setting. + +
+
+ comment: EMNLP 2023 Camera Ready Version +
+
+
+
+
+ + ☆ Debiasing Algorithm through Model Adaptation + + +
+ Large language models are becoming the go-to solution for various language +tasks. However, with growing capacity, models are prone to rely on spurious +correlations stemming from biases and stereotypes present in the training data. +This work proposes a novel method for detecting and mitigating gender bias in +language models. We perform causal analysis to identify problematic model +components and discover that mid-upper feed-forward layers are most prone to +convey biases. Based on the analysis results, we adapt the model by multiplying +these layers by a linear projection. Our titular method, DAMA, significantly +decreases bias as measured by diverse metrics while maintaining the model's +performance on downstream tasks. We release code for our method and models, +which retrain LLaMA's state-of-the-art performance while being significantly +less biased. + +
+
+
+
+
+ + ☆ Sentence Bag Graph Formulation for Biomedical Distant Supervision + Relation Extraction + + +
+ We introduce a novel graph-based framework for alleviating key challenges in +distantly-supervised relation extraction and demonstrate its effectiveness in +the challenging and important domain of biomedical data. Specifically, we +propose a graph view of sentence bags referring to an entity pair, which +enables message-passing based aggregation of information related to the entity +pair over the sentence bag. The proposed framework alleviates the common +problem of noisy labeling in distantly supervised relation extraction and also +effectively incorporates inter-dependencies between sentences within a bag. +Extensive experiments on two large-scale biomedical relation datasets and the +widely utilized NYT dataset demonstrate that our proposed framework +significantly outperforms the state-of-the-art methods for biomedical distant +supervision relation extraction while also providing excellent performance for +relation extraction in the general text mining domain. + +
+
+
+
+
+ + ☆ Stacking the Odds: Transformer-Based Ensemble for AI-Generated Text + Detection ALT + + +
+ This paper reports our submission under the team name `SynthDetectives' to +the ALTA 2023 Shared Task. We use a stacking ensemble of Transformers for the +task of AI-generated text detection. Our approach is novel in terms of its +choice of models in that we use accessible and lightweight models in the +ensemble. We show that ensembling the models results in an improved accuracy in +comparison with using them individually. Our approach achieves an accuracy +score of 0.9555 on the official test data provided by the shared task +organisers. + +
+
+ comment: This is an ALTA 2023 Shared Task Paper +
+
+
+
+
+ + ☆ Pre-trained Speech Processing Models Contain Human-Like Biases that + Propagate to Speech Emotion Recognition + + +
+ Previous work has established that a person's demographics and speech style +affect how well speech processing models perform for them. But where does this +bias come from? In this work, we present the Speech Embedding Association Test +(SpEAT), a method for detecting bias in one type of model used for many speech +tasks: pre-trained models. The SpEAT is inspired by word embedding association +tests in natural language processing, which quantify intrinsic bias in a +model's representations of different concepts, such as race or valence +(something's pleasantness or unpleasantness) and capture the extent to which a +model trained on large-scale socio-cultural data has learned human-like biases. +Using the SpEAT, we test for six types of bias in 16 English speech models +(including 4 models also trained on multilingual data), which come from the +wav2vec 2.0, HuBERT, WavLM, and Whisper model families. We find that 14 or more +models reveal positive valence (pleasantness) associations with abled people +over disabled people, with European-Americans over African-Americans, with +females over males, with U.S. accented speakers over non-U.S. accented +speakers, and with younger people over older people. Beyond establishing that +pre-trained speech models contain these biases, we also show that they can have +real world effects. We compare biases found in pre-trained models to biases in +downstream models adapted to the task of Speech Emotion Recognition (SER) and +find that in 66 of the 96 tests performed (69%), the group that is more +associated with positive valence as indicated by the SpEAT also tends to be +predicted as speaking with higher valence by the downstream model. Our work +provides evidence that, like text and image-based models, pre-trained speech +based-models frequently learn human-like biases. Our work also shows that bias +found in pre-trained models can propagate to the downstream task of SER. + +
+
+
+
+
+ + ☆ Prompt-Engineering and Transformer-based Question Generation and + Evaluation + + +
+ Question generation has numerous applications in the educational context. +Question generation can prove helpful for students when reviewing content and +testing themselves. Furthermore, a question generation model can aid teachers +by lessening the burden of creating assessments and other practice material. +This paper aims to find the best method to generate questions from textual data +through a transformer model and prompt engineering. In this research, we +finetuned a pretrained distilBERT model on the SQuAD question answering dataset +to generate questions. In addition to training a transformer model, prompt +engineering was applied to generate questions effectively using the LLaMA +model. The generated questions were compared against the baseline questions in +the SQuAD dataset to evaluate the effectiveness of four different prompts. All +four prompts demonstrated over 60% similarity on average. Of the +prompt-generated questions, 30% achieved a high similarity score greater than +70%. + +
+
+
+
+
+ + ☆ MUST: A Multilingual Student-Teacher Learning approach for low-resource + speech recognition + + +
+ Student-teacher learning or knowledge distillation (KD) has been previously +used to address data scarcity issue for training of speech recognition (ASR) +systems. However, a limitation of KD training is that the student model classes +must be a proper or improper subset of the teacher model classes. It prevents +distillation from even acoustically similar languages if the character sets are +not same. In this work, the aforementioned limitation is addressed by proposing +a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors +mapping approach. A pre-trained mapping model is used to map posteriors from a +teacher language to the student language ASR. These mapped posteriors are used +as soft labels for KD learning. Various teacher ensemble schemes are +experimented to train an ASR model for low-resource languages. A model trained +with MUST learning reduces relative character error rate (CER) up to 9.5% in +comparison with a baseline monolingual ASR. + +
+
+ comment: Accepted for IEEE ASRU 2023 +
+
+
+
+
+ + ☆ Counterfactually Probing Language Identity in Multilingual Models EMNLP 2023 + + +
+ Techniques in causal analysis of language models illuminate how linguistic +information is organized in LLMs. We use one such technique, AlterRep, a method +of counterfactual probing, to explore the internal structure of multilingual +models (mBERT and XLM-R). We train a linear classifier on a binary language +identity task, to classify tokens between Language X and Language Y. Applying a +counterfactual probing procedure, we use the classifier weights to project the +embeddings into the null space and push the resulting embeddings either in the +direction of Language X or Language Y. Then we evaluate on a masked language +modeling task. We find that, given a template in Language X, pushing towards +Language Y systematically increases the probability of Language Y words, above +and beyond a third-party control language. But it does not specifically push +the model towards translation-equivalent words in Language Y. Pushing towards +Language X (the same direction as the template) has a minimal effect, but +somewhat degrades these models. Overall, we take these results as further +evidence of the rich structure of massive multilingual language models, which +include both a language-specific and language-general component. And we show +that counterfactual probing can be fruitfully applied to multilingual models. + +
+
+ comment: 12 pages, 5 figures, MRL Workshop @ EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ LEACE: Perfect linear concept erasure in closed form + + +
+ Concept erasure aims to remove specified features from a representation. It +can improve fairness (e.g. preventing a classifier from using gender or race) +and interpretability (e.g. removing a concept to observe changes in model +behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form +method which provably prevents all linear classifiers from detecting a concept +while changing the representation as little as possible, as measured by a broad +class of norms. We apply LEACE to large language models with a novel procedure +called "concept scrubbing," which erases target concept information from every +layer in the network. We demonstrate our method on two tasks: measuring the +reliance of language models on part-of-speech information, and reducing gender +bias in BERT embeddings. Code is available at +https://github.com/EleutherAI/concept-erasure. + +
+
+
+
+
+ + ♻ ☆ ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers NeurIPS 2023 + + +
+ Large language models (LLMs) excel at implementing code from functionality +descriptions but struggle with algorithmic problems that require not only +implementation but also identification of the suitable algorithm. Moreover, +LLM-generated programs lack guaranteed correctness and require human +verification. To address these challenges, we propose ALGO, a framework that +synthesizes Algorithmic programs with LLM-Generated Oracles to guide the +generation and verify their correctness. ALGO first generates a reference +oracle by prompting an LLM to exhaustively enumerate all the combinations of +relevant variables. This oracle is then utilized to guide an arbitrary search +strategy in exploring the algorithm space and to verify the synthesized +algorithms. Our study shows that the LLM-generated oracles are correct for 88% +of the cases. With the oracles as verifiers, ALGO can be integrated with any +existing code generation model in a model-agnostic manner to enhance its +performance. Experiments show that when equipped with ALGO, we achieve an 8x +better one-submission pass rate over the Codex model and a 2.6x better +one-submission pass rate over CodeT, the current state-of-the-art model on +CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code +Interpreter on unseen problems. The problem set we used for testing, the +prompts we used, the verifier and solution programs, and the test cases +generated by ALGO are available at https://github.com/zkx06111/ALGO. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Multilingual Machine Translation with Large Language Models: Empirical + Results and Analysis + + +
+ Large language models (LLMs) have demonstrated remarkable potential in +handling multilingual machine translation (MMT). In this paper, we +systematically investigate the advantages and challenges of LLMs for MMT by +answering two questions: 1) How well do LLMs perform in translating massive +languages? 2) Which factors affect LLMs' performance in translation? We +thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our +empirical results show that translation capabilities of LLMs are continually +improving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of +translation directions but still faces a large gap towards the commercial +translation system, especially on low-resource languages. Through further +analysis, we discover that LLMs exhibit new working patterns when used for MMT. +First, instruction semantics can surprisingly be ignored when given in-context +exemplars. Second, cross-lingual exemplars can provide better task guidance for +low-resource translation than exemplars in the same language pairs. Third, LLM +can acquire translation ability in a resource-efficient way and generate +moderate translation even on zero-resource languages. + +
+
+
+
+
+ + ♻ ☆ An Ensemble Approach to Question Classification: Integrating Electra + Transformer, GloVe, and LSTM + + +
+ Natural Language Processing (NLP) has emerged as a crucial technology for +understanding and generating human language, playing an essential role in tasks +such as machine translation, sentiment analysis, and more pertinently, question +classification. As a subfield within NLP, question classification focuses on +determining the type of information being sought, a fundamental step for +downstream applications like question answering systems. This study presents an +innovative ensemble approach for question classification, combining the +strengths of Electra, GloVe, and LSTM models. Rigorously tested on the +well-regarded TREC dataset, the model demonstrates how the integration of these +disparate technologies can lead to superior results. Electra brings in its +transformer-based capabilities for complex language understanding, GloVe offers +global vector representations for capturing word-level semantics, and LSTM +contributes its sequence learning abilities to model long-term dependencies. By +fusing these elements strategically, our ensemble model delivers a robust and +efficient solution for the complex task of question classification. Through +rigorous comparisons with well-known models like BERT, RoBERTa, and DistilBERT, +the ensemble approach verifies its effectiveness by attaining an 80% accuracy +score on the test dataset. + +
+
+
+
+
+ + ♻ ☆ MEGClass: Extremely Weakly Supervised Text Classification via + Mutually-Enhancing Text Granularities + + +
+ Text classification is essential for organizing unstructured text. +Traditional methods rely on human annotations or, more recently, a set of class +seed words for supervision, which can be costly, particularly for specialized +or emerging domains. To address this, using class surface names alone as +extremely weak supervision has been proposed. However, existing approaches +treat different levels of text granularity (documents, sentences, or words) +independently, disregarding inter-granularity class disagreements and the +context identifiable exclusively through joint extraction. In order to tackle +these issues, we introduce MEGClass, an extremely weakly-supervised text +classification method that leverages Mutually-Enhancing Text Granularities. +MEGClass utilizes coarse- and fine-grained context signals obtained by jointly +considering a document's most class-indicative words and sentences. This +approach enables the learning of a contextualized document representation that +captures the most discriminative class indicators. By preserving the +heterogeneity of potential classes, MEGClass can select the most informative +class-indicative documents as iterative feedback to enhance the initial +word-based class representations and ultimately fine-tune a pre-trained text +classifier. Extensive experiments on seven benchmark datasets demonstrate that +MEGClass outperforms other weakly and extremely weakly supervised methods. + +
+
+ comment: Code: https://github.com/pkargupta/MEGClass/ +
+
+
+
+
+ + ♻ ☆ MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop + Questions EMNLP 2023 + + +
+ The information stored in large language models (LLMs) falls out of date +quickly, and retraining from scratch is often not an option. This has recently +given rise to a range of techniques for injecting new facts through updating +model weights. Current evaluation paradigms are extremely limited, mainly +validating the recall of edited facts, but changing one fact should cause +rippling changes to the model's related beliefs. If we edit the UK Prime +Minister to now be Rishi Sunak, then we should get a different answer to Who is +married to the British Prime Minister? In this work, we present a benchmark, +MQuAKE (Multi-hop Question Answering for Knowledge Editing), comprising +multi-hop questions that assess whether edited models correctly answer +questions where the answer should change as an entailed consequence of edited +facts. While we find that current knowledge-editing approaches can recall +edited facts accurately, they fail catastrophically on the constructed +multi-hop questions. We thus propose a simple memory-based approach, MeLLo, +which stores all edited facts externally while prompting the language model +iteratively to generate answers that are consistent with the edited facts. +While MQuAKE remains challenging, we show that MeLLo scales well with LLMs (up +to 175B) and outperforms previous model editors by a large margin. + +
+
+ comment: EMNLP 2023. Our code and datasets are available at + https://github.com/princeton-nlp/MQuAKE +
+
+
+
+
+ + ♻ ☆ Injecting structural hints: Using language models to study inductive + biases in language learning EMNLP 2023 + + +
+ Both humans and large language models are able to learn language without +explicit structural supervision. What inductive biases make this learning +possible? We address this fundamental cognitive question by leveraging +transformer language models: we inject inductive bias into language models by +pretraining on formally-structured data, and then evaluate the biased learners' +ability to learn typologically-diverse natural languages. Our experimental +setup creates a testbed for hypotheses about inductive bias in human language +learning. We investigate the effect of injecting models with three types of +inductive bias: 1) recursive, hierarchical processing, 2) crossing token-token +relationships that can't be modeled by context-free grammars, and 3) a Zipfian +power-law vocabulary distribution. We show that non-context-free relationships +form the best inductive biases. Our study leverages the capabilities of +transformer models to run controlled language learning experiments that are not +possible to run on humans, and surfaces hypotheses about the structures that +facilitate language learning in both humans and machines. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Language Agents for Detecting Implicit Stereotypes in Text-to-image + Models at Scale + + +
+ The recent surge in the research of diffusion models has accelerated the +adoption of text-to-image models in various Artificial Intelligence Generated +Content (AIGC) commercial products. While these exceptional AIGC products are +gaining increasing recognition and sparking enthusiasm among consumers, the +questions regarding whether, when, and how these models might unintentionally +reinforce existing societal stereotypes remain largely unaddressed. Motivated +by recent advancements in language agents, here we introduce a novel agent +architecture tailored for stereotype detection in text-to-image models. This +versatile agent architecture is capable of accommodating free-form detection +tasks and can autonomously invoke various tools to facilitate the entire +process, from generating corresponding instructions and images, to detecting +stereotypes. We build the stereotype-relevant benchmark based on multiple +open-text datasets, and apply this architecture to commercial products and +popular open source text-to-image models. We find that these models often +display serious stereotypes when it comes to certain prompts about personal +characteristics, social cultural context and crime-related aspects. In summary, +these empirical findings underscore the pervasive existence of stereotypes +across social dimensions, including gender, race, and religion, which not only +validate the effectiveness of our proposed approach, but also emphasize the +critical necessity of addressing potential ethical risks in the burgeoning +realm of AIGC. As AIGC continues its rapid expansion trajectory, with new +models and plugins emerging daily in staggering numbers, the challenge lies in +the timely detection and mitigation of potential biases within these models. + +
+
+
+
+
+ + ♻ ☆ Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task + Instruction Tuning + + +
+ Recent advancements enlarge the capabilities of large language models (LLMs) +in zero-shot image-to-text generation and understanding by integrating +multi-modal inputs. However, such success is typically limited to English +scenarios due to the lack of large-scale and high-quality non-English +multi-modal resources, making it extremely difficult to establish competitive +counterparts in other languages. In this paper, we introduce the Ziya-Visual +series, a set of bilingual large-scale vision-language models (LVLMs) designed +to incorporate visual semantics into LLM for multi-modal dialogue. Composed of +Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying +Transformer from BLIP-2, further exploring the assistance of optimization +schemes such as instruction tuning, multi-stage training and low-rank +adaptation module for visual-language alignment. In addition, we stimulate the +understanding ability of GPT-4 in multi-modal scenarios, translating our +gathered English image-text datasets into Chinese and generating +instruction-response through the in-context learning method. The experiment +results demonstrate that compared to the existing LVLMs, Ziya-Visual achieves +competitive performance across a wide range of English-only tasks including +zero-shot image-text retrieval, image captioning, and visual question +answering. The evaluation leaderboard accessed by GPT-4 also indicates that our +models possess satisfactory image-text understanding and generation +capabilities in Chinese multi-modal scenario dialogues. Code, demo and models +are available at +~\url{https://huggingface.co/IDEA-CCNL/Ziya-BLIP2-14B-Visual-v1}. + +
+
+
+
+
+ + ♻ ☆ A Read-and-Select Framework for Zero-shot Entity Linking EMNLP 2023 + + +
+ Zero-shot entity linking (EL) aims at aligning entity mentions to unseen +entities to challenge the generalization ability. Previous methods largely +focus on the candidate retrieval stage and ignore the essential candidate +ranking stage, which disambiguates among entities and makes the final linking +prediction. In this paper, we propose a read-and-select (ReS) framework by +modeling the main components of entity disambiguation, i.e., mention-entity +matching and cross-entity comparison. First, for each candidate, the reading +module leverages mention context to output mention-aware entity +representations, enabling mention-entity matching. Then, in the selecting +module, we frame the choice of candidates as a sequence labeling problem, and +all candidate representations are fused together to enable cross-entity +comparison. Our method achieves the state-of-the-art performance on the +established zero-shot EL dataset ZESHEL with a 2.55% micro-average accuracy +gain, with no need for laborious multi-phase pre-training used in most of the +previous work, showing the effectiveness of both mention-entity and +cross-entity interaction. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ BeaverTails: Towards Improved Safety Alignment of LLM via a + Human-Preference Dataset NeurIPS + + +
+ In this paper, we introduce the \textsc{BeaverTails} dataset, aimed at +fostering research on safety alignment in large language models (LLMs). This +dataset uniquely separates annotations of helpfulness and harmlessness for +question-answering pairs, thus offering distinct perspectives on these crucial +attributes. In total, we have gathered safety meta-labels for 30,207 +question-answer (QA) pairs and 30,144 pairs of expert comparison data for both +the helpfulness and harmlessness metrics. In total, we have gathered safety +meta-labels for 333,963 question-answer (QA) pairs and 361,903 pairs of expert +comparison data for both the helpfulness and harmlessness metrics. We further +showcase applications of BeaverTails in content moderation and reinforcement +learning with human feedback (RLHF), emphasizing its potential for practical +safety measures in LLMs. We believe this dataset provides vital resources for +the community, contributing towards the safe development and deployment of +LLMs. Our project page is available at the following URL: +https://sites.google.com/view/pku-beavertails. + Warning: this paper contains example data that may be offensive or harmful. + +
+
+ comment: NeurIPS Datasets and Benchmarks 2023 +
+
+
+
+
+ + ♻ ☆ Language Models with Rationality + + +
+ While large language models (LLMs) are proficient at question-answering (QA), +it is not always clear how (or even if) an answer follows from their latent +"beliefs". This lack of interpretability is a growing impediment to widespread +use of LLMs. To address this, our goals are to make model beliefs and their +inferential relationships explicit, and to resolve inconsistencies that may +exist, so that answers are supported by interpretable chains of reasoning drawn +from a consistent network of beliefs. Our approach, which we call REFLEX, is to +add a rational, self-reflecting layer on top of the LLM. First, given a +question, we construct a belief graph using a backward-chaining process to +materialize relevant model beliefs (including beliefs about answer candidates) +and their inferential relationships. Second, we identify and minimize +contradictions in that graph using a formal constraint reasoner. We find that +REFLEX significantly improves consistency (by 8%-11% absolute) without harming +overall answer accuracy, resulting in answers supported by faithful chains of +reasoning drawn from a more consistent belief system. This suggests a new style +of system architecture in which an LLM extended with a rational layer can +provide an interpretable window into system beliefs, add a systematic reasoning +capability, and repair latent inconsistencies present in the LLM. + +
+
+
+
+
+ + ♻ ☆ Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark WSDM 2023 + + +
+ Modern Entity Linking (EL) systems entrench a popularity bias, yet there is +no dataset focusing on tail and emerging entities in languages other than +English. We present Hansel, a new benchmark in Chinese that fills the vacancy +of non-English few-shot and zero-shot EL challenges. The test set of Hansel is +human annotated and reviewed, created with a novel method for collecting +zero-shot EL datasets. It covers 10K diverse documents in news, social media +posts and other web articles, with Wikidata as its target Knowledge Base. We +demonstrate that the existing state-of-the-art EL system performs poorly on +Hansel (R@1 of 36.6% on Few-Shot). We then establish a strong baseline that +scores a R@1 of 46.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We +also show that our baseline achieves competitive results on TAC-KBP2015 Chinese +Entity Linking task. + +
+
+ comment: WSDM 2023 +
+
+
+
+
+ + ♻ ☆ Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and + the Case of Information Extraction EMNLP 2023 + + +
+ Large language models (LLMs) have great potential for synthetic data +generation. This work shows that useful data can be synthetically generated +even for tasks that cannot be solved directly by LLMs: for problems with +structured outputs, it is possible to prompt an LLM to perform the task in the +reverse direction, by generating plausible input text for a target output +structure. Leveraging this asymmetry in task difficulty makes it possible to +produce large-scale, high-quality data for complex tasks. We demonstrate the +effectiveness of this approach on closed information extraction, where +collecting ground-truth data is challenging, and no satisfactory dataset exists +to date. We synthetically generate a dataset of 1.8M data points, establish its +superior quality compared to existing datasets in a human evaluation, and use +it to finetune small models (220M and 770M parameters), termed SynthIE, that +outperform the prior state of the art (with equal model size) by a substantial +margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, +and models are available at https://github.com/epfl-dlab/SynthIE. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency + Model ACM MM 2023 + + +
+ Denoising diffusion probabilistic models (DDPMs) have shown promising +performance for speech synthesis. However, a large number of iterative steps +are required to achieve high sample quality, which restricts the inference +speed. Maintaining sample quality while increasing sampling speed has become a +challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based +"Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a +single diffusion sampling step while achieving high audio quality. The +consistency constraint is applied to distill a consistency model from a +well-designed diffusion-based teacher model, which ultimately yields superior +performances in the distilled CoMoSpeech. Our experiments show that by +generating audio recordings by a single sampling step, the CoMoSpeech achieves +an inference speed more than 150 times faster than real-time on a single NVIDIA +A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based +speech synthesis truly practical. Meanwhile, objective and subjective +evaluations on text-to-speech and singing voice synthesis show that the +proposed teacher models yield the best audio quality, and the one-step sampling +based CoMoSpeech achieves the best inference speed with better or comparable +audio quality to other conventional multi-step diffusion model baselines. Audio +samples are available at https://comospeech.github.io/. + +
+
+ comment: Accepted to ACM MM 2023 +
+
+
+
+
+ + ♻ ☆ Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management NeurIPS 2023 + + +
+ Reinforcement learning (RL) has shown great promise for developing dialogue +management (DM) agents that are non-myopic, conduct rich conversations, and +maximize overall user satisfaction. Despite recent developments in RL and +language models (LMs), using RL to power conversational chatbots remains +challenging, in part because RL requires online exploration to learn +effectively, whereas collecting novel human-bot interactions can be expensive +and unsafe. This issue is exacerbated by the combinatorial action spaces facing +these algorithms, as most LM agents generate responses at the word level. We +develop a variety of RL algorithms, specialized to dialogue planning, that +leverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that +capture diverse semantics, generate utterances reflecting different intents, +and are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods +significantly reduce the size of the action space and improve the efficacy of +RL-based DM. We evaluate our methods in open-domain dialogue to demonstrate +their effectiveness w.r.t.\ the diversity of intent in generated utterances and +overall DM performance. + +
+
+ comment: Thirty-seventh Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 + + +
+ Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented +performance in response generation, especially with visual inputs, enabling +more creative and adaptable interaction than large language models such as +ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since +adversaries may successfully evade the entire system by subtly manipulating the +most vulnerable modality (e.g., vision). To this end, we propose evaluating the +robustness of open-source large VLMs in the most realistic and high-risk +setting, where adversaries have only black-box system access and seek to +deceive the model into returning the targeted responses. In particular, we +first craft targeted adversarial examples against pretrained models such as +CLIP and BLIP, and then transfer these adversarial examples to other VLMs such +as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we +observe that black-box queries on these VLMs can further improve the +effectiveness of targeted evasion, resulting in a surprisingly high success +rate for generating targeted responses. Our findings provide a quantitative +understanding regarding the adversarial vulnerability of large VLMs and call +for a more thorough examination of their potential security flaws before +deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Turning Flowchart into Dialog: Augmenting Flowchart-grounded + Troubleshooting Dialogs via Synthetic Data Generation ALT + + +
+ Flowchart-grounded troubleshooting dialogue (FTD) systems, which follow the +instructions of a flowchart to diagnose users' problems in specific domains +(e.g., vehicle, laptop), have been gaining research interest in recent years. +However, collecting sufficient dialogues that are naturally grounded on +flowcharts is costly, thus FTD systems are impeded by scarce training data. To +mitigate the data sparsity issue, we propose a plan-based synthetic data +generation (PlanSDG) approach that generates diverse synthetic dialog data at +scale by transforming concise flowchart into dialogues. Specifically, its +generative model employs a variational-base framework with a hierarchical +planning strategy that includes global and local latent planning variables. +Experiments on the FloDial dataset show that synthetic dialogue produced by +PlanSDG improves the performance of downstream tasks, including flowchart path +retrieval and response generation, in particular on the Out-of-Flowchart +settings. In addition, further analysis demonstrate the quality of synthetic +data generated by PlanSDG in paths that are covered by current sample dialogues +and paths that are not covered. + +
+
+ comment: Accepted by ALTA 2023 +
+
+
+
+
+ + ♻ ☆ CAPSTONE: Curriculum Sampling for Dense Retrieval with Document + Expansion EMNLP 2023 + + +
+ The dual-encoder has become the de facto architecture for dense retrieval. +Typically, it computes the latent representations of the query and document +independently, thus failing to fully capture the interactions between the query +and document. To alleviate this, recent research has focused on obtaining +query-informed document representations. During training, it expands the +document with a real query, but during inference, it replaces the real query +with a generated one. This inconsistency between training and inference causes +the dense retrieval model to prioritize query information while disregarding +the document when computing the document representation. Consequently, it +performs even worse than the vanilla dense retrieval model because its +performance heavily relies on the relevance between the generated queries and +the real query.In this paper, we propose a curriculum sampling strategy that +utilizes pseudo queries during training and progressively enhances the +relevance between the generated query and the real query. By doing so, the +retrieval model learns to extend its attention from the document alone to both +the document and query, resulting in high-quality query-informed document +representations. Experimental results on both in-domain and out-of-domain +datasets demonstrate that our approach outperforms previous dense retrieval +models. + +
+
+ comment: Accetpted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Keyword Augmented Retrieval: Novel framework for Information Retrieval + integrated with speech interface + + +
+ Retrieving answers in a quick and low cost manner without hallucinations from +a combination of structured and unstructured data using Language models is a +major hurdle. This is what prevents employment of Language models in knowledge +retrieval automation. This becomes accentuated when one wants to integrate a +speech interface on top of a text based knowledge retrieval system. Besides, +for commercial search and chat-bot applications, complete reliance on +commercial large language models (LLMs) like GPT 3.5 etc. can be very costly. +In the present study, the authors have addressed the aforementioned problem by +first developing a keyword based search framework which augments discovery of +the context from the document to be provided to the LLM. The keywords in turn +are generated by a relatively smaller LLM and cached for comparison with +keywords generated by the same smaller LLM against the query raised. This +significantly reduces time and cost to find the context within documents. Once +the context is set, a larger LLM uses that to provide answers based on a prompt +tailored for Q\&A. This research work demonstrates that use of keywords in +context identification reduces the overall inference time and cost of +information retrieval. Given this reduction in inference time and cost with the +keyword augmented retrieval framework, a speech based interface for user input +and response readout was integrated. This allowed a seamless interaction with +the language model. + +
+
+
+
+
+ + ♻ ☆ A Survey of Knowledge Enhanced Pre-trained Models + + +
+ Pre-trained language models learn informative word representations on a +large-scale text corpus through self-supervised learning, which has achieved +promising performance in fields of natural language processing (NLP) after +fine-tuning. These models, however, suffer from poor robustness and lack of +interpretability. We refer to pre-trained language models with knowledge +injection as knowledge-enhanced pre-trained language models (KEPLMs). These +models demonstrate deep understanding and logical reasoning and introduce +interpretability. In this survey, we provide a comprehensive overview of KEPLMs +in NLP. We first discuss the advancements in pre-trained language models and +knowledge representation learning. Then we systematically categorize existing +KEPLMs from three different perspectives. Finally, we outline some potential +directions of KEPLMs for future research. + +
+
+ comment: 32 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ Personality Understanding of Fictional Characters during Book Reading ACL 2023 + + +
+ Comprehending characters' personalities is a crucial aspect of story reading. +As readers engage with a story, their understanding of a character evolves +based on new events and information; and multiple fine-grained aspects of +personalities can be perceived. This leads to a natural problem of situated and +fine-grained personality understanding. The problem has not been studied in the +NLP field, primarily due to the lack of appropriate datasets mimicking the +process of book reading. We present the first labeled dataset PersoNet for this +problem. Our novel annotation strategy involves annotating user notes from +online reading apps as a proxy for the original books. Experiments and human +studies indicate that our dataset construction is both efficient and accurate; +and our task heavily relies on long-term context to achieve accurate +predictions for both machines and humans. The dataset is available at +https://github.com/Gorov/personet_acl23. + +
+
+ comment: Accepted at ACL 2023 +
+
+
+
+
+ + ♻ ☆ Federated Learning of Large Language Models with Parameter-Efficient + Prompt Tuning and Adaptive Optimization EMNLP 2023 + + +
+ Federated learning (FL) is a promising paradigm to enable collaborative model +training with decentralized data. However, the training process of Large +Language Models (LLMs) generally incurs the update of significant parameters, +which limits the applicability of FL techniques to tackle the LLMs in real +scenarios. Prompt tuning can significantly reduce the number of parameters to +update, but it either incurs performance degradation or low training +efficiency. The straightforward utilization of prompt tuning in the FL often +raises non-trivial communication costs and dramatically degrades performance. +In addition, the decentralized data is generally non-Independent and +Identically Distributed (non-IID), which brings client drift problems and thus +poor performance. This paper proposes a Parameter-efficient prompt Tuning +approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and +effective FL of LLMs. First, an efficient partial prompt tuning approach is +proposed to improve performance and efficiency simultaneously. Second, a novel +adaptive optimization method is developed to address the client drift problems +on both the device and server sides to enhance performance further. Extensive +experiments based on 10 datasets demonstrate the superb performance (up to +60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training +time) of FedPepTAO compared with 9 baseline approaches. Our code is available +at https://github.com/llm-eff/FedPepTAO. + +
+
+ comment: 18 pages, accepted by EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ The Cambridge Law Corpus: A Corpus for Legal AI Research + + +
+ We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. +It consists of over 250 000 court cases from the UK. Most cases are from the +21st century, but the corpus includes cases as old as the 16th century. This +paper presents the first release of the corpus, containing the raw text and +meta-data. Together with the corpus, we provide annotations on case outcomes +for 638 cases, done by legal experts. Using our annotated data, we have trained +and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to +provide benchmarks. We include an extensive legal and ethical discussion to +address the potentially sensitive nature of this material. As a consequence, +the corpus will only be released for research purposes under certain +restrictions. + +
+
+
+
+
+ + ♻ ☆ Evaluating and Inducing Personality in Pre-trained Language Models NeurIPS 2023 + + +
+ Standardized and quantified evaluation of machine behaviors is a crux of +understanding LLMs. In this study, we draw inspiration from psychometric +studies by leveraging human personality theory as a tool for studying machine +behaviors. Originating as a philosophical quest for human behaviors, the study +of personality delves into how individuals differ in thinking, feeling, and +behaving. Toward building and understanding human-like social machines, we are +motivated to ask: Can we assess machine behaviors by leveraging human +psychometric tests in a principled and quantitative manner? If so, can we +induce a specific personality in LLMs? To answer these questions, we introduce +the Machine Personality Inventory (MPI) tool for studying machine behaviors; +MPI follows standardized personality tests, built upon the Big Five Personality +Factors (Big Five) theory and personality assessment inventories. By +systematically evaluating LLMs with MPI, we provide the first piece of evidence +demonstrating the efficacy of MPI in studying LLMs behaviors. We further devise +a Personality Prompting (P^2) method to induce LLMs with specific personalities +in a controllable way, capable of producing diverse and verifiable behaviors. +We hope this work sheds light on future studies by adopting personality as the +essential indicator for various downstream tasks, and could further motivate +research into equally intriguing human-like machine behaviors. + +
+
+ comment: Accepted at NeurIPS 2023 (Spotlight) +
+
+
+
+
+ + ♻ ☆ MindLLM: Pre-training Lightweight Large Language Model from Scratch, + Evaluations and Domain Applications + + +
+ Large Language Models (LLMs) have demonstrated remarkable performance across +various natural language tasks, marking significant strides towards general +artificial intelligence. While general artificial intelligence is leveraged by +developing increasingly large-scale models, there could be another branch to +develop lightweight custom models that better serve certain domains, taking +into account the high cost of training and deploying LLMs and the scarcity of +resources. In this paper, we present MindLLM, a novel series of bilingual +lightweight large language models, trained from scratch, alleviating such +burdens by offering models with 1.3 billion and 3 billion parameters. A +thorough account of experiences accrued during large model development is +given, covering every step of the process, including data construction, model +architecture, evaluation, and applications. Such insights are hopefully +valuable for fellow academics and developers. MindLLM consistently matches or +surpasses the performance of other open-source larger models on some public +benchmarks. We also introduce an innovative instruction tuning framework +tailored for smaller models to enhance their capabilities efficiently. +Moreover, we explore the application of MindLLM in specific vertical domains +such as law and finance, underscoring the agility and adaptability of our +lightweight models. + +
+
+ comment: Working in progress +
+
+
+
+
+ + ♻ ☆ Geodesic Multi-Modal Mixup for Robust Fine-Tuning NeurIPS 2023 + + +
+ Pre-trained multi-modal models, such as CLIP, provide transferable embeddings +and show promising results in diverse applications. However, the analysis of +learned multi-modal embeddings is relatively unexplored, and the embedding +transferability can be improved. In this work, we observe that CLIP holds +separated embedding subspaces for two different modalities, and then we +investigate it through the lens of uniformity-alignment to measure the quality +of learned representation. Both theoretically and empirically, we show that +CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack +of alignment and uniformity might restrict the transferability and robustness +of embeddings. To this end, we devise a new fine-tuning method for robust +representation equipping better alignment and uniformity. First, we propose a +Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to +generate hard negative samples on the hypersphere. Then, we fine-tune the model +on hard negatives as well as original negatives and positives with contrastive +loss. Based on the theoretical analysis about hardness guarantee and limiting +behavior, we justify the use of our method. Extensive experiments on retrieval, +calibration, few- or zero-shot classification (under distribution shift), +embedding arithmetic, and image captioning further show that our method +provides transferable representations, enabling robust model adaptation on +diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup + +
+
+ comment: To appear at NeurIPS 2023 +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 83 + +
+
+
+ + ☆ 3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets ICCV 2023 + + +
+ We present 3DMiner -- a pipeline for mining 3D shapes from challenging +large-scale unannotated image datasets. Unlike other unsupervised 3D +reconstruction methods, we assume that, within a large-enough dataset, there +must exist images of objects with similar shapes but varying backgrounds, +textures, and viewpoints. Our approach leverages the recent advances in +learning self-supervised image representations to cluster images with +geometrically similar shapes and find common image correspondences between +them. We then exploit these correspondences to obtain rough camera estimates as +initialization for bundle-adjustment. Finally, for every image cluster, we +apply a progressive bundle-adjusting reconstruction method to learn a neural +occupancy field representing the underlying shape. We show that this procedure +is robust to several types of errors introduced in previous steps (e.g., wrong +camera poses, images containing dissimilar shapes, etc.), allowing us to obtain +shape and pose annotations for images in-the-wild. When using images from Pix3D +chairs, our method is capable of producing significantly better results than +state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively +and qualitatively. Furthermore, we show how 3DMiner can be applied to +in-the-wild data by reconstructing shapes present in images from the LAION-5B +dataset. Project Page: https://ttchengab.github.io/3dminerOfficial + +
+
+ comment: In ICCV 2023 +
+
+
+
+
+ + ☆ Fast Trainable Projection for Robust Fine-Tuning NeurIPS 2023 + + +
+ Robust fine-tuning aims to achieve competitive in-distribution (ID) +performance while maintaining the out-of-distribution (OOD) robustness of a +pre-trained model when transferring it to a downstream task. Recently, +projected gradient descent has been successfully used in robust fine-tuning by +constraining the deviation from the initialization of the fine-tuned model +explicitly through projection. However, algorithmically, two limitations +prevent this method from being adopted more widely, scalability and efficiency. +In this paper, we propose a new projection-based fine-tuning algorithm, Fast +Trainable Projection (FTP) for computationally efficient learning of per-layer +projection constraints, resulting in an average $35\%$ speedup on our +benchmarks compared to prior works. FTP can be combined with existing +optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we +show that FTP is a special instance of hyper-optimizers that tune the +hyper-parameters of optimizers in a learnable manner through nested +differentiation. Empirically, we show superior robustness on OOD datasets, +including domain shifts and natural corruptions, across four different vision +tasks with five different pre-trained models. Additionally, we demonstrate that +FTP is broadly applicable and beneficial to other learning scenarios such as +low-label and continual learning settings thanks to its easy adaptability. The +code will be available at https://github.com/GT-RIPL/FTP.git. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music + Generation + + +
+ With rapid advances in generative artificial intelligence, the text-to-music +synthesis task has emerged as a promising direction for music generation from +scratch. However, finer-grained control over multi-track generation remains an +open challenge. Existing models exhibit strong raw generation capability but +lack the flexibility to compose separate tracks and combine them in a +controllable manner, differing from typical workflows of human composers. To +address this issue, we propose JEN-1 Composer, a unified framework to +efficiently model marginal, conditional, and joint distributions over +multi-track music via a single model. JEN-1 Composer framework exhibits the +capacity to seamlessly incorporate any diffusion-based music generation system, +\textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music +generation. We introduce a curriculum training strategy aimed at incrementally +instructing the model in the transition from single-track generation to the +flexible generation of multi-track combinations. During the inference, users +have the ability to iteratively produce and choose music tracks that meet their +preferences, subsequently creating an entire musical composition incrementally +following the proposed Human-AI co-composition workflow. Quantitative and +qualitative assessments demonstrate state-of-the-art performance in +controllable and high-fidelity multi-track music synthesis. The proposed JEN-1 +Composer represents a significant advance toward interactive AI-facilitated +music creation and composition. Demos will be available at +https://jenmusic.ai/audio-demos. + +
+
+ comment: Preprints +
+
+
+
+
+ + ☆ BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species + Classification and Mapping WACV 2024 + + +
+ We propose a metadata-aware self-supervised learning~(SSL)~framework useful +for fine-grained classification and ecological mapping of bird species around +the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL) +and Masked Image Modeling~(MIM), while also enriching the embedding space with +metadata available with ground-level imagery of birds. We separately train +uni-modal and cross-modal ViT on a novel cross-view global bird species dataset +containing ground-level imagery, metadata (location, time), and corresponding +satellite imagery. We demonstrate that our models learn fine-grained and +geographically conditioned features of birds, by evaluating on two downstream +tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval. +Pre-trained models learned using our framework achieve SotA performance on FGVC +of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and +NABirds datasets. Moreover, the impressive cross-modal retrieval performance of +our model enables the creation of species distribution maps across any +geographic region. The dataset and source code will be released at +https://github.com/mvrl/BirdSAT}. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ☆ Learning to Follow Object-Centric Image Editing Instructions Faithfully EMNLP 2023 + + +
+ Natural language instructions are a powerful interface for editing the +outputs of text-to-image diffusion models. However, several challenges need to +be addressed: 1) underspecification (the need to model the implicit meaning of +instructions) 2) grounding (the need to localize where the edit has to be +performed), 3) faithfulness (the need to preserve the elements of the image not +affected by the edit instruction). Current approaches focusing on image editing +with natural language instructions rely on automatically generated paired data, +which, as shown in our investigation, is noisy and sometimes nonsensical, +exacerbating the above issues. Building on recent advances in segmentation, +Chain-of-Thought prompting, and visual question answering, we significantly +improve the quality of the paired data. In addition, we enhance the supervision +signal by highlighting parts of the image that need to be changed by the +instruction. The model fine-tuned on the improved data is capable of performing +fine-grained object-centric edits better than state-of-the-art baselines, +mitigating the problems outlined above, as shown by automatic and human +evaluations. Moreover, our model is capable of generalizing to domains unseen +during training, such as visual metaphors. + +
+
+ comment: Findings of EMNLP 2023 (Long paper) +
+
+
+
+
+ + ☆ Women Wearing Lipstick: Measuring the Bias Between an Object and Its + Related Gender EMNLP + + +
+ In this paper, we investigate the impact of objects on gender bias in image +captioning systems. Our results show that only gender-specific objects have a +strong gender bias (e.g., women-lipstick). In addition, we propose a visual +semantic-based gender score that measures the degree of bias and can be used as +a plug-in for any image captioning system. Our experiments demonstrate the +utility of the gender score, since we observe that our score can measure the +bias relation between a caption and its related gender; therefore, our score +can be used as an additional metric to the existing Object Gender Co-Occ +approach. Code and data are publicly available at +\url{https://github.com/ahmedssabir/GenderScore}. + +
+
+ comment: EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Out-of-distribution Object Detection through Bayesian Uncertainty + Estimation + + +
+ The superior performance of object detectors is often established under the +condition that the test samples are in the same distribution as the training +data. However, in many practical applications, out-of-distribution (OOD) +instances are inevitable and usually lead to uncertainty in the results. In +this paper, we propose a novel, intuitive, and scalable probabilistic object +detection method for OOD detection. Unlike other uncertainty-modeling methods +that either require huge computational costs to infer the weight distributions +or rely on model training through synthetic outlier data, our method is able to +distinguish between in-distribution (ID) data and OOD data via weight parameter +sampling from proposed Gaussian distributions based on pre-trained networks. We +demonstrate that our Bayesian object detector can achieve satisfactory OOD +identification performance by reducing the FPR95 score by up to 8.19% and +increasing the AUROC score by up to 13.94% when trained on BDD100k and VOC +datasets as the ID datasets and evaluated on COCO2017 dataset as the OOD +dataset. + +
+
+
+
+
+ + ☆ Dynamic V2X Autonomous Perception from Road-to-Vehicle Vision + + +
+ Vehicle-to-everything (V2X) perception is an innovative technology that +enhances vehicle perception accuracy, thereby elevating the security and +reliability of autonomous systems. However, existing V2X perception methods +focus on static scenes from mainly vehicle-based vision, which is constrained +by sensor capabilities and communication loads. To adapt V2X perception models +to dynamic scenes, we propose to build V2X perception from road-to-vehicle +vision and present Adaptive Road-to-Vehicle Perception (AR2VP) method. In +AR2VP,we leverage roadside units to offer stable, wide-range sensing +capabilities and serve as communication hubs. AR2VP is devised to tackle both +intra-scene and inter-scene changes. For the former, we construct a dynamic +perception representing module, which efficiently integrates vehicle +perceptions, enabling vehicles to capture a more comprehensive range of dynamic +factors within the scene.Moreover, we introduce a road-to-vehicle perception +compensating module, aimed at preserving the maximized roadside unit perception +information in the presence of intra-scene changes.For inter-scene changes, we +implement an experience replay mechanism leveraging the roadside unit's storage +capacity to retain a subset of historical scene data, maintaining model +robustness in response to inter-scene shifts. We conduct perception experiment +on 3D object detection and segmentation, and the results show that AR2VP excels +in both performance-bandwidth trade-offs and adaptability within dynamic +environments. + +
+
+
+
+
+ + ☆ Efficient IoT Inference via Context-Awareness + + +
+ While existing strategies for optimizing deep learning-based classification +models on low-power platforms assume the models are trained on all classes of +interest, this paper posits that adopting context-awareness i.e. focusing +solely on the likely classes in the current context, can substantially enhance +performance in resource-constrained environments. We propose a new paradigm, +CACTUS, for scalable and efficient context-aware classification where a +micro-classifier recognizes a small set of classes relevant to the current +context and, when context change happens, rapidly switches to another suitable +micro-classifier. CACTUS has several innovations including optimizing the +training cost of context-aware classifiers, enabling on-the-fly context-aware +switching between classifiers, and selecting the best context-aware classifiers +given limited resources. We show that CACTUS achieves significant benefits in +accuracy, latency, and compute budget across a range of datasets and IoT +platforms. + +
+
+ comment: 12 pages, 10 figures +
+
+
+
+
+ + ☆ Dynamic Task and Weight Prioritization Curriculum Learning for + Multimodal Imagery + + +
+ This paper explores post-disaster analytics using multimodal deep learning +models trained with curriculum learning method. Studying post-disaster +analytics is important as it plays a crucial role in mitigating the impact of +disasters by providing timely and accurate insights into the extent of damage +and the allocation of resources. We propose a curriculum learning strategy to +enhance the performance of multimodal deep learning models. Curriculum learning +emulates the progressive learning sequence in human education by training deep +learning models on increasingly complex data. Our primary objective is to +develop a curriculum-trained multimodal deep learning model, with a particular +focus on visual question answering (VQA) capable of jointly processing image +and text data, in conjunction with semantic segmentation for disaster analytics +using the +FloodNet\footnote{https://github.com/BinaLab/FloodNet-Challenge-EARTHVISION2021} +dataset. To achieve this, U-Net model is used for semantic segmentation and +image encoding. A custom built text classifier is used for visual question +answering. Existing curriculum learning methods rely on manually defined +difficulty functions. We introduce a novel curriculum learning approach termed +Dynamic Task and Weight Prioritization (DATWEP), which leverages a +gradient-based method to automatically decide task difficulty during curriculum +learning training, thereby eliminating the need for explicit difficulty +computation. The integration of DATWEP into our multimodal model shows +improvement on VQA performance. Source code is available at +https://github.com/fualsan/DATWEP. + +
+
+
+
+
+ + ☆ Reward Finetuning for Faster and More Accurate Unsupervised Object + Discovery + + +
+ Recent advances in machine learning have shown that Reinforcement Learning +from Human Feedback (RLHF) can improve machine learning models and align them +with human preferences. Although very successful for Large Language Models +(LLMs), these advancements have not had a comparable impact in research for +autonomous vehicles -- where alignment with human expectations can be +imperative. In this paper, we propose to adapt similar RL-based methods to +unsupervised object discovery, i.e. learning to detect objects from LiDAR +points without any training labels. Instead of labels, we use simple heuristics +to mimic human feedback. More explicitly, we combine multiple heuristics into a +simple reward function that positively correlates its score with bounding box +accuracy, \ie, boxes containing objects are scored higher than those without. +We start from the detector's own predictions to explore the space and reinforce +boxes with high rewards through gradient updates. Empirically, we demonstrate +that our approach is not only more accurate, but also orders of magnitudes +faster to train compared to prior works on object discovery. + +
+
+
+
+
+ + ☆ Bespoke Solvers for Generative Flow Models + + +
+ Diffusion or flow-based models are powerful generative paradigms that are +notoriously hard to sample as samples are defined as solutions to +high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) +which require a large Number of Function Evaluations (NFE) to approximate well. +Existing methods to alleviate the costly sampling process include model +distillation and designing dedicated ODE solvers. However, distillation is +costly to train and sometimes can deteriorate quality, while dedicated solvers +still require relatively large NFE to produce high quality samples. In this +paper we introduce "Bespoke solvers", a novel framework for constructing custom +ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach +optimizes an order consistent and parameter-efficient solver (e.g., with 80 +learnable parameters), is trained for roughly 1% of the GPU time required for +training the pre-trained model, and significantly improves approximation and +generation quality compared to dedicated solvers. For example, a Bespoke solver +for a CIFAR10 model produces samples with Fr\'echet Inception Distance (FID) of +2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this +model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke +samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 +NFE. + +
+
+
+
+
+ + ☆ Myriad: Large Multimodal Model by Applying Vision Experts for Industrial + Anomaly Detection + + +
+ Existing industrial anomaly detection (IAD) methods predict anomaly scores +for both anomaly detection and localization. However, they struggle to perform +a multi-turn dialog and detailed descriptions for anomaly regions, e.g., color, +shape, and categories of industrial anomalies. Recently, large multimodal +(i.e., vision and language) models (LMMs) have shown eminent perception +abilities on multiple vision tasks such as image captioning, visual +understanding, visual reasoning, etc., making it a competitive potential choice +for more comprehensible anomaly detection. However, the knowledge about anomaly +detection is absent in existing general LMMs, while training a specific LMM for +anomaly detection requires a tremendous amount of annotated data and massive +computation resources. In this paper, we propose a novel large multi-modal +model by applying vision experts for industrial anomaly detection (dubbed +Myriad), which leads to definite anomaly detection and high-quality anomaly +description. Specifically, we adopt MiniGPT-4 as the base LMM and design an +Expert Perception module to embed the prior knowledge from vision experts as +tokens which are intelligible to Large Language Models (LLMs). To compensate +for the errors and confusions of vision experts, we introduce a domain adapter +to bridge the visual representation gaps between generic and industrial images. +Furthermore, we propose a Vision Expert Instructor, which enables the Q-Former +to generate IAD domain vision-language tokens according to vision expert prior. +Extensive experiments on MVTec-AD and VisA benchmarks demonstrate that our +proposed method not only performs favorably against state-of-the-art methods +under the 1-class and few-shot settings, but also provide definite anomaly +prediction along with detailed descriptions in IAD domain. + +
+
+ comment: 8 pages, 7 figures +
+
+
+
+
+ + ☆ Multimodal ChatGPT for Medical Applications: an Experimental Study of + GPT-4V + + +
+ In this paper, we critically evaluate the capabilities of the +state-of-the-art multimodal large language model, i.e., GPT-4 with Vision +(GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly +assess GPT-4V's proficiency in answering questions paired with images using +both pathology and radiology datasets from 11 modalities (e.g. Microscopy, +Dermoscopy, X-ray, CT, etc.) and fifteen objects of interests (brain, liver, +lung, etc.). Our datasets encompass a comprehensive range of medical inquiries, +including sixteen distinct question types. Throughout our evaluations, we +devised textual prompts for GPT-4V, directing it to synergize visual and +textual information. The experiments with accuracy score conclude that the +current version of GPT-4V is not recommended for real-world diagnostics due to +its unreliable and suboptimal accuracy in responding to diagnostic medical +questions. In addition, we delineate seven unique facets of GPT-4V's behavior +in medical VQA, highlighting its constraints within this complex arena. The +complete details of our evaluation cases are accessible at +https://github.com/ZhilingYan/GPT4V-Medical-Report. + +
+
+
+
+
+ + ☆ TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language + Understanding + + +
+ Large-scale video-language pre-training has made remarkable strides in +advancing video-language understanding tasks. However, the heavy computational +burden of video encoding remains a formidable efficiency bottleneck, +particularly for long-form videos. These videos contain massive visual tokens +due to their inherent 3D properties and spatiotemporal redundancy, making it +challenging to capture complex temporal and spatial relationships. To tackle +this issue, we propose an efficient method called TEmporal-Spatial Token +Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating +similar frames, as well as similar patches within each frame. TESTA can reduce +the number of visual tokens by 75% and thus accelerate video encoding. Building +upon TESTA, we introduce a pre-trained video-language model equipped with a +divided space-time token aggregation module in each video encoder block. We +evaluate our model on five datasets for paragraph-to-video retrieval and +long-form VideoQA tasks. Experimental results show that TESTA improves +computing efficiency by 1.7 times, and achieves significant performance gains +from its scalability in processing longer input frames, e.g., +13.7 R@1 on +QuerYD and +6.5 R@1 on Condensed Movie. + +
+
+ comment: 16 pages, 9 figures, code is available at + https://github.com/RenShuhuai-Andy/TESTA +
+
+
+
+
+ + ☆ Boosting Decision-Based Black-Box Adversarial Attack with Gradient + Priors IJCAI 2023 + + +
+ Decision-based methods have shown to be effective in black-box adversarial +attacks, as they can obtain satisfactory performance and only require to access +the final model prediction. Gradient estimation is a critical step in black-box +adversarial attacks, as it will directly affect the query efficiency. Recent +works have attempted to utilize gradient priors to facilitate score-based +methods to obtain better results. However, these gradient priors still suffer +from the edge gradient discrepancy issue and the successive iteration gradient +direction issue, thus are difficult to simply extend to decision-based methods. +In this paper, we propose a novel Decision-based Black-box Attack framework +with Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent +gradient prior and time-dependent prior into the gradient estimation procedure. +First, by leveraging the joint bilateral filter to deal with each random +perturbation, DBA-GP can guarantee that the generated perturbations in edge +locations are hardly smoothed, i.e., alleviating the edge gradient discrepancy, +thus remaining the characteristics of the original image as much as possible. +Second, by utilizing a new gradient updating strategy to automatically adjust +the successive iteration gradient direction, DBA-GP can accelerate the +convergence speed, thus improving the query efficiency. Extensive experiments +have demonstrated that the proposed method outperforms other strong baselines +significantly. + +
+
+ comment: Accepted by IJCAI 2023 +
+
+
+
+
+ + ☆ FPGAN-Control: A Controllable Fingerprint Generator for Training with + Synthetic Data + + +
+ Training fingerprint recognition models using synthetic data has recently +gained increased attention in the biometric community as it alleviates the +dependency on sensitive personal data. Existing approaches for fingerprint +generation are limited in their ability to generate diverse impressions of the +same finger, a key property for providing effective data for training +recognition models. To address this gap, we present FPGAN-Control, an identity +preserving image generation framework which enables control over the +fingerprint's image appearance (e.g., fingerprint type, acquisition device, +pressure level) of generated fingerprints. We introduce a novel appearance loss +that encourages disentanglement between the fingerprint's identity and +appearance properties. In our experiments, we used the publicly available NIST +SD302 (N2N) dataset for training the FPGAN-Control model. We demonstrate the +merits of FPGAN-Control, both quantitatively and qualitatively, in terms of +identity preservation level, degree of appearance control, and low +synthetic-to-real domain gap. Finally, training recognition models using only +synthetic datasets generated by FPGAN-Control lead to recognition accuracies +that are on par or even surpass models trained using real data. To the best of +our knowledge, this is the first work to demonstrate this. + +
+
+
+
+
+ + ☆ Efficient Test-Time Adaptation for Super-Resolution with Second-Order + Degradation and Reconstruction NeurIPS 2023 + + +
+ Image super-resolution (SR) aims to learn a mapping from low-resolution (LR) +to high-resolution (HR) using paired HR-LR training images. Conventional SR +methods typically gather the paired training data by synthesizing LR images +from HR images using a predetermined degradation model, e.g., Bicubic +down-sampling. However, the realistic degradation type of test images may +mismatch with the training-time degradation type due to the dynamic changes of +the real-world scenarios, resulting in inferior-quality SR images. To address +this, existing methods attempt to estimate the degradation model and train an +image-specific model, which, however, is quite time-consuming and impracticable +to handle rapidly changing domain shifts. Moreover, these methods largely +concentrate on the estimation of one degradation type (e.g., blur degradation), +overlooking other degradation types like noise and JPEG in real-world test-time +scenarios, thus limiting their practicality. To tackle these problems, we +present an efficient test-time adaptation framework for SR, named SRTTA, which +is able to quickly adapt SR models to test domains with different/unknown +degradation types. Specifically, we design a second-order degradation scheme to +construct paired data based on the degradation type of the test image, which is +predicted by a pre-trained degradation classifier. Then, we adapt the SR model +by implementing feature-level reconstruction learning from the initial test +image to its second-order degraded counterparts, which helps the SR model +generate plausible HR images. Extensive experiments are conducted on newly +synthesized corrupted DIV2K datasets with 8 different degradations and several +real-world datasets, demonstrating that our SRTTA framework achieves an +impressive improvement over existing methods with satisfying speed. The source +code is available at https://github.com/DengZeshuai/SRTTA. + +
+
+ comment: Accepted by 37th Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ☆ Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic + Segmentation NeurIPS 2023 + + +
+ This paper studies the problem of weakly open-vocabulary semantic +segmentation (WOVSS), which learns to segment objects of arbitrary classes +using mere image-text pairs. Existing works turn to enhance the vanilla vision +transformer by introducing explicit grouping recognition, i.e., employing +several group tokens/centroids to cluster the image tokens and perform the +group-text alignment. Nevertheless, these methods suffer from a granularity +inconsistency regarding the usage of group tokens, which are aligned in the +all-to-one v.s. one-to-one manners during the training and inference phases, +respectively. We argue that this discrepancy arises from the lack of elaborate +supervision for each group token. To bridge this granularity gap, this paper +explores explicit supervision for the group tokens from the prototypical +knowledge. To this end, this paper proposes the non-learnable prototypical +regularization (NPR) where non-learnable prototypes are estimated from source +features to serve as supervision and enable contrastive matching of the group +tokens. This regularization encourages the group tokens to segment objects with +less redundancy and capture more comprehensive semantic regions, leading to +increased compactness and richness. Based on NPR, we propose the prototypical +guidance segmentation network (PGSeg) that incorporates multi-modal +regularization by leveraging prototypical sources from both images and texts at +different levels, progressively enhancing the segmentation capability with +diverse prototypical patterns. Experimental results show that our proposed +method achieves state-of-the-art performance on several benchmark datasets. The +source code is available at https://github.com/Ferenas/PGSeg. + +
+
+ comment: 14 pages, Accept in NeurIPS 2023 +
+
+
+
+
+ + ☆ DynPoint: Dynamic Neural Point For View Synthesis + + +
+ The introduction of neural radiance fields has greatly improved the +effectiveness of view synthesis for monocular videos. However, existing +algorithms face difficulties when dealing with uncontrolled or lengthy +scenarios, and require extensive training time specific to each new scenario. +To tackle these limitations, we propose DynPoint, an algorithm designed to +facilitate the rapid synthesis of novel views for unconstrained monocular +videos. Rather than encoding the entirety of the scenario information into a +latent representation, DynPoint concentrates on predicting the explicit 3D +correspondence between neighboring frames to realize information aggregation. +Specifically, this correspondence prediction is achieved through the estimation +of consistent depth and scene flow information across frames. Subsequently, the +acquired correspondence is utilized to aggregate information from multiple +reference frames to a target frame, by constructing hierarchical neural point +clouds. The resulting framework enables swift and accurate view synthesis for +desired views of target frames. The experimental results obtained demonstrate +the considerable acceleration of training time achieved - typically an order of +magnitude - by our proposed method while yielding comparable outcomes compared +to prior approaches. Furthermore, our method exhibits strong robustness in +handling long-duration videos without learning a canonical representation of +video content. + +
+
+
+
+
+ + ☆ Controllable Group Choreography using Contrastive Diffusion + + +
+ Music-driven group choreography poses a considerable challenge but holds +significant potential for a wide range of industrial applications. The ability +to generate synchronized and visually appealing group dance motions that are +aligned with music opens up opportunities in many fields such as entertainment, +advertising, and virtual performances. However, most of the recent works are +not able to generate high-fidelity long-term motions, or fail to enable +controllable experience. In this work, we aim to address the demand for +high-quality and customizable group dance generation by effectively governing +the consistency and diversity of group choreographies. In particular, we +utilize a diffusion-based generative approach to enable the synthesis of +flexible number of dancers and long-term group dances, while ensuring coherence +to the input music. Ultimately, we introduce a Group Contrastive Diffusion +(GCD) strategy to enhance the connection between dancers and their group, +presenting the ability to control the consistency or diversity level of the +synthesized group animation via the classifier-guidance sampling technique. +Through intensive experiments and evaluation, we demonstrate the effectiveness +of our approach in producing visually captivating and consistent group dance +motions. The experimental results show the capability of our method to achieve +the desired levels of consistency and diversity, while maintaining the overall +quality of the generated group choreography. + +
+
+ comment: Accepted in ACM Transactions on Graphics +
+
+
+
+
+ + ☆ Blacksmith: Fast Adversarial Training of Vision Transformers via a + Mixture of Single-step and Multi-step Methods + + +
+ Despite the remarkable success achieved by deep learning algorithms in +various domains, such as computer vision, they remain vulnerable to adversarial +perturbations. Adversarial Training (AT) stands out as one of the most +effective solutions to address this issue; however, single-step AT can lead to +Catastrophic Overfitting (CO). This scenario occurs when the adversarially +trained network suddenly loses robustness against multi-step attacks like +Projected Gradient Descent (PGD). Although several approaches have been +proposed to address this problem in Convolutional Neural Networks (CNNs), we +found out that they do not perform well when applied to Vision Transformers +(ViTs). In this paper, we propose Blacksmith, a novel training strategy to +overcome the CO problem, specifically in ViTs. Our approach utilizes either of +PGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the +adversarial training of the neural network. This will increase the diversity of +our training attacks, which could potentially mitigate the CO issue. To manage +the increased training time resulting from this combination, we craft the PGD-2 +attack based on only the first half of the layers, while FGSM is applied +end-to-end. Through our experiments, we demonstrate that our novel method +effectively prevents CO, achieves PGD-2 level performance, and outperforms +other existing techniques including N-FGSM, which is the state-of-the-art +method in fast training for CNNs. + +
+
+
+
+
+ + ☆ Analyzing Vision Transformers for Image Classification in Class + Embedding Space NeurIPS 2023 + + +
+ Despite the growing use of transformer models in computer vision, a +mechanistic understanding of these networks is still needed. This work +introduces a method to reverse-engineer Vision Transformers trained to solve +image classification tasks. Inspired by previous research in NLP, we +demonstrate how the inner representations at any level of the hierarchy can be +projected onto the learned class embedding space to uncover how these networks +build categorical representations for their predictions. We use our framework +to show how image tokens develop class-specific representations that depend on +attention mechanisms and contextual information, and give insights on how +self-attention and MLP layers differentially contribute to this categorical +composition. We additionally demonstrate that this method (1) can be used to +determine the parts of an image that would be important for detecting the class +of interest, and (2) exhibits significant advantages over traditional linear +probing approaches. Taken together, our results position our proposed framework +as a powerful tool for mechanistic interpretability and explainability +research. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Spacecraft Autonomous Decision-Planning for Collision Avoidance: a + Reinforcement Learning Approach + + +
+ The space environment around the Earth is becoming increasingly populated by +both active spacecraft and space debris. To avoid potential collision events, +significant improvements in Space Situational Awareness (SSA) activities and +Collision Avoidance (CA) technologies are allowing the tracking and maneuvering +of spacecraft with increasing accuracy and reliability. However, these +procedures still largely involve a high level of human intervention to make the +necessary decisions. For an increasingly complex space environment, this +decision-making strategy is not likely to be sustainable. Therefore, it is +important to successfully introduce higher levels of automation for key Space +Traffic Management (STM) processes to ensure the level of reliability needed +for navigating a large number of spacecraft. These processes range from +collision risk detection to the identification of the appropriate action to +take and the execution of avoidance maneuvers. This work proposes an +implementation of autonomous CA decision-making capabilities on spacecraft +based on Reinforcement Learning (RL) techniques. A novel methodology based on a +Partially Observable Markov Decision Process (POMDP) framework is developed to +train the Artificial Intelligence (AI) system on board the spacecraft, +considering epistemic and aleatory uncertainties. The proposed framework +considers imperfect monitoring information about the status of the debris in +orbit and allows the AI system to effectively learn stochastic policies to +perform accurate Collision Avoidance Maneuvers (CAMs). The objective is to +successfully delegate the decision-making process for autonomously implementing +a CAM to the spacecraft without human intervention. This approach would allow +for a faster response in the decision-making process and for highly +decentralized operations. + +
+
+ comment: Preprint accepted in the 74th International Astronautical Congress + (IAC) - Baku, Azerbaijan, 2-6 October 2023 +
+
+
+
+
+ + ☆ AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly + Detection + + +
+ Zero-shot anomaly detection (ZSAD) requires detection models trained using +auxiliary data to detect anomalies without any training sample in a target +dataset. It is a crucial task when training data is not accessible due to +various concerns, \eg, data privacy, yet it is challenging since the models +need to generalize to anomalies across different domains where the appearance +of foreground objects, abnormal regions, and background features, such as +defects/tumors on different products/organs, can vary significantly. Recently +large pre-trained vision-language models (VLMs), such as CLIP, have +demonstrated strong zero-shot recognition ability in various vision tasks, +including anomaly detection. However, their ZSAD performance is weak since the +VLMs focus more on modeling the class semantics of the foreground objects +rather than the abnormality/normality in the images. In this paper we introduce +a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across +different domains. The key insight of AnomalyCLIP is to learn object-agnostic +text prompts that capture generic normality and abnormality in an image +regardless of its foreground objects. This allows our model to focus on the +abnormal image regions rather than the object semantics, enabling generalized +normality and abnormality recognition on diverse types of objects. Large-scale +experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP +achieves superior zero-shot performance of detecting and segmenting anomalies +in datasets of highly diverse class semantics from various defect inspection +and medical imaging domains. Code will be made available at +https://github.com/zqhang/AnomalyCLIP. + +
+
+
+
+
+ + ☆ Mask Propagation for Efficient Video Semantic Segmentation NeurIPS 2023 + + +
+ Video Semantic Segmentation (VSS) involves assigning a semantic label to each +pixel in a video sequence. Prior work in this field has demonstrated promising +results by extending image semantic segmentation models to exploit temporal +relationships across video frames; however, these approaches often incur +significant computational costs. In this paper, we propose an efficient mask +propagation framework for VSS, called MPVSS. Our approach first employs a +strong query-based image segmentor on sparse key frames to generate accurate +binary masks and class predictions. We then design a flow estimation module +utilizing the learned queries to generate a set of segment-aware flow maps, +each associated with a mask prediction from the key frame. Finally, the +mask-flow pairs are warped to serve as the mask predictions for the non-key +frames. By reusing predictions from key frames, we circumvent the need to +process a large volume of video frames individually with resource-intensive +segmentors, alleviating temporal redundancy and significantly reducing +computational costs. Extensive experiments on VSPW and Cityscapes demonstrate +that our mask propagation framework achieves SOTA accuracy and efficiency +trade-offs. For instance, our best model with Swin-L backbone outperforms the +SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW +dataset. Moreover, our framework reduces up to 4x FLOPs compared to the +per-frame Mask2Former baseline with only up to 2% mIoU degradation on the +Cityscapes validation set. Code is available at +https://github.com/ziplab/MPVSS. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ TIC-TAC: A Framework To Learn And Evaluate Your Covariance + + +
+ We study the problem of unsupervised heteroscedastic covariance estimation, +where the goal is to learn the multivariate target distribution $\mathcal{N}(y, +\Sigma_y | x )$ given an observation $x$. This problem is particularly +challenging as $\Sigma_{y}$ varies for different samples (heteroscedastic) and +no annotation for the covariance is available (unsupervised). Typically, +state-of-the-art methods predict the mean $f_{\theta}(x)$ and covariance +$\textrm{Cov}(f_{\theta}(x))$ of the target distribution through two neural +networks trained using the negative log-likelihood. This raises two questions: +(1) Does the predicted covariance truly capture the randomness of the predicted +mean? (2) In the absence of ground-truth annotation, how can we quantify the +performance of covariance estimation? We address (1) by deriving TIC: Taylor +Induced Covariance, which captures the randomness of the multivariate +$f_{\theta}(x)$ by incorporating its gradient and curvature around $x$ through +the second order Taylor polynomial. Furthermore, we tackle (2) by introducing +TAC: Task Agnostic Correlations, a metric which leverages conditioning of the +normal distribution to evaluate the covariance. We verify the effectiveness of +TIC through multiple experiments spanning synthetic (univariate, multivariate) +and real-world datasets (UCI Regression, LSP, and MPII Human Pose Estimation). +Our experiments show that TIC outperforms state-of-the-art in accurately +learning the covariance, as quantified through TAC. + +
+
+ comment: 12 pages, 4 figures. Please feel free to provide feedback! +
+
+
+
+
+ + ☆ Customize StyleGAN with One Hand Sketch + + +
+ Generating images from human sketches typically requires dedicated networks +trained from scratch. In contrast, the emergence of the pre-trained +Vision-Language models (e.g., CLIP) has propelled generative applications based +on controlling the output imagery of existing StyleGAN models with text inputs +or reference images. Parallelly, our work proposes a framework to control +StyleGAN imagery with a single user sketch. In particular, we learn a +conditional distribution in the latent space of a pre-trained StyleGAN model +via energy-based learning and propose two novel energy functions leveraging +CLIP for cross-domain semantic supervision. Once trained, our model can +generate multi-modal images semantically aligned with the input sketch. +Quantitative evaluations on synthesized datasets have shown that our approach +improves significantly from previous methods in the one-shot regime. The +superiority of our method is further underscored when experimenting with a wide +range of human sketches of diverse styles and poses. Surprisingly, our models +outperform the previous baseline regarding both the range of sketch inputs and +image qualities despite operating with a stricter setting: with no extra +training data and single sketch input. + +
+
+ comment: preprint +
+
+
+
+
+ + ☆ Video Frame Interpolation with Many-to-many Splatting and Spatial + Selective Refinement + + +
+ In this work, we first propose a fully differentiable Many-to-Many (M2M) +splatting framework to interpolate frames efficiently. Given a frame pair, we +estimate multiple bidirectional flows to directly forward warp the pixels to +the desired time step before fusing overlapping pixels. In doing so, each +source pixel renders multiple target pixels and each target pixel can be +synthesized from a larger area of visual context, establishing a many-to-many +splatting scheme with robustness to undesirable artifacts. For each input frame +pair, M2M has a minuscule computational overhead when interpolating an +arbitrary number of in-between frames, hence achieving fast multi-frame +interpolation. However, directly warping and fusing pixels in the intensity +domain is sensitive to the quality of motion estimation and may suffer from +less effective representation capacity. To improve interpolation accuracy, we +further extend an M2M++ framework by introducing a flexible Spatial Selective +Refinement (SSR) component, which allows for trading computational efficiency +for interpolation quality and vice versa. Instead of refining the entire +interpolated frame, SSR only processes difficult regions selected under the +guidance of an estimated error map, thereby avoiding redundant computation. +Evaluation on multiple benchmark datasets shows that our method is able to +improve the efficiency while maintaining competitive video interpolation +quality, and it can be adjusted to use more or less compute as needed. + +
+
+ comment: T-PAMI. arXiv admin note: substantial text overlap with + arXiv:2204.03513 +
+
+
+
+
+ + ☆ Adversarial Examples Are Not Real Features NeurIPS 2023 + + +
+ The existence of adversarial examples has been a mystery for years and +attracted much interest. A well-known theory by \citet{ilyas2019adversarial} +explains adversarial vulnerability from a data perspective by showing that one +can extract non-robust features from adversarial examples and these features +alone are useful for classification. However, the explanation remains quite +counter-intuitive since non-robust features are mostly noise features to +humans. In this paper, we re-examine the theory from a larger context by +incorporating multiple learning paradigms. Notably, we find that contrary to +their good usefulness under supervised learning, non-robust features attain +poor usefulness when transferred to other self-supervised learning paradigms, +such as contrastive learning, masked image modeling, and diffusion models. It +reveals that non-robust features are not really as useful as robust or natural +features that enjoy good transferability between these paradigms. Meanwhile, +for robustness, we also show that naturally trained encoders from robust +features are largely non-robust under AutoAttack. Our cross-paradigm +examination suggests that the non-robust features are not really useful but +more like paradigm-wise shortcuts, and robust features alone might be +insufficient to attain reliable model robustness. Code is available at +\url{https://github.com/PKU-ML/AdvNotRealFeatures}. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Label Poisoning is All You Need + + +
+ In a backdoor attack, an adversary injects corrupted data into a model's +training dataset in order to gain control over its predictions on images with a +specific attacker-defined trigger. A typical corrupted training example +requires altering both the image, by applying the trigger, and the label. +Models trained on clean images, therefore, were considered safe from backdoor +attacks. However, in some common machine learning scenarios, the training +labels are provided by potentially malicious third-parties. This includes +crowd-sourced annotation and knowledge distillation. We, hence, investigate a +fundamental question: can we launch a successful backdoor attack by only +corrupting labels? We introduce a novel approach to design label-only backdoor +attacks, which we call FLIP, and demonstrate its strengths on three datasets +(CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32, +ResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels +corrupted, FLIP achieves a near-perfect attack success rate of 99.4% while +suffering only a 1.8% drop in the clean test accuracy. Our approach builds upon +the recent advances in trajectory matching, originally introduced for dataset +distillation. + +
+
+
+
+
+ + ☆ A transfer learning approach with convolutional neural network for Face + Mask Detection + + +
+ Due to the epidemic of the coronavirus (Covid-19) and its rapid spread around +the world, the world has faced an enormous crisis. To prevent the spread of the +coronavirus, the World Health Organization (WHO) has introduced the use of +masks and keeping social distance as the best preventive method. So, developing +an automatic monitoring system for detecting facemasks in some crowded places +is essential. To do this, we propose a mask recognition system based on +transfer learning and Inception v3 architecture. In the proposed method, two +datasets are used simultaneously for training including the Simulated Mask Face +Dataset (SMFD) and MaskedFace-Net (MFN) This paper tries to increase the +accuracy of the proposed system by optimally setting hyper-parameters and +accurately designing the fully connected layers. The main advantage of the +proposed method is that in addition to masked and unmasked faces, it can also +detect cases of incorrect use of mask. Therefore, the proposed method +classifies the input face images into three categories. Experimental results +show the high accuracy and efficiency of the proposed method; so, this method +has achieved an accuracy of 99.47% and 99.33% in training and test data +respectively + +
+
+ comment: 9 pages, in Persian language, 8 figures +
+
+
+
+
+ + ☆ CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved + Self-Supervised Video Hashing ACM MM 2023 + + +
+ Compressing videos into binary codes can improve retrieval speed and reduce +storage overhead. However, learning accurate hash codes for video retrieval can +be challenging due to high local redundancy and complex global dependencies +between video frames, especially in the absence of labels. Existing +self-supervised video hashing methods have been effective in designing +expressive temporal encoders, but have not fully utilized the temporal dynamics +and spatial appearance of videos due to less challenging and unreliable +learning tasks. To address these challenges, we begin by utilizing the +contrastive learning task to capture global spatio-temporal information of +videos for hashing. With the aid of our designed augmentation strategies, which +focus on spatial and temporal variations to create positive pairs, the learning +framework can generate hash codes that are invariant to motion, scale, and +viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e., +frame order verification and scene change regularization, to capture local +spatio-temporal details within video frames, thereby enhancing the perception +of temporal structure and the modeling of spatio-temporal relationships. Our +proposed Contrastive Hashing with Global-Local Spatio-temporal Information +(CHAIN) outperforms state-of-the-art self-supervised video hashing methods on +four video benchmark datasets. Our codes will be released. + +
+
+ comment: 12 pages, 8 figures, accepted by ACM MM 2023 +
+
+
+
+
+ + ☆ QWID: Quantized Weed Identification Deep neural network + + +
+ In this paper, we present an efficient solution for weed classification in +agriculture. We focus on optimizing model performance at inference while +respecting the constraints of the agricultural domain. We propose a Quantized +Deep Neural Network model that classifies a dataset of 9 weed classes using +8-bit integer (int8) quantization, a departure from standard 32-bit floating +point (fp32) models. Recognizing the hardware resource limitations in +agriculture, our model balances model size, inference time, and accuracy, +aligning with practical requirements. We evaluate the approach on ResNet-50 and +InceptionV3 architectures, comparing their performance against their int8 +quantized versions. Transfer learning and fine-tuning are applied using the +DeepWeeds dataset. The results show staggering model size and inference time +reductions while maintaining accuracy in real-world production scenarios like +Desktop, Mobile and Raspberry Pi. Our work sheds light on a promising direction +for efficient AI in agriculture, holding potential for broader applications. + Code: https://github.com/parikshit14/QNN-for-weed + +
+
+ comment: 6 pages, 6 figures, 4 tables +
+
+
+
+
+ + ☆ Improving Multi-Person Pose Tracking with A Confidence Network + + +
+ Human pose estimation and tracking are fundamental tasks for understanding +human behaviors in videos. Existing top-down framework-based methods usually +perform three-stage tasks: human detection, pose estimation and tracking. +Although promising results have been achieved, these methods rely heavily on +high-performance detectors and may fail to track persons who are occluded or +miss-detected. To overcome these problems, in this paper, we develop a novel +keypoint confidence network and a tracking pipeline to improve human detection +and pose estimation in top-down approaches. Specifically, the keypoint +confidence network is designed to determine whether each keypoint is occluded, +and it is incorporated into the pose estimation module. In the tracking +pipeline, we propose the Bbox-revision module to reduce missing detection and +the ID-retrieve module to correct lost trajectories, improving the performance +of the detection stage. Experimental results show that our approach is +universal in human detection and pose estimation, achieving state-of-the-art +performance on both PoseTrack 2017 and 2018 datasets. + +
+
+ comment: Accepted by IEEE Transactions on Multimedia. 11 pages, 5 figures +
+
+
+
+
+ + ☆ TiV-NeRF: Tracking and Mapping via Time-Varying Representation with + Dynamic Neural Radiance Fields + + +
+ Previous attempts to integrate Neural Radiance Fields (NeRF) into +Simultaneous Localization and Mapping (SLAM) framework either rely on the +assumption of static scenes or treat dynamic objects as outliers. However, most +of real-world scenarios is dynamic. In this paper, we propose a time-varying +representation to track and reconstruct the dynamic scenes. Our system +simultaneously maintains two processes, tracking process and mapping process. +For tracking process, the entire input images are uniformly sampled and +training of the RGB images are self-supervised. For mapping process, we +leverage know masks to differentiate dynamic objects and static backgrounds, +and we apply distinct sampling strategies for two types of areas. The +parameters optimization for both processes are made up by two stages, the first +stage associates time with 3D positions to convert the deformation field to the +canonical field. And the second associates time with 3D positions in canonical +field to obtain colors and Signed Distance Function (SDF). Besides, We propose +a novel keyframe selection strategy based on the overlapping rate. We evaluate +our approach on two publicly available synthetic datasets and validate that our +method is more effective compared to current state-of-the-art dynamic mapping +methods. + +
+
+
+
+
+ + ☆ InstanT: Semi-supervised Learning with Instance-dependent Thresholds NeurIPS 2023 + + +
+ Semi-supervised learning (SSL) has been a fundamental challenge in machine +learning for decades. The primary family of SSL algorithms, known as +pseudo-labeling, involves assigning pseudo-labels to confident unlabeled +instances and incorporating them into the training set. Therefore, the +selection criteria of confident instances are crucial to the success of SSL. +Recently, there has been growing interest in the development of SSL methods +that use dynamic or adaptive thresholds. Yet, these methods typically apply the +same threshold to all samples, or use class-dependent thresholds for instances +belonging to a certain class, while neglecting instance-level information. In +this paper, we propose the study of instance-dependent thresholds, which has +the highest degree of freedom compared with existing methods. Specifically, we +devise a novel instance-dependent threshold function for all unlabeled +instances by utilizing their instance-level ambiguity and the +instance-dependent error rates of pseudo-labels, so instances that are more +likely to have incorrect pseudo-labels will have higher thresholds. +Furthermore, we demonstrate that our instance-dependent threshold function +provides a bounded probabilistic guarantee for the correctness of the +pseudo-labels it assigns. + +
+
+ comment: Accepted as poster for NeurIPS 2023 +
+
+
+
+
+ + ☆ Identifiable Contrastive Learning with Automatic Feature Importance + Discovery + + +
+ Existing contrastive learning methods rely on pairwise sample contrast +$z_x^\top z_{x'}$ to learn data representations, but the learned features often +lack clear interpretability from a human perspective. Theoretically, it lacks +feature identifiability and different initialization may lead to totally +different features. In this paper, we study a new method named tri-factor +contrastive learning (triCL) that involves a 3-factor contrast in the form of +$z_x^\top S z_{x'}$, where $S=\text{diag}(s_1,\dots,s_k)$ is a learnable +diagonal matrix that automatically captures the importance of each feature. We +show that by this simple extension, triCL can not only obtain identifiable +features that eliminate randomness but also obtain more interpretable features +that are ordered according to the importance matrix $S$. We show that features +with high importance have nice interpretability by capturing common classwise +features, and obtain superior performance when evaluated for image retrieval +using a few features. The proposed triCL objective is general and can be +applied to different contrastive learning methods like SimCLR and CLIP. We +believe that it is a better alternative to existing 2-factor contrastive +learning by improving its identifiability and interpretability with minimal +overhead. Code is available at +https://github.com/PKU-ML/Tri-factor-Contrastive-Learning. + +
+
+
+
+
+ + ☆ Multi-task deep learning for large-scale building detail extraction from + high-resolution satellite imagery + + +
+ Understanding urban dynamics and promoting sustainable development requires +comprehensive insights about buildings. While geospatial artificial +intelligence has advanced the extraction of such details from Earth +observational data, existing methods often suffer from computational +inefficiencies and inconsistencies when compiling unified building-related +datasets for practical applications. To bridge this gap, we introduce the +Multi-task Building Refiner (MT-BR), an adaptable neural network tailored for +simultaneous extraction of spatial and attributional building details from +high-resolution satellite imagery, exemplified by building rooftops, urban +functional types, and roof architectural types. Notably, MT-BR can be +fine-tuned to incorporate additional building details, extending its +applicability. For large-scale applications, we devise a novel spatial sampling +scheme that strategically selects limited but representative image samples. +This process optimizes both the spatial distribution of samples and the urban +environmental characteristics they contain, thus enhancing extraction +effectiveness while curtailing data preparation expenditures. We further +enhance MT-BR's predictive performance and generalization capabilities through +the integration of advanced augmentation techniques. Our quantitative results +highlight the efficacy of the proposed methods. Specifically, networks trained +with datasets curated via our sampling method demonstrate improved predictive +accuracy relative to those using alternative sampling approaches, with no +alterations to network architecture. Moreover, MT-BR consistently outperforms +other state-of-the-art methods in extracting building details across various +metrics. The real-world practicality is also demonstrated in an application +across Shanghai, generating a unified dataset that encompasses both the spatial +and attributional details of buildings. + +
+
+
+
+
+ + ☆ Emergence of Shape Bias in Convolutional Neural Networks through + Activation Sparsity NeurIPS 2023 + + +
+ Current deep-learning models for object recognition are known to be heavily +biased toward texture. In contrast, human visual systems are known to be biased +toward shape and structure. What could be the design principles in human visual +systems that led to this difference? How could we introduce more shape bias +into the deep learning models? In this paper, we report that sparse coding, a +ubiquitous principle in the brain, can in itself introduce shape bias into the +network. We found that enforcing the sparse coding constraint using a +non-differential Top-K operation can lead to the emergence of structural +encoding in neurons in convolutional neural networks, resulting in a smooth +decomposition of objects into parts and subparts and endowing the networks with +shape bias. We demonstrated this emergence of shape bias and its functional +benefits for different network structures with various datasets. For object +recognition convolutional neural networks, the shape bias leads to greater +robustness against style and pattern change distraction. For the image +synthesis generative adversary networks, the emerged shape bias leads to more +coherent and decomposable structures in the synthesized images. Ablation +studies suggest that sparse codes tend to encode structures, whereas the more +distributed codes tend to favor texture. Our code is host at the github +repository: \url{https://github.com/Crazy-Jack/nips2023_shape_vs_texture} + +
+
+ comment: Published as NeurIPS 2023 (Oral) +
+
+
+
+
+ + ☆ Towards Generalized Multi-stage Clustering: Multi-view Self-distillation + + +
+ Existing multi-stage clustering methods independently learn the salient +features from multiple views and then perform the clustering task. +Particularly, multi-view clustering (MVC) has attracted a lot of attention in +multi-view or multi-modal scenarios. MVC aims at exploring common semantics and +pseudo-labels from multiple views and clustering in a self-supervised manner. +However, limited by noisy data and inadequate feature learning, such a +clustering paradigm generates overconfident pseudo-labels that mis-guide the +model to produce inaccurate predictions. Therefore, it is desirable to have a +method that can correct this pseudo-label mistraction in multi-stage clustering +to avoid the bias accumulation. To alleviate the effect of overconfident +pseudo-labels and improve the generalization ability of the model, this paper +proposes a novel multi-stage deep MVC framework where multi-view +self-distillation (DistilMVC) is introduced to distill dark knowledge of label +distribution. Specifically, in the feature subspace at different hierarchies, +we explore the common semantics of multiple views through contrastive learning +and obtain pseudo-labels by maximizing the mutual information between views. +Additionally, a teacher network is responsible for distilling pseudo-labels +into dark knowledge, supervising the student network and improving its +predictive capabilities to enhance the robustness. Extensive experiments on +real-world multi-view datasets show that our method has better clustering +performance than state-of-the-art methods. + +
+
+
+
+
+ + ☆ Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes NeurIPS 2023 + + +
+ Unsupervised monocular depth estimation techniques have demonstrated +encouraging results but typically assume that the scene is static. These +techniques suffer when trained on dynamical scenes, where apparent object +motion can equally be explained by hypothesizing the object's independent +motion, or by altering its depth. This ambiguity causes depth estimators to +predict erroneous depth for moving objects. To resolve this issue, we introduce +Dynamo-Depth, an unifying approach that disambiguates dynamical motion by +jointly learning monocular depth, 3D independent flow field, and motion +segmentation from unlabeled monocular videos. Specifically, we offer our key +insight that a good initial estimation of motion segmentation is sufficient for +jointly learning depth and independent motion despite the fundamental +underlying ambiguity. Our proposed method achieves state-of-the-art performance +on monocular depth estimation on Waymo Open and nuScenes Dataset with +significant improvement in the depth of moving objects. Code and additional +results are available at https://dynamo-depth.github.io. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Differentiable Learning of Generalized Structured Matrices for Efficient + Deep Neural Networks + + +
+ This paper investigates efficient deep neural networks (DNNs) to replace +dense unstructured weight matrices with structured ones that possess desired +properties. The challenge arises because the optimal weight matrix structure in +popular neural network models is obscure in most cases and may vary from layer +to layer even in the same network. Prior structured matrices proposed for +efficient DNNs were mostly hand-crafted without a generalized framework to +systematically learn them. To address this issue, we propose a generalized and +differentiable framework to learn efficient structures of weight matrices by +gradient descent. We first define a new class of structured matrices that +covers a wide range of structured matrices in the literature by adjusting the +structural parameters. Then, the frequency-domain differentiable +parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to +learn the structural parameters by proximal gradient descent. Finally, we +introduce an effective initialization method for the proposed scheme. Our +method learns efficient DNNs with structured matrices, achieving lower +complexity and/or higher performance than prior approaches that employ +low-rank, block-sparse, or block-low-rank matrices. + +
+
+
+
+
+ + ☆ HDMNet: A Hierarchical Matching Network with Double Attention for + Large-scale Outdoor LiDAR Point Cloud Registration WACV2024 + + +
+ Outdoor LiDAR point clouds are typically large-scale and complexly +distributed. To achieve efficient and accurate registration, emphasizing the +similarity among local regions and prioritizing global local-to-local matching +is of utmost importance, subsequent to which accuracy can be enhanced through +cost-effective fine registration. In this paper, a novel hierarchical neural +network with double attention named HDMNet is proposed for large-scale outdoor +LiDAR point cloud registration. Specifically, A novel feature consistency +enhanced double-soft matching network is introduced to achieve two-stage +matching with high flexibility while enlarging the receptive field with high +efficiency in a patch-to patch manner, which significantly improves the +registration performance. Moreover, in order to further utilize the sparse +matching information from deeper layer, we develop a novel trainable embedding +mask to incorporate the confidence scores of correspondences obtained from pose +estimation of deeper layer, eliminating additional computations. The +high-confidence keypoints in the sparser point cloud of the deeper layer +correspond to a high-confidence spatial neighborhood region in shallower layer, +which will receive more attention, while the features of non-key regions will +be masked. Extensive experiments are conducted on two large-scale outdoor LiDAR +point cloud datasets to demonstrate the high accuracy and efficiency of the +proposed HDMNet. + +
+
+ comment: Accepted by WACV2024 +
+
+
+
+
+ + ♻ ☆ Unlocking Feature Visualization for Deeper Networks with MAgnitude + Constrained Optimization + + +
+ Feature visualization has gained substantial popularity, particularly after +the influential work by Olah et al. in 2017, which established it as a crucial +tool for explainability. However, its widespread adoption has been limited due +to a reliance on tricks to generate interpretable images, and corresponding +challenges in scaling it to deeper neural networks. Here, we describe MACO, a +simple approach to address these shortcomings. The main idea is to generate +images by optimizing the phase spectrum while keeping the magnitude constant to +ensure that generated explanations lie in the space of natural images. Our +approach yields significantly better results (both qualitatively and +quantitatively) and unlocks efficient and interpretable feature visualizations +for large state-of-the-art neural networks. We also show that our approach +exhibits an attribution mechanism allowing us to augment feature visualizations +with spatial importance. We validate our method on a novel benchmark for +comparing feature visualization methods, and release its visualizations for all +classes of the ImageNet dataset on https://serre-lab.github.io/Lens/. + Overall, our approach unlocks, for the first time, feature visualizations for +large, state-of-the-art deep neural networks without resorting to any +parametric prior image model. + +
+
+
+
+
+ + ♻ ☆ Jumping through Local Minima: Quantization in the Loss Landscape of + Vision Transformers + + +
+ Quantization scale and bit-width are the most important parameters when +considering how to quantize a neural network. Prior work focuses on optimizing +quantization scales in a global manner through gradient methods (gradient +descent \& Hessian analysis). Yet, when applying perturbations to quantization +scales, we observe a very jagged, highly non-smooth test loss landscape. In +fact, small perturbations in quantization scale can greatly affect accuracy, +yielding a $0.5-0.8\%$ accuracy boost in 4-bit quantized vision transformers +(ViTs). In this regime, gradient methods break down, since they cannot reliably +reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to +effectively traverse the non-smooth landscape. Additionally, we propose using +an infoNCE loss, which not only helps combat overfitting on the small +calibration dataset ($1,000$ images) but also makes traversing such a highly +non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully +quantized ViT-Base by $10.30\%$, $0.78\%$, and $0.15\%$ for $3$-bit, $4$-bit, +and $8$-bit weight quantization levels. Extensive experiments on a variety of +CNN and ViT architectures further demonstrate its robustness in extreme +quantization scenarios. Our code is available at +https://github.com/enyac-group/evol-q + +
+
+ comment: arXiv admin note: text overlap with arXiv:2211.09643 +
+
+
+
+
+ + ♻ ☆ Diversify Your Vision Datasets with Automatic Diffusion-Based + Augmentation + + +
+ Many fine-grained classification tasks, like rare animal identification, have +limited training data and consequently classifiers trained on these datasets +often fail to generalize to variations in the domain like changes in weather or +location. As such, we explore how natural language descriptions of the domains +seen in training data can be used with large vision models trained on diverse +pretraining datasets to generate useful variations of the training data. We +introduce ALIA (Automated Language-guided Image Augmentation), a method which +utilizes large vision and language models to automatically generate natural +language descriptions of a dataset's domains and augment the training data via +language-guided image editing. To maintain data integrity, a model trained on +the original dataset filters out minimal image edits and those which corrupt +class-relevant information. The resulting dataset is visually consistent with +the original training data and offers significantly enhanced diversity. We show +that ALIA is able to surpasses traditional data augmentation and text-to-image +generated data on fine-grained classification tasks, including cases of domain +generalization and contextual bias. Code is available at +https://github.com/lisadunlap/ALIA. + +
+
+ comment: Update: replaced Planes dataset with Waterbirds & updated results + after bug fix +
+
+
+
+
+ + ♻ ☆ On the Vulnerability of DeepFake Detectors to Attacks Generated by + Denoising Diffusion Models + + +
+ The detection of malicious deepfakes is a constantly evolving problem that +requires continuous monitoring of detectors to ensure they can detect image +manipulations generated by the latest emerging models. In this paper, we +investigate the vulnerability of single-image deepfake detectors to black-box +attacks created by the newest generation of generative methods, namely +Denoising Diffusion Models (DDMs). Our experiments are run on FaceForensics++, +a widely used deepfake benchmark consisting of manipulated images generated +with various techniques for face identity swapping and face reenactment. +Attacks are crafted through guided reconstruction of existing deepfakes with a +proposed DDM approach for face restoration. Our findings indicate that +employing just a single denoising diffusion step in the reconstruction process +of a deepfake can significantly reduce the likelihood of detection, all without +introducing any perceptible image modifications. While training detectors using +attack examples demonstrated some effectiveness, it was observed that +discriminators trained on fully diffusion-based deepfakes exhibited limited +generalizability when presented with our attacks. + +
+
+ comment: Submitted for review +
+
+
+
+
+ + ♻ ☆ Towards Improved Input Masking for Convolutional Neural Networks ICCV 2023 + + +
+ The ability to remove features from the input of machine learning models is +very important to understand and interpret model predictions. However, this is +non-trivial for vision models since masking out parts of the input image +typically causes large distribution shifts. This is because the baseline color +used for masking (typically grey or black) is out of distribution. Furthermore, +the shape of the mask itself can contain unwanted signals which can be used by +the model for its predictions. Recently, there has been some progress in +mitigating this issue (called missingness bias) in image masking for vision +transformers. In this work, we propose a new masking method for CNNs we call +layer masking in which the missingness bias caused by masking is reduced to a +large extent. Intuitively, layer masking applies a mask to intermediate +activation maps so that the model only processes the unmasked input. We show +that our method (i) is able to eliminate or minimize the influence of the mask +shape or color on the output of the model, and (ii) is much better than +replacing the masked region by black or grey for input perturbation based +interpretability techniques like LIME. Thus, layer masking is much less +affected by missingness bias than other masking strategies. We also demonstrate +how the shape of the mask may leak information about the class, thus affecting +estimates of model reliance on class-relevant features derived from input +masking. Furthermore, we discuss the role of data augmentation techniques for +tackling this problem, and argue that they are not sufficient for preventing +model reliance on mask shape. The code for this project is publicly available +at https://github.com/SriramB-98/layer_masking + +
+
+ comment: 29 pages, 19 figures. Accepted at ICCV 2023 +
+
+
+
+
+ + ♻ ☆ A Visual Active Search Framework for Geospatial Exploration WACV 2024 + + +
+ Many problems can be viewed as forms of geospatial search aided by aerial +imagery, with examples ranging from detecting poaching activity to human +trafficking. We model this class of problems in a visual active search (VAS) +framework, which has three key inputs: (1) an image of the entire search area, +which is subdivided into regions, (2) a local search function, which determines +whether a previously unseen object class is present in a given region, and (3) +a fixed search budget, which limits the number of times the local search +function can be evaluated. The goal is to maximize the number of objects found +within the search budget. We propose a reinforcement learning approach for VAS +that learns a meta-search policy from a collection of fully annotated search +tasks. This meta-search policy is then used to dynamically search for a novel +target-object class, leveraging the outcome of any previous queries to +determine where to query next. Through extensive experiments on several +large-scale satellite imagery datasets, we show that the proposed approach +significantly outperforms several strong baselines. We also propose novel +domain adaptation techniques that improve the policy at decision time when +there is a significant domain gap with the training data. Code is publicly +available. + +
+
+ comment: Accepted to WACV 2024, 24 pages, 18 figures, Code is available at: + https://github.com/anindyasarkarIITH/VAS +
+
+
+
+
+ + ♻ ☆ Alignment with human representations supports robust few-shot learning NeurIPS 2023 + + +
+ Should we care whether AI systems have representations of the world that are +similar to those of humans? We provide an information-theoretic analysis that +suggests that there should be a U-shaped relationship between the degree of +representational alignment with humans and performance on few-shot learning +tasks. We confirm this prediction empirically, finding such a relationship in +an analysis of the performance of 491 computer vision models. We also show that +highly-aligned models are more robust to both natural adversarial attacks and +domain shifts. Our results suggest that human-alignment is often a sufficient, +but not necessary, condition for models to make effective use of limited data, +be robust, and generalize well. + +
+
+ comment: Spotlight at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Federated Learning for Medical Applications: A Taxonomy, Current Trends, + Challenges, and Future Research Directions + + +
+ With the advent of the IoT, AI, ML, and DL algorithms, the landscape of +data-driven medical applications has emerged as a promising avenue for +designing robust and scalable diagnostic and prognostic models from medical +data. This has gained a lot of attention from both academia and industry, +leading to significant improvements in healthcare quality. However, the +adoption of AI-driven medical applications still faces tough challenges, +including meeting security, privacy, and quality of service (QoS) standards. +Recent developments in \ac{FL} have made it possible to train complex +machine-learned models in a distributed manner and have become an active +research domain, particularly processing the medical data at the edge of the +network in a decentralized way to preserve privacy and address security +concerns. To this end, in this paper, we explore the present and future of FL +technology in medical applications where data sharing is a significant +challenge. We delve into the current research trends and their outcomes, +unravelling the complexities of designing reliable and scalable \ac{FL} models. +Our paper outlines the fundamental statistical issues in FL, tackles +device-related problems, addresses security challenges, and navigates the +complexity of privacy concerns, all while highlighting its transformative +potential in the medical field. Our study primarily focuses on medical +applications of \ac{FL}, particularly in the context of global cancer +diagnosis. We highlight the potential of FL to enable computer-aided diagnosis +tools that address this challenge with greater effectiveness than traditional +data-driven methods. We hope that this comprehensive review will serve as a +checkpoint for the field, summarizing the current state-of-the-art and +identifying open problems and future research directions. + +
+
+ comment: Accepted at IEEE Internet of Things Journal +
+
+
+
+
+ + ♻ ☆ Guided Motion Diffusion for Controllable Human Motion Synthesis ICCV23 + + +
+ Denoising diffusion models have shown great promise in human motion synthesis +conditioned on natural language descriptions. However, integrating spatial +constraints, such as pre-defined motion trajectories and obstacles, remains a +challenge despite being essential for bridging the gap between isolated human +motion and its surrounding environment. To address this issue, we propose +Guided Motion Diffusion (GMD), a method that incorporates spatial constraints +into the motion generation process. Specifically, we propose an effective +feature projection scheme that manipulates motion representation to enhance the +coherency between spatial information and local poses. Together with a new +imputation formulation, the generated motion can reliably conform to spatial +constraints such as global motion trajectories. Furthermore, given sparse +spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance +approach to turn a sparse signal, which is susceptible to being ignored during +the reverse steps, into denser signals to guide the generated motion to the +given constraints. Our extensive experiments justify the development of GMD, +which achieves a significant improvement over state-of-the-art methods in +text-based motion generation while allowing control of the synthesized motions +with spatial constraints. + +
+
+ comment: ICCV23. Project page: https://korrawe.github.io/gmd-project/ +
+
+
+
+
+ + ♻ ☆ STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced + Audio-Visual Diarization CVPR 2023 + + +
+ This report introduces our novel method named STHG for the Audio-Visual +Diarization task of the Ego4D Challenge 2023. Our key innovation is that we +model all the speakers in a video using a single, unified heterogeneous graph +learning framework. Unlike previous approaches that require a separate +component solely for the camera wearer, STHG can jointly detect the speech +activities of all people including the camera wearer. Our final method obtains +61.1% DER on the test set of Ego4D, which significantly outperforms all the +baselines as well as last year's winner. Our submission achieved 1st place in +the Ego4D Challenge 2023. We additionally demonstrate that applying the +off-the-shelf speech recognition system to the diarized speech segments by STHG +produces a competitive performance on the Speech Transcription task of this +challenge. + +
+
+ comment: Validation report for the Ego4D challenge at CVPR 2023 +
+
+
+
+
+ + ♻ ☆ Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual + Diarization ECCV 2022 + + +
+ This report describes our approach for the Audio-Visual Diarization (AVD) +task of the Ego4D Challenge 2022. Specifically, we present multiple technical +improvements over the official baselines. First, we improve the detection +performance of the camera wearer's voice activity by modifying the training +scheme of its model. Second, we discover that an off-the-shelf voice activity +detection model can effectively remove false positives when it is applied +solely to the camera wearer's voice activities. Lastly, we show that better +active speaker detection leads to a better AVD outcome. Our final method +obtains 65.9% DER on the test set of Ego4D, which significantly outperforms all +the baselines. Our submission achieved 1st place in the Ego4D Challenge 2022. + +
+
+ comment: Validation report for the Ego4D challenge at ECCV 2022 +
+
+
+
+
+ + ♻ ☆ MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary + Polyp Segmentation WACV 2024 + + +
+ Efficient polyp segmentation in healthcare plays a critical role in enabling +early diagnosis of colorectal cancer. However, the segmentation of polyps +presents numerous challenges, including the intricate distribution of +backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. +Defining the boundary between the foreground (i.e. polyp itself) and the +background (surrounding tissue) is difficult. To mitigate these challenges, we +propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored +specifically for polyp segmentation within colonoscopy images. This network +draws inspiration from the fusion of a classical edge detection technique with +an attention mechanism. By combining these techniques, MEGANet effectively +preserves high-frequency information, notably edges and boundaries, which tend +to erode as neural networks deepen. MEGANet is designed as an end-to-end +framework, encompassing three key modules: an encoder, which is responsible for +capturing and abstracting the features from the input image, a decoder, which +focuses on salient features, and the Edge-Guided Attention module (EGA) that +employs the Laplacian Operator to accentuate polyp boundaries. Extensive +experiments, both qualitative and quantitative, on five benchmark datasets, +demonstrate that our EGANet outperforms other existing SOTA methods under six +evaluation metrics. Our code is available at +\url{https://github.com/UARK-AICV/MEGANet}. + +
+
+ comment: Accepted at the IEEE/CVF Winter Conference on Applications of + Computer Vision (WACV 2024) +
+
+
+
+
+ + ♻ ☆ Fairness and Bias in Robot Learning + + +
+ Machine learning has significantly enhanced the abilities of robots, enabling +them to perform a wide range of tasks in human environments and adapt to our +uncertain real world. Recent works in various machine learning domains have +highlighted the importance of accounting for fairness to ensure that these +algorithms do not reproduce human biases and consequently lead to +discriminatory outcomes. With robot learning systems increasingly performing +more and more tasks in our everyday lives, it is crucial to understand the +influence of such biases to prevent unintended behavior toward certain groups +of people. In this work, we present the first survey on fairness in robot +learning from an interdisciplinary perspective spanning technical, ethical, and +legal challenges. We propose a taxonomy for sources of bias and the resulting +types of discrimination due to them. Using examples from different robot +learning domains, we examine scenarios of unfair outcomes and strategies to +mitigate them. We present early advances in the field by covering different +fairness definitions, ethical and legal considerations, and methods for fair +robot learning. With this work, we aim to pave the road for groundbreaking +developments in fair robot learning. + +
+
+
+
+
+ + ♻ ☆ Towards a Better Understanding of the Computer Vision Research Community + in Africa + + +
+ Computer vision is a broad field of study that encompasses different tasks +(e.g., object detection). Although computer vision is relevant to the African +communities in various applications, yet computer vision research is +under-explored in the continent and constructs only 0.06% of top-tier +publications in the last ten years. In this paper, our goal is to have a better +understanding of the computer vision research conducted in Africa and provide +pointers on whether there is equity in research or not. We do this through an +empirical analysis of the African computer vision publications that are Scopus +indexed, where we collect around 63,000 publications over the period 2012-2022. +We first study the opportunities available for African institutions to publish +in top-tier computer vision venues. We show that African publishing trends in +top-tier venues over the years do not exhibit consistent growth, unlike other +continents such as North America or Asia. Moreover, we study all computer +vision publications beyond top-tier venues in different African regions to find +that mainly Northern and Southern Africa are publishing in computer vision with +68.5% and 15.9% of publications, resp. Nonetheless, we highlight that both +Eastern and Western Africa are exhibiting a promising increase with the last +two years closing the gap with Southern Africa. Additionally, we study the +collaboration patterns in these publications to find that most of these exhibit +international collaborations rather than African ones. We also show that most +of these publications include an African author that is a key contributor as +the first or last author. Finally, we present the most recurring keywords in +computer vision publications per African region. + +
+
+ comment: Published in EAAMO'23 under ACM License. This work is part of our + African computer vision grassroots research in Ro'ya - CV4Africa, + https://ro-ya-cv4africa.github.io/homepage/ +
+
+
+
+
+ + ♻ ☆ BEA: Revisiting anchor-based object detection DNN using Budding Ensemble + Architecture BMVC-2023 + + +
+ This paper introduces the Budding Ensemble Architecture (BEA), a novel +reduced ensemble architecture for anchor-based object detection models. Object +detection models are crucial in vision-based tasks, particularly in autonomous +systems. They should provide precise bounding box detections while also +calibrating their predicted confidence scores, leading to higher-quality +uncertainty estimates. However, current models may make erroneous decisions due +to false positives receiving high scores or true positives being discarded due +to low scores. BEA aims to address these issues. The proposed loss functions in +BEA improve the confidence score calibration and lower the uncertainty error, +which results in a better distinction of true and false positives and, +eventually, higher accuracy of the object detection models. Both Base-YOLOv3 +and SSD models were enhanced using the BEA method and its proposed loss +functions. The BEA on Base-YOLOv3 trained on the KITTI dataset results in a 6% +and 3.7% increase in mAP and AP50, respectively. Utilizing a well-balanced +uncertainty estimation threshold to discard samples in real-time even leads to +a 9.6% higher AP50 than its base model. This is attributed to a 40% increase in +the area under the AP50-based retention curve used to measure the quality of +calibration of confidence scores. Furthermore, BEA-YOLOV3 trained on KITTI +provides superior out-of-distribution detection on Citypersons, BDD100K, and +COCO datasets compared to the ensembles and vanilla models of YOLOv3 and +Gaussian-YOLOv3. + +
+
+ comment: 14 pages, 5 pages supplementary material. Accepted at BMVC-2023 +
+
+
+
+
+ + ♻ ☆ Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models + + +
+ Text-to-Image diffusion models have made tremendous progress over the past +two years, enabling the generation of highly realistic images based on +open-domain text descriptions. However, despite their success, text +descriptions often struggle to adequately convey detailed controls, even when +composed of long and complex texts. Moreover, recent studies have also shown +that these models face challenges in understanding such complex texts and +generating the corresponding images. Therefore, there is a growing need to +enable more control modes beyond text description. In this paper, we introduce +Uni-ControlNet, a unified framework that allows for the simultaneous +utilization of different local controls (e.g., edge maps, depth map, +segmentation masks) and global controls (e.g., CLIP image embeddings) in a +flexible and composable manner within one single model. Unlike existing +methods, Uni-ControlNet only requires the fine-tuning of two additional +adapters upon frozen pre-trained text-to-image diffusion models, eliminating +the huge cost of training from scratch. Moreover, thanks to some dedicated +adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) +of adapters, regardless of the number of local or global controls used. This +not only reduces the fine-tuning costs and model size, making it more suitable +for real-world deployment, but also facilitate composability of different +conditions. Through both quantitative and qualitative comparisons, +Uni-ControlNet demonstrates its superiority over existing methods in terms of +controllability, generation quality and composability. Code is available at +\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}. + +
+
+ comment: Camera Ready, Code is available at + https://github.com/ShihaoZhaoZSH/Uni-ControlNet +
+
+
+
+
+ + ♻ ☆ Asymmetric Image Retrieval with Cross Model Compatible Ensembles + + +
+ The asymmetrical retrieval setting is a well suited solution for resource +constrained applications such as face recognition and image retrieval. In this +setting, a large model is used for indexing the gallery while a lightweight +model is used for querying. The key principle in such systems is ensuring that +both models share the same embedding space. Most methods in this domain are +based on knowledge distillation. While useful, they suffer from several +drawbacks: they are upper-bounded by the performance of the single best model +found and cannot be extended to use an ensemble of models in a straightforward +manner. In this paper we present an approach that does not rely on knowledge +distillation, rather it utilizes embedding transformation models. This allows +the use of N independently trained and diverse gallery models (e.g., trained on +different datasets or having a different architecture) and a single query +model. As a result, we improve the overall accuracy beyond that of any single +model while maintaining a low computational budget for querying. Additionally, +we propose a gallery image rejection method that utilizes the diversity between +multiple transformed embeddings to estimate the uncertainty of gallery images. + +
+
+
+
+
+ + ♻ ☆ SAM-Med3D + + +
+ Although the Segment Anything Model (SAM) has demonstrated impressive +performance in 2D natural image segmentation, its application to 3D volumetric +medical images reveals significant shortcomings, namely suboptimal performance +and unstable prediction, necessitating an excessive number of prompt points to +attain the desired outcomes. These issues can hardly be addressed by +fine-tuning SAM on medical data because the original 2D structure of SAM +neglects 3D spatial information. In this paper, we introduce SAM-Med3D, the +most comprehensive study to modify SAM for 3D medical images. Our approach is +characterized by its comprehensiveness in two primary aspects: firstly, by +comprehensively reformulating SAM to a thorough 3D architecture trained on a +comprehensively processed large-scale volumetric medical dataset; and secondly, +by providing a comprehensive evaluation of its performance. Specifically, we +train SAM-Med3D with over 131K 3D masks and 247 categories. Our SAM-Med3D +excels at capturing 3D spatial information, exhibiting competitive performance +with significantly fewer prompt points than the top-performing fine-tuned SAM +in the medical domain. We then evaluate its capabilities across 15 datasets and +analyze it from multiple perspectives, including anatomical structures, +modalities, targets, and generalization abilities. Our approach, compared with +SAM, showcases pronouncedly enhanced efficiency and broad segmentation +capabilities for 3D volumetric medical images. Our code is released at +https://github.com/uni-medical/SAM-Med3D. + +
+
+
+
+
+ + ♻ ☆ Implicit Neural Feature Fusion Function for Multispectral and + Hyperspectral Image Fusion + + +
+ Multispectral and Hyperspectral Image Fusion (MHIF) is a practical task that +aims to fuse a high-resolution multispectral image (HR-MSI) and a +low-resolution hyperspectral image (LR-HSI) of the same scene to obtain a +high-resolution hyperspectral image (HR-HSI). Benefiting from powerful +inductive bias capability, CNN-based methods have achieved great success in the +MHIF task. However, they lack certain interpretability and require convolution +structures be stacked to enhance performance. Recently, Implicit Neural +Representation (INR) has achieved good performance and interpretability in 2D +tasks due to its ability to locally interpolate samples and utilize multimodal +content such as pixels and coordinates. Although INR-based approaches show +promise, they require extra construction of high-frequency information +(\emph{e.g.,} positional encoding). In this paper, inspired by previous work of +MHIF task, we realize that HR-MSI could serve as a high-frequency detail +auxiliary input, leading us to propose a novel INR-based hyperspectral fusion +function named Implicit Neural Feature Fusion Function (INF). As an elaborate +structure, it solves the MHIF task and addresses deficiencies in the INR-based +approaches. Specifically, our INF designs a Dual High-Frequency Fusion (DHFF) +structure that obtains high-frequency information twice from HR-MSI and LR-HSI, +then subtly fuses them with coordinate information. Moreover, the proposed INF +incorporates a parameter-free method named INR with cosine similarity (INR-CS) +that uses cosine similarity to generate local weights through feature vectors. +Based on INF, we construct an Implicit Neural Fusion Network (INFN) that +achieves state-of-the-art performance for MHIF tasks of two public datasets, +\emph{i.e.,} CAVE and Harvard. The code will soon be made available on GitHub. + +
+
+
+
+
+ + ♻ ☆ Parameter and Computation Efficient Transfer Learning for + Vision-Language Pre-trained Models + + +
+ With ever increasing parameters and computation, vision-language pre-trained +(VLP) models exhibit prohibitive expenditure in downstream task adaption. +Recent endeavors mainly focus on parameter efficient transfer learning (PETL) +for VLP models by only updating a small number of parameters. However, +excessive computational overhead still plagues the application of VLPs. In this +paper, we aim at parameter and computation efficient transfer learning (PCETL) +for VLP models. In particular, PCETL not only needs to limit the number of +trainable parameters in VLP models, but also to reduce the computational +redundancy during inference, thus enabling a more efficient transfer. To +approach this target, we propose a novel dynamic architecture skipping (DAS) +approach towards effective PCETL. Instead of directly optimizing the intrinsic +architectures of VLP models, DAS first observes the significances of their +modules to downstream tasks via a reinforcement learning (RL) based process, +and then skips the redundant ones with lightweight networks, i.e., adapters, +according to the obtained rewards. In this case, the VLP model can well +maintain the scale of trainable parameters while speeding up its inference on +downstream tasks. To validate DAS, we apply it to two representative VLP +models, namely ViLT and METER, and conduct extensive experiments on a bunch of +VL tasks. The experimental results not only show the great advantages of DAS in +reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but +also confirm its competitiveness against existing PETL methods in terms of +parameter scale and performance. Our source code is given in our appendix. + +
+
+
+
+
+ + ♻ ☆ Perceptual Quality Assessment of Face Video Compression: A Benchmark and + An Effective Method + + +
+ Recent years have witnessed an exponential increase in the demand for face +video compression, and the success of artificial intelligence has expanded the +boundaries beyond traditional hybrid video coding. Generative coding approaches +have been identified as promising alternatives with reasonable perceptual +rate-distortion trade-offs, leveraging the statistical priors of face videos. +However, the great diversity of distortion types in spatial and temporal +domains, ranging from the traditional hybrid coding frameworks to generative +models, present grand challenges in compressed face video quality assessment +(VQA). In this paper, we introduce the large-scale Compressed Face Video +Quality Assessment (CFVQA) database, which is the first attempt to +systematically understand the perceptual quality and diversified compression +distortions in face videos. The database contains 3,240 compressed face video +clips in multiple compression levels, which are derived from 135 source videos +with diversified content using six representative video codecs, including two +traditional methods based on hybrid coding frameworks, two end-to-end methods, +and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index +for face video compression was developed to measure the perceptual quality, +considering the distinct content characteristics and temporal priors of the +face videos. Experimental results exhibit its superior performance on the +proposed CFVQA dataset. The benchmark is now made publicly available at: +https://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment. + +
+
+
+
+
+ + ♻ ☆ OpenMask3D: Open-Vocabulary 3D Instance Segmentation NeurIPS 2023 + + +
+ We introduce the task of open-vocabulary 3D instance segmentation. Current +approaches for 3D instance segmentation can typically only recognize object +categories from a pre-defined closed set of classes that are annotated in the +training datasets. This results in important limitations for real-world +applications where one might need to perform tasks guided by novel, +open-vocabulary queries related to a wide variety of objects. Recently, +open-vocabulary 3D scene understanding methods have emerged to address this +problem by learning queryable features for each point in the scene. While such +a representation can be directly employed to perform semantic segmentation, +existing methods cannot separate multiple object instances. In this work, we +address this limitation, and propose OpenMask3D, which is a zero-shot approach +for open-vocabulary 3D instance segmentation. Guided by predicted +class-agnostic 3D instance masks, our model aggregates per-mask features via +multi-view fusion of CLIP-based image embeddings. Experiments and ablation +studies on ScanNet200 and Replica show that OpenMask3D outperforms other +open-vocabulary methods, especially on the long-tail distribution. Qualitative +experiments further showcase OpenMask3D's ability to segment object properties +based on free-form queries describing geometry, affordances, and materials. + +
+
+ comment: NeurIPS 2023. Project page: https://openmask3d.github.io/ +
+
+
+
+
+ + ♻ ☆ Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation NeurIPS 2023 + + +
+ Most existing methods for unsupervised domain adaptation (UDA) rely on a +shared network to extract domain-invariant features. However, when facing +multiple source domains, optimizing such a network involves updating the +parameters of the entire network, making it both computationally expensive and +challenging, particularly when coupled with min-max objectives. Inspired by +recent advances in prompt learning that adapts high-capacity models for +downstream tasks in a computationally economic way, we introduce Multi-Prompt +Alignment (MPA), a simple yet efficient framework for multi-source UDA. Given a +source and target domain pair, MPA first trains an individual prompt to +minimize the domain gap through a contrastive loss. Then, MPA denoises the +learned prompts through an auto-encoding process and aligns them by maximizing +the agreement of all the reconstructed prompts. Moreover, we show that the +resulting subspace acquired from the auto-encoding process can easily +generalize to a streamlined set of target domains, making our method more +efficient for practical usage. Extensive experiments show that MPA achieves +state-of-the-art results on three popular datasets with an impressive average +accuracy of 54.1% on DomainNet. + +
+
+ comment: NeurIPS 2023 camera-ready version +
+
+
+
+
+ + ♻ ☆ On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 + + +
+ Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented +performance in response generation, especially with visual inputs, enabling +more creative and adaptable interaction than large language models such as +ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since +adversaries may successfully evade the entire system by subtly manipulating the +most vulnerable modality (e.g., vision). To this end, we propose evaluating the +robustness of open-source large VLMs in the most realistic and high-risk +setting, where adversaries have only black-box system access and seek to +deceive the model into returning the targeted responses. In particular, we +first craft targeted adversarial examples against pretrained models such as +CLIP and BLIP, and then transfer these adversarial examples to other VLMs such +as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we +observe that black-box queries on these VLMs can further improve the +effectiveness of targeted evasion, resulting in a surprisingly high success +rate for generating targeted responses. Our findings provide a quantitative +understanding regarding the adversarial vulnerability of large VLMs and call +for a more thorough examination of their potential security flaws before +deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face + Recognition + + +
+ Face recognition systems are widely deployed in safety-critical applications, +including law enforcement, yet they exhibit bias across a range of +socio-demographic dimensions, such as gender and race. Conventional wisdom +dictates that model biases arise from biased training data. As a consequence, +previous works on bias mitigation largely focused on pre-processing the +training data, adding penalties to prevent bias from effecting the model during +training, or post-processing predictions to debias them, yet these approaches +have shown limited success on hard problems such as face recognition. In our +work, we discover that biases are actually inherent to neural network +architectures themselves. Following this reframing, we conduct the first neural +architecture search for fairness, jointly with a search for hyperparameters. +Our search outputs a suite of models which Pareto-dominate all other +high-performance architectures and existing bias mitigation methods in terms of +accuracy and fairness, often by large margins, on the two most widely used +datasets for face identification, CelebA and VGGFace2. Furthermore, these +models generalize to other datasets and sensitive attributes. We release our +code, models and raw data files at https://github.com/dooleys/FR-NAS. + +
+
+
+
+
+ + ♻ ☆ On Calibrating Diffusion Probabilistic Models NeurIPS 2023 + + +
+ Recently, diffusion probabilistic models (DPMs) have achieved promising +results in diverse generative tasks. A typical DPM framework includes a forward +process that gradually diffuses the data distribution and a reverse process +that recovers the data distribution from time-dependent data scores. In this +work, we observe that the stochastic reverse process of data scores is a +martingale, from which concentration bounds and the optional stopping theorem +for data scores can be derived. Then, we discover a simple way for calibrating +an arbitrary pretrained DPM, with which the score matching loss can be reduced +and the lower bounds of model likelihood can consequently be increased. We +provide general calibration guidelines under various model parametrizations. +Our calibration method is performed only once and the resulting models can be +used repeatedly for sampling. We conduct experiments on multiple datasets to +empirically validate our proposal. Our code is at +https://github.com/thudzj/Calibrated-DPMs. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and + In-depth Evaluation + + +
+ This paper presents a comprehensive evaluation of the Optical Character +Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large +Multimodal Model (LMM). We assess the model's performance across a range of OCR +tasks, including scene text recognition, handwritten text recognition, +handwritten mathematical expression recognition, table structure recognition, +and information extraction from visually-rich document. The evaluation reveals +that GPT-4V performs well in recognizing and understanding Latin contents, but +struggles with multilingual scenarios and complex tasks. Specifically, it +showed limitations when dealing with non-Latin languages and complex tasks such +as handwriting mathematical expression recognition, table structure +recognition, and end-to-end semantic entity recognition and pair extraction +from document image. Based on these observations, we affirm the necessity and +continued research value of specialized OCR models. In general, despite its +versatility in handling diverse OCR tasks, GPT-4V does not outperform existing +state-of-the-art OCR models. How to fully utilize pre-trained general-purpose +LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study +offers a critical reference for future research in OCR with LMMs. Evaluation +pipeline and results are available at +https://github.com/SCUT-DLVCLab/GPT-4V_OCR. + +
+
+
+
+
+ + ♻ ☆ Text Promptable Surgical Instrument Segmentation with Vision-Language + Models NeurIPS 2023 + + +
+ In this paper, we propose a novel text promptable surgical instrument +segmentation approach to overcome challenges associated with diversity and +differentiation of surgical instruments in minimally invasive surgeries. We +redefine the task as text promptable, thereby enabling a more nuanced +comprehension of surgical instruments and adaptability to new instrument types. +Inspired by recent advancements in vision-language models, we leverage +pretrained image and text encoders as our model backbone and design a text +promptable mask decoder consisting of attention- and convolution-based +prompting schemes for surgical instrument segmentation prediction. Our model +leverages multiple text prompts for each surgical instrument through a new +mixture of prompts mechanism, resulting in enhanced segmentation performance. +Additionally, we introduce a hard instrument area reinforcement module to +improve image feature comprehension and segmentation precision. Extensive +experiments on several surgical instrument segmentation datasets demonstrate +our model's superior performance and promising generalization capability. To +our knowledge, this is the first implementation of a promptable approach to +surgical instrument segmentation, offering significant potential for practical +application in the field of robotic-assisted surgery. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ CGOF++: Controllable 3D Face Synthesis with Conditional Generative + Occupancy Fields NeurIPS'22 + + +
+ Capitalizing on the recent advances in image generation models, existing +controllable face image synthesis methods are able to generate high-fidelity +images with some levels of controllability, e.g., controlling the shapes, +expressions, textures, and poses of the generated face images. However, +previous methods focus on controllable 2D image generative models, which are +prone to producing inconsistent face images under large expression and pose +changes. In this paper, we propose a new NeRF-based conditional 3D face +synthesis framework, which enables 3D controllability over the generated face +images by imposing explicit 3D conditions from 3D face priors. At its core is a +conditional Generative Occupancy Field (cGOF++) that effectively enforces the +shape of the generated face to conform to a given 3D Morphable Model (3DMM) +mesh, built on top of EG3D [1], a recent tri-plane-based generative model. To +achieve accurate control over fine-grained 3D face shapes of the synthesized +images, we additionally incorporate a 3D landmark loss as well as a volume +warping loss into our synthesis framework. Experiments validate the +effectiveness of the proposed method, which is able to generate high-fidelity +face images and shows more precise 3D controllability than state-of-the-art +2D-based controllable face synthesis methods. + +
+
+ comment: Accepted to IEEE Transactions on Pattern Analysis and Machine + Intelligence (TPAMI). This article is an extension of the NeurIPS'22 paper + arXiv:2206.08361 +
+
+
+
+
+ + ♻ ☆ Dual-Decoder Consistency via Pseudo-Labels Guided Data Augmentation for + Semi-Supervised Medical Image Segmentation + + +
+ Though supervised learning gains impressive success, the acquisition of +indispensable large-scale labeled datasets are often impractical in biomedical +imaging partially due to expensive costs and lengthy annotations done by +experienced radiologists. Semi-supervised learning has been shown to be an +effective way to address this limitation by leveraging useful information from +unlabeled datasets. In this paper, we present a new semi-supervised learning +method referred to as Dual-Decoder Consistency via Pseudo-Labels Guided Data +Augmentation (DCPA) for medical image segmentation. We devise a consistency +regularization to improve the semi-supervised learning. Specifically, to +promote consistent representations during the training process, we use +different decoders for student and teachers networks while maintain the same +encoder. Moreover, to learn from unlabeled data, we create pseudo-labels +generated by the teacher networks and augment the training data with the +pseudo-labels. The two techniques contribute to the improved performance of the +proposed method. We evaluate the performance of the proposed method on three +representative medical image segmentation datasets. Extensive comparisons to +the state-of-the-art medical image segmentation methods were carried out under +typical scenarios with 10% and 20% labeled data. Experimental outcomes +demonstrate that our method consistently outperforms state-of-the-art +semi-supervised medical image segmentation methods over the three +semi-supervised settings. Furthermore, to explore the performance of proposed +method under extreme condition, we conduct experiments with only 5% labeled +data. The results further verify the superior performance of the proposed +method. Source code is publicly online at https://github.com/BinYCn/DCPA.git. + +
+
+
+
+
+ + ♻ ☆ Toward a Deeper Understanding: RetNet Viewed through Convolution + + +
+ The success of Vision Transformer (ViT) has been widely reported on a wide +range of image recognition tasks. ViT can learn global dependencies superior to +CNN, yet CNN's inherent locality can substitute for expensive training +resources. Recently, the outstanding performance of RetNet in the field of +language modeling has garnered attention, surpassing that of the Transformer +with explicit local modeling, shifting researchers' focus towards Transformers +in the CV field. This paper investigates the effectiveness of RetNet from a CNN +perspective and presents a variant of RetNet tailored to the visual domain. +Similar to RetNet we improves ViT's local modeling by applying a weight mask on +the original self-attention matrix. A straightforward way to locally adapt the +self-attention matrix can be realized by an element-wise learnable weight mask +(ELM), for which our preliminary results show promising results. However, the +element-wise simple learnable weight mask not only induces a non-trivial +additional parameter overhead but also increases the optimization complexity. +To this end, this work proposes a novel Gaussian mixture mask (GMM) in which +one mask only has two learnable parameters and it can be conveniently used in +any ViT variants whose attention mechanism allows the use of masks. +Experimental results on multiple small datasets demonstrate that the +effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost +zero additional parameter or computation cost). Our code can be publicly +available at https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention. + +
+
+
+
+
+ + ♻ ☆ SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions NeurIPS 2023 + + +
+ The remarkable capabilities of pretrained image diffusion models have been +utilized not only for generating fixed-size images but also for creating +panoramas. However, naive stitching of multiple images often results in visible +seams. Recent techniques have attempted to address this issue by performing +joint diffusions in multiple windows and averaging latent features in +overlapping regions. However, these approaches, which focus on seamless montage +generation, often yield incoherent outputs by blending different scenes within +a single image. To overcome this limitation, we propose SyncDiffusion, a +plug-and-play module that synchronizes multiple diffusions through gradient +descent from a perceptual similarity loss. Specifically, we compute the +gradient of the perceptual loss using the predicted denoised images at each +denoising step, providing meaningful guidance for achieving coherent montages. +Our experimental results demonstrate that our method produces significantly +more coherent outputs compared to previous methods (66.35% vs. 33.65% in our +user study) while still maintaining fidelity (as assessed by GIQA) and +compatibility with the input prompt (as measured by CLIP score). We further +demonstrate the versatility of our method across three plug-and-play +applications: layout-guided image generation, conditional image generation and +360-degree panorama generation. Our project page is at +https://syncdiffusion.github.io. + +
+
+ comment: Accepted to NeurIPS 2023. Project page: + https://syncdiffusion.github.io +
+
+
+
+
+ + ♻ ☆ Interactive Visual Reasoning under Uncertainty NeurIPS 2023 + + +
+ One of the fundamental cognitive abilities of humans is to quickly resolve +uncertainty by generating hypotheses and testing them via active trials. +Encountering a novel phenomenon accompanied by ambiguous cause-effect +relationships, humans make hypotheses against data, conduct inferences from +observation, test their theory via experimentation, and correct the proposition +if inconsistency arises. These iterative processes persist until the underlying +mechanism becomes clear. In this work, we devise the IVRE (pronounced as +"ivory") environment for evaluating artificial agents' reasoning ability under +uncertainty. IVRE is an interactive environment featuring rich scenarios +centered around Blicket detection. Agents in IVRE are placed into environments +with various ambiguous action-effect pairs and asked to determine each object's +role. They are encouraged to propose effective and efficient experiments to +validate their hypotheses based on observations and actively gather new +information. The game ends when all uncertainties are resolved or the maximum +number of trials is consumed. By evaluating modern artificial agents in IVRE, +we notice a clear failure of today's learning methods compared to humans. Such +inefficacy in interactive reasoning ability under uncertainty calls for future +research in building human-like intelligence. + +
+
+ comment: Accepted at NeurIPS 2023 (Datasets and Benchmarks) +
+
+
+
+
+ + ♻ ☆ Unlocking Deterministic Robustness Certification on ImageNet + + +
+ Despite the promise of Lipschitz-based methods for provably-robust deep +learning with deterministic guarantees, current state-of-the-art results are +limited to feed-forward Convolutional Networks (ConvNets) on low-dimensional +data, such as CIFAR-10. This paper investigates strategies for expanding +certifiably robust training to larger, deeper models. A key challenge in +certifying deep networks is efficient calculation of the Lipschitz bound for +residual blocks found in ResNet and ViT architectures. We show that fast ways +of bounding the Lipschitz constant for conventional ResNets are loose, and show +how to address this by designing a new residual block, leading to the +\emph{Linear ResNet} (LiResNet) architecture. We then introduce \emph{Efficient +Margin MAximization} (EMMA), a loss function that stabilizes robust training by +simultaneously penalizing worst-case adversarial examples from \emph{all} +classes. Together, these contributions yield new \emph{state-of-the-art} robust +accuracy on CIFAR-10/100 and Tiny-ImageNet under $\ell_2$ perturbations. +Moreover, for the first time, we are able to scale up fast deterministic +robustness guarantees to ImageNet, demonstrating that this approach to robust +learning can be applied to real-world applications. + We release our code on Github: \url{https://github.com/klasleino/gloro}. + +
+
+
+
+
+ + ♻ ☆ Distractor-aware Event-based Tracking + + +
+ Event cameras, or dynamic vision sensors, have recently achieved success from +fundamental vision tasks to high-level vision researches. Due to its ability to +asynchronously capture light intensity changes, event camera has an inherent +advantage to capture moving objects in challenging scenarios including objects +under low light, high dynamic range, or fast moving objects. Thus event camera +are natural for visual object tracking. However, the current event-based +trackers derived from RGB trackers simply modify the input images to event +frames and still follow conventional tracking pipeline that mainly focus on +object texture for target distinction. As a result, the trackers may not be +robust dealing with challenging scenarios such as moving cameras and cluttered +foreground. In this paper, we propose a distractor-aware event-based tracker +that introduces transformer modules into Siamese network architecture (named +DANet). Specifically, our model is mainly composed of a motion-aware network +and a target-aware network, which simultaneously exploits both motion cues and +object contours from event data, so as to discover motion objects and identify +the target object by removing dynamic distractors. Our DANet can be trained in +an end-to-end manner without any post-processing and can run at over 80 FPS on +a single V100. We conduct comprehensive experiments on two large event tracking +datasets to validate the proposed model. We demonstrate that our tracker has +superior performance against the state-of-the-art trackers in terms of both +accuracy and efficiency. + +
+
+
+
+
+ + ♻ ☆ Video-Mined Task Graphs for Keystep Recognition in Instructional Videos NeurIPS 2023 + + +
+ Procedural activity understanding requires perceiving human actions in terms +of a broader task, where multiple keysteps are performed in sequence across a +long video to reach a final goal state -- such as the steps of a recipe or a +DIY fix-it task. Prior work largely treats keystep recognition in isolation of +this broader structure, or else rigidly confines keysteps to align with a +predefined sequential script. We propose discovering a task graph automatically +from how-to videos to represent probabilistically how people tend to execute +keysteps, and then leverage this graph to regularize keystep recognition in +novel videos. On multiple datasets of real-world instructional videos, we show +the impact: more reliable zero-shot keystep localization and improved video +representation learning, exceeding the state of the art. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Trust, but Verify: Robust Image Segmentation using Deep Learning + + +
+ We describe a method for verifying the output of a deep neural network for +medical image segmentation that is robust to several classes of random as well +as worst-case perturbations i.e. adversarial attacks. This method is based on a +general approach recently developed by the authors called "Trust, but Verify" +wherein an auxiliary verification network produces predictions about certain +masked features in the input image using the segmentation as an input. A +well-designed auxiliary network will produce high-quality predictions when the +input segmentations are accurate, but will produce low-quality predictions when +the segmentations are incorrect. Checking the predictions of such a network +with the original image allows us to detect bad segmentations. However, to +ensure the verification method is truly robust, we need a method for checking +the quality of the predictions that does not itself rely on a black-box neural +network. Indeed, we show that previous methods for segmentation evaluation that +do use deep neural regression networks are vulnerable to false negatives i.e. +can inaccurately label bad segmentations as good. We describe the design of a +verification network that avoids such vulnerability and present results to +demonstrate its robustness compared to previous methods. + +
+
+ comment: 5 Pages, 8 Figures, conference +
+
+
+
+
+ + ♻ ☆ SAMCLR: Contrastive pre-training on complex scenes using SAM for view + sampling NeurIPS 2023 + + +
+ In Computer Vision, self-supervised contrastive learning enforces similar +representations between different views of the same image. The pre-training is +most often performed on image classification datasets, like ImageNet, where +images mainly contain a single class of objects. However, when dealing with +complex scenes with multiple items, it becomes very unlikely for several views +of the same image to represent the same object category. In this setting, we +propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into +semantic regions, then sample the two views from the same region. Preliminary +results show empirically that when pre-training on Cityscapes and ADE20K, then +evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs +at least on par with, and most often significantly outperforms not only SimCLR, +but also DINO and MoCo. + +
+
+ comment: Accepted at NeurIPS 2023 Workshop on SSL +
+
+
+
+
+ + ♻ ☆ Geodesic Multi-Modal Mixup for Robust Fine-Tuning NeurIPS 2023 + + +
+ Pre-trained multi-modal models, such as CLIP, provide transferable embeddings +and show promising results in diverse applications. However, the analysis of +learned multi-modal embeddings is relatively unexplored, and the embedding +transferability can be improved. In this work, we observe that CLIP holds +separated embedding subspaces for two different modalities, and then we +investigate it through the lens of uniformity-alignment to measure the quality +of learned representation. Both theoretically and empirically, we show that +CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack +of alignment and uniformity might restrict the transferability and robustness +of embeddings. To this end, we devise a new fine-tuning method for robust +representation equipping better alignment and uniformity. First, we propose a +Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to +generate hard negative samples on the hypersphere. Then, we fine-tune the model +on hard negatives as well as original negatives and positives with contrastive +loss. Based on the theoretical analysis about hardness guarantee and limiting +behavior, we justify the use of our method. Extensive experiments on retrieval, +calibration, few- or zero-shot classification (under distribution shift), +embedding arithmetic, and image captioning further show that our method +provides transferable representations, enabling robust model adaptation on +diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup + +
+
+ comment: To appear at NeurIPS 2023 +
+
+
+
+
+
+
+
+ + Information Retrieval 7 + +
+
+
+ + ☆ Poisoning Retrieval Corpora by Injecting Adversarial Passages EMNLP 2023 + + +
+ Dense retrievers have achieved state-of-the-art performance in various +information retrieval tasks, but to what extent can they be safely deployed in +real-world applications? In this work, we propose a novel attack for dense +retrieval systems in which a malicious user generates a small number of +adversarial passages by perturbing discrete tokens to maximize similarity with +a provided set of training queries. When these adversarial passages are +inserted into a large retrieval corpus, we show that this attack is highly +effective in fooling these systems to retrieve them for queries that were not +seen by the attacker. More surprisingly, these adversarial passages can +directly generalize to out-of-domain queries and corpora with a high success +attack rate -- for instance, we find that 50 generated passages optimized on +Natural Questions can mislead >94% of questions posed in financial documents or +online forums. We also benchmark and compare a range of state-of-the-art dense +retrievers, both unsupervised and supervised. Although different systems +exhibit varying levels of vulnerability, we show they can all be successfully +attacked by injecting up to 500 passages, a small fraction compared to a +retrieval corpus of millions of passages. + +
+
+ comment: EMNLP 2023. Our code is available at + https://github.com/princeton-nlp/corpus-poisoning +
+
+
+
+
+ + ☆ MILL: Mutual Verification with Large Language Models for Zero-Shot Query + Expansion + + +
+ Query expansion is a commonly-used technique in many search systems to better +represent users' information needs with additional query terms. Existing +studies for this task usually propose to expand a query with retrieved or +generated contextual documents. However, both types of methods have clear +limitations. For retrieval-based methods, the documents retrieved with the +original query might not be accurate enough to reveal the search intent, +especially when the query is brief or ambiguous. For generation-based methods, +existing models can hardly be trained or aligned on a particular corpus, due to +the lack of corpus-specific labeled data. In this paper, we propose a novel +Large Language Model (LLM) based mutual verification framework for query +expansion, which alleviates the aforementioned limitations. Specifically, we +first design a query-query-document generation pipeline, which can effectively +leverage the contextual knowledge encoded in LLMs to generate sub-queries and +corresponding documents from multiple perspectives. Next, we employ a mutual +verification method for both generated and retrieved contextual documents, +where 1) retrieved documents are filtered with the external contextual +knowledge in generated documents, and 2) generated documents are filtered with +the corpus-specific knowledge in retrieved documents. Overall, the proposed +method allows retrieved and generated documents to complement each other to +finalize a better query expansion. We conduct extensive experiments on three +information retrieval datasets, i.e., TREC-DL-2020, TREC-COVID, and MSMARCO. +The results demonstrate that our method outperforms other baselines +significantly. + +
+
+
+
+
+ + ☆ A Multimodal Ecological Civilization Pattern Recommendation Method Based + on Large Language Models and Knowledge Graph + + +
+ The Ecological Civilization Pattern Recommendation System (ECPRS) aims to +recommend suitable ecological civilization patterns for target regions, +promoting sustainable development and reducing regional disparities. However, +the current representative recommendation methods are not suitable for +recommending ecological civilization patterns in a geographical context. There +are two reasons for this. Firstly, regions have spatial heterogeneity, and the +(ECPRS)needs to consider factors like climate, topography, vegetation, etc., to +recommend civilization patterns adapted to specific ecological environments, +ensuring the feasibility and practicality of the recommendations. Secondly, the +abstract features of the ecological civilization patterns in the real world +have not been fully utilized., resulting in poor richness in their embedding +representations and consequently, lower performance of the recommendation +system. Considering these limitations, we propose the ECPR-MML method. +Initially, based on the novel method UGPIG, we construct a knowledge graph to +extract regional representations incorporating spatial heterogeneity features. +Following that, inspired by the significant progress made by Large Language +Models (LLMs) in the field of Natural Language Processing (NLP), we employ +Large LLMs to generate multimodal features for ecological civilization patterns +in the form of text and images. We extract and integrate these multimodal +features to obtain semantically rich representations of ecological +civilization. Through extensive experiments, we validate the performance of our +ECPR-MML model. Our results show that F1@5 is 2.11% higher compared to +state-of-the-art models, 2.02% higher than NGCF, and 1.16% higher than UGPIG. +Furthermore, multimodal data can indeed enhance recommendation performance. +However, the data generated by LLM is not as effective as real data to a +certain extent. + +
+
+
+
+
+ + ☆ The diminishing state of shared reality on US television news + + +
+ The potential for a large, diverse population to coexist peacefully is +thought to depend on the existence of a ``shared reality:'' a public sphere in +which participants are exposed to similar facts about similar topics. A +generation ago, broadcast television news was widely considered to serve this +function; however, since the rise of cable news in the 1990s, critics and +scholars have worried that the corresponding fragmentation and segregation of +audiences along partisan lines has caused this shared reality to be lost. Here +we examine this concern using a unique combination of data sets tracking the +production (since 2012) and consumption (since 2016) of television news content +on the three largest cable and broadcast networks respectively. With regard to +production, we find strong evidence for the ``loss of shared reality +hypothesis:'' while broadcast continues to cover similar topics with similar +language, cable news networks have become increasingly distinct, both from +broadcast news and each other, diverging both in terms of content and language. +With regard to consumption, we find more mixed evidence: while broadcast news +has indeed declined in popularity, it remains the dominant source of news for +roughly 50\% more Americans than does cable; moreover, its decline, while +somewhat attributable to cable, appears driven more by a shift away from news +consumption altogether than a growth in cable consumption. We conclude that +shared reality on US television news is indeed diminishing, but is more robust +than previously thought and is declining for somewhat different reasons. + +
+
+
+
+
+ + ♻ ☆ Music Augmentation and Denoising For Peak-Based Audio Fingerprinting + + +
+ Audio fingerprinting is a well-established solution for song identification +from short recording excerpts. Popular methods rely on the extraction of sparse +representations, generally spectral peaks, and have proven to be accurate, +fast, and scalable to large collections. However, real-world applications of +audio identification often happen in noisy environments, which can cause these +systems to fail. In this work, we tackle this problem by introducing and +releasing a new audio augmentation pipeline that adds noise to music snippets +in a realistic way, by stochastically mimicking real-world scenarios. We then +propose and release a deep learning model that removes noisy components from +spectrograms in order to improve peak-based fingerprinting systems' accuracy. +We show that the addition of our model improves the identification performance +of commonly used audio fingerprinting systems, even under noisy conditions. + +
+
+
+
+
+ + ♻ ☆ Keyword Augmented Retrieval: Novel framework for Information Retrieval + integrated with speech interface + + +
+ Retrieving answers in a quick and low cost manner without hallucinations from +a combination of structured and unstructured data using Language models is a +major hurdle. This is what prevents employment of Language models in knowledge +retrieval automation. This becomes accentuated when one wants to integrate a +speech interface on top of a text based knowledge retrieval system. Besides, +for commercial search and chat-bot applications, complete reliance on +commercial large language models (LLMs) like GPT 3.5 etc. can be very costly. +In the present study, the authors have addressed the aforementioned problem by +first developing a keyword based search framework which augments discovery of +the context from the document to be provided to the LLM. The keywords in turn +are generated by a relatively smaller LLM and cached for comparison with +keywords generated by the same smaller LLM against the query raised. This +significantly reduces time and cost to find the context within documents. Once +the context is set, a larger LLM uses that to provide answers based on a prompt +tailored for Q\&A. This research work demonstrates that use of keywords in +context identification reduces the overall inference time and cost of +information retrieval. Given this reduction in inference time and cost with the +keyword augmented retrieval framework, a speech based interface for user input +and response readout was integrated. This allowed a seamless interaction with +the language model. + +
+
+
+
+
+ + ♻ ☆ Geodesic Multi-Modal Mixup for Robust Fine-Tuning NeurIPS 2023 + + +
+ Pre-trained multi-modal models, such as CLIP, provide transferable embeddings +and show promising results in diverse applications. However, the analysis of +learned multi-modal embeddings is relatively unexplored, and the embedding +transferability can be improved. In this work, we observe that CLIP holds +separated embedding subspaces for two different modalities, and then we +investigate it through the lens of uniformity-alignment to measure the quality +of learned representation. Both theoretically and empirically, we show that +CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack +of alignment and uniformity might restrict the transferability and robustness +of embeddings. To this end, we devise a new fine-tuning method for robust +representation equipping better alignment and uniformity. First, we propose a +Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to +generate hard negative samples on the hypersphere. Then, we fine-tune the model +on hard negatives as well as original negatives and positives with contrastive +loss. Based on the theoretical analysis about hardness guarantee and limiting +behavior, we justify the use of our method. Extensive experiments on retrieval, +calibration, few- or zero-shot classification (under distribution shift), +embedding arithmetic, and image captioning further show that our method +provides transferable representations, enabling robust model adaptation on +diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup + +
+
+ comment: To appear at NeurIPS 2023 +
+
+
+
+
+
+
+
+ + Machine Learning 137 + +
+
+
+ + ☆ Improved Motor Imagery Classification Using Adaptive Spatial Filters + Based on Particle Swarm Optimization Algorithm + + +
+ As a typical self-paced brain-computer interface (BCI) system, the motor +imagery (MI) BCI has been widely applied in fields such as robot control, +stroke rehabilitation, and assistance for patients with stroke or spinal cord +injury. Many studies have focused on the traditional spatial filters obtained +through the common spatial pattern (CSP) method. However, the CSP method can +only obtain fixed spatial filters for specific input signals. Besides, CSP +method only focuses on the variance difference of two types of +electroencephalogram (EEG) signals, so the decoding ability of EEG signals is +limited. To obtain more effective spatial filters for better extraction of +spatial features that can improve classification to MI-EEG, this paper proposes +an adaptive spatial filter solving method based on particle swarm optimization +algorithm (PSO). A training and testing framework based on filter bank and +spatial filters (FBCSP-ASP) is designed for MI EEG signal classification. +Comparative experiments are conducted on two public datasets (2a and 2b) from +BCI competition IV, which show the outstanding average recognition accuracy of +FBCSP-ASP. The proposed method has achieved significant performance improvement +on MI-BCI. The classification accuracy of the proposed method has reached +74.61% and 81.19% on datasets 2a and 2b, respectively. Compared with the +baseline algorithm (FBCSP), the proposed algorithm improves 11.44% and 7.11% on +two datasets respectively. Furthermore, the analysis based on mutual +information, t-SNE and Shapley values further proves that ASP features have +excellent decoding ability for MI-EEG signals, and explains the improvement of +classification performance by the introduction of ASP features. + +
+
+ comment: 25 pages, 8 figures +
+
+
+
+
+ + ☆ Enhancing Motor Imagery Decoding in Brain Computer Interfaces using + Riemann Tangent Space Mapping and Cross Frequency Coupling + + +
+ Objective: Motor Imagery (MI) serves as a crucial experimental paradigm +within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor +intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration +from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper +introduces a novel approach termed Riemann Tangent Space Mapping using +Dichotomous Filter Bank with Convolutional Neural Network (DFBRTS) to enhance +the representation quality and decoding capability pertaining to MI features. +DFBRTS first initiates the process by meticulously filtering EEG signals +through a Dichotomous Filter Bank, structured in the fashion of a complete +binary tree. Subsequently, it employs Riemann Tangent Space Mapping to extract +salient EEG signal features within each sub-band. Finally, a lightweight +convolutional neural network is employed for further feature extraction and +classification, operating under the joint supervision of cross-entropy and +center loss. To validate the efficacy, extensive experiments were conducted +using DFBRTS on two well-established benchmark datasets: the BCI competition IV +2a (BCIC-IV-2a) dataset and the OpenBMI dataset. The performance of DFBRTS was +benchmarked against several state-of-the-art MI decoding methods, alongside +other Riemannian geometry-based MI decoding approaches. Results: DFBRTS +significantly outperforms other MI decoding algorithms on both datasets, +achieving a remarkable classification accuracy of 78.16% for four-class and +71.58% for two-class hold-out classification, as compared to the existing +benchmarks. + +
+
+ comment: 22 pages, 7 figures +
+
+
+
+
+ + ☆ Conformal Normalization in Recurrent Neural Network of Grid Cells + + +
+ Grid cells in the entorhinal cortex of the mammalian brain exhibit striking +hexagon firing patterns in their response maps as the animal (e.g., a rat) +navigates in a 2D open environment. The responses of the population of grid +cells collectively form a vector in a high-dimensional neural activity space, +and this vector represents the self-position of the agent in the 2D physical +space. As the agent moves, the vector is transformed by a recurrent neural +network that takes the velocity of the agent as input. In this paper, we +propose a simple and general conformal normalization of the input velocity for +the recurrent neural network, so that the local displacement of the position +vector in the high-dimensional neural space is proportional to the local +displacement of the agent in the 2D physical space, regardless of the direction +of the input velocity. Our numerical experiments on the minimally simple linear +and non-linear recurrent networks show that conformal normalization leads to +the emergence of the hexagon grid patterns. Furthermore, we derive a new +theoretical understanding that connects conformal normalization to the +emergence of hexagon grid patterns in navigation tasks. + +
+
+
+
+
+ + ☆ Robustifying Language Models with Test-Time Adaptation ICLR + + +
+ Large-scale language models achieved state-of-the-art performance over a +number of language tasks. However, they fail on adversarial language examples, +which are sentences optimized to fool the language models but with similar +semantic meanings for humans. While prior work focuses on making the language +model robust at training time, retraining for robustness is often unrealistic +for large-scale foundation models. Instead, we propose to make the language +models robust at test time. By dynamically adapting the input sentence with +predictions from masked words, we show that we can reverse many language +adversarial attacks. Since our approach does not require any training, it works +for novel tasks at test time and can adapt to novel adversarial corruptions. +Visualizations and empirical results on two popular sentence classification +datasets demonstrate that our method can repair adversarial language attacks +over 65% o + +
+
+ comment: 8 Pages 2 Figures Submitted to ICLR Workshop +
+
+
+
+
+ + ☆ Predicting recovery following stroke: deep learning, multimodal data and + feature selection using explainable AI + + +
+ Machine learning offers great potential for automated prediction of +post-stroke symptoms and their response to rehabilitation. Major challenges for +this endeavour include the very high dimensionality of neuroimaging data, the +relatively small size of the datasets available for learning, and how to +effectively combine neuroimaging and tabular data (e.g. demographic information +and clinical characteristics). This paper evaluates several solutions based on +two strategies. The first is to use 2D images that summarise MRI scans. The +second is to select key features that improve classification accuracy. +Additionally, we introduce the novel approach of training a convolutional +neural network (CNN) on images that combine regions-of-interest extracted from +MRIs, with symbolic representations of tabular data. We evaluate a series of +CNN architectures (both 2D and a 3D) that are trained on different +representations of MRI and tabular data, to predict whether a composite measure +of post-stroke spoken picture description ability is in the aphasic or +non-aphasic range. MRI and tabular data were acquired from 758 English speaking +stroke survivors who participated in the PLORAS study. The classification +accuracy for a baseline logistic regression was 0.678 for lesion size alone, +rising to 0.757 and 0.813 when initial symptom severity and recovery time were +successively added. The highest classification accuracy 0.854 was observed when +8 regions-of-interest was extracted from each MRI scan and combined with lesion +size, initial severity and recovery time in a 2D Residual Neural Network.Our +findings demonstrate how imaging and tabular data can be combined for high +post-stroke classification accuracy, even when the dataset is small in machine +learning terms. We conclude by proposing how the current models could be +improved to achieve even higher levels of accuracy using images from hospital +scanners. + +
+
+
+
+
+ + ☆ Rare Event Probability Learning by Normalizing Flows + + +
+ A rare event is defined by a low probability of occurrence. Accurate +estimation of such small probabilities is of utmost importance across diverse +domains. Conventional Monte Carlo methods are inefficient, demanding an +exorbitant number of samples to achieve reliable estimates. Inspired by the +exact sampling capabilities of normalizing flows, we revisit this challenge and +propose normalizing flow assisted importance sampling, termed NOFIS. NOFIS +first learns a sequence of proposal distributions associated with predefined +nested subset events by minimizing KL divergence losses. Next, it estimates the +rare event probability by utilizing importance sampling in conjunction with the +last proposal. The efficacy of our NOFIS method is substantiated through +comprehensive qualitative visualizations, affirming the optimality of the +learned proposal distribution, as well as a series of quantitative experiments +encompassing $10$ distinct test cases, which highlight NOFIS's superiority over +baseline approaches. + +
+
+ comment: 16 pages, 5 figures, 2 tables +
+
+
+
+
+ + ☆ The Power of Explainability in Forecast-Informed Deep Learning Models + for Flood Mitigation + + +
+ Floods can cause horrific harm to life and property. However, they can be +mitigated or even avoided by the effective use of hydraulic structures such as +dams, gates, and pumps. By pre-releasing water via these structures in advance +of extreme weather events, water levels are sufficiently lowered to prevent +floods. In this work, we propose FIDLAR, a Forecast Informed Deep Learning +Architecture, achieving flood management in watersheds with hydraulic +structures in an optimal manner by balancing out flood mitigation and +unnecessary wastage of water via pre-releases. We perform experiments with +FIDLAR using data from the South Florida Water Management District, which +manages a coastal area that is highly prone to frequent storms and floods. +Results show that FIDLAR performs better than the current state-of-the-art with +several orders of magnitude speedup and with provably better pre-release +schedules. The dramatic speedups make it possible for FIDLAR to be used for +real-time flood management. The main contribution of this paper is the +effective use of tools for model explainability, allowing us to understand the +contribution of the various environmental factors towards its decisions. + +
+
+
+
+
+ + ☆ RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning + with Active Data Manipulation + + +
+ Federated learning (FL) has recently emerged as a privacy-preserving approach +for machine learning in domains that rely on user interactions, particularly +recommender systems (RS) and online learning to rank (OLTR). While there has +been substantial research on the privacy of traditional FL, little attention +has been paid to studying the privacy properties of these interaction-based FL +(IFL) systems. In this work, we show that IFL can introduce unique challenges +concerning user privacy, particularly when the central server has knowledge and +control over the items that users interact with. Specifically, we demonstrate +the threat of reconstructing user interactions by presenting RAIFLE, a general +optimization-based reconstruction attack framework customized for IFL. RAIFLE +employs Active Data Manipulation (ADM), a novel attack technique unique to IFL, +where the server actively manipulates the training features of the items to +induce adversarial behaviors in the local FL updates. We show that RAIFLE is +more impactful than existing FL privacy attacks in the IFL context, and +describe how it can undermine privacy defenses like secure aggregation and +private information retrieval. Based on our findings, we propose and discuss +countermeasure guidelines to mitigate our attack in the context of federated +RS/OLTR specifically and IFL more broadly. + +
+
+
+
+
+ + ☆ Transfer Learning in Transformer-Based Demand Forecasting For Home + Energy Management System + + +
+ Increasingly, homeowners opt for photovoltaic (PV) systems and/or battery +storage to minimize their energy bills and maximize renewable energy usage. +This has spurred the development of advanced control algorithms that maximally +achieve those goals. However, a common challenge faced while developing such +controllers is the unavailability of accurate forecasts of household power +consumption, especially for shorter time resolutions (15 minutes) and in a +data-efficient manner. In this paper, we analyze how transfer learning can help +by exploiting data from multiple households to improve a single house's load +forecasting. Specifically, we train an advanced forecasting model (a temporal +fusion transformer) using data from multiple different households, and then +finetune this global model on a new household with limited data (i.e. only a +few days). The obtained models are used for forecasting power consumption of +the household for the next 24 hours~(day-ahead) at a time resolution of 15 +minutes, with the intention of using these forecasts in advanced controllers +such as Model Predictive Control. We show the benefit of this transfer learning +setup versus solely using the individual new household's data, both in terms of +(i) forecasting accuracy ($\sim$15\% MAE reduction) and (ii) control +performance ($\sim$2\% energy cost reduction), using real-world household data. + +
+
+ comment: 7 pages, 2 figures, workshop article at BALANCES, BuildSys'23 +
+
+
+
+
+ + ☆ Real-World Implementation of Reinforcement Learning Based Energy + Coordination for a Cluster of Households + + +
+ Given its substantial contribution of 40\% to global power consumption, the +built environment has received increasing attention to serve as a source of +flexibility to assist the modern power grid. In that respect, previous research +mainly focused on energy management of individual buildings. In contrast, in +this paper, we focus on aggregated control of a set of residential buildings, +to provide grid supporting services, that eventually should include ancillary +services. In particular, we present a real-life pilot study that studies the +effectiveness of reinforcement-learning (RL) in coordinating the power +consumption of 8 residential buildings to jointly track a target power signal. +Our RL approach relies solely on observed data from individual households and +does not require any explicit building models or simulators, making it +practical to implement and easy to scale. We show the feasibility of our +proposed RL-based coordination strategy in a real-world setting. In a 4-week +case study, we demonstrate a hierarchical control system, relying on an +RL-based ranking system to select which households to activate flex assets +from, and a real-time PI control-based power dispatch mechanism to control the +selected assets. Our results demonstrate satisfactory power tracking, and the +effectiveness of the RL-based ranks which are learnt in a purely data-driven +manner. + +
+
+ comment: 8 pages, 2 figures, workshop article accepted at RLEM'23 + (BuildSys'23) +
+
+
+
+
+ + ☆ BERT Lost Patience Won't Be Robust to Adversarial Slowdown NeurIPS 2023 + + +
+ In this paper, we systematically evaluate the robustness of multi-exit +language models against adversarial slowdown. To audit their robustness, we +design a slowdown attack that generates natural adversarial text bypassing +early-exit points. We use the resulting WAFFLE attack as a vehicle to conduct a +comprehensive evaluation of three multi-exit mechanisms with the GLUE benchmark +against adversarial slowdown. We then show our attack significantly reduces the +computational savings provided by the three methods in both white-box and +black-box settings. The more complex a mechanism is, the more vulnerable it is +to adversarial slowdown. We also perform a linguistic analysis of the perturbed +text inputs, identifying common perturbation patterns that our attack +generates, and comparing them with standard adversarial text attacks. Moreover, +we show that adversarial training is ineffective in defeating our slowdown +attack, but input sanitization with a conversational model, e.g., ChatGPT, can +remove perturbations effectively. This result suggests that future work is +needed for developing efficient yet robust multi-exit models. Our code is +available at: https://github.com/ztcoalson/WAFFLE + +
+
+ comment: Accepted to NeurIPS 2023 [Poster] +
+
+
+
+
+ + ☆ MAG-GNN: Reinforcement Learning Boosted Graph Neural Network NeurIPS 2023 + + +
+ While Graph Neural Networks (GNNs) recently became powerful tools in graph +learning tasks, considerable efforts have been spent on improving GNNs' +structural encoding ability. A particular line of work proposed subgraph GNNs +that use subgraph information to improve GNNs' expressivity and achieved great +success. However, such effectivity sacrifices the efficiency of GNNs by +enumerating all possible subgraphs. In this paper, we analyze the necessity of +complete subgraph enumeration and show that a model can achieve a comparable +level of expressivity by considering a small subset of the subgraphs. We then +formulate the identification of the optimal subset as a combinatorial +optimization problem and propose Magnetic Graph Neural Network (MAG-GNN), a +reinforcement learning (RL) boosted GNN, to solve the problem. Starting with a +candidate subgraph set, MAG-GNN employs an RL agent to iteratively update the +subgraphs to locate the most expressive set for prediction. This reduces the +exponential complexity of subgraph enumeration to the constant complexity of a +subgraph search algorithm while keeping good expressivity. We conduct extensive +experiments on many datasets, showing that MAG-GNN achieves competitive +performance to state-of-the-art methods and even outperforms many subgraph +GNNs. We also demonstrate that MAG-GNN effectively reduces the running time of +subgraph GNNs. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Automaton Distillation: Neuro-Symbolic Transfer Learning for Deep + Reinforcement Learning + + +
+ Reinforcement learning (RL) is a powerful tool for finding optimal policies +in sequential decision processes. However, deep RL methods suffer from two +weaknesses: collecting the amount of agent experience required for practical RL +problems is prohibitively expensive, and the learned policies exhibit poor +generalization on tasks outside of the training distribution. To mitigate these +issues, we introduce automaton distillation, a form of neuro-symbolic transfer +learning in which Q-value estimates from a teacher are distilled into a +low-dimensional representation in the form of an automaton. We then propose two +methods for generating Q-value estimates: static transfer, which reasons over +an abstract Markov Decision Process constructed based on prior knowledge, and +dynamic transfer, where symbolic information is extracted from a teacher Deep +Q-Network (DQN). The resulting Q-value estimates from either method are used to +bootstrap learning in the target environment via a modified DQN loss function. +We list several failure modes of existing automaton-based transfer methods and +demonstrate that both static and dynamic automaton distillation decrease the +time required to find optimal policies for various decision tasks. + +
+
+
+
+
+ + ☆ Worst-case Performance of Popular Approximate Nearest Neighbor Search + Implementations: Guarantees and Limitations NeurIPS 2023 + + +
+ Graph-based approaches to nearest neighbor search are popular and powerful +tools for handling large datasets in practice, but they have limited +theoretical guarantees. We study the worst-case performance of recent +graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG +and DiskANN. For DiskANN, we show that its "slow preprocessing" version +provably supports approximate nearest neighbor search query with constant +approximation ratio and poly-logarithmic query time, on data sets with bounded +"intrinsic" dimension. For the other data structure variants studied, including +DiskANN with "fast preprocessing", HNSW and NSG, we present a family of +instances on which the empirical query time required to achieve a "reasonable" +accuracy is linear in instance size. For example, for DiskANN, we show that the +query procedure can take at least $0.1 n$ steps on instances of size $n$ before +it encounters any of the $5$ nearest neighbors of the query. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Software engineering for deep learning applications: usage of SWEng and + MLops tools in GitHub repositories + + +
+ The rising popularity of deep learning (DL) methods and techniques has +invigorated interest in the topic of SE4DL, the application of software +engineering (SE) practices on deep learning software. Despite the novel +engineering challenges brought on by the data-driven and non-deterministic +paradigm of DL software, little work has been invested into developing +AI-targeted SE tools. On the other hand, tools tackling more general +engineering issues in DL are actively used and referred to under the umbrella +term of ``MLOps tools''. Furthermore, the available literature supports the +utility of conventional SE tooling in DL software development. Building upon +previous MSR research on tool usage in open-source software works, we identify +conventional and MLOps tools adopted in popular applied DL projects that use +Python as the main programming language. About 70% of the GitHub repositories +mined contained at least one conventional SE tool. Software configuration +management tools are the most adopted, while the opposite applies to +maintenance tools. Substantially fewer MLOps tools were in use, with only 9 +tools out of a sample of 80 used in at least one repository. The majority of +them were open-source rather than proprietary. One of these tools, TensorBoard, +was found to be adopted in about half of the repositories in our study. +Consequently, the use of conventional SE tooling demonstrates its relevance to +DL software. Further research is recommended on the adoption of MLOps tooling +by open-source projects, focusing on the relevance of particular tool types, +the development of required tools, as well as ways to promote the use of +already available tools. + +
+
+
+
+
+ + ☆ Efficient IoT Inference via Context-Awareness + + +
+ While existing strategies for optimizing deep learning-based classification +models on low-power platforms assume the models are trained on all classes of +interest, this paper posits that adopting context-awareness i.e. focusing +solely on the likely classes in the current context, can substantially enhance +performance in resource-constrained environments. We propose a new paradigm, +CACTUS, for scalable and efficient context-aware classification where a +micro-classifier recognizes a small set of classes relevant to the current +context and, when context change happens, rapidly switches to another suitable +micro-classifier. CACTUS has several innovations including optimizing the +training cost of context-aware classifiers, enabling on-the-fly context-aware +switching between classifiers, and selecting the best context-aware classifiers +given limited resources. We show that CACTUS achieves significant benefits in +accuracy, latency, and compute budget across a range of datasets and IoT +platforms. + +
+
+ comment: 12 pages, 10 figures +
+
+
+
+
+ + ☆ Proving Linear Mode Connectivity of Neural Networks via Optimal + Transport + + +
+ The energy landscape of high-dimensional non-convex optimization problems is +crucial to understanding the effectiveness of modern deep neural network +architectures. Recent works have experimentally shown that two different +solutions found after two runs of a stochastic training are often connected by +very simple continuous paths (e.g., linear) modulo a permutation of the +weights. In this paper, we provide a framework theoretically explaining this +empirical observation. Based on convergence rates in Wasserstein distance of +empirical measures, we show that, with high probability, two wide enough +two-layer neural networks trained with stochastic gradient descent are linearly +connected. Additionally, we express upper and lower bounds on the width of each +layer of two deep neural networks with independent neuron weights to be +linearly connected. Finally, we empirically demonstrate the validity of our +approach by showing how the dimension of the support of the weight distribution +of neurons, which dictates Wasserstein convergence rates is correlated with +linear mode connectivity. + +
+
+
+
+
+ + ☆ Atom: Low-bit Quantization for Efficient and Accurate LLM Serving + + +
+ The growing demand for Large Language Models (LLMs) in applications such as +content generation, intelligent chatbots, and sentiment analysis poses +considerable challenges for LLM service providers. To efficiently use GPU +resources and boost throughput, batching multiple requests has emerged as a +popular paradigm; to further speed up batching, LLM quantization techniques +reduce memory consumption and increase computing capacity. However, prevalent +quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully +leverage the capabilities of modern GPUs, such as 4-bit integer operators, +resulting in sub-optimal performance. + To maximize LLMs' serving throughput, we introduce Atom, a low-bit +quantization method that achieves high throughput improvements with negligible +accuracy loss. Atom significantly boosts serving throughput by using low-bit +operators and considerably reduces memory consumption via low-bit quantization. +It attains high accuracy by applying a novel mixed-precision and fine-grained +quantization process. We evaluate Atom on 4-bit weight-activation quantization +setups in the serving context. Atom improves end-to-end throughput by up to +$7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 +quantization, while maintaining the same latency target. + +
+
+
+
+
+ + ☆ Bridging the Gap: Towards an Expanded Toolkit for ML-Supported + Decision-Making in the Public Sector + + +
+ Machine Learning (ML) systems are becoming instrumental in the public sector, +with applications spanning areas like criminal justice, social welfare, +financial fraud detection, and public health. While these systems offer great +potential benefits to institutional decision-making processes, such as improved +efficiency and reliability, they still face the challenge of aligning intricate +and nuanced policy objectives with the precise formalization requirements +necessitated by ML models. In this paper, we aim to bridge the gap between ML +and public sector decision-making by presenting a comprehensive overview of key +technical challenges where disjunctions between policy goals and ML models +commonly arise. We concentrate on pivotal points of the ML pipeline that +connect the model to its operational environment, delving into the significance +of representative training data and highlighting the importance of a model +setup that facilitates effective decision-making. Additionally, we link these +challenges with emerging methodological advancements, encompassing causal ML, +domain adaptation, uncertainty quantification, and multi-objective +optimization, illustrating the path forward for harmonizing ML and public +sector objectives. + +
+
+
+
+
+ + ☆ Bespoke Solvers for Generative Flow Models + + +
+ Diffusion or flow-based models are powerful generative paradigms that are +notoriously hard to sample as samples are defined as solutions to +high-dimensional Ordinary or Stochastic Differential Equations (ODEs/SDEs) +which require a large Number of Function Evaluations (NFE) to approximate well. +Existing methods to alleviate the costly sampling process include model +distillation and designing dedicated ODE solvers. However, distillation is +costly to train and sometimes can deteriorate quality, while dedicated solvers +still require relatively large NFE to produce high quality samples. In this +paper we introduce "Bespoke solvers", a novel framework for constructing custom +ODE solvers tailored to the ODE of a given pre-trained flow model. Our approach +optimizes an order consistent and parameter-efficient solver (e.g., with 80 +learnable parameters), is trained for roughly 1% of the GPU time required for +training the pre-trained model, and significantly improves approximation and +generation quality compared to dedicated solvers. For example, a Bespoke solver +for a CIFAR10 model produces samples with Fr\'echet Inception Distance (FID) of +2.73 with 10 NFE, and gets to 1% of the Ground Truth (GT) FID (2.59) for this +model with only 20 NFE. On the more challenging ImageNet-64$\times$64, Bespoke +samples at 2.2 FID with 10 NFE, and gets within 2% of GT FID (1.71) with 20 +NFE. + +
+
+
+
+
+ + ☆ Efficient Cluster Selection for Personalized Federated Learning: A + Multi-Armed Bandit Approach + + +
+ Federated learning (FL) offers a decentralized training approach for machine +learning models, prioritizing data privacy. However, the inherent heterogeneity +in FL networks, arising from variations in data distribution, size, and device +capabilities, poses challenges in user federation. Recognizing this, +Personalized Federated Learning (PFL) emphasizes tailoring learning processes +to individual data profiles. In this paper, we address the complexity of +clustering users in PFL, especially in dynamic networks, by introducing a +dynamic Upper Confidence Bound (dUCB) algorithm inspired by the multi-armed +bandit (MAB) approach. The dUCB algorithm ensures that new users can +effectively find the best cluster for their data distribution by balancing +exploration and exploitation. The performance of our algorithm is evaluated in +various cases, showing its effectiveness in handling dynamic federated learning +scenarios. + +
+
+
+
+
+ + ☆ Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile + Streaming NeurIPS 2023 + + +
+ Sketching algorithms have recently proven to be a powerful approach both for +designing low-space streaming algorithms as well as fast polynomial time +approximation schemes (PTAS). In this work, we develop new techniques to extend +the applicability of sketching-based approaches to the sparse dictionary +learning and the Euclidean $k$-means clustering problems. In particular, we +initiate the study of the challenging setting where the dictionary/clustering +assignment for each of the $n$ input points must be output, which has +surprisingly received little attention in prior work. On the fast algorithms +front, we obtain a new approach for designing PTAS's for the $k$-means +clustering problem, which generalizes to the first PTAS for the sparse +dictionary learning problem. On the streaming algorithms front, we obtain new +upper bounds and lower bounds for dictionary learning and $k$-means clustering. +In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a +turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space +upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde +O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as +well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random +order row insertion streams with a natural "bounded sensitivity" assumption. On +the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon + +dk/\epsilon)$ lower bound for $k$-means clustering, as well as an +$\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the +cost of a single fixed set of candidate centers. + +
+
+ comment: To appear in NeurIPS 2023 +
+
+
+
+
+ + ☆ Gauge-optimal approximate learning for small data classification + problems + + +
+ Small data learning problems are characterized by a significant discrepancy +between the limited amount of response variable observations and the large +feature space dimension. In this setting, the common learning tools struggle to +identify the features important for the classification task from those that +bear no relevant information, and cannot derive an appropriate learning rule +which allows to discriminate between different classes. As a potential solution +to this problem, here we exploit the idea of reducing and rotating the feature +space in a lower-dimensional gauge and propose the Gauge-Optimal Approximate +Learning (GOAL) algorithm, which provides an analytically tractable joint +solution to the dimension reduction, feature segmentation and classification +problems for small data learning problems. We prove that the optimal solution +of the GOAL algorithm consists in piecewise-linear functions in the Euclidean +space, and that it can be approximated through a monotonically convergent +algorithm which presents -- under the assumption of a discrete segmentation of +the feature space -- a closed-form solution for each optimization substep and +an overall linear iteration cost scaling. The GOAL algorithm has been compared +to other state-of-the-art machine learning (ML) tools on both synthetic data +and challenging real-world applications from climate science and bioinformatics +(i.e., prediction of the El Nino Southern Oscillation and inference of +epigenetically-induced gene-activity networks from limited experimental data). +The experimental results show that the proposed algorithm outperforms the +reported best competitors for these problems both in learning performance and +computational cost. + +
+
+ comment: 47 pages, 4 figures +
+
+
+
+
+ + ☆ Evaluating LLP Methods: Challenges and Approaches + + +
+ Learning from Label Proportions (LLP) is an established machine learning +problem with numerous real-world applications. In this setting, data items are +grouped into bags, and the goal is to learn individual item labels, knowing +only the features of the data and the proportions of labels in each bag. +Although LLP is a well-established problem, it has several unusual aspects that +create challenges for benchmarking learning methods. Fundamental complications +arise because of the existence of different LLP variants, i.e., dependence +structures that can exist between items, labels, and bags. Accordingly, the +first algorithmic challenge is the generation of variant-specific datasets +capturing the diversity of dependence structures and bag characteristics. The +second methodological challenge is model selection, i.e., hyperparameter +tuning; due to the nature of LLP, model selection cannot easily use the +standard machine learning paradigm. The final benchmarking challenge consists +of properly evaluating LLP solution methods across various LLP variants. We +note that there is very little consideration of these issues in prior work, and +there are no general solutions for these challenges proposed to date. To +address these challenges, we develop methods capable of generating LLP datasets +meeting the requirements of different variants. We use these methods to +generate a collection of datasets encompassing the spectrum of LLP problem +characteristics, which can be used in future evaluation studies. Additionally, +we develop guidelines for benchmarking LLP algorithms, including the model +selection and evaluation steps. Finally, we illustrate the new methods and +guidelines by performing an extensive benchmark of a set of well-known LLP +algorithms. We show that choosing the best algorithm depends critically on the +LLP variant and model selection method, demonstrating the need for our proposed +approach. + +
+
+
+
+
+ + ☆ Revisiting the Learnability of Apple Tasting + + +
+ In online binary classification under \textit{apple tasting} feedback, the +learner only observes the true label if it predicts "1". First studied by +\cite{helmbold2000apple}, we revisit this classical partial-feedback setting +and study online learnability from a combinatorial perspective. We show that +the Littlestone dimension continues to prove a tight quantitative +characterization of apple tasting in the agnostic setting, closing an open +question posed by \cite{helmbold2000apple}. In addition, we give a new +combinatorial parameter, called the Effective width, that tightly quantifies +the minimax expected mistakes in the realizable setting. As a corollary, we use +the Effective width to establish a \textit{trichotomy} of the minimax expected +number of mistakes in the realizable setting. In particular, we show that in +the realizable setting, the expected number of mistakes for any learner under +apple tasting feedback can only be $\Theta(1), \Theta(\sqrt{T})$, or +$\Theta(T)$. + +
+
+ comment: 18 pages +
+
+
+
+
+ + ☆ Feature Aggregation in Joint Sound Classification and Localization + Neural Networks + + +
+ This study addresses the application of deep learning techniques in joint +sound signal classification and localization networks. Current state-of-the-art +sound source localization deep learning networks lack feature aggregation +within their architecture. Feature aggregation enhances model performance by +enabling the consolidation of information from different feature scales, +thereby improving feature robustness and invariance. This is particularly +important in SSL networks, which must differentiate direct and indirect +acoustic signals. To address this gap, we adapt feature aggregation techniques +from computer vision neural networks to signal detection neural networks. +Additionally, we propose the Scale Encoding Network (SEN) for feature +aggregation to encode features from various scales, compressing the network for +more computationally efficient aggregation. To evaluate the efficacy of feature +aggregation in SSL networks, we integrated the following computer vision +feature aggregation sub-architectures into a SSL control architecture: Path +Aggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network +(BiFPN), and SEN. These sub-architectures were evaluated using two metrics for +signal classification and two metrics for direction-of-arrival regression. +PANet and BiFPN are established aggregators in computer vision models, while +the proposed SEN is a more compact aggregator. The results suggest that models +incorporating feature aggregations outperformed the control model, the Sound +Event Localization and Detection network (SELDnet), in both sound signal +classification and localization. The feature aggregation techniques enhance the +performance of sound detection neural networks, particularly in +direction-of-arrival regression. + +
+
+
+
+
+ + ☆ Escaping Saddle Points in Heterogeneous Federated Learning via + Distributed SGD with Communication Compression + + +
+ We consider the problem of finding second-order stationary points of +heterogeneous federated learning (FL). Previous works in FL mostly focus on +first-order convergence guarantees, which do not rule out the scenario of +unstable saddle points. Meanwhile, it is a key bottleneck of FL to achieve +communication efficiency without compensating the learning accuracy, especially +when local data are highly heterogeneous across different clients. Given this, +we propose a novel algorithm Power-EF that only communicates compressed +information via a novel error-feedback scheme. To our knowledge, Power-EF is +the first distributed and compressed SGD algorithm that provably escapes saddle +points in heterogeneous FL without any data homogeneity assumptions. In +particular, Power-EF improves to second-order stationary points after visiting +first-order (possibly saddle) points, using additional gradient queries and +communication rounds only of almost the same order required by first-order +convergence, and the convergence rate exhibits a linear speedup in terms of the +number of workers. Our theory improves/recovers previous results, while +extending to much more tolerant settings on the local data. Numerical +experiments are provided to complement the theory. + +
+
+ comment: 27 pages +
+
+
+
+
+ + Object-centric architectures enable efficient causal representation + learning + + +
+ Causal representation learning has showed a variety of settings in which we +can disentangle latent variables with identifiability guarantees (up to some +reasonable equivalence class). Common to all of these approaches is the +assumption that (1) the latent variables are represented as $d$-dimensional +vectors, and (2) that the observations are the output of some injective +generative function of these latent variables. While these assumptions appear +benign, we show that when the observations are of multiple objects, the +generative function is no longer injective and disentanglement fails in +practice. We can address this failure by combining recent developments in +object-centric learning and causal representation learning. By modifying the +Slot Attention architecture arXiv:2006.15055, we develop an object-centric +architecture that leverages weak supervision from sparse perturbations to +disentangle each object's properties. This approach is more data-efficient in +the sense that it requires significantly fewer perturbations than a comparable +approach that encodes to a Euclidean space and we show that this approach +successfully disentangles the properties of a set of objects in a series of +simple image-based disentanglement experiments. + +
+
+
+
+
+ + ☆ Datasets and Benchmarks for Nanophotonic Structure and Parametric Design + Simulations NeurIPS 2023 + + +
+ Nanophotonic structures have versatile applications including solar cells, +anti-reflective coatings, electromagnetic interference shielding, optical +filters, and light emitting diodes. To design and understand these nanophotonic +structures, electrodynamic simulations are essential. These simulations enable +us to model electromagnetic fields over time and calculate optical properties. +In this work, we introduce frameworks and benchmarks to evaluate nanophotonic +structures in the context of parametric structure design problems. The +benchmarks are instrumental in assessing the performance of optimization +algorithms and identifying an optimal structure based on target optical +properties. Moreover, we explore the impact of varying grid sizes in +electrodynamic simulations, shedding light on how evaluation fidelity can be +strategically leveraged in enhancing structure designs. + +
+
+ comment: 31 pages, 31 figures, 4 tables. Accepted at the 37th Conference on + Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks + Track +
+
+
+
+
+ + ☆ Differentially Private Permutation Tests: Applications to Kernel Methods + + +
+ Recent years have witnessed growing concerns about the privacy of sensitive +data. In response to these concerns, differential privacy has emerged as a +rigorous framework for privacy protection, gaining widespread recognition in +both academic and industrial circles. While substantial progress has been made +in private data analysis, existing methods often suffer from impracticality or +a significant loss of statistical efficiency. This paper aims to alleviate +these concerns in the context of hypothesis testing by introducing +differentially private permutation tests. The proposed framework extends +classical non-private permutation tests to private settings, maintaining both +finite-sample validity and differential privacy in a rigorous manner. The power +of the proposed test depends on the choice of a test statistic, and we +establish general conditions for consistency and non-asymptotic uniform power. +To demonstrate the utility and practicality of our framework, we focus on +reproducing kernel-based test statistics and introduce differentially private +kernel tests for two-sample and independence testing: dpMMD and dpHSIC. The +proposed kernel tests are straightforward to implement, applicable to various +types of data, and attain minimax optimal power across different privacy +regimes. Our empirical evaluations further highlight their competitive power +under various synthetic and real-world scenarios, emphasizing their practical +value. The code is publicly available to facilitate the implementation of our +framework. + +
+
+
+
+
+ + ☆ On Linear Separation Capacity of Self-Supervised Representation Learning + + +
+ Recent advances in self-supervised learning have highlighted the efficacy of +data augmentation in learning data representation from unlabeled data. Training +a linear model atop these enhanced representations can yield an adept +classifier. Despite the remarkable empirical performance, the underlying +mechanisms that enable data augmentation to unravel nonlinear data structures +into linearly separable representations remain elusive. This paper seeks to +bridge this gap by investigating under what conditions learned representations +can linearly separate manifolds when data is drawn from a multi-manifold model. +Our investigation reveals that data augmentation offers additional information +beyond observed data and can thus improve the information-theoretic optimal +rate of linear separation capacity. In particular, we show that self-supervised +learning can linearly separate manifolds with a smaller distance than +unsupervised learning, underscoring the additional benefits of data +augmentation. Our theoretical analysis further underscores that the performance +of downstream linear classifiers primarily hinges on the linear separability of +data representations rather than the size of the labeled data set, reaffirming +the viability of constructing efficient classifiers with limited labeled data +amid an expansive unlabeled data set. + +
+
+
+
+
+ + ☆ Machine Learning for the identification of phase-transitions in + interacting agent-based systems + + +
+ Deriving closed-form, analytical expressions for reduced-order models, and +judiciously choosing the closures leading to them, has long been the strategy +of choice for studying phase- and noise-induced transitions for agent-based +models (ABMs). In this paper, we propose a data-driven framework that pinpoints +phase transitions for an ABM in its mean-field limit, using a smaller number of +variables than traditional closed-form models. To this end, we use the manifold +learning algorithm Diffusion Maps to identify a parsimonious set of data-driven +latent variables, and show that they are in one-to-one correspondence with the +expected theoretical order parameter of the ABM. We then utilize a deep +learning framework to obtain a conformal reparametrization of the data-driven +coordinates that facilitates, in our example, the identification of a single +parameter-dependent ODE in these coordinates. We identify this ODE through a +residual neural network inspired by a numerical integration scheme (forward +Euler). We then use the identified ODE -- enabled through an odd symmetry +transformation -- to construct the bifurcation diagram exhibiting the phase +transition. + +
+
+ comment: 14 pages, 9 Figures +
+
+
+
+
+ + ☆ Boosting Decision-Based Black-Box Adversarial Attack with Gradient + Priors IJCAI 2023 + + +
+ Decision-based methods have shown to be effective in black-box adversarial +attacks, as they can obtain satisfactory performance and only require to access +the final model prediction. Gradient estimation is a critical step in black-box +adversarial attacks, as it will directly affect the query efficiency. Recent +works have attempted to utilize gradient priors to facilitate score-based +methods to obtain better results. However, these gradient priors still suffer +from the edge gradient discrepancy issue and the successive iteration gradient +direction issue, thus are difficult to simply extend to decision-based methods. +In this paper, we propose a novel Decision-based Black-box Attack framework +with Gradient Priors (DBA-GP), which seamlessly integrates the data-dependent +gradient prior and time-dependent prior into the gradient estimation procedure. +First, by leveraging the joint bilateral filter to deal with each random +perturbation, DBA-GP can guarantee that the generated perturbations in edge +locations are hardly smoothed, i.e., alleviating the edge gradient discrepancy, +thus remaining the characteristics of the original image as much as possible. +Second, by utilizing a new gradient updating strategy to automatically adjust +the successive iteration gradient direction, DBA-GP can accelerate the +convergence speed, thus improving the query efficiency. Extensive experiments +have demonstrated that the proposed method outperforms other strong baselines +significantly. + +
+
+ comment: Accepted by IJCAI 2023 +
+
+
+
+
+ + ☆ Does Invariant Graph Learning via Environment Augmentation Learn + Invariance? NeurIPS 2023 + + +
+ Invariant graph representation learning aims to learn the invariance among +data from different environments for out-of-distribution generalization on +graphs. As the graph environment partitions are usually expensive to obtain, +augmenting the environment information has become the de facto approach. +However, the usefulness of the augmented environment information has never been +verified. In this work, we find that it is fundamentally impossible to learn +invariant graph representations via environment augmentation without additional +assumptions. Therefore, we develop a set of minimal assumptions, including +variation sufficiency and variation consistency, for feasible invariant graph +learning. We then propose a new framework Graph invAriant Learning Assistant +(GALA). GALA incorporates an assistant model that needs to be sensitive to +graph environment changes or distribution shifts. The correctness of the proxy +predictions by the assistant model hence can differentiate the variations in +spurious subgraphs. We show that extracting the maximally invariant subgraph to +the proxy predictions provably identifies the underlying invariant subgraph for +successful OOD generalization under the established minimal assumptions. +Extensive experiments on datasets including DrugOOD with various graph +distribution shifts confirm the effectiveness of GALA. + +
+
+ comment: NeurIPS 2023, 34 pages, 35 figures +
+
+
+
+
+ + ☆ An Improved Relaxation for Oracle-Efficient Adversarial Contextual + Bandits NeurIPS 2023 + + +
+ We present an oracle-efficient relaxation for the adversarial contextual +bandits problem, where the contexts are sequentially drawn i.i.d from a known +distribution and the cost sequence is chosen by an online adversary. Our +algorithm has a regret bound of +$O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}})$ and makes at most $O(K)$ calls +per round to an offline optimization oracle, where $K$ denotes the number of +actions, $T$ denotes the number of rounds and $\Pi$ denotes the set of +policies. This is the first result to improve the prior best bound of +$O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}})$ as obtained by Syrgkanis et +al. at NeurIPS 2016, and the first to match the original bound of Langford and +Zhang at NeurIPS 2007 which was obtained for the stochastic case. + +
+
+ comment: Appears in NeurIPS 2023 +
+
+
+
+
+ + ☆ Optimization Landscape of Policy Gradient Methods for Discrete-time + Static Output Feedback + + +
+ In recent times, significant advancements have been made in delving into the +optimization landscape of policy gradient methods for achieving optimal control +in linear time-invariant (LTI) systems. Compared with state-feedback control, +output-feedback control is more prevalent since the underlying state of the +system may not be fully observed in many practical settings. This paper +analyzes the optimization landscape inherent to policy gradient methods when +applied to static output feedback (SOF) control in discrete-time LTI systems +subject to quadratic cost. We begin by establishing crucial properties of the +SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous +Hessian. Despite the absence of convexity, we leverage these properties to +derive novel findings regarding convergence (and nearly dimension-free rate) to +stationary points for three policy gradient methods, including the vanilla +policy gradient method, the natural policy gradient method, and the +Gauss-Newton method. Moreover, we provide proof that the vanilla policy +gradient method exhibits linear convergence towards local minima when +initialized near such minima. The paper concludes by presenting numerical +examples that validate our theoretical findings. These results not only +characterize the performance of gradient descent for optimizing the SOF problem +but also provide insights into the effectiveness of general policy gradient +methods within the realm of reinforcement learning. + +
+
+
+
+
+ + ☆ Behavior Alignment via Reward Function Optimization NeurIPS 2023 + + +
+ Designing reward functions for efficiently guiding reinforcement learning +(RL) agents toward specific behaviors is a complex task. This is challenging +since it requires the identification of reward structures that are not sparse +and that avoid inadvertently inducing undesirable behaviors. Naively modifying +the reward structure to offer denser and more frequent feedback can lead to +unintended outcomes and promote behaviors that are not aligned with the +designer's intended goal. Although potential-based reward shaping is often +suggested as a remedy, we systematically investigate settings where deploying +it often significantly impairs performance. To address these issues, we +introduce a new framework that uses a bi-level objective to learn +\emph{behavior alignment reward functions}. These functions integrate auxiliary +rewards reflecting a designer's heuristics and domain knowledge with the +environment's primary rewards. Our approach automatically determines the most +effective way to blend these types of feedback, thereby enhancing robustness +against heuristic reward misspecification. Remarkably, it can also adapt an +agent's policy optimization process to mitigate suboptimalities resulting from +limitations and biases inherent in the underlying RL algorithms. We evaluate +our method's efficacy on a diverse set of tasks, from small-scale experiments +to high-dimensional control challenges. We investigate heuristic auxiliary +rewards of varying quality -- some of which are beneficial and others +detrimental to the learning process. Our results show that our framework offers +a robust and principled way to integrate designer-specified heuristics. It not +only addresses key shortcomings of existing approaches but also consistently +leads to high-performing solutions, even when given misaligned or +poorly-specified auxiliary reward functions. + +
+
+ comment: (Spotlight) Thirty-seventh Conference on Neural Information + Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ☆ Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic + Segmentation NeurIPS 2023 + + +
+ This paper studies the problem of weakly open-vocabulary semantic +segmentation (WOVSS), which learns to segment objects of arbitrary classes +using mere image-text pairs. Existing works turn to enhance the vanilla vision +transformer by introducing explicit grouping recognition, i.e., employing +several group tokens/centroids to cluster the image tokens and perform the +group-text alignment. Nevertheless, these methods suffer from a granularity +inconsistency regarding the usage of group tokens, which are aligned in the +all-to-one v.s. one-to-one manners during the training and inference phases, +respectively. We argue that this discrepancy arises from the lack of elaborate +supervision for each group token. To bridge this granularity gap, this paper +explores explicit supervision for the group tokens from the prototypical +knowledge. To this end, this paper proposes the non-learnable prototypical +regularization (NPR) where non-learnable prototypes are estimated from source +features to serve as supervision and enable contrastive matching of the group +tokens. This regularization encourages the group tokens to segment objects with +less redundancy and capture more comprehensive semantic regions, leading to +increased compactness and richness. Based on NPR, we propose the prototypical +guidance segmentation network (PGSeg) that incorporates multi-modal +regularization by leveraging prototypical sources from both images and texts at +different levels, progressively enhancing the segmentation capability with +diverse prototypical patterns. Experimental results show that our proposed +method achieves state-of-the-art performance on several benchmark datasets. The +source code is available at https://github.com/Ferenas/PGSeg. + +
+
+ comment: 14 pages, Accept in NeurIPS 2023 +
+
+
+
+
+ + ☆ A U-turn on Double Descent: Rethinking Parameter Counting in Statistical + Learning NeurIPS 2023 + + +
+ Conventional statistical wisdom established a well-understood relationship +between model complexity and prediction error, typically presented as a +U-shaped curve reflecting a transition between under- and overfitting regimes. +However, motivated by the success of overparametrized neural networks, recent +influential work has suggested this theory to be generally incomplete, +introducing an additional regime that exhibits a second descent in test error +as the parameter count p grows past sample size n - a phenomenon dubbed double +descent. While most attention has naturally been given to the deep-learning +setting, double descent was shown to emerge more generally across non-neural +models: known cases include linear regression, trees, and boosting. In this +work, we take a closer look at evidence surrounding these more classical +statistical machine learning methods and challenge the claim that observed +cases of double descent truly extend the limits of a traditional U-shaped +complexity-generalization curve therein. We show that once careful +consideration is given to what is being plotted on the x-axes of their double +descent plots, it becomes apparent that there are implicitly multiple +complexity axes along which the parameter count grows. We demonstrate that the +second descent appears exactly (and only) when and where the transition between +these underlying axes occurs, and that its location is thus not inherently tied +to the interpolation threshold p=n. We then gain further insight by adopting a +classical nonparametric statistics perspective. We interpret the investigated +methods as smoothers and propose a generalized measure for the effective number +of parameters they use on unseen examples, using which we find that their +apparent double descent curves indeed fold back into more traditional convex +shapes - providing a resolution to tensions between double descent and +statistical intuition. + +
+
+ comment: To appear in the Proceedings of the 37th Conference on Neural + Information Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ☆ Blacksmith: Fast Adversarial Training of Vision Transformers via a + Mixture of Single-step and Multi-step Methods + + +
+ Despite the remarkable success achieved by deep learning algorithms in +various domains, such as computer vision, they remain vulnerable to adversarial +perturbations. Adversarial Training (AT) stands out as one of the most +effective solutions to address this issue; however, single-step AT can lead to +Catastrophic Overfitting (CO). This scenario occurs when the adversarially +trained network suddenly loses robustness against multi-step attacks like +Projected Gradient Descent (PGD). Although several approaches have been +proposed to address this problem in Convolutional Neural Networks (CNNs), we +found out that they do not perform well when applied to Vision Transformers +(ViTs). In this paper, we propose Blacksmith, a novel training strategy to +overcome the CO problem, specifically in ViTs. Our approach utilizes either of +PGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the +adversarial training of the neural network. This will increase the diversity of +our training attacks, which could potentially mitigate the CO issue. To manage +the increased training time resulting from this combination, we craft the PGD-2 +attack based on only the first half of the layers, while FGSM is applied +end-to-end. Through our experiments, we demonstrate that our novel method +effectively prevents CO, achieves PGD-2 level performance, and outperforms +other existing techniques including N-FGSM, which is the state-of-the-art +method in fast training for CNNs. + +
+
+
+
+
+ + ☆ EtiCor: Corpus for Analyzing LLMs for Etiquettes EMNLP 2023 + + +
+ Etiquettes are an essential ingredient of day-to-day interactions among +people. Moreover, etiquettes are region-specific, and etiquettes in one region +might contradict those in other regions. In this paper, we propose EtiCor, an +Etiquettes Corpus, having texts about social norms from five different regions +across the globe. The corpus provides a test bed for evaluating LLMs for +knowledge and understanding of region-specific etiquettes. Additionally, we +propose the task of Etiquette Sensitivity. We experiment with state-of-the-art +LLMs (Delphi, Falcon40B, and GPT-3.5). Initial results indicate that LLMs, +mostly fail to understand etiquettes from regions from non-Western world. + +
+
+ comment: Accepted at EMNLP 2023, Main Conference +
+
+
+
+
+ + ☆ TRIAGE: Characterizing and auditing training data for improved + regression NeurIPS 2023 + + +
+ Data quality is crucial for robust machine learning algorithms, with the +recent interest in data-centric AI emphasizing the importance of training data +characterization. However, current data characterization methods are largely +focused on classification settings, with regression settings largely +understudied. To address this, we introduce TRIAGE, a novel data +characterization framework tailored to regression tasks and compatible with a +broad class of regressors. TRIAGE utilizes conformal predictive distributions +to provide a model-agnostic scoring method, the TRIAGE score. We operationalize +the score to analyze individual samples' training dynamics and characterize +samples as under-, over-, or well-estimated by the model. We show that TRIAGE's +characterization is consistent and highlight its utility to improve performance +via data sculpting/filtering, in multiple regression settings. Additionally, +beyond sample level, we show TRIAGE enables new approaches to dataset selection +and feature acquisition. Overall, TRIAGE highlights the value unlocked by data +characterization in real-world regression applications + +
+
+ comment: Presented at NeurIPS 2023 +
+
+
+
+
+ + ☆ End-to-End Autoregressive Retrieval via Bootstrapping for Smart Reply + Systems EMNLP 2023 + + +
+ Reply suggestion systems represent a staple component of many instant +messaging and email systems. However, the requirement to produce sets of +replies, rather than individual replies, makes the task poorly suited for +out-of-the-box retrieval architectures, which only consider individual +message-reply similarity. As a result, these system often rely on additional +post-processing modules to diversify the outputs. However, these approaches are +ultimately bottlenecked by the performance of the initial retriever, which in +practice struggles to present a sufficiently diverse range of options to the +downstream diversification module, leading to the suggestions being less +relevant to the user. In this paper, we consider a novel approach that +radically simplifies this pipeline through an autoregressive text-to-text +retrieval model, that learns the smart reply task end-to-end from a dataset of +(message, reply set) pairs obtained via bootstrapping. Empirical results show +this method consistently outperforms a range of state-of-the-art baselines +across three datasets, corresponding to a 5.1%-17.9% improvement in relevance, +and a 0.5%-63.1% improvement in diversity compared to the best baseline +approach. We make our code publicly available. + +
+
+ comment: FINDINGS-EMNLP 2023 +
+
+
+
+
+ + ☆ Playing in the Dark: No-regret Learning with Adversarial Constraints + + +
+ We study a generalization of the classic Online Convex Optimization (OCO) +framework by considering additional long-term adversarial constraints. +Specifically, after an online policy decides its action on a round, in addition +to a convex cost function, the adversary also reveals a set of $k$ convex +constraints. The cost and the constraint functions could change arbitrarily +with time, and no information about the future functions is assumed to be +available. In this paper, we propose a meta-policy that simultaneously achieves +a sublinear cumulative constraint violation and a sublinear regret. This is +achieved via a black box reduction of the constrained problem to the standard +OCO problem for a recursively constructed sequence of surrogate cost functions. +We show that optimal performance bounds can be achieved by solving the +surrogate problem using any adaptive OCO policy enjoying a standard +data-dependent regret bound. A new Lyapunov-based proof technique is presented +that reveals a connection between regret and certain sequential inequalities +through a novel decomposition result. We conclude the paper by highlighting +applications to online multi-task learning and network control problems. + +
+
+
+
+
+ + ☆ TIC-TAC: A Framework To Learn And Evaluate Your Covariance + + +
+ We study the problem of unsupervised heteroscedastic covariance estimation, +where the goal is to learn the multivariate target distribution $\mathcal{N}(y, +\Sigma_y | x )$ given an observation $x$. This problem is particularly +challenging as $\Sigma_{y}$ varies for different samples (heteroscedastic) and +no annotation for the covariance is available (unsupervised). Typically, +state-of-the-art methods predict the mean $f_{\theta}(x)$ and covariance +$\textrm{Cov}(f_{\theta}(x))$ of the target distribution through two neural +networks trained using the negative log-likelihood. This raises two questions: +(1) Does the predicted covariance truly capture the randomness of the predicted +mean? (2) In the absence of ground-truth annotation, how can we quantify the +performance of covariance estimation? We address (1) by deriving TIC: Taylor +Induced Covariance, which captures the randomness of the multivariate +$f_{\theta}(x)$ by incorporating its gradient and curvature around $x$ through +the second order Taylor polynomial. Furthermore, we tackle (2) by introducing +TAC: Task Agnostic Correlations, a metric which leverages conditioning of the +normal distribution to evaluate the covariance. We verify the effectiveness of +TIC through multiple experiments spanning synthetic (univariate, multivariate) +and real-world datasets (UCI Regression, LSP, and MPII Human Pose Estimation). +Our experiments show that TIC outperforms state-of-the-art in accurately +learning the covariance, as quantified through TAC. + +
+
+ comment: 12 pages, 4 figures. Please feel free to provide feedback! +
+
+
+
+
+ + ☆ Building a Safer Maritime Environment Through Multi-Path Long-Term + Vessel Trajectory Forecasting + + +
+ Maritime transport is paramount to global economic growth and environmental +sustainability. In this regard, the Automatic Identification System (AIS) data +plays a significant role by offering real-time streaming data on vessel +movement, which allows for enhanced traffic surveillance, assisting in vessel +safety by avoiding vessel-to-vessel collisions and proactively preventing +vessel-to-whale ones. This paper tackles an intrinsic problem to trajectory +forecasting: the effective multi-path long-term vessel trajectory forecasting +on engineered sequences of AIS data. We utilize an encoder-decoder model with +Bidirectional Long Short-Term Memory Networks (Bi-LSTM) to predict the next 12 +hours of vessel trajectories using 1 to 3 hours of AIS data. We feed the model +with probabilistic features engineered from the AIS data that refer to the +potential route and destination of each trajectory so that the model, +leveraging convolutional layers for spatial feature learning and a +position-aware attention mechanism that increases the importance of recent +timesteps of a sequence during temporal feature learning, forecasts the vessel +trajectory taking the potential route and destination into account. The F1 +Score of these features is approximately 85% and 75%, indicating their +efficiency in supplementing the neural network. We trialed our model in the +Gulf of St. Lawrence, one of the North Atlantic Right Whales (NARW) habitats, +achieving an R2 score exceeding 98% with varying techniques and features. +Despite the high R2 score being attributed to well-defined shipping lanes, our +model demonstrates superior complex decision-making during path selection. In +addition, our model shows enhanced accuracy, with average and median +forecasting errors of 11km and 6km, respectively. Our study confirms the +potential of geographical data engineering and trajectory forecasting models +for preserving marine life species. + +
+
+ comment: 44 pages, 13 figures, 6 tables, 27 equations, and 1 algorithm +
+
+
+
+
+ + ☆ Language Agents with Reinforcement Learning for Strategic Play in the + Werewolf Game + + +
+ Agents built with large language models (LLMs) have recently achieved great +advancements. However, most of the efforts focus on single-agent or cooperative +settings, leaving more general multi-agent environments underexplored. We +propose a new framework powered by reinforcement learning (RL) to develop +strategic language agents, i.e., LLM-based agents with strategic thinking +ability, for a popular language game, Werewolf. Werewolf is a social deduction +game with hidden roles that involves both cooperation and competition and +emphasizes deceptive communication and diverse gameplay. Our agent tackles this +game by first using LLMs to reason about potential deceptions and generate a +set of strategically diverse actions. Then an RL policy, which selects an +action from the candidates, is learned by population-based training to enhance +the agents' decision-making ability. By combining LLMs with the RL policy, our +agent produces a variety of emergent strategies, achieves the highest win rate +against other LLM-based agents, and stays robust against adversarial human +players in the Werewolf game. + +
+
+
+
+
+ + ☆ Machine Learning Algorithms to Predict Chess960 Result and Develop + Opening Themes + + +
+ This work focuses on the analysis of Chess 960, also known as Fischer Random +Chess, a variant of traditional chess where the starting positions of the +pieces are randomized. The study aims to predict the game outcome using machine +learning techniques and develop an opening theme for each starting position. +The first part of the analysis utilizes machine learning models to predict the +game result based on certain moves in each position. The methodology involves +segregating raw data from .pgn files into usable formats and creating datasets +comprising approximately 500 games for each starting position. Three machine +learning algorithms -- KNN Clustering, Random Forest, and Gradient Boosted +Trees -- have been used to predict the game outcome. To establish an opening +theme, the board is divided into five regions: center, white kingside, white +queenside, black kingside, and black queenside. The data from games played by +top engines in all 960 positions is used to track the movement of pieces in the +opening. By analysing the change in the number of pieces in each region at +specific moves, the report predicts the region towards which the game is +developing. These models provide valuable insights into predicting game +outcomes and understanding the opening theme in Chess 960. + +
+
+ comment: 16 pages, 6 figures and 3 tables +
+
+
+
+
+ + ☆ The Utility of "Even if..." Semifactual Explanation to Optimise Positive + Outcomes + + +
+ When users receive either a positive or negative outcome from an automated +system, Explainable AI (XAI) has almost exclusively focused on how to mutate +negative outcomes into positive ones by crossing a decision boundary using +counterfactuals (e.g., \textit{"If you earn 2k more, we will accept your loan +application"}). Here, we instead focus on \textit{positive} outcomes, and take +the novel step of using XAI to optimise them (e.g., \textit{"Even if you wish +to half your down-payment, we will still accept your loan application"}). +Explanations such as these that employ "even if..." reasoning, and do not cross +a decision boundary, are known as semifactuals. To instantiate semifactuals in +this context, we introduce the concept of \textit{Gain} (i.e., how much a user +stands to benefit from the explanation), and consider the first causal +formalisation of semifactuals. Tests on benchmark datasets show our algorithms +are better at maximising gain compared to prior work, and that causality is +important in the process. Most importantly however, a user study supports our +main hypothesis by showing people find semifactual explanations more useful +than counterfactuals when they receive the positive outcome of a loan +acceptance. + +
+
+
+
+
+ + ☆ Adversarial Examples Are Not Real Features NeurIPS 2023 + + +
+ The existence of adversarial examples has been a mystery for years and +attracted much interest. A well-known theory by \citet{ilyas2019adversarial} +explains adversarial vulnerability from a data perspective by showing that one +can extract non-robust features from adversarial examples and these features +alone are useful for classification. However, the explanation remains quite +counter-intuitive since non-robust features are mostly noise features to +humans. In this paper, we re-examine the theory from a larger context by +incorporating multiple learning paradigms. Notably, we find that contrary to +their good usefulness under supervised learning, non-robust features attain +poor usefulness when transferred to other self-supervised learning paradigms, +such as contrastive learning, masked image modeling, and diffusion models. It +reveals that non-robust features are not really as useful as robust or natural +features that enjoy good transferability between these paradigms. Meanwhile, +for robustness, we also show that naturally trained encoders from robust +features are largely non-robust under AutoAttack. Our cross-paradigm +examination suggests that the non-robust features are not really useful but +more like paradigm-wise shortcuts, and robust features alone might be +insufficient to attain reliable model robustness. Code is available at +\url{https://github.com/PKU-ML/AdvNotRealFeatures}. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU + Networks on Nearly-orthogonal Data NeurIPS 2023 + + +
+ The implicit bias towards solutions with favorable properties is believed to +be a key reason why neural networks trained by gradient-based optimization can +generalize well. While the implicit bias of gradient flow has been widely +studied for homogeneous neural networks (including ReLU and leaky ReLU +networks), the implicit bias of gradient descent is currently only understood +for smooth neural networks. Therefore, implicit bias in non-smooth neural +networks trained by gradient descent remains an open question. In this paper, +we aim to answer this question by studying the implicit bias of gradient +descent for training two-layer fully connected (leaky) ReLU neural networks. We +showed that when the training data are nearly-orthogonal, for leaky ReLU +activation function, gradient descent will find a network with a stable rank +that converges to $1$, whereas for ReLU activation function, gradient descent +will find a neural network with a stable rank that is upper bounded by a +constant. Additionally, we show that gradient descent will find a neural +network such that all the training data points have the same normalized margin +asymptotically. Experiments on both synthetic and real data backup our +theoretical findings. + +
+
+ comment: 55 pages, 7 figures. In NeurIPS 2023 +
+
+
+
+
+ + ☆ Label Poisoning is All You Need + + +
+ In a backdoor attack, an adversary injects corrupted data into a model's +training dataset in order to gain control over its predictions on images with a +specific attacker-defined trigger. A typical corrupted training example +requires altering both the image, by applying the trigger, and the label. +Models trained on clean images, therefore, were considered safe from backdoor +attacks. However, in some common machine learning scenarios, the training +labels are provided by potentially malicious third-parties. This includes +crowd-sourced annotation and knowledge distillation. We, hence, investigate a +fundamental question: can we launch a successful backdoor attack by only +corrupting labels? We introduce a novel approach to design label-only backdoor +attacks, which we call FLIP, and demonstrate its strengths on three datasets +(CIFAR-10, CIFAR-100, and Tiny-ImageNet) and four architectures (ResNet-32, +ResNet-18, VGG-19, and Vision Transformer). With only 2% of CIFAR-10 labels +corrupted, FLIP achieves a near-perfect attack success rate of 99.4% while +suffering only a 1.8% drop in the clean test accuracy. Our approach builds upon +the recent advances in trajectory matching, originally introduced for dataset +distillation. + +
+
+
+
+
+ + ☆ A transfer learning approach with convolutional neural network for Face + Mask Detection + + +
+ Due to the epidemic of the coronavirus (Covid-19) and its rapid spread around +the world, the world has faced an enormous crisis. To prevent the spread of the +coronavirus, the World Health Organization (WHO) has introduced the use of +masks and keeping social distance as the best preventive method. So, developing +an automatic monitoring system for detecting facemasks in some crowded places +is essential. To do this, we propose a mask recognition system based on +transfer learning and Inception v3 architecture. In the proposed method, two +datasets are used simultaneously for training including the Simulated Mask Face +Dataset (SMFD) and MaskedFace-Net (MFN) This paper tries to increase the +accuracy of the proposed system by optimally setting hyper-parameters and +accurately designing the fully connected layers. The main advantage of the +proposed method is that in addition to masked and unmasked faces, it can also +detect cases of incorrect use of mask. Therefore, the proposed method +classifies the input face images into three categories. Experimental results +show the high accuracy and efficiency of the proposed method; so, this method +has achieved an accuracy of 99.47% and 99.33% in training and test data +respectively + +
+
+ comment: 9 pages, in Persian language, 8 figures +
+
+
+
+
+ + ☆ Remaining Useful Life Prediction of Lithium-ion Batteries using + Spatio-temporal Multimodal Attention Networks + + +
+ Lithium-ion batteries are widely used in various applications, including +electric vehicles and renewable energy storage. The prediction of the remaining +useful life (RUL) of batteries is crucial for ensuring reliable and efficient +operation, as well as reducing maintenance costs. However, determining the life +cycle of batteries in real-world scenarios is challenging, and existing methods +have limitations in predicting the number of cycles iteratively. In addition, +existing works often oversimplify the datasets, neglecting important features +of the batteries such as temperature, internal resistance, and material type. +To address these limitations, this paper proposes a two-stage remaining useful +life prediction scheme for Lithium-ion batteries using a spatio-temporal +multimodal attention network (ST-MAN). The proposed model is designed to +iteratively predict the number of cycles required for the battery to reach the +end of its useful life, based on available data. The proposed ST-MAN is to +capture the complex spatio-temporal dependencies in the battery data, including +the features that are often neglected in existing works. Experimental results +demonstrate that the proposed ST-MAN model outperforms existing CNN and +LSTM-based methods, achieving state-of-the-art performance in predicting the +remaining useful life of Li-ion batteries. The proposed method has the +potential to improve the reliability and efficiency of battery operations and +is applicable in various industries, including automotive and renewable energy. + +
+
+
+
+
+ + ☆ Posterior Sampling with Delayed Feedback for Reinforcement Learning with + Linear Function Approximation + + +
+ Recent studies in reinforcement learning (RL) have made significant progress +by leveraging function approximation to alleviate the sample complexity hurdle +for better performance. Despite the success, existing provably efficient +algorithms typically rely on the accessibility of immediate feedback upon +taking actions. The failure to account for the impact of delay in observations +can significantly degrade the performance of real-world systems due to the +regret blow-up. In this work, we tackle the challenge of delayed feedback in RL +with linear function approximation by employing posterior sampling, which has +been shown to empirically outperform the popular UCB algorithms in a wide range +of regimes. We first introduce Delayed-PSVI, an optimistic value-based +algorithm that effectively explores the value function space via noise +perturbation with posterior sampling. We provide the first analysis for +posterior sampling algorithms with delayed feedback in RL and show our +algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case +regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the +expected delay. To further improve its computational efficiency and to expand +its applicability in high-dimensional RL problems, we incorporate a +gradient-based approximate sampling scheme via Langevin dynamics for +Delayed-LPSVI, which maintains the same order-optimal regret guarantee with +$\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to +demonstrate the statistical and computational efficacy of our algorithms. + +
+
+
+
+
+ + ☆ Hyperbolic Graph Neural Networks at Scale: A Meta Learning Approach NeurIPS 2023 + + +
+ The progress in hyperbolic neural networks (HNNs) research is hindered by +their absence of inductive bias mechanisms, which are essential for +generalizing to new tasks and facilitating scalable learning over large +datasets. In this paper, we aim to alleviate these issues by learning +generalizable inductive biases from the nodes' local subgraph and transfer them +for faster learning over new subgraphs with a disjoint set of nodes, edges, and +labels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph +Meta Learner (H-GRAM), that, for the tasks of node classification and link +prediction, learns transferable information from a set of support local +subgraphs in the form of hyperbolic meta gradients and label hyperbolic +protonets to enable faster learning over a query set of new tasks dealing with +disjoint subgraphs. Furthermore, we show that an extension of our meta-learning +framework also mitigates the scalability challenges seen in HNNs faced by +existing approaches. Our comparative analysis shows that H-GRAM effectively +learns and transfers information in multiple challenging few-shot settings +compared to other state-of-the-art baselines. Additionally, we demonstrate +that, unlike standard HNNs, our approach is able to scale over large graph +datasets and improve performance over its Euclidean counterparts. + +
+
+ comment: Accepted to NeurIPS 2023. 14 pages of main paper, 5 pages of + supplementary +
+
+
+
+
+ + ☆ Sentence Bag Graph Formulation for Biomedical Distant Supervision + Relation Extraction + + +
+ We introduce a novel graph-based framework for alleviating key challenges in +distantly-supervised relation extraction and demonstrate its effectiveness in +the challenging and important domain of biomedical data. Specifically, we +propose a graph view of sentence bags referring to an entity pair, which +enables message-passing based aggregation of information related to the entity +pair over the sentence bag. The proposed framework alleviates the common +problem of noisy labeling in distantly supervised relation extraction and also +effectively incorporates inter-dependencies between sentences within a bag. +Extensive experiments on two large-scale biomedical relation datasets and the +widely utilized NYT dataset demonstrate that our proposed framework +significantly outperforms the state-of-the-art methods for biomedical distant +supervision relation extraction while also providing excellent performance for +relation extraction in the general text mining domain. + +
+
+
+
+
+ + ☆ InstanT: Semi-supervised Learning with Instance-dependent Thresholds NeurIPS 2023 + + +
+ Semi-supervised learning (SSL) has been a fundamental challenge in machine +learning for decades. The primary family of SSL algorithms, known as +pseudo-labeling, involves assigning pseudo-labels to confident unlabeled +instances and incorporating them into the training set. Therefore, the +selection criteria of confident instances are crucial to the success of SSL. +Recently, there has been growing interest in the development of SSL methods +that use dynamic or adaptive thresholds. Yet, these methods typically apply the +same threshold to all samples, or use class-dependent thresholds for instances +belonging to a certain class, while neglecting instance-level information. In +this paper, we propose the study of instance-dependent thresholds, which has +the highest degree of freedom compared with existing methods. Specifically, we +devise a novel instance-dependent threshold function for all unlabeled +instances by utilizing their instance-level ambiguity and the +instance-dependent error rates of pseudo-labels, so instances that are more +likely to have incorrect pseudo-labels will have higher thresholds. +Furthermore, we demonstrate that our instance-dependent threshold function +provides a bounded probabilistic guarantee for the correctness of the +pseudo-labels it assigns. + +
+
+ comment: Accepted as poster for NeurIPS 2023 +
+
+
+
+
+ + ☆ Estimating the Rate-Distortion Function by Wasserstein Gradient Descent NeurIPS 2023 + + +
+ In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$ +describes how much a data source can be compressed (in bit-rate) at any given +level of fidelity (distortion). Obtaining $R(D)$ for a given data source +establishes the fundamental performance limit for all compression algorithms. +We propose a new method to estimate $R(D)$ from the perspective of optimal +transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support +of the reproduction distribution in advance, our Wasserstein gradient descent +algorithm learns the support of the optimal reproduction distribution by moving +particles. We prove its local convergence and analyze the sample complexity of +our R-D estimator based on a connection to entropic optimal transport. +Experimentally, we obtain comparable or tighter bounds than state-of-the-art +neural network methods on low-rate sources while requiring considerably less +tuning and computation effort. We also highlight a connection to +maximum-likelihood deconvolution and introduce a new class of sources that can +be used as test cases with known solutions to the R-D problem. + +
+
+ comment: Accepted as conference paper at NeurIPS 2023 +
+
+
+
+
+ + ☆ Topological, or Non-topological? A Deep Learning Based Prediction + + +
+ Prediction and discovery of new materials with desired properties are at the +forefront of quantum science and technology research. A major bottleneck in +this field is the computational resources and time complexity related to +finding new materials from ab initio calculations. In this work, an effective +and robust deep learning-based model is proposed by incorporating persistent +homology and graph neural network which offers an accuracy of 91.4% and an F1 +score of 88.5% in classifying topological vs. non-topological materials, +outperforming the other state-of-the-art classifier models. The incorporation +of the graph neural network encodes the underlying relation between the atoms +into the model based on their own crystalline structures and thus proved to be +an effective method to represent and process non-euclidean data like molecules +with a relatively shallow network. The persistent homology pipeline in the +suggested neural network is capable of integrating the atom-specific +topological information into the deep learning model, increasing robustness, +and gain in performance. It is believed that the presented work will be an +efficacious tool for predicting the topological class and therefore enable the +high-throughput search for novel materials in this field. + +
+
+ comment: 13 pages, 8 figures +
+
+
+
+
+ + ☆ Learning Subgrid-Scale Models in Discontinuous Galerkin Methods with + Neural Ordinary Differential Equations for Compressible Navier--Stokes + Equations + + +
+ The growing computing power over the years has enabled simulations to become +more complex and accurate. However, high-fidelity simulations, while immensely +valuable for scientific discovery and problem solving, come with significant +computational demands. As a result, it is common to run a low-fidelity model +with a subgrid-scale model to reduce the computational cost, but selecting the +appropriate subgrid-scale models and tuning them are challenging. We propose a +novel method for learning the subgrid-scale model effects when simulating +partial differential equations using neural ordinary differential equations in +the context of discontinuous Galerkin (DG) spatial discretization. Our approach +learns the missing scales of the low-order DG solver at a continuous level and +hence improves the accuracy of the low-order DG approximations as well as +accelerates the filtered high-order DG simulations with a certain degree of +precision. We demonstrate the performance of our approach through +multidimensional Taylor--Green vortex examples at different Reynolds numbers +and times, which cover laminar, transitional, and turbulent regimes. The +proposed method not only reconstructs the subgrid-scale from the low-order +(1st-order) approximation but also speeds up the filtered high-order DG +(6th-order) simulation by two orders of magnitude. + +
+
+ comment: 15 figures, 2 tables, 22 pages +
+
+
+
+
+ + ☆ Ever Evolving Evaluator (EV3): Towards Flexible and Reliable + Meta-Optimization for Knowledge Distillation NeurIPS 2023 + + +
+ We introduce EV3, a novel meta-optimization framework designed to efficiently +train scalable machine learning models through an intuitive +explore-assess-adapt protocol. In each iteration of EV3, we explore various +model parameter updates, assess them using pertinent evaluation methods, and +adapt the model based on the optimal updates and previous progress history. EV3 +offers substantial flexibility without imposing stringent constraints like +differentiability on the key objectives relevant to the tasks of interest. +Moreover, this protocol welcomes updates with biased gradients and allows for +the use of a diversity of losses and optimizers. Additionally, in scenarios +with multiple objectives, it can be used to dynamically prioritize tasks. With +inspiration drawn from evolutionary algorithms, meta-learning, and neural +architecture search, we investigate an application of EV3 to knowledge +distillation. Our experimental results illustrate EV3's capability to safely +explore model spaces, while hinting at its potential applicability across +numerous domains due to its inherent flexibility and adaptability. + +
+
+ comment: NeurIPS 2023 Workshop on Adaptive Experimental Design and Active + Learning in the Real World (RealML-2023) +
+
+
+
+
+ + ☆ D2NO: Efficient Handling of Heterogeneous Input Function Spaces with + Distributed Deep Neural Operators + + +
+ Neural operators have been applied in various scientific fields, such as +solving parametric partial differential equations, dynamical systems with +control, and inverse problems. However, challenges arise when dealing with +input functions that exhibit heterogeneous properties, requiring multiple +sensors to handle functions with minimal regularity. To address this issue, +discretization-invariant neural operators have been used, allowing the sampling +of diverse input functions with different sensor locations. However, existing +frameworks still require an equal number of sensors for all functions. In our +study, we propose a novel distributed approach to further relax the +discretization requirements and solve the heterogeneous dataset challenges. Our +method involves partitioning the input function space and processing individual +input functions using independent and separate neural networks. A centralized +neural network is used to handle shared information across all output +functions. This distributed methodology reduces the number of gradient descent +back-propagation steps, improving efficiency while maintaining accuracy. We +demonstrate that the corresponding neural network is a universal approximator +of continuous nonlinear operators and present four numerical examples to +validate its performance. + +
+
+
+
+
+ + ☆ A foundational neural operator that continuously learns without + forgetting + + +
+ Machine learning has witnessed substantial growth, leading to the development +of advanced artificial intelligence models crafted to address a wide range of +real-world challenges spanning various domains, such as computer vision, +natural language processing, and scientific computing. Nevertheless, the +creation of custom models for each new task remains a resource-intensive +undertaking, demanding considerable computational time and memory resources. In +this study, we introduce the concept of the Neural Combinatorial Wavelet Neural +Operator (NCWNO) as a foundational model for scientific computing. This model +is specifically designed to excel in learning from a diverse spectrum of +physics and continuously adapt to the solution operators associated with +parametric partial differential equations (PDEs). The NCWNO leverages a gated +structure that employs local wavelet experts to acquire shared features across +multiple physical systems, complemented by a memory-based ensembling approach +among these local wavelet experts. This combination enables rapid adaptation to +new challenges. The proposed foundational model offers two key advantages: (i) +it can simultaneously learn solution operators for multiple parametric PDEs, +and (ii) it can swiftly generalize to new parametric PDEs with minimal +fine-tuning. The proposed NCWNO is the first foundational operator learning +algorithm distinguished by its (i) robustness against catastrophic forgetting, +(ii) the maintenance of positive transfer for new parametric PDEs, and (iii) +the facilitation of knowledge transfer across dissimilar tasks. Through an +extensive set of benchmark examples, we demonstrate that the NCWNO can +outperform task-specific baseline operator learning frameworks with minimal +hyperparameter tuning at the prediction stage. We also show that with minimal +fine-tuning, the NCWNO performs accurate combinatorial learning of new +parametric PDEs. + +
+
+
+
+
+ + ☆ Simple and Asymmetric Graph Contrastive Learning without Augmentations NeurIPS 2023 + + +
+ Graph Contrastive Learning (GCL) has shown superior performance in +representation learning in graph-structured data. Despite their success, most +existing GCL methods rely on prefabricated graph augmentation and homophily +assumptions. Thus, they fail to generalize well to heterophilic graphs where +connected nodes may have different class labels and dissimilar features. In +this paper, we study the problem of conducting contrastive learning on +homophilic and heterophilic graphs. We find that we can achieve promising +performance simply by considering an asymmetric view of the neighboring nodes. +The resulting simple algorithm, Asymmetric Contrastive Learning for Graphs +(GraphACL), is easy to implement and does not rely on graph augmentations and +homophily assumptions. We provide theoretical and empirical evidence that +GraphACL can capture one-hop local neighborhood information and two-hop +monophily similarity, which are both important for modeling heterophilic +graphs. Experimental results show that the simple GraphACL significantly +outperforms state-of-the-art graph contrastive learning and self-supervised +learning methods on homophilic and heterophilic graphs. The code of GraphACL is +available at https://github.com/tengxiao1/GraphACL. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Differentiable Learning of Generalized Structured Matrices for Efficient + Deep Neural Networks + + +
+ This paper investigates efficient deep neural networks (DNNs) to replace +dense unstructured weight matrices with structured ones that possess desired +properties. The challenge arises because the optimal weight matrix structure in +popular neural network models is obscure in most cases and may vary from layer +to layer even in the same network. Prior structured matrices proposed for +efficient DNNs were mostly hand-crafted without a generalized framework to +systematically learn them. To address this issue, we propose a generalized and +differentiable framework to learn efficient structures of weight matrices by +gradient descent. We first define a new class of structured matrices that +covers a wide range of structured matrices in the literature by adjusting the +structural parameters. Then, the frequency-domain differentiable +parameterization scheme based on the Gaussian-Dirichlet kernel is adopted to +learn the structural parameters by proximal gradient descent. Finally, we +introduce an effective initialization method for the proposed scheme. Our +method learns efficient DNNs with structured matrices, achieving lower +complexity and/or higher performance than prior approaches that employ +low-rank, block-sparse, or block-low-rank matrices. + +
+
+
+
+
+ + ♻ ☆ A Holistic Approach to Unifying Automatic Concept Extraction and Concept + Importance Estimation + + +
+ In recent years, concept-based approaches have emerged as some of the most +promising explainability methods to help us interpret the decisions of +Artificial Neural Networks (ANNs). These methods seek to discover intelligible +visual 'concepts' buried within the complex patterns of ANN activations in two +key steps: (1) concept extraction followed by (2) importance estimation. While +these two steps are shared across methods, they all differ in their specific +implementations. Here, we introduce a unifying theoretical framework that +comprehensively defines and clarifies these two steps. This framework offers +several advantages as it allows us: (i) to propose new evaluation metrics for +comparing different concept extraction approaches; (ii) to leverage modern +attribution methods and evaluation metrics to extend and systematically +evaluate state-of-the-art concept-based approaches and importance estimation +techniques; (iii) to derive theoretical guarantees regarding the optimality of +such methods. We further leverage our framework to try to tackle a crucial +question in explainability: how to efficiently identify clusters of data points +that are classified based on a similar shared strategy. To illustrate these +findings and to highlight the main strategies of a model, we introduce a visual +representation called the strategic cluster graph. Finally, we present +https://serre-lab.github.io/Lens, a dedicated website that offers a complete +compilation of these visualizations for all classes of the ImageNet dataset. + +
+
+
+
+
+ + ♻ ☆ Neural Fields with Hard Constraints of Arbitrary Differential Order NeurIPS + 2023 + + +
+ While deep learning techniques have become extremely popular for solving a +broad range of optimization problems, methods to enforce hard constraints +during optimization, particularly on deep neural networks, remain +underdeveloped. Inspired by the rich literature on meshless interpolation and +its extension to spectral collocation methods in scientific computing, we +develop a series of approaches for enforcing hard constraints on neural fields, +which we refer to as Constrained Neural Fields (CNF). The constraints can be +specified as a linear operator applied to the neural field and its derivatives. +We also design specific model representations and training strategies for +problems where standard models may encounter difficulties, such as conditioning +of the system, memory consumption, and capacity of the network when being +constrained. Our approaches are demonstrated in a wide range of real-world +applications. Additionally, we develop a framework that enables highly +efficient model and constraint specification, which can be readily applied to +any downstream task where hard constraints need to be explicitly satisfied +during optimization. + +
+
+ comment: 37th Conference on Neural Information Processing Systems (NeurIPS + 2023) +
+
+
+
+
+ + ♻ ☆ Decision Stacks: Flexible Reinforcement Learning via Modular Generative + Models NeurIPS 2023 + + +
+ Reinforcement learning presents an attractive paradigm to reason about +several distinct aspects of sequential decision making, such as specifying +complex goals, planning future observations and actions, and critiquing their +utilities. However, the combined integration of these capabilities poses +competing algorithmic challenges in retaining maximal expressivity while +allowing for flexibility in modeling choices for efficient learning and +inference. We present Decision Stacks, a generative framework that decomposes +goal-conditioned policy agents into 3 generative modules. These modules +simulate the temporal evolution of observations, rewards, and actions via +independent generative models that can be learned in parallel via teacher +forcing. Our framework guarantees both expressivity and flexibility in +designing individual modules to account for key factors such as architectural +bias, optimization objective and dynamics, transferrability across domains, and +inference speed. Our empirical results demonstrate the effectiveness of +Decision Stacks for offline policy optimization for several MDP and POMDP +environments, outperforming existing methods and enabling flexible generative +decision making. + +
+
+ comment: published at NeurIPS 2023, project page: + https://siyan-zhao.github.io/decision-stacks/ +
+
+
+
+
+ + ♻ ☆ LEACE: Perfect linear concept erasure in closed form + + +
+ Concept erasure aims to remove specified features from a representation. It +can improve fairness (e.g. preventing a classifier from using gender or race) +and interpretability (e.g. removing a concept to observe changes in model +behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form +method which provably prevents all linear classifiers from detecting a concept +while changing the representation as little as possible, as measured by a broad +class of norms. We apply LEACE to large language models with a novel procedure +called "concept scrubbing," which erases target concept information from every +layer in the network. We demonstrate our method on two tasks: measuring the +reliance of language models on part-of-speech information, and reducing gender +bias in BERT embeddings. Code is available at +https://github.com/EleutherAI/concept-erasure. + +
+
+
+
+
+ + ♻ ☆ From Continuous Dynamics to Graph Neural Networks: Neural Diffusion and + Beyond + + +
+ Graph neural networks (GNNs) have demonstrated significant promise in +modelling relational data and have been widely applied in various fields of +interest. The key mechanism behind GNNs is the so-called message passing where +information is being iteratively aggregated to central nodes from their +neighbourhood. Such a scheme has been found to be intrinsically linked to a +physical process known as heat diffusion, where the propagation of GNNs +naturally corresponds to the evolution of heat density. Analogizing the process +of message passing to the heat dynamics allows to fundamentally understand the +power and pitfalls of GNNs and consequently informs better model design. +Recently, there emerges a plethora of works that proposes GNNs inspired from +the continuous dynamics formulation, in an attempt to mitigate the known +limitations of GNNs, such as oversmoothing and oversquashing. In this survey, +we provide the first systematic and comprehensive review of studies that +leverage the continuous perspective of GNNs. To this end, we introduce +foundational ingredients for adapting continuous dynamics to GNNs, along with a +general framework for the design of graph neural dynamics. We then review and +categorize existing works based on their driven mechanisms and underlying +dynamics. We also summarize how the limitations of classic GNNs can be +addressed under the continuous framework. We conclude by identifying multiple +open research directions. + +
+
+
+
+
+ + ♻ ☆ An Ensemble Approach to Question Classification: Integrating Electra + Transformer, GloVe, and LSTM + + +
+ Natural Language Processing (NLP) has emerged as a crucial technology for +understanding and generating human language, playing an essential role in tasks +such as machine translation, sentiment analysis, and more pertinently, question +classification. As a subfield within NLP, question classification focuses on +determining the type of information being sought, a fundamental step for +downstream applications like question answering systems. This study presents an +innovative ensemble approach for question classification, combining the +strengths of Electra, GloVe, and LSTM models. Rigorously tested on the +well-regarded TREC dataset, the model demonstrates how the integration of these +disparate technologies can lead to superior results. Electra brings in its +transformer-based capabilities for complex language understanding, GloVe offers +global vector representations for capturing word-level semantics, and LSTM +contributes its sequence learning abilities to model long-term dependencies. By +fusing these elements strategically, our ensemble model delivers a robust and +efficient solution for the complex task of question classification. Through +rigorous comparisons with well-known models like BERT, RoBERTa, and DistilBERT, +the ensemble approach verifies its effectiveness by attaining an 80% accuracy +score on the test dataset. + +
+
+
+
+
+ + ♻ ☆ Learning in Zero-Sum Linear Quadratic Games with Last-Iterate + Convergence + + +
+ Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and +can be used (i)~as a dynamic game formulation for risk-sensitive or robust +control and (ii)~as a benchmark setting for multi-agent reinforcement learning +with two competing agents in continuous state-control spaces. In contrast to +the well-studied single-agent linear quadratic regulator problem, zero-sum LQ +games entail solving a challenging nonconvex-nonconcave min-max problem with an +objective function that lacks coercivity. Recently, Zhang et al. showed that +an~$\epsilon$-Nash equilibrium (NE) of finite horizon zero-sum LQ games can be +learned via nested model-free Natural Policy Gradient (NPG) algorithms with +poly$(1/\epsilon)$ sample complexity. In this work, we propose a simpler nested +Zeroth-Order (ZO) algorithm improving sample complexity by several orders of +magnitude and guaranteeing convergence of the last iterate. Our main results +are two-fold: (i) in the deterministic setting, we establish the first global +last-iterate linear convergence result for the nested algorithm that seeks NE +of zero-sum LQ games; (ii) in the model-free setting, we establish +a~$\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity using a +single-point ZO estimator. For our last-iterate convergence results, our +analysis leverages the Implicit Regularization (IR) property and a new gradient +domination condition for the primal function. Our key improvements in the +sample complexity rely on a more sample-efficient nested algorithm design and a +finer control of the ZO natural gradient estimation error utilizing the +structure endowed by the finite-horizon setting. + +
+
+
+
+
+ + ♻ ☆ Alignment with human representations supports robust few-shot learning NeurIPS 2023 + + +
+ Should we care whether AI systems have representations of the world that are +similar to those of humans? We provide an information-theoretic analysis that +suggests that there should be a U-shaped relationship between the degree of +representational alignment with humans and performance on few-shot learning +tasks. We confirm this prediction empirically, finding such a relationship in +an analysis of the performance of 491 computer vision models. We also show that +highly-aligned models are more robust to both natural adversarial attacks and +domain shifts. Our results suggest that human-alignment is often a sufficient, +but not necessary, condition for models to make effective use of limited data, +be robust, and generalize well. + +
+
+ comment: Spotlight at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Koopman Kernel Regression + + +
+ Many machine learning approaches for decision making, such as reinforcement +learning, rely on simulators or predictive models to forecast the +time-evolution of quantities of interest, e.g., the state of an agent or the +reward of a policy. Forecasts of such complex phenomena are commonly described +by highly nonlinear dynamical systems, making their use in optimization-based +decision-making challenging. Koopman operator theory offers a beneficial +paradigm for addressing this problem by characterizing forecasts via linear +time-invariant (LTI) ODEs -- turning multi-step forecasting into sparse matrix +multiplications. Though there exists a variety of learning approaches, they +usually lack crucial learning-theoretic guarantees, making the behavior of the +obtained models with increasing data and dimensionality unclear. We address the +aforementioned by deriving a novel reproducing kernel Hilbert space (RKHS) over +trajectories that solely spans transformations into LTI dynamical systems. The +resulting Koopman Kernel Regression (KKR) framework enables the use of +statistical learning tools from function approximation for novel convergence +results and generalization error bounds under weaker assumptions than existing +work. Our experiments demonstrate superior forecasting performance compared to +Koopman operator and sequential data predictors in RKHS. + +
+
+
+
+
+ + ♻ ☆ DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent + Method + + +
+ This paper proposes a new easy-to-implement parameter-free gradient-based +optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is +efficient -- matching the convergence rate of optimally tuned gradient descent +in convex optimization up to a logarithmic factor without tuning any +parameters, and universal -- automatically adapting to both smooth and +nonsmooth problems. While popular algorithms following the AdaGrad framework +compute a running average of the squared gradients to use for normalization, +DoWG maintains a new distance-based weighted version of the running average, +which is crucial to achieve the desired properties. To complement our theory, +we also show empirically that DoWG trains at the edge of stability, and +validate its effectiveness on practical machine learning tasks. + +
+
+ comment: 22 pages, 1 table, 4 figures +
+
+
+
+
+ + ♻ ☆ Federated Learning for Medical Applications: A Taxonomy, Current Trends, + Challenges, and Future Research Directions + + +
+ With the advent of the IoT, AI, ML, and DL algorithms, the landscape of +data-driven medical applications has emerged as a promising avenue for +designing robust and scalable diagnostic and prognostic models from medical +data. This has gained a lot of attention from both academia and industry, +leading to significant improvements in healthcare quality. However, the +adoption of AI-driven medical applications still faces tough challenges, +including meeting security, privacy, and quality of service (QoS) standards. +Recent developments in \ac{FL} have made it possible to train complex +machine-learned models in a distributed manner and have become an active +research domain, particularly processing the medical data at the edge of the +network in a decentralized way to preserve privacy and address security +concerns. To this end, in this paper, we explore the present and future of FL +technology in medical applications where data sharing is a significant +challenge. We delve into the current research trends and their outcomes, +unravelling the complexities of designing reliable and scalable \ac{FL} models. +Our paper outlines the fundamental statistical issues in FL, tackles +device-related problems, addresses security challenges, and navigates the +complexity of privacy concerns, all while highlighting its transformative +potential in the medical field. Our study primarily focuses on medical +applications of \ac{FL}, particularly in the context of global cancer +diagnosis. We highlight the potential of FL to enable computer-aided diagnosis +tools that address this challenge with greater effectiveness than traditional +data-driven methods. We hope that this comprehensive review will serve as a +checkpoint for the field, summarizing the current state-of-the-art and +identifying open problems and future research directions. + +
+
+ comment: Accepted at IEEE Internet of Things Journal +
+
+
+
+
+ + ♻ ☆ A Spectral Approach to Item Response Theory + + +
+ The Rasch model is one of the most fundamental models in \emph{item response +theory} and has wide-ranging applications from education testing to +recommendation systems. In a universe with $n$ users and $m$ items, the Rasch +model assumes that the binary response $X_{li} \in \{0,1\}$ of a user $l$ with +parameter $\theta^*_l$ to an item $i$ with parameter $\beta^*_i$ (e.g., a user +likes a movie, a student correctly solves a problem) is distributed as +$\Pr(X_{li}=1) = 1/(1 + \exp{-(\theta^*_l - \beta^*_i)})$. In this paper, we +propose a \emph{new item estimation} algorithm for this celebrated model (i.e., +to estimate $\beta^*$). The core of our algorithm is the computation of the +stationary distribution of a Markov chain defined on an item-item graph. We +complement our algorithmic contributions with finite-sample error guarantees, +the first of their kind in the literature, showing that our algorithm is +consistent and enjoys favorable optimality properties. We discuss practical +modifications to accelerate and robustify the algorithm that practitioners can +adopt. Experiments on synthetic and real-life datasets, ranging from small +education testing datasets to large recommendation systems datasets show that +our algorithm is scalable, accurate, and competitive with the most commonly +used methods in the literature. + +
+
+
+
+
+ + ♻ ☆ Towards Anytime Classification in Early-Exit Architectures by Enforcing + Conditional Monotonicity NeurIPS 2023 + + +
+ Modern predictive models are often deployed to environments in which +computational budgets are dynamic. Anytime algorithms are well-suited to such +environments as, at any point during computation, they can output a prediction +whose quality is a function of computation time. Early-exit neural networks +have garnered attention in the context of anytime computation due to their +capability to provide intermediate predictions at various stages throughout the +network. However, we demonstrate that current early-exit networks are not +directly applicable to anytime settings, as the quality of predictions for +individual data points is not guaranteed to improve with longer computation. To +address this shortcoming, we propose an elegant post-hoc modification, based on +the Product-of-Experts, that encourages an early-exit network to become +gradually confident. This gives our deep models the property of conditional +monotonicity in the prediction quality -- an essential stepping stone towards +truly anytime predictive modeling using early-exit architectures. Our empirical +results on standard image-classification tasks demonstrate that such behaviors +can be achieved while preserving competitive accuracy on average. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Energy cost and machine learning accuracy impact of k-anonymisation and + synthetic data techniques + + +
+ To address increasing societal concerns regarding privacy and climate, the EU +adopted the General Data Protection Regulation (GDPR) and committed to the +Green Deal. Considerable research studied the energy efficiency of software and +the accuracy of machine learning models trained on anonymised data sets. Recent +work began exploring the impact of privacy-enhancing techniques (PET) on both +the energy consumption and accuracy of the machine learning models, focusing on +k-anonymity. As synthetic data is becoming an increasingly popular PET, this +paper analyses the energy consumption and accuracy of two phases: a) applying +privacy-enhancing techniques to the concerned data set, b) training the models +on the concerned privacy-enhanced data set. We use two privacy-enhancing +techniques: k-anonymisation (using generalisation and suppression) and +synthetic data, and three machine-learning models. Each model is trained on +each privacy-enhanced data set. Our results show that models trained on +k-anonymised data consume less energy than models trained on the original data, +with a similar performance regarding accuracy. Models trained on synthetic data +have a similar energy consumption and a similar to lower accuracy compared to +models trained on the original data. + +
+
+ comment: Published in the proceedings (Pages: 57-65) of The International + Conference on Information and Communications Technology for Sustainability + (ICT4S) 2023 in Rennes, France. 9 pages, 4 figures, 5 tables +
+
+
+
+
+ + ♻ ☆ Beyond Geometry: Comparing the Temporal Structure of Computation in + Neural Circuits with Dynamical Similarity Analysis + + +
+ How can we tell whether two neural networks utilize the same internal +processes for a particular computation? This question is pertinent for multiple +subfields of neuroscience and machine learning, including neuroAI, mechanistic +interpretability, and brain-machine interfaces. Standard approaches for +comparing neural networks focus on the spatial geometry of latent states. Yet +in recurrent networks, computations are implemented at the level of dynamics, +and two networks performing the same computation with equivalent dynamics need +not exhibit the same geometry. To bridge this gap, we introduce a novel +similarity metric that compares two systems at the level of their dynamics, +called Dynamical Similarity Analysis (DSA). Our method incorporates two +components: Using recent advances in data-driven dynamical systems theory, we +learn a high-dimensional linear system that accurately captures core features +of the original nonlinear dynamics. Next, we compare different systems passed +through this embedding using a novel extension of Procrustes Analysis that +accounts for how vector fields change under orthogonal transformation. In four +case studies, we demonstrate that our method disentangles conjugate and +non-conjugate recurrent neural networks (RNNs), while geometric methods fall +short. We additionally show that our method can distinguish learning rules in +an unsupervised manner. Our method opens the door to comparative analyses of +the essential temporal structure of computation in neural circuits. + +
+
+ comment: 22 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Fairness and Bias in Robot Learning + + +
+ Machine learning has significantly enhanced the abilities of robots, enabling +them to perform a wide range of tasks in human environments and adapt to our +uncertain real world. Recent works in various machine learning domains have +highlighted the importance of accounting for fairness to ensure that these +algorithms do not reproduce human biases and consequently lead to +discriminatory outcomes. With robot learning systems increasingly performing +more and more tasks in our everyday lives, it is crucial to understand the +influence of such biases to prevent unintended behavior toward certain groups +of people. In this work, we present the first survey on fairness in robot +learning from an interdisciplinary perspective spanning technical, ethical, and +legal challenges. We propose a taxonomy for sources of bias and the resulting +types of discrimination due to them. Using examples from different robot +learning domains, we examine scenarios of unfair outcomes and strategies to +mitigate them. We present early advances in the field by covering different +fairness definitions, ethical and legal considerations, and methods for fair +robot learning. With this work, we aim to pave the road for groundbreaking +developments in fair robot learning. + +
+
+
+
+
+ + ♻ ☆ On the impact of activation and normalization in obtaining isometric + embeddings at initialization + + +
+ In this paper, we explore the structure of the penultimate Gram matrix in +deep neural networks, which contains the pairwise inner products of outputs +corresponding to a batch of inputs. In several architectures it has been +observed that this Gram matrix becomes degenerate with depth at initialization, +which dramatically slows training. Normalization layers, such as batch or layer +normalization, play a pivotal role in preventing the rank collapse issue. +Despite promising advances, the existing theoretical results do not extend to +layer normalization, which is widely used in transformers, and can not +quantitatively characterize the role of non-linear activations. To bridge this +gap, we prove that layer normalization, in conjunction with activation layers, +biases the Gram matrix of a multilayer perceptron towards the identity matrix +at an exponential rate with depth at initialization. We quantify this rate +using the Hermite expansion of the activation function. + +
+
+
+
+
+ + ♻ ☆ Auditing Fairness by Betting NeurIPS 2023 + + +
+ We provide practical, efficient, and nonparametric methods for auditing the +fairness of deployed classification and regression models. Whereas previous +work relies on a fixed-sample size, our methods are sequential and allow for +the continuous monitoring of incoming data, making them highly amenable to +tracking the fairness of real-world systems. We also allow the data to be +collected by a probabilistic policy as opposed to sampled uniformly from the +population. This enables auditing to be conducted on data gathered for another +purpose. Moreover, this policy may change over time and different policies may +be used on different subpopulations. Finally, our methods can handle +distribution shift resulting from either changes to the model or changes in the +underlying population. Our approach is based on recent progress in +anytime-valid inference and game-theoretic statistics-the "testing by betting" +framework in particular. These connections ensure that our methods are +interpretable, fast, and easy to implement. We demonstrate the efficacy of our +approach on three benchmark fairness datasets. + +
+
+ comment: Accepted to NeurIPS 2023. 29 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ Locally Differentially Private Gradient Tracking for Distributed Online + Learning over Directed Graphs + + +
+ Distributed online learning has been proven extremely effective in solving +large-scale machine learning problems over streaming data. However, information +sharing between learners in distributed learning also raises concerns about the +potential leakage of individual learners' sensitive data. To mitigate this +risk, differential privacy, which is widely regarded as the "gold standard" for +privacy protection, has been widely employed in many existing results on +distributed online learning. However, these results often face a fundamental +tradeoff between learning accuracy and privacy. In this paper, we propose a +locally differentially private gradient tracking based distributed online +learning algorithm that successfully circumvents this tradeoff. We prove that +the proposed algorithm converges in mean square to the exact optimal solution +while ensuring rigorous local differential privacy, with the cumulative privacy +budget guaranteed to be finite even when the number of iterations tends to +infinity. The algorithm is applicable even when the communication graph among +learners is directed. To the best of our knowledge, this is the first result +that simultaneously ensures learning accuracy and rigorous local differential +privacy in distributed online learning over directed graphs. We evaluate our +algorithm's performance by using multiple benchmark machine-learning +applications, including logistic regression of the "Mushrooms" dataset and +CNN-based image classification of the "MNIST" and "CIFAR-10" datasets, +respectively. The experimental results confirm that the proposed algorithm +outperforms existing counterparts in both training and testing accuracies. + +
+
+ comment: 21 pages, 4 figures +
+
+
+
+
+ + ♻ ☆ Making AI Less "Thirsty": Uncovering and Addressing the Secret Water + Footprint of AI Models + + +
+ The growing carbon footprint of artificial intelligence (AI) models, +especially large ones such as GPT-3, has been undergoing public scrutiny. +Unfortunately, however, the equally important and enormous water (withdrawal +and consumption) footprint of AI models has remained under the radar. For +example, training GPT-3 in Microsoft's state-of-the-art U.S. data centers can +directly evaporate 700,000 liters of clean freshwater, but such information has +been kept a secret. More critically, the global AI demand may be accountable +for 4.2 -- 6.6 billion cubic meters of water withdrawal in 2027, which is more +than the total annual water withdrawal of 4 -- 6 Denmark or half of the United +Kingdom. This is very concerning, as freshwater scarcity has become one of the +most pressing challenges shared by all of us in the wake of the rapidly growing +population, depleting water resources, and aging water infrastructures. To +respond to the global water challenges, AI models can, and also must, take +social responsibility and lead by example by addressing their own water +footprint. In this paper, we provide a principled methodology to estimate the +water footprint of AI models, and also discuss the unique spatial-temporal +diversities of AI models' runtime water efficiency. Finally, we highlight the +necessity of holistically addressing water footprint along with carbon +footprint to enable truly sustainable AI. + +
+
+ comment: New updates include discussion on water withdrawal and water + consumption, scope definition for water, and new estimates of GPT-3's water + footprint based on Microsoft's new WUE and PUE data. Source codes available + at: https://github.com/Ren-Research/Making-AI-Less-Thirsty +
+
+
+
+
+ + ♻ ☆ Bayesian Optimisation of Functions on Graphs NeurIPS 2023 + + +
+ The increasing availability of graph-structured data motivates the task of +optimising over functions defined on the node set of graphs. Traditional graph +search algorithms can be applied in this case, but they may be +sample-inefficient and do not make use of information about the function +values; on the other hand, Bayesian optimisation is a class of promising +black-box solvers with superior sample efficiency, but it has been scarcely +been applied to such novel setups. To fill this gap, we propose a novel +Bayesian optimisation framework that optimises over functions defined on +generic, large-scale and potentially unknown graphs. Through the learning of +suitable kernels on graphs, our framework has the advantage of adapting to the +behaviour of the target function. The local modelling approach further +guarantees the efficiency of our method. Extensive experiments on both +synthetic and real-world graphs demonstrate the effectiveness of the proposed +optimisation framework. + +
+
+ comment: NeurIPS 2023. 11 pages, 11 figures, 1 table (29 pages, 31 figures, 1 + table including references and appendices) +
+
+
+
+
+ + ♻ ☆ Diffusion Variational Autoencoder for Tackling Stochasticity in + Multi-Step Regression Stock Price Prediction CIKM 2023 + + +
+ Multi-step stock price prediction over a long-term horizon is crucial for +forecasting its volatility, allowing financial institutions to price and hedge +derivatives, and banks to quantify the risk in their trading books. +Additionally, most financial regulators also require a liquidity horizon of +several days for institutional investors to exit their risky assets, in order +to not materially affect market prices. However, the task of multi-step stock +price prediction is challenging, given the highly stochastic nature of stock +data. Current solutions to tackle this problem are mostly designed for +single-step, classification-based predictions, and are limited to low +representation expressiveness. The problem also gets progressively harder with +the introduction of the target price sequence, which also contains stochastic +noise and reduces generalizability at test-time. To tackle these issues, we +combine a deep hierarchical variational-autoencoder (VAE) and diffusion +probabilistic techniques to do seq2seq stock prediction through a stochastic +generative process. The hierarchical VAE allows us to learn the complex and +low-level latent variables for stock prediction, while the diffusion +probabilistic model trains the predictor to handle stock price stochasticity by +progressively adding random noise to the stock data. Our Diffusion-VAE (D-Va) +model is shown to outperform state-of-the-art solutions in terms of its +prediction accuracy and variance. More importantly, the multi-step outputs can +also allow us to form a stock portfolio over the prediction length. We +demonstrate the effectiveness of our model outputs in the portfolio investment +task through the Sharpe ratio metric and highlight the importance of dealing +with different types of prediction uncertainties. + +
+
+ comment: CIKM 2023 +
+
+
+
+
+ + ♻ ☆ Latent Exploration for Reinforcement Learning + + +
+ In Reinforcement Learning, agents learn policies by exploring and interacting +with the environment. Due to the curse of dimensionality, learning policies +that map high-dimensional sensory input to motor output is particularly +challenging. During training, state of the art methods (SAC, PPO, etc.) explore +the environment by perturbing the actuation with independent Gaussian noise. +While this unstructured exploration has proven successful in numerous tasks, it +can be suboptimal for overactuated systems. When multiple actuators, such as +motors or muscles, drive behavior, uncorrelated perturbations risk diminishing +each other's effect, or modifying the behavior in a task-irrelevant way. While +solutions to introduce time correlation across action perturbations exist, +introducing correlation across actuators has been largely ignored. Here, we +propose LATent TIme-Correlated Exploration (Lattice), a method to inject +temporally-correlated noise into the latent state of the policy network, which +can be seamlessly integrated with on- and off-policy algorithms. We demonstrate +that the noisy actions generated by perturbing the network's activations can be +modeled as a multivariate Gaussian distribution with a full covariance matrix. +In the PyBullet locomotion tasks, Lattice-SAC achieves state of the art +results, and reaches 18% higher reward than unstructured exploration in the +Humanoid environment. In the musculoskeletal control environments of MyoSuite, +Lattice-PPO achieves higher reward in most reaching and object manipulation +tasks, while also finding more energy-efficient policies with reductions of +20-60%. Overall, we demonstrate the effectiveness of structured action noise in +time and actuator space for complex motor control tasks. The code is available +at: https://github.com/amathislab/lattice. + +
+
+ comment: Code available at https://github.com/amathislab/lattice +
+
+
+
+
+ + ♻ ☆ A practical PINN framework for multi-scale problems with multi-magnitude + loss terms + + +
+ For multi-scale problems, the conventional physics-informed neural networks +(PINNs) face some challenges in obtaining available predictions. In this paper, +based on PINNs, we propose a practical deep learning framework for multi-scale +problems by reconstructing the loss function and associating it with special +neural network architectures. New PINN methods derived from the improved PINN +framework differ from the conventional PINN method mainly in two aspects. +First, the new methods use a novel loss function by modifying the standard loss +function through a (grouping) regularization strategy. The regularization +strategy implements a different power operation on each loss term so that all +loss terms composing the loss function are of approximately the same order of +magnitude, which makes all loss terms be optimized synchronously during the +optimization process. Second, for the multi-frequency or high-frequency +problems, in addition to using the modified loss function, new methods upgrade +the neural network architecture from the common fully-connected neural network +to special network architectures such as the Fourier feature architecture, and +the integrated architecture developed by us. The combination of the above two +techniques leads to a significant improvement in the computational accuracy of +multi-scale problems. Several challenging numerical examples demonstrate the +effectiveness of the proposed methods. The proposed methods not only +significantly outperform the conventional PINN method in terms of computational +efficiency and computational accuracy, but also compare favorably with the +state-of-the-art methods in the recent literature. The improved PINN framework +facilitates better application of PINNs to multi-scale problems. + +
+
+
+
+
+ + ♻ ☆ Optimal Learners for Realizable Regression: PAC Learning and Online + Learning + + +
+ In this work, we aim to characterize the statistical complexity of realizable +regression both in the PAC learning setting and the online learning setting. +Previous work had established the sufficiency of finiteness of the fat +shattering dimension for PAC learnability and the necessity of finiteness of +the scaled Natarajan dimension, but little progress had been made towards a +more complete characterization since the work of Simon (SICOMP '97). To this +end, we first introduce a minimax instance optimal learner for realizable +regression and propose a novel dimension that both qualitatively and +quantitatively characterizes which classes of real-valued predictors are +learnable. We then identify a combinatorial dimension related to the Graph +dimension that characterizes ERM learnability in the realizable setting. +Finally, we establish a necessary condition for learnability based on a +combinatorial dimension related to the DS dimension, and conjecture that it may +also be sufficient in this context. Additionally, in the context of online +learning we provide a dimension that characterizes the minimax instance optimal +cumulative loss up to a constant factor and design an optimal online learner +for realizable regression, thus resolving an open question raised by Daskalakis +and Golowich in STOC '22. + +
+
+
+
+
+ + ♻ ☆ Prodigy: An Expeditiously Adaptive Parameter-Free Learner + + +
+ We consider the problem of estimating the learning rate in adaptive methods, +such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to +provably estimate the distance to the solution $D$, which is needed to set the +learning rate optimally. Our techniques are modifications of the D-Adaptation +method for learning-rate-free learning. Our methods improve upon the +convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where +$d_0$ is the initial estimate of $D$. We test our methods on 12 common +logistic-regression benchmark datasets, VGG11 and ResNet-50 training on +CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on +Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT +transformer training on BookWiki. Our experimental results show that our +approaches consistently outperform D-Adaptation and reach test accuracy values +close to that of hand-tuned Adam. + +
+
+
+
+
+ + ♻ ☆ UNSSOR: Unsupervised Neural Speech Separation by Leveraging + Over-determined Training Mixtures NeurIPS + + +
+ In reverberant conditions with multiple concurrent speakers, each microphone +acquires a mixture signal of multiple speakers at a different location. In +over-determined conditions where the microphones out-number speakers, we can +narrow down the solutions to speaker images and realize unsupervised speech +separation by leveraging each mixture signal as a constraint (i.e., the +estimated speaker images at a microphone should add up to the mixture). +Equipped with this insight, we propose UNSSOR, an algorithm for +$\textbf{u}$nsupervised $\textbf{n}$eural $\textbf{s}$peech +$\textbf{s}$eparation by leveraging $\textbf{o}$ver-determined training +mixtu$\textbf{r}$es. At each training step, we feed an input mixture to a deep +neural network (DNN) to produce an intermediate estimate for each speaker, +linearly filter the estimates, and optimize a loss so that, at each microphone, +the filtered estimates of all the speakers can add up to the mixture to satisfy +the above constraint. We show that this loss can promote unsupervised +separation of speakers. The linear filters are computed in each sub-band based +on the mixture and DNN estimates through the forward convolutive prediction +(FCP) algorithm. To address the frequency permutation problem incurred by using +sub-band FCP, a loss term based on minimizing intra-source magnitude scattering +is proposed. Although UNSSOR requires over-determined training mixtures, we can +train DNNs to achieve under-determined separation (e.g., unsupervised monaural +speech separation). Evaluation results on two-speaker separation in reverberant +conditions show the effectiveness and potential of UNSSOR. + +
+
+ comment: in Conference on Neural Information Processing Systems (NeurIPS), + 2023 +
+
+
+
+
+ + ♻ ☆ A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking + + +
+ This paper presents a comprehensive survey on deep learning-based image +watermarking, a technique that entails the invisible embedding and extraction +of watermarks within a cover image, aiming to offer a seamless blend of +robustness and adaptability. We navigate the complex landscape of this +interdisciplinary domain, linking historical foundations, current innovations, +and prospective developments. Unlike existing literature, our study +concentrates exclusively on image watermarking with deep learning, delivering +an in-depth, yet brief analysis enriched by three fundamental contributions. +First, we introduce a refined categorization, segmenting the field into +Embedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid +Methods. This taxonomy, inspired by the varied roles of deep learning across +studies, is designed to infuse clarity, offering readers technical insights and +directional guidance. Second, our exploration dives into representative +methodologies, encapsulating the diverse research directions and inherent +challenges within each category to provide a consolidated perspective. Lastly, +we venture beyond established boundaries to outline emerging frontiers, +offering a detailed insight into prospective research avenues. + +
+
+ comment: This paper was accepted for publication by the MDPI Applied Sciences + journal +
+
+
+
+
+ + ♻ ☆ Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and + the Case of Information Extraction EMNLP 2023 + + +
+ Large language models (LLMs) have great potential for synthetic data +generation. This work shows that useful data can be synthetically generated +even for tasks that cannot be solved directly by LLMs: for problems with +structured outputs, it is possible to prompt an LLM to perform the task in the +reverse direction, by generating plausible input text for a target output +structure. Leveraging this asymmetry in task difficulty makes it possible to +produce large-scale, high-quality data for complex tasks. We demonstrate the +effectiveness of this approach on closed information extraction, where +collecting ground-truth data is challenging, and no satisfactory dataset exists +to date. We synthetically generate a dataset of 1.8M data points, establish its +superior quality compared to existing datasets in a human evaluation, and use +it to finetune small models (220M and 770M parameters), termed SynthIE, that +outperform the prior state of the art (with equal model size) by a substantial +margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, +and models are available at https://github.com/epfl-dlab/SynthIE. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency + Model ACM MM 2023 + + +
+ Denoising diffusion probabilistic models (DDPMs) have shown promising +performance for speech synthesis. However, a large number of iterative steps +are required to achieve high sample quality, which restricts the inference +speed. Maintaining sample quality while increasing sampling speed has become a +challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based +"Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a +single diffusion sampling step while achieving high audio quality. The +consistency constraint is applied to distill a consistency model from a +well-designed diffusion-based teacher model, which ultimately yields superior +performances in the distilled CoMoSpeech. Our experiments show that by +generating audio recordings by a single sampling step, the CoMoSpeech achieves +an inference speed more than 150 times faster than real-time on a single NVIDIA +A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based +speech synthesis truly practical. Meanwhile, objective and subjective +evaluations on text-to-speech and singing voice synthesis show that the +proposed teacher models yield the best audio quality, and the one-step sampling +based CoMoSpeech achieves the best inference speed with better or comparable +audio quality to other conventional multi-step diffusion model baselines. Audio +samples are available at https://comospeech.github.io/. + +
+
+ comment: Accepted to ACM MM 2023 +
+
+
+
+
+ + ♻ ☆ Perceptual Quality Assessment of Face Video Compression: A Benchmark and + An Effective Method + + +
+ Recent years have witnessed an exponential increase in the demand for face +video compression, and the success of artificial intelligence has expanded the +boundaries beyond traditional hybrid video coding. Generative coding approaches +have been identified as promising alternatives with reasonable perceptual +rate-distortion trade-offs, leveraging the statistical priors of face videos. +However, the great diversity of distortion types in spatial and temporal +domains, ranging from the traditional hybrid coding frameworks to generative +models, present grand challenges in compressed face video quality assessment +(VQA). In this paper, we introduce the large-scale Compressed Face Video +Quality Assessment (CFVQA) database, which is the first attempt to +systematically understand the perceptual quality and diversified compression +distortions in face videos. The database contains 3,240 compressed face video +clips in multiple compression levels, which are derived from 135 source videos +with diversified content using six representative video codecs, including two +traditional methods based on hybrid coding frameworks, two end-to-end methods, +and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index +for face video compression was developed to measure the perceptual quality, +considering the distinct content characteristics and temporal priors of the +face videos. Experimental results exhibit its superior performance on the +proposed CFVQA dataset. The benchmark is now made publicly available at: +https://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment. + +
+
+
+
+
+ + ♻ ☆ Equivariant Adaptation of Large Pretrained Models NeurIPS 2023 + + +
+ Equivariant networks are specifically designed to ensure consistent behavior +with respect to a set of input transformations, leading to higher sample +efficiency and more accurate and robust predictions. However, redesigning each +component of prevalent deep neural network architectures to achieve chosen +equivariance is a difficult problem and can result in a computationally +expensive network during both training and inference. A recently proposed +alternative towards equivariance that removes the architectural constraints is +to use a simple canonicalization network that transforms the input to a +canonical form before feeding it to an unconstrained prediction network. We +show here that this approach can effectively be used to make a large pretrained +network equivariant. However, we observe that the produced canonical +orientations can be misaligned with those of the training distribution, +hindering performance. Using dataset-dependent priors to inform the +canonicalization function, we are able to make large pretrained models +equivariant while maintaining their performance. This significantly improves +the robustness of these models to deterministic transformations of the data, +such as rotations. We believe this equivariant adaptation of large pretrained +models can help their domain-specific applications with known symmetry priors. + +
+
+ comment: 17 pages, 6 figures. Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Complexity Matters: Rethinking the Latent Space for Generative Modeling NeurIPS 2023 + + +
+ In generative modeling, numerous successful approaches leverage a +low-dimensional latent space, e.g., Stable Diffusion models the latent space +induced by an encoder and generates images through a paired decoder. Although +the selection of the latent space is empirically pivotal, determining the +optimal choice and the process of identifying it remain unclear. In this study, +we aim to shed light on this under-explored topic by rethinking the latent +space from the perspective of model complexity. Our investigation starts with +the classic generative adversarial networks (GANs). Inspired by the GAN +training objective, we propose a novel "distance" between the latent and data +distributions, whose minimization coincides with that of the generator +complexity. The minimizer of this distance is characterized as the optimal +data-dependent latent that most effectively capitalizes on the generator's +capacity. Then, we consider parameterizing such a latent distribution by an +encoder network and propose a two-stage training strategy called Decoupled +Autoencoder (DAE), where the encoder is only updated in the first stage with an +auxiliary decoder and then frozen in the second stage while the actual decoder +is being trained. DAE can improve the latent distribution and as a result, +improve the generative performance. Our theoretical analyses are corroborated +by comprehensive experiments on various models such as VQGAN and Diffusion +Transformer, where our modifications yield significant improvements in sample +quality with decreased model complexity. + +
+
+ comment: Accepted to NeurIPS 2023 (Spotlight) +
+
+
+
+
+ + ♻ ☆ Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management NeurIPS 2023 + + +
+ Reinforcement learning (RL) has shown great promise for developing dialogue +management (DM) agents that are non-myopic, conduct rich conversations, and +maximize overall user satisfaction. Despite recent developments in RL and +language models (LMs), using RL to power conversational chatbots remains +challenging, in part because RL requires online exploration to learn +effectively, whereas collecting novel human-bot interactions can be expensive +and unsafe. This issue is exacerbated by the combinatorial action spaces facing +these algorithms, as most LM agents generate responses at the word level. We +develop a variety of RL algorithms, specialized to dialogue planning, that +leverage recent Mixture-of-Expert Language Models (MoE-LMs) -- models that +capture diverse semantics, generate utterances reflecting different intents, +and are amenable for multi-turn DM. By exploiting MoE-LM structure, our methods +significantly reduce the size of the action space and improve the efficacy of +RL-based DM. We evaluate our methods in open-domain dialogue to demonstrate +their effectiveness w.r.t.\ the diversity of intent in generated utterances and +overall DM performance. + +
+
+ comment: Thirty-seventh Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Inconsistency, Instability, and Generalization Gap of Deep Neural + Network Training NeurIPS 2023 + + +
+ As deep neural networks are highly expressive, it is important to find +solutions with small generalization gap (the difference between the performance +on the training data and unseen data). Focusing on the stochastic nature of +training, we first present a theoretical analysis in which the bound of +generalization gap depends on what we call inconsistency and instability of +model outputs, which can be estimated on unlabeled data. Our empirical study +based on this analysis shows that instability and inconsistency are strongly +predictive of generalization gap in various settings. In particular, our +finding indicates that inconsistency is a more reliable indicator of +generalization gap than the sharpness of the loss landscape. Furthermore, we +show that algorithmic reduction of inconsistency leads to superior performance. +The results also provide a theoretical basis for existing methods such as +co-distillation and ensemble. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ How to Turn Your Knowledge Graph Embeddings into Generative Models + + +
+ Some of the most successful knowledge graph embedding (KGE) models for link +prediction -- CP, RESCAL, TuckER, ComplEx -- can be interpreted as energy-based +models. Under this perspective they are not amenable for exact +maximum-likelihood estimation (MLE), sampling and struggle to integrate logical +constraints. This work re-interprets the score functions of these KGEs as +circuits -- constrained computational graphs allowing efficient +marginalisation. Then, we design two recipes to obtain efficient generative +circuit models by either restricting their activations to be non-negative or +squaring their outputs. Our interpretation comes with little or no loss of +performance for link prediction, while the circuits framework unlocks exact +learning by MLE, efficient sampling of new triples, and guarantee that logical +constraints are satisfied by design. Furthermore, our models scale more +gracefully than the original KGEs on graphs with millions of entities. + +
+
+
+
+
+ + ♻ ☆ Attacks on Online Learners: a Teacher-Student Analysis + + +
+ Machine learning models are famously vulnerable to adversarial attacks: small +ad-hoc perturbations of the data that can catastrophically alter the model +predictions. While a large literature has studied the case of test-time attacks +on pre-trained models, the important case of attacks in an online learning +setting has received little attention so far. In this work, we use a +control-theoretical perspective to study the scenario where an attacker may +perturb data labels to manipulate the learning dynamics of an online learner. +We perform a theoretical analysis of the problem in a teacher-student setup, +considering different attack strategies, and obtaining analytical results for +the steady state of simple linear learners. These results enable us to prove +that a discontinuous transition in the learner's accuracy occurs when the +attack strength exceeds a critical threshold. We then study empirically attacks +on learners with complex architectures using real data, confirming the insights +of our theoretical analysis. Our findings show that greedy attacks can be +extremely efficient, especially when data stream in small batches. + +
+
+ comment: 19 pages, 10 figures +
+
+
+
+
+ + ♻ ☆ Modeling Fission Gas Release at the Mesoscale using Multiscale DenseNet + Regression with Attention Mechanism and Inception Blocks + + +
+ Mesoscale simulations of fission gas release (FGR) in nuclear fuel provide a +powerful tool for understanding how microstructure evolution impacts FGR, but +they are computationally intensive. In this study, we present an alternate, +data-driven approach, using deep learning to predict instantaneous FGR flux +from 2D nuclear fuel microstructure images. Four convolutional neural network +(CNN) architectures with multiscale regression are trained and evaluated on +simulated FGR data generated using a hybrid phase field/cluster dynamics model. +All four networks show high predictive power, with $R^{2}$ values above 98%. +The best performing network combine a Convolutional Block Attention Module +(CBAM) and InceptionNet mechanisms to provide superior accuracy (mean absolute +percentage error of 4.4%), training stability, and robustness on very low +instantaneous FGR flux values. + +
+
+ comment: Submitted at Journal of Nuclear Materials, 20 pages, 10 figures, 3 + tables +
+
+
+
+
+ + ♻ ☆ On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 + + +
+ Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented +performance in response generation, especially with visual inputs, enabling +more creative and adaptable interaction than large language models such as +ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since +adversaries may successfully evade the entire system by subtly manipulating the +most vulnerable modality (e.g., vision). To this end, we propose evaluating the +robustness of open-source large VLMs in the most realistic and high-risk +setting, where adversaries have only black-box system access and seek to +deceive the model into returning the targeted responses. In particular, we +first craft targeted adversarial examples against pretrained models such as +CLIP and BLIP, and then transfer these adversarial examples to other VLMs such +as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we +observe that black-box queries on these VLMs can further improve the +effectiveness of targeted evasion, resulting in a surprisingly high success +rate for generating targeted responses. Our findings provide a quantitative +understanding regarding the adversarial vulnerability of large VLMs and call +for a more thorough examination of their potential security flaws before +deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ ODE-based Recurrent Model-free Reinforcement Learning for POMDPs NeurIPS 2023 + + +
+ Neural ordinary differential equations (ODEs) are widely recognized as the +standard for modeling physical mechanisms, which help to perform approximate +inference in unknown physical or biological environments. In partially +observable (PO) environments, how to infer unseen information from raw +observations puzzled the agents. By using a recurrent policy with a compact +context, context-based reinforcement learning provides a flexible way to +extract unobservable information from historical transitions. To help the agent +extract more dynamics-related information, we present a novel ODE-based +recurrent model combines with model-free reinforcement learning (RL) framework +to solve partially observable Markov decision processes (POMDPs). We +experimentally demonstrate the efficacy of our methods across various PO +continuous control and meta-RL tasks. Furthermore, our experiments illustrate +that our method is robust against irregular observations, owing to the ability +of ODEs to model irregularly-sampled time series. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Information Design in Multi-Agent Reinforcement Learning + + +
+ Reinforcement learning (RL) is inspired by the way human infants and animals +learn from the environment. The setting is somewhat idealized because, in +actual tasks, other agents in the environment have their own goals and behave +adaptively to the ego agent. To thrive in those environments, the agent needs +to influence other agents so their actions become more helpful and less +harmful. Research in computational economics distills two ways to influence +others directly: by providing tangible goods (mechanism design) and by +providing information (information design). This work investigates information +design problems for a group of RL agents. The main challenges are two-fold. One +is the information provided will immediately affect the transition of the agent +trajectories, which introduces additional non-stationarity. The other is the +information can be ignored, so the sender must provide information that the +receiver is willing to respect. We formulate the Markov signaling game, and +develop the notions of signaling gradient and the extended obedience +constraints that address these challenges. Our algorithm is efficient on +various mixed-motive tasks and provides further insights into computational +economics. Our code is publicly available at +https://github.com/YueLin301/InformationDesignMARL. + +
+
+
+
+
+ + ♻ ☆ Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face + Recognition + + +
+ Face recognition systems are widely deployed in safety-critical applications, +including law enforcement, yet they exhibit bias across a range of +socio-demographic dimensions, such as gender and race. Conventional wisdom +dictates that model biases arise from biased training data. As a consequence, +previous works on bias mitigation largely focused on pre-processing the +training data, adding penalties to prevent bias from effecting the model during +training, or post-processing predictions to debias them, yet these approaches +have shown limited success on hard problems such as face recognition. In our +work, we discover that biases are actually inherent to neural network +architectures themselves. Following this reframing, we conduct the first neural +architecture search for fairness, jointly with a search for hyperparameters. +Our search outputs a suite of models which Pareto-dominate all other +high-performance architectures and existing bias mitigation methods in terms of +accuracy and fairness, often by large margins, on the two most widely used +datasets for face identification, CelebA and VGGFace2. Furthermore, these +models generalize to other datasets and sensitive attributes. We release our +code, models and raw data files at https://github.com/dooleys/FR-NAS. + +
+
+
+
+
+ + ♻ ☆ On Calibrating Diffusion Probabilistic Models NeurIPS 2023 + + +
+ Recently, diffusion probabilistic models (DPMs) have achieved promising +results in diverse generative tasks. A typical DPM framework includes a forward +process that gradually diffuses the data distribution and a reverse process +that recovers the data distribution from time-dependent data scores. In this +work, we observe that the stochastic reverse process of data scores is a +martingale, from which concentration bounds and the optional stopping theorem +for data scores can be derived. Then, we discover a simple way for calibrating +an arbitrary pretrained DPM, with which the score matching loss can be reduced +and the lower bounds of model likelihood can consequently be increased. We +provide general calibration guidelines under various model parametrizations. +Our calibration method is performed only once and the resulting models can be +used repeatedly for sampling. We conduct experiments on multiple datasets to +empirically validate our proposal. Our code is at +https://github.com/thudzj/Calibrated-DPMs. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Bayesian Dynamic DAG Learning: Application in Discovering Dynamic + Effective Connectome of Brain + + +
+ Understanding the complex mechanisms of the brain can be unraveled by +extracting the Dynamic Effective Connectome (DEC). Recently, score-based +Directed Acyclic Graph (DAG) discovery methods have shown significant +improvements in extracting the causal structure and inferring effective +connectivity. However, learning DEC through these methods still faces two main +challenges: one with the fundamental impotence of high-dimensional dynamic DAG +discovery methods and the other with the low quality of fMRI data. In this +paper, we introduce Bayesian Dynamic DAG learning with M-matrices Acyclicity +characterization \textbf{(BDyMA)} method to address the challenges in +discovering DEC. The presented dynamic causal model enables us to discover +bidirected edges as well. Leveraging an unconstrained framework in the BDyMA +method leads to more accurate results in detecting high-dimensional networks, +achieving sparser outcomes, making it particularly suitable for extracting DEC. +Additionally, the score function of the BDyMA method allows the incorporation +of prior knowledge into the process of dynamic causal discovery which further +enhances the accuracy of results. Comprehensive simulations on synthetic data +and experiments on Human Connectome Project (HCP) data demonstrate that our +method can handle both of the two main challenges, yielding more accurate and +reliable DEC compared to state-of-the-art and baseline methods. Additionally, +we investigate the trustworthiness of DTI data as prior knowledge for DEC +discovery and show the improvements in DEC discovery when the DTI data is +incorporated into the process. + +
+
+
+
+
+ + ♻ ☆ Physics-Driven ML-Based Modelling for Correcting Inverse Estimation + + +
+ When deploying machine learning estimators in science and engineering (SAE) +domains, it is critical to avoid failed estimations that can have disastrous +consequences, e.g., in aero engine design. This work focuses on detecting and +correcting failed state estimations before adopting them in SAE inverse +problems, by utilizing simulations and performance metrics guided by physical +laws. We suggest to flag a machine learning estimation when its physical model +error exceeds a feasible threshold, and propose a novel approach, GEESE, to +correct it through optimization, aiming at delivering both low error and high +efficiency. The key designs of GEESE include (1) a hybrid surrogate error model +to provide fast error estimations to reduce simulation cost and to enable +gradient based backpropagation of error feedback, and (2) two generative models +to approximate the probability distributions of the candidate states for +simulating the exploitation and exploration behaviours. All three models are +constructed as neural networks. GEESE is tested on three real-world SAE inverse +problems and compared to a number of state-of-the-art optimization/search +approaches. Results show that it fails the least number of times in terms of +finding a feasible state correction, and requires physical evaluations less +frequently in general. + +
+
+ comment: 19 pages, the paper is accepted by Neurips 2023 as a spotlight +
+
+
+
+
+ + ♻ ☆ Music Augmentation and Denoising For Peak-Based Audio Fingerprinting + + +
+ Audio fingerprinting is a well-established solution for song identification +from short recording excerpts. Popular methods rely on the extraction of sparse +representations, generally spectral peaks, and have proven to be accurate, +fast, and scalable to large collections. However, real-world applications of +audio identification often happen in noisy environments, which can cause these +systems to fail. In this work, we tackle this problem by introducing and +releasing a new audio augmentation pipeline that adds noise to music snippets +in a realistic way, by stochastically mimicking real-world scenarios. We then +propose and release a deep learning model that removes noisy components from +spectrograms in order to improve peak-based fingerprinting systems' accuracy. +We show that the addition of our model improves the identification performance +of commonly used audio fingerprinting systems, even under noisy conditions. + +
+
+
+
+
+ + ♻ ☆ Data-Driven Network Neuroscience: On Data Collection and Benchmark + + +
+ This paper presents a comprehensive and quality collection of functional +human brain network data for potential research in the intersection of +neuroscience, machine learning, and graph analytics. Anatomical and functional +MRI images have been used to understand the functional connectivity of the +human brain and are particularly important in identifying underlying +neurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism. +Recently, the study of the brain in the form of brain networks using machine +learning and graph analytics has become increasingly popular, especially to +predict the early onset of these conditions. A brain network, represented as a +graph, retains rich structural and positional information that traditional +examination methods are unable to capture. However, the lack of publicly +accessible brain network data prevents researchers from data-driven +explorations. One of the main difficulties lies in the complicated +domain-specific preprocessing steps and the exhaustive computation required to +convert the data from MRI images into brain networks. We bridge this gap by +collecting a large amount of MRI images from public databases and a private +source, working with domain experts to make sensible design choices, and +preprocessing the MRI images to produce a collection of brain network datasets. +The datasets originate from 6 different sources, cover 4 brain conditions, and +consist of a total of 2,702 subjects. We test our graph datasets on 12 machine +learning models to provide baselines and validate the data quality on a recent +graph analysis model. To lower the barrier to entry and promote the research in +this interdisciplinary field, we release our brain network data and complete +preprocessing details including codes at +https://doi.org/10.17608/k6.auckland.21397377 and +https://github.com/brainnetuoa/data_driven_network_neuroscience. + +
+
+
+
+
+ + ♻ ☆ Compression with Bayesian Implicit Neural Representations NeurIPS 2023 + + +
+ Many common types of data can be represented as functions that map +coordinates to signal values, such as pixel locations to RGB values in the case +of an image. Based on this view, data can be compressed by overfitting a +compact neural network to its functional representation and then encoding the +network weights. However, most current solutions for this are inefficient, as +quantization to low-bit precision substantially degrades the reconstruction +quality. To address this issue, we propose overfitting variational Bayesian +neural networks to the data and compressing an approximate posterior weight +sample using relative entropy coding instead of quantizing and entropy coding +it. This strategy enables direct optimization of the rate-distortion +performance by minimizing the $\beta$-ELBO, and target different +rate-distortion trade-offs for a given network architecture by adjusting +$\beta$. Moreover, we introduce an iterative algorithm for learning prior +weight distributions and employ a progressive refinement process for the +variational posterior that significantly enhances performance. Experiments show +that our method achieves strong performance on image and audio compression +while retaining simplicity. + +
+
+ comment: Accepted as a Spotlight paper in NeurIPS 2023. Updated camera-ready + version +
+
+
+
+
+ + ♻ ☆ Neural Injective Functions for Multisets, Measures and Graphs via a + Finite Witness Theorem NeurIPS 2023 + + +
+ Injective multiset functions have a key role in the theoretical study of +machine learning on multisets and graphs. Yet, there remains a gap between the +provably injective multiset functions considered in theory, which typically +rely on polynomial moments, and the multiset functions used in practice, which +rely on $\textit{neural moments}$ $\unicode{x2014}$ whose injectivity on +multisets has not been studied to date. + In this paper, we bridge this gap by showing that moments of neural networks +do define injective multiset functions, provided that an analytic +non-polynomial activation is used. The number of moments required by our theory +is optimal essentially up to a multiplicative factor of two. To prove this +result, we state and prove a $\textit{finite witness theorem}$, which is of +independent interest. + As a corollary to our main theorem, we derive new approximation results for +functions on multisets and measures, and new separation results for graph +neural networks. We also provide two negative results: (1) moments of +piecewise-linear neural networks cannot be injective multiset functions; and +(2) even when moment-based multiset functions are injective, they can never be +bi-Lipschitz. + +
+
+ comment: NeurIPS 2023 camera-ready +
+
+
+
+
+ + ♻ ☆ Stochastic Approximation Approaches to Group Distributionally Robust + Optimization + + +
+ This paper investigates group distributionally robust optimization (GDRO), +with the purpose to learn a model that performs well over $m$ different +distributions. First, we formulate GDRO as a stochastic convex-concave +saddle-point problem, and demonstrate that stochastic mirror descent (SMD), +using $m$ samples in each iteration, achieves an $O(m (\log m)/\epsilon^2)$ +sample complexity for finding an $\epsilon$-optimal solution, which matches the +$\Omega(m/\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make +use of techniques from online learning to reduce the number of samples required +in each round from $m$ to $1$, keeping the same sample complexity. +Specifically, we cast GDRO as a two-players game where one player simply +performs SMD and the other executes an online algorithm for non-oblivious +multi-armed bandits. Next, we consider a more practical scenario where the +number of samples that can be drawn from each distribution is different, and +propose a novel formulation of weighted GDRO, which allows us to derive +distribution-dependent convergence rates. Denote by $n_i$ the sample budget for +the $i$-th distribution, and assume $n_1 \geq n_2 \geq \cdots \geq n_m$. In the +first approach, we incorporate non-uniform sampling into SMD such that the +sample budget is satisfied in expectation, and prove that the excess risk of +the $i$-th distribution decreases at an $O(\sqrt{n_1 \log m}/n_i)$ rate. In the +second approach, we use mini-batches to meet the budget exactly and also reduce +the variance in stochastic gradients, and then leverage stochastic mirror-prox +algorithm, which can exploit small variances, to optimize a carefully designed +weighted GDRO problem. Under appropriate conditions, it attains an $O((\log +m)/\sqrt{n_i})$ convergence rate, which almost matches the optimal +$O(\sqrt{1/n_i})$ rate of only learning from the $i$-th distribution with $n_i$ +samples. + +
+
+
+
+
+ + ♻ ☆ Kernelized Cumulants: Beyond Kernel Mean Embeddings + + +
+ In $\mathbb R^d$, it is well-known that cumulants provide an alternative to +moments that can achieve the same goals with numerous benefits such as lower +variance estimators. In this paper we extend cumulants to reproducing kernel +Hilbert spaces (RKHS) using tools from tensor algebras and show that they are +computationally tractable by a kernel trick. These kernelized cumulants provide +a new set of all-purpose statistics; the classical maximum mean discrepancy and +Hilbert-Schmidt independence criterion arise as the degree one objects in our +general construction. We argue both theoretically and empirically (on +synthetic, environmental, and traffic data analysis) that going beyond degree +one has several advantages and can be achieved with the same computational +complexity and minimal overhead in our experiments. + +
+
+ comment: 19 pages, 8 figures +
+
+
+
+
+ + ♻ ☆ Medical Profile Model: Scientific and Practical Applications in + Healthcare + + +
+ The paper researches the problem of representation learning for electronic +health records. We present the patient histories as temporal sequences of +diseases for which embeddings are learned in an unsupervised setup with a +transformer-based neural network model. Additionally the embedding space +includes demographic parameters which allow the creation of generalized patient +profiles and successful transfer of medical knowledge to other domains. The +training of such a medical profile model has been performed on a dataset of +more than one million patients. Detailed model analysis and its comparison with +the state-of-the-art method show its clear advantage in the diagnosis +prediction task. Further, we show two applications based on the developed +profile model. First, a novel Harbinger Disease Discovery method allowing to +reveal disease associated hypotheses and potentially are beneficial in the +design of epidemiological studies. Second, the patient embeddings extracted +from the profile model applied to the insurance scoring task allow significant +improvement in the performance metrics. + +
+
+ comment: 8 pages, code available at + https://github.com/sberbank-ai-lab/mimic.profile, accepted for publication at + IEEE JBHI +
+
+
+
+
+ + ♻ ☆ On Momentum-Based Gradient Methods for Bilevel Optimization with + Nonconvex Lower-Level + + +
+ Bilevel optimization is a popular two-level hierarchical optimization, which +has been widely applied to many machine learning tasks such as hyperparameter +learning, meta learning and continual learning. Although many bilevel +optimization methods recently have been developed, the bilevel methods are not +well studied when the lower-level problem is nonconvex. To fill this gap, in +the paper, we study a class of nonconvex bilevel optimization problems, where +both upper-level and lower-level problems are nonconvex, and the lower-level +problem satisfies Polyak-{\L}ojasiewicz (PL) condition. We propose an efficient +momentum-based gradient bilevel method (MGBiO) to solve these deterministic +problems. Meanwhile, we propose a class of efficient momentum-based stochastic +gradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic +problems. Moreover, we provide a useful convergence analysis framework for our +methods. Specifically, under some mild conditions, we prove that our MGBiO +method has a sample (or gradient) complexity of $O(\epsilon^{-2})$ for finding +an $\epsilon$-stationary solution of the deterministic bilevel problems (i.e., +$\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a +factor of $O(\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO +methods have sample complexities of $\tilde{O}(\epsilon^{-4})$ and +$\tilde{O}(\epsilon^{-3})$, respectively, in finding an $\epsilon$-stationary +solution of the stochastic bilevel problems (i.e., $\mathbb{E}\|\nabla +F(x)\|\leq \epsilon$), which improves the existing best results by a factor of +$\tilde{O}(\epsilon^{-3})$. Extensive experimental results on bilevel PL game +and hyper-representation learning demonstrate the efficiency of our algorithms. +This paper commemorates the mathematician Boris Polyak (1935 -2023). + +
+
+ comment: In new version of our paper, we relaxed some assumptions, updated our + algorithms and added some numerical experiments +
+
+
+
+
+ + ♻ ☆ Federated Learning of Large Language Models with Parameter-Efficient + Prompt Tuning and Adaptive Optimization EMNLP 2023 + + +
+ Federated learning (FL) is a promising paradigm to enable collaborative model +training with decentralized data. However, the training process of Large +Language Models (LLMs) generally incurs the update of significant parameters, +which limits the applicability of FL techniques to tackle the LLMs in real +scenarios. Prompt tuning can significantly reduce the number of parameters to +update, but it either incurs performance degradation or low training +efficiency. The straightforward utilization of prompt tuning in the FL often +raises non-trivial communication costs and dramatically degrades performance. +In addition, the decentralized data is generally non-Independent and +Identically Distributed (non-IID), which brings client drift problems and thus +poor performance. This paper proposes a Parameter-efficient prompt Tuning +approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and +effective FL of LLMs. First, an efficient partial prompt tuning approach is +proposed to improve performance and efficiency simultaneously. Second, a novel +adaptive optimization method is developed to address the client drift problems +on both the device and server sides to enhance performance further. Extensive +experiments based on 10 datasets demonstrate the superb performance (up to +60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training +time) of FedPepTAO compared with 9 baseline approaches. Our code is available +at https://github.com/llm-eff/FedPepTAO. + +
+
+ comment: 18 pages, accepted by EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Reliable learning in challenging environments + + +
+ The problem of designing learners that provide guarantees that their +predictions are provably correct is of increasing importance in machine +learning. However, learning theoretic guarantees have only been considered in +very specific settings. In this work, we consider the design and analysis of +reliable learners in challenging test-time environments as encountered in +modern machine learning problems: namely `adversarial' test-time attacks (in +several variations) and `natural' distribution shifts. In this work, we provide +a reliable learner with provably optimal guarantees in such settings. We +discuss computationally feasible implementations of the learner and further +show that our algorithm achieves strong positive performance guarantees on +several natural examples: for example, linear separators under log-concave +distributions or smooth boundary classifiers under smooth probability +distributions. + +
+
+
+
+
+ + ♻ ☆ Realistic Synthetic Financial Transactions for Anti-Money Laundering + Models + + +
+ With the widespread digitization of finance and the increasing popularity of +cryptocurrencies, the sophistication of fraud schemes devised by cybercriminals +is growing. Money laundering -- the movement of illicit funds to conceal their +origins -- can cross bank and national boundaries, producing complex +transaction patterns. The UN estimates 2-5\% of global GDP or \$0.8 - \$2.0 +trillion dollars are laundered globally each year. Unfortunately, real data to +train machine learning models to detect laundering is generally not available, +and previous synthetic data generators have had significant shortcomings. A +realistic, standardized, publicly-available benchmark is needed for comparing +models and for the advancement of the area. To this end, this paper contributes +a synthetic financial transaction dataset generator and a set of synthetically +generated AML (Anti-Money Laundering) datasets. We have calibrated this +agent-based generator to match real transactions as closely as possible and +made the datasets public. We describe the generator in detail and demonstrate +how the datasets generated can help compare different Graph Neural Networks in +terms of their AML abilities. In a key way, using synthetic data in these +comparisons can be even better than using real data: the ground truth labels +are complete, whilst many laundering transactions in real data are never +detected. + +
+
+
+
+
+ + ♻ ☆ On Parametric Optimal Execution and Machine Learning Surrogates + + +
+ We investigate optimal order execution problems in discrete time with +instantaneous price impact and stochastic resilience. First, in the setting of +linear transient price impact we derive a closed-form recursion for the optimal +strategy, extending the deterministic results from Obizhaeva and Wang (J +Financial Markets, 2013). Second, we develop a numerical algorithm based on +dynamic programming and deep learning for the case of nonlinear transient price +impact as proposed by Bouchaud et al. (Quant. Finance, 2004). Specifically, we +utilize an actor-critic framework that constructs two neural-network (NN) +surrogates for the value function and the feedback control. The flexible +scalability of NN functional approximators enables parametric learning, i.e., +incorporating several model or market parameters as part of the input space. +Precise calibration of price impact, resilience, etc., is known to be extremely +challenging and hence it is critical to understand sensitivity of the execution +policy to these parameters. Our NN learner organically scales across multiple +input dimensions and is shown to accurately approximate optimal strategies +across a wide range of parameter configurations. We provide a fully +reproducible Jupyter Notebook with our NN implementation, which is of +independent pedagogical interest, demonstrating the ease of use of NN +surrogates in (parametric) stochastic control problems. + +
+
+ comment: 33 pages, 8 figures. Github repo at + https://github.com/moritz-voss/Parametric_Optimal_Execution_ML +
+
+
+
+
+ + ♻ ☆ SBMLtoODEjax: Efficient Simulation and Optimization of Biological + Network Models in JAX + + +
+ Advances in bioengineering and biomedicine demand a deep understanding of the +dynamic behavior of biological systems, ranging from protein pathways to +complex cellular processes. Biological networks like gene regulatory networks +and protein pathways are key drivers of embryogenesis and physiological +processes. Comprehending their diverse behaviors is essential for tackling +diseases, including cancer, as well as for engineering novel biological +constructs. Despite the availability of extensive mathematical models +represented in Systems Biology Markup Language (SBML), researchers face +significant challenges in exploring the full spectrum of behaviors and +optimizing interventions to efficiently shape those behaviors. Existing tools +designed for simulation of biological network models are not tailored to +facilitate interventions on network dynamics nor to facilitate automated +discovery. Leveraging recent developments in machine learning (ML), this paper +introduces SBMLtoODEjax, a lightweight library designed to seamlessly integrate +SBML models with ML-supported pipelines, powered by JAX. SBMLtoODEjax +facilitates the reuse and customization of SBML-based models, harnessing JAX's +capabilities for efficient parallel simulations and optimization, with the aim +to accelerate research in biological network analysis. + +
+
+
+
+
+ + ♻ ☆ Hyperbolic VAE via Latent Gaussian Distributions + + +
+ We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent +space consists of a set of Gaussian distributions. It is known that the set of +the univariate Gaussian distributions with the Fisher information metric form a +hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed +with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal +distribution based on the Kullback-Leibler divergence, a local approximation of +the squared Fisher-Rao distance, to define a density over the latent space. In +experiments, we demonstrate the efficacy of GM-VAE on two different tasks: +density estimation of image datasets and environment modeling in model-based +reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- +and Euclidean-VAEs on density estimation tasks and shows competitive +performance in model-based reinforcement learning. We observe that our model +provides strong numerical stability, addressing a common limitation reported in +previous hyperbolic-VAEs. + +
+
+ comment: 20 pages, Thirty-seventh Conference on Neural Information Processing + System, 2023 +
+
+
+
+
+ + ♻ ☆ CubeTR: Learning to Solve The Rubiks Cube Using Transformers + + +
+ Since its first appearance, transformers have been successfully used in wide +ranging domains from computer vision to natural language processing. +Application of transformers in Reinforcement Learning by reformulating it as a +sequence modelling problem was proposed only recently. Compared to other +commonly explored reinforcement learning problems, the Rubiks cube poses a +unique set of challenges. The Rubiks cube has a single solved state for +quintillions of possible configurations which leads to extremely sparse +rewards. The proposed model CubeTR attends to longer sequences of actions and +addresses the problem of sparse rewards. CubeTR learns how to solve the Rubiks +cube from arbitrary starting states without any human prior, and after move +regularisation, the lengths of solutions generated by it are expected to be +very close to those given by algorithms used by expert human solvers. CubeTR +provides insights to the generalisability of learning algorithms to higher +dimensional cubes and the applicability of transformers in other relevant +sparse reward scenarios. + +
+
+ comment: It has untested ideas without supporting experimentation. + Discontinued work in this direction +
+
+
+
+
+ + ♻ ☆ Superclustering by finding statistically significant separable groups of + optimal gaussian clusters + + +
+ The paper presents the algorithm for clustering a dataset by grouping the +optimal, from the point of view of the BIC criterion, number of Gaussian +clusters into the optimal, from the point of view of their statistical +separability, superclusters. + The algorithm consists of three stages: representation of the dataset as a +mixture of Gaussian distributions - clusters, which number is determined based +on the minimum of the BIC criterion; using the Mahalanobis distance, to +estimate the distances between the clusters and cluster sizes; combining the +resulting clusters into superclusters using the DBSCAN method by finding its +hyperparameter (maximum distance) providing maximum value of introduced matrix +quality criterion at maximum number of superclusters. The matrix quality +criterion corresponds to the proportion of statistically significant separated +superclusters among all found superclusters. + The algorithm has only one hyperparameter - statistical significance level, +and automatically detects optimal number and shape of superclusters based of +statistical hypothesis testing approach. The algorithm demonstrates a good +results on test datasets in noise and noiseless situations. An essential +advantage of the algorithm is its ability to predict correct supercluster for +new data based on already trained clusterer and perform soft (fuzzy) +clustering. The disadvantages of the algorithm are: its low speed and +stochastic nature of the final clustering. It requires a sufficiently large +dataset for clustering, which is typical for many statistical methods. + +
+
+ comment: 25 pages, 6 figures, 1 table +
+
+
+
+
+ + ♻ ☆ Understanding and Improving Feature Learning for Out-of-Distribution + Generalization NeurIPS 2023 + + +
+ A common explanation for the failure of out-of-distribution (OOD) +generalization is that the model trained with empirical risk minimization (ERM) +learns spurious features instead of invariant features. However, several recent +studies challenged this explanation and found that deep networks may have +already learned sufficiently good features for OOD generalization. Despite the +contradictions at first glance, we theoretically show that ERM essentially +learns both spurious and invariant features, while ERM tends to learn spurious +features faster if the spurious correlation is stronger. Moreover, when fed the +ERM learned features to the OOD objectives, the invariant feature learning +quality significantly affects the final OOD performance, as OOD objectives +rarely learn new features. Therefore, ERM feature learning can be a bottleneck +to OOD generalization. To alleviate the reliance, we propose Feature Augmented +Training (FeAT), to enforce the model to learn richer features ready for OOD +generalization. FeAT iteratively augments the model to learn new features while +retaining the already learned features. In each round, the retention and +augmentation operations are performed on different subsets of the training data +that capture distinct features. Extensive experiments show that FeAT +effectively learns richer features thus boosting the performance of various OOD +objectives. + +
+
+ comment: Yongqiang Chen, Wei Huang, and Kaiwen Zhou contributed equally; + NeurIPS 2023, 55 pages, 64 figures +
+
+
+
+
+ + ♻ ☆ Interactive Visual Reasoning under Uncertainty NeurIPS 2023 + + +
+ One of the fundamental cognitive abilities of humans is to quickly resolve +uncertainty by generating hypotheses and testing them via active trials. +Encountering a novel phenomenon accompanied by ambiguous cause-effect +relationships, humans make hypotheses against data, conduct inferences from +observation, test their theory via experimentation, and correct the proposition +if inconsistency arises. These iterative processes persist until the underlying +mechanism becomes clear. In this work, we devise the IVRE (pronounced as +"ivory") environment for evaluating artificial agents' reasoning ability under +uncertainty. IVRE is an interactive environment featuring rich scenarios +centered around Blicket detection. Agents in IVRE are placed into environments +with various ambiguous action-effect pairs and asked to determine each object's +role. They are encouraged to propose effective and efficient experiments to +validate their hypotheses based on observations and actively gather new +information. The game ends when all uncertainties are resolved or the maximum +number of trials is consumed. By evaluating modern artificial agents in IVRE, +we notice a clear failure of today's learning methods compared to humans. Such +inefficacy in interactive reasoning ability under uncertainty calls for future +research in building human-like intelligence. + +
+
+ comment: Accepted at NeurIPS 2023 (Datasets and Benchmarks) +
+
+
+
+
+ + ♻ ☆ DELTA: Diverse Client Sampling for Fasting Federated Learning NeurIPS 2023 + + +
+ Partial client participation has been widely adopted in Federated Learning +(FL) to reduce the communication burden efficiently. However, an inadequate +client sampling scheme can lead to the selection of unrepresentative subsets, +resulting in significant variance in model updates and slowed convergence. +Existing sampling methods are either biased or can be further optimized for +faster convergence.In this paper, we present DELTA, an unbiased sampling scheme +designed to alleviate these issues. DELTA characterizes the effects of client +diversity and local variance, and samples representative clients with valuable +information for global model updates. In addition, DELTA is a proven optimal +unbiased sampling scheme that minimizes variance caused by partial client +participation and outperforms other unbiased sampling schemes in terms of +convergence. Furthermore, to address full-client gradient dependence,we provide +a practical version of DELTA depending on the available clients' information, +and also analyze its convergence. Our results are validated through experiments +on both synthetic and real-world datasets. + +
+
+ comment: Accepted by Thirty-seventh Conference on Neural Information + Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Sampling from Gaussian Process Posteriors using Stochastic Gradient + Descent + + +
+ Gaussian processes are a powerful framework for quantifying uncertainty and +for sequential decision-making but are limited by the requirement of solving +linear systems. In general, this has a cubic cost in dataset size and is +sensitive to conditioning. We explore stochastic gradient algorithms as a +computationally efficient method of approximately solving these linear systems: +we develop low-variance optimization objectives for sampling from the posterior +and extend these to inducing points. Counterintuitively, stochastic gradient +descent often produces accurate predictions, even in cases where it does not +converge quickly to the optimum. We explain this through a spectral +characterization of the implicit bias from non-convergence. We show that +stochastic gradient descent produces predictive distributions close to the true +posterior both in regions with sufficient data coverage, and in regions +sufficiently far away from the data. Experimentally, stochastic gradient +descent achieves state-of-the-art performance on sufficiently large-scale or +ill-conditioned regression tasks. Its uncertainty estimates match the +performance of significantly more expensive baselines on a large-scale +Bayesian~optimization~task. + +
+
+
+
+
+ + ♻ ☆ Posterior Contraction Rates for Matérn Gaussian Processes on + Riemannian Manifolds + + +
+ Gaussian processes are used in many machine learning applications that rely +on uncertainty quantification. Recently, computational tools for working with +these models in geometric settings, such as when inputs lie on a Riemannian +manifold, have been developed. This raises the question: can these intrinsic +models be shown theoretically to lead to better performance, compared to simply +embedding all relevant quantities into $\mathbb{R}^d$ and using the restriction +of an ordinary Euclidean Gaussian process? To study this, we prove optimal +contraction rates for intrinsic Mat\'ern Gaussian processes defined on compact +Riemannian manifolds. We also prove analogous rates for extrinsic processes +using trace and extension theorems between manifold and ambient Sobolev spaces: +somewhat surprisingly, the rates obtained turn out to coincide with those of +the intrinsic processes, provided that their smoothness parameters are matched +appropriately. We illustrate these rates empirically on a number of examples, +which, mirroring prior work, show that intrinsic processes can achieve better +performance in practice. Therefore, our work shows that finer-grained analyses +are needed to distinguish between different levels of data-efficiency of +geometric Gaussian processes, particularly in settings which involve small data +set sizes and non-asymptotic behavior. + +
+
+
+
+
+ + ♻ ☆ Unlocking Deterministic Robustness Certification on ImageNet + + +
+ Despite the promise of Lipschitz-based methods for provably-robust deep +learning with deterministic guarantees, current state-of-the-art results are +limited to feed-forward Convolutional Networks (ConvNets) on low-dimensional +data, such as CIFAR-10. This paper investigates strategies for expanding +certifiably robust training to larger, deeper models. A key challenge in +certifying deep networks is efficient calculation of the Lipschitz bound for +residual blocks found in ResNet and ViT architectures. We show that fast ways +of bounding the Lipschitz constant for conventional ResNets are loose, and show +how to address this by designing a new residual block, leading to the +\emph{Linear ResNet} (LiResNet) architecture. We then introduce \emph{Efficient +Margin MAximization} (EMMA), a loss function that stabilizes robust training by +simultaneously penalizing worst-case adversarial examples from \emph{all} +classes. Together, these contributions yield new \emph{state-of-the-art} robust +accuracy on CIFAR-10/100 and Tiny-ImageNet under $\ell_2$ perturbations. +Moreover, for the first time, we are able to scale up fast deterministic +robustness guarantees to ImageNet, demonstrating that this approach to robust +learning can be applied to real-world applications. + We release our code on Github: \url{https://github.com/klasleino/gloro}. + +
+
+
+
+
+ + ♻ ☆ Evaluating and Inducing Personality in Pre-trained Language Models NeurIPS 2023 + + +
+ Standardized and quantified evaluation of machine behaviors is a crux of +understanding LLMs. In this study, we draw inspiration from psychometric +studies by leveraging human personality theory as a tool for studying machine +behaviors. Originating as a philosophical quest for human behaviors, the study +of personality delves into how individuals differ in thinking, feeling, and +behaving. Toward building and understanding human-like social machines, we are +motivated to ask: Can we assess machine behaviors by leveraging human +psychometric tests in a principled and quantitative manner? If so, can we +induce a specific personality in LLMs? To answer these questions, we introduce +the Machine Personality Inventory (MPI) tool for studying machine behaviors; +MPI follows standardized personality tests, built upon the Big Five Personality +Factors (Big Five) theory and personality assessment inventories. By +systematically evaluating LLMs with MPI, we provide the first piece of evidence +demonstrating the efficacy of MPI in studying LLMs behaviors. We further devise +a Personality Prompting (P^2) method to induce LLMs with specific personalities +in a controllable way, capable of producing diverse and verifiable behaviors. +We hope this work sheds light on future studies by adopting personality as the +essential indicator for various downstream tasks, and could further motivate +research into equally intriguing human-like machine behaviors. + +
+
+ comment: Accepted at NeurIPS 2023 (Spotlight) +
+
+
+
+
+ + ♻ ☆ Strategic Distribution Shift of Interacting Agents via Coupled Gradient + Flows + + +
+ We propose a novel framework for analyzing the dynamics of distribution shift +in real-world systems that captures the feedback loop between learning +algorithms and the distributions on which they are deployed. Prior work largely +models feedback-induced distribution shift as adversarial or via an overly +simplistic distribution-shift structure. In contrast, we propose a coupled +partial differential equation model that captures fine-grained changes in the +distribution over time by accounting for complex dynamics that arise due to +strategic responses to algorithmic decision-making, non-local endogenous +population interactions, and other exogenous sources of distribution shift. We +consider two common settings in machine learning: cooperative settings with +information asymmetries, and competitive settings where a learner faces +strategic users. For both of these settings, when the algorithm retrains via +gradient descent, we prove asymptotic convergence of the retraining procedure +to a steady-state, both in finite and in infinite dimensions, obtaining +explicit rates in terms of the model parameters. To do so we derive new results +on the convergence of coupled PDEs that extends what is known on multi-species +systems. Empirically, we show that our approach captures well-documented forms +of distribution shifts like polarization and disparate impacts that simpler +models cannot capture. + +
+
+
+
+
+ + ♻ ☆ Trust, but Verify: Robust Image Segmentation using Deep Learning + + +
+ We describe a method for verifying the output of a deep neural network for +medical image segmentation that is robust to several classes of random as well +as worst-case perturbations i.e. adversarial attacks. This method is based on a +general approach recently developed by the authors called "Trust, but Verify" +wherein an auxiliary verification network produces predictions about certain +masked features in the input image using the segmentation as an input. A +well-designed auxiliary network will produce high-quality predictions when the +input segmentations are accurate, but will produce low-quality predictions when +the segmentations are incorrect. Checking the predictions of such a network +with the original image allows us to detect bad segmentations. However, to +ensure the verification method is truly robust, we need a method for checking +the quality of the predictions that does not itself rely on a black-box neural +network. Indeed, we show that previous methods for segmentation evaluation that +do use deep neural regression networks are vulnerable to false negatives i.e. +can inaccurately label bad segmentations as good. We describe the design of a +verification network that avoids such vulnerability and present results to +demonstrate its robustness compared to previous methods. + +
+
+ comment: 5 Pages, 8 Figures, conference +
+
+
+
+
+ + ♻ ☆ Robust Learning with Progressive Data Expansion Against Spurious + Correlation NeurIPS 2023 + + +
+ While deep learning models have shown remarkable performance in various +tasks, they are susceptible to learning non-generalizable spurious features +rather than the core features that are genuinely correlated to the true label. +In this paper, beyond existing analyses of linear models, we theoretically +examine the learning process of a two-layer nonlinear convolutional neural +network in the presence of spurious features. Our analysis suggests that +imbalanced data groups and easily learnable spurious features can lead to the +dominance of spurious features during the learning process. In light of this, +we propose a new training algorithm called PDE that efficiently enhances the +model's robustness for a better worst-group performance. PDE begins with a +group-balanced subset of training data and progressively expands it to +facilitate the learning of the core features. Experiments on synthetic and +real-world benchmark datasets confirm the superior performance of our method on +models such as ResNets and Transformers. On average, our method achieves a 2.8% +improvement in worst-group accuracy compared with the state-of-the-art method, +while enjoying up to 10x faster training efficiency. Codes are available at +https://github.com/uclaml/PDE. + +
+
+ comment: 22 pages, 7 figures, 11 tables. In NeurIPS 2023 +
+
+
+
+
+
+
+
+ + Multimedia 6 + +
+
+
+ + ☆ JEN-1 Composer: A Unified Framework for High-Fidelity Multi-Track Music + Generation + + +
+ With rapid advances in generative artificial intelligence, the text-to-music +synthesis task has emerged as a promising direction for music generation from +scratch. However, finer-grained control over multi-track generation remains an +open challenge. Existing models exhibit strong raw generation capability but +lack the flexibility to compose separate tracks and combine them in a +controllable manner, differing from typical workflows of human composers. To +address this issue, we propose JEN-1 Composer, a unified framework to +efficiently model marginal, conditional, and joint distributions over +multi-track music via a single model. JEN-1 Composer framework exhibits the +capacity to seamlessly incorporate any diffusion-based music generation system, +\textit{e.g.} Jen-1, enhancing its capacity for versatile multi-track music +generation. We introduce a curriculum training strategy aimed at incrementally +instructing the model in the transition from single-track generation to the +flexible generation of multi-track combinations. During the inference, users +have the ability to iteratively produce and choose music tracks that meet their +preferences, subsequently creating an entire musical composition incrementally +following the proposed Human-AI co-composition workflow. Quantitative and +qualitative assessments demonstrate state-of-the-art performance in +controllable and high-fidelity multi-track music synthesis. The proposed JEN-1 +Composer represents a significant advance toward interactive AI-facilitated +music creation and composition. Demos will be available at +https://jenmusic.ai/audio-demos. + +
+
+ comment: Preprints +
+
+
+
+
+ + ☆ Video Frame Interpolation with Many-to-many Splatting and Spatial + Selective Refinement + + +
+ In this work, we first propose a fully differentiable Many-to-Many (M2M) +splatting framework to interpolate frames efficiently. Given a frame pair, we +estimate multiple bidirectional flows to directly forward warp the pixels to +the desired time step before fusing overlapping pixels. In doing so, each +source pixel renders multiple target pixels and each target pixel can be +synthesized from a larger area of visual context, establishing a many-to-many +splatting scheme with robustness to undesirable artifacts. For each input frame +pair, M2M has a minuscule computational overhead when interpolating an +arbitrary number of in-between frames, hence achieving fast multi-frame +interpolation. However, directly warping and fusing pixels in the intensity +domain is sensitive to the quality of motion estimation and may suffer from +less effective representation capacity. To improve interpolation accuracy, we +further extend an M2M++ framework by introducing a flexible Spatial Selective +Refinement (SSR) component, which allows for trading computational efficiency +for interpolation quality and vice versa. Instead of refining the entire +interpolated frame, SSR only processes difficult regions selected under the +guidance of an estimated error map, thereby avoiding redundant computation. +Evaluation on multiple benchmark datasets shows that our method is able to +improve the efficiency while maintaining competitive video interpolation +quality, and it can be adjusted to use more or less compute as needed. + +
+
+ comment: T-PAMI. arXiv admin note: substantial text overlap with + arXiv:2204.03513 +
+
+
+
+
+ + ♻ ☆ A Brief Yet In-Depth Survey of Deep Learning-Based Image Watermarking + + +
+ This paper presents a comprehensive survey on deep learning-based image +watermarking, a technique that entails the invisible embedding and extraction +of watermarks within a cover image, aiming to offer a seamless blend of +robustness and adaptability. We navigate the complex landscape of this +interdisciplinary domain, linking historical foundations, current innovations, +and prospective developments. Unlike existing literature, our study +concentrates exclusively on image watermarking with deep learning, delivering +an in-depth, yet brief analysis enriched by three fundamental contributions. +First, we introduce a refined categorization, segmenting the field into +Embedder-Extractor, Deep Networks as a Feature Transformation, and Hybrid +Methods. This taxonomy, inspired by the varied roles of deep learning across +studies, is designed to infuse clarity, offering readers technical insights and +directional guidance. Second, our exploration dives into representative +methodologies, encapsulating the diverse research directions and inherent +challenges within each category to provide a consolidated perspective. Lastly, +we venture beyond established boundaries to outline emerging frontiers, +offering a detailed insight into prospective research avenues. + +
+
+ comment: This paper was accepted for publication by the MDPI Applied Sciences + journal +
+
+
+
+
+ + ♻ ☆ CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency + Model ACM MM 2023 + + +
+ Denoising diffusion probabilistic models (DDPMs) have shown promising +performance for speech synthesis. However, a large number of iterative steps +are required to achieve high sample quality, which restricts the inference +speed. Maintaining sample quality while increasing sampling speed has become a +challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based +"Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a +single diffusion sampling step while achieving high audio quality. The +consistency constraint is applied to distill a consistency model from a +well-designed diffusion-based teacher model, which ultimately yields superior +performances in the distilled CoMoSpeech. Our experiments show that by +generating audio recordings by a single sampling step, the CoMoSpeech achieves +an inference speed more than 150 times faster than real-time on a single NVIDIA +A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based +speech synthesis truly practical. Meanwhile, objective and subjective +evaluations on text-to-speech and singing voice synthesis show that the +proposed teacher models yield the best audio quality, and the one-step sampling +based CoMoSpeech achieves the best inference speed with better or comparable +audio quality to other conventional multi-step diffusion model baselines. Audio +samples are available at https://comospeech.github.io/. + +
+
+ comment: Accepted to ACM MM 2023 +
+
+
+
+
+ + ♻ ☆ Perceptual Quality Assessment of Face Video Compression: A Benchmark and + An Effective Method + + +
+ Recent years have witnessed an exponential increase in the demand for face +video compression, and the success of artificial intelligence has expanded the +boundaries beyond traditional hybrid video coding. Generative coding approaches +have been identified as promising alternatives with reasonable perceptual +rate-distortion trade-offs, leveraging the statistical priors of face videos. +However, the great diversity of distortion types in spatial and temporal +domains, ranging from the traditional hybrid coding frameworks to generative +models, present grand challenges in compressed face video quality assessment +(VQA). In this paper, we introduce the large-scale Compressed Face Video +Quality Assessment (CFVQA) database, which is the first attempt to +systematically understand the perceptual quality and diversified compression +distortions in face videos. The database contains 3,240 compressed face video +clips in multiple compression levels, which are derived from 135 source videos +with diversified content using six representative video codecs, including two +traditional methods based on hybrid coding frameworks, two end-to-end methods, +and two generative methods. In addition, a FAce VideO IntegeRity (FAVOR) index +for face video compression was developed to measure the perceptual quality, +considering the distinct content characteristics and temporal priors of the +face videos. Experimental results exhibit its superior performance on the +proposed CFVQA dataset. The benchmark is now made publicly available at: +https://github.com/Yixuan423/Compressed-Face-Videos-Quality-Assessment. + +
+
+
+
+
+ + ♻ ☆ On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS 2023 + + +
+ Large vision-language models (VLMs) such as GPT-4 have achieved unprecedented +performance in response generation, especially with visual inputs, enabling +more creative and adaptable interaction than large language models such as +ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since +adversaries may successfully evade the entire system by subtly manipulating the +most vulnerable modality (e.g., vision). To this end, we propose evaluating the +robustness of open-source large VLMs in the most realistic and high-risk +setting, where adversaries have only black-box system access and seek to +deceive the model into returning the targeted responses. In particular, we +first craft targeted adversarial examples against pretrained models such as +CLIP and BLIP, and then transfer these adversarial examples to other VLMs such +as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we +observe that black-box queries on these VLMs can further improve the +effectiveness of targeted evasion, resulting in a surprisingly high success +rate for generating targeted responses. Our findings provide a quantitative +understanding regarding the adversarial vulnerability of large VLMs and call +for a more thorough examination of their potential security flaws before +deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 53 + +
+
+
+ + ☆ Translating away Translationese without Parallel Data EMNLP 2023 + + +
+ Translated texts exhibit systematic linguistic differences compared to +original texts in the same language, and these differences are referred to as +translationese. Translationese has effects on various cross-lingual natural +language processing tasks, potentially leading to biased results. In this +paper, we explore a novel approach to reduce translationese in translated +texts: translation-based style transfer. As there are no parallel +human-translated and original data in the same language, we use a +self-supervised approach that can learn from comparable (rather than parallel) +mono-lingual original and translated data. However, even this self-supervised +approach requires some parallel data for validation. We show how we can +eliminate the need for parallel validation data by combining the +self-supervised loss with an unsupervised loss. This unsupervised loss +leverages the original language model loss over the style-transferred output +and a semantic similarity loss between the input and style-transferred output. +We evaluate our approach in terms of original vs. translationese binary +classification in addition to measuring content preservation and target-style +fluency. The results show that our approach is able to reduce translationese +classifier accuracy to a level of a random classifier after style transfer +while adequately preserving the content and fluency in the target original +style. + +
+
+ comment: Accepted at EMNLP 2023, Main Conference +
+
+
+
+
+ + ☆ All Things Considered: Detecting Partisan Events from News Media with + Cross-Article Comparison EMNLP'23 + + +
+ Public opinion is shaped by the information news media provide, and that +information in turn may be shaped by the ideological preferences of media +outlets. But while much attention has been devoted to media bias via overt +ideological language or topic selection, a more unobtrusive way in which the +media shape opinion is via the strategic inclusion or omission of partisan +events that may support one side or the other. We develop a latent +variable-based framework to predict the ideology of news articles by comparing +multiple articles on the same story and identifying partisan events whose +inclusion or omission reveals ideology. Our experiments first validate the +existence of partisan event selection, and then show that article alignment and +cross-document comparison detect partisan events and article ideology better +than competitive baselines. Our results reveal the high-level form of media +bias, which is present even among mainstream media with strong norms of +objectivity and nonpartisanship. Our codebase and dataset are available at +https://github.com/launchnlp/ATC. + +
+
+ comment: EMNLP'23 Main Conference +
+
+
+
+
+ + ☆ Open Visual Knowledge Extraction via Relation-Oriented Multimodality + Model Prompting NeurIPS 2023 + + +
+ Images contain rich relational knowledge that can help machines understand +the world. Existing methods on visual knowledge extraction often rely on the +pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation +types), restricting the expressiveness of the extracted knowledge. In this +work, we take a first exploration to a new paradigm of open visual knowledge +extraction. To achieve this, we present OpenVik which consists of an open +relational region detector to detect regions potentially containing relational +knowledge and a visual knowledge generator that generates format-free knowledge +by prompting the large multimodality model with the detected region of +interest. We also explore two data enhancement techniques for diversifying the +generated format-free visual knowledge. Extensive knowledge quality evaluations +highlight the correctness and uniqueness of the extracted open visual knowledge +by OpenVik. Moreover, integrating our extracted knowledge across various visual +reasoning applications shows consistent improvements, indicating the real-world +applicability of OpenVik. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded + Dialogue Generation + + +
+ Model hallucination has been a crucial interest of research in Natural +Language Generation (NLG). In this work, we propose sequence-level certainty as +a common theme over hallucination in NLG, and explore the correlation between +sequence-level certainty and the level of hallucination in model responses. We +categorize sequence-level certainty into two aspects: probabilistic certainty +and semantic certainty, and reveal through experiments on Knowledge-Grounded +Dialogue Generation (KGDG) task that both a higher level of probabilistic +certainty and a higher level of semantic certainty in model responses are +significantly correlated with a lower level of hallucination. What's more, we +provide theoretical proof and analysis to show that semantic certainty is a +good estimator of probabilistic certainty, and therefore has the potential as +an alternative to probability-based certainty estimation in black-box +scenarios. Based on the observation on the relationship between certainty and +hallucination, we further propose Certainty-based Response Ranking (CRR), a +decoding-time method for mitigating hallucination in NLG. Based on our +categorization of sequence-level certainty, we propose 2 types of CRR approach: +Probabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually +sampled model responses using their arithmetic mean log-probability of the +entire sequence. S-CRR approaches certainty estimation from meaning-space, and +ranks a number of model response candidates based on their semantic certainty +level, which is estimated by the entailment-based Agreement Score (AS). Through +extensive experiments across 3 KGDG datasets, 3 decoding methods, and on 4 +different models, we validate the effectiveness of our 2 proposed CRR methods +to reduce model hallucination. + +
+
+
+
+
+ + ☆ Are NLP Models Good at Tracing Thoughts: An Overview of Narrative + Understanding + + +
+ Narrative understanding involves capturing the author's cognitive processes, +providing insights into their knowledge, intentions, beliefs, and desires. +Although large language models (LLMs) excel in generating grammatically +coherent text, their ability to comprehend the author's thoughts remains +uncertain. This limitation hinders the practical applications of narrative +understanding. In this paper, we conduct a comprehensive survey of narrative +understanding tasks, thoroughly examining their key features, definitions, +taxonomy, associated datasets, training objectives, evaluation metrics, and +limitations. Furthermore, we explore the potential of expanding the +capabilities of modularized LLMs to address novel narrative understanding +tasks. By framing narrative understanding as the retrieval of the author's +imaginative cues that outline the narrative structure, our study introduces a +fresh perspective on enhancing narrative comprehension. + +
+
+
+
+
+ + ☆ ProMap: Effective Bilingual Lexicon Induction via Language Model + Prompting AACL 2023 + + +
+ Bilingual Lexicon Induction (BLI), where words are translated between two +languages, is an important NLP task. While noticeable progress on BLI in rich +resource languages using static word embeddings has been achieved. The word +translation performance can be further improved by incorporating information +from contextualized word embeddings. In this paper, we introduce ProMap, a +novel approach for BLI that leverages the power of prompting pretrained +multilingual and multidialectal language models to address these challenges. To +overcome the employment of subword tokens in these models, ProMap relies on an +effective padded prompting of language models with a seed dictionary that +achieves good performance when used independently. We also demonstrate the +effectiveness of ProMap in re-ranking results from other BLI methods such as +with aligned static word embeddings. When evaluated on both rich-resource and +low-resource languages, ProMap consistently achieves state-of-the-art results. +Furthermore, ProMap enables strong performance in few-shot scenarios (even with +less than 10 training examples), making it a valuable tool for low-resource +language translation. Overall, we believe our method offers both exciting and +promising direction for BLI in general and low-resource languages in +particular. ProMap code and data are available at +\url{https://github.com/4mekki4/promap}. + +
+
+ comment: To appear in IJCNLP-AACL 2023 +
+
+
+
+
+ + ☆ Crossing the Aisle: Unveiling Partisan and Counter-Partisan Events in + News Reporting EMNLP'23 + + +
+ News media is expected to uphold unbiased reporting. Yet they may still +affect public opinion by selectively including or omitting events that support +or contradict their ideological positions. Prior work in NLP has only studied +media bias via linguistic style and word usage. In this paper, we study to +which degree media balances news reporting and affects consumers through event +inclusion or omission. We first introduce the task of detecting both partisan +and counter-partisan events: events that support or oppose the author's +political ideology. To conduct our study, we annotate a high-quality dataset, +PAC, containing 8,511 (counter-)partisan event annotations in 304 news articles +from ideologically diverse media outlets. We benchmark PAC to highlight the +challenges of this task. Our findings highlight both the ways in which the news +subtly shapes opinion and the need for large language models that better +understand events within a broader context. Our dataset can be found at +https://github.com/launchnlp/Partisan-Event-Dataset. + +
+
+ comment: EMNLP'23 Findings +
+
+
+
+
+ + ☆ TLM: Token-Level Masking for Transformers EMNLP2023 + + +
+ Structured dropout approaches, such as attention dropout and DropHead, have +been investigated to regularize the multi-head attention mechanism in +Transformers. In this paper, we propose a new regularization scheme based on +token-level rather than structure-level to reduce overfitting. Specifically, we +devise a novel Token-Level Masking (TLM) training strategy for Transformers to +regularize the connections of self-attention, which consists of two masking +techniques that are effective and easy to implement. The underlying idea is to +manipulate the connections between tokens in the multi-head attention via +masking, where the networks are forced to exploit partial neighbors' +information to produce a meaningful representation. The generality and +effectiveness of TLM are thoroughly evaluated via extensive experiments on 4 +diversified NLP tasks across 18 datasets, including natural language +understanding benchmark GLUE, ChineseGLUE, Chinese Grammatical Error +Correction, and data-to-text generation. The results indicate that TLM can +consistently outperform attention dropout and DropHead, e.g., it increases by +0.5 points relative to DropHead with BERT-large on GLUE. Moreover, TLM can +establish a new record on the data-to-text benchmark Rotowire (18.93 BLEU). Our +code will be publicly available at https://github.com/Young1993/tlm. + +
+
+ comment: 13 pages. Accepted by EMNLP2023 main conference +
+
+
+
+
+ + ☆ Using Large Language Models to Support Thematic Analysis in Empirical + Legal Studies + + +
+ Thematic analysis and other variants of inductive coding are widely used +qualitative analytic methods within empirical legal studies (ELS). We propose a +novel framework facilitating effective collaboration of a legal expert with a +large language model (LLM) for generating initial codes (phase 2 of thematic +analysis), searching for themes (phase 3), and classifying the data in terms of +the themes (to kick-start phase 4). We employed the framework for an analysis +of a dataset (n=785) of facts descriptions from criminal court opinions +regarding thefts. The goal of the analysis was to discover classes of typical +thefts. Our results show that the LLM, namely OpenAI's GPT-4, generated +reasonable initial codes, and it was capable of improving the quality of the +codes based on expert feedback. They also suggest that the model performed well +in zero-shot classification of facts descriptions in terms of the themes. +Finally, the themes autonomously discovered by the LLM appear to map fairly +well to the themes arrived at by legal experts. These findings can be leveraged +by legal researchers to guide their decisions in integrating LLMs into their +thematic analyses, as well as other inductive coding projects. + +
+
+ comment: 10 pages, 5 figures, 3 tables +
+
+
+
+
+ + ☆ Probing LLMs for Joint Encoding of Linguistic Categories EMNLP + + +
+ Large Language Models (LLMs) exhibit impressive performance on a range of NLP +tasks, due to the general-purpose linguistic knowledge acquired during +pretraining. Existing model interpretability research (Tenney et al., 2019) +suggests that a linguistic hierarchy emerges in the LLM layers, with lower +layers better suited to solving syntactic tasks and higher layers employed for +semantic processing. Yet, little is known about how encodings of different +linguistic phenomena interact within the models and to what extent processing +of linguistically-related categories relies on the same, shared model +representations. In this paper, we propose a framework for testing the joint +encoding of linguistic categories in LLMs. Focusing on syntax, we find evidence +of joint encoding both at the same (related part-of-speech (POS) classes) and +different (POS classes and related syntactic dependency relations) levels of +linguistic hierarchy. Our cross-lingual experiments show that the same patterns +hold across languages in multilingual LLMs. + +
+
+ comment: Accepted in EMNLP Findings 2023 +
+
+
+
+
+ + ☆ When Reviewers Lock Horn: Finding Disagreement in Scientific Peer + Reviews EMNLP 2023 + + +
+ To this date, the efficacy of the scientific publishing enterprise +fundamentally rests on the strength of the peer review process. The journal +editor or the conference chair primarily relies on the expert reviewers' +assessment, identify points of agreement and disagreement and try to reach a +consensus to make a fair and informed decision on whether to accept or reject a +paper. However, with the escalating number of submissions requiring review, +especially in top-tier Artificial Intelligence (AI) conferences, the +editor/chair, among many other works, invests a significant, sometimes +stressful effort to mitigate reviewer disagreements. Here in this work, we +introduce a novel task of automatically identifying contradictions among +reviewers on a given article. To this end, we introduce ContraSciView, a +comprehensive review-pair contradiction dataset on around 8.5k papers (with +around 28k review pairs containing nearly 50k review pair comments) from the +open review-based ICLR and NeurIPS conferences. We further propose a baseline +model that detects contradictory statements from the review pairs. To the best +of our knowledge, we make the first attempt to identify disagreements among +peer reviewers automatically. We make our dataset and code public for further +investigations. + +
+
+ comment: 12 pages, 5 figures, EMNLP 2023 short +
+
+
+
+
+ + ☆ N-Critics: Self-Refinement of Large Language Models with Ensemble of + Critics + + +
+ We propose a self-correction mechanism for Large Language Models (LLMs) to +mitigate issues such as toxicity and fact hallucination. This method involves +refining model outputs through an ensemble of critics and the model's own +feedback. Drawing inspiration from human behavior, we explore whether LLMs can +emulate the self-correction process observed in humans who often engage in +self-reflection and seek input from others to refine their understanding of +complex topics. Our approach is model-agnostic and can be applied across +various domains to enhance trustworthiness by addressing fairness, bias, and +robustness concerns. We consistently observe performance improvements in LLMs +for reducing toxicity and correcting factual errors. + +
+
+
+
+
+ + ☆ ASTormer: An AST Structure-aware Transformer Decoder for Text-to-SQL + + +
+ Text-to-SQL aims to generate an executable SQL program given the user +utterance and the corresponding database schema. To ensure the well-formedness +of output SQLs, one prominent approach adopts a grammar-based recurrent decoder +to produce the equivalent SQL abstract syntax tree (AST). However, previous +methods mainly utilize an RNN-series decoder, which 1) is time-consuming and +inefficient and 2) introduces very few structure priors. In this work, we +propose an AST structure-aware Transformer decoder (ASTormer) to replace +traditional RNN cells. The structural knowledge, such as node types and +positions in the tree, is seamlessly incorporated into the decoder via both +absolute and relative position embeddings. Besides, the proposed framework is +compatible with different traversing orders even considering adaptive node +selection. Extensive experiments on five text-to-SQL benchmarks demonstrate the +effectiveness and efficiency of our structured decoder compared to competitive +baselines. + +
+
+
+
+
+ + ☆ From Indeterminacy to Determinacy: Augmenting Logical Reasoning + Capabilities with Large Language Models + + +
+ Recent advances in LLMs have revolutionized the landscape of reasoning tasks. +To enhance the capabilities of LLMs to emulate human reasoning, prior works +focus on modeling reasoning steps using specific thought structures like +chains, trees, or graphs. However, LLM-based reasoning continues to encounter +three challenges: 1) Selecting appropriate reasoning structures for various +tasks; 2) Exploiting known conditions sufficiently and efficiently to deduce +new insights; 3) Considering the impact of historical reasoning experience. To +address these challenges, we propose DetermLR, a novel reasoning framework that +formulates the reasoning process as a transformational journey from +indeterminate premises to determinate ones. This process is marked by the +incremental accumulation of determinate premises, making the conclusion +progressively closer to clarity. DetermLR includes three essential components: +1) Premise identification: We categorize premises into two distinct types: +determinate and indeterminate. This empowers LLMs to customize reasoning +structures to match the specific task complexities. 2) Premise prioritization +and exploration: We leverage quantitative measurements to assess the relevance +of each premise to the target, prioritizing more relevant premises for +exploring new insights. 3) Iterative process with reasoning memory: We +introduce a reasoning memory module to automate storage and extraction of +available premises and reasoning paths, preserving historical reasoning details +for more accurate premise prioritization. Comprehensive experimental results +show that DetermLR outperforms all baselines on four challenging logical +reasoning tasks: LogiQA, ProofWriter, FOLIO, and LogicalDeduction. DetermLR can +achieve better reasoning performance while requiring fewer visited states, +highlighting its superior efficiency and effectiveness in tackling logical +reasoning tasks. + +
+
+ comment: Code repo: https://github.com/XiaoMi/DetermLR +
+
+
+
+
+ + ☆ EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health + Records with Chest X-ray Images NeurIPS 2023 + + +
+ Electronic Health Records (EHRs), which contain patients' medical histories +in various multi-modal formats, often overlook the potential for joint +reasoning across imaging and table modalities underexplored in current EHR +Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel +multi-modal question answering dataset combining structured EHRs and chest +X-ray images. To develop our dataset, we first construct two uni-modal +resources: 1) The MIMIC- CXR-VQA dataset, our newly created medical visual +question answering (VQA) benchmark, specifically designed to augment the +imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of +a previously established table-based EHR QA dataset. By integrating these two +uni-modal resources, we successfully construct a multi-modal EHR QA dataset +that necessitates both uni-modal and cross-modal reasoning. To address the +unique challenges of multi-modal questions within EHRs, we propose a +NeuralSQL-based strategy equipped with an external VQA API. This pioneering +endeavor enhances engagement with multi-modal EHR sources and we believe that +our dataset can catalyze advances in real-world medical scenarios such as +clinical decision-making and research. EHRXQA is available at +https://github.com/baeseongsu/ehrxqa. + +
+
+ comment: Accepted at NeurIPS 2023 Datasets and Benchmarks Track (10 pages for + main text, 4 pages for references, 28 pages for supplementary materials) +
+
+
+
+
+ + ☆ Setting the Trap: Capturing and Defeating Backdoors in Pretrained + Language Models through Honeypots + + +
+ In the field of natural language processing, the prevalent approach involves +fine-tuning pretrained language models (PLMs) using local samples. Recent +research has exposed the susceptibility of PLMs to backdoor attacks, wherein +the adversaries can embed malicious prediction behaviors by manipulating a few +training samples. In this study, our objective is to develop a +backdoor-resistant tuning procedure that yields a backdoor-free model, no +matter whether the fine-tuning dataset contains poisoned samples. To this end, +we propose and integrate a honeypot module into the original PLM, specifically +designed to absorb backdoor information exclusively. Our design is motivated by +the observation that lower-layer representations in PLMs carry sufficient +backdoor features while carrying minimal information about the original tasks. +Consequently, we can impose penalties on the information acquired by the +honeypot module to inhibit backdoor creation during the fine-tuning process of +the stem network. Comprehensive experiments conducted on benchmark datasets +substantiate the effectiveness and robustness of our defensive strategy. +Notably, these results indicate a substantial reduction in the attack success +rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods. + +
+
+
+
+
+ + ☆ Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive + Learning for Code Generation EMNLP 2023 + + +
+ With the rise of powerful closed-sourced LLMs (ChatGPT, GPT-4), there are +increasing interests in distilling the capabilies of close-sourced LLMs to +smaller open-sourced LLMs. Previous distillation methods usually prompt ChatGPT +to generate a set of instructions and answers, for the student model to learn. +However, such standard distillation approach neglects the merits and conditions +of the student model. Inspired by modern teaching principles, we design a +personalised distillation process, in which the student attempts to solve a +task first, then the teacher provides an adaptive refinement for the student to +improve. Instead of feeding the student with teacher's prior, personalised +distillation enables personalised learning for the student model, as it only +learns on examples it makes mistakes upon and learns to improve its own +solution. On code generation, personalised distillation consistently +outperforms standard distillation with only one third of the data. With only +2.5-3K personalised examples that incur a data-collection cost of 4-6$, we +boost CodeGen-mono-16B by 7% to achieve 36.4% pass@1 and StarCoder by 12.2% to +achieve 45.8% pass@1 on HumanEval. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Dense Retrieval as Indirect Supervision for Large-space Decision Making EMNLP 2023 + + +
+ Many discriminative natural language understanding (NLU) tasks have large +label spaces. Learning such a process of large-space decision making is +particularly challenging due to the lack of training instances per label and +the difficulty of selection among many fine-grained labels. Inspired by dense +retrieval methods for passage finding in open-domain QA, we propose a +reformulation of large-space discriminative NLU tasks as a learning-to-retrieve +task, leading to a novel solution named Dense Decision Retrieval (DDR ). +Instead of predicting fine-grained decisions as logits, DDR adopts a +dual-encoder architecture that learns to predict by retrieving from a decision +thesaurus. This approach not only leverages rich indirect supervision signals +from easy-to-consume learning resources for dense retrieval, it also leads to +enhanced prediction generalizability with a semantically meaningful +representation of the large decision space. When evaluated on tasks with +decision spaces ranging from hundreds to hundred-thousand scales, DDR +outperforms strong baselines greatly by 27.54% in P@1 on two extreme +multi-label classification tasks, 1.17% in F1 score ultra-fine entity typing, +and 1.26% in accuracy on three few-shot intent classification tasks on average. +Code and resources are available at https://github.com/luka-group/DDR + +
+
+ comment: EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Anaphor Assisted Document-Level Relation Extraction EMNLP 2023 + + +
+ Document-level relation extraction (DocRE) involves identifying relations +between entities distributed in multiple sentences within a document. Existing +methods focus on building a heterogeneous document graph to model the internal +structure of an entity and the external interaction between entities. However, +there are two drawbacks in existing methods. On one hand, anaphor plays an +important role in reasoning to identify relations between entities but is +ignored by these methods. On the other hand, these methods achieve +cross-sentence entity interactions implicitly by utilizing a document or +sentences as intermediate nodes. Such an approach has difficulties in learning +fine-grained interactions between entities across different sentences, +resulting in sub-optimal performance. To address these issues, we propose an +Anaphor-Assisted (AA) framework for DocRE tasks. Experimental results on the +widely-used datasets demonstrate that our model achieves a new state-of-the-art +performance. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference +
+
+
+
+
+ + ☆ MILDSum: A Novel Benchmark Dataset for Multilingual Summarization of + Indian Legal Case Judgments EMNLP 2023 + + +
+ Automatic summarization of legal case judgments is a practically important +problem that has attracted substantial research efforts in many countries. In +the context of the Indian judiciary, there is an additional complexity -- +Indian legal case judgments are mostly written in complex English, but a +significant portion of India's population lacks command of the English +language. Hence, it is crucial to summarize the legal documents in Indian +languages to ensure equitable access to justice. While prior research primarily +focuses on summarizing legal case judgments in their source languages, this +study presents a pioneering effort toward cross-lingual summarization of +English legal documents into Hindi, the most frequently spoken Indian language. +We construct the first high-quality legal corpus comprising of 3,122 case +judgments from prominent Indian courts in English, along with their summaries +in both English and Hindi, drafted by legal practitioners. We benchmark the +performance of several diverse summarization approaches on our corpus and +demonstrate the need for further research in cross-lingual summarization in the +legal domain. + +
+
+ comment: Accepted at EMNLP 2023 (Main Conference) +
+
+
+
+
+ + ☆ Accelerating LLM Inference by Enabling Intermediate Layer Decoding + + +
+ Large Language Models (LLMs) have achieved remarkable performance across a +wide variety of natural language tasks; however, their large size makes their +inference slow and computationally expensive which poses a practical challenge +for resource constrained real-world applications. Focusing on this problem, we +propose to instruction tune LLMs in a way that enables intermediate layer +decoding for efficiently generating text, but importantly without compromising +the quality of the generation. Specifically, we instruction tune LLMs with +additional explicit Losses from the InTermediate layErs (LITE) and show that it +enables these layers to acquire 'good' generation ability without affecting the +generation ability of the final layer. We perform 'dynamic confidence-based +early exiting' at token level from the intermediate layers which improves the +efficiency of inference while maintaining the generation quality. We conduct +comprehensive experiments by instruction tuning LLaMA-2 models on the widely +used Alpaca dataset and holistically evaluate on four different +human-instruction test sets: Vicuna, WizardLM, Koala, and Self-Instruct. We +show that 'dynamic early exiting' achieves consistent and considerable cost +improvements (37.86% on average) while maintaining the generation quality of +the responses. We further conduct a thorough analysis of the results over +several important aspects, such as comparing the semantic similarity of the +outputs and dissecting the efficiency improvements by comparing the number of +tokens generated in the output. In summary, our work contributes to improving +the efficiency of LLM inference while maintaining the generation quality, a +crucial step en route to enabling their widespread adoption. + +
+
+
+
+
+ + ☆ Identifying Conspiracy Theories News based on Event Relation Graph EMNLP 2023 + + +
+ Conspiracy theories, as a type of misinformation, are narratives that +explains an event or situation in an irrational or malicious manner. While most +previous work examined conspiracy theory in social media short texts, limited +attention was put on such misinformation in long news documents. In this paper, +we aim to identify whether a news article contains conspiracy theories. We +observe that a conspiracy story can be made up by mixing uncorrelated events +together, or by presenting an unusual distribution of relations between events. +Achieving a contextualized understanding of events in a story is essential for +detecting conspiracy theories. Thus, we propose to incorporate an event +relation graph for each article, in which events are nodes, and four common +types of event relations, coreference, temporal, causal, and subevent +relations, are considered as edges. Then, we integrate the event relation graph +into conspiracy theory identification in two ways: an event-aware language +model is developed to augment the basic language model with the knowledge of +events and event relations via soft labels; further, a heterogeneous graph +attention network is designed to derive a graph embedding based on hard labels. +Experiments on a large benchmark dataset show that our approach based on event +relation graph improves both precision and recall of conspiracy theory +identification, and generalizes well for new unseen media sources. + +
+
+ comment: Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Discourse Structures Guided Fine-grained Propaganda Identification EMNLP 2023 + + +
+ Propaganda is a form of deceptive narratives that instigate or mislead the +public, usually with a political purpose. In this paper, we aim to identify +propaganda in political news at two fine-grained levels: sentence-level and +token-level. We observe that propaganda content is more likely to be embedded +in sentences that attribute causality or assert contrast to nearby sentences, +as well as seen in opinionated evaluation, speculation and discussions of +future expectation. Hence, we propose to incorporate both local and global +discourse structures for propaganda discovery and construct two teacher models +for identifying PDTB-style discourse relations between nearby sentences and +common discourse roles of sentences in a news article respectively. We further +devise two methods to incorporate the two types of discourse structures for +propaganda identification by either using teacher predicted probabilities as +additional features or soliciting guidance in a knowledge distillation +framework. Experiments on the benchmark dataset demonstrate that leveraging +guidance from discourse structures can significantly improve both precision and +recall of propaganda content identification. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based + Queries NeurIPS 2023 + + +
+ In scientific research, the ability to effectively retrieve relevant +documents based on complex, multifaceted queries is critical. Existing +evaluation datasets for this task are limited, primarily due to the high cost +and effort required to annotate resources that effectively represent complex +queries. To address this, we propose a novel task, Scientific DOcument +Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed +to handle the complex nature of user queries in scientific research. We +developed a benchmark dataset within the field of computer science, consisting +of 100 human-authored complex query cases. For each complex query, we assembled +a collection of 100 relevant documents and produced annotated relevance scores +for ranking them. Recognizing the significant labor of expert annotation, we +also introduce Anno-GPT, a scalable framework for validating the performance of +Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM +annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, +without compromising quality. Furthermore, due to the multi-tiered structure of +these complex queries, the DORIS-MAE dataset can be extended to over 4,000 +sub-query test cases without requiring additional annotation. We evaluated 17 +recent retrieval methods on DORIS-MAE, observing notable performance drops +compared to traditional datasets. This highlights the need for better +approaches to handle complex, multifaceted queries in scientific research. Our +dataset and codebase are available at +https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. + +
+
+ comment: To appear in NeurIPS 2023 Datasets and Benchmarks Track +
+
+
+
+
+ + ♻ ☆ "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in + LLM-Generated Reference Letters + + +
+ Large Language Models (LLMs) have recently emerged as an effective tool to +assist individuals in writing various types of content, including professional +documents such as recommendation letters. Though bringing convenience, this +application also introduces unprecedented fairness concerns. Model-generated +reference letters might be directly used by users in professional scenarios. If +underlying biases exist in these model-constructed letters, using them without +scrutinization could lead to direct societal harms, such as sabotaging +application success rates for female applicants. In light of this pressing +issue, it is imminent and necessary to comprehensively study fairness issues +and associated harms in this real-world use case. In this paper, we critically +examine gender biases in LLM-generated reference letters. Drawing inspiration +from social science findings, we design evaluation methods to manifest biases +through 2 dimensions: (1) biases in language style and (2) biases in lexical +content. We further investigate the extent of bias propagation by analyzing the +hallucination bias of models, a term that we define to be bias exacerbation in +model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- +ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated +recommendation letters. Our findings not only warn against using LLMs for this +application without scrutinization, but also illuminate the importance of +thoroughly studying hidden biases and harms in LLM-generated professional +documents. + +
+
+
+
+
+ + ♻ ☆ Parameter-Efficient Cross-lingual Transfer of Vision and Language Models + via Translation-based Alignment EMNLP + + +
+ Pre-trained vision and language models such as CLIP have witnessed remarkable +success in connecting images and texts with a primary focus on English texts. +Despite recent efforts to extend CLIP to support other languages, disparities +in performance among different languages have been observed due to uneven +resource availability. Additionally, current cross-lingual transfer methods of +those pre-trained models would consume excessive resources for a large number +of languages. Therefore, we propose a new parameter-efficient cross-lingual +transfer learning framework that utilizes a translation-based alignment method +to mitigate multilingual disparities and explores parameter-efficient +fine-tuning methods for parameter-efficient cross-lingual transfer. Extensive +experiments on XTD and Multi30K datasets, covering 11 languages under +zero-shot, few-shot, and full-dataset learning scenarios, show that our +framework significantly reduces the multilingual disparities among languages +and improves cross-lingual transfer results, especially in low-resource +scenarios, while only keeping and fine-tuning an extremely small number of +parameters compared to the full model (e.g., Our framework only requires 0.16\% +additional parameters of a full-model for each language in the few-shot +learning scenario). The codes are available at +\url{https://github.com/eric-ai-lab/PECTVLM}. The codes are available at +\url{https://github.com/eric-ai-lab/PECTVLM}. + +
+
+ comment: Findings of EMNLP +
+
+
+
+
+ + ♻ ☆ SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen + LLMs NeurIPS 2023 + + +
+ In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling +frozen LLMs to perform both understanding and generation tasks involving +non-linguistic modalities such as images or videos. SPAE converts between raw +pixels and interpretable lexical tokens (or words) extracted from the LLM's +vocabulary. The resulting tokens capture both the semantic meaning and the +fine-grained details needed for visual reconstruction, effectively translating +the visual content into a language comprehensible to the LLM, and empowering it +to perform a wide array of multimodal tasks. Our approach is validated through +in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set +of image understanding and generation tasks. Our method marks the first +successful attempt to enable a frozen LLM to generate image content while +surpassing state-of-the-art performance in image understanding tasks, under the +same setting, by over 25%. + +
+
+ comment: NeurIPS 2023 spotlight +
+
+
+
+
+ + ♻ ☆ On the Exploitability of Instruction Tuning NeurIPS 2023 + + +
+ Instruction tuning is an effective technique to align large language models +(LLMs) with human intents. In this work, we investigate how an adversary can +exploit instruction tuning by injecting specific instruction-following examples +into the training data that intentionally changes the model's behavior. For +example, an adversary can achieve content injection by injecting training +examples that mention target content and eliciting such behavior from +downstream models. To achieve this goal, we propose \textit{AutoPoison}, an +automated data poisoning pipeline. It naturally and coherently incorporates +versatile attack goals into poisoned data with the help of an oracle LLM. We +showcase two example attacks: content injection and over-refusal attacks, each +aiming to induce a specific exploitable behavior. We quantify and benchmark the +strength and the stealthiness of our data poisoning scheme. Our results show +that AutoPoison allows an adversary to change a model's behavior by poisoning +only a small fraction of data while maintaining a high level of stealthiness in +the poisoned examples. We hope our work sheds light on how data quality affects +the behavior of instruction-tuned models and raises awareness of the importance +of data quality for responsible deployments of LLMs. Code is available at +\url{https://github.com/azshue/AutoPoison}. + +
+
+ comment: NeurIPS 2023 camera-ready (21 pages, 10 figures) +
+
+
+
+
+ + ♻ ☆ Language Models Meet World Models: Embodied Experiences Enhance Language + Models + + +
+ While large language models (LMs) have shown remarkable capabilities across +numerous tasks, they often struggle with simple reasoning and planning in +physical environments, such as understanding object permanence or planning +household activities. The limitation arises from the fact that LMs are trained +only on written text and miss essential embodied knowledge and skills. In this +paper, we propose a new paradigm of enhancing LMs by finetuning them with world +models, to gain diverse embodied knowledge while retaining their general +language capabilities. Our approach deploys an embodied agent in a world model, +particularly a simulator of the physical world (VirtualHome), and acquires a +diverse set of embodied experiences through both goal-oriented planning and +random exploration. These experiences are then used to finetune LMs to teach +diverse abilities of reasoning and acting in the physical world, e.g., planning +and completing goals, object permanence and tracking, etc. Moreover, it is +desirable to preserve the generality of LMs during finetuning, which +facilitates generalizing the embodied knowledge across tasks rather than being +tied to specific simulations. We thus further introduce the classical (EWC) for +selective weight updates, combined with low-rank adapters (LoRA) for training +efficiency. Extensive experiments show our approach substantially improves base +LMs on 18 downstream tasks by 64.28% on average. In particular, the small LMs +(1.3B, 6B, and 13B) enhanced by our approach match or even outperform much +larger LMs (e.g., ChatGPT). + +
+
+
+
+
+ + ♻ ☆ Evaluating Emotion Arcs Across Languages: Bridging the Global Divide in + Sentiment Analysis + + +
+ Emotion arcs capture how an individual (or a population) feels over time. +They are widely used in industry and research; however, there is little work on +evaluating the automatically generated arcs. This is because of the difficulty +of establishing the true (gold) emotion arc. Our work, for the first time, +systematically and quantitatively evaluates automatically generated emotion +arcs. We also compare two common ways of generating emotion arcs: +Machine-Learning (ML) models and Lexicon-Only (LexO) methods. By running +experiments on 18 diverse datasets in 9 languages, we show that despite being +markedly poor at instance level emotion classification, LexO methods are highly +accurate at generating emotion arcs when aggregating information from hundreds +of instances. We also show, through experiments on six indigenous African +languages, as well as Arabic, and Spanish, that automatic translations of +English emotion lexicons can be used to generate high-quality emotion arcs in +less-resource languages. This opens up avenues for work on emotions in +languages from around the world; which is crucial for commerce, public policy, +and health research in service of speakers often left behind. Code and +resources: https://github.com/dteodore/EmotionArcs + +
+
+ comment: 9 pages, 5 figures. arXiv admin note: substantial text overlap with + arXiv:2210.07381 +
+
+
+
+
+ + ♻ ☆ Diagnosing Transformers: Illuminating Feature Spaces for Clinical + Decision-Making + + +
+ Pre-trained transformers are often fine-tuned to aid clinical decision-making +using limited clinical notes. Model interpretability is crucial, especially in +high-stakes domains like medicine, to establish trust and ensure safety, which +requires human engagement. We introduce SUFO, a systematic framework that +enhances interpretability of fine-tuned transformer feature spaces. SUFO +utilizes a range of analytic and visualization techniques, including Supervised +probing, Unsupervised similarity analysis, Feature dynamics, and Outlier +analysis to address key questions about model trust and interpretability. We +conduct a case study investigating the impact of pre-training data where we +focus on real-world pathology classification tasks, and validate our findings +on MedNLI. We evaluate five 110M-sized pre-trained transformer models, +categorized into general-domain (BERT, TNLR), mixed-domain (BioBERT, Clinical +BioBERT), and domain-specific (PubMedBERT) groups. Our SUFO analyses reveal +that: (1) while PubMedBERT, the domain-specific model, contains valuable +information for fine-tuning, it can overfit to minority classes when class +imbalances exist. In contrast, mixed-domain models exhibit greater resistance +to overfitting, suggesting potential improvements in domain-specific model +robustness; (2) in-domain pre-training accelerates feature disambiguation +during fine-tuning; and (3) feature spaces undergo significant sparsification +during this process, enabling clinicians to identify common outlier modes among +fine-tuned models as demonstrated in this paper. These findings showcase the +utility of SUFO in enhancing trust and safety when using transformers in +medicine, and we believe SUFO can aid practitioners in evaluating fine-tuned +language models for other applications in medicine and in more critical +domains. + +
+
+
+
+
+ + ♻ ☆ ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text + Processing EMNLP'2023 + + +
+ English and Chinese, known as resource-rich languages, have witnessed the +strong development of transformer-based language models for natural language +processing tasks. Although Vietnam has approximately 100M people speaking +Vietnamese, several pre-trained models, e.g., PhoBERT, ViBERT, and vELECTRA, +performed well on general Vietnamese NLP tasks, including POS tagging and named +entity recognition. These pre-trained language models are still limited to +Vietnamese social media tasks. In this paper, we present the first monolingual +pre-trained language model for Vietnamese social media texts, ViSoBERT, which +is pre-trained on a large-scale corpus of high-quality and diverse Vietnamese +social media texts using XLM-R architecture. Moreover, we explored our +pre-trained model on five important natural language downstream tasks on +Vietnamese social media texts: emotion recognition, hate speech detection, +sentiment analysis, spam reviews detection, and hate speech spans detection. +Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses +the previous state-of-the-art models on multiple Vietnamese social media tasks. +Our ViSoBERT model is available only for research purposes. + +
+
+ comment: Accepted at EMNLP'2023 Main Conference +
+
+
+
+
+ + ♻ ☆ Fine-grained Late-interaction Multi-modal Retrieval for Retrieval + Augmented Visual Question Answering NeurIPS 2023 + + +
+ Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to +utilize knowledge from external knowledge bases to answer visually-grounded +questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong +framework to tackle KB-VQA, first retrieves related documents with Dense +Passage Retrieval (DPR) and then uses them to answer questions. This paper +proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which +significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major +limitations in RA-VQA's retriever: (1) the image representations obtained via +image-to-text transforms can be incomplete and inaccurate and (2) relevance +scores between queries and documents are computed with one-dimensional +embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes +these limitations by obtaining image representations that complement those from +the image-to-text transforms using a vision model aligned with an existing +text-based retriever through a simple alignment network. FLMR also encodes +images and questions using multi-dimensional embeddings to capture +finer-grained relevance between queries and documents. FLMR significantly +improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. +Finally, we equipped RA-VQA with two state-of-the-art large +multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA +dataset. + +
+
+ comment: To appear at NeurIPS 2023. This is the camera-ready version. We fixed + some numbers and added more experiments to address reviewers' comments +
+
+
+
+
+ + ♻ ☆ Key Frame Mechanism For Efficient Conformer Based End-to-end Speech + Recognition + + +
+ Recently, Conformer as a backbone network for end-to-end automatic speech +recognition achieved state-of-the-art performance. The Conformer block +leverages a self-attention mechanism to capture global information, along with +a convolutional neural network to capture local information, resulting in +improved performance. However, the Conformer-based model encounters an issue +with the self-attention mechanism, as computational complexity grows +quadratically with the length of the input sequence. Inspired by previous +Connectionist Temporal Classification (CTC) guided blank skipping during +decoding, we introduce intermediate CTC outputs as guidance into the +downsampling procedure of the Conformer encoder. We define the frame with +non-blank output as key frame. Specifically, we introduce the key frame-based +self-attention (KFSA) mechanism, a novel method to reduce the computation of +the self-attention mechanism using key frames. The structure of our proposed +approach comprises two encoders. Following the initial encoder, we introduce an +intermediate CTC loss function to compute the label frame, enabling us to +extract the key frames and blank frames for KFSA. Furthermore, we introduce the +key frame-based downsampling (KFDS) mechanism to operate on high-dimensional +acoustic features directly and drop the frames corresponding to blank labels, +which results in new acoustic feature sequences as input to the second encoder. +By using the proposed method, which achieves comparable or higher performance +than vanilla Conformer and other similar work such as Efficient Conformer. +Meantime, our proposed method can discard more than 60\% useless frames during +model training and inference, which will accelerate the inference speed +significantly. This work code is available in +{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer} + +
+
+ comment: This manuscript has been accepted by IEEE Signal Processing Letters + for publication +
+
+
+
+
+ + ♻ ☆ Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing + Perspective + + +
+ Large language models (LLMs), like ChatGPT, have shown some human-like +cognitive abilities. For comparing these abilities of different models, several +benchmarks (i.e. sets of standard test questions) from different fields (e.g., +Literature, Biology and Psychology) are often adopted and the test results +under traditional metrics such as accuracy, recall and F1, are reported. +However, such way for evaluating LLMs can be inefficient and inaccurate from +the cognitive science perspective. Inspired by Computerized Adaptive Testing +(CAT) used in psychometrics, we propose an adaptive testing framework for LLM +evaluation. Rather than using a standard test set and simply reporting +accuracy, this approach dynamically adjusts the characteristics of the test +questions, such as difficulty, based on the model's performance. This allows +for a more accurate estimation of the model's abilities, using fewer questions. +More importantly, it allows LLMs to be compared with humans easily, which is +essential for NLP models that aim for human-level ability. Our diagnostic +reports have found that ChatGPT often behaves like a ``careless student'', +prone to slip and occasionally guessing the questions. We conduct a +fine-grained diagnosis and rank the latest 6 instruction-tuned LLMs from three +aspects of Subject Knowledge, Mathematical Reasoning, and Programming, where +GPT4 can outperform other models significantly and reach the cognitive ability +of middle-level students. Different tests for different models using efficient +adaptive testing -- we believe this has the potential to become a new norm in +evaluating large language models. + +
+
+
+
+
+ + ♻ ☆ Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for + Knowledge-intensive Question Answering + + +
+ Equipped with Chain-of-Thought (CoT), Large language models (LLMs) have shown +impressive reasoning ability in various downstream tasks. Even so, suffering +from hallucinations and the inability to access external knowledge, LLMs often +come with incorrect or unfaithful intermediate reasoning steps, especially in +the context of answering knowledge-intensive tasks such as KBQA. To alleviate +this issue, we propose a framework called Knowledge-Driven Chain-of-Thought +(KD-CoT) to verify and modify reasoning traces in CoT via interaction with +external knowledge, and thus overcome the hallucinations and error propagation. +Concretely, we formulate the CoT rationale process of LLMs into a structured +multi-round QA format. In each round, LLMs interact with a QA system that +retrieves external knowledge and produce faithful reasoning traces based on +retrieved precise answers. The structured CoT reasoning of LLMs is facilitated +by our developed KBQA CoT collection, which serves as in-context learning +demonstrations and can also be utilized as feedback augmentation to train a +robust retriever. Extensive experiments on WebQSP and ComplexWebQuestion +datasets demonstrate the effectiveness of proposed KD-CoT in task-solving +reasoning generation, which outperforms the vanilla CoT ICL with an absolute +success rate of 8.0% and 5.1%. Furthermore, our proposed feedback-augmented +retriever outperforms the state-of-the-art baselines for retrieving knowledge, +achieving significant improvement in Hit and recall performance. Our code and +data are released on https://github.com/AdelWang/KD-CoT/tree/main. + +
+
+
+
+
+ + ♻ ☆ Learning Descriptive Image Captioning via Semipermeable Maximum + Likelihood Estimation NeurIPS 2023 + + +
+ Image captioning aims to describe visual content in natural language. As 'a +picture is worth a thousand words', there could be various correct descriptions +for an image. However, with maximum likelihood estimation as the training +objective, the captioning model is penalized whenever its prediction mismatches +with the label. For instance, when the model predicts a word expressing richer +semantics than the label, it will be penalized and optimized to prefer more +concise expressions, referred to as conciseness optimization. In contrast, +predictions that are more concise than labels lead to richness optimization. +Such conflicting optimization directions could eventually result in the model +generating general descriptions. In this work, we introduce Semipermeable +MaxImum Likelihood Estimation (SMILE), which allows richness optimization while +blocking conciseness optimization, thus encouraging the model to generate +longer captions with more details. Extensive experiments on two mainstream +image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE +significantly enhances the descriptiveness of generated captions. We further +provide in-depth investigations to facilitate a better understanding of how +SMILE works. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial + Reports EMNLP 2023 + + +
+ How can we interpret and retrieve medical evidence to support clinical +decisions? Clinical trial reports (CTR) amassed over the years contain +indispensable information for the development of personalized medicine. +However, it is practically infeasible to manually inspect over 400,000+ +clinical trial reports in order to find the best evidence for experimental +treatments. Natural Language Inference (NLI) offers a potential solution to +this problem, by allowing the scalable computation of textual entailment. +However, existing NLI models perform poorly on biomedical corpora, and +previously published datasets fail to capture the full complexity of inference +over CTRs. In this work, we present a novel resource to advance research on NLI +for reasoning on CTRs. The resource includes two main tasks. Firstly, to +determine the inference relation between a natural language statement, and a +CTR. Secondly, to retrieve supporting facts to justify the predicted relation. +We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these +tasks. Baselines on this corpus expose the limitations of existing NLI models, +with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To +the best of our knowledge, we are the first to design a task that covers the +interpretation of full CTRs. To encourage further work on this challenging +dataset, we make the corpus, competition leaderboard, website and code to +replicate the baseline experiments available at: +https://github.com/ai-systems/nli4ct + +
+
+ comment: EMNLP 2023 Camera-ready, 15 pages +
+
+
+
+
+ + ♻ ☆ Training Socially Aligned Language Models on Simulated Social + Interactions + + +
+ Social alignment in AI systems aims to ensure that these models behave +according to established societal values. However, unlike humans, who derive +consensus on value judgments through social interaction, current language +models (LMs) are trained to rigidly replicate their training corpus in +isolation, leading to subpar generalization in unfamiliar scenarios and +vulnerability to adversarial attacks. This work presents a novel training +paradigm that permits LMs to learn from simulated social interactions. In +comparison to existing methodologies, our approach is considerably more +scalable and efficient, demonstrating superior performance in alignment +benchmarks and human evaluations. This paradigm shift in the training of LMs +brings us a step closer to developing AI systems that can robustly and +accurately reflect societal norms and values. + +
+
+ comment: Code, data, and models can be downloaded via + https://github.com/agi-templar/Stable-Alignment +
+
+
+
+
+ + ♻ ☆ Improving CLIP Training with Language Rewrites NeurIPS 2023 + + +
+ Contrastive Language-Image Pre-training (CLIP) stands as one of the most +effective and scalable methods for training transferable vision models using +paired image and text data. CLIP models are trained using contrastive loss, +which typically relies on data augmentations to prevent overfitting and +shortcuts. However, in the CLIP training paradigm, data augmentations are +exclusively applied to image inputs, while language inputs remain unchanged +throughout the entire training process, limiting the exposure of diverse texts +to the same image. In this paper, we introduce Language augmented CLIP +(LaCLIP), a simple yet highly effective approach to enhance CLIP training +through language rewrites. Leveraging the in-context learning capability of +large language models, we rewrite the text descriptions associated with each +image. These rewritten texts exhibit diversity in sentence structure and +vocabulary while preserving the original key concepts and meanings. During +training, LaCLIP randomly selects either the original texts or the rewritten +versions as text augmentations for each image. Extensive experiments on CC3M, +CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with +language rewrites significantly improves the transfer performance without +computation or memory overhead during training. Specifically for ImageNet +zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on +LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World EMNLP + + +
+ Target similarity tuning (TST) is a method of selecting relevant examples in +natural language (NL) to code generation through large language models (LLMs) +to improve performance. Its goal is to adapt a sentence embedding model to have +the similarity between two NL inputs match the similarity between their +associated code outputs. In this paper, we propose different methods to apply +and improve TST in the real world. First, we replace the sentence transformer +with embeddings from a larger model, which reduces sensitivity to the language +distribution and thus provides more flexibility in synthetic generation of +examples, and we train a tiny model that transforms these embeddings to a space +where embedding similarity matches code similarity, which allows the model to +remain a black box and only requires a few matrix multiplications at inference +time. Second, we show how to efficiently select a smaller number of training +examples to train the TST model. Third, we introduce a ranking-based evaluation +for TST that does not require end-to-end code generation experiments, which can +be expensive to perform. + +
+
+ comment: Accepted for EMNLP-Findings, 2023 +
+
+
+
+
+ + ♻ ☆ Statistical Knowledge Assessment for Large Language Models NeurIPS 2023 + + +
+ Given varying prompts regarding a factoid question, can a large language +model (LLM) reliably generate factually correct answers? Existing LLMs may +generate distinct responses for different prompts. In this paper, we study the +problem of quantifying knowledge contained in an LLM regarding a given set of +facts. We propose KaRR, a statistical approach to assess factual knowledge for +LLMs. The main idea is to estimate the ratio of LLM generating text +corresponding to the answer entity given diverse prompts of the subject and the +querying relation, versus it generating by random chances. Our assessment suite +contains a comprehensive set of 994,123 entities and 600 relations, with +1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, +including LLaMA, Alpaca, OPT, etc. Experiments show that our results have a +strong correlation (0.43 Kendall's $\tau$) with the results of human assessment +on LLMs. Our results reveal that the knowledge in LLMs with the same backbone +architecture adheres to the scaling law, while tuning on instruction-following +data sometimes compromises the model's capability to generate factually correct +text reliably. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Non-Autoregressive Math Word Problem Solver with Unified Tree Structure EMNLP2023 + + +
+ Existing MWP solvers employ sequence or binary tree to present the solution +expression and decode it from given problem description. However, such +structures fail to handle the variants that can be derived via mathematical +manipulation, e.g., $(a_1+a_2) * a_3$ and $a_1 * a_3+a_2 * a_3$ can both be +possible valid solutions for a same problem but formulated as different +expression sequences or trees. The multiple solution variants depicting +different possible solving procedures for the same input problem would raise +two issues: 1) making it hard for the model to learn the mapping function +between the input and output spaces effectively, and 2) wrongly indicating +\textit{wrong} when evaluating a valid expression variant. To address these +issues, we introduce a unified tree structure to present a solution expression, +where the elements are permutable and identical for all the expression +variants. We propose a novel non-autoregressive solver, named \textit{MWP-NAS}, +to parse the problem and deduce the solution expression based on the unified +tree. For evaluating the possible expression variants, we design a path-based +metric to evaluate the partial accuracy of expressions of a unified tree. The +results from extensive experiments conducted on Math23K and MAWPS demonstrate +the effectiveness of our proposed MWP-NAS. The codes and checkpoints are +available at: \url{https://github.com/mengqunhan/MWP-NAS}. + +
+
+ comment: Accepted at EMNLP2023 +
+
+
+
+
+ + ♻ ☆ Are All Steps Equally Important? Benchmarking Essentiality Detection of + Events EMNLP 2023 + + +
+ Natural language expresses events with varying granularities, where +coarse-grained events (goals) can be broken down into finer-grained event +sequences (steps). A critical yet overlooked aspect of understanding event +processes is recognizing that not all step events hold equal importance toward +the completion of a goal. In this paper, we address this gap by examining the +extent to which current models comprehend the essentiality of step events in +relation to a goal event. Cognitive studies suggest that such capability +enables machines to emulate human commonsense reasoning about preconditions and +necessary efforts of everyday tasks. We contribute a high-quality corpus of +(goal, step) pairs gathered from the community guideline website WikiHow, with +steps manually annotated for their essentiality concerning the goal by experts. +The high inter-annotator agreement demonstrates that humans possess a +consistent understanding of event essentiality. However, after evaluating +multiple statistical and largescale pre-trained language models, we find that +existing approaches considerably underperform compared to humans. This +observation highlights the need for further exploration into this critical and +challenging task. The dataset and code are available at +http://cogcomp.org/page/publication_view/1023. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Probing LLMs for hate speech detection: strengths and vulnerabilities EMNLP 2023 + + +
+ Recently efforts have been made by social media platforms as well as +researchers to detect hateful or toxic language using large language models. +However, none of these works aim to use explanation, additional context and +victim community information in the detection process. We utilise different +prompt variation, input information and evaluate large language models in zero +shot setting (without adding any in-context examples). We select three large +language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - +HateXplain, implicit hate and ToxicSpans. We find that on average including the +target information in the pipeline improves the model performance substantially +(~20-30%) over the baseline across the datasets. There is also a considerable +effect of adding the rationales/explanations into the pipeline (~10-20%) over +the baseline across the datasets. In addition, we further provide a typology of +the error cases where these large language models fail to (i) classify and (ii) +explain the reason for the decisions they take. Such vulnerable points +automatically constitute 'jailbreak' prompts for these models and industry +scale safeguard techniques need to be developed to make the models robust +against such prompts. + +
+
+ comment: 13 pages, 9 figures, 7 tables, accepted to findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ LogiCoT: Logical Chain-of-Thought Instruction-Tuning + + +
+ Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive +chain-of-thought reasoning ability. Recent work on self-instruction tuning, +such as Alpaca, has focused on enhancing the general proficiency of models. +These instructions enable the model to achieve performance comparable to +GPT-3.5 on general tasks like open-domain text generation and paraphrasing. +However, they fall short of helping the model handle complex reasoning tasks. +To bridge the gap, this paper presents LogiCoT, a new instruction-tuning +dataset for Logical Chain-of-Thought reasoning with GPT-4. We elaborate on the +process of harvesting instructions for prompting GPT-4 to generate +chain-of-thought rationales. LogiCoT serves as an instruction set for teaching +models of logical reasoning and elicits general reasoning skills. + +
+
+
+
+
+ + ♻ ☆ Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with + Text NeurIPS + + +
+ In-context vision and language models like Flamingo support arbitrarily +interleaved sequences of images and text as input. This format not only enables +few-shot learning via interleaving independent supervised (image, text) +examples, but also, more complex prompts involving interaction between images, +e.g., "What do image A and image B have in common?" To support this interface, +pretraining occurs over web corpora that similarly contain interleaved +images+text. To date, however, large-scale data of this form have not been +publicly available. + We release Multimodal C4, an augmentation of the popular text-only C4 corpus +with images interleaved. We use a linear assignment algorithm to place images +into longer bodies of text using CLIP features, a process that we show +outperforms alternatives. Multimodal C4 spans everyday topics like cooking, +travel, technology, etc. A manual inspection of a random sample of documents +shows that a vast majority (88%) of images are topically relevant, and that +linear assignment frequently selects individual sentences specifically +well-aligned with each image (80%). After filtering NSFW images, ads, etc., the +resulting corpus consists of 101.2M documents with 571M images interleaved in +43B English tokens. + +
+
+ comment: NeurIPS D&B 2023. Project homepage: https://github.com/allenai/mmc4 +
+
+
+
+
+ + ♻ ☆ Extending Input Contexts of Language Models through Training on + Segmented Sequences + + +
+ Effectively training language models on long inputs poses many technical +challenges. As a cost consideration, languages models are pretrained on a fixed +sequence length before being adapted to longer sequences. We explore various +methods for adapting models to longer inputs by training on segmented sequences +and an interpolation-based method for extending absolute positional embeddings. +We develop a training procedure to extend the input context size of pretrained +models with no architectural changes and no additional memory costs than +training on the original input lengths. By sub-sampling segments from long +inputs while maintaining their original position the model is able to learn new +positional interactions. Our method benefits both models trained with absolute +positional embeddings, by extending their input contexts, as well as popular +relative positional embedding methods showing a reduced perplexity on sequences +longer than they were trained on. We demonstrate our method can extend input +contexts by a factor of 4x while improving perplexity. + +
+
+ comment: 11 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ Collaborative Generative AI: Integrating GPT-k for Efficient Editing in + Text-to-Image Generation EMNLP 2023 + + +
+ The field of text-to-image (T2I) generation has garnered significant +attention both within the research community and among everyday users. Despite +the advancements of T2I models, a common issue encountered by users is the need +for repetitive editing of input prompts in order to receive a satisfactory +image, which is time-consuming and labor-intensive. Given the demonstrated text +generation power of large-scale language models, such as GPT-k, we investigate +the potential of utilizing such models to improve the prompt editing process +for T2I generation. We conduct a series of experiments to compare the common +edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting +T2I, and examine factors that may influence this process. We found that GPT-k +models focus more on inserting modifiers while humans tend to replace words and +phrases, which includes changes to the subject matter. Experimental results +show that GPT-k are more effective in adjusting modifiers rather than +predicting spontaneous changes in the primary subject matters. Adopting the +edit suggested by GPT-k models may reduce the percentage of remaining edits by +20-30%. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ A Novel Site-Agnostic Multimodal Deep Learning Model to Identify + Pro-Eating Disorder Content on Social Media + + +
+ Over the last decade, there has been a vast increase in eating disorder +diagnoses and eating disorder-attributed deaths, reaching their zenith during +the Covid-19 pandemic. This immense growth derived in part from the stressors +of the pandemic but also from increased exposure to social media, which is rife +with content that promotes eating disorders. This study aimed to create a +multimodal deep learning model that can determine if a given social media post +promotes eating disorders based on a combination of visual and textual data. A +labeled dataset of Tweets was collected from Twitter, upon which twelve deep +learning models were trained and tested. Based on model performance, the most +effective deep learning model was the multimodal fusion of the RoBERTa natural +language processing model and the MaxViT image classification model, attaining +accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT +fusion model, deployed to classify an unlabeled dataset of posts from the +social media sites Tumblr and Reddit, generated results akin to those of +previous research studies that did not employ artificial intelligence-based +techniques, indicating that deep learning models can develop insights congruent +to those of researchers. Additionally, the model was used to conduct a +timeseries analysis of yet unseen Tweets from eight Twitter hashtags, +uncovering that, since 2014, the relative abundance of content that promotes +eating disorders has decreased drastically within those communities. Despite +this reduction, by 2018, content that promotes eating disorders had either +stopped declining or increased in ampleness anew on these hashtags. + +
+
+
+
+
+ + ♻ ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ♻ ☆ DecipherPref: Analyzing Influential Factors in Human Preference + Judgments via GPT-4 + + +
+ Human preference judgments are pivotal in guiding large language models +(LLMs) to produce outputs that align with human values. Human evaluations are +also used in summarization tasks to compare outputs from various systems, +complementing existing automatic metrics. Despite their significance, however, +there has been limited research probing these pairwise or $k$-wise comparisons. +The collective impact and relative importance of factors such as output length, +informativeness, fluency, and factual consistency are still not well +understood. It is also unclear if there are other hidden factors influencing +human judgments. In this paper, we conduct an in-depth examination of a +collection of pairwise human judgments released by OpenAI. Utilizing the +Bradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in +these human judgments. We find that the most favored factors vary across tasks +and genres, whereas the least favored factors tend to be consistent, e.g., +outputs are too brief, contain excessive off-focus content or hallucinated +facts. Our findings have implications on the construction of balanced datasets +in human preference evaluations, which is a crucial step in shaping the +behaviors of future LLMs. + +
+
+
+
+
+ + ♻ ☆ A Question Answering Framework for Decontextualizing User-facing + Snippets from Scientific Documents EMNLP2023 + + +
+ Many real-world applications (e.g., note taking, search) require extracting a +sentence or paragraph from a document and showing that snippet to a human +outside of the source document. Yet, users may find snippets difficult to +understand as they lack context from the original document. In this work, we +use language models to rewrite snippets from scientific documents to be read on +their own. First, we define the requirements and challenges for this +user-facing decontextualization task, such as clarifying where edits occur and +handling references to other documents. Second, we propose a framework that +decomposes the task into three stages: question generation, question answering, +and rewriting. Using this framework, we collect gold decontextualizations from +experienced scientific article readers. We then conduct a range of experiments +across state-of-the-art commercial and open-source language models to identify +how to best provide missing-but-relevant information to models for our task. +Finally, we develop QaDecontext, a simple prompting strategy inspired by our +framework that improves over end-to-end prompting. We conclude with analysis +that finds, while rewriting is easy, question generation and answering remain +challenging for today's models. + +
+
+ comment: 19 pages, 2 figures, 8 tables, EMNLP2023 +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 61 + +
+
+
+ + ☆ Exploring Data Augmentations on Self-/Semi-/Fully- Supervised + Pre-trained Models + + +
+ Data augmentation has become a standard component of vision pre-trained +models to capture the invariance between augmented views. In practice, +augmentation techniques that mask regions of a sample with zero/mean values or +patches from other samples are commonly employed in pre-trained models with +self-/semi-/fully-supervised contrastive losses. However, the underlying +mechanism behind the effectiveness of these augmentation techniques remains +poorly explored. To investigate the problems, we conduct an empirical study to +quantify how data augmentation affects performance. Concretely, we apply 4 +types of data augmentations termed with Random Erasing, CutOut, CutMix and +MixUp to a series of self-/semi-/fully- supervised pre-trained models. We +report their performance on vision tasks such as image classification, object +detection, instance segmentation, and semantic segmentation. We then explicitly +evaluate the invariance and diversity of the feature embedding. We observe +that: 1) Masking regions of the images decreases the invariance of the learned +feature embedding while providing a more considerable diversity. 2) Manual +annotations do not change the invariance or diversity of the learned feature +embedding. 3) The MixUp approach improves the diversity significantly, with +only a marginal decrease in terms of the invariance. + +
+
+
+
+
+ + ☆ Deep Learning-based Compressed Domain Multimedia for Man and Machine: A + Taxonomy and Application to Point Cloud Classification + + +
+ In the current golden age of multimedia, human visualization is no longer the +single main target, with the final consumer often being a machine which +performs some processing or computer vision tasks. In both cases, deep learning +plays a undamental role in extracting features from the multimedia +representation data, usually producing a compressed representation referred to +as latent representation. The increasing development and adoption of deep +learning-based solutions in a wide area of multimedia applications have opened +an exciting new vision where a common compressed multimedia representation is +used for both man and machine. The main benefits of this vision are two-fold: +i) improved performance for the computer vision tasks, since the effects of +coding artifacts are mitigated; and ii) reduced computational complexity, since +prior decoding is not required. This paper proposes the first taxonomy for +designing compressed domain computer vision solutions driven by the +architecture and weights compatibility with an available spatio-temporal +computer vision processor. The potential of the proposed taxonomy is +demonstrated for the specific case of point cloud classification by designing +novel compressed domain processors using the JPEG Pleno Point Cloud Coding +standard under development and adaptations of the PointGrid classifier. +Experimental results show that the designed compressed domain point cloud +classification solutions can significantly outperform the spatial-temporal +domain classification benchmarks when applied to the decompressed data, +containing coding artifacts, and even surpass their performance when applied to +the original uncompressed data. + +
+
+
+
+
+ + ☆ INCODE: Implicit Neural Conditioning with Prior Knowledge Embeddings WACV 2024 + + +
+ Implicit Neural Representations (INRs) have revolutionized signal +representation by leveraging neural networks to provide continuous and smooth +representations of complex data. However, existing INRs face limitations in +capturing fine-grained details, handling noise, and adapting to diverse signal +types. To address these challenges, we introduce INCODE, a novel approach that +enhances the control of the sinusoidal-based activation function in INRs using +deep prior knowledge. INCODE comprises a harmonizer network and a composer +network, where the harmonizer network dynamically adjusts key parameters of the +activation function. Through a task-specific pre-trained model, INCODE adapts +the task-specific parameters to optimize the representation process. Our +approach not only excels in representation, but also extends its prowess to +tackle complex tasks such as audio, image, and 3D shape reconstructions, as +well as intricate challenges such as neural radiance fields (NeRFs), and +inverse problems, including denoising, super-resolution, inpainting, and CT +reconstruction. Through comprehensive experiments, INCODE demonstrates its +superiority in terms of robustness, accuracy, quality, and convergence rate, +broadening the scope of signal representation. Please visit the project's +website for details on the proposed method and access to the code. + +
+
+ comment: Accepted at WACV 2024 conference +
+
+
+
+
+ + ☆ Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models WACV 2024 + + +
+ Personalized text-to-image (T2I) synthesis based on diffusion models has +attracted significant attention in recent research. However, existing methods +primarily concentrate on customizing subjects or styles, neglecting the +exploration of global geometry. In this study, we propose an approach that +focuses on the customization of 360-degree panoramas, which inherently possess +global geometric properties, using a T2I diffusion model. To achieve this, we +curate a paired image-text dataset specifically designed for the task and +subsequently employ it to fine-tune a pre-trained T2I diffusion model with +LoRA. Nevertheless, the fine-tuned model alone does not ensure the continuity +between the leftmost and rightmost sides of the synthesized images, a crucial +characteristic of 360-degree panoramas. To address this issue, we propose a +method called StitchDiffusion. Specifically, we perform pre-denoising +operations twice at each time step of the denoising process on the stitch block +consisting of the leftmost and rightmost image regions. Furthermore, a global +cropping is adopted to synthesize seamless 360-degree panoramas. Experimental +results demonstrate the effectiveness of our customized model combined with the +proposed StitchDiffusion in generating high-quality 360-degree panoramic +images. Moreover, our customized model exhibits exceptional generalization +ability in producing scenes unseen in the fine-tuning dataset. Code is +available at https://github.com/littlewhitesea/StitchDiffusion. + +
+
+ comment: Accepted by WACV 2024, Project Page: + https://littlewhitesea.github.io/stitchdiffusion.github.io/ +
+
+
+
+
+ + ☆ Rethinking Semi-Supervised Federated Learning: How to co-train + fully-labeled and fully-unlabeled client imaging data MICCAI 2023 + + +
+ The most challenging, yet practical, setting of semi-supervised federated +learning (SSFL) is where a few clients have fully labeled data whereas the +other clients have fully unlabeled data. This is particularly common in +healthcare settings where collaborating partners (typically hospitals) may have +images but not annotations. The bottleneck in this setting is the joint +training of labeled and unlabeled clients as the objective function for each +client varies based on the availability of labels. This paper investigates an +alternative way for effective training with labeled and unlabeled clients in a +federated setting. We propose a novel learning scheme specifically designed for +SSFL which we call Isolated Federated Learning (IsoFed) that circumvents the +problem by avoiding simple averaging of supervised and semi-supervised models +together. In particular, our training approach consists of two parts - (a) +isolated aggregation of labeled and unlabeled client models, and (b) local +self-supervised pretraining of isolated global models in all clients. We +evaluate our model performance on medical image datasets of four different +modalities publicly available within the biomedical image classification +benchmark MedMNIST. We further vary the proportion of labeled clients and the +degree of heterogeneity to demonstrate the effectiveness of the proposed method +under varied experimental settings. + +
+
+ comment: Published in MICCAI 2023 with early acceptance and selected as 1 of + the top 20 poster highlights under the category: Which work has the potential + to impact other applications of AI and CV +
+
+
+
+
+ + ☆ UniCat: Crafting a Stronger Fusion Baseline for Multimodal + Re-Identification NeurIPS 2023 + + +
+ Multimodal Re-Identification (ReID) is a popular retrieval task that aims to +re-identify objects across diverse data streams, prompting many researchers to +integrate multiple modalities into a unified representation. While such fusion +promises a holistic view, our investigations shed light on potential pitfalls. +We uncover that prevailing late-fusion techniques often produce suboptimal +latent representations when compared to methods that train modalities in +isolation. We argue that this effect is largely due to the inadvertent +relaxation of the training objectives on individual modalities when using +fusion, what others have termed modality laziness. We present a nuanced +point-of-view that this relaxation can lead to certain modalities failing to +fully harness available task-relevant information, and yet, offers a protective +veil to noisy modalities, preventing them from overfitting to task-irrelevant +data. Our findings also show that unimodal concatenation (UniCat) and other +late-fusion ensembling of unimodal backbones, when paired with best-known +training techniques, exceed the current state-of-the-art performance across +several multimodal ReID benchmarks. By unveiling the double-edged sword of +"modality laziness", we motivate future research in balancing local modality +strengths with global representations. + +
+
+ comment: Accepted NeurIPS 2023 UniReps, 9 pages, 4 tables +
+
+
+
+
+ + OC-NMN: Object-centric Compositional Neural Module Network for + Generative Visual Analogical Reasoning + + +
+ A key aspect of human intelligence is the ability to imagine -- composing +learned concepts in novel ways -- to make sense of new scenarios. Such capacity +is not yet attained for machine learning systems. In this work, in the context +of visual reasoning, we show how modularity can be leveraged to derive a +compositional data augmentation framework inspired by imagination. Our method, +denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes +visual generative reasoning tasks into a series of primitives applied to +objects without using a domain-specific language. We show that our modular +architectural choices can be used to generate new training tasks that lead to +better out-of-distribution generalization. We compare our model to existing and +new baselines in proposed visual reasoning benchmark that consists of applying +arithmetic operations to MNIST digits. + +
+
+
+
+
+ + ☆ Open Visual Knowledge Extraction via Relation-Oriented Multimodality + Model Prompting NeurIPS 2023 + + +
+ Images contain rich relational knowledge that can help machines understand +the world. Existing methods on visual knowledge extraction often rely on the +pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation +types), restricting the expressiveness of the extracted knowledge. In this +work, we take a first exploration to a new paradigm of open visual knowledge +extraction. To achieve this, we present OpenVik which consists of an open +relational region detector to detect regions potentially containing relational +knowledge and a visual knowledge generator that generates format-free knowledge +by prompting the large multimodality model with the detected region of +interest. We also explore two data enhancement techniques for diversifying the +generated format-free visual knowledge. Extensive knowledge quality evaluations +highlight the correctness and uniqueness of the extracted open visual knowledge +by OpenVik. Moreover, integrating our extracted knowledge across various visual +reasoning applications shows consistent improvements, indicating the real-world +applicability of OpenVik. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ A Review on the Applications of Machine Learning for Tinnitus Diagnosis + Using EEG Signals + + +
+ Tinnitus is a prevalent hearing disorder that can be caused by various +factors such as age, hearing loss, exposure to loud noises, ear infections or +tumors, certain medications, head or neck injuries, and psychological +conditions like anxiety and depression. While not every patient requires +medical attention, about 20% of sufferers seek clinical intervention. Early +diagnosis is crucial for effective treatment. New developments have been made +in tinnitus detection to aid in early detection of this illness. Over the past +few years, there has been a notable growth in the usage of +electroencephalography (EEG) to study variations in oscillatory brain activity +related to tinnitus. However, the results obtained from numerous studies vary +greatly, leading to conflicting conclusions. Currently, clinicians rely solely +on their expertise to identify individuals with tinnitus. Researchers in this +field have incorporated various data modalities and machine-learning techniques +to aid clinicians in identifying tinnitus characteristics and classifying +people with tinnitus. The purpose of writing this article is to review articles +that focus on using machine learning (ML) to identify or predict tinnitus +patients using EEG signals as input data. We have evaluated 11 articles +published between 2016 and 2023 using a systematic literature review (SLR) +method. This article arranges perfect summaries of all the research reviewed +and compares the significant aspects of each. Additionally, we performed +statistical analyses to gain a deeper comprehension of the most recent research +in this area. Almost all of the reviewed articles followed a five-step +procedure to achieve the goal of tinnitus. Disclosure. Finally, we discuss the +open affairs and challenges in this method of tinnitus recognition or +prediction and suggest future directions for research. + +
+
+
+
+
+ + ☆ PrObeD: Proactive Object Detection Wrapper + + +
+ Previous research in $2D$ object detection focuses on various tasks, +including detecting objects in generic and camouflaged images. These works are +regarded as passive works for object detection as they take the input image as +is. However, convergence to global minima is not guaranteed to be optimal in +neural networks; therefore, we argue that the trained weights in the object +detector are not optimal. To rectify this problem, we propose a wrapper based +on proactive schemes, PrObeD, which enhances the performance of these object +detectors by learning a signal. PrObeD consists of an encoder-decoder +architecture, where the encoder network generates an image-dependent signal +termed templates to encrypt the input images, and the decoder recovers this +template from the encrypted images. We propose that learning the optimum +template results in an object detector with an improved detection performance. +The template acts as a mask to the input images to highlight semantics useful +for the object detector. Finetuning the object detector with these encrypted +images enhances the detection performance for both generic and camouflaged. Our +experiments on MS-COCO, CAMO, COD$10$K, and NC$4$K datasets show improvement +over different detectors after applying PrObeD. Our models/codes are available +at https://github.com/vishal3477/Proactive-Object-Detection. + +
+
+ comment: Accepted at Neurips 2023 +
+
+
+
+
+ + ☆ CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale + Point Cloud Data NeurIPS + + +
+ City-scale 3D point cloud is a promising way to express detailed and +complicated outdoor structures. It encompasses both the appearance and geometry +features of segmented city components, including cars, streets, and buildings, +that can be utilized for attractive applications such as user-interactive +navigation of autonomous vehicles and drones. However, compared to the +extensive text annotations available for images and indoor scenes, the scarcity +of text annotations for outdoor scenes poses a significant challenge for +achieving these applications. To tackle this problem, we introduce the +CityRefer dataset for city-level visual grounding. The dataset consists of 35k +natural language descriptions of 3D objects appearing in SensatUrban city +scenes and 5k landmarks labels synchronizing with OpenStreetMap. To ensure the +quality and accuracy of the dataset, all descriptions and labels in the +CityRefer dataset are manually verified. We also have developed a baseline +system that can learn encoded language descriptions, 3D object instances, and +geographical information about the city's landmarks to perform visual grounding +on the CityRefer dataset. To the best of our knowledge, the CityRefer dataset +is the largest city-level visual grounding dataset for localizing specific 3D +objects. + +
+
+ comment: NeurIPS D&B 2023. The first two authors are equally contributed +
+
+
+
+
+ + ☆ Pre-training with Random Orthogonal Projection Image Modeling + + +
+ Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual +pre-training without the use of labels. MIM applies random crops to input +images, processes them with an encoder, and then recovers the masked inputs +with a decoder, which encourages the network to capture and learn structural +information about objects and scenes. The intermediate feature representations +obtained from MIM are suitable for fine-tuning on downstream tasks. In this +paper, we propose an Image Modeling framework based on random orthogonal +projection instead of binary masking as in MIM. Our proposed Random Orthogonal +Projection Image Modeling (ROPIM) reduces spatially-wise token information +under guaranteed bound on the noise variance and can be considered as masking +entire spatial image area under locally varying masking degrees. Since ROPIM +uses a random subspace for the projection that realizes the masking step, the +readily available complement of the subspace can be used during unmasking to +promote recovery of removed information. In this paper, we show that using +random orthogonal projection leads to superior performance compared to +crop-based masking. We demonstrate state-of-the-art results on several popular +benchmarks. + +
+
+
+
+
+ + ☆ Online Multi-view Anomaly Detection with Disentangled Product-of-Experts + Modeling + + +
+ Multi-view or even multi-modal data is appealing yet challenging for +real-world applications. Detecting anomalies in multi-view data is a prominent +recent research topic. However, most of the existing methods 1) are only +suitable for two views or type-specific anomalies, 2) suffer from the issue of +fusion disentanglement, and 3) do not support online detection after model +deployment. To address these challenges, our main ideas in this paper are +three-fold: multi-view learning, disentangled representation learning, and +generative model. To this end, we propose dPoE, a novel multi-view variational +autoencoder model that involves (1) a Product-of-Experts (PoE) layer in +tackling multi-view data, (2) a Total Correction (TC) discriminator in +disentangling view-common and view-specific representations, and (3) a joint +loss function in wrapping up all components. In addition, we devise theoretical +information bounds to control both view-common and view-specific +representations. Extensive experiments on six real-world datasets demonstrate +that the proposed dPoE outperforms baselines markedly. + +
+
+ comment: Accepted by ACM Multimedia 2023, 10 pages, 5 tables, and 3 figures +
+
+
+
+
+ + ☆ Audio-Visual Instance Segmentation + + +
+ In this paper, we propose a new multi-modal task, namely audio-visual +instance segmentation (AVIS), in which the goal is to identify, segment, and +track individual sounding object instances in audible videos, simultaneously. +To our knowledge, it is the first time that instance segmentation has been +extended into the audio-visual domain. To better facilitate this research, we +construct the first audio-visual instance segmentation benchmark (AVISeg). +Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6 +seconds from YouTube and public audio-visual datasets, where 117 videos have +been annotated by using an interactive semi-automatic labeling tool based on +the Segment Anything Model (SAM). In addition, we present a simple baseline +model for the AVIS task. Our new model introduces an audio branch and a +cross-modal fusion module to Mask2Former to locate all sounding objects. +Finally, we evaluate the proposed method using two backbones on AVISeg. We +believe that AVIS will inspire the community towards a more comprehensive +multi-modal understanding. + +
+
+
+
+
+ + ☆ Triplet Attention Transformer for Spatiotemporal Predictive Learning WACV 2024 + + +
+ Spatiotemporal predictive learning offers a self-supervised learning paradigm +that enables models to learn both spatial and temporal patterns by predicting +future sequences based on historical sequences. Mainstream methods are +dominated by recurrent units, yet they are limited by their lack of +parallelization and often underperform in real-world scenarios. To improve +prediction quality while maintaining computational efficiency, we propose an +innovative triplet attention transformer designed to capture both inter-frame +dynamics and intra-frame static features. Specifically, the model incorporates +the Triplet Attention Module (TAM), which replaces traditional recurrent units +by exploring self-attention mechanisms in temporal, spatial, and channel +dimensions. In this configuration: (i) temporal tokens contain abstract +representations of inter-frame, facilitating the capture of inherent temporal +dependencies; (ii) spatial and channel attention combine to refine the +intra-frame representation by performing fine-grained interactions across +spatial and channel dimensions. Alternating temporal, spatial, and +channel-level attention allows our approach to learn more complex short- and +long-range spatiotemporal dependencies. Extensive experiments demonstrate +performance surpassing existing recurrent-based and recurrent-free methods, +achieving state-of-the-art under multi-scenario examination including moving +object trajectory prediction, traffic flow prediction, driving scene +prediction, and human motion capture. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ Foundational Models in Medical Imaging: A Comprehensive Survey and + Future Vision + + +
+ Foundation models, large-scale, pre-trained deep-learning models adapted to a +wide range of downstream tasks have gained significant interest lately in +various deep-learning problems undergoing a paradigm shift with the rise of +these models. Trained on large-scale dataset to bridge the gap between +different modalities, foundation models facilitate contextual reasoning, +generalization, and prompt capabilities at test time. The predictions of these +models can be adjusted for new tasks by augmenting the model input with +task-specific hints called prompts without requiring extensive labeled data and +retraining. Capitalizing on the advances in computer vision, medical imaging +has also marked a growing interest in these models. To assist researchers in +navigating this direction, this survey intends to provide a comprehensive +overview of foundation models in the domain of medical imaging. Specifically, +we initiate our exploration by providing an exposition of the fundamental +concepts forming the basis of foundation models. Subsequently, we offer a +methodical taxonomy of foundation models within the medical domain, proposing a +classification system primarily structured around training strategies, while +also incorporating additional facets such as application domains, imaging +modalities, specific organs of interest, and the algorithms integral to these +models. Furthermore, we emphasize the practical use case of some selected +approaches and then discuss the opportunities, applications, and future +directions of these large-scale pre-trained models, for analyzing medical +images. In the same vein, we address the prevailing challenges and research +pathways associated with foundational models in medical imaging. These +encompass the areas of interpretability, data management, computational +requirements, and the nuanced issue of contextual comprehension. + +
+
+ comment: The paper is currently in the process of being prepared for + submission to MIA +
+
+
+
+
+ + ☆ Efficient Object Detection in Optical Remote Sensing Imagery via + Attention-based Feature Distillation + + +
+ Efficient object detection methods have recently received great attention in +remote sensing. Although deep convolutional networks often have excellent +detection accuracy, their deployment on resource-limited edge devices is +difficult. Knowledge distillation (KD) is a strategy for addressing this issue +since it makes models lightweight while maintaining accuracy. However, existing +KD methods for object detection have encountered two constraints. First, they +discard potentially important background information and only distill nearby +foreground regions. Second, they only rely on the global context, which limits +the student detector's ability to acquire local information from the teacher +detector. To address the aforementioned challenges, we propose Attention-based +Feature Distillation (AFD), a new KD approach that distills both local and +global information from the teacher detector. To enhance local distillation, we +introduce a multi-instance attention mechanism that effectively distinguishes +between background and foreground elements. This approach prompts the student +detector to focus on the pertinent channels and pixels, as identified by the +teacher detector. Local distillation lacks global information, thus attention +global distillation is proposed to reconstruct the relationship between various +pixels and pass it from teacher to student detector. The performance of AFD is +evaluated on two public aerial image benchmarks, and the evaluation results +demonstrate that AFD in object detection can attain the performance of other +state-of-the-art models while being efficient. + +
+
+
+
+
+ + ☆ Foundation Models for Generalist Geospatial Artificial Intelligence + + +
+ Significant progress in the development of highly adaptable and reusable +Artificial Intelligence (AI) models is expected to have a significant impact on +Earth science and remote sensing. Foundation models are pre-trained on large +unlabeled datasets through self-supervision, and then fine-tuned for various +downstream tasks with small labeled datasets. This paper introduces a +first-of-a-kind framework for the efficient pre-training and fine-tuning of +foundational models on extensive geospatial data. We have utilized this +framework to create Prithvi, a transformer-based geospatial foundational model +pre-trained on more than 1TB of multispectral satellite imagery from the +Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the +efficacy of our framework in successfully fine-tuning Prithvi to a range of +Earth observation tasks that have not been tackled by previous work on +foundation models involving multi-temporal cloud gap imputation, flood mapping, +wildfire scar segmentation, and multi-temporal crop segmentation. Our +experiments show that the pre-trained model accelerates the fine-tuning process +compared to leveraging randomly initialized weights. In addition, pre-trained +Prithvi compares well against the state-of-the-art, e.g., outperforming a +conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) +in the structural similarity index. Finally, due to the limited availability of +labeled data in the field of Earth observation, we gradually reduce the +quantity of available labeled data for refining the model to evaluate data +efficiency and demonstrate that data can be decreased significantly without +affecting the model's accuracy. The pre-trained 100 million parameter model and +corresponding fine-tuning workflows have been released publicly as open source +contributions to the global Earth sciences community through Hugging Face. + +
+
+
+
+
+ + ☆ Med-DANet V2: A Flexible Dynamic Architecture for Efficient Medical + Volumetric Segmentation WACV 2024 + + +
+ Recent works have shown that the computational efficiency of 3D medical image +(e.g. CT and MRI) segmentation can be impressively improved by dynamic +inference based on slice-wise complexity. As a pioneering work, a dynamic +architecture network for medical volumetric segmentation (i.e. Med-DANet) has +achieved a favorable accuracy and efficiency trade-off by dynamically selecting +a suitable 2D candidate model from the pre-defined model bank for different +slices. However, the issues of incomplete data analysis, high training costs, +and the two-stage pipeline in Med-DANet require further improvement. To this +end, this paper further explores a unified formulation of the dynamic inference +framework from the perspective of both the data itself and the model structure. +For each slice of the input volume, our proposed method dynamically selects an +important foreground region for segmentation based on the policy generated by +our Decision Network and Crop Position Network. Besides, we propose to insert a +stage-wise quantization selector to the employed segmentation model (e.g. +U-Net) for dynamic architecture adapting. Extensive experiments on BraTS 2019 +and 2020 show that our method achieves comparable or better performance than +previous state-of-the-art methods with much less model complexity. Compared +with previous methods Med-DANet and TransBTS with dynamic and static +architecture respectively, our framework improves the model efficiency by up to +nearly 4.1 and 17.3 times with comparable segmentation results on BraTS 2019. + +
+
+ comment: Accepted by WACV 2024 +
+
+
+
+
+ + ☆ Feature Guided Masked Autoencoder for Self-supervised Learning in Remote + Sensing + + +
+ Self-supervised learning guided by masked image modelling, such as Masked +AutoEncoder (MAE), has attracted wide attention for pretraining vision +transformers in remote sensing. However, MAE tends to excessively focus on +pixel details, thereby limiting the model's capacity for semantic +understanding, in particular for noisy SAR images. In this paper, we explore +spectral and spatial remote sensing image features as improved +MAE-reconstruction targets. We first conduct a study on reconstructing various +image features, all performing comparably well or better than raw pixels. Based +on such observations, we propose Feature Guided Masked Autoencoder (FG-MAE): +reconstructing a combination of Histograms of Oriented Graidents (HOG) and +Normalized Difference Indices (NDI) for multispectral images, and +reconstructing HOG for SAR images. Experimental results on three downstream +tasks illustrate the effectiveness of FG-MAE with a particular boost for SAR +imagery. Furthermore, we demonstrate the well-inherited scalability of FG-MAE +and release a first series of pretrained vision transformers for medium +resolution SAR and multispectral images. + +
+
+ comment: 13 pages, 8 figures +
+
+
+
+
+ + ☆ EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health + Records with Chest X-ray Images NeurIPS 2023 + + +
+ Electronic Health Records (EHRs), which contain patients' medical histories +in various multi-modal formats, often overlook the potential for joint +reasoning across imaging and table modalities underexplored in current EHR +Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel +multi-modal question answering dataset combining structured EHRs and chest +X-ray images. To develop our dataset, we first construct two uni-modal +resources: 1) The MIMIC- CXR-VQA dataset, our newly created medical visual +question answering (VQA) benchmark, specifically designed to augment the +imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of +a previously established table-based EHR QA dataset. By integrating these two +uni-modal resources, we successfully construct a multi-modal EHR QA dataset +that necessitates both uni-modal and cross-modal reasoning. To address the +unique challenges of multi-modal questions within EHRs, we propose a +NeuralSQL-based strategy equipped with an external VQA API. This pioneering +endeavor enhances engagement with multi-modal EHR sources and we believe that +our dataset can catalyze advances in real-world medical scenarios such as +clinical decision-making and research. EHRXQA is available at +https://github.com/baeseongsu/ehrxqa. + +
+
+ comment: Accepted at NeurIPS 2023 Datasets and Benchmarks Track (10 pages for + main text, 4 pages for references, 28 pages for supplementary materials) +
+
+
+
+
+ + ☆ Local-Global Self-Supervised Visual Representation Learning + + +
+ Self-supervised representation learning methods mainly focus on image-level +instance discrimination. This study explores the potential benefits of +incorporating patch-level discrimination into existing methods to enhance the +quality of learned representations by simultaneously looking at local and +global visual features. Towards this idea, we present a straightforward yet +effective patch-matching algorithm that can find the corresponding patches +across the augmented views of an image. The augmented views are subsequently +fed into a self-supervised learning framework employing Vision Transformer +(ViT) as its backbone. The result is the generation of both image-level and +patch-level representations. Leveraging the proposed patch-matching algorithm, +the model minimizes the representation distance between not only the CLS tokens +but also the corresponding patches. As a result, the model gains a more +comprehensive understanding of both the entirety of the image as well as its +finer details. We pretrain the proposed method on small, medium, and +large-scale datasets. It is shown that our approach could outperform +state-of-the-art image-level representation learning methods on both image +classification and downstream tasks. Keywords: Self-Supervised Learning; Visual +Representations; Local-Global Representation Learning; Patch-Wise +Representation Learning; Vision Transformer (ViT) + +
+
+
+
+
+ + ☆ One-shot Localization and Segmentation of Medical Images with Foundation + Models NeurIPS 2023 + + +
+ Recent advances in Vision Transformers (ViT) and Stable Diffusion (SD) models +with their ability to capture rich semantic features of the image have been +used for image correspondence tasks on natural images. In this paper, we +examine the ability of a variety of pre-trained ViT (DINO, DINOv2, SAM, CLIP) +and SD models, trained exclusively on natural images, for solving the +correspondence problems on medical images. While many works have made a case +for in-domain training, we show that the models trained on natural images can +offer good performance on medical images across different modalities +(CT,MR,Ultrasound) sourced from various manufacturers, over multiple anatomical +regions (brain, thorax, abdomen, extremities), and on wide variety of tasks. +Further, we leverage the correspondence with respect to a template image to +prompt a Segment Anything (SAM) model to arrive at single shot segmentation, +achieving dice range of 62%-90% across tasks, using just one image as +reference. We also show that our single-shot method outperforms the recently +proposed few-shot segmentation method - UniverSeg (Dice range 47%-80%) on most +of the semantic segmentation tasks(six out of seven) across medical imaging +modalities. + +
+
+ comment: Accepted at NeurIPS 2023 R0-FoMo Workshop +
+
+
+
+
+ + ☆ Switching Temporary Teachers for Semi-Supervised Semantic Segmentation NeurIPS-2023 + + +
+ The teacher-student framework, prevalent in semi-supervised semantic +segmentation, mainly employs the exponential moving average (EMA) to update a +single teacher's weights based on the student's. However, EMA updates raise a +problem in that the weights of the teacher and student are getting coupled, +causing a potential performance bottleneck. Furthermore, this problem may +become more severe when training with more complicated labels such as +segmentation masks but with few annotated data. This paper introduces Dual +Teacher, a simple yet effective approach that employs dual temporary teachers +aiming to alleviate the coupling problem for the student. The temporary +teachers work in shifts and are progressively improved, so consistently prevent +the teacher and student from becoming excessively close. Specifically, the +temporary teachers periodically take turns generating pseudo-labels to train a +student model and maintain the distinct characteristics of the student model +for each epoch. Consequently, Dual Teacher achieves competitive performance on +the PASCAL VOC, Cityscapes, and ADE20K benchmarks with remarkably shorter +training times than state-of-the-art methods. Moreover, we demonstrate that our +approach is model-agnostic and compatible with both CNN- and Transformer-based +models. Code is available at \url{https://github.com/naver-ai/dual-teacher}. + +
+
+ comment: NeurIPS-2023 +
+
+
+
+
+ + ☆ Towards Plastic and Stable Exemplar-Free Incremental Learning: A + Dual-Learner Framework with Cumulative Parameter Averaging + + +
+ The dilemma between plasticity and stability presents a significant challenge +in Incremental Learning (IL), especially in the exemplar-free scenario where +accessing old-task samples is strictly prohibited during the learning of a new +task. A straightforward solution to this issue is learning and storing an +independent model for each task, known as Single Task Learning (STL). Despite +the linear growth in model storage with the number of tasks in STL, we +empirically discover that averaging these model parameters can potentially +preserve knowledge across all tasks. Inspired by this observation, we propose a +Dual-Learner framework with Cumulative Parameter Averaging (DLCPA). DLCPA +employs a dual-learner design: a plastic learner focused on acquiring new-task +knowledge and a stable learner responsible for accumulating all learned +knowledge. The knowledge from the plastic learner is transferred to the stable +learner via cumulative parameter averaging. Additionally, several task-specific +classifiers work in cooperation with the stable learner to yield the final +prediction. Specifically, when learning a new task, these modules are updated +in a cyclic manner: i) the plastic learner is initially optimized using a +self-supervised loss besides the supervised loss to enhance the feature +extraction robustness; ii) the stable learner is then updated with respect to +the plastic learner in a cumulative parameter averaging manner to maintain its +task-wise generalization; iii) the task-specific classifier is accordingly +optimized to align with the stable learner. Experimental results on CIFAR-100 +and Tiny-ImageNet show that DLCPA outperforms several state-of-the-art +exemplar-free baselines in both Task-IL and Class-IL settings. + +
+
+
+
+
+ + ☆ Electrical Impedance Tomography: A Fair Comparative Study on Deep + Learning and Analytic-based Approaches + + +
+ Electrical Impedance Tomography (EIT) is a powerful imaging technique with +diverse applications, e.g., medical diagnosis, industrial monitoring, and +environmental studies. The EIT inverse problem is about inferring the internal +conductivity distribution of an object from measurements taken on its boundary. +It is severely ill-posed, necessitating advanced computational methods for +accurate image reconstructions. Recent years have witnessed significant +progress, driven by innovations in analytic-based approaches and deep learning. +This review explores techniques for solving the EIT inverse problem, focusing +on the interplay between contemporary deep learning-based strategies and +classical analytic-based methods. Four state-of-the-art deep learning +algorithms are rigorously examined, harnessing the representational +capabilities of deep neural networks to reconstruct intricate conductivity +distributions. In parallel, two analytic-based methods, rooted in mathematical +formulations and regularisation techniques, are dissected for their strengths +and limitations. These methodologies are evaluated through various numerical +experiments, encompassing diverse scenarios that reflect real-world +complexities. A suite of performance metrics is employed to assess the efficacy +of these methods. These metrics collectively provide a nuanced understanding of +the methods' ability to capture essential features and delineate complex +conductivity patterns. One novel feature of the study is the incorporation of +variable conductivity scenarios, introducing a level of heterogeneity that +mimics textured inclusions. This departure from uniform conductivity +assumptions mimics realistic scenarios where tissues or materials exhibit +spatially varying electrical properties. Exploring how each method responds to +such variable conductivity scenarios opens avenues for understanding their +robustness and adaptability. + +
+
+
+
+
+ + ☆ Benchmark Generation Framework with Customizable Distortions for Image + Classifier Robustness WACV 2024 + + +
+ We present a novel framework for generating adversarial benchmarks to +evaluate the robustness of image classification models. Our framework allows +users to customize the types of distortions to be optimally applied to images, +which helps address the specific distortions relevant to their deployment. The +benchmark can generate datasets at various distortion levels to assess the +robustness of different image classifiers. Our results show that the +adversarial samples generated by our framework with any of the image +classification models, like ResNet-50, Inception-V3, and VGG-16, are effective +and transferable to other models causing them to fail. These failures happen +even when these models are adversarially retrained using state-of-the-art +techniques, demonstrating the generalizability of our adversarial samples. We +achieve competitive performance in terms of net $L_2$ distortion compared to +state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we +demonstrate our framework achieves such results with simple distortions like +Gaussian noise without introducing unnatural artifacts or color bleeds. This is +made possible by a model-based reinforcement learning (RL) agent and a +technique that reduces a deep tree search of the image for model sensitivity to +perturbations, to a one-level analysis and action. The flexibility of choosing +distortions and setting classification probability thresholds for multiple +classes makes our framework suitable for algorithmic audits. + +
+
+ comment: Accepted at IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV 2024) +
+
+
+
+
+ + ☆ ODM3D: Alleviating Foreground Sparsity for Enhanced Semi-Supervised + Monocular 3D Object Detection WACV 2024 + + +
+ Monocular 3D object detection (M3OD) is a significant yet inherently +challenging task in autonomous driving due to absence of implicit depth cues in +a single RGB image. In this paper, we strive to boost currently underperforming +monocular 3D object detectors by leveraging an abundance of unlabelled data via +semi-supervised learning. Our proposed ODM3D framework entails cross-modal +knowledge distillation at various levels to inject LiDAR-domain knowledge into +a monocular detector during training. By identifying foreground sparsity as the +main culprit behind existing methods' suboptimal training, we exploit the +precise localisation information embedded in LiDAR points to enable more +foreground-attentive and efficient distillation via the proposed BEV occupancy +guidance mask, leading to notably improved knowledge transfer and M3OD +performance. Besides, motivated by insights into why existing cross-modal +GT-sampling techniques fail on our task at hand, we further design a novel +cross-modal object-wise data augmentation strategy for effective RGB-LiDAR +joint learning. Our method ranks 1st in both KITTI validation and test +benchmarks, significantly surpassing all existing monocular methods, supervised +or semi-supervised, on both BEV and 3D detection metrics. + +
+
+ comment: Accepted by WACV 2024 +
+
+
+
+
+ + ☆ Domain Generalisation via Risk Distribution Matching WACV 2024 + + +
+ We propose a novel approach for domain generalisation (DG) leveraging risk +distributions to characterise domains, thereby achieving domain invariance. In +our findings, risk distributions effectively highlight differences between +training domains and reveal their inherent complexities. In testing, we may +observe similar, or potentially intensifying in magnitude, divergences between +risk distributions. Hence, we propose a compelling proposition: Minimising the +divergences between risk distributions across training domains leads to robust +invariance for DG. The key rationale behind this concept is that a model, +trained on domain-invariant or stable features, may consistently produce +similar risk distributions across various domains. Building upon this idea, we +propose Risk Distribution Matching (RDM). Using the maximum mean discrepancy +(MMD) distance, RDM aims to minimise the variance of risk distributions across +training domains. However, when the number of domains increases, the direct +optimisation of variance leads to linear growth in MMD computations, resulting +in inefficiency. Instead, we propose an approximation that requires only one +MMD computation, by aligning just two distributions: that of the worst-case +domain and the aggregated distribution from all domains. Notably, this method +empirically outperforms optimising distributional variance while being +computationally more efficient. Unlike conventional DG matching algorithms, RDM +stands out for its enhanced efficacy by concentrating on scalar risk +distributions, sidestepping the pitfalls of high-dimensional challenges seen in +feature or gradient matching. Our extensive experiments on standard benchmark +datasets demonstrate that RDM shows superior generalisation capability over +state-of-the-art DG methods. + +
+
+ comment: Accepted at 2024 IEEE/CVF Winter Conference on Applications of + Computer Vision (WACV 2024) +
+
+
+
+
+ + ☆ This Looks Like Those: Illuminating Prototypical Concepts Using Multiple + Visualizations + + +
+ We present ProtoConcepts, a method for interpretable image classification +combining deep learning and case-based reasoning using prototypical parts. +Existing work in prototype-based image classification uses a ``this looks like +that'' reasoning process, which dissects a test image by finding prototypical +parts and combining evidence from these prototypes to make a final +classification. However, all of the existing prototypical part-based image +classifiers provide only one-to-one comparisons, where a single training image +patch serves as a prototype to compare with a part of our test image. With +these single-image comparisons, it can often be difficult to identify the +underlying concept being compared (e.g., ``is it comparing the color or the +shape?''). Our proposed method modifies the architecture of prototype-based +networks to instead learn prototypical concepts which are visualized using +multiple image patches. Having multiple visualizations of the same prototype +allows us to more easily identify the concept captured by that prototype (e.g., +``the test image and the related training patches are all the same shade of +blue''), and allows our model to create richer, more interpretable visual +explanations. Our experiments show that our ``this looks like those'' reasoning +process can be applied as a modification to a wide range of existing +prototypical image classification networks while achieving comparable accuracy +on benchmark datasets. + +
+
+
+
+
+ + ☆ Visual Explanations via Iterated Integrated Attributions ICCV 2023 + + +
+ We introduce Iterated Integrated Attributions (IIA) - a generic method for +explaining the predictions of vision models. IIA employs iterative integration +across the input image, the internal representations generated by the model, +and their gradients, yielding precise and focused explanation maps. We +demonstrate the effectiveness of IIA through comprehensive evaluations across +various tasks, datasets, and network architectures. Our results showcase that +IIA produces accurate explanation maps, outperforming other state-of-the-art +explanation techniques. + +
+
+ comment: ICCV 2023 +
+
+
+
+
+ + ♻ ☆ An Inverse Scaling Law for CLIP Training NeurIPS 2023 + + +
+ CLIP, one of the pioneering foundation models that connect images and text, +has enabled many recent breakthroughs in computer vision. However, its +associated training cost is prohibitively high, imposing a significant barrier +to its widespread exploration. In this paper, we present a surprising finding +that there exists an inverse scaling law for CLIP training, whereby the larger +the image/text encoders used, the shorter the sequence length of image/text +tokens that can be applied in training. Moreover, we showcase that the strategy +for reducing image/text token length plays a crucial role in determining the +quality of this scaling law. + As a result of this finding, we are able to successfully train CLIP even with +limited computational resources. For example, using 8 A100 GPUs, our CLIP +models achieve zero-shot top-1 ImageNet-1k accuracies of 63.2% in ~2 days, +67.8% in ~3 days, and 69.3% in ~4 days. Our method also works well when scaling +up -- with G/14, we register a new record of 83.0% ImageNet-1k zero-shot +accuracy, and meanwhile accelerate the training by ~33x compared to its +OpenCLIP counterpart. By reducing the computation barrier associated with CLIP, +we hope to inspire more research in this field, particularly from academics. +Our code is available at https://github.com/UCSC-VLAA/CLIPA. + +
+
+ comment: NeurIPS 2023 camera-ready +
+
+
+
+
+ + ♻ ☆ ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution + Detection in Segmentation NeurIPS 2023 + + +
+ Recent advancements in dense out-of-distribution (OOD) detection have +primarily focused on scenarios where the training and testing datasets share a +similar domain, with the assumption that no domain shift exists between them. +However, in real-world situations, domain shift often exits and significantly +affects the accuracy of existing out-of-distribution (OOD) detection models. In +this work, we propose a dual-level OOD detection framework to handle domain +shift and semantic shift jointly. The first level distinguishes whether domain +shift exists in the image by leveraging global low-level features, while the +second level identifies pixels with semantic shift by utilizing dense +high-level feature maps. In this way, we can selectively adapt the model to +unseen domains as well as enhance model's capacity in detecting novel classes. +We validate the efficacy of our proposed method on several OOD segmentation +benchmarks, including those with significant domain shifts and those without, +observing consistent performance improvements across various baseline models. +Code is available at +${\href{https://github.com/gaozhitong/ATTA}{https://github.com/gaozhitong/ATTA}}$. + +
+
+ comment: Published in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Effective Robustness against Natural Distribution Shifts for Models with + Different Training Data NeurIPS 2023 + + +
+ "Effective robustness" measures the extra out-of-distribution (OOD) +robustness beyond what can be predicted from the in-distribution (ID) +performance. Existing effective robustness evaluations typically use a single +test set such as ImageNet to evaluate the ID accuracy. This becomes problematic +when evaluating models trained on different data distributions, e.g., comparing +models trained on ImageNet vs. zero-shot language-image pre-trained models +trained on LAION. In this paper, we propose a new evaluation metric to evaluate +and compare the effective robustness of models trained on different data. To do +this, we control for the accuracy on multiple ID test sets that cover the +training distributions for all the evaluated models. Our new evaluation metric +provides a better estimate of effective robustness when there are models with +different training data. It may also explain the surprising effective +robustness gains of zero-shot CLIP-like models exhibited in prior works that +used ImageNet as the only ID test set, while the gains diminish under our new +evaluation. Additional artifacts including interactive visualizations are +provided at https://shizhouxing.github.io/effective-robustness. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ A Comparison of AIS, X-Band Marine Radar Systems and Camera Surveillance + Systems in the Collection of Tracking Data + + +
+ Maritime traffic has increased in recent years, especially in terms of +seaborne trade. To ensure safety, security, and protection of the marine +environment, several systems have been deployed. To overcome some of their +inconveniences, the collected data is typically fused. The fused data is used +for various purposes, one of our interest is target tracking. The most relevant +systems in that context are AIS and X-band marine radar. Many works consider +that visual data provided by camera surveillance systems enable additional +advantages. Therefore, many tracking algorithms using visual data (images) have +been developed. Yet, there is little emphasis on the reasons making the +integration of camera systems important. Thus, our main aim in this paper is to +analyze the aforementioned surveillance systems for target tracking and +conclude some of the maritime security improvements resulted from the +integration of cameras to the overall maritime surveillance system. + +
+
+ comment: The journal that published this paper is no longer online. We + discovered it was a predatory journal. Withdrawing this paper will allow us + to publish it elsewhere. We enhanced this paper and will provide its full + text as a replacement +
+
+
+
+
+ + ♻ ☆ SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen + LLMs NeurIPS 2023 + + +
+ In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling +frozen LLMs to perform both understanding and generation tasks involving +non-linguistic modalities such as images or videos. SPAE converts between raw +pixels and interpretable lexical tokens (or words) extracted from the LLM's +vocabulary. The resulting tokens capture both the semantic meaning and the +fine-grained details needed for visual reconstruction, effectively translating +the visual content into a language comprehensible to the LLM, and empowering it +to perform a wide array of multimodal tasks. Our approach is validated through +in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set +of image understanding and generation tasks. Our method marks the first +successful attempt to enable a frozen LLM to generate image content while +surpassing state-of-the-art performance in image understanding tasks, under the +same setting, by over 25%. + +
+
+ comment: NeurIPS 2023 spotlight +
+
+
+
+
+ + ♻ ☆ What Can Human Sketches Do for Object Detection? CVPR + + +
+ Sketches are highly expressive, inherently capturing subjective and +fine-grained visual cues. The exploration of such innate properties of human +sketches has, however, been limited to that of image retrieval. In this paper, +for the first time, we cultivate the expressiveness of sketches but for the +fundamental vision task of object detection. The end result is a sketch-enabled +object detection framework that detects based on what \textit{you} sketch -- +\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of +zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of +a ``zebra") that you desire (part-aware detection). We further dictate that our +model works without (i) knowing which category to expect at testing (zero-shot) +and (ii) not requiring additional bounding boxes (as per fully supervised) and +class labels (as per weakly supervised). Instead of devising a model from the +ground up, we show an intuitive synergy between foundation models (e.g., CLIP) +and existing sketch models build for sketch-based image retrieval (SBIR), which +can already elegantly solve the task -- CLIP to provide model generalisation, +and SBIR to bridge the (sketch$\rightarrow$photo) gap. In particular, we first +perform independent prompting on both sketch and photo branches of an SBIR +model to build highly generalisable sketch and photo encoders on the back of +the generalisation ability of CLIP. We then devise a training paradigm to adapt +the learned encoders for object detection, such that the region embeddings of +detected boxes are aligned with the sketch and photo embeddings from SBIR. +Evaluating our framework on standard object detection datasets like PASCAL-VOC +and MS-COCO outperforms both supervised (SOD) and weakly-supervised object +detectors (WSOD) on zero-shot setups. Project Page: +\url{https://pinakinathc.github.io/sketch-detect} + +
+
+ comment: Best Paper Finalist (Top 12 Best Papers). Presented in special + single-track plenary sessions to all attendees in Computer Vision and Pattern + Recognition (CVPR), 2023. Updated an error in Fig.3 (from Softmax to Cross + Entropy). Thanks to the community for pointing it out +
+
+
+
+
+ + ♻ ☆ PERF: Panoramic Neural Radiance Field from a Single Panorama + + +
+ Neural Radiance Field (NeRF) has achieved substantial progress in novel view +synthesis given multi-view images. Recently, some works have attempted to train +a NeRF from a single image with 3D priors. They mainly focus on a limited field +of view with a few occlusions, which greatly limits their scalability to +real-world 360-degree panoramic scenarios with large-size occlusions. In this +paper, we present PERF, a 360-degree novel view synthesis framework that trains +a panoramic neural radiance field from a single panorama. Notably, PERF allows +3D roaming in a complex scene without expensive and tedious image collection. +To achieve this goal, we propose a novel collaborative RGBD inpainting method +and a progressive inpainting-and-erasing method to lift up a 360-degree 2D +scene to a 3D scene. Specifically, we first predict a panoramic depth map as +initialization given a single panorama and reconstruct visible 3D regions with +volume rendering. Then we introduce a collaborative RGBD inpainting approach +into a NeRF for completing RGB images and depth maps from random views, which +is derived from an RGB Stable Diffusion model and a monocular depth estimator. +Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent +geometry between a newly-sampled view and reference views. The two components +are integrated into the learning of NeRFs in a unified optimization framework +and achieve promising results. Extensive experiments on Replica and a new +dataset PERF-in-the-wild demonstrate the superiority of our PERF over +state-of-the-art methods. Our PERF can be widely used for real-world +applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization +applications. Project page and code are available at +https://perf-project.github.io/ and https://github.com/perf-project/PeRF. + +
+
+ comment: Project Page: https://perf-project.github.io/ , Code: + https://github.com/perf-project/PeRF +
+
+
+
+
+ + ♻ ☆ Revisiting Adversarial Training for ImageNet: Architectures, Training + and Generalization across Threat Models NeurIPS 2023 + + +
+ While adversarial training has been extensively studied for ResNet +architectures and low resolution datasets like CIFAR, much less is known for +ImageNet. Given the recent debate about whether transformers are more robust +than convnets, we revisit adversarial training on ImageNet comparing ViTs and +ConvNeXts. Extensive experiments show that minor changes in architecture, most +notably replacing PatchStem with ConvStem, and training scheme have a +significant impact on the achieved robustness. These changes not only increase +robustness in the seen $\ell_\infty$-threat model, but even more so improve +generalization to unseen $\ell_1/\ell_2$-attacks. Our modified ConvNeXt, +ConvNeXt + ConvStem, yields the most robust $\ell_\infty$-models across +different ranges of model parameters and FLOPs, while our ViT + ConvStem yields +the best generalization to unseen threat models. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Fine-grained Late-interaction Multi-modal Retrieval for Retrieval + Augmented Visual Question Answering NeurIPS 2023 + + +
+ Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to +utilize knowledge from external knowledge bases to answer visually-grounded +questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong +framework to tackle KB-VQA, first retrieves related documents with Dense +Passage Retrieval (DPR) and then uses them to answer questions. This paper +proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which +significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major +limitations in RA-VQA's retriever: (1) the image representations obtained via +image-to-text transforms can be incomplete and inaccurate and (2) relevance +scores between queries and documents are computed with one-dimensional +embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes +these limitations by obtaining image representations that complement those from +the image-to-text transforms using a vision model aligned with an existing +text-based retriever through a simple alignment network. FLMR also encodes +images and questions using multi-dimensional embeddings to capture +finer-grained relevance between queries and documents. FLMR significantly +improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. +Finally, we equipped RA-VQA with two state-of-the-art large +multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA +dataset. + +
+
+ comment: To appear at NeurIPS 2023. This is the camera-ready version. We fixed + some numbers and added more experiments to address reviewers' comments +
+
+
+
+
+ + ♻ ☆ SpikingNeRF: Making Bio-inspired Neural Networks See through the Real + World + + +
+ Spiking neural networks (SNNs) have been thriving on numerous tasks to +leverage their promising energy efficiency and exploit their potentialities as +biologically plausible intelligence. Meanwhile, the Neural Radiance Fields +(NeRF) render high-quality 3D scenes with massive energy consumption, but few +works delve into the energy-saving solution with a bio-inspired approach. In +this paper, we propose SpikingNeRF, which aligns the radiance ray with the +temporal dimension of SNN, to naturally accommodate the SNN to the +reconstruction of Radiance Fields. Thus, the computation turns into a +spike-based, multiplication-free manner, reducing the energy consumption. In +SpikingNeRF, each sampled point on the ray is matched onto a particular time +step, and represented in a hybrid manner where the voxel grids are maintained +as well. Based on the voxel grids, sampled points are determined whether to be +masked for better training and inference. However, this operation also incurs +irregular temporal length. We propose the temporal padding strategy to tackle +the masked samples to maintain regular temporal length, i.e., regular tensors, +and the temporal condensing strategy to form a denser data structure for +hardware-friendly computation. Extensive experiments on various datasets +demonstrate that our method reduces the 70.79\% energy consumption on average +and obtains comparable synthesis quality with the ANN baseline. + +
+
+
+
+
+ + ♻ ☆ Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale + From A New Perspective NeurIPS 2023 + + +
+ We present a new dataset condensation framework termed Squeeze, Recover and +Relabel (SRe$^2$L) that decouples the bilevel optimization of model and +synthetic data during training, to handle varying scales of datasets, model +architectures and image resolutions for efficient dataset condensation. The +proposed method demonstrates flexibility across diverse dataset scales and +exhibits multiple advantages in terms of arbitrary resolutions of synthesized +images, low training cost and memory consumption with high-resolution +synthesis, and the ability to scale up to arbitrary evaluation network +architectures. Extensive experiments are conducted on Tiny-ImageNet and full +ImageNet-1K datasets. Under 50 IPC, our approach achieves the highest 42.5% and +60.8% validation accuracy on Tiny-ImageNet and ImageNet-1K, outperforming all +previous state-of-the-art methods by margins of 14.5% and 32.9%, respectively. +Our approach also surpasses MTT in terms of speed by approximately 52$\times$ +(ConvNet-4) and 16$\times$ (ResNet-18) faster with less memory consumption of +11.6$\times$ and 6.4$\times$ during data synthesis. Our code and condensed +datasets of 50, 200 IPC with 4K recovery budget are available at +https://github.com/VILA-Lab/SRe2L. + +
+
+ comment: NeurIPS 2023 spotlight. Code at https://github.com/VILA-Lab/SRe2L +
+
+
+
+
+ + ♻ ☆ Parameter-efficient Tuning of Large-scale Multimodal Foundation Model NeurIPS2023 + + +
+ Driven by the progress of large-scale pre-training, parameter-efficient +transfer learning has gained immense popularity across different subfields of +Artificial Intelligence. The core is to adapt the model to downstream tasks +with only a small set of parameters. Recently, researchers have leveraged such +proven techniques in multimodal tasks and achieve promising results. However, +two critical issues remain unresolved: how to further reduce the complexity +with lightweight design and how to boost alignment between modalities under +extremely low parameters. In this paper, we propose A graceful prompt framework +for cross-modal transfer (Aurora) to overcome these challenges. Considering the +redundancy in existing architectures, we first utilize the mode approximation +to generate 0.1M trainable parameters to implement the multimodal prompt +tuning, which explores the low intrinsic dimension with only 0.04% parameters +of the pre-trained model. Then, for better modality alignment, we propose the +Informative Context Enhancement and Gated Query Transformation module under +extremely few parameters scenes. A thorough evaluation on six cross-modal +benchmarks shows that it not only outperforms the state-of-the-art but even +outperforms the full fine-tuning approach. Our code is available at: +https://github.com/WillDreamer/Aurora. + +
+
+ comment: Accepted by NeurIPS2023 +
+
+
+
+
+ + ♻ ☆ An XAI Approach to Deep Learning Models in the Detection of DCIS + + +
+ The results showed that XAI could indeed be used as a proof of concept to +begin discussions on the implementation of assistive AI systems within the +clinical community. + +
+
+ comment: 12 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ RelaMiX: Exploring Few-Shot Adaptation in Video-based Action Recognition + + +
+ Domain adaptation is essential for activity recognition to ensure accurate +and robust performance across diverse environments, sensor types, and data +sources. Unsupervised domain adaptation methods have been extensively studied, +yet, they require large-scale unlabeled data from the target domain. In this +work, we address Few-Shot Domain Adaptation for video-based Activity +Recognition (FSDA-AR), which leverages a very small amount of labeled target +videos to achieve effective adaptation. This setting is attractive and +promising for applications, as it requires recording and labeling only a few, +or even a single example per class in the target domain, which often includes +activities that are rare yet crucial to recognize. We construct FSDA-AR +benchmarks using five established datasets considering diverse domain types: +UCF101, HMDB51, EPIC-KITCHEN, Sims4Action, and ToyotaSmartHome. Our results +demonstrate that FSDA-AR performs comparably to unsupervised domain adaptation +with significantly fewer (yet labeled) target domain samples. We further +propose a novel approach, RelaMiX, to better leverage the few labeled target +domain samples as knowledge guidance. RelaMiX encompasses a temporal relational +attention network with relation dropout, alongside a cross-domain information +alignment mechanism. Furthermore, it integrates a mechanism for mixing features +within a latent space by using the few-shot target domain samples. The proposed +RelaMiX solution achieves state-of-the-art performance on all datasets within +the FSDA-AR benchmark. To encourage future research of few-shot domain +adaptation for video-based activity recognition, our benchmarks and source code +are made publicly available at https://github.com/KPeng9510/RelaMiX. + +
+
+ comment: Benchmarks and source code are made publicly available at + https://github.com/KPeng9510/RelaMiX +
+
+
+
+
+ + ♻ ☆ JourneyDB: A Benchmark for Generative Image Understanding NeurIPS 2023 + + +
+ While recent advancements in vision-language models have had a transformative +impact on multi-modal comprehension, the extent to which these models possess +the ability to comprehend generated images remains uncertain. Synthetic images, +in comparison to real data, encompass a higher level of diversity in terms of +both content and style, thereby presenting significant challenges for the +models to fully grasp. In light of this challenge, we introduce a comprehensive +dataset, referred to as JourneyDB, that caters to the domain of generative +images within the context of multi-modal visual understanding. Our meticulously +curated dataset comprises 4 million distinct and high-quality generated images, +each paired with the corresponding text prompts that were employed in their +creation. Furthermore, we additionally introduce an external subset with +results of another 22 text-to-image generative models, which makes JourneyDB a +comprehensive benchmark for evaluating the comprehension of generated images. +On our dataset, we have devised four benchmarks to assess the performance of +generated image comprehension in relation to both content and style +interpretation. These benchmarks encompass prompt inversion, style retrieval, +image captioning, and visual question answering. Lastly, we evaluate the +performance of state-of-the-art multi-modal models when applied to the +JourneyDB dataset, providing a comprehensive analysis of their strengths and +limitations in comprehending generated content. We anticipate that the proposed +dataset and benchmarks will facilitate further research in the field of +generative content understanding. The dataset is publicly available at +https://journeydb.github.io. + +
+
+ comment: Accepted to the Thirty-seventh Conference on Neural Information + Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models NeurIPS 2023 + + +
+ Targeting to understand the underlying explainable factors behind +observations and modeling the conditional generation process on these factors, +we connect disentangled representation learning to Diffusion Probabilistic +Models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We +propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without +any annotations of the factors, the task is to automatically discover the +inherent factors behind the observations and disentangle the gradient fields of +DPM into sub-gradient fields, each conditioned on the representation of each +discovered factor. With disentangled DPMs, those inherent factors can be +automatically discovered, explicitly represented, and clearly injected into the +diffusion process via the sub-gradient fields. To tackle this task, we devise +an unsupervised approach named DisDiff, achieving disentangled representation +learning in the framework of DPMs. Extensive experiments on synthetic and +real-world datasets demonstrate the effectiveness of DisDiff. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Iterative Multi-granular Image Editing using Diffusion Models WACV + + +
+ Recent advances in text-guided image synthesis has dramatically changed how +creative professionals generate artistic and aesthetically pleasing visual +assets. To fully support such creative endeavors, the process should possess +the ability to: 1) iteratively edit the generations and 2) control the spatial +reach of desired changes (global, local or anything in between). We formalize +this pragmatic problem setting as Iterative Multi-granular Editing. While there +has been substantial progress with diffusion-based models for image synthesis +and editing, they are all one shot (i.e., no iterative editing capabilities) +and do not naturally yield multi-granular control (i.e., covering the full +spectrum of local-to-global edits). To overcome these drawbacks, we propose +EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent +iteration strategy, which re-purposes a pre-trained diffusion model to +facilitate iterative editing. This is complemented by a gradient control +operation for multi-granular control. We introduce a new benchmark dataset to +evaluate our newly proposed setting. We conduct exhaustive quantitatively and +qualitatively evaluation against recent state-of-the-art approaches adapted to +our task, to being out the mettle of EMILIE. We hope our work would attract +attention to this newly identified, pragmatic problem setting. + +
+
+ comment: Accepted to IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV) 2024 +
+
+
+
+
+ + ♻ ☆ Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans 3DV2024 + + +
+ We present a novel task for cross-dataset visual grounding in 3D scenes +(Cross3DVG), which overcomes limitations of existing 3D visual grounding +models, specifically their restricted 3D resources and consequent tendencies of +overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual +grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse +descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with +human annotations. After training the Cross3DVG model using the source 3D +visual grounding dataset, we evaluate it without target labels using the target +dataset with, e.g., different sensors, 3D reconstruction methods, and language +annotators. Comprehensive experiments are conducted using established visual +grounding models and with CLIP-based multi-view 2D and 3D integration designed +to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D +visual grounding exhibits significantly worse performance than learning and +evaluation with a single dataset because of the 3D data and language variants +across datasets. Moreover, (ii) better object detector and localization modules +and fusing 3D data and multi-view CLIP-based image features can alleviate this +lower performance. Our Cross3DVG task can provide a benchmark for developing +robust 3D visual grounding models to handle diverse 3D scenes while leveraging +deep language understanding. + +
+
+ comment: 3DV2024 +
+
+
+
+
+ + ♻ ☆ UCF: Uncovering Common Features for Generalizable Deepfake Detection + + +
+ Deepfake detection remains a challenging task due to the difficulty of +generalizing to new types of forgeries. This problem primarily stems from the +overfitting of existing detection methods to forgery-irrelevant features and +method-specific patterns. The latter has been rarely studied and not well +addressed by previous works. This paper presents a novel approach to address +the two types of overfitting issues by uncovering common forgery features. +Specifically, we first propose a disentanglement framework that decomposes +image information into three distinct components: forgery-irrelevant, +method-specific forgery, and common forgery features. To ensure the decoupling +of method-specific and common forgery features, a multi-task learning strategy +is employed, including a multi-class classification that predicts the category +of the forgery method and a binary classification that distinguishes the real +from the fake. Additionally, a conditional decoder is designed to utilize +forgery features as a condition along with forgery-irrelevant features to +generate reconstructed images. Furthermore, a contrastive regularization +technique is proposed to encourage the disentanglement of the common and +specific forgery features. Ultimately, we only utilize the common forgery +features for the purpose of generalizable deepfake detection. Extensive +evaluations demonstrate that our framework can perform superior generalization +than current state-of-the-art methods. + +
+
+
+
+
+ + ♻ ☆ DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection + + +
+ A critical yet frequently overlooked challenge in the field of deepfake +detection is the lack of a standardized, unified, comprehensive benchmark. This +issue leads to unfair performance comparisons and potentially misleading +results. Specifically, there is a lack of uniformity in data processing +pipelines, resulting in inconsistent data inputs for detection models. +Additionally, there are noticeable differences in experimental settings, and +evaluation strategies and metrics lack standardization. To fill this gap, we +present the first comprehensive benchmark for deepfake detection, called +DeepfakeBench, which offers three key contributions: 1) a unified data +management system to ensure consistent input across all detectors, 2) an +integrated framework for state-of-the-art methods implementation, and 3) +standardized evaluation metrics and protocols to promote transparency and +reproducibility. Featuring an extensible, modular-based codebase, DeepfakeBench +contains 15 state-of-the-art detection methods, 9 deepfake datasets, a series +of deepfake detection evaluation protocols and analysis tools, as well as +comprehensive evaluations. Moreover, we provide new insights based on extensive +analysis of these evaluations from various perspectives (e.g., data +augmentations, backbones). We hope that our efforts could facilitate future +research and foster innovation in this increasingly critical domain. All codes, +evaluations, and analyses of our benchmark are publicly available at +https://github.com/SCLBD/DeepfakeBench. + +
+
+
+
+
+ + ♻ ☆ Improving CLIP Training with Language Rewrites NeurIPS 2023 + + +
+ Contrastive Language-Image Pre-training (CLIP) stands as one of the most +effective and scalable methods for training transferable vision models using +paired image and text data. CLIP models are trained using contrastive loss, +which typically relies on data augmentations to prevent overfitting and +shortcuts. However, in the CLIP training paradigm, data augmentations are +exclusively applied to image inputs, while language inputs remain unchanged +throughout the entire training process, limiting the exposure of diverse texts +to the same image. In this paper, we introduce Language augmented CLIP +(LaCLIP), a simple yet highly effective approach to enhance CLIP training +through language rewrites. Leveraging the in-context learning capability of +large language models, we rewrite the text descriptions associated with each +image. These rewritten texts exhibit diversity in sentence structure and +vocabulary while preserving the original key concepts and meanings. During +training, LaCLIP randomly selects either the original texts or the rewritten +versions as text augmentations for each image. Extensive experiments on CC3M, +CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with +language rewrites significantly improves the transfer performance without +computation or memory overhead during training. Specifically for ImageNet +zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on +LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Semantic-aware Consistency Network for Cloth-changing Person + Re-Identification ACM MM 2023 + + +
+ Cloth-changing Person Re-Identification (CC-ReID) is a challenging task that +aims to retrieve the target person across multiple surveillance cameras when +clothing changes might happen. Despite recent progress in CC-ReID, existing +approaches are still hindered by the interference of clothing variations since +they lack effective constraints to keep the model consistently focused on +clothing-irrelevant regions. To address this issue, we present a Semantic-aware +Consistency Network (SCNet) to learn identity-related semantic features by +proposing effective consistency constraints. Specifically, we generate the +black-clothing image by erasing pixels in the clothing area, which explicitly +mitigates the interference from clothing variations. In addition, to fully +exploit the fine-grained identity information, a head-enhanced attention module +is introduced, which learns soft attention maps by utilizing the proposed +part-based matching loss to highlight head information. We further design a +semantic consistency loss to facilitate the learning of high-level +identity-related semantic features, forcing the model to focus on semantically +consistent cloth-irrelevant regions. By using the consistency constraint, our +model does not require any extra auxiliary segmentation module to generate the +black-clothing image or locate the head region during the inference stage. +Extensive experiments on four cloth-changing person Re-ID datasets (LTCC, PRCC, +Vc-Clothes, and DeepChange) demonstrate that our proposed SCNet makes +significant improvements over prior state-of-the-art approaches. Our code is +available at: https://github.com/Gpn-star/SCNet. + +
+
+ comment: Accepted by ACM MM 2023 +
+
+
+
+
+ + ♻ ☆ An Intelligent Remote Sensing Image Quality Inspection System + + +
+ Due to the inevitable presence of quality problems, quality inspection of +remote sensing images is indeed an indispensable step between the acquisition +and the application of them. However, traditional manual inspection suffers +from low efficiency. Hence, we propose a novel deep learning-based two-step +intelligent system consisting of multiple advanced computer vision models, +which first performs image classification by SwinV2 and then accordingly adopts +the most appropriate method, such as semantic segmentation by Segformer, to +localize the quality problems. Results demonstrate that the proposed method +exhibits excellent performance and efficiency, surpassing traditional methods. +Furthermore, we conduct an initial exploration of applying multimodal models to +remote sensing image quality inspection. + +
+
+
+
+
+ + ♻ ☆ Species196: A One-Million Semi-supervised Dataset for Fine-grained + Species Recognition NeurIPS 2023 + + +
+ The development of foundation vision models has pushed the general visual +recognition to a high level, but cannot well address the fine-grained +recognition in specialized domain such as invasive species classification. +Identifying and managing invasive species has strong social and ecological +value. Currently, most invasive species datasets are limited in scale and cover +a narrow range of species, which restricts the development of deep-learning +based invasion biometrics systems. To fill the gap of this area, we introduced +Species196, a large-scale semi-supervised dataset of 196-category invasive +species. It collects over 19K images with expert-level accurate annotations +Species196-L, and 1.2M unlabeled images of invasive species Species196-U. The +dataset provides four experimental settings for benchmarking the existing +models and algorithms, namely, supervised learning, semi-supervised learning, +self-supervised pretraining and zero-shot inference ability of large +multi-modal models. To facilitate future research on these four learning +paradigms, we conduct an empirical study of the representative methods on the +introduced dataset. The dataset is publicly available at +https://species-dataset.github.io/. + +
+
+ comment: Accepted by NeurIPS 2023 Track Datasets and Benchmarks +
+
+
+
+
+ + ♻ ☆ FastHuman: Reconstructing High-Quality Clothed Human in Minutes 3DV 2024 + + +
+ We propose an approach for optimizing high-quality clothed human body shapes +in minutes, using multi-view posed images. While traditional neural rendering +methods struggle to disentangle geometry and appearance using only rendering +loss, and are computationally intensive, our method uses a mesh-based patch +warping technique to ensure multi-view photometric consistency, and sphere +harmonics (SH) illumination to refine geometric details efficiently. We employ +oriented point clouds' shape representation and SH shading, which significantly +reduces optimization and rendering times compared to implicit methods. Our +approach has demonstrated promising results on both synthetic and real-world +datasets, making it an effective solution for rapidly generating high-quality +human body shapes. Project page +\href{https://l1346792580123.github.io/nccsfs/}{https://l1346792580123.github.io/nccsfs/} + +
+
+ comment: International Conference on 3D Vision, 3DV 2024 +
+
+
+
+
+ + ♻ ☆ DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model + Statistics NeurIPS 2023 + + +
+ Diffusion probabilistic models (DPMs) have exhibited excellent performance +for high-fidelity image generation while suffering from inefficient sampling. +Recent works accelerate the sampling procedure by proposing fast ODE solvers +that leverage the specific ODE form of DPMs. However, they highly rely on +specific parameterization during inference (such as noise/data prediction), +which might not be the optimal choice. In this work, we propose a novel +formulation towards the optimal parameterization during sampling that minimizes +the first-order discretization error of the ODE solution. Based on such +formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by +introducing several coefficients efficiently computed on the pretrained model, +which we call empirical model statistics. We further incorporate multistep +methods and a predictor-corrector framework, and propose some techniques for +improving sample quality at small numbers of function evaluations (NFE) or +large guidance scales. Experiments show that DPM-Solver-v3 achieves +consistently better or comparable performance in both unconditional and +conditional sampling with both pixel-space and latent-space DPMs, especially in +5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on +unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable +Diffusion, bringing a speed-up of 15%$\sim$30% compared to previous +state-of-the-art training-free methods. Code is available at +https://github.com/thu-ml/DPM-Solver-v3. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ LayoutGPT: Compositional Visual Planning and Generation with Large + Language Models NeurIPS 2023 + + +
+ Attaining a high degree of user controllability in visual generation often +requires intricate, fine-grained inputs like layouts. However, such inputs +impose a substantial burden on users when compared to simple text inputs. To +address the issue, we study how Large Language Models (LLMs) can serve as +visual planners by generating layouts from text conditions, and thus +collaborate with visual generative models. We propose LayoutGPT, a method to +compose in-context visual demonstrations in style sheet language to enhance the +visual planning skills of LLMs. LayoutGPT can generate plausible layouts in +multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also +shows superior performance in converting challenging language concepts like +numerical and spatial relations to layout arrangements for faithful +text-to-image generation. When combined with a downstream image generation +model, LayoutGPT outperforms text-to-image models/systems by 20-40% and +achieves comparable performance as human users in designing visual layouts for +numerical and spatial correctness. Lastly, LayoutGPT achieves comparable +performance to supervised methods in 3D indoor scene synthesis, demonstrating +its effectiveness and potential in multiple visual domains. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ 4K4D: Real-Time 4D View Synthesis at 4K Resolution + + +
+ This paper targets high-fidelity and real-time view synthesis of dynamic 3D +scenes at 4K resolution. Recently, some methods on dynamic view synthesis have +shown impressive rendering quality. However, their speed is still limited when +rendering high-resolution images. To overcome this problem, we propose 4K4D, a +4D point cloud representation that supports hardware rasterization and enables +unprecedented rendering speed. Our representation is built on a 4D feature grid +so that the points are naturally regularized and can be robustly optimized. In +addition, we design a novel hybrid appearance model that significantly boosts +the rendering quality while preserving efficiency. Moreover, we develop a +differentiable depth peeling algorithm to effectively learn the proposed model +from RGB videos. Experiments show that our representation can be rendered at +over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the +ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x +faster than previous methods and achieves the state-of-the-art rendering +quality. Our project page is available at https://zju3dv.github.io/4k4d/. + +
+
+ comment: Project Page: https://zju3dv.github.io/4k4d +
+
+
+
+
+ + ♻ ☆ UNK-VQA: A Dataset and A Probe into Multi-modal Large Models' Abstention + Ability + + +
+ Teaching Visual Question Answering (VQA) models to refrain from answering +unanswerable questions is necessary for building a trustworthy AI system. +Existing studies, though have explored various aspects of VQA but somewhat +ignored this particular attribute. This paper aims to bridge the research gap +by contributing a comprehensive dataset, called UNK-VQA. The dataset is +specifically designed to address the challenge of questions that models do not +know. To this end, we first augment the existing data via deliberate +perturbations on either the image or question. In specific, we carefully ensure +that the question-image semantics remain close to the original unperturbed +distribution. By this means, the identification of unanswerable questions +becomes challenging, setting our dataset apart from others that involve mere +image replacement. We then extensively evaluate the zero- and few-shot +performance of several emerging multi-modal large models and discover their +significant limitations when applied to our dataset. Additionally, we also +propose a straightforward method to tackle these unanswerable questions. This +dataset, we believe, will serve as a valuable benchmark for enhancing the +abstention capability of VQA models, thereby leading to increased +trustworthiness of AI systems. We have made the +\href{https://github.com/guoyang9/UNK-VQA}{dataset} available to facilitate +further exploration in this area. + +
+
+
+
+
+ + ♻ ☆ OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping NeurIPS 2023 + + +
+ Accurately depicting the complex traffic scene is a vital component for +autonomous vehicles to execute correct judgments. However, existing benchmarks +tend to oversimplify the scene by solely focusing on lane perception tasks. +Observing that human drivers rely on both lanes and traffic signals to operate +their vehicles safely, we present OpenLane-V2, the first dataset on topology +reasoning for traffic scene structure. The objective of the presented dataset +is to advance research in understanding the structure of road scenes by +examining the relationship between perceived entities, such as traffic elements +and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000 +annotated road scenes that describe traffic elements and their correlation to +the lanes. It comprises three primary sub-tasks, including the 3D lane +detection inherited from OpenLane, accompanied by corresponding metrics to +evaluate the model's performance. We evaluate various state-of-the-art methods, +and present their quantitative and qualitative results on OpenLane-V2 to +indicate future avenues for investigating topology reasoning in traffic scenes. + +
+
+ comment: Accepted by NeurIPS 2023 Track on Datasets and Benchmarks | + OpenLane-V2 Dataset: https://github.com/OpenDriveLab/OpenLane-V2 +
+
+
+
+
+
+
+
+ + Information Retrieval 8 + +
+
+
+ + ☆ Leveraging Multimodal Features and Item-level User Feedback for Bundle + Construction + + +
+ Automatic bundle construction is a crucial prerequisite step in various +bundle-aware online services. Previous approaches are mostly designed to model +the bundling strategy of existing bundles. However, it is hard to acquire +large-scale well-curated bundle dataset, especially for those platforms that +have not offered bundle services before. Even for platforms with mature bundle +services, there are still many items that are included in few or even zero +bundles, which give rise to sparsity and cold-start challenges in the bundle +construction models. To tackle these issues, we target at leveraging multimodal +features, item-level user feedback signals, and the bundle composition +information, to achieve a comprehensive formulation of bundle construction. +Nevertheless, such formulation poses two new technical challenges: 1) how to +learn effective representations by optimally unifying multiple features, and 2) +how to address the problems of modality missing, noise, and sparsity problems +induced by the incomplete query bundles. In this work, to address these +technical challenges, we propose a Contrastive Learning-enhanced Hierarchical +Encoder method (CLHE). Specifically, we use self-attention modules to combine +the multimodal and multi-item features, and then leverage both item- and +bundle-level contrastive learning to enhance the representation learning, thus +to counter the modality missing, noise, and sparsity problems. Extensive +experiments on four datasets in two application domains demonstrate that our +method outperforms a list of SOTA methods. The code and dataset are available +at https://github.com/Xiaohao-Liu/CLHE. + +
+
+
+
+
+ + ☆ Empowering Collaborative Filtering with Principled Adversarial + Contrastive Loss NeurIPS 2023 + + +
+ Contrastive Learning (CL) has achieved impressive performance in +self-supervised learning tasks, showing superior generalization ability. +Inspired by the success, adopting CL into collaborative filtering (CF) is +prevailing in semi-supervised top-K recommendations. The basic idea is to +routinely conduct heuristic-based data augmentation and apply contrastive +losses (e.g., InfoNCE) on the augmented views. Yet, some CF-tailored challenges +make this adoption suboptimal, such as the issue of out-of-distribution, the +risk of false negatives, and the nature of top-K evaluation. They necessitate +the CL-based CF scheme to focus more on mining hard negatives and +distinguishing false negatives from the vast unlabeled user-item interactions, +for informative contrast signals. Worse still, there is limited understanding +of contrastive loss in CF methods, especially w.r.t. its generalization +ability. To bridge the gap, we delve into the reasons underpinning the success +of contrastive loss in CF, and propose a principled Adversarial InfoNCE loss +(AdvInfoNCE), which is a variant of InfoNCE, specially tailored for CF methods. +AdvInfoNCE adaptively explores and assigns hardness to each negative instance +in an adversarial fashion and further utilizes a fine-grained hardness-aware +ranking criterion to empower the recommender's generalization ability. Training +CF models with AdvInfoNCE, we validate the effectiveness of AdvInfoNCE on both +synthetic and real-world benchmark datasets, thus showing its generalization +ability to mitigate out-of-distribution problems. Given the theoretical +guarantees and empirical superiority of AdvInfoNCE over most contrastive loss +functions, we advocate its adoption as a standard loss in recommender systems, +particularly for the out-of-distribution tasks. Codes are available at +https://github.com/LehengTHU/AdvInfoNCE. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Dense Retrieval as Indirect Supervision for Large-space Decision Making EMNLP 2023 + + +
+ Many discriminative natural language understanding (NLU) tasks have large +label spaces. Learning such a process of large-space decision making is +particularly challenging due to the lack of training instances per label and +the difficulty of selection among many fine-grained labels. Inspired by dense +retrieval methods for passage finding in open-domain QA, we propose a +reformulation of large-space discriminative NLU tasks as a learning-to-retrieve +task, leading to a novel solution named Dense Decision Retrieval (DDR ). +Instead of predicting fine-grained decisions as logits, DDR adopts a +dual-encoder architecture that learns to predict by retrieving from a decision +thesaurus. This approach not only leverages rich indirect supervision signals +from easy-to-consume learning resources for dense retrieval, it also leads to +enhanced prediction generalizability with a semantically meaningful +representation of the large decision space. When evaluated on tasks with +decision spaces ranging from hundreds to hundred-thousand scales, DDR +outperforms strong baselines greatly by 27.54% in P@1 on two extreme +multi-label classification tasks, 1.17% in F1 score ultra-fine entity typing, +and 1.26% in accuracy on three few-shot intent classification tasks on average. +Code and resources are available at https://github.com/luka-group/DDR + +
+
+ comment: EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Embedding in Recommender Systems: A Survey + + +
+ Recommender systems have become an essential component of many online +platforms, providing personalized recommendations to users. A crucial aspect is +embedding techniques that coverts the high-dimensional discrete features, such +as user and item IDs, into low-dimensional continuous vectors and can enhance +the recommendation performance. Applying embedding techniques captures complex +entity relationships and has spurred substantial research. In this survey, we +provide an overview of the recent literature on embedding techniques in +recommender systems. This survey covers embedding methods like collaborative +filtering, self-supervised learning, and graph-based techniques. Collaborative +filtering generates embeddings capturing user-item preferences, excelling in +sparse data. Self-supervised methods leverage contrastive or generative +learning for various tasks. Graph-based techniques like node2vec exploit +complex relationships in network-rich environments. Addressing the scalability +challenges inherent to embedding methods, our survey delves into innovative +directions within the field of recommendation systems. These directions aim to +enhance performance and reduce computational complexity, paving the way for +improved recommender systems. Among these innovative approaches, we will +introduce Auto Machine Learning (AutoML), hash techniques, and quantization +techniques in this survey. We discuss various architectures and techniques and +highlight the challenges and future directions in these aspects. This survey +aims to provide a comprehensive overview of the state-of-the-art in this +rapidly evolving field and serve as a useful resource for researchers and +practitioners working in the area of recommender systems. + +
+
+
+
+
+ + ♻ ☆ DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based + Queries NeurIPS 2023 + + +
+ In scientific research, the ability to effectively retrieve relevant +documents based on complex, multifaceted queries is critical. Existing +evaluation datasets for this task are limited, primarily due to the high cost +and effort required to annotate resources that effectively represent complex +queries. To address this, we propose a novel task, Scientific DOcument +Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed +to handle the complex nature of user queries in scientific research. We +developed a benchmark dataset within the field of computer science, consisting +of 100 human-authored complex query cases. For each complex query, we assembled +a collection of 100 relevant documents and produced annotated relevance scores +for ranking them. Recognizing the significant labor of expert annotation, we +also introduce Anno-GPT, a scalable framework for validating the performance of +Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM +annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, +without compromising quality. Furthermore, due to the multi-tiered structure of +these complex queries, the DORIS-MAE dataset can be extended to over 4,000 +sub-query test cases without requiring additional annotation. We evaluated 17 +recent retrieval methods on DORIS-MAE, observing notable performance drops +compared to traditional datasets. This highlights the need for better +approaches to handle complex, multifaceted queries in scientific research. Our +dataset and codebase are available at +https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. + +
+
+ comment: To appear in NeurIPS 2023 Datasets and Benchmarks Track +
+
+
+
+
+ + ♻ ☆ A Diffusion model for POI recommendation + + +
+ Next Point-of-Interest (POI) recommendation is a critical task in +location-based services that aim to provide personalized suggestions for the +user's next destination. Previous works on POI recommendation have laid focused +on modeling the user's spatial preference. However, existing works that +leverage spatial information are only based on the aggregation of users' +previous visited positions, which discourages the model from recommending POIs +in novel areas. This trait of position-based methods will harm the model's +performance in many situations. Additionally, incorporating sequential +information into the user's spatial preference remains a challenge. In this +paper, we propose Diff-POI: a Diffusion-based model that samples the user's +spatial preference for the next POI recommendation. Inspired by the wide +application of diffusion algorithm in sampling from distributions, Diff-POI +encodes the user's visiting sequence and spatial character with two +tailor-designed graph encoding modules, followed by a diffusion-based sampling +strategy to explore the user's spatial visiting trends. We leverage the +diffusion process and its reversed form to sample from the posterior +distribution and optimized the corresponding score function. We design a joint +training and inference framework to optimize and evaluate the proposed +Diff-POI. Extensive experiments on four real-world POI recommendation datasets +demonstrate the superiority of our Diff-POI over state-of-the-art baseline +methods. Further ablation and parameter studies on Diff-POI reveal the +functionality and effectiveness of the proposed diffusion-based sampling +strategy for addressing the limitations of existing methods. + +
+
+ comment: Accepted by ACM Transactions on Information Systems (TOIS 2023) +
+
+
+
+
+ + ♻ ☆ Unified Off-Policy Learning to Rank: a Reinforcement Learning + Perspective + + +
+ Off-policy Learning to Rank (LTR) aims to optimize a ranker from data +collected by a deployed logging policy. However, existing off-policy learning +to rank methods often make strong assumptions about how users generate the +click data, i.e., the click model, and hence need to tailor their methods +specifically under different click models. In this paper, we unified the +ranking process under general stochastic click models as a Markov Decision +Process (MDP), and the optimal ranking could be learned with offline +reinforcement learning (RL) directly. Building upon this, we leverage offline +RL techniques for off-policy LTR and propose the Click Model-Agnostic Unified +Off-policy Learning to Rank (CUOLR) method, which could be easily applied to a +wide range of click models. Through a dedicated formulation of the MDP, we show +that offline RL algorithms can adapt to various click models without complex +debiasing techniques and prior knowledge of the model. Results on various +large-scale datasets demonstrate that CUOLR consistently outperforms the +state-of-the-art off-policy learning to rank algorithms while maintaining +consistency and robustness under different click models. + +
+
+ comment: accepted by Neruips 2023 +
+
+
+
+
+ + ♻ ☆ DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph + Index using Query-sensitivity Entry Vertex + + +
+ Given a vector dataset $\mathcal{X}$ and a query vector $\vec{x}_q$, +graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph +index $G$ and approximately return vectors with minimum distances to +$\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is +that a graph index would be too large to fit into the memory especially for a +large-scale $\mathcal{X}$. To solve this, a Product Quantization (PQ)-based +hybrid method called DiskANN is proposed to store a low-dimensional PQ index in +memory and retain a graph index in SSD, thus reducing memory overhead while +ensuring a high search accuracy. However, it suffers from two I/O issues that +significantly affect the overall efficiency: (1) long routing path from an +entry vertex to the query's neighborhood that results in large number of I/O +requests and (2) redundant I/O requests during the routing process. We propose +an optimized DiskANN++ to overcome above issues. Specifically, for the first +issue, we present a query-sensitive entry vertex selection strategy to replace +DiskANN's static graph-central entry vertex by a dynamically determined entry +vertex that is close to the query. For the second I/O issue, we present an +isomorphic mapping on DiskANN's graph index to optimize the SSD layout and +propose an asynchronously optimized Pagesearch based on the optimized SSD +layout as an alternative to DiskANN's beamsearch. Comprehensive experimental +studies on eight real-world datasets demonstrate our DiskANN++'s superiority on +efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to +DiskANN, given the same accuracy constraint. + +
+
+ comment: 14 pages including references, 9 figures +
+
+
+
+
+
+
+
+ + Multimedia 5 + +
+
+
+ + ☆ Leveraging Multimodal Features and Item-level User Feedback for Bundle + Construction + + +
+ Automatic bundle construction is a crucial prerequisite step in various +bundle-aware online services. Previous approaches are mostly designed to model +the bundling strategy of existing bundles. However, it is hard to acquire +large-scale well-curated bundle dataset, especially for those platforms that +have not offered bundle services before. Even for platforms with mature bundle +services, there are still many items that are included in few or even zero +bundles, which give rise to sparsity and cold-start challenges in the bundle +construction models. To tackle these issues, we target at leveraging multimodal +features, item-level user feedback signals, and the bundle composition +information, to achieve a comprehensive formulation of bundle construction. +Nevertheless, such formulation poses two new technical challenges: 1) how to +learn effective representations by optimally unifying multiple features, and 2) +how to address the problems of modality missing, noise, and sparsity problems +induced by the incomplete query bundles. In this work, to address these +technical challenges, we propose a Contrastive Learning-enhanced Hierarchical +Encoder method (CLHE). Specifically, we use self-attention modules to combine +the multimodal and multi-item features, and then leverage both item- and +bundle-level contrastive learning to enhance the representation learning, thus +to counter the modality missing, noise, and sparsity problems. Extensive +experiments on four datasets in two application domains demonstrate that our +method outperforms a list of SOTA methods. The code and dataset are available +at https://github.com/Xiaohao-Liu/CLHE. + +
+
+
+
+
+ + ☆ Online Multi-view Anomaly Detection with Disentangled Product-of-Experts + Modeling + + +
+ Multi-view or even multi-modal data is appealing yet challenging for +real-world applications. Detecting anomalies in multi-view data is a prominent +recent research topic. However, most of the existing methods 1) are only +suitable for two views or type-specific anomalies, 2) suffer from the issue of +fusion disentanglement, and 3) do not support online detection after model +deployment. To address these challenges, our main ideas in this paper are +three-fold: multi-view learning, disentangled representation learning, and +generative model. To this end, we propose dPoE, a novel multi-view variational +autoencoder model that involves (1) a Product-of-Experts (PoE) layer in +tackling multi-view data, (2) a Total Correction (TC) discriminator in +disentangling view-common and view-specific representations, and (3) a joint +loss function in wrapping up all components. In addition, we devise theoretical +information bounds to control both view-common and view-specific +representations. Extensive experiments on six real-world datasets demonstrate +that the proposed dPoE outperforms baselines markedly. + +
+
+ comment: Accepted by ACM Multimedia 2023, 10 pages, 5 tables, and 3 figures +
+
+
+
+
+ + ☆ Audio-Visual Instance Segmentation + + +
+ In this paper, we propose a new multi-modal task, namely audio-visual +instance segmentation (AVIS), in which the goal is to identify, segment, and +track individual sounding object instances in audible videos, simultaneously. +To our knowledge, it is the first time that instance segmentation has been +extended into the audio-visual domain. To better facilitate this research, we +construct the first audio-visual instance segmentation benchmark (AVISeg). +Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6 +seconds from YouTube and public audio-visual datasets, where 117 videos have +been annotated by using an interactive semi-automatic labeling tool based on +the Segment Anything Model (SAM). In addition, we present a simple baseline +model for the AVIS task. Our new model introduces an audio branch and a +cross-modal fusion module to Mask2Former to locate all sounding objects. +Finally, we evaluate the proposed method using two backbones on AVISeg. We +believe that AVIS will inspire the community towards a more comprehensive +multi-modal understanding. + +
+
+
+
+
+ + ☆ Deep3DSketch+: Obtaining Customized 3D Model by Single Free-Hand Sketch + through Deep Learning + + +
+ As 3D models become critical in today's manufacturing and product design, +conventional 3D modeling approaches based on Computer-Aided Design (CAD) are +labor-intensive, time-consuming, and have high demands on the creators. This +work aims to introduce an alternative approach to 3D modeling by utilizing +free-hand sketches to obtain desired 3D models. We introduce Deep3DSketch+, +which is a deep-learning algorithm that takes the input of a single free-hand +sketch and produces a complete and high-fidelity model that matches the sketch +input. The neural network has view- and structural-awareness enabled by a Shape +Discriminator (SD) and a Stroke Enhancement Module (SEM), which overcomes the +limitations of sparsity and ambiguity of the sketches. The network design also +brings high robustness to partial sketch input in industrial applications.Our +approach has undergone extensive experiments, demonstrating its +state-of-the-art (SOTA) performance on both synthetic and real-world datasets. +These results validate the effectiveness and superiority of our method compared +to existing techniques. We have demonstrated the conversion of free-hand +sketches into physical 3D objects using additive manufacturing. We believe that +our approach has the potential to accelerate product design and democratize +customized manufacturing. + +
+
+
+
+
+ + ♻ ☆ SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen + LLMs NeurIPS 2023 + + +
+ In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling +frozen LLMs to perform both understanding and generation tasks involving +non-linguistic modalities such as images or videos. SPAE converts between raw +pixels and interpretable lexical tokens (or words) extracted from the LLM's +vocabulary. The resulting tokens capture both the semantic meaning and the +fine-grained details needed for visual reconstruction, effectively translating +the visual content into a language comprehensible to the LLM, and empowering it +to perform a wide array of multimodal tasks. Our approach is validated through +in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set +of image understanding and generation tasks. Our method marks the first +successful attempt to enable a frozen LLM to generate image content while +surpassing state-of-the-art performance in image understanding tasks, under the +same setting, by over 25%. + +
+
+ comment: NeurIPS 2023 spotlight +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 100 + +
+
+
+ + ☆ FP8-LM: Training FP8 Large Language Models + + +
+ In this paper, we explore FP8 low-bit data formats for efficient training of +large language models (LLMs). Our key insight is that most variables, such as +gradients and optimizer states, in LLM training can employ low-precision data +formats without compromising model accuracy and requiring no changes to +hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision +framework for training LLMs. This framework offers three levels of FP8 +utilization to streamline mixed-precision and distributed parallel training for +LLMs. It gradually incorporates 8-bit gradients, optimizer states, and +distributed learning in an incremental manner. Experiment results show that, +during the training of GPT-175B model on H100 GPU platform, our FP8 +mixed-precision training framework not only achieved a remarkable 42% reduction +in real memory usage but also ran 64% faster than the widely adopted BF16 +framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer +Engine by 17%. This largely reduces the training costs for large foundation +models. Furthermore, our FP8 mixed-precision training methodology is generic. +It can be seamlessly applied to other tasks such as LLM instruction tuning and +reinforcement learning with human feedback, offering savings in fine-tuning +expenses. Our FP8 low-precision training framework is open-sourced at +{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}. + +
+
+
+
+
+ + ☆ An Approach to Automatically generating Riddles aiding Concept + Attainment + + +
+ One of the primary challenges in online learning environments, is to retain +learner engagement. Several different instructional strategies are proposed +both in online and offline environments to enhance learner engagement. The +Concept Attainment Model is one such instructional strategy that focuses on +learners acquiring a deeper understanding of a concept rather than just its +dictionary definition. This is done by searching and listing the properties +used to distinguish examples from non-examples of various concepts. Our work +attempts to apply the Concept Attainment Model to build conceptual riddles, to +deploy over online learning environments. The approach involves creating +factual triples from learning resources, classifying them based on their +uniqueness to a concept into `Topic Markers' and `Common', followed by +generating riddles based on the Concept Attainment Model's format and capturing +all possible solutions to those riddles. The results obtained from the human +evaluation of riddles prove encouraging. + +
+
+ comment: 9 pages, 5 figures +
+
+
+
+
+ + ☆ MalFake: A Multimodal Fake News Identification for Malayalam using + Recurrent Neural Networks and VGG-16 + + +
+ The amount of news being consumed online has substantially expanded in recent +years. Fake news has become increasingly common, especially in regional +languages like Malayalam, due to the rapid publication and lack of editorial +standards on some online sites. Fake news may have a terrible effect on +society, causing people to make bad judgments, lose faith in authorities, and +even engage in violent behavior. When we take into the context of India, there +are many regional languages, and fake news is spreading in every language. +Therefore, providing efficient techniques for identifying false information in +regional tongues is crucial. Until now, little to no work has been done in +Malayalam, extracting features from multiple modalities to classify fake news. +Multimodal approaches are more accurate in detecting fake news, as features +from multiple modalities are extracted to build the deep learning +classification model. As far as we know, this is the first piece of work in +Malayalam that uses multimodal deep learning to tackle false information. +Models trained with more than one modality typically outperform models taught +with only one modality. Our study in the Malayalam language utilizing +multimodal deep learning is a significant step toward more effective +misinformation detection and mitigation. + +
+
+
+
+
+ + ☆ Fine-Tuning Language Models Using Formal Methods Feedback + + +
+ Although pre-trained language models encode generic knowledge beneficial for +planning and control, they may fail to generate appropriate control policies +for domain-specific tasks. Existing fine-tuning methods use human feedback to +address this limitation, however, sourcing human feedback is labor intensive +and costly. We present a fully automated approach to fine-tune pre-trained +language models for applications in autonomous systems, bridging the gap +between generic knowledge and domain-specific requirements while reducing cost. +The method synthesizes automaton-based controllers from pre-trained models +guided by natural language task descriptions. These controllers are verifiable +against independently provided specifications within a world model, which can +be abstract or obtained from a high-fidelity simulator. Controllers with high +compliance with the desired specifications receive higher ranks, guiding the +iterative fine-tuning process. We provide quantitative evidences, primarily in +autonomous driving, to demonstrate the method's effectiveness across multiple +tasks. The results indicate an improvement in percentage of specifications +satisfied by the controller from 60% to 90%. + +
+
+
+
+
+ + ☆ Davidsonian Scene Graph: Improving Reliability in Fine-grained + Evaluation for Text-Image Generation + + +
+ Evaluating text-to-image models is notoriously difficult. A strong recent +approach for assessing text-image faithfulness is based on QG/A (question +generation and answering), which uses pre-trained foundational models to +automatically generate a set of questions and answers from the prompt, and +output images are scored based on whether these answers extracted with a visual +question answering model are consistent with the prompt-based answers. This +kind of evaluation is naturally dependent on the quality of the underlying QG +and QA models. We identify and address several reliability challenges in +existing QG/A work: (a) QG questions should respect the prompt (avoiding +hallucinations, duplications, and omissions) and (b) VQA answers should be +consistent (not asserting that there is no motorcycle in an image while also +claiming the motorcycle is blue). We address these issues with Davidsonian +Scene Graph (DSG), an empirically grounded evaluation framework inspired by +formal semantics. DSG is an automatic, graph-based QG/A that is modularly +implemented to be adaptable to any QG/A module. DSG produces atomic and unique +questions organized in dependency graphs, which (i) ensure appropriate semantic +coverage and (ii) sidestep inconsistent answers. With extensive experimentation +and human evaluation on a range of model configurations (LLM, VQA, and T2I), we +empirically demonstrate that DSG addresses the challenges noted above. Finally, +we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 +prompts, covering a wide range of fine-grained semantic categories with a +balanced distribution. We will release the DSG-1k prompts and the corresponding +DSG questions. + +
+
+ comment: Project website: https://google.github.io/DSG +
+
+
+
+
+ + ☆ Revising with a Backward Glance: Regressions and Skips during Reading as + Cognitive Signals for Revision Policies in Incremental Processing CoNLL 2023 + + +
+ In NLP, incremental processors produce output in instalments, based on +incoming prefixes of the linguistic input. Some tokens trigger revisions, +causing edits to the output hypothesis, but little is known about why models +revise when they revise. A policy that detects the time steps where revisions +should happen can improve efficiency. Still, retrieving a suitable signal to +train a revision policy is an open problem, since it is not naturally available +in datasets. In this work, we investigate the appropriateness of regressions +and skips in human reading eye-tracking data as signals to inform revision +policies in incremental sequence labelling. Using generalised mixed-effects +models, we find that the probability of regressions and skips by humans can +potentially serve as useful predictors for revisions in BiLSTMs and Transformer +models, with consistent results for various languages. + +
+
+ comment: Accepted to CoNLL 2023 +
+
+
+
+
+ + ☆ ArcheType: A Novel Framework for Open-Source Column Type Annotation + using Large Language Models + + +
+ Existing deep-learning approaches to semantic column type annotation (CTA) +have important shortcomings: they rely on semantic types which are fixed at +training time; require a large number of training samples per type and incur +large run-time inference costs; and their performance can degrade when +evaluated on novel datasets, even when types remain constant. Large language +models have exhibited strong zero-shot classification performance on a wide +range of tasks and in this paper we explore their use for CTA. We introduce +ArcheType, a simple, practical method for context sampling, prompt +serialization, model querying, and label remapping, which enables large +language models to solve column type annotation problems in a fully zero-shot +manner. We ablate each component of our method separately, and establish that +improvements to context sampling and label remapping provide the most +consistent gains. ArcheType establishes new state-of-the-art performance on +both zero-shot and fine-tuned CTA, including three new domain-specific +benchmarks, which we release, along with the code to reproduce our results at +https://github.com/penfever/ArcheType. + +
+
+ comment: 17 pages, 8 figures +
+
+
+
+
+ + ☆ INA: An Integrative Approach for Enhancing Negotiation Strategies with + Reward-Based Dialogue System + + +
+ In this paper, we propose a novel negotiation dialogue agent designed for the +online marketplace. Our agent is integrative in nature i.e, it possesses the +capability to negotiate on price as well as other factors, such as the addition +or removal of items from a deal bundle, thereby offering a more flexible and +comprehensive negotiation experience. We create a new dataset called +Integrative Negotiation Dataset (IND) to enable this functionality. For this +dataset creation, we introduce a new semi-automated data creation method, which +combines defining negotiation intents, actions, and intent-action simulation +between users and the agent to generate potential dialogue flows. Finally, the +prompting of GPT-J, a state-of-the-art language model, is done to generate +dialogues for a given intent, with a human-in-the-loop process for post-editing +and refining minor errors to ensure high data quality. We employ a set of novel +rewards, specifically tailored for the negotiation task to train our +Negotiation Agent, termed as the Integrative Negotiation Agent (INA). These +rewards incentivize the chatbot to learn effective negotiation strategies that +can adapt to various contextual requirements and price proposals. By leveraging +the IND, we train our model and conduct experiments to evaluate the +effectiveness of our reward-based dialogue system for negotiation. Our results +demonstrate that the proposed approach and reward system significantly enhance +the agent's negotiation capabilities. The INA successfully engages in +integrative negotiations, displaying the ability to dynamically adjust prices +and negotiate the inclusion or exclusion of items in a bundle deal + +
+
+
+
+
+ + ☆ Lost in Translation, Found in Spans: Identifying Claims in Multilingual + Social Media EMNLP 2023 + + +
+ Claim span identification (CSI) is an important step in fact-checking +pipelines, aiming to identify text segments that contain a checkworthy claim or +assertion in a social media post. Despite its importance to journalists and +human fact-checkers, it remains a severely understudied problem, and the scarce +research on this topic so far has only focused on English. Here we aim to +bridge this gap by creating a novel dataset, X-CLAIM, consisting of 7K +real-world claims collected from numerous social media platforms in five Indian +languages and English. We report strong baselines with state-of-the-art +encoder-only language models (e.g., XLM-R) and we demonstrate the benefits of +training on multiple languages over alternative cross-lingual transfer methods +such as zero-shot transfer, or training on translated data, from a +high-resource language such as English. We evaluate generative large language +models from the GPT series using prompting methods on the X-CLAIM dataset and +we find that they underperform the smaller encoder-only language models for +low-resource languages. + +
+
+ comment: EMNLP 2023 (main) +
+
+
+
+
+ + ☆ Style Description based Text-to-Speech with Conditional Prosodic Layer + Normalization based Diffusion GAN + + +
+ In this paper, we present a Diffusion GAN based approach (Prosodic Diff-TTS) +to generate the corresponding high-fidelity speech based on the style +description and content text as an input to generate speech samples within only +4 denoising steps. It leverages the novel conditional prosodic layer +normalization to incorporate the style embeddings into the multi head attention +based phoneme encoder and mel spectrogram decoder based generator architecture +to generate the speech. The style embedding is generated by fine tuning the +pretrained BERT model on auxiliary tasks such as pitch, speaking speed, +emotion,gender classifications. We demonstrate the efficacy of our proposed +architecture on multi-speaker LibriTTS and PromptSpeech datasets, using +multiple quantitative metrics that measure generated accuracy and MOS. + +
+
+
+
+
+ + ☆ Personas as a Way to Model Truthfulness in Language Models + + +
+ Large Language Models are trained on vast amounts of text from the internet, +which contains both factual and misleading information about the world. Can +language models discern truth from falsehood in this contradicting data? +Expanding on the view that LLMs can model different agents producing the +corpora, we hypothesize that they can cluster truthful text by modeling a +truthful persona: a group of agents that are likely to produce truthful text +and share similar features. For example, trustworthy sources like Wikipedia and +Science usually use formal writing styles and make consistent claims. By +modeling this persona, LLMs can generalize truthfulness beyond the specific +contexts in which each agent generated the training text. For example, the +model can infer that the agent "Wikipedia" will behave truthfully on topics +that were only generated by "Science" because they share a persona. We first +show evidence for the persona hypothesis via two observations: (1) we can probe +whether a model's answer will be truthful before it is generated; (2) +finetuning a model on a set of facts improves its truthfulness on unseen +topics. Next, using arithmetics as a synthetic environment, we show that +language models can separate true and false statements, and generalize +truthfulness across agents; but only if agents in the training data share a +truthful generative process that enables the creation of a truthful persona. +Overall, our findings suggest that models can exploit hierarchical structures +in the data to learn abstract concepts like truthfulness. + +
+
+
+
+
+ + ☆ MPrompt: Exploring Multi-level Prompt Tuning for Machine Reading + Comprehension EMNLP2023 + + +
+ The large language models have achieved superior performance on various +natural language tasks. One major drawback of such approaches is they are +resource-intensive in fine-tuning new datasets. Soft-prompt tuning presents a +resource-efficient solution to fine-tune the pre-trained language models (PLMs) +while keeping their weight frozen. Existing soft prompt methods mainly focus on +designing the input-independent prompts that steer the model to fit the domain +of the new dataset. Those methods often ignore the fine-grained information +about the task and context of the text. In this paper, we propose a multi-level +prompt tuning (MPrompt) method for machine reading comprehension. It utilizes +prompts at task-specific, domain-specific, and context-specific levels to +enhance the comprehension of input semantics at different granularities. We +also propose an independence constraint to steer each domain-specific prompt to +focus on information within its domain to avoid redundancy. Moreover, we +present a prompt generator that incorporates context-related knowledge in the +prompt generation to enhance contextual relevancy. We conducted extensive +experiments on 12 benchmarks of various QA formats and achieved an average +improvement of 1.94\% over the state-of-the-art methods. + +
+
+ comment: 13 pages, 5 figures, accepted by EMNLP2023-Findings +
+
+
+
+
+ + ☆ Elevating Code-mixed Text Handling through Auditory Information of Words EMNLP 2023 + + +
+ With the growing popularity of code-mixed data, there is an increasing need +for better handling of this type of data, which poses a number of challenges, +such as dealing with spelling variations, multiple languages, different +scripts, and a lack of resources. Current language models face difficulty in +effectively handling code-mixed data as they primarily focus on the semantic +representation of words and ignore the auditory phonetic features. This leads +to difficulties in handling spelling variations in code-mixed text. In this +paper, we propose an effective approach for creating language models for +handling code-mixed textual data using auditory information of words from +SOUNDEX. Our approach includes a pre-training step based on +masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a +new method of providing input data to the pre-trained model. Through +experimentation on various code-mixed datasets (of different languages) for +sentiment, offensive and aggression classification tasks, we establish that our +novel language modeling approach (SAMLM) results in improved robustness towards +adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM +based approach also results in better classification results over the popular +baselines for code-mixed tasks. We use the explainability technique, SHAP +(SHapley Additive exPlanations) to explain how the auditory features +incorporated through SAMLM assist the model to handle the code-mixed text +effectively and increase robustness against adversarial attacks +\footnote{Source code has been made available on +\url{https://github.com/20118/DefenseWithPhonetics}, +\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html\#Phonetics}}. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Disentangled Representation Learning with Large Language Models for + Text-Attributed Graphs + + +
+ Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs +such as citation networks, e-commerce networks and social networks has +attracted considerable attention in the web community. Recently, large language +models (LLMs) have demonstrated exceptional capabilities across a wide range of +tasks. However, the existing works focus on harnessing the potential of LLMs +solely relying on prompts to convey graph structure information to LLMs, thus +suffering from insufficient understanding of the complex structural +relationships within TAGs. To address this problem, in this paper we present +the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the +reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model +incorporates graph structure information through tailored disentangled graph +neural network (GNN) layers, enabling LLMs to capture the intricate +relationships hidden in text-attributed graphs from multiple structural +factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing +computational costs and allowing much more flexibility in combining with +different LLM models. Experimental evaluations demonstrate the effectiveness of +the proposed DGTL model on achieving superior or comparable performance over +state-of-the-art baselines. Additionally, we also demonstrate that our DGTL +model can offer natural language explanations for predictions, thereby +significantly enhancing model interpretability. + +
+
+
+
+
+ + ☆ DELPHI: Data for Evaluating LLMs' Performance in Handling Controversial + Issues EMNLP + + +
+ Controversy is a reflection of our zeitgeist, and an important aspect to any +discourse. The rise of large language models (LLMs) as conversational systems +has increased public reliance on these systems for answers to their various +questions. Consequently, it is crucial to systematically examine how these +models respond to questions that pertaining to ongoing debates. However, few +such datasets exist in providing human-annotated labels reflecting the +contemporary discussions. To foster research in this area, we propose a novel +construction of a controversial questions dataset, expanding upon the publicly +released Quora Question Pairs Dataset. This dataset presents challenges +concerning knowledge recency, safety, fairness, and bias. We evaluate different +LLMs using a subset of this dataset, illuminating how they handle controversial +issues and the stances they adopt. This research ultimately contributes to our +understanding of LLMs' interaction with controversial issues, paving the way +for improvements in their comprehension and handling of complex societal +debates. + +
+
+ comment: Accepted to EMNLP Industry Track 2023 +
+
+
+
+
+ + ☆ Ask more, know better: Reinforce-Learned Prompt Questions for Decision + Making with Large Language Models + + +
+ Large language models (LLMs) demonstrate their promise in tackling +complicated practical challenges by combining action-based policies with chain +of thought (CoT) reasoning. Having high-quality prompts on hand, however, is +vital to the framework's effectiveness. Currently, these prompts are +handcrafted utilizing extensive human labor, resulting in CoT policies that +frequently fail to generalize. Human intervention is also required in order to +develop grounding functions that ensure low-level controllers appropriately +process CoT reasoning. In this paper, we take the first step towards a fully +integrated end-to-end framework for task-solving in real settings employing +complicated reasoning. To that purpose, we offer a new leader-follower bilevel +framework capable of learning to ask relevant questions (prompts) and +subsequently undertaking reasoning to guide the learning of actions to be +performed in an environment. A good prompt should make introspective revisions +based on historical findings, leading the CoT to consider the anticipated +goals. A prompt-generator policy has its own aim in our system, allowing it to +adapt to the action policy and automatically root the CoT process towards +outputs that lead to decisive, high-performing actions. Meanwhile, the action +policy is learning how to use the CoT outputs to take specific actions. Our +empirical data reveal that our system outperforms leading methods in agent +learning benchmarks such as Overcooked and FourRoom. + +
+
+
+
+
+ + ☆ OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization + + +
+ Opinion summarization sets itself apart from other types of summarization +tasks due to its distinctive focus on aspects and sentiments. Although certain +automated evaluation methods like ROUGE have gained popularity, we have found +them to be unreliable measures for assessing the quality of opinion summaries. +In this paper, we present OpinSummEval, a dataset comprising human judgments +and outputs from 14 opinion summarization models. We further explore the +correlation between 24 automatic metrics and human ratings across four +dimensions. Our findings indicate that metrics based on neural networks +generally outperform non-neural ones. However, even metrics built on powerful +backbones, such as BART and GPT-3/3.5, do not consistently correlate well +across all dimensions, highlighting the need for advancements in automated +evaluation methods for opinion summarization. The code and data are publicly +available at https://github.com/A-Chicharito-S/OpinSummEval/tree/main. + +
+
+ comment: preprint, 19 pages, 4 figures, 10 tables +
+
+
+
+
+ + ☆ Towards a Unified Conversational Recommendation System: Multi-task + Learning via Contextualized Knowledge Distillation EMNLP 2023 + + +
+ In Conversational Recommendation System (CRS), an agent is asked to recommend +a set of items to users within natural language conversations. To address the +need for both conversational capability and personalized recommendations, prior +works have utilized separate recommendation and dialogue modules. However, such +approach inevitably results in a discrepancy between recommendation results and +generated responses. To bridge the gap, we propose a multi-task learning for a +unified CRS, where a single model jointly learns both tasks via Contextualized +Knowledge Distillation (ConKD). We introduce two versions of ConKD: hard gate +and soft gate. The former selectively gates between two task-specific teachers, +while the latter integrates knowledge from both teachers. Our gates are +computed on-the-fly in a context-specific manner, facilitating flexible +integration of relevant knowledge. Extensive experiments demonstrate that our +single model significantly improves recommendation performance while enhancing +fluency, and achieves comparable results in terms of diversity. + +
+
+ comment: EMNLP 2023 Main Conference +
+
+
+
+
+ + ☆ Mind the Gap: Automated Corpus Creation for Enthymeme Detection and + Reconstruction in Learner Arguments EMNLP 2023 + + +
+ Writing strong arguments can be challenging for learners. It requires to +select and arrange multiple argumentative discourse units (ADUs) in a logical +and coherent way as well as to decide which ADUs to leave implicit, so called +enthymemes. However, when important ADUs are missing, readers might not be able +to follow the reasoning or understand the argument's main point. This paper +introduces two new tasks for learner arguments: to identify gaps in arguments +(enthymeme detection) and to fill such gaps (enthymeme reconstruction). +Approaches to both tasks may help learners improve their argument quality. We +study how corpora for these tasks can be created automatically by deleting ADUs +from an argumentative text that are central to the argument and its quality, +while maintaining the text's naturalness. Based on the ICLEv3 corpus of +argumentative learner essays, we create 40,089 argument instances for enthymeme +detection and reconstruction. Through manual studies, we provide evidence that +the proposed corpus creation process leads to the desired quality reduction, +and results in arguments that are similarly natural to those written by +learners. Finally, first baseline approaches to enthymeme detection and +reconstruction demonstrate the corpus' usefulness. + +
+
+ comment: Accepted to Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Lost in Translation -- Multilingual Misinformation and its Evolution + + +
+ Misinformation and disinformation are growing threats in the digital age, +spreading rapidly across languages and borders. This paper investigates the +prevalence and dynamics of multilingual misinformation through an analysis of +over 250,000 unique fact-checks spanning 95 languages. First, we find that +while the majority of misinformation claims are only fact-checked once, 11.7%, +corresponding to more than 21,000 claims, are checked multiple times. Using +fact-checks as a proxy for the spread of misinformation, we find 33% of +repeated claims cross linguistic boundaries, suggesting that some +misinformation permeates language barriers. However, spreading patterns exhibit +strong homophily, with misinformation more likely to spread within the same +language. To study the evolution of claims over time and mutations across +languages, we represent fact-checks with multilingual sentence embeddings and +cluster semantically similar claims. We analyze the connected components and +shortest paths connecting different versions of a claim finding that claims +gradually drift over time and undergo greater alteration when traversing +languages. Overall, this novel investigation of multilingual misinformation +provides key insights. It quantifies redundant fact-checking efforts, +establishes that some claims diffuse across languages, measures linguistic +homophily, and models the temporal and cross-lingual evolution of claims. The +findings advocate for expanded information sharing between fact-checkers +globally while underscoring the importance of localized verification. + +
+
+
+
+
+ + ☆ Detrimental Contexts in Open-Domain Question Answering EMNLP 2023 + + +
+ For knowledge intensive NLP tasks, it has been widely accepted that accessing +more information is a contributing factor to improvements in the model's +end-to-end performance. However, counter-intuitively, too much context can have +a negative impact on the model when evaluated on common question answering (QA) +datasets. In this paper, we analyze how passages can have a detrimental effect +on retrieve-then-read architectures used in question answering. Our empirical +evidence indicates that the current read architecture does not fully leverage +the retrieved passages and significantly degrades its performance when using +the whole passages compared to utilizing subsets of them. Our findings +demonstrate that model accuracy can be improved by 10% on two popular QA +datasets by filtering out detrimental passages. Additionally, these outcomes +are attained by utilizing existing retrieval methods without further training +or data. We further highlight the challenges associated with identifying the +detrimental passages. First, even with the correct context, the model can make +an incorrect prediction, posing a challenge in determining which passages are +most influential. Second, evaluation typically considers lexical matching, +which is not robust to variations of correct answers. Despite these +limitations, our experimental results underscore the pivotal role of +identifying and removing these detrimental passages for the context-efficient +retrieve-then-read pipeline. Code and data are available at +https://github.com/xfactlab/emnlp2023-damaging-retrieval + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Knowledge Corpus Error in Question Answering EMNLP 2023 + + +
+ Recent works in open-domain question answering (QA) have explored generating +context passages from large language models (LLMs), replacing the traditional +retrieval step in the QA pipeline. However, it is not well understood why +generated passages can be more effective than retrieved ones. This study +revisits the conventional formulation of QA and introduces the concept of +knowledge corpus error. This error arises when the knowledge corpus used for +retrieval is only a subset of the entire string space, potentially excluding +more helpful passages that exist outside the corpus. LLMs may mitigate this +shortcoming by generating passages in a larger space. We come up with an +experiment of paraphrasing human-annotated gold context using LLMs to observe +knowledge corpus error empirically. Our results across three QA benchmarks +reveal an increased performance (10% - 13%) when using paraphrased passage, +indicating a signal for the existence of knowledge corpus error. Our code is +available at https://github.com/xfactlab/emnlp2023-knowledge-corpus-error + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ DUMA: a Dual-Mind Conversational Agent with Fast and Slow Thinking + + +
+ Inspired by the dual-process theory of human cognition, we introduce DUMA, a +novel conversational agent framework that embodies a dual-mind mechanism +through the utilization of two generative Large Language Models (LLMs) +dedicated to fast and slow thinking respectively. The fast thinking model +serves as the primary interface for external interactions and initial response +generation, evaluating the necessity for engaging the slow thinking model based +on the complexity of the complete response. When invoked, the slow thinking +model takes over the conversation, engaging in meticulous planning, reasoning, +and tool utilization to provide a well-analyzed response. This dual-mind +configuration allows for a seamless transition between intuitive responses and +deliberate problem-solving processes based on the situation. We have +constructed a conversational agent to handle online inquiries in the real +estate industry. The experiment proves that our method balances effectiveness +and efficiency, and has a significant improvement compared to the baseline. + +
+
+
+
+
+ + ☆ A Scalable Framework for Table of Contents Extraction from Complex ESG + Annual Reports + + +
+ Table of contents (ToC) extraction centres on structuring documents in a +hierarchical manner. In this paper, we propose a new dataset, ESGDoc, +comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to +2022. These reports pose significant challenges due to their diverse structures +and extensive length. To address these challenges, we propose a new framework +for Toc extraction, consisting of three steps: (1) Constructing an initial tree +of text blocks based on reading order and font sizes; (2) Modelling each tree +node (or text block) independently by considering its contextual information +captured in node-centric subtree; (3) Modifying the original tree by taking +appropriate action on each tree node (Keep, Delete, or Move). This +construction-modelling-modification (CMM) process offers several benefits. It +eliminates the need for pairwise modelling of section headings as in previous +approaches, making document segmentation practically feasible. By incorporating +structured information, each section heading can leverage both local and +long-distance context relevant to itself. Experimental results show that our +approach outperforms the previous state-of-the-art baseline with a fraction of +running time. Our framework proves its scalability by effectively handling +documents of any length. + +
+
+
+
+
+ + ☆ Multi-grained Evidence Inference for Multi-choice Reading Comprehension + + +
+ Multi-choice Machine Reading Comprehension (MRC) is a major and challenging +task for machines to answer questions according to provided options. Answers in +multi-choice MRC cannot be directly extracted in the given passages, and +essentially require machines capable of reasoning from accurate extracted +evidence. However, the critical evidence may be as simple as just one word or +phrase, while it is hidden in the given redundant, noisy passage with multiple +linguistic hierarchies from phrase, fragment, sentence until the entire +passage. We thus propose a novel general-purpose model enhancement which +integrates multi-grained evidence comprehensively, named Multi-grained evidence +inferencer (Mugen), to make up for the inability. Mugen extracts three +different granularities of evidence: coarse-, middle- and fine-grained +evidence, and integrates evidence with the original passages, achieving +significant and consistent performance improvement on four multi-choice MRC +benchmarks. + +
+
+ comment: Accepted by TASLP 2023, vol. 31, pp. 3896-3907 +
+
+
+
+
+ + ☆ "Honey, Tell Me What's Wrong", Global Explanation of Textual + Discriminative Models through Cooperative Generation + + +
+ The ubiquity of complex machine learning has raised the importance of +model-agnostic explanation algorithms. These methods create artificial +instances by slightly perturbing real instances, capturing shifts in model +decisions. However, such methods rely on initial data and only provide +explanations of the decision for these. To tackle these problems, we propose +Therapy, the first global and model-agnostic explanation method adapted to text +which requires no input dataset. Therapy generates texts following the +distribution learned by a classifier through cooperative generation. Because it +does not rely on initial samples, it allows to generate explanations even when +data is absent (e.g., for confidentiality reasons). Moreover, conversely to +existing methods that combine multiple local explanations into a global one, +Therapy offers a global overview of the model behavior on the input space. Our +experiments show that although using no input data to generate samples, Therapy +provides insightful information about features used by the classifier that is +competitive with the ones from methods relying on input samples and outperforms +them when input samples are not specific to the studied model. + +
+
+ comment: 8 pages plus references and 2 pages of appendices. 7 figures and 2 + tables +
+
+
+
+
+ + ☆ ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model + for Visual Question Answering in Vietnamese + + +
+ In recent years, Visual Question Answering (VQA) has gained significant +attention for its diverse applications, including intelligent car assistance, +aiding visually impaired individuals, and document image information retrieval +using natural language queries. VQA requires effective integration of +information from questions and images to generate accurate answers. Neural +models for VQA have made remarkable progress on large-scale datasets, with a +primary focus on resource-rich languages like English. To address this, we +introduce the ViCLEVR dataset, a pioneering collection for evaluating various +visual reasoning capabilities in Vietnamese while mitigating biases. The +dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), +each question annotated to specify the type of reasoning involved. Leveraging +this dataset, we conduct a comprehensive analysis of contemporary visual +reasoning systems, offering valuable insights into their strengths and +limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion +that identifies objects in images based on questions. The architecture +effectively employs transformers to enable simultaneous reasoning over textual +and visual data, merging both modalities at an early model stage. The +experimental findings demonstrate that our proposed model achieves +state-of-the-art performance across four evaluation metrics. The accompanying +code and dataset have been made publicly accessible at +\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate +advancements within the research community, fostering the development of more +multimodal fusion algorithms, specifically tailored to address the nuances of +low-resource languages, exemplified by Vietnamese. + +
+
+ comment: A pre-print version and submitted to journal +
+
+
+
+
+ + ☆ On General Language Understanding EMNLP 2023 + + +
+ Natural Language Processing prides itself to be an empirically-minded, if not +outright empiricist field, and yet lately it seems to get itself into +essentialist debates on issues of meaning and measurement ("Do Large Language +Models Understand Language, And If So, How Much?"). This is not by accident: +Here, as everywhere, the evidence underspecifies the understanding. As a +remedy, this paper sketches the outlines of a model of understanding, which can +ground questions of the adequacy of current methods of measurement of model +quality. The paper makes three claims: A) That different language use situation +types have different characteristics, B) That language understanding is a +multifaceted phenomenon, bringing together individualistic and social +processes, and C) That the choice of Understanding Indicator marks the limits +of benchmarking, and the beginnings of considerations of the ethics of NLP use. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Large language models for aspect-based sentiment analysis + + +
+ Large language models (LLMs) offer unprecedented text completion +capabilities. As general models, they can fulfill a wide range of roles, +including those of more specialized models. We assess the performance of GPT-4 +and GPT-3.5 in zero shot, few shot and fine-tuned settings on the aspect-based +sentiment analysis (ABSA) task. Fine-tuned GPT-3.5 achieves a state-of-the-art +F1 score of 83.8 on the joint aspect term extraction and polarity +classification task of the SemEval-2014 Task 4, improving upon InstructABSA +[@scaria_instructabsa_2023] by 5.7%. However, this comes at the price of 1000 +times more model parameters and thus increased inference cost. We discuss the +the cost-performance trade-offs of different models, and analyze the typical +errors that they make. Our results also indicate that detailed prompts improve +performance in zero-shot and few-shot settings but are not necessary for +fine-tuned models. This evidence is relevant for practioners that are faced +with the choice of prompt engineering versus fine-tuning when using LLMs for +ABSA. + +
+
+
+
+
+ + ☆ SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment + Analysis + + +
+ Code-mixing is a well-studied linguistic phenomenon when two or more +languages are mixed in text or speech. Several datasets have been build with +the goal of training computational models for code-mixing. Although it is very +common to observe code-mixing with multiple languages, most datasets available +contain code-mixed between only two languages. In this paper, we introduce +SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data +between three languages Bangla, English, and Hindi. We carry out a +comprehensive evaluation using SentMix-3L. We show that zero-shot prompting +with GPT-3.5 outperforms all transformer-based models on SentMix-3L. + +
+
+
+
+
+ + ☆ NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination + for each Benchmark EMNLP2024 + + +
+ In this position paper, we argue that the classical evaluation on Natural +Language Processing (NLP) tasks using annotated benchmarks is in trouble. The +worst kind of data contamination happens when a Large Language Model (LLM) is +trained on the test split of a benchmark, and then evaluated in the same +benchmark. The extent of the problem is unknown, as it is not straightforward +to measure. Contamination causes an overestimation of the performance of a +contaminated model in a target benchmark and associated task with respect to +their non-contaminated counterparts. The consequences can be very harmful, with +wrong scientific conclusions being published while other correct ones are +discarded. This position paper defines different levels of data contamination +and argues for a community effort, including the development of automatic and +semi-automatic measures to detect when data from a benchmark was exposed to a +model, and suggestions for flagging papers with conclusions that are +compromised by data contamination. + +
+
+ comment: Accepted at EMNLP2024-Findings +
+
+
+
+
+ + ☆ Does Role-Playing Chatbots Capture the Character Personalities? + Assessing Personality Traits for Role-Playing Chatbots + + +
+ The emergence of large-scale pretrained language models has revolutionized +the capabilities of new AI application, especially in the realm of crafting +chatbots with distinct personas. Given the "stimulus-response" nature of +chatbots, this paper unveils an innovative open-ended interview-style approach +for personality assessment on role-playing chatbots, which offers a richer +comprehension of their intrinsic personalities. We conduct personality +assessments on 32 role-playing chatbots created by the ChatHaruhi library, +across both the Big Five and MBTI dimensions, and measure their alignment with +human perception. Evaluation results underscore that modern role-playing +chatbots based on LLMs can effectively portray personality traits of +corresponding characters, with an alignment rate of 82.8% compared with +human-perceived personalities. Besides, we also suggest potential strategies +for shaping chatbots' personalities. Hence, this paper serves as a cornerstone +study for role-playing chatbots that intersects computational linguistics and +psychology. Our resources are available at +https://github.com/LC1332/Chat-Haruhi-Suzumiya + +
+
+ comment: A Personality Traits Test Over ChatHaruhi +
+
+
+
+
+ + ☆ Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General + Healthcare + + +
+ Large Language Models (LLMs) have introduced a new era of proficiency in +comprehending complex healthcare and biomedical topics. However, there is a +noticeable lack of models in languages other than English and models that can +interpret multi-modal input, which is crucial for global healthcare +accessibility. In response, this study introduces Qilin-Med-VL, the first +Chinese large vision-language model designed to integrate the analysis of +textual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer +(ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum +training process that includes feature alignment and instruction tuning. This +method enhances the model's ability to generate medical captions and answer +complex medical queries. We also release ChiMed-VL, a dataset consisting of +more than 1M image-text pairs. This dataset has been carefully curated to +enable detailed and comprehensive interpretation of medical data using various +types of images. + +
+
+
+
+
+ + ☆ Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed + Languages + + +
+ Recently Whisper has approached human-level robustness and accuracy in +English automatic speech recognition (ASR), while in minor language and mixed +language speech recognition, there remains a compelling need for further +improvement. In this work, we present the impressive results of Whisper-MCE, +our finetuned Whisper model, which was trained using our self-collected +dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile, +considering word error rate (WER) poses challenges when it comes to evaluating +its effectiveness in minor language and mixed-language contexts, we present a +novel rating mechanism. By comparing our model to the baseline whisper-large-v2 +model, we demonstrate its superior ability to accurately capture the content of +the original audio, achieve higher recognition accuracy, and exhibit faster +recognition speed. Notably, our model outperforms other existing models in the +specific task of recognizing mixed language. + +
+
+
+
+
+ + ☆ Unified Segment-to-Segment Framework for Simultaneous Sequence + Generation NeurIPS 2023 + + +
+ Simultaneous sequence generation is a pivotal task for real-time scenarios, +such as streaming speech recognition, simultaneous machine translation and +simultaneous speech translation, where the target sequence is generated while +receiving the source sequence. The crux of achieving high-quality generation +with low latency lies in identifying the optimal moments for generating, +accomplished by learning a mapping between the source and target sequences. +However, existing methods often rely on task-specific heuristics for different +sequence types, limiting the model's capacity to adaptively learn the +source-target mapping and hindering the exploration of multi-task learning for +various simultaneous tasks. In this paper, we propose a unified +segment-to-segment framework (Seg2Seg) for simultaneous sequence generation, +which learns the mapping in an adaptive and unified manner. During the process +of simultaneous generation, the model alternates between waiting for a source +segment and generating a target segment, making the segment serve as the +natural bridge between the source and target. To accomplish this, Seg2Seg +introduces a latent segment as the pivot between source to target and explores +all potential source-target mappings via the proposed expectation training, +thereby learning the optimal moments for generating. Experiments on multiple +simultaneous generation tasks demonstrate that Seg2Seg achieves +state-of-the-art performance and exhibits better generality across various +tasks. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ Transformers as Graph-to-Graph Models EMNLP 2023 + + +
+ We argue that Transformers are essentially graph-to-graph models, with +sequences just being a special case. Attention weights are functionally +equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes +this ability explicit, by inputting graph edges into the attention weight +computations and predicting graph edges with attention-like functions, thereby +integrating explicit graphs into the latent graphs learned by pretrained +Transformers. Adding iterative graph refinement provides a joint embedding of +input, output, and latent graphs, allowing non-autoregressive graph prediction +to optimise the complete graph without any bespoke pipeline or decoding +strategy. Empirical results show that this architecture achieves +state-of-the-art accuracies for modelling a variety of linguistic structures, +integrating very effectively with the latent linguistic representations learned +by pretraining. + +
+
+ comment: Accepted to Big Picture workshop at EMNLP 2023 +
+
+
+
+
+ + ☆ SOUL: Towards Sentiment and Opinion Understanding of Language EMNLP 2023 + + +
+ Sentiment analysis is a well-established natural language processing task, +with sentiment polarity classification being one of its most popular and +representative tasks. However, despite the success of pre-trained language +models in this area, they often fall short of capturing the broader +complexities of sentiment analysis. To address this issue, we propose a new +task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims +to evaluate sentiment understanding through two subtasks: Review Comprehension +(RC) and Justification Generation (JG). RC seeks to validate statements that +focus on subjective information based on a review text, while JG requires +models to provide explanations for their sentiment predictions. To enable +comprehensive evaluation, we annotate a new dataset comprising 15,028 +statements from 3,638 reviews. Experimental results indicate that SOUL is a +challenging task for both small and large language models, with a performance +gap of up to 27% when compared to human performance. Furthermore, evaluations +conducted with both human experts and GPT-4 highlight the limitations of the +small language model in generating reasoning-based justifications. These +findings underscore the challenging nature of the SOUL task for existing +models, emphasizing the need for further advancements in sentiment analysis to +address its complexities. The new dataset and code are available at +https://github.com/DAMO-NLP-SG/SOUL. + +
+
+ comment: EMNLP 2023 Main Conference, Short Paper +
+
+
+
+
+ + ☆ Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection + Method + + +
+ Large Language Models (LLMs) have shown great potential in Natural Language +Processing (NLP) tasks. However, recent literature reveals that LLMs generate +nonfactual responses intermittently, which impedes the LLMs' reliability for +further utilization. In this paper, we propose a novel self-detection method to +detect which questions that a LLM does not know that are prone to generate +nonfactual results. Specifically, we first diversify the textual expressions +for a given question and collect the corresponding answers. Then we examine the +divergencies between the generated answers to identify the questions that the +model may generate falsehoods. All of the above steps can be accomplished by +prompting the LLMs themselves without referring to any other external +resources. We conduct comprehensive experiments and demonstrate the +effectiveness of our method on recently released LLMs, e.g., Vicuna, ChatGPT, +and GPT-4. + +
+
+
+
+
+ + ☆ 3D-Aware Visual Question Answering about Parts, Poses and Occlusions NeurIPS2023 + + +
+ Despite rapid progress in Visual question answering (VQA), existing datasets +and models mainly focus on testing reasoning in 2D. However, it is important +that VQA models also understand the 3D structure of visual scenes, for example +to support tasks like navigation or manipulation. This includes an +understanding of the 3D object pose, their parts and occlusions. In this work, +we introduce the task of 3D-aware VQA, which focuses on challenging questions +that require a compositional reasoning over the 3D structure of visual scenes. +We address 3D-aware VQA from both the dataset and the model perspective. First, +we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains +questions about object parts, their 3D poses, and occlusions. Second, we +propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: +probabilistic neural symbolic program execution for reasoning and deep neural +networks with 3D generative representations of objects for robust visual +recognition. Our experimental results show our model PO3D-VQA outperforms +existing methods significantly, but we still observe a significant performance +gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an +important open research area. + +
+
+ comment: Accepted by NeurIPS2023 +
+
+
+
+
+ + ☆ Natural Language Interfaces for Tabular Data Querying and Visualization: + A Survey + + +
+ The emergence of natural language processing has revolutionized the way users +interact with tabular data, enabling a shift from traditional query languages +and manual plotting to more intuitive, language-based interfaces. The rise of +large language models (LLMs) such as ChatGPT and its successors has further +advanced this field, opening new avenues for natural language processing +techniques. This survey presents a comprehensive overview of natural language +interfaces for tabular data querying and visualization, which allow users to +interact with data using natural language queries. We introduce the fundamental +concepts and techniques underlying these interfaces with a particular emphasis +on semantic parsing, the key technology facilitating the translation from +natural language to SQL queries or data visualization commands. We then delve +into the recent advancements in Text-to-SQL and Text-to-Vis problems from the +perspectives of datasets, methodologies, metrics, and system designs. This +includes a deep dive into the influence of LLMs, highlighting their strengths, +limitations, and potential for future improvements. Through this survey, we aim +to provide a roadmap for researchers and practitioners interested in developing +and applying natural language interfaces for data interaction in the era of +large language models. + +
+
+ comment: 20 pages, 4 figures, 5 tables. Submitted to IEEE TKDE +
+
+
+
+
+ + ☆ Can LLMs Keep a Secret? Testing Privacy Implications of Language Models + via Contextual Integrity Theory + + +
+ The interactive use of large language models (LLMs) in AI assistants (at +work, home, etc.) introduces a new set of inference-time privacy risks: LLMs +are fed different types of information from multiple sources in their inputs +and are expected to reason about what to share in their outputs, for what +purpose and with whom, within a given context. In this work, we draw attention +to the highly critical yet overlooked notion of contextual privacy by proposing +ConfAIde, a benchmark designed to identify critical weaknesses in the privacy +reasoning capabilities of instruction-tuned LLMs. Our experiments show that +even the most capable models such as GPT-4 and ChatGPT reveal private +information in contexts that humans would not, 39% and 57% of the time, +respectively. This leakage persists even when we employ privacy-inducing +prompts or chain-of-thought reasoning. Our work underscores the immediate need +to explore novel inference-time privacy-preserving approaches, based on +reasoning and theory of mind. + +
+
+ comment: confaide.github.io +
+
+
+
+
+ + ☆ ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for + Consistent Data-to-Text Generation EMNLP2023 + + +
+ We present ASPIRO, an approach for structured data verbalisation into short +template sentences in zero to few-shot settings. Unlike previous methods, our +approach prompts large language models (LLMs) to directly produce +entity-agnostic templates, rather than relying on LLMs to faithfully copy the +given example entities, or validating/crafting the templates manually. We +incorporate LLM re-prompting, triggered by algorithmic parsing checks, as well +as the PARENT metric induced consistency validation to identify and rectify +template generation problems in real-time. ASPIRO, compared to direct LLM +output, averages 66\% parsing error rate reduction in generated verbalisations +of RDF triples on the DART dataset. Our best 5-shot text-davinci-003 setup, +scoring BLEU of 50.62, METEOR of 45.16, BLEURT of 0.82, NUBIA of 0.87, and +PARENT of 0.8962 on the Rel2Text dataset, competes effectively with recent +fine-tuned pre-trained language models. + +
+
+ comment: Accepted to Findings of EMNLP2023, code available at + https://github.com/vejvarm/ASPIRO +
+
+
+
+
+ + ☆ TarGEN: Targeted Data Generation with Large Language Models + + +
+ The rapid advancement of large language models (LLMs) has sparked interest in +data synthesis techniques, aiming to generate diverse and high-quality +synthetic datasets. However, these synthetic datasets often suffer from a lack +of diversity and added noise. In this paper, we present TarGEN, a multi-step +prompting strategy for generating high-quality synthetic datasets utilizing a +LLM. An advantage of TarGEN is its seedless nature; it does not require +specific task instances, broadening its applicability beyond task replication. +We augment TarGEN with a method known as self-correction empowering LLMs to +rectify inaccurately labeled instances during dataset creation, ensuring +reliable labels. To assess our technique's effectiveness, we emulate 8 tasks +from the SuperGLUE benchmark and finetune various language models, including +encoder-only, encoder-decoder, and decoder-only models on both synthetic and +original training sets. Evaluation on the original test set reveals that models +trained on datasets generated by TarGEN perform approximately 1-2% points +better than those trained on original datasets (82.84% via syn. vs. 81.12% on +og. using Flan-T5). When incorporating instruction tuning, the performance +increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A +comprehensive analysis of the synthetic dataset compared to the original +dataset reveals that the synthetic dataset demonstrates similar or higher +levels of dataset complexity and diversity. Furthermore, the synthetic dataset +displays a bias level that aligns closely with the original dataset. Finally, +when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive +results on the OpenLLM leaderboard, surpassing the model trained on the +Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for +quality data generation and reducing the human efforts to create complex +benchmarks. + +
+
+ comment: 10 pages, 6 tables, 5 figures, 5 pages references, 17 pages appendix +
+
+
+
+
+ + ☆ From Values to Opinions: Predicting Human Behaviors and Stances Using + Value-Injected Large Language Models EMNLP 2023 + + +
+ Being able to predict people's opinions on issues and behaviors in realistic +scenarios can be helpful in various domains, such as politics and marketing. +However, conducting large-scale surveys like the European Social Survey to +solicit people's opinions on individual issues can incur prohibitive costs. +Leveraging prior research showing influence of core human values on individual +decisions and actions, we propose to use value-injected large language models +(LLM) to predict opinions and behaviors. To this end, we present Value +Injection Method (VIM), a collection of two methods -- argument generation and +question answering -- designed to inject targeted value distributions into LLMs +via fine-tuning. We then conduct a series of experiments on four tasks to test +the effectiveness of VIM and the possibility of using value-injected LLMs to +predict opinions and behaviors of people. We find that LLMs value-injected with +variations of VIM substantially outperform the baselines. Also, the results +suggest that opinions and behaviors can be better predicted using +value-injected LLMs than the baseline approaches. + +
+
+ comment: EMNLP 2023 main paper accepted +
+
+
+
+
+ + ☆ Evaluating Cross-Domain Text-to-SQL Models and Benchmarks EMNLP 2023 + + +
+ Text-to-SQL benchmarks play a crucial role in evaluating the progress made in +the field and the ranking of different models. However, accurately matching a +model-generated SQL query to a reference SQL query in a benchmark fails for +various reasons, such as underspecified natural language queries, inherent +assumptions in both model-generated and reference queries, and the +non-deterministic nature of SQL output under certain conditions. In this paper, +we conduct an extensive study of several prominent cross-domain text-to-SQL +benchmarks and re-evaluate some of the top-performing models within these +benchmarks, by both manually evaluating the SQL queries and rewriting them in +equivalent expressions. Our evaluation reveals that attaining a perfect +performance on these benchmarks is unfeasible due to the multiple +interpretations that can be derived from the provided samples. Furthermore, we +find that the true performance of the models is underestimated and their +relative performance changes after a re-evaluation. Most notably, our +evaluation reveals a surprising discovery: a recent GPT4-based model surpasses +the gold standard reference queries in the Spider benchmark in our human +evaluation. This finding highlights the importance of interpreting benchmark +evaluations cautiously, while also acknowledging the critical role of +additional independent evaluations in driving advancements in the field. + +
+
+ comment: To appear in EMNLP 2023 +
+
+
+
+
+ + ☆ On the Automatic Generation and Simplification of Children's Stories EMNLP 2023 + + +
+ With recent advances in large language models (LLMs), the concept of +automatically generating children's educational materials has become +increasingly realistic. Working toward the goal of age-appropriate simplicity +in generated educational texts, we first examine the ability of several popular +LLMs to generate stories with properly adjusted lexical and readability levels. +We find that, in spite of the growing capabilities of LLMs, they do not yet +possess the ability to limit their vocabulary to levels appropriate for younger +age groups. As a second experiment, we explore the ability of state-of-the-art +lexical simplification models to generalize to the domain of children's stories +and, thus, create an efficient pipeline for their automatic generation. In +order to test these models, we develop a dataset of child-directed lexical +simplification instances, with examples taken from the LLM-generated stories in +our first experiment. We find that, while the strongest-performing current +lexical simplification models do not perform as well on material designed for +children due to their reliance on large language models behind the scenes, some +models that still achieve fairly strong results on general data can mimic or +even improve their performance on children-directed data with proper +fine-tuning, which we conduct using our newly created child-directed +simplification dataset. + +
+
+ comment: Accepted to EMNLP 2023 (main conference) +
+
+
+
+
+ + ☆ Publicly Detectable Watermarking for Language Models + + +
+ We construct the first provable watermarking scheme for language models with +public detectability or verifiability: we use a private key for watermarking +and a public key for watermark detection. Our protocol is the first +watermarking scheme that does not embed a statistical signal in generated text. +Rather, we directly embed a publicly-verifiable cryptographic signature using a +form of rejection sampling. We show that our construction meets strong formal +security guarantees and preserves many desirable properties found in schemes in +the private-key watermarking setting. In particular, our watermarking scheme +retains distortion-freeness and model agnosticity. We implement our scheme and +make empirical measurements over open models in the 7B parameter range. Our +experiments suggest that our watermarking scheme meets our formal claims while +preserving text quality. + +
+
+
+
+
+ + ☆ PeTailor: Improving Large Language Model by Tailored Chunk Scorer in + Biomedical Triple Extraction + + +
+ The automatic extraction of biomedical entities and their interaction from +unstructured data remains a challenging task due to the limited availability of +expert-labeled standard datasets. In this paper, we introduce PETAI-LOR, a +retrieval-based language framework that is augmented by tailored chunk scorer. +Unlike previous retrieval-augmented language models (LM) that retrieve relevant +documents by calculating the similarity between the input sentence and the +candidate document set, PETAILOR segments the sentence into chunks and +retrieves the relevant chunk from our pre-computed chunk-based relational +key-value memory. Moreover, in order to comprehend the specific requirements of +the LM, PETAI-LOR adapt the tailored chunk scorer to the LM. We also introduce +GM-CIHT, an expert annotated biomedical triple extraction dataset with more +relation types. This dataset is centered on the non-drug treatment and general +biomedical domain. Additionally, we investigate the efficacy of triple +extraction models trained on general domains when applied to the biomedical +domain. Our experiments reveal that PETAI-LOR achieves state-of-the-art +performance on GM-CIHT + +
+
+ comment: this is the first preprint version +
+
+
+
+
+ + ☆ Do Not Harm Protected Groups in Debiasing Language Representation Models + + +
+ Language Representation Models (LRMs) trained with real-world data may +capture and exacerbate undesired bias and cause unfair treatment of people in +various demographic groups. Several techniques have been investigated for +applying interventions to LRMs to remove bias in benchmark evaluations on, for +example, word embeddings. However, the negative side effects of debiasing +interventions are usually not revealed in the downstream tasks. We propose +xGAP-DEBIAS, a set of evaluations on assessing the fairness of debiasing. In +this work, We examine four debiasing techniques on a real-world text +classification task and show that reducing biasing is at the cost of degrading +performance for all demographic groups, including those the debiasing +techniques aim to protect. We advocate that a debiasing technique should have +good downstream performance with the constraint of ensuring no harm to the +protected group. + +
+
+
+
+
+ + ☆ T5 meets Tybalt: Author Attribution in Early Modern English Drama Using + Large Language Models + + +
+ Large language models have shown breakthrough potential in many NLP domains. +Here we consider their use for stylometry, specifically authorship +identification in Early Modern English drama. We find both promising and +concerning results; LLMs are able to accurately predict the author of +surprisingly short passages but are also prone to confidently misattribute +texts to specific authors. A fine-tuned t5-large model outperforms all tested +baselines, including logistic regression, SVM with a linear kernel, and cosine +delta, at attributing small passages. However, we see indications that the +presence of certain authors in the model's pre-training data affects predictive +results in ways that are difficult to assess. + +
+
+ comment: Published in CHR 2023 +
+
+
+
+
+ + ☆ Modeling Legal Reasoning: LM Annotation at the Edge of Human Agreement + + +
+ Generative language models (LMs) are increasingly used for document +class-prediction tasks and promise enormous improvements in cost and +efficiency. Existing research often examines simple classification tasks, but +the capability of LMs to classify on complex or specialized tasks is less well +understood. We consider a highly complex task that is challenging even for +humans: the classification of legal reasoning according to jurisprudential +philosophy. Using a novel dataset of historical United States Supreme Court +opinions annotated by a team of domain experts, we systematically test the +performance of a variety of LMs. We find that generative models perform poorly +when given instructions (i.e. prompts) equal to the instructions presented to +human annotators through our codebook. Our strongest results derive from +fine-tuning models on the annotated dataset; the best performing model is an +in-domain model, LEGAL-BERT. We apply predictions from this fine-tuned model to +study historical trends in jurisprudence, an exercise that both aligns with +prominent qualitative historical accounts and points to areas of possible +refinement in those accounts. Our findings generally sound a note of caution in +the use of generative LMs on complex tasks without fine-tuning and point to the +continued relevance of human annotation-intensive classification methods. + +
+
+
+
+
+ + ☆ Expanding the Set of Pragmatic Considerations in Conversational AI + + +
+ Despite considerable performance improvements, current conversational AI +systems often fail to meet user expectations. We discuss several pragmatic +limitations of current conversational AI systems. We illustrate pragmatic +limitations with examples that are syntactically appropriate, but have clear +pragmatic deficiencies. We label our complaints as "Turing Test Triggers" +(TTTs) as they indicate where current conversational AI systems fall short +compared to human behavior. We develop a taxonomy of pragmatic considerations +intended to identify what pragmatic competencies a conversational AI system +requires and discuss implications for the design and evaluation of +conversational AI systems. + +
+
+ comment: Pre-print version of paper that appeared at Multidisciplinary + Perspectives on COntext-aware embodied Spoken Interactions (MP-COSIN) + workshop at IEEE RO-MAN 2023 +
+
+
+
+
+ + ☆ SDOH-NLI: a Dataset for Inferring Social Determinants of Health from + Clinical Notes EMNLP 2023 + + +
+ Social and behavioral determinants of health (SDOH) play a significant role +in shaping health outcomes, and extracting these determinants from clinical +notes is a first step to help healthcare providers systematically identify +opportunities to provide appropriate care and address disparities. Progress on +using NLP methods for this task has been hindered by the lack of high-quality +publicly available labeled data, largely due to the privacy and regulatory +constraints on the use of real patients' information. This paper introduces a +new dataset, SDOH-NLI, that is based on publicly available notes and which we +release publicly. We formulate SDOH extraction as a natural language inference +(NLI) task, and provide binary textual entailment labels obtained from human +raters for a cross product of a set of social history snippets as premises and +SDOH factors as hypotheses. Our dataset differs from standard NLI benchmarks in +that our premises and hypotheses are obtained independently. We evaluate both +"off-the-shelf" entailment models as well as models fine-tuned on our data, and +highlight the ways in which our dataset appears more challenging than commonly +used NLI datasets. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Teacher Perception of Automatically Extracted Grammar Concepts for L2 + Language Learning EMNLP + + +
+ One of the challenges in language teaching is how best to organize rules +regarding syntax, semantics, or phonology in a meaningful manner. This not only +requires content creators to have pedagogical skills, but also have that +language's deep understanding. While comprehensive materials to develop such +curricula are available in English and some broadly spoken languages, for many +other languages, teachers need to manually create them in response to their +students' needs. This is challenging because i) it requires that such experts +be accessible and have the necessary resources, and ii) describing all the +intricacies of a language is time-consuming and prone to omission. In this +work, we aim to facilitate this process by automatically discovering and +visualizing grammar descriptions. We extract descriptions from a natural text +corpus that answer questions about morphosyntax (learning of word order, +agreement, case marking, or word formation) and semantics (learning of +vocabulary). We apply this method for teaching two Indian languages, Kannada +and Marathi, which, unlike English, do not have well-developed resources for +second language learning. To assess the perceived utility of the extracted +material, we enlist the help of language educators from schools in North +America to perform a manual evaluation, who find the materials have potential +to be used for their lesson preparation and learner evaluation. + +
+
+ comment: Accepted at EMNLP Findings 2023. arXiv admin note: substantial text + overlap with arXiv:2206.05154 +
+
+
+
+
+ + ♻ ☆ Towards Understanding Sycophancy in Language Models + + +
+ Human feedback is commonly utilized to finetune AI assistants. But human +feedback may also encourage model responses that match user beliefs over +truthful ones, a behaviour known as sycophancy. We investigate the prevalence +of sycophancy in models whose finetuning procedure made use of human feedback, +and the potential role of human preference judgments in such behavior. We first +demonstrate that five state-of-the-art AI assistants consistently exhibit +sycophancy across four varied free-form text-generation tasks. To understand if +human preferences drive this broadly observed behavior, we analyze existing +human preference data. We find that when a response matches a user's views, it +is more likely to be preferred. Moreover, both humans and preference models +(PMs) prefer convincingly-written sycophantic responses over correct ones a +non-negligible fraction of the time. Optimizing model outputs against PMs also +sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results +indicate that sycophancy is a general behavior of state-of-the-art AI +assistants, likely driven in part by human preference judgments favoring +sycophantic responses. + +
+
+ comment: 32 pages, 20 figures +
+
+
+
+
+ + ♻ ☆ Benchmarking Spatial Relationships in Text-to-Image Generation + + +
+ Spatial understanding is a fundamental aspect of computer vision and integral +for human-level reasoning about images, making it an important component for +grounded language understanding. While recent text-to-image synthesis (T2I) +models have shown unprecedented improvements in photorealism, it is unclear +whether they have reliable spatial understanding capabilities. We investigate +the ability of T2I models to generate correct spatial relationships among +objects and present VISOR, an evaluation metric that captures how accurately +the spatial relationship described in text is generated in the image. To +benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that +contains sentences describing two or more objects and the spatial relationships +between them. We construct an automated evaluation pipeline to recognize +objects and their spatial relationships, and employ it in a large-scale +evaluation of T2I models. Our experiments reveal a surprising finding that, +although state-of-the-art T2I models exhibit high image quality, they are +severely limited in their ability to generate multiple objects or the specified +spatial relations between them. Our analyses demonstrate several biases and +artifacts of T2I models such as the difficulty with generating multiple +objects, a bias towards generating the first object mentioned, spatially +inconsistent outputs for equivalent relationships, and a correlation between +object co-occurrence and spatial understanding capabilities. We conduct a human +study that shows the alignment between VISOR and human judgement about spatial +understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to +the community in support of T2I reasoning research. + +
+
+ comment: preprint; Code and Data at https://github.com/microsoft/VISOR and + https://huggingface.co/datasets/tgokhale/sr2d_visor +
+
+
+
+
+ + ♻ ☆ MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational + Transcript Cleanup EMNLP 2023 + + +
+ Current disfluency detection models focus on individual utterances each from +a single speaker. However, numerous discontinuity phenomena in spoken +conversational transcripts occur across multiple turns, hampering human +readability and the performance of downstream NLP tasks. This study addresses +these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken +conversational transcripts and collecting a new dataset, MultiTurnCleanup1. We +design a data labeling schema to collect the high-quality dataset and provide +extensive data analysis. Furthermore, we leverage two modeling approaches for +experimental evaluation as benchmarks for future research. + +
+
+ comment: EMNLP 2023 main conference. Dataset: + https://github.com/huashen218/MultiTurnCleanup +
+
+
+
+
+ + ♻ ☆ eP-ALM: Efficient Perceptual Augmentation of Language Models ICCV 2023 + + +
+ Large Language Models (LLMs) have so far impressed the world, with +unprecedented capabilities that emerge in models at large scales. On the vision +side, transformer models (i.e., ViT) are following the same trend, achieving +the best performance on challenging benchmarks. With the abundance of such +unimodal models, a natural question arises; do we need also to follow this +trend to tackle multimodal tasks? In this work, we propose to rather direct +effort to efficient adaptations of existing models, and propose to augment +Language Models with perception. Existing approaches for adapting pretrained +models for vision-language tasks still rely on several key components that +hinder their efficiency. In particular, they still train a large number of +parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) +trained on huge image-text datasets, and add significant inference overhead. In +addition, most of these approaches have focused on Zero-Shot and In Context +Learning, with little to no effort on direct finetuning. We investigate the +minimal computational effort needed to adapt unimodal models for multimodal +tasks and propose a new challenging setup, alongside different approaches, that +efficiently adapts unimodal pretrained models. We show that by freezing more +than 99% of total parameters, training only one linear projection layer, and +prepending only one trainable token, our approach (dubbed eP-ALM) significantly +outperforms other baselines on VQA and Captioning across Image, Video, and +Audio modalities, following the proposed setup. The code is available here: +https://github.com/mshukor/eP-ALM. + +
+
+ comment: Accepted at ICCV 2023. Project page: + https://mshukor.github.io/eP-ALM.github.io/ +
+
+
+
+
+ + ♻ ☆ ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to + Support Human-AI Scientific Writing SC + + +
+ Despite a surge collection of XAI methods, users still struggle to obtain +required AI explanations. Previous research suggests chatbots as dynamic +solutions, but the effective design of conversational XAI agents for practical +human needs remains under-explored. This paper focuses on Conversational XAI +for AI-assisted scientific writing tasks. Drawing from human linguistic +theories and formative studies, we identify four design rationales: +"multifaceted", "controllability", "mix-initiative", "context-aware +drill-down". We incorporate them into an interactive prototype, ConvXAI, which +facilitates heterogeneous AI explanations for scientific writing through +dialogue. In two studies with 21 users, ConvXAI outperforms a GUI-based +baseline on improving human-perceived understanding and writing improvement. +The paper further discusses the practical human usage patterns in interacting +with ConvXAI for scientific co-writing. + +
+
+ comment: CSCW 2023 Demo. ConvXAI system code: + https://github.com/huashen218/convxai.git +
+
+
+
+
+ + ♻ ☆ MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark + for Language Model Evaluation EMNLP 2023 + + +
+ Curated datasets for healthcare are often limited due to the need of human +annotations from experts. In this paper, we present MedEval, a multi-level, +multi-task, and multi-domain medical benchmark to facilitate the development of +language models for healthcare. MedEval is comprehensive and consists of data +from several healthcare systems and spans 35 human body regions from 8 +examination modalities. With 22,779 collected sentences and 21,228 reports, we +provide expert annotations at multiple levels, offering a granular potential +usage of the data and supporting a wide range of tasks. Moreover, we +systematically evaluated 10 generic and domain-specific language models under +zero-shot and finetuning settings, from domain-adapted baselines in healthcare +to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our +evaluations reveal varying effectiveness of the two categories of language +models across different tasks, from which we notice the importance of +instruction tuning for few-shot usage of large language models. Our +investigation paves the way toward benchmarking language models for healthcare +and provides valuable insights into the strengths and limitations of adopting +large language models in medical domains, informing their practical +applications and future advancements. + +
+
+ comment: Accepted to EMNLP 2023. Camera-ready version: added more evaluation + results on LLMs such as GPT4, LLaMa2, and LLaMa2-chat +
+
+
+
+
+ + ♻ ☆ Machine Reading Comprehension using Case-based Reasoning + + +
+ We present an accurate and interpretable method for answer extraction in +machine reading comprehension that is reminiscent of case-based reasoning (CBR) +from classical AI. Our method (CBR-MRC) builds upon the hypothesis that +contextualized answers to similar questions share semantic similarities with +each other. Given a test question, CBR-MRC first retrieves a set of similar +cases from a non-parametric memory and then predicts an answer by selecting the +span in the test context that is most similar to the contextualized +representations of answers in the retrieved cases. The semi-parametric nature +of our approach allows it to attribute a prediction to the specific set of +evidence cases, making it a desirable choice for building reliable and +debuggable QA systems. We show that CBR-MRC provides high accuracy comparable +with large reader models and outperforms baselines by 11.5 and 8.4 EM on +NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability +of CBR-MRC in identifying not just the correct answer tokens but also the span +with the most relevant supporting evidence. Lastly, we observe that contexts +for certain question types show higher lexical diversity than others and find +that CBR-MRC is robust to these variations while performance using +fully-parametric methods drops. + +
+
+ comment: 9 pages, 2 figures +
+
+
+
+
+ + ♻ ☆ Language models show human-like content effects on reasoning tasks + + +
+ Abstract reasoning is a key ability for an intelligent system. Large language +models (LMs) achieve above-chance performance on abstract reasoning tasks, but +exhibit many imperfections. However, human abstract reasoning is also +imperfect. For example, human reasoning is affected by our real-world knowledge +and beliefs, and shows notable "content effects"; humans reason more reliably +when the semantic content of a problem supports the correct logical inferences. +These content-entangled reasoning patterns play a central role in debates about +the fundamental nature of human intelligence. Here, we investigate whether +language models $\unicode{x2014}$ whose prior expectations capture some aspects +of human knowledge $\unicode{x2014}$ similarly mix content into their answers +to logical problems. We explored this question across three logical reasoning +tasks: natural language inference, judging the logical validity of syllogisms, +and the Wason selection task. We evaluate state of the art large language +models, as well as humans, and find that the language models reflect many of +the same patterns observed in humans across these tasks $\unicode{x2014}$ like +humans, models answer more accurately when the semantic content of a task +supports the logical inferences. These parallels are reflected both in answer +patterns, and in lower-level features like the relationship between model +answer distributions and human response times. Our findings have implications +for understanding both these cognitive effects in humans, and the factors that +contribute to language model performance. + +
+
+
+
+
+ + ♻ ☆ Exploring Chain-of-Thought Style Prompting for Text-to-SQL EMNLP 2023 + + +
+ In-context learning with large language models (LLMs) has recently caught +increasing attention due to its superior few-shot performance on various tasks. +However, its performance on text-to-SQL parsing still has much room for +improvement. In this paper, we hypothesize that a crucial aspect of LLMs to +improve for text-to-SQL parsing is their multi-step reasoning ability. Thus, we +systematically study how to enhance LLMs' reasoning ability through chain of +thought (CoT) style prompting, including the original chain-of-thought +prompting (Wei et al., 2022b) and least-to-most prompting (Zhou et al., 2023). +Our experiments demonstrate that iterative prompting as in Zhou et al. (2023) +may be unnecessary for text-to-SQL parsing, and using detailed reasoning steps +tends to have more error propagation issues. Based on these findings, we +propose a new CoT-style prompting method for text-to-SQL parsing. It brings 5.2 +and 6.5 point absolute gains on the Spider development set and the Spider +Realistic set, respectively, compared to the standard prompting method without +reasoning steps; 2.4 and 1.5 point absolute gains, compared to the +least-to-most prompting method. + +
+
+ comment: EMNLP 2023 main; long paper +
+
+
+
+
+ + ♻ ☆ Minimum Bayes' Risk Decoding for System Combination of Grammatical Error + Correction Systems + + +
+ For sequence-to-sequence tasks it is challenging to combine individual system +outputs. Further, there is also often a mismatch between the decoding criterion +and the one used for assessment. Minimum Bayes' Risk (MBR) decoding can be used +to combine system outputs in a manner that encourages better alignment with the +final assessment criterion. This paper examines MBR decoding for Grammatical +Error Correction (GEC) systems, where performance is usually evaluated in terms +of edits and an associated F-score. Hence, we propose a novel MBR loss function +directly linked to this form of criterion. Furthermore, an approach to expand +the possible set of candidate sentences is described. This builds on a current +max-voting combination scheme, as well as individual edit-level selection. +Experiments on three popular GEC datasets and with state-of-the-art GEC systems +demonstrate the efficacy of the proposed MBR approach. Additionally, the paper +highlights how varying reward metrics within the MBR decoding framework can +provide control over precision, recall, and the F-score in combined GEC +systems. + +
+
+
+
+
+ + ♻ ☆ Android in the Wild: A Large-Scale Dataset for Android Device Control + + +
+ There is a growing interest in device-control systems that can interpret +human natural language instructions and execute them on a digital device by +directly controlling its user interface. We present a dataset for +device-control research, Android in the Wild (AITW), which is orders of +magnitude larger than current datasets. The dataset contains human +demonstrations of device interactions, including the screens and actions, and +corresponding natural language instructions. It consists of 715k episodes +spanning 30k unique instructions, four versions of Android (v10-13),and eight +device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It +contains multi-step tasks that require semantic understanding of language and +visual context. This dataset poses a new challenge: actions available through +the user interface must be inferred from their visual appearance. And, instead +of simple UI element-based actions, the action space consists of precise +gestures (e.g., horizontal scrolls to operate carousel widgets). We organize +our dataset to encourage robustness analysis of device-control systems, i.e., +how well a system performs in the presence of new task descriptions, new +applications, or new platform versions. We develop two agents and report +performance across the dataset. The dataset is available at +https://github.com/google-research/google-research/tree/master/android_in_the_wild. + +
+
+
+
+
+ + ♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: + a comprehensive review + + +
+ This paper delves into the pivotal role of prompt engineering in unleashing +the capabilities of Large Language Models (LLMs). Prompt engineering is the +process of structuring input text for LLMs and is a technique integral to +optimizing the efficacy of LLMs. This survey elucidates foundational principles +of prompt engineering, such as role-prompting, one-shot, and few-shot +prompting, as well as more advanced methodologies such as the chain-of-thought +and tree-of-thoughts prompting. The paper sheds light on how external +assistance in the form of plugins can assist in this task, and reduce machine +hallucination by retrieving external knowledge. We subsequently delineate +prospective directions in prompt engineering research, emphasizing the need for +a deeper understanding of structures and the role of agents in Artificial +Intelligence-Generated Content (AIGC) tools. We discuss how to assess the +efficacy of prompt methods from different perspectives and using different +methods. Finally, we gather information about the application of prompt +engineering in such fields as education and programming, showing its +transformative potential. This comprehensive survey aims to serve as a friendly +guide for anyone venturing through the big world of LLMs and prompt +engineering. + +
+
+
+
+
+ + ♻ ☆ Data Augmentation for Emotion Detection in Small Imbalanced Text Data ICML + + +
+ Emotion recognition in text, the task of identifying emotions such as joy or +anger, is a challenging problem in NLP with many applications. One of the +challenges is the shortage of available datasets that have been annotated with +emotions. Certain existing datasets are small, follow different emotion +taxonomies and display imbalance in their emotion distribution. In this work, +we studied the impact of data augmentation techniques precisely when applied to +small imbalanced datasets, for which current state-of-the-art models (such as +RoBERTa) under-perform. Specifically, we utilized four data augmentation +methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and +ProtAugment) on three datasets that come from different sources and vary in +size, emotion categories and distributions. Our experimental results show that +using the augmented data when training the classifier model leads to +significant improvements. Finally, we conducted two case studies: a) directly +using the popular chat-GPT API to paraphrase text using different prompts, and +b) using external data to augment the training set. Results show the promising +potential of these methods. + +
+
+ comment: To be published in the Proceedings of IEEE International Conference + on Machine Learning Applications IEEE (ICMLA 2023) +
+
+
+
+
+ + ♻ ☆ Open-ended Commonsense Reasoning with Unrestricted Answer Scope EMNLP 2023 + + +
+ Open-ended Commonsense Reasoning is defined as solving a commonsense question +without providing 1) a short list of answer candidates and 2) a pre-defined +answer scope. Conventional ways of formulating the commonsense question into a +question-answering form or utilizing external knowledge to learn +retrieval-based methods are less applicable in the open-ended setting due to an +inherent challenge. Without pre-defining an answer scope or a few candidates, +open-ended commonsense reasoning entails predicting answers by searching over +an extremely large searching space. Moreover, most questions require implicit +multi-hop reasoning, which presents even more challenges to our problem. In +this work, we leverage pre-trained language models to iteratively retrieve +reasoning paths on the external knowledge base, which does not require +task-specific supervision. The reasoning paths can help to identify the most +precise answer to the commonsense question. We conduct experiments on two +commonsense benchmark datasets. Compared to other approaches, our proposed +method achieves better performance both quantitatively and qualitatively. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System EMNLP'23 + + +
+ Arabic is a complex language with many varieties and dialects spoken by over +450 millions all around the world. Due to the linguistic diversity and +variations, it is challenging to build a robust and generalized ASR system for +Arabic. In this work, we address this gap by developing and demoing a system, +dubbed VoxArabica, for dialect identification (DID) as well as automatic speech +recognition (ASR) of Arabic. We train a wide range of models such as HuBERT +(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR +tasks. Our DID models are trained to identify 17 different dialects in addition +to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. +Additionally, for the remaining dialects in ASR, we provide the option to +choose various models such as Whisper and MMS in a zero-shot setting. We +integrate these models into a single web interface with diverse features such +as audio recording, file upload, model selection, and the option to raise flags +for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide +range of audiences concerned with Arabic research. Our system is currently +running at https://cdce-206-12-100-168.ngrok.io/. + +
+
+ comment: Accepted at ArabicNLP conference co-located with EMNLP'23. First + three authors contributed equally +
+
+
+
+
+ + ♻ ☆ Complex Query Answering on Eventuality Knowledge Graph with Implicit + Logical Constraints + + +
+ Querying knowledge graphs (KGs) using deep learning approaches can naturally +leverage the reasoning and generalization ability to learn to infer better +answers. Traditional neural complex query answering (CQA) approaches mostly +work on entity-centric KGs. However, in the real world, we also need to make +logical inferences about events, states, and activities (i.e., eventualities or +situations) to push learning systems from System I to System II, as proposed by +Yoshua Bengio. Querying logically from an EVentuality-centric KG (EVKG) can +naturally provide references to such kind of intuitive and logical inference. +Thus, in this paper, we propose a new framework to leverage neural methods to +answer complex logical queries based on an EVKG, which can satisfy not only +traditional first-order logic constraints but also implicit logical constraints +over eventualities concerning their occurrences and orders. For instance, if we +know that "Food is bad" happens before "PersonX adds soy sauce", then "PersonX +adds soy sauce" is unlikely to be the cause of "Food is bad" due to implicit +temporal constraint. To facilitate consistent reasoning on EVKGs, we propose +Complex Eventuality Query Answering (CEQA), a more rigorous definition of CQA +that considers the implicit logical constraints governing the temporal order +and occurrence of eventualities. In this manner, we propose to leverage theorem +provers for constructing benchmark datasets to ensure the answers satisfy +implicit logical constraints. We also propose a Memory-Enhanced Query Encoding +(MEQE) approach to significantly improve the performance of state-of-the-art +neural query encoders on the CEQA task. + +
+
+
+
+
+ + ♻ ☆ Can large language models replace humans in the systematic review + process? Evaluating GPT-4's efficacy in screening and extracting data from + peer-reviewed and grey literature in multiple languages + + +
+ Systematic reviews are vital for guiding practice, research, and policy, yet +they are often slow and labour-intensive. Large language models (LLMs) could +offer a way to speed up and automate systematic reviews, but their performance +in such tasks has not been comprehensively evaluated against humans, and no +study has tested GPT-4, the biggest LLM so far. This pre-registered study +evaluates GPT-4's capability in title/abstract screening, full-text review, and +data extraction across various literature types and languages using a +'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human +performance in most tasks, results were skewed by chance agreement and dataset +imbalance. After adjusting for these, there was a moderate level of performance +for data extraction, and - barring studies that used highly reliable prompts - +screening performance levelled at none to moderate for different stages and +languages. When screening full-text literature using highly reliable prompts, +GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key +studies using highly reliable prompts improved its performance even more. Our +findings indicate that, currently, substantial caution should be used if LLMs +are being used to conduct systematic reviews, but suggest that, for certain +systematic review tasks delivered under reliable prompts, LLMs can rival human +performance. + +
+
+ comment: 9 pages, 2 figures, 1 table +
+
+
+
+
+ + ♻ ☆ Is ChatGPT Good at Search? Investigating Large Language Models as + Re-Ranking Agents EMNLP 2023 + + +
+ Large Language Models (LLMs) have demonstrated remarkable zero-shot +generalization across various language-related tasks, including search engines. +However, existing work utilizes the generative ability of LLMs for Information +Retrieval (IR) rather than direct passage ranking. The discrepancy between the +pre-training objectives of LLMs and the ranking objective poses another +challenge. In this paper, we first investigate generative LLMs such as ChatGPT +and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal +that properly instructed LLMs can deliver competitive, even superior results to +state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to +address concerns about data contamination of LLMs, we collect a new test set +called NovelEval, based on the latest knowledge and aiming to verify the +model's ability to rank unknown knowledge. Finally, to improve efficiency in +real-world applications, we delve into the potential for distilling the ranking +capabilities of ChatGPT into small specialized models using a permutation +distillation scheme. Our evaluation results turn out that a distilled 440M +model outperforms a 3B supervised model on the BEIR benchmark. The code to +reproduce our results is available at www.github.com/sunnweiwei/RankGPT. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Framework-Based Qualitative Analysis of Free Responses of Large Language + Models: Algorithmic Fidelity + + +
+ Today, using Large-scale generative Language Models (LLMs) it is possible to +simulate free responses to interview questions like those traditionally +analyzed using qualitative research methods. Qualitative methodology +encompasses a broad family of techniques involving manual analysis of +open-ended interviews or conversations conducted freely in natural language. +Here we consider whether artificial "silicon participants" generated by LLMs +may be productively studied using qualitative methods aiming to produce +insights that could generalize to real human populations. The key concept in +our analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023) +capturing the degree to which LLM-generated outputs mirror human +sub-populations' beliefs and attitudes. By definition, high algorithmic +fidelity suggests latent beliefs elicited from LLMs may generalize to real +humans, whereas low algorithmic fidelity renders such research invalid. Here we +used an LLM to generate interviews with silicon participants matching specific +demographic characteristics one-for-one with a set of human participants. Using +framework-based qualitative analysis, we showed the key themes obtained from +both human and silicon participants were strikingly similar. However, when we +analyzed the structure and tone of the interviews we found even more striking +differences. We also found evidence of the hyper-accuracy distortion described +by Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not +have sufficient algorithmic fidelity to expect research on it to generalize to +human populations. However, the rapid pace of LLM research makes it plausible +this could change in the future. Thus we stress the need to establish epistemic +norms now around how to assess validity of LLM-based qualitative research, +especially concerning the need to ensure representation of heterogeneous lived +experiences. + +
+
+ comment: 46 pages, 5 tables, 5 figures +
+
+
+
+
+ + ♻ ☆ Disentangling Structure and Style: Political Bias Detection in News by + Inducing Document Hierarchy EMNLP 2023 + + +
+ We address an important gap in detecting political bias in news articles. +Previous works that perform document classification can be influenced by the +writing style of each news outlet, leading to overfitting and limited +generalizability. Our approach overcomes this limitation by considering both +the sentence-level semantics and the document-level rhetorical structure, +resulting in a more robust and style-agnostic approach to detecting political +bias in news articles. We introduce a novel multi-head hierarchical attention +model that effectively encodes the structure of long documents through a +diverse ensemble of attention heads. While journalism follows a formalized +rhetorical structure, the writing style may vary by news outlet. We demonstrate +that our method overcomes this domain dependency and outperforms previous +approaches for robustness and accuracy. Further analysis and human evaluation +demonstrate the ability of our model to capture common discourse structures in +journalism. Our code is available at: +https://github.com/xfactlab/emnlp2023-Document-Hierarchy + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Can Large Language Models Capture Dissenting Human Voices? EMNLP 2023 + + +
+ Large language models (LLMs) have shown impressive achievements in solving a +broad range of tasks. Augmented by instruction fine-tuning, LLMs have also been +shown to generalize in zero-shot settings as well. However, whether LLMs +closely align with the human disagreement distribution has not been +well-studied, especially within the scope of natural language inference (NLI). +In this paper, we evaluate the performance and alignment of LLM distribution +with humans using two different techniques to estimate the multinomial +distribution: Monte Carlo Estimation (MCE) and Log Probability Estimation +(LPE). As a result, we show LLMs exhibit limited ability in solving NLI tasks +and simultaneously fail to capture human disagreement distribution. The +inference and human alignment performances plunge even further on data samples +with high human disagreement levels, raising concerns about their natural +language understanding (NLU) ability and their representativeness to a larger +human population. The source code for the experiments is available at +https://github.com/xfactlab/emnlp2023-LLM-Disagreement + +
+
+ comment: To appear at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Semantic HELM: A Human-Readable Memory for Reinforcement Learning NeurIPS 2023 + + +
+ Reinforcement learning agents deployed in the real world often have to cope +with partially observable environments. Therefore, most agents employ memory +mechanisms to approximate the state of the environment. Recently, there have +been impressive success stories in mastering partially observable environments, +mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft. +However, existing methods lack interpretability in the sense that it is not +comprehensible for humans what the agent stores in its memory. In this regard, +we propose a novel memory mechanism that represents past events in human +language. Our method uses CLIP to associate visual inputs with language tokens. +Then we feed these tokens to a pretrained language model that serves the agent +as memory and provides it with a coherent and human-readable representation of +the past. We train our memory mechanism on a set of partially observable +environments and find that it excels on tasks that require a memory component, +while mostly attaining performance on-par with strong baselines on tasks that +do not. On a challenging continuous recognition task, where memorizing the past +is crucial, our memory mechanism converges two orders of magnitude faster than +prior methods. Since our memory mechanism is human-readable, we can peek at an +agent's memory and check whether crucial pieces of information have been +stored. This significantly enhances troubleshooting and paves the way toward +more interpretable agents. + +
+
+ comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix), + Code: https://github.com/ml-jku/helm +
+
+
+
+
+ + ♻ ☆ Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as + Conversational Agents EMNLP 2023 + + +
+ Recent work has proposed a methodology for the systematic evaluation of +"Situated Language Understanding Agents"-agents that operate in rich linguistic +and non-linguistic contexts-through testing them in carefully constructed +interactive settings. Other recent work has argued that Large Language Models +(LLMs), if suitably set up, can be understood as (simulators of) such agents. A +connection suggests itself, which this paper explores: Can LLMs be evaluated +meaningfully by exposing them to constrained game-like settings that are built +to challenge specific capabilities? As a proof of concept, this paper +investigates five interaction settings, showing that current chat-optimised +LLMs are, to an extent, capable to follow game-play instructions. Both this +capability and the quality of the game play, measured by how well the +objectives of the different games are met, follows the development cycle, with +newer models performing better. The metrics even for the comparatively simple +example games are far from being saturated, suggesting that the proposed +instrument will remain to have diagnostic value. Our general framework for +implementing and evaluating games with LLMs is available at +https://github.com/clp-research/clembench. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Tackling the Matrix Multiplication Micro-kernel Generation with Exo + + +
+ The optimization of the matrix multiplication (or GEMM) has been a need +during the last decades. This operation is considered the flagship of current +linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its +widespread use in a large variety of scientific applications. The GEMM is +usually implemented following the GotoBLAS philosophy, which tiles the GEMM +operands and uses a series of nested loops for performance improvement. These +approaches extract the maximum computational power of the architectures through +small pieces of hardware-oriented, high-performance code called micro-kernel. +However, this approach forces developers to generate, with a non-negligible +effort, a dedicated micro-kernel for each new hardware. + In this work, we present a step-by-step procedure for generating +micro-kernels with the Exo compiler that performs close to (or even better +than) manually developed microkernels written with intrinsic functions or +assembly language. Our solution also improves the portability of the generated +code, since a hardware target is fully specified by a concise library-based +description of its instructions. + +
+
+ comment: 11 pages, 18 figures. Presented at CGO 2024. It includes a software + artifact step-by-step execution +
+
+
+
+
+ + ♻ ☆ PLANNER: Generating Diversified Paragraph via Latent Language Diffusion + Model NeurIPS 2023 + + +
+ Autoregressive models for text sometimes generate repetitive and low-quality +output because errors accumulate during the steps of generation. This issue is +often attributed to exposure bias - the difference between how a model is +trained, and how it is used during inference. Denoising diffusion models +provide an alternative approach in which a model can revisit and revise its +output. However, they can be computationally expensive and prior efforts on +text have led to models that produce less fluent output compared to +autoregressive models, especially for longer text and paragraphs. In this +paper, we propose PLANNER, a model that combines latent semantic diffusion with +autoregressive generation, to generate fluent text while exercising global +control over paragraphs. The model achieves this by combining an autoregressive +"decoding" module with a "planning" module that uses latent diffusion to +generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed +method is evaluated on various conditional generation tasks, and results on +semantic generation, text completion and summarization show its effectiveness +in generating high-quality long-form text in an efficient manner. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ 90% F1 Score in Relational Triple Extraction: Is it Real ? EMNLP 2023 + + +
+ Extracting relational triples from text is a crucial task for constructing +knowledge bases. Recent advancements in joint entity and relation extraction +models have demonstrated remarkable F1 scores ($\ge 90\%$) in accurately +extracting relational triples from free text. However, these models have been +evaluated under restrictive experimental settings and unrealistic datasets. +They overlook sentences with zero triples (zero-cardinality), thereby +simplifying the task. In this paper, we present a benchmark study of +state-of-the-art joint entity and relation extraction models under a more +realistic setting. We include sentences that lack any triples in our +experiments, providing a comprehensive evaluation. Our findings reveal a +significant decline (approximately 10-15\% in one dataset and 6-14\% in another +dataset) in the models' F1 scores within this realistic experimental setup. +Furthermore, we propose a two-step modeling approach that utilizes a simple +BERT-based classifier. This approach leads to overall performance improvement +in these models within the realistic experimental setting. + +
+
+ comment: Accepted in GenBench workshop @ EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ EMO: Earth Mover Distance Optimization for Auto-Regressive Language + Modeling + + +
+ Neural language models are probabilistic models of human text. They are +predominantly trained using maximum likelihood estimation (MLE), which is +equivalent to minimizing the forward cross-entropy between the empirical data +distribution and the model distribution. However, various degeneration +phenomena are still widely observed when decoding from the distributions +learned by such models. We establish that the forward cross-entropy is +suboptimal as a distance metric for aligning human and model distribution due +to its (1) recall-prioritization (2) negative diversity ignorance and (3) +train-test mismatch. In this paper, we propose Earth Mover Distance +Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on +the inherent properties of earth mover distance to address the aforementioned +challenges. Due to the high complexity of direct computation, we further +introduce a feasible upper bound for EMO to ease end-to-end training. Upon +extensive evaluation of language models trained using EMO and MLE. We find that +EMO demonstrates a consistently better language modeling performance than MLE +across domains. Moreover, EMO demonstrates noteworthy enhancements in +downstream performance with minimal fine-tuning on merely 25,000 sentences. +This highlights the tremendous potential of EMO as a lightweight calibration +method for enhancing large-scale pre-trained language models. + +
+
+ comment: Update experimental results of instruction-tuning and Github link +
+
+
+
+
+ + ♻ ☆ ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned + Samples in NLP + + +
+ Backdoor attacks have emerged as a prominent threat to natural language +processing (NLP) models, where the presence of specific triggers in the input +can lead poisoned models to misclassify these inputs to predetermined target +classes. Current detection mechanisms are limited by their inability to address +more covert backdoor strategies, such as style-based attacks. In this work, we +propose an innovative test-time poisoned sample detection framework that hinges +on the interpretability of model predictions, grounded in the semantic meaning +of inputs. We contend that triggers (e.g., infrequent words) are not supposed +to fundamentally alter the underlying semantic meanings of poisoned samples as +they want to stay stealthy. Based on this observation, we hypothesize that +while the model's predictions for paraphrased clean samples should remain +stable, predictions for poisoned samples should revert to their true labels +upon the mutations applied to triggers during the paraphrasing process. We +employ ChatGPT, a state-of-the-art large language model, as our paraphraser and +formulate the trigger-removal task as a prompt engineering problem. We adopt +fuzzing, a technique commonly used for unearthing software vulnerabilities, to +discover optimal paraphrase prompts that can effectively eliminate triggers +while concurrently maintaining input semantics. Experiments on 4 types of +backdoor attacks, including the subtle style backdoors, and 4 distinct datasets +demonstrate that our approach surpasses baseline methods, including STRIP, RAP, +and ONION, in precision and recall. + +
+
+
+
+
+ + ♻ ☆ Global Structure Knowledge-Guided Relation Extraction Method for + Visually-Rich Document EMNLP 2023 + + +
+ Visual Relation Extraction (VRE) is a powerful means of discovering +relationships between entities within visually-rich documents. Existing methods +often focus on manipulating entity features to find pairwise relations, yet +neglect the more fundamental structural information that links disparate entity +pairs together. The absence of global structure information may make the model +struggle to learn long-range relations and easily predict conflicted results. +To alleviate such limitations, we propose a GlObal Structure knowledge-guided +relation Extraction (GOSE) framework. GOSE initiates by generating preliminary +relation predictions on entity pairs extracted from a scanned image of the +document. Subsequently, global structural knowledge is captured from the +preceding iterative predictions, which are then incorporated into the +representations of the entities. This "generate-capture-incorporate" cycle is +repeated multiple times, allowing entity representations and global structure +knowledge to be mutually reinforced. Extensive experiments validate that GOSE +not only outperforms existing methods in the standard fine-tuning setting but +also reveals superior cross-lingual learning capabilities; indeed, even yields +stronger data-efficient performance in the low-resource setting. The code for +GOSE will be available at https://github.com/chenxn2020/GOSE. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ♻ ☆ DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge + Graphs EMNLP 2023 + + +
+ Recent work within the Argument Mining community has shown the applicability +of Natural Language Processing systems for solving problems found within +competitive debate. One of the most important tasks within competitive debate +is for debaters to create high quality debate cases. We show that effective +debate cases can be constructed using constrained shortest path traversals on +Argumentative Semantic Knowledge Graphs. We study this potential in the context +of a type of American Competitive Debate, called Policy Debate, which already +has a large scale dataset targeting it called DebateSum. We significantly +improve upon DebateSum by introducing 53180 new examples, as well as further +useful metadata for every example, to the dataset. We leverage the txtai +semantic search and knowledge graph toolchain to produce and contribute 9 +semantic knowledge graphs built on this dataset. We create a unique method for +evaluating which knowledge graphs are better in the context of producing policy +debate cases. A demo which automatically generates debate cases, along with all +other code and the Knowledge Graphs, are open-sourced and made available to the +public here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG + +
+
+ comment: 8 pages, Accepted to The 4th New Frontiers in Summarization Workshop + (EMNLP 2023), System Demonstration paper +
+
+
+
+
+ + ♻ ☆ How Does Beam Search improve Span-Level Confidence Estimation in + Generative Sequence Labeling? + + +
+ Sequence labeling is a core task in text understanding for IE/IR systems. +Text generation models have increasingly become the go-to solution for such +tasks (e.g., entity extraction and dialog slot filling). While most research +has focused on the labeling accuracy, a key aspect -- of vital practical +importance -- has slipped through the cracks: understanding model confidence. +More specifically, we lack a principled understanding of how to reliably gauge +the confidence of a model in its predictions for each labeled span. This paper +aims to provide some empirical insights on estimating model confidence for +generative sequence labeling. Most notably, we find that simply using the +decoder's output probabilities \textbf{is not} the best in realizing +well-calibrated confidence estimates. As verified over six public datasets of +different tasks, we show that our proposed approach -- which leverages +statistics from top-$k$ predictions by a beam search -- significantly reduces +calibration errors of the predictions of a generative sequence labeling model. + +
+
+
+
+
+ + ♻ ☆ Accented Speech Recognition With Accent-specific Codebooks EMNLP 2023 + + +
+ Speech accents pose a significant challenge to state-of-the-art automatic +speech recognition (ASR) systems. Degradation in performance across +underrepresented accents is a severe deterrent to the inclusive adoption of +ASR. In this work, we propose a novel accent adaptation approach for end-to-end +ASR systems using cross-attention with a trainable set of codebooks. These +learnable codebooks capture accent-specific information and are integrated +within the ASR encoder layers. The model is trained on accented English speech, +while the test data also contained accents which were not seen during training. +On the Mozilla Common Voice multi-accented dataset, we show that our proposed +approach yields significant performance gains not only on the seen English +accents (up to $37\%$ relative improvement in word error rate) but also on the +unseen accents (up to $5\%$ relative improvement in WER). Further, we +illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We +also compare the performance with other approaches based on accent adversarial +training. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ♻ ☆ Vicarious Offense and Noise Audit of Offensive Speech Classifiers: + Unifying Human and Machine Disagreement on What is Offensive EMNLP 2023 + + +
+ Offensive speech detection is a key component of content moderation. However, +what is offensive can be highly subjective. This paper investigates how machine +and human moderators disagree on what is offensive when it comes to real-world +social web political discourse. We show that (1) there is extensive +disagreement among the moderators (humans and machines); and (2) human and +large-language-model classifiers are unable to predict how other human raters +will respond, based on their political leanings. For (1), we conduct a noise +audit at an unprecedented scale that combines both machine and human responses. +For (2), we introduce a first-of-its-kind dataset of vicarious offense. Our +noise audit reveals that moderation outcomes vary wildly across different +machine moderators. Our experiments with human moderators suggest that +political leanings combined with sensitive issues affect both first-person and +vicarious offense. The dataset is available through +https://github.com/Homan-Lab/voiced. + +
+
+ comment: Accepted to appear at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ The Expressive Power of Low-Rank Adaptation + + +
+ Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that +leverages low-rank adaptation of weight matrices, has emerged as a prevalent +technique for fine-tuning pre-trained models such as large language models and +diffusion models. Despite its huge success in practice, the theoretical +underpinnings of LoRA have largely remained unexplored. This paper takes the +first step to bridge this gap by theoretically analyzing the expressive power +of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any +model $f$ to accurately represent any smaller target model $\overline{f}$ if +LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of +}\overline{f}}{\text{depth of }f}$. We also quantify the approximation error +when LoRA-rank is lower than the threshold. For Transformer networks, we show +any model can be adapted to a target model of the same size with +rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters. + +
+
+ comment: 40 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ FairMonitor: A Four-Stage Automatic Framework for Detecting Stereotypes + and Biases in Large Language Models + + +
+ Detecting stereotypes and biases in Large Language Models (LLMs) can enhance +fairness and reduce adverse impacts on individuals or groups when these LLMs +are applied. However, the majority of existing methods focus on measuring the +model's preference towards sentences containing biases and stereotypes within +datasets, which lacks interpretability and cannot detect implicit biases and +stereotypes in the real world. To address this gap, this paper introduces a +four-stage framework to directly evaluate stereotypes and biases in the +generated content of LLMs, including direct inquiry testing, serial or adapted +story testing, implicit association testing, and unknown situation testing. +Additionally, the paper proposes multi-dimensional evaluation metrics and +explainable zero-shot prompts for automated evaluation. Using the education +sector as a case study, we constructed the Edu-FairMonitor based on the +four-stage framework, which encompasses 12,632 open-ended questions covering +nine sensitive factors and 26 educational scenarios. Experimental results +reveal varying degrees of stereotypes and biases in five LLMs evaluated on +Edu-FairMonitor. Moreover, the results of our proposed automated evaluation +method have shown a high correlation with human annotations. + +
+
+
+
+
+ + ♻ ☆ HYTREL: Hypergraph-enhanced Tabular Data Representation Learning NeurIPS 2023 + + +
+ Language models pretrained on large collections of tabular data have +demonstrated their effectiveness in several downstream tasks. However, many of +these models do not take into account the row/column permutation invariances, +hierarchical structure, etc. that exist in tabular data. To alleviate these +limitations, we propose HYTREL, a tabular language model, that captures the +permutation invariances and three more structural properties of tabular data by +using hypergraphs - where the table cells make up the nodes and the cells +occurring jointly together in each row, column, and the entire table are used +to form three different types of hyperedges. We show that HYTREL is maximally +invariant under certain conditions for tabular data, i.e., two tables obtain +the same representations via HYTREL iff the two tables are identical up to +permutations. Our empirical results demonstrate that HYTREL consistently +outperforms other competitive baselines on four downstream tasks with minimal +pretraining, illustrating the advantages of incorporating the inductive biases +associated with tabular data into the representations. Finally, our qualitative +analyses showcase that HYTREL can assimilate the table structures to generate +robust representations for the cells, rows, columns, and the entire table. + +
+
+ comment: NeurIPS 2023 (spotlight) +
+
+
+
+
+ + ♻ ☆ Visual Programming for Text-to-Image Generation and Evaluation NeurIPS 2023 + + +
+ As large language models have demonstrated impressive performance in many +domains, recent works have adopted language models (LMs) as controllers of +visual modules for vision-and-language tasks. While existing work focuses on +equipping LMs with visual understanding, we propose two novel +interpretable/explainable visual programming frameworks for text-to-image (T2I) +generation and evaluation. First, we introduce VPGen, an interpretable +step-by-step T2I generation framework that decomposes T2I generation into three +steps: object/count generation, layout generation, and image generation. We +employ an LM to handle the first two steps (object/count generation and layout +generation), by finetuning it on text-layout pairs. Our step-by-step T2I +generation framework provides stronger spatial control than end-to-end models, +the dominant approach for this task. Furthermore, we leverage the world +knowledge of pretrained LMs, overcoming the limitation of previous +layout-guided T2I works that can only handle predefined object classes. We +demonstrate that our VPGen has improved control in counts/spatial +relations/scales of objects than state-of-the-art T2I generation models. +Second, we introduce VPEval, an interpretable and explainable evaluation +framework for T2I generation based on visual programming. Unlike previous T2I +evaluations with a single scoring model that is accurate in some skills but +unreliable in others, VPEval produces evaluation programs that invoke a set of +visual modules that are experts in different skills, and also provides +visual+textual explanations of the evaluation results. Our analysis shows that +VPEval provides a more human-correlated evaluation for skill-specific and +open-ended prompts than widely used single model-based evaluation. We hope that +our work encourages future progress on interpretable/explainable generation and +evaluation for T2I models. + +
+
+ comment: NeurIPS 2023; Project website: https://vp-t2i.github.io +
+
+
+
+
+ + ♻ ☆ TIES-Merging: Resolving Interference When Merging Models NeurIPS 2023 + + +
+ Transfer learning - i.e., further fine-tuning a pre-trained model on a +downstream task - can confer significant advantages, including improved +downstream performance, faster convergence, and better sample efficiency. These +advantages have led to a proliferation of task-specific fine-tuned models, +which typically can only perform a single task and do not benefit from one +another. Recently, model merging techniques have emerged as a solution to +combine multiple task-specific models into a single multitask model without +performing additional training. However, existing merging methods often ignore +the interference between parameters of different models, resulting in large +performance drops when merging multiple models. In this paper, we demonstrate +that prior merging techniques inadvertently lose valuable information due to +two major sources of interference: (a) interference due to redundant parameter +values and (b) disagreement on the sign of a given parameter's values across +models. To address this, we propose our method, TRIM, ELECT SIGN & MERGE +(TIES-Merging), which introduces three novel steps when merging models: (1) +resetting parameters that only changed a small amount during fine-tuning, (2) +resolving sign conflicts, and (3) merging only the parameters that are in +alignment with the final agreed-upon sign. We find that TIES-Merging +outperforms several existing methods in diverse settings covering a range of +modalities, domains, number of tasks, model sizes, architectures, and +fine-tuning settings. We further analyze the impact of different types of +interference on model parameters, and highlight the importance of resolving +sign interference. Our code is available at +https://github.com/prateeky2806/ties-merging + +
+
+ comment: Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables +
+
+
+
+
+ + ♻ ☆ Quilt-1M: One Million Image-Text Pairs for Histopathology + + +
+ Recent accelerations in multi-modal applications have been made possible with +the plethora of image and text data available online. However, the scarcity of +analogous data in the medical field, specifically in histopathology, has slowed +comparable progress. To enable similar representation learning for +histopathology, we turn to YouTube, an untapped resource of videos, offering +$1,087$ hours of valuable educational histopathology videos from expert +clinicians. From YouTube, we curate QUILT: a large-scale vision-language +dataset consisting of $802, 144$ image and text pairs. QUILT was automatically +curated using a mixture of models, including large language models, handcrafted +algorithms, human knowledge databases, and automatic speech recognition. In +comparison, the most comprehensive datasets curated for histopathology amass +only around $200$K samples. We combine QUILT with datasets from other sources, +including Twitter, research papers, and the internet in general, to create an +even larger dataset: QUILT-1M, with $1$M paired image-text samples, marking it +as the largest vision-language histopathology dataset to date. We demonstrate +the value of QUILT-1M by fine-tuning a pre-trained CLIP model. Our model +outperforms state-of-the-art models on both zero-shot and linear probing tasks +for classifying new histopathology images across $13$ diverse patch-level +datasets of $8$ different sub-pathologies and cross-modal retrieval tasks. + +
+
+
+
+
+ + ♻ ☆ Keep it Neutral: Using Natural Language Inference to Improve Generation + + +
+ We explore incorporating natural language inference (NLI) into the text +generative pipeline by using a pre-trained NLI model to assess whether a +generated sentence entails, contradicts, or is neutral to the prompt and +preceding text. First, we show that the NLI task is predictive of generation +errors made by GPT-3. We use these results to develop an NLI-informed +generation procedure for GPT-J. Then, we evaluate these generations by +obtaining human annotations on error types and overall quality. We find that an +NLI strategy of maximizing entailment improves text generation when the nucleus +sampling randomness parameter value is high, while one which maximizes +contradiction is in fact productive when the parameter value is low. Overall, +though, we demonstrate that an NLI strategy of maximizing the neutral class +provides the highest quality of generated text (significantly better than the +vanilla generations), regardless of parameter value. + +
+
+
+
+
+ + ♻ ☆ MelHuBERT: A simplified HuBERT on Mel spectrograms + + +
+ Self-supervised models have had great success in learning speech +representations that can generalize to various downstream tasks. However, most +self-supervised models require a large amount of compute and multiple GPUs to +train, significantly hampering the development of self-supervised learning. In +an attempt to reduce the computation of training, we revisit the training of +HuBERT, a highly successful self-supervised model. We improve and simplify +several key components, including the loss function, input representation, and +training in multiple stages. Our model, MelHuBERT, is able to achieve favorable +performance on phone recognition, speaker identification, and automatic speech +recognition against HuBERT, while saving 31.2% of the pre-training time, or +equivalently 33.5% MACs per one second speech. The code and pre-trained models +are available in https://github.com/nervjack2/MelHuBERT. + +
+
+ comment: ASRU 2023 +
+
+
+
+
+ + ♻ ☆ SituatedGen: Incorporating Geographical and Temporal Contexts into + Generative Commonsense Reasoning NeurIPS 2023 + + +
+ Recently, commonsense reasoning in text generation has attracted much +attention. Generative commonsense reasoning is the task that requires machines, +given a group of keywords, to compose a single coherent sentence with +commonsense plausibility. While existing datasets targeting generative +commonsense reasoning focus on everyday scenarios, it is unclear how well +machines reason under specific geographical and temporal contexts. We formalize +this challenging task as SituatedGen, where machines with commonsense should +generate a pair of contrastive sentences given a group of keywords including +geographical or temporal entities. We introduce a corresponding English dataset +consisting of 8,268 contrastive sentence pairs, which are built upon several +existing commonsense reasoning benchmarks with minimal manual labor. +Experiments show that state-of-the-art generative language models struggle to +generate sentences with commonsense plausibility and still lag far behind human +performance. Our dataset is publicly available at +https://github.com/yunx-z/situated_gen. + +
+
+ comment: Accepted to NeurIPS 2023 Datasets and Benchmarks Track +
+
+
+
+
+ + ♻ ☆ Instruction Mining: When Data Mining Meets Large Language Model + Finetuning + + +
+ Large language models (LLMs) are initially pretrained for broad capabilities +and then finetuned with instruction-following datasets to improve their +performance in interacting with humans. Despite advances in finetuning, a +standardized guideline for selecting high-quality datasets to optimize this +process remains elusive. In this paper, we first propose InstructMining, an +innovative method designed for automatically selecting premium +instruction-following data for finetuning LLMs. Specifically, InstructMining +utilizes natural language indicators as a measure of data quality, applying +them to evaluate unseen datasets. During experimentation, we discover that +double descent phenomenon exists in large language model finetuning. Based on +this observation, we further leverage BlendSearch to help find the best subset +among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show +that InstructMining-7B achieves state-of-the-art performance on two of the most +popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard. + +
+
+ comment: 22 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ LANCE: Stress-testing Visual Models by Generating Language-guided + Counterfactual Images NeurIPS 2023 + + +
+ We propose an automated algorithm to stress-test a trained visual model by +generating language-guided counterfactual test images (LANCE). Our method +leverages recent progress in large language modeling and text-based image +editing to augment an IID test set with a suite of diverse, realistic, and +challenging test images without altering model weights. We benchmark the +performance of a diverse set of pre-trained models on our generated data and +observe significant and consistent performance drops. We further analyze model +sensitivity across different types of edits, and demonstrate its applicability +at surfacing previously unknown class-level model biases in ImageNet. Code is +available at https://github.com/virajprabhu/lance. + +
+
+ comment: NeurIPS 2023 camera ready. Project webpage: + https://virajprabhu.github.io/lance-web/ +
+
+
+
+
+ + ♻ ☆ WikiChat: Stopping the Hallucination of Large Language Model Chatbots by + Few-Shot Grounding on Wikipedia EMNLP 2023 + + +
+ This paper presents the first few-shot LLM-based chatbot that almost never +hallucinates and has high conversationality and low latency. WikiChat is +grounded on the English Wikipedia, the largest curated free-text corpus. + WikiChat generates a response from an LLM, retains only the grounded facts, +and combines them with additional information it retrieves from the corpus to +form factual and engaging responses. We distill WikiChat based on GPT-4 into a +7B-parameter LLaMA model with minimal loss of quality, to significantly improve +its latency, cost and privacy, and facilitate research and deployment. + Using a novel hybrid human-and-LLM evaluation methodology, we show that our +best system achieves 97.3% factual accuracy in simulated conversations. It +significantly outperforms all retrieval-based and LLM-based baselines, and by +3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. +Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is +also significantly more informative and engaging, just like an LLM. + WikiChat achieves 97.9% factual accuracy in conversations with human users +about recent topics, 55.0% better than GPT-4, while receiving significantly +higher user ratings and more favorable comments. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Improved Contextual Recognition In Automatic Speech Recognition Systems + By Semantic Lattice Rescoring + + +
+ Automatic Speech Recognition (ASR) has witnessed a profound research +interest. Recent breakthroughs have given ASR systems different prospects such +as faithfully transcribing spoken language, which is a pivotal advancement in +building conversational agents. However, there is still an imminent challenge +of accurately discerning context-dependent words and phrases. In this work, we +propose a novel approach for enhancing contextual recognition within ASR +systems via semantic lattice processing leveraging the power of deep learning +models in accurately delivering spot-on transcriptions across a wide variety of +vocabularies and speaking styles. Our solution consists of using Hidden Markov +Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks +(DNN) models integrating both language and acoustic modeling for better +accuracy. We infused our network with the use of a transformer-based model to +properly rescore the word lattice achieving remarkable capabilities with a +palpable reduction in Word Error Rate (WER). We demonstrate the effectiveness +of our proposed framework on the LibriSpeech dataset with empirical analyses. + +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 80 + +
+
+
+ + ☆ Image Clustering Conditioned on Text Criteria + + +
+ Classical clustering methods do not provide users with direct control of the +clustering results, and the clustering results may not be consistent with the +relevant criterion that a user has in mind. In this work, we present a new +methodology for performing image clustering based on user-specified text +criteria by leveraging modern vision-language models and large language models. +We call our method Image Clustering Conditioned on Text Criteria (IC$|$TC), and +it represents a different paradigm of image clustering. IC$|$TC requires a +minimal and practical degree of human intervention and grants the user +significant control over the clustering results in return. Our experiments show +that IC$|$TC can effectively cluster images with various criteria, such as +human action, physical location, or the person's mood, while significantly +outperforming baselines. + +
+
+
+
+
+ + ☆ Always Clear Days: Degradation Type and Severity Aware All-In-One + Adverse Weather Removal + + +
+ All-in-one adverse weather removal is an emerging topic on image restoration, +which aims to restore multiple weather degradation in an unified model, and the +challenging are twofold. First, discovering and handling the property of +multi-domain in target distribution formed by multiple weather conditions. +Second, design efficient and effective operations for different degradation +types. To address this problem, most prior works focus on the multi-domain +caused by weather type. Inspired by inter\&intra-domain adaptation literature, +we observed that not only weather type but also weather severity introduce +multi-domain within each weather type domain, which is ignored by previous +methods, and further limit their performance. To this end, we proposed a +degradation type and severity aware model, called \textbf{UtilityIR}, for blind +all-in-one bad weather image restoration. To extract weather information from +single image, we proposed a novel Marginal Quality Ranking Loss (MQRL) and +utilized Contrastive Loss (CL) to guide weather severity and type extraction, +and leverage a bag of novel techniques such as Multi-Head Cross Attention +(MHCA) and Local-Global Adaptive Instance Normalization (LG-AdaIN) to +efficiently restore spatial varying weather degradation. The proposed method +can significantly outperform the SOTA methods subjectively and objectively on +different weather restoration tasks with a large margin, and enjoy less model +parameters. Proposed method even can restore \textbf{unseen} domain combined +multiple degradation images, and modulating restoration level. Implementation +code will be available at +{https://github.com/fordevoted/UtilityIR}{\textit{this repository}} + +
+
+ comment: 12 pages, 12 figures +
+
+
+
+
+ + ☆ Heterogeneous Federated Learning with Group-Aware Prompt Tuning + + +
+ Transformers have achieved remarkable success in various machine-learning +tasks, prompting their widespread adoption. In this paper, we explore their +application in the context of federated learning (FL), with a particular focus +on heterogeneous scenarios where individual clients possess diverse local +datasets. To meet the computational and communication demands of FL, we +leverage pre-trained Transformers and use an efficient prompt-tuning strategy. +Our strategy introduces the concept of learning both shared and group prompts, +enabling the acquisition of universal knowledge and group-specific knowledge +simultaneously. Additionally, a prompt selection module assigns personalized +group prompts to each input, aligning the global model with the data +distribution of each client. This approach allows us to train a single global +model that can automatically adapt to various local client data distributions +without requiring local fine-tuning. In this way, our proposed method +effectively bridges the gap between global and personalized local models in +Federated Learning and surpasses alternative approaches that lack the +capability to adapt to previously unseen clients. The effectiveness of our +approach is rigorously validated through extensive experimentation and ablation +studies. + +
+
+
+
+
+ + ☆ FOUND: Foot Optimization with Uncertain Normals for Surface Deformation + Using Synthetic Data + + +
+ Surface reconstruction from multi-view images is a challenging task, with +solutions often requiring a large number of sampled images with high overlap. +We seek to develop a method for few-view reconstruction, for the case of the +human foot. To solve this task, we must extract rich geometric cues from RGB +images, before carefully fusing them into a final 3D object. Our FOUND approach +tackles this, with 4 main contributions: (i) SynFoot, a synthetic dataset of +50,000 photorealistic foot images, paired with ground truth surface normals and +keypoints; (ii) an uncertainty-aware surface normal predictor trained on our +synthetic dataset; (iii) an optimization scheme for fitting a generative foot +model to a series of images; and (iv) a benchmark dataset of calibrated images +and high resolution ground truth geometry. We show that our normal predictor +outperforms all off-the-shelf equivalents significantly on real images, and our +optimization scheme outperforms state-of-the-art photogrammetry pipelines, +especially for a few-view setting. We release our synthetic dataset and +baseline 3D scans to the research community. + +
+
+ comment: 14 pages, 15 figures +
+
+
+
+
+ + ☆ LipSim: A Provably Robust Perceptual Similarity Metric + + +
+ Recent years have seen growing interest in developing and applying perceptual +similarity metrics. Research has shown the superiority of perceptual metrics +over pixel-wise metrics in aligning with human perception and serving as a +proxy for the human visual system. On the other hand, as perceptual metrics +rely on neural networks, there is a growing concern regarding their resilience, +given the established vulnerability of neural networks to adversarial attacks. +It is indeed logical to infer that perceptual metrics may inherit both the +strengths and shortcomings of neural networks. In this work, we demonstrate the +vulnerability of state-of-the-art perceptual similarity metrics based on an +ensemble of ViT-based feature extractors to adversarial attacks. We then +propose a framework to train a robust perceptual similarity metric called +LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging +1-Lipschitz neural networks as the backbone, LipSim provides guarded areas +around each data point and certificates for all perturbations within an +$\ell_2$ ball. Finally, a comprehensive set of experiments shows the +performance of LipSim in terms of natural and certified scores and on the image +retrieval application. The code is available at +https://github.com/SaraGhazanfari/LipSim. + +
+
+
+
+
+ + ☆ PlantPlotGAN: A Physics-Informed Generative Adversarial Network for + Plant Disease Prediction WACV + + +
+ Monitoring plantations is crucial for crop management and producing healthy +harvests. Unmanned Aerial Vehicles (UAVs) have been used to collect +multispectral images that aid in this monitoring. However, given the number of +hectares to be monitored and the limitations of flight, plant disease signals +become visually clear only in the later stages of plant growth and only if the +disease has spread throughout a significant portion of the plantation. This +limited amount of relevant data hampers the prediction models, as the +algorithms struggle to generalize patterns with unbalanced or unrealistic +augmented datasets effectively. To address this issue, we propose PlantPlotGAN, +a physics-informed generative model capable of creating synthetic multispectral +plot images with realistic vegetation indices. These indices served as a proxy +for disease detection and were used to evaluate if our model could help +increase the accuracy of prediction models. The results demonstrate that the +synthetic imagery generated from PlantPlotGAN outperforms state-of-the-art +methods regarding the Fr\'echet inception distance. Moreover, prediction models +achieve higher accuracy metrics when trained with synthetic and original +imagery for earlier plant disease detection compared to the training processes +based solely on real imagery. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ☆ A Self-Supervised Approach to Land Cover Segmentation + + +
+ Land use/land cover change (LULC) maps are integral resources in earth +science and agricultural research. Due to the nature of such maps, the creation +of LULC maps is often constrained by the time and human resources necessary to +accurately annotate satellite imagery and remote sensing data. While computer +vision models that perform semantic segmentation to create detailed labels from +such data are not uncommon, litle research has been done on self-supervised and +unsupervised approaches to labelling LULC maps without the use of ground-truth +masks. Here, we demonstrate a self-supervised method of land cover segmentation +that has no need for high-quality ground truth labels. The proposed deep +learning employs a frozen pre-trained ViT backbone transferred from DINO in a +STEGO architecture and is fine-tuned using a custom dataset consisting of very +high resolution (VHR) sattelite imagery. After only 10 epochs of fine-tuning, +an accuracy of roughly 52% was observed across 5 samples, signifying the +feasibility of self-supervised models for the automated labelling of VHR LULC +maps. + +
+
+
+
+
+ + ☆ Generative AI Model for Artistic Style Transfer Using Convolutional + Neural Networks + + +
+ Artistic style transfer, a captivating application of generative artificial +intelligence, involves fusing the content of one image with the artistic style +of another to create unique visual compositions. This paper presents a +comprehensive overview of a novel technique for style transfer using +Convolutional Neural Networks (CNNs). By leveraging deep image representations +learned by CNNs, we demonstrate how to separate and manipulate image content +and style, enabling the synthesis of high-quality images that combine content +and style in a harmonious manner. We describe the methodology, including +content and style representations, loss computation, and optimization, and +showcase experimental results highlighting the effectiveness and versatility of +the approach across different styles and content + +
+
+
+
+
+ + ☆ How Re-sampling Helps for Long-Tail Learning? NeurIPS 2023 + + +
+ Long-tail learning has received significant attention in recent years due to +the challenge it poses with extremely imbalanced datasets. In these datasets, +only a few classes (known as the head classes) have an adequate number of +training samples, while the rest of the classes (known as the tail classes) are +infrequent in the training data. Re-sampling is a classical and widely used +approach for addressing class imbalance issues. Unfortunately, recent studies +claim that re-sampling brings negligible performance improvements in modern +long-tail learning tasks. This paper aims to investigate this phenomenon +systematically. Our research shows that re-sampling can considerably improve +generalization when the training images do not contain semantically irrelevant +contexts. In other scenarios, however, it can learn unexpected spurious +correlations between irrelevant contexts and target labels. We design +experiments on two homogeneous datasets, one containing irrelevant context and +the other not, to confirm our findings. To prevent the learning of spurious +correlations, we propose a new context shift augmentation module that generates +diverse training images for the tail class by maintaining a context bank +extracted from the head-class images. Experiments demonstrate that our proposed +module can boost the generalization and outperform other approaches, including +class-balanced re-sampling, decoupled classifier re-training, and data +augmentation methods. The source code is available at +https://www.lamda.nju.edu.cn/code_CSA.ashx. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Davidsonian Scene Graph: Improving Reliability in Fine-grained + Evaluation for Text-Image Generation + + +
+ Evaluating text-to-image models is notoriously difficult. A strong recent +approach for assessing text-image faithfulness is based on QG/A (question +generation and answering), which uses pre-trained foundational models to +automatically generate a set of questions and answers from the prompt, and +output images are scored based on whether these answers extracted with a visual +question answering model are consistent with the prompt-based answers. This +kind of evaluation is naturally dependent on the quality of the underlying QG +and QA models. We identify and address several reliability challenges in +existing QG/A work: (a) QG questions should respect the prompt (avoiding +hallucinations, duplications, and omissions) and (b) VQA answers should be +consistent (not asserting that there is no motorcycle in an image while also +claiming the motorcycle is blue). We address these issues with Davidsonian +Scene Graph (DSG), an empirically grounded evaluation framework inspired by +formal semantics. DSG is an automatic, graph-based QG/A that is modularly +implemented to be adaptable to any QG/A module. DSG produces atomic and unique +questions organized in dependency graphs, which (i) ensure appropriate semantic +coverage and (ii) sidestep inconsistent answers. With extensive experimentation +and human evaluation on a range of model configurations (LLM, VQA, and T2I), we +empirically demonstrate that DSG addresses the challenges noted above. Finally, +we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 +prompts, covering a wide range of fine-grained semantic categories with a +balanced distribution. We will release the DSG-1k prompts and the corresponding +DSG questions. + +
+
+ comment: Project website: https://google.github.io/DSG +
+
+
+
+
+ + ☆ Edge AI-Based Vein Detector for Efficient Venipuncture in the + Antecubital Fossa + + +
+ Assessing the condition and visibility of veins is a crucial step before +obtaining intravenous access in the antecubital fossa, which is a common +procedure to draw blood or administer intravenous therapies (IV therapies). +Even though medical practitioners are highly skilled at intravenous +cannulation, they usually struggle to perform the procedure in patients with +low visible veins due to fluid retention, age, overweight, dark skin tone, or +diabetes. Recently, several investigations proposed combining Near Infrared +(NIR) imaging and deep learning (DL) techniques for forearm vein segmentation. +Although they have demonstrated compelling results, their use has been rather +limited owing to the portability and precision requirements to perform +venipuncture. In this paper, we aim to contribute to bridging this gap using +three strategies. First, we introduce a new NIR-based forearm vein segmentation +dataset of 2,016 labelled images collected from 1,008 subjects with low visible +veins. Second, we propose a modified U-Net architecture that locates veins +specifically in the antecubital fossa region of the examined patient. Finally, +a compressed version of the proposed architecture was deployed inside a +bespoke, portable vein finder device after testing four common embedded +microcomputers and four common quantization modalities. Experimental results +showed that the model compressed with Dynamic Range Quantization and deployed +on a Raspberry Pi 4B card produced the best execution time and precision +balance, with 5.14 FPS and 0.957 of latency and Intersection over Union (IoU), +respectively. These results show promising performance inside a +resource-restricted low-cost device. + +
+
+ comment: Accepted for publication in MICAI 2023, Part II, LNCS 14392 +
+
+
+
+
+ + ☆ TBDLNet: a network for classifying multidrug-resistant and + drug-sensitive tuberculosis + + +
+ This paper proposes applying a novel deep-learning model, TBDLNet, to +recognize CT images to classify multidrug-resistant and drug-sensitive +tuberculosis automatically. The pre-trained ResNet50 is selected to extract +features. Three randomized neural networks are used to alleviate the +overfitting problem. The ensemble of three RNNs is applied to boost the +robustness via majority voting. The proposed model is evaluated by five-fold +cross-validation. Five indexes are selected in this paper, which are accuracy, +sensitivity, precision, F1-score, and specificity. The TBDLNet achieves 0.9822 +accuracy, 0.9815 specificity, 0.9823 precision, 0.9829 sensitivity, and 0.9826 +F1-score, respectively. The TBDLNet is suitable for classifying +multidrug-resistant tuberculosis and drug-sensitive tuberculosis. It can detect +multidrug-resistant pulmonary tuberculosis as early as possible, which helps to +adjust the treatment plan in time and improve the treatment effect. + +
+
+
+
+
+ + ☆ Artifact-Robust Graph-Based Learning in Digital Pathology + + +
+ Whole slide images~(WSIs) are digitized images of tissues placed in glass +slides using advanced scanners. The digital processing of WSIs is challenging +as they are gigapixel images and stored in multi-resolution format. A common +challenge with WSIs is that perturbations/artifacts are inevitable during +storing the glass slides and digitizing them. These perturbations include +motion, which often arises from slide movement during placement, and changes in +hue and brightness due to variations in staining chemicals and the quality of +digitizing scanners. In this work, a novel robust learning approach to account +for these artifacts is presented. Due to the size and resolution of WSIs and to +account for neighborhood information, graph-based methods are called for. We +use graph convolutional network~(GCN) to extract features from the graph +representing WSI. Through a denoiser {and pooling layer}, the effects of +perturbations in WSIs are controlled and the output is followed by a +transformer for the classification of different grades of prostate cancer. To +compare the efficacy of the proposed approach, the model without denoiser is +trained and tested with WSIs without any perturbation and then different +perturbations are introduced in WSIs and passed through the network with the +denoiser. The accuracy and kappa scores of the proposed model with prostate +cancer dataset compared with non-robust algorithms show significant improvement +in cancer diagnosis. + +
+
+
+
+
+ + ☆ Semi-Supervised Panoptic Narrative Grounding ACM MM 2023 + + +
+ Despite considerable progress, the advancement of Panoptic Narrative +Grounding (PNG) remains hindered by costly annotations. In this paper, we +introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) +learning scheme, capitalizing on a smaller set of labeled image-text pairs and +a larger set of unlabeled pairs to achieve competitive performance. Unlike +visual segmentation tasks, PNG involves one pixel belonging to multiple +open-ended nouns. As a result, existing multi-class based semi-supervised +segmentation frameworks cannot be directly applied to this task. To address +this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to +the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and +data augmentation to determine the optimal generic configuration for the +SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label +quality, we propose a Quality-Based Loss Adjustment (QLA) approach to adjust +the semi-supervised objective, resulting in an enhanced SS-PNG-NW+. Employing +our proposed QLA, we improve BCE Loss and Dice loss at pixel and mask levels, +respectively. We conduct extensive experiments on PNG datasets, with our +SS-PNG-NW+ demonstrating promising results comparable to fully-supervised +models across all data ratios. Remarkably, our SS-PNG-NW+ outperforms +fully-supervised models with only 30% and 50% supervision data, exceeding their +performance by 0.8% and 1.1% respectively. This highlights the effectiveness of +our proposed SS-PNG-NW+ in overcoming the challenges posed by limited +annotations and enhancing the applicability of PNG tasks. The source code is +available at https://github.com/nini0919/SSPNG. + +
+
+ comment: ACM MM 2023 +
+
+
+
+
+ + ☆ Unsupervised Representation Learning for Diverse Deformable Shape + Collections + + +
+ We introduce a novel learning-based method for encoding and manipulating 3D +surface meshes. Our method is specifically designed to create an interpretable +embedding space for deformable shape collections. Unlike previous 3D mesh +autoencoders that require meshes to be in a 1-to-1 correspondence, our approach +is trained on diverse meshes in an unsupervised manner. Central to our method +is a spectral pooling technique that establishes a universal latent space, +breaking free from traditional constraints of mesh connectivity and shape +categories. The entire process consists of two stages. In the first stage, we +employ the functional map paradigm to extract point-to-point (p2p) maps between +a collection of shapes in an unsupervised manner. These p2p maps are then +utilized to construct a common latent space, which ensures straightforward +interpretation and independence from mesh connectivity and shape category. +Through extensive experiments, we demonstrate that our method achieves +excellent reconstructions and produces more realistic and smoother +interpolations than baseline approaches. + +
+
+ comment: Accepted at International Conference on 3D Vision 2024 +
+
+
+
+
+ + ☆ End-to-end Video Gaze Estimation via Capturing Head-face-eye + Spatial-temporal Interaction Context + + +
+ In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to +facilitate video gaze estimation via capturing spatial-temporal interaction +context among head, face, and eye in an end-to-end learning way, which has not +been well concerned yet. The main advantage of MCGaze is that the tasks of clue +localization of head, face, and eye can be solved jointly for gaze estimation +in a one-step way, with joint optimization to seek optimal performance. During +this, spatial-temporal context exchange happens among the clues on the head, +face, and eye. Accordingly, the final gazes obtained by fusing features from +various queries can be aware of global clues from heads and faces, and local +clues from eyes simultaneously, which essentially leverages performance. +Meanwhile, the one-step running way also ensures high running efficiency. +Experiments on the challenging Gaze360 dataset verify the superiority of our +proposition. The source code will be released at +https://github.com/zgchen33/MCGaze. + +
+
+ comment: 5 pages, 3 figures, 3 tables +
+
+
+
+
+ + ☆ Direct Unsupervised Denoising + + +
+ Traditional supervised denoisers are trained using pairs of noisy input and +clean target images. They learn to predict a central tendency of the posterior +distribution over possible clean images. When, e.g., trained with the popular +quadratic loss function, the network's output will correspond to the minimum +mean square error (MMSE) estimate. Unsupervised denoisers based on Variational +AutoEncoders (VAEs) have succeeded in achieving state-of-the-art results while +requiring only unpaired noisy data as training input. In contrast to the +traditional supervised approach, unsupervised denoisers do not directly produce +a single prediction, such as the MMSE estimate, but allow us to draw samples +from the posterior distribution of clean solutions corresponding to the noisy +input. To approximate the MMSE estimate during inference, unsupervised methods +have to create and draw a large number of samples - a computationally expensive +process - rendering the approach inapplicable in many situations. Here, we +present an alternative approach that trains a deterministic network alongside +the VAE to directly predict a central tendency. Our method achieves results +that surpass the results achieved by the unsupervised method at a fraction of +the computational cost. + +
+
+
+
+
+ + ☆ er.autopilot 1.0: The Full Autonomous Stack for Oval Racing at High + Speeds + + +
+ The Indy Autonomous Challenge (IAC) brought together for the first time in +history nine autonomous racing teams competing at unprecedented speed and in +head-to-head scenario, using independently developed software on open-wheel +racecars. This paper presents the complete software architecture used by team +TII EuroRacing (TII-ER), covering all the modules needed to avoid static +obstacles, perform active overtakes and reach speeds above 75 m/s (270 km/h). +In addition to the most common modules related to perception, planning, and +control, we discuss the approaches used for vehicle dynamics modelling, +simulation, telemetry, and safety. Overall results and the performance of each +module are described, as well as the lessons learned during the first two +events of the competition on oval tracks, where the team placed respectively +second and third. + +
+
+ comment: Preprint: Accepted to Field Robotics "Opportunities and Challenges + with Autonomous Racing" Special Issue +
+
+
+
+
+ + ☆ Classifier-head Informed Feature Masking and Prototype-based Logit + Smoothing for Out-of-Distribution Detection + + +
+ Out-of-distribution (OOD) detection is essential when deploying neural +networks in the real world. One main challenge is that neural networks often +make overconfident predictions on OOD data. In this study, we propose an +effective post-hoc OOD detection method based on a new feature masking strategy +and a novel logit smoothing strategy. Feature masking determines the important +features at the penultimate layer for each in-distribution (ID) class based on +the weights of the ID class in the classifier head and masks the rest features. +Logit smoothing computes the cosine similarity between the feature vector of +the test sample and the prototype of the predicted ID class at the penultimate +layer and uses the similarity as an adaptive temperature factor on the logit to +alleviate the network's overconfidence prediction for OOD data. With these +strategies, we can reduce feature activation of OOD data and enlarge the gap in +OOD score between ID and OOD data. Extensive experiments on multiple standard +OOD detection benchmarks demonstrate the effectiveness of our method and its +compatibility with existing methods, with new state-of-the-art performance +achieved from our method. The source code will be released publicly. + +
+
+ comment: 10 pages, 7 figures +
+
+
+
+
+ + ☆ A Chebyshev Confidence Guided Source-Free Domain Adaptation Framework + for Medical Image Segmentation + + +
+ Source-free domain adaptation (SFDA) aims to adapt models trained on a +labeled source domain to an unlabeled target domain without the access to +source data. In medical imaging scenarios, the practical significance of SFDA +methods has been emphasized due to privacy concerns. Recent State-of-the-art +SFDA methods primarily rely on self-training based on pseudo-labels (PLs). +Unfortunately, PLs suffer from accuracy deterioration caused by domain shift, +and thus limit the effectiveness of the adaptation process. To address this +issue, we propose a Chebyshev confidence guided SFDA framework to accurately +assess the reliability of PLs and generate self-improving PLs for +self-training. The Chebyshev confidence is estimated by calculating probability +lower bound of the PL confidence, given the prediction and the corresponding +uncertainty. Leveraging the Chebyshev confidence, we introduce two +confidence-guided denoising methods: direct denoising and prototypical +denoising. Additionally, we propose a novel teacher-student joint training +scheme (TJTS) that incorporates a confidence weighting module to improve PLs +iteratively. The TJTS, in collaboration with the denoising methods, effectively +prevents the propagation of noise and enhances the accuracy of PLs. Extensive +experiments in diverse domain scenarios validate the effectiveness of our +proposed framework and establish its superiority over state-of-the-art SFDA +methods. Our paper contributes to the field of SFDA by providing a novel +approach for precisely estimating the reliability of pseudo-labels and a +framework for obtaining high-quality PLs, resulting in improved adaptation +performance. + +
+
+
+
+
+ + ☆ Text Augmented Spatial-aware Zero-shot Referring Image Segmentation EMNLP2023 + + +
+ In this paper, we study a challenging task of zero-shot referring image +segmentation. This task aims to identify the instance mask that is most related +to a referring expression without training on pixel-level annotations. Previous +research takes advantage of pre-trained cross-modal models, e.g., CLIP, to +align instance-level masks with referring expressions. %Yet, CLIP only +considers image-text pair level alignment, which neglects fine-grained image +region and complex sentence matching. Yet, CLIP only considers the global-level +alignment of image-text pairs, neglecting fine-grained matching between the +referring sentence and local image regions. To address this challenge, we +introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image +segmentation framework that is training-free and robust to various visual +encoders. TAS incorporates a mask proposal network for instance-level mask +extraction, a text-augmented visual-text matching score for mining the +image-text correlation, and a spatial rectifier for mask post-processing. +Notably, the text-augmented visual-text matching score leverages a $P$ score +and an $N$-score in addition to the typical visual-text matching score. The +$P$-score is utilized to close the visual-text domain gap through a surrogate +captioning model, where the score is computed between the surrogate +model-generated texts and the referring expression. The $N$-score considers the +fine-grained alignment of region-text pairs via negative phrase mining, +encouraging the masked image to be repelled from the mined distracting phrases. +Extensive experiments are conducted on various datasets, including RefCOCO, +RefCOCO+, and RefCOCOg. The proposed method clearly outperforms +state-of-the-art zero-shot referring image segmentation methods. + +
+
+ comment: Findings of EMNLP2023 +
+
+
+
+
+ + ☆ ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model + for Visual Question Answering in Vietnamese + + +
+ In recent years, Visual Question Answering (VQA) has gained significant +attention for its diverse applications, including intelligent car assistance, +aiding visually impaired individuals, and document image information retrieval +using natural language queries. VQA requires effective integration of +information from questions and images to generate accurate answers. Neural +models for VQA have made remarkable progress on large-scale datasets, with a +primary focus on resource-rich languages like English. To address this, we +introduce the ViCLEVR dataset, a pioneering collection for evaluating various +visual reasoning capabilities in Vietnamese while mitigating biases. The +dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), +each question annotated to specify the type of reasoning involved. Leveraging +this dataset, we conduct a comprehensive analysis of contemporary visual +reasoning systems, offering valuable insights into their strengths and +limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion +that identifies objects in images based on questions. The architecture +effectively employs transformers to enable simultaneous reasoning over textual +and visual data, merging both modalities at an early model stage. The +experimental findings demonstrate that our proposed model achieves +state-of-the-art performance across four evaluation metrics. The accompanying +code and dataset have been made publicly accessible at +\url{https://github.com/kvt0012/ViCLEVR}. This provision seeks to stimulate +advancements within the research community, fostering the development of more +multimodal fusion algorithms, specifically tailored to address the nuances of +low-resource languages, exemplified by Vietnamese. + +
+
+ comment: A pre-print version and submitted to journal +
+
+
+
+
+ + ☆ ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image + + +
+ We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view +synthesis for in-the-wild scenes. While existing methods are designed for +single objects with masked backgrounds, we propose new techniques to address +challenges introduced by in-the-wild multi-object scenes with complex +backgrounds. Specifically, we train a generative prior on a mixture of data +sources that capture object-centric, indoor, and outdoor scenes. To address +issues from data mixture such as depth-scale ambiguity, we propose a novel +camera conditioning parameterization and normalization scheme. Further, we +observe that Score Distillation Sampling (SDS) tends to truncate the +distribution of complex backgrounds during distillation of 360-degree scenes, +and propose "SDS anchoring" to improve the diversity of synthesized novel +views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset +in the zero-shot setting, even outperforming methods specifically trained on +DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark +for single-image novel view synthesis, and demonstrate strong performance in +this setting. Our code and data are at http://kylesargent.github.io/zeronvs/ + +
+
+ comment: 17 pages +
+
+
+
+
+ + ☆ FaultSeg Swin-UNETR: Transformer-Based Self-Supervised Pretraining Model + for Fault Recognition + + +
+ This paper introduces an approach to enhance seismic fault recognition +through self-supervised pretraining. Seismic fault interpretation holds great +significance in the fields of geophysics and geology. However, conventional +methods for seismic fault recognition encounter various issues, including +dependence on data quality and quantity, as well as susceptibility to +interpreter subjectivity. Currently, automated fault recognition methods +proposed based on small synthetic datasets experience performance degradation +when applied to actual seismic data. To address these challenges, we have +introduced the concept of self-supervised learning, utilizing a substantial +amount of relatively easily obtainable unlabeled seismic data for pretraining. +Specifically, we have employed the Swin Transformer model as the core network +and employed the SimMIM pretraining task to capture unique features related to +discontinuities in seismic data. During the fine-tuning phase, inspired by edge +detection techniques, we have also refined the structure of the Swin-UNETR +model, enabling multiscale decoding and fusion for more effective fault +detection. Experimental results demonstrate that our proposed method attains +state-of-the-art performance on the Thebe dataset, as measured by the OIS and +ODS metrics. + +
+
+
+
+
+ + ☆ Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General + Healthcare + + +
+ Large Language Models (LLMs) have introduced a new era of proficiency in +comprehending complex healthcare and biomedical topics. However, there is a +noticeable lack of models in languages other than English and models that can +interpret multi-modal input, which is crucial for global healthcare +accessibility. In response, this study introduces Qilin-Med-VL, the first +Chinese large vision-language model designed to integrate the analysis of +textual and visual data. Qilin-Med-VL combines a pre-trained Vision Transformer +(ViT) with a foundational LLM. It undergoes a thorough two-stage curriculum +training process that includes feature alignment and instruction tuning. This +method enhances the model's ability to generate medical captions and answer +complex medical queries. We also release ChiMed-VL, a dataset consisting of +more than 1M image-text pairs. This dataset has been carefully curated to +enable detailed and comprehensive interpretation of medical data using various +types of images. + +
+
+
+
+
+ + ☆ Multivessel Coronary Artery Segmentation and Stenosis Localisation using + Ensemble Learning MICCAI2023 + + +
+ Coronary angiography analysis is a common clinical task performed by +cardiologists to diagnose coronary artery disease (CAD) through an assessment +of atherosclerotic plaque's accumulation. This study introduces an end-to-end +machine learning solution developed as part of our solution for the MICCAI 2023 +Automatic Region-based Coronary Artery Disease diagnostics using x-ray +angiography imagEs (ARCADE) challenge, which aims to benchmark solutions for +multivessel coronary artery segmentation and potential stenotic lesion +localisation from X-ray coronary angiograms. We adopted a robust baseline model +training strategy to progressively improve performance, comprising five +successive stages of binary class pretraining, multivessel segmentation, +fine-tuning using class frequency weighted dataloaders, fine-tuning using +F1-based curriculum learning strategy (F1-CLS), and finally multi-target +angiogram view classifier-based collective adaptation. Unlike many other +medical imaging procedures, this task exhibits a notable degree of +interobserver variability. %, making it particularly amenable to automated +analysis. Our ensemble model combines the outputs from six baseline models +using the weighted ensembling approach, which our analysis shows is found to +double the predictive accuracy of the proposed solution. The final prediction +was further refined, targeting the correction of misclassified blobs. Our +solution achieved a mean F1 score of $37.69\%$ for coronary artery +segmentation, and $39.41\%$ for stenosis localisation, positioning our team in +the 5th position on both leaderboards. This work demonstrates the potential of +automated tools to aid CAD diagnosis, guide interventions, and improve the +accuracy of stent injections in clinical settings. + +
+
+ comment: Submission report for ARCADE challenge hosted at MICCAI2023 +
+
+
+
+
+ + ☆ Shape-centered Representation Learning for Visible-Infrared Person + Re-identification + + +
+ Current Visible-Infrared Person Re-Identification (VI-ReID) methods +prioritize extracting distinguishing appearance features, ignoring the natural +resistance of body shape against modality changes. Initially, we gauged the +discriminative potential of shapes by a straightforward concatenation of shape +and appearance features. However, two unresolved issues persist in the +utilization of shape features. One pertains to the dependence on auxiliary +models for shape feature extraction in the inference phase, along with the +errors in generated infrared shapes due to the intrinsic modality disparity. +The other issue involves the inadequately explored correlation between shape +and appearance features. To tackle the aforementioned challenges, we propose +the Shape-centered Representation Learning framework (ScRL), which focuses on +learning shape features and appearance features associated with shapes. +Specifically, we devise the Shape Feature Propagation (SFP), facilitating +direct extraction of shape features from original images with minimal +complexity costs during inference. To restitute inaccuracies in infrared body +shapes at the feature level, we present the Infrared Shape Restitution (ISR). +Furthermore, to acquire appearance features related to shape, we design the +Appearance Feature Enhancement (AFE), which accentuates identity-related +features while suppressing identity-unrelated features guided by shape +features. Extensive experiments are conducted to validate the effectiveness of +the proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy +attains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM, +RegDB datasets respectively, outperforming existing state-of-the-art methods. + +
+
+
+
+
+ + ☆ Understanding Parameter Saliency via Extreme Value Theory + + +
+ Deep neural networks are being increasingly implemented throughout society in +recent years. It is useful to identify which parameters trigger +misclassification in diagnosing undesirable model behaviors. The concept of +parameter saliency is proposed and used to diagnose convolutional neural +networks (CNNs) by ranking convolution filters that may have caused +misclassification on the basis of parameter saliency. It is also shown that +fine-tuning the top ranking salient filters has efficiently corrected +misidentification on ImageNet. However, there is still a knowledge gap in terms +of understanding why parameter saliency ranking can find the filters inducing +misidentification. In this work, we attempt to bridge the gap by analyzing +parameter saliency ranking from a statistical viewpoint, namely, extreme value +theory. We first show that the existing work implicitly assumes that the +gradient norm computed for each filter follows a normal distribution. Then, we +clarify the relationship between parameter saliency and the score based on the +peaks-over-threshold (POT) method, which is often used to model extreme values. +Finally, we reformulate parameter saliency in terms of the POT method, where +this reformulation is regarded as statistical anomaly detection and does not +require the implicit assumptions of the existing parameter-saliency +formulation. Our experimental results demonstrate that our reformulation can +detect malicious filters as well. Furthermore, we show that the existing +parameter saliency method exhibits a bias against the depth of layers in deep +neural networks. In particular, this bias has the potential to inhibit the +discovery of filters that cause misidentification in situations where domain +shift occurs. In contrast, parameter saliency based on POT shows less of this +bias. + +
+
+
+
+
+ + ☆ Instance Segmentation under Occlusions via Location-aware Copy-Paste + Data Augmentation + + +
+ Occlusion is a long-standing problem in computer vision, particularly in +instance segmentation. ACM MMSports 2023 DeepSportRadar has introduced a +dataset that focuses on segmenting human subjects within a basketball context +and a specialized evaluation metric for occlusion scenarios. Given the modest +size of the dataset and the highly deformable nature of the objects to be +segmented, this challenge demands the application of robust data augmentation +techniques and wisely-chosen deep learning architectures. Our work (ranked 1st +in the competition) first proposes a novel data augmentation technique, capable +of generating more training samples with wider distribution. Then, we adopt a +new architecture - Hybrid Task Cascade (HTC) framework with CBNetV2 as backbone +and MaskIoU head to improve segmentation performance. Furthermore, we employ a +Stochastic Weight Averaging (SWA) training strategy to improve the model's +generalization. As a result, we achieve a remarkable occlusion score (OM) of +0.533 on the challenge dataset, securing the top-1 position on the leaderboard. +Source code is available at this +https://github.com/nguyendinhson-kaist/MMSports23-Seg-AutoID. + +
+
+
+
+
+ + ☆ Diversifying Spatial-Temporal Perception for Video Domain Generalization NeurIPS 2023 + + +
+ Video domain generalization aims to learn generalizable video classification +models for unseen target domains by training in a source domain. A critical +challenge of video domain generalization is to defend against the heavy +reliance on domain-specific cues extracted from the source domain when +recognizing target videos. To this end, we propose to perceive diverse +spatial-temporal cues in videos, aiming to discover potential domain-invariant +cues in addition to domain-specific cues. We contribute a novel model named +Spatial-Temporal Diversification Network (STDN), which improves the diversity +from both space and time dimensions of video data. First, our STDN proposes to +discover various types of spatial cues within individual frames by spatial +grouping. Then, our STDN proposes to explicitly model spatial-temporal +dependencies between video contents at multiple space-time scales by +spatial-temporal relation modeling. Extensive experiments on three benchmarks +of different types demonstrate the effectiveness and versatility of our +approach. + +
+
+ comment: Accepted to NeurIPS 2023. Code is available at + https://github.com/KunyuLin/STDN/ +
+
+
+
+
+ + ☆ 3D-Aware Visual Question Answering about Parts, Poses and Occlusions NeurIPS2023 + + +
+ Despite rapid progress in Visual question answering (VQA), existing datasets +and models mainly focus on testing reasoning in 2D. However, it is important +that VQA models also understand the 3D structure of visual scenes, for example +to support tasks like navigation or manipulation. This includes an +understanding of the 3D object pose, their parts and occlusions. In this work, +we introduce the task of 3D-aware VQA, which focuses on challenging questions +that require a compositional reasoning over the 3D structure of visual scenes. +We address 3D-aware VQA from both the dataset and the model perspective. First, +we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains +questions about object parts, their 3D poses, and occlusions. Second, we +propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: +probabilistic neural symbolic program execution for reasoning and deep neural +networks with 3D generative representations of objects for robust visual +recognition. Our experimental results show our model PO3D-VQA outperforms +existing methods significantly, but we still observe a significant performance +gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an +important open research area. + +
+
+ comment: Accepted by NeurIPS2023 +
+
+
+
+
+ + ☆ DocStormer: Revitalizing Multi-Degraded Colored Document Images to + Pristine PDF + + +
+ For capturing colored document images, e.g. posters and magazines, it is +common that multiple degradations such as shadows, wrinkles, etc., are +simultaneously introduced due to external factors. Restoring multi-degraded +colored document images is a great challenge, yet overlooked, as most existing +algorithms focus on enhancing color-ignored document images via binarization. +Thus, we propose DocStormer, a novel algorithm designed to restore +multi-degraded colored documents to their potential pristine PDF. The +contributions are: firstly, we propose a "Perceive-then-Restore" paradigm with +a reinforced transformer block, which more effectively encodes and utilizes the +distribution of degradations. Secondly, we are the first to utilize GAN and +pristine PDF magazine images to narrow the distribution gap between the +enhanced results and PDF images, in pursuit of less degradation and better +visual quality. Thirdly, we propose a non-parametric strategy, PFILI, which +enables a smaller training scale and larger testing resolutions with acceptable +detail trade-off, while saving memory and inference time. Fourthly, we are the +first to propose a novel Multi-Degraded Colored Document image Enhancing +dataset, named MD-CDE, for both training and evaluation. Experimental results +show that the DocStormer exhibits superior performance, capable of revitalizing +multi-degraded colored documents into their potential pristine digital +versions, which fills the current academic gap from the perspective of method, +data, and task. + +
+
+
+
+
+ + ☆ Impressions: Understanding Visual Semiotics and Aesthetic Impact EMNLP 2023 + + +
+ Is aesthetic impact different from beauty? Is visual salience a reflection of +its capacity for effective communication? We present Impressions, a novel +dataset through which to investigate the semiotics of images, and how specific +visual features and design choices can elicit specific emotions, thoughts and +beliefs. We posit that the impactfulness of an image extends beyond formal +definitions of aesthetics, to its success as a communicative act, where style +contributes as much to meaning formation as the subject matter. However, prior +image captioning datasets are not designed to empower state-of-the-art +architectures to model potential human impressions or interpretations of +images. To fill this gap, we design an annotation task heavily inspired by +image analysis techniques in the Visual Arts to collect 1,440 image-caption +pairs and 4,320 unique annotations exploring impact, pragmatic image +description, impressions, and aesthetic design choices. We show that existing +multimodal image captioning and conditional generation models struggle to +simulate plausible human responses to images. However, this dataset +significantly improves their ability to model impressions and aesthetic +evaluations of images through fine-tuning and few-shot adaptation. + +
+
+ comment: To be published in EMNLP 2023 +
+
+
+
+
+ + ☆ Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D + Scene Representations + + +
+ Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations, +capable of high quality novel view synthesis of complex scenes. While NeRFs +have been applied to graphics, vision, and robotics, problems with slow +rendering speed and characteristic visual artifacts prevent adoption in many +use cases. In this work, we investigate combining an autoencoder (AE) with a +NeRF, in which latent features (instead of colours) are rendered and then +convolutionally decoded. The resulting latent-space NeRF can produce novel +views with higher quality than standard colour-space NeRFs, as the AE can +correct certain visual artifacts, while rendering over three times faster. Our +work is orthogonal to other techniques for improving NeRF efficiency. Further, +we can control the tradeoff between efficiency and image quality by shrinking +the AE architecture, achieving over 13 times faster rendering with only a small +drop in performance. We hope that our approach can form the basis of an +efficient, yet high-fidelity, 3D scene representation for downstream tasks, +especially when retaining differentiability is useful, as in many robotics +scenarios requiring continual learning. + +
+
+
+
+
+ + ☆ Siamese-DETR for Generic Multi-Object Tracking + + +
+ The ability to detect and track the dynamic objects in different scenes is +fundamental to real-world applications, e.g., autonomous driving and robot +navigation. However, traditional Multi-Object Tracking (MOT) is limited to +tracking objects belonging to the pre-defined closed-set categories. Recently, +Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track +interested objects beyond pre-defined categories with the given text prompt and +template image. However, the expensive well pre-trained (vision-)language model +and fine-grained category annotations are required to train OVMOT models. In +this paper, we focus on GMOT and propose a simple but effective method, +Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) +are required for training. Different from existing GMOT methods, which train a +Single Object Tracking (SOT) based detector to detect interested objects and +then apply a data association based MOT tracker to get the trajectories, we +leverage the inherent object queries in DETR variants. Specifically: 1) The +multi-scale object queries are designed based on the given template image, +which are effective for detecting different scales of objects with the same +category as the template image; 2) A dynamic matching training strategy is +introduced to train Siamese-DETR on commonly used detection datasets, which +takes full advantage of provided annotations; 3) The online tracking pipeline +is simplified through a tracking-by-query manner by incorporating the tracked +boxes in previous frame as additional query boxes. The complex data association +is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive +experimental results show that Siamese-DETR surpasses existing MOT methods on +GMOT-40 dataset by a large margin. + +
+
+
+
+
+ + ☆ SmooSeg: Smoothness Prior for Unsupervised Semantic Segmentation NeurIPS 2023 + + +
+ Unsupervised semantic segmentation is a challenging task that segments images +into semantic groups without manual annotation. Prior works have primarily +focused on leveraging prior knowledge of semantic consistency or priori +concepts from self-supervised learning methods, which often overlook the +coherence property of image segments. In this paper, we demonstrate that the +smoothness prior, asserting that close features in a metric space share the +same semantics, can significantly simplify segmentation by casting unsupervised +semantic segmentation as an energy minimization problem. Under this paradigm, +we propose a novel approach called SmooSeg that harnesses self-supervised +learning methods to model the closeness relationships among observations as +smoothness signals. To effectively discover coherent semantic segments, we +introduce a novel smoothness loss that promotes piecewise smoothness within +segments while preserving discontinuities across different segments. +Additionally, to further enhance segmentation quality, we design an asymmetric +teacher-student style predictor that generates smoothly updated pseudo labels, +facilitating an optimal fit between observations and labeling outputs. Thanks +to the rich supervision cues of the smoothness prior, our SmooSeg significantly +outperforms STEGO in terms of pixel accuracy on three datasets: COCOStuff +(+14.9%), Cityscapes (+13.0%), and Potsdam-3 (+5.7%). + +
+
+ comment: Accepted by NeurIPS 2023. Code available: + https://github.com/mc-lan/SmooSeg +
+
+
+
+
+ + ☆ Grid Jigsaw Representation with CLIP: A New Perspective on Image + Clustering + + +
+ Unsupervised representation learning for image clustering is essential in +computer vision. Although the advancement of visual models has improved image +clustering with efficient visual representations, challenges still remain. +Firstly, these features often lack the ability to represent the internal +structure of images, hindering the accurate clustering of visually similar +images. Secondly, the existing features tend to lack finer-grained semantic +labels, limiting the ability to capture nuanced differences and similarities +between images. + In this paper, we first introduce Jigsaw based strategy method for image +clustering called Grid Jigsaw Representation (GJR) with systematic exposition +from pixel to feature in discrepancy against human and computer. We emphasize +that this algorithm, which mimics human jigsaw puzzle, can effectively improve +the model to distinguish the spatial feature between different samples and +enhance the clustering ability. GJR modules are appended to a variety of deep +convolutional networks and tested with significant improvements on a wide range +of benchmark datasets including CIFAR-10, CIFAR-100/20, STL-10, ImageNet-10 and +ImageNetDog-15. + On the other hand, convergence efficiency is always an important challenge +for unsupervised image clustering. Recently, pretrained representation learning +has made great progress and released models can extract mature visual +representations. It is obvious that use the pretrained model as feature +extractor can speed up the convergence of clustering where our aim is to +provide new perspective in image clustering with reasonable resource +application and provide new baseline. Further, we innovate pretrain-based Grid +Jigsaw Representation (pGJR) with improvement by GJR. The experiment results +show the effectiveness on the clustering task with respect to the ACC, NMI and +ARI three metrics and super fast convergence speed. + +
+
+
+
+
+ + ☆ What You See Is What You Detect: Towards better Object Densification in + 3D detection + + +
+ Recent works have demonstrated the importance of object completion in 3D +Perception from Lidar signal. Several methods have been proposed in which +modules were used to densify the point clouds produced by laser scanners, +leading to better recall and more accurate results. Pursuing in that direction, +we present, in this work, a counter-intuitive perspective: the widely-used +full-shape completion approach actually leads to a higher error-upper bound +especially for far away objects and small objects like pedestrians. Based on +this observation, we introduce a visible part completion method that requires +only 11.3\% of the prediction points that previous methods generate. To recover +the dense representation, we propose a mesh-deformation-based method to augment +the point set associated with visible foreground objects. Considering that our +approach focuses only on the visible part of the foreground objects to achieve +accurate 3D detection, we named our method What You See Is What You Detect +(WYSIWYD). Our proposed method is thus a detector-independent model that +consists of 2 parts: an Intra-Frustum Segmentation Transformer (IFST) and a +Mesh Depth Completion Network(MDCNet) that predicts the foreground depth from +mesh deformation. This way, our model does not require the time-consuming +full-depth completion task used by most pseudo-lidar-based methods. Our +experimental evaluation shows that our approach can provide up to 12.2\% +performance improvements over most of the public baseline models on the KITTI +and NuScenes dataset bringing the state-of-the-art to a new level. The codes +will be available at +\textcolor[RGB]{0,0,255}{\url{{https://github.com/Orbis36/WYSIWYD}} + +
+
+
+
+
+ + ☆ One Style is All you Need to Generate a Video + + +
+ In this paper, we propose a style-based conditional video generative model. +We introduce a novel temporal generator based on a set of learned sinusoidal +bases. Our method learns dynamic representations of various actions that are +independent of image content and can be transferred between different actors. +Beyond the significant enhancement of video quality compared to prevalent +methods, we demonstrate that the disentangled dynamic and content permit their +independent manipulation, as well as temporal GAN-inversion to retrieve and +transfer a video motion from one content or identity to another without further +preprocessing such as landmark points. + +
+
+
+
+
+ + ♻ ☆ Benchmarking Spatial Relationships in Text-to-Image Generation + + +
+ Spatial understanding is a fundamental aspect of computer vision and integral +for human-level reasoning about images, making it an important component for +grounded language understanding. While recent text-to-image synthesis (T2I) +models have shown unprecedented improvements in photorealism, it is unclear +whether they have reliable spatial understanding capabilities. We investigate +the ability of T2I models to generate correct spatial relationships among +objects and present VISOR, an evaluation metric that captures how accurately +the spatial relationship described in text is generated in the image. To +benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that +contains sentences describing two or more objects and the spatial relationships +between them. We construct an automated evaluation pipeline to recognize +objects and their spatial relationships, and employ it in a large-scale +evaluation of T2I models. Our experiments reveal a surprising finding that, +although state-of-the-art T2I models exhibit high image quality, they are +severely limited in their ability to generate multiple objects or the specified +spatial relations between them. Our analyses demonstrate several biases and +artifacts of T2I models such as the difficulty with generating multiple +objects, a bias towards generating the first object mentioned, spatially +inconsistent outputs for equivalent relationships, and a correlation between +object co-occurrence and spatial understanding capabilities. We conduct a human +study that shows the alignment between VISOR and human judgement about spatial +understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to +the community in support of T2I reasoning research. + +
+
+ comment: preprint; Code and Data at https://github.com/microsoft/VISOR and + https://huggingface.co/datasets/tgokhale/sr2d_visor +
+
+
+
+
+ + ♻ ☆ Navigating Data Heterogeneity in Federated Learning A Semi-Supervised + Approach for Object Detection NeurIPS 2023 + + +
+ Federated Learning (FL) has emerged as a potent framework for training models +across distributed data sources while maintaining data privacy. Nevertheless, +it faces challenges with limited high-quality labels and non-IID client data, +particularly in applications like autonomous driving. To address these hurdles, +we navigate the uncharted waters of Semi-Supervised Federated Object Detection +(SSFOD). We present a pioneering SSFOD framework, designed for scenarios where +labeled data reside only at the server while clients possess unlabeled data. +Notably, our method represents the inaugural implementation of SSFOD for +clients with 0% labeled non-IID data, a stark contrast to previous studies that +maintain some subset of labels at each client. We propose FedSTO, a two-stage +strategy encompassing Selective Training followed by Orthogonally enhanced +full-parameter training, to effectively address data shift (e.g. weather +conditions) between server and clients. Our contributions include selectively +refining the backbone of the detector to avert overfitting, orthogonality +regularization to boost representation divergence, and local EMA-driven pseudo +label assignment to yield high-quality pseudo labels. Extensive validation on +prominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M) +attests to the efficacy of our approach, demonstrating state-of-the-art +results. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as +well as fully-supervised centralized training methods. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ eP-ALM: Efficient Perceptual Augmentation of Language Models ICCV 2023 + + +
+ Large Language Models (LLMs) have so far impressed the world, with +unprecedented capabilities that emerge in models at large scales. On the vision +side, transformer models (i.e., ViT) are following the same trend, achieving +the best performance on challenging benchmarks. With the abundance of such +unimodal models, a natural question arises; do we need also to follow this +trend to tackle multimodal tasks? In this work, we propose to rather direct +effort to efficient adaptations of existing models, and propose to augment +Language Models with perception. Existing approaches for adapting pretrained +models for vision-language tasks still rely on several key components that +hinder their efficiency. In particular, they still train a large number of +parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) +trained on huge image-text datasets, and add significant inference overhead. In +addition, most of these approaches have focused on Zero-Shot and In Context +Learning, with little to no effort on direct finetuning. We investigate the +minimal computational effort needed to adapt unimodal models for multimodal +tasks and propose a new challenging setup, alongside different approaches, that +efficiently adapts unimodal pretrained models. We show that by freezing more +than 99% of total parameters, training only one linear projection layer, and +prepending only one trainable token, our approach (dubbed eP-ALM) significantly +outperforms other baselines on VQA and Captioning across Image, Video, and +Audio modalities, following the proposed setup. The code is available here: +https://github.com/mshukor/eP-ALM. + +
+
+ comment: Accepted at ICCV 2023. Project page: + https://mshukor.github.io/eP-ALM.github.io/ +
+
+
+
+
+ + ♻ ☆ Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action + Recognition ICCV-2023 + + +
+ Recent video recognition models utilize Transformer models for long-range +spatio-temporal context modeling. Video transformer designs are based on +self-attention that can model global context at a high computational cost. In +comparison, convolutional designs for videos offer an efficient alternative but +lack long-range dependency modeling. Towards achieving the best of both +designs, this work proposes Video-FocalNet, an effective and efficient +architecture for video recognition that models both local and global contexts. +Video-FocalNet is based on a spatio-temporal focal modulation architecture that +reverses the interaction and aggregation steps of self-attention for better +efficiency. Further, the aggregation step and the interaction step are both +implemented using efficient convolution and element-wise multiplication +operations that are computationally less expensive than their self-attention +counterparts on video representations. We extensively explore the design space +of focal modulation-based spatio-temporal context modeling and demonstrate our +parallel spatial and temporal encoding design to be the optimal choice. +Video-FocalNets perform favorably well against the state-of-the-art +transformer-based models for video recognition on five large-scale datasets +(Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower +computational cost. Our code/models are released at +https://github.com/TalalWasim/Video-FocalNets. + +
+
+ comment: Accepted to ICCV-2023. Camera-Ready version. Project page: + https://TalalWasim.github.io/Video-FocalNets/ +
+
+
+
+
+ + ♻ ☆ High-performance real-world optical computing trained by in situ + model-free optimization + + +
+ Optical computing systems can provide high-speed and low-energy data +processing but face deficiencies in computationally demanding training and +simulation-to-reality gap. We propose a model-free solution for lightweight in +situ optimization of optical computing systems based on the score gradient +estimation algorithm. This approach treats the system as a black box and +back-propagates loss directly to the optical weights' probabilistic +distributions, hence circumventing the need for computation-heavy and biased +system simulation. We demonstrate a superior classification accuracy on the +MNIST and FMNIST datasets through experiments on a single-layer diffractive +optical computing system. Furthermore, we show its potential for image-free and +high-speed cell analysis. The inherent simplicity of our proposed method, +combined with its low demand for computational resources, expedites the +transition of optical computing from laboratory demonstrations to real-world +applications. + +
+
+
+
+
+ + ♻ ☆ DUBLIN -- Document Understanding By Language-Image Network + + +
+ Visual document understanding is a complex task that involves analyzing both +the text and the visual elements in document images. Existing models often rely +on manual feature engineering or domain-specific pipelines, which limit their +generalization ability across different document types and languages. In this +paper, we propose DUBLIN, which is pretrained on web pages using three novel +objectives: Masked Document Text Generation Task, Bounding Box Task, and +Rendered Question Answering Task, that leverage both the spatial and semantic +information in the document images. Our model achieves competitive or +state-of-the-art results on several benchmarks, such as Web-Based Structural +Reading Comprehension, Document Visual Question Answering, Key Information +Extraction, Diagram Understanding, and Table Question Answering. In particular, +we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 +and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms +the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and +AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve +competitive performance on RVL-CDIP document classification. Moreover, we +create new baselines for text-based datasets by rendering them as document +images to promote research in this direction. + +
+
+
+
+
+ + ♻ ☆ Kernelized Back-Projection Networks for Blind Super Resolution + + +
+ Since non-blind Super Resolution (SR) fails to super-resolve Low-Resolution +(LR) images degraded by arbitrary degradations, SR with the degradation model +is required. However, this paper reveals that non-blind SR that is trained +simply with various blur kernels exhibits comparable performance as those with +the degradation model for blind SR. This result motivates us to revisit +high-performance non-blind SR and extend it to blind SR with blur kernels. This +paper proposes two SR networks by integrating kernel estimation and SR branches +in an iterative end-to-end manner. In the first model, which is called the +Kernel Conditioned Back-Projection Network (KCBPN), the low-dimensional kernel +representations are estimated for conditioning the SR branch. In our second +model, the Kernelized BackProjection Network (KBPN), a raw kernel is estimated +and directly employed for modeling the image degradation. The estimated kernel +is employed not only for back-propagating its residual but also for +forward-propagating the residual to iterative stages. This forward-propagation +encourages these stages to learn a variety of different features in different +stages by focusing on pixels with large residuals in each stage. Experimental +results validate the effectiveness of our proposed networks for kernel +estimation and SR. We will release the code for this work. + +
+
+ comment: The first two authors contributed equally to this work +
+
+
+
+
+ + ♻ ☆ Generalized Neural Collapse for a Large Number of Classes + + +
+ Neural collapse provides an elegant mathematical characterization of learned +last layer representations (a.k.a. features) and classifier weights in deep +classification models. Such results not only provide insights but also motivate +new techniques for improving practical deep models. However, most of the +existing empirical and theoretical studies in neural collapse focus on the case +that the number of classes is small relative to the dimension of the feature +space. This paper extends neural collapse to cases where the number of classes +are much larger than the dimension of feature space, which broadly occur for +language models, retrieval systems, and face recognition applications. We show +that the features and classifier exhibit a generalized neural collapse +phenomenon, where the minimum one-vs-rest margins is maximized.We provide +empirical study to verify the occurrence of generalized neural collapse in +practical deep neural networks. Moreover, we provide theoretical study to show +that the generalized neural collapse provably occurs under unconstrained +feature model with spherical constraint, under certain technical conditions on +feature dimension and number of classes. + +
+
+ comment: 32 pages, 12 figures +
+
+
+
+
+ + ♻ ☆ HyperFields: Towards Zero-Shot Generation of NeRFs from Text + + +
+ We introduce HyperFields, a method for generating text-conditioned Neural +Radiance Fields (NeRFs) with a single forward pass and (optionally) some +fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns +a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF +distillation training, which distills scenes encoded in individual NeRFs into +one dynamic hypernetwork. These techniques enable a single network to fit over +a hundred unique scenes. We further demonstrate that HyperFields learns a more +general map between text and NeRFs, and consequently is capable of predicting +novel in-distribution and out-of-distribution scenes -- either zero-shot or +with a few finetuning steps. Finetuning HyperFields benefits from accelerated +convergence thanks to the learned general map, and is capable of synthesizing +novel scenes 5 to 10 times faster than existing neural optimization-based +methods. Our ablation experiments show that both the dynamic architecture and +NeRF distillation are critical to the expressivity of HyperFields. + +
+
+ comment: Project page: https://threedle.github.io/hyperfields/ +
+
+
+
+
+ + ♻ ☆ Implicit Convolutional Kernels for Steerable CNNs NeurIPS 2023 + + +
+ Steerable convolutional neural networks (CNNs) provide a general framework +for building neural networks equivariant to translations and transformations of +an origin-preserving group $G$, such as reflections and rotations. They rely on +standard convolutions with $G$-steerable kernels obtained by analytically +solving the group-specific equivariance constraint imposed onto the kernel +space. As the solution is tailored to a particular group $G$, implementing a +kernel basis does not generalize to other symmetry transformations, +complicating the development of general group equivariant models. We propose +using implicit neural representation via multi-layer perceptrons (MLPs) to +parameterize $G$-steerable kernels. The resulting framework offers a simple and +flexible way to implement Steerable CNNs and generalizes to any group $G$ for +which a $G$-equivariant MLP can be built. We prove the effectiveness of our +method on multiple tasks, including N-body simulations, point cloud +classification and molecular property prediction. + +
+
+ comment: Accepted to 37th Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Analyzing the Sample Complexity of Self-Supervised Image Reconstruction + Methods + + +
+ Supervised training of deep neural networks on pairs of clean image and noisy +measurement achieves state-of-the-art performance for many image reconstruction +tasks, but such training pairs are difficult to collect. Self-supervised +methods enable training based on noisy measurements only, without clean images. +In this work, we investigate the cost of self-supervised training in terms of +sample complexity for a class of self-supervised methods that enable the +computation of unbiased estimates of gradients of the supervised loss, +including noise2noise methods. We analytically show that a model trained with +such self-supervised training is as good as the same model trained in a +supervised fashion, but self-supervised training requires more examples than +supervised training. We then study self-supervised denoising and accelerated +MRI empirically and characterize the cost of self-supervised training in terms +of the number of additional samples required, and find that the performance gap +between self-supervised and supervised training vanishes as a function of the +training examples, at a problem-dependent rate, as predicted by our theory. + +
+
+
+
+
+ + ♻ ☆ Spatially Resolved Gene Expression Prediction from H&E Histology Images + via Bi-modal Contrastive Learning + + +
+ Histology imaging is an important tool in medical diagnosis and research, +enabling the examination of tissue structure and composition at the microscopic +level. Understanding the underlying molecular mechanisms of tissue architecture +is critical in uncovering disease mechanisms and developing effective +treatments. Gene expression profiling provides insight into the molecular +processes underlying tissue architecture, but the process can be time-consuming +and expensive. We present BLEEP (Bi-modaL Embedding for Expression Prediction), +a bi-modal embedding framework capable of generating spatially resolved gene +expression profiles of whole-slide Hematoxylin and eosin (H&E) stained +histology images. BLEEP uses contrastive learning to construct a +low-dimensional joint embedding space from a reference dataset using paired +image and expression profiles at micrometer resolution. With this approach, the +gene expression of any query image patch can be imputed using the expression +profiles from the reference dataset. We demonstrate BLEEP's effectiveness in +gene expression prediction by benchmarking its performance on a human liver +tissue dataset captured using the 10x Visium platform, where it achieves +significant improvements over existing methods. Our results demonstrate the +potential of BLEEP to provide insights into the molecular mechanisms underlying +tissue architecture, with important implications in diagnosis and research of +various diseases. The proposed approach can significantly reduce the time and +cost associated with gene expression profiling, opening up new avenues for +high-throughput analysis of histology images for both research and clinical +applications. + +
+
+
+
+
+ + ♻ ☆ A Regularized Conditional GAN for Posterior Sampling in Image Recovery + Problems + + +
+ In image recovery problems, one seeks to infer an image from distorted, +incomplete, and/or noise-corrupted measurements. Such problems arise in +magnetic resonance imaging (MRI), computed tomography, deblurring, +super-resolution, inpainting, phase retrieval, image-to-image translation, and +other applications. Given a training set of signal/measurement pairs, we seek +to do more than just produce one good image estimate. Rather, we aim to rapidly +and accurately sample from the posterior distribution. To do this, we propose a +regularized conditional Wasserstein GAN that generates dozens of high-quality +posterior samples per second. Our regularization comprises an $\ell_1$ penalty +and an adaptively weighted standard-deviation reward. Using quantitative +evaluation metrics like conditional Fr\'{e}chet inception distance, we +demonstrate that our method produces state-of-the-art posterior samples in both +multicoil MRI and large-scale inpainting applications. The code for our model +can be found here: https://github.com/matt-bendel/rcGAN + +
+
+
+
+
+ + ♻ ☆ More complex encoder is not all you need + + +
+ U-Net and its variants have been widely used in medical image segmentation. +However, most current U-Net variants confine their improvement strategies to +building more complex encoder, while leaving the decoder unchanged or adopting +a simple symmetric structure. These approaches overlook the true functionality +of the decoder: receiving low-resolution feature maps from the encoder and +restoring feature map resolution and lost information through upsampling. As a +result, the decoder, especially its upsampling component, plays a crucial role +in enhancing segmentation outcomes. However, in 3D medical image segmentation, +the commonly used transposed convolution can result in visual artifacts. This +issue stems from the absence of direct relationship between adjacent pixels in +the output feature map. Furthermore, plain encoder has already possessed +sufficient feature extraction capability because downsampling operation leads +to the gradual expansion of the receptive field, but the loss of information +during downsampling process is unignorable. To address the gap in relevant +research, we extend our focus beyond the encoder and introduce neU-Net (i.e., +not complex encoder U-Net), which incorporates a novel Sub-pixel Convolution +for upsampling to construct a powerful decoder. Additionally, we introduce +multi-scale wavelet inputs module on the encoder side to provide additional +information. Our model design achieves excellent results, surpassing other +state-of-the-art methods on both the Synapse and ACDC datasets. + +
+
+
+
+
+ + ♻ ☆ Exploring Diverse In-Context Configurations for Image Captioning NeurIPS2023 + + +
+ After discovering that Language Models (LMs) can be good in-context few-shot +learners, numerous strategies have been proposed to optimize in-context +sequence configurations. Recently, researchers in Vision-Language (VL) domains +also develop their few-shot learners, while they only use the simplest way, +ie., randomly sampling, to configure in-context image-text pairs. In order to +explore the effects of varying configurations on VL in-context learning, we +devised four strategies for image selection and four for caption assignment to +configure in-context image-text pairs for image captioning. Here Image +Captioning is used as the case study since it can be seen as the +visually-conditioned LM. Our comprehensive experiments yield two +counter-intuitive but valuable insights, highlighting the distinct +characteristics of VL in-context learning due to multi-modal synergy, as +compared to the NLP case. Furthermore, in our exploration of optimal +combination strategies, we observed an average performance enhancement of 20.7 +of CIDEr scores compared to the baseline. The code is given in +https://github.com/yongliang-wu/ExploreCfg. + +
+
+ comment: Accepted by NeurIPS2023 +
+
+
+
+
+ + ♻ ☆ Adjoint Rigid Transform Network: Task-conditioned Alignment of 3D Shapes + + +
+ Most learning methods for 3D data (point clouds, meshes) suffer significant +performance drops when the data is not carefully aligned to a canonical +orientation. Aligning real world 3D data collected from different sources is +non-trivial and requires manual intervention. In this paper, we propose the +Adjoint Rigid Transform (ART) Network, a neural module which can be integrated +with a variety of 3D networks to significantly boost their performance. ART +learns to rotate input shapes to a learned canonical orientation, which is +crucial for a lot of tasks such as shape reconstruction, interpolation, +non-rigid registration, and latent disentanglement. ART achieves this with +self-supervision and a rotation equivariance constraint on predicted rotations. +The remarkable result is that with only self-supervision, ART facilitates +learning a unique canonical orientation for both rigid and nonrigid shapes, +which leads to a notable boost in performance of aforementioned tasks. We will +release our code and pre-trained models for further research. + +
+
+
+
+
+ + ♻ ☆ TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion + Refinement + + +
+ We present TOCH, a method for refining incorrect 3D hand-object interaction +sequences using a data prior. Existing hand trackers, especially those that +rely on very few cameras, often produce visually unrealistic results with +hand-object intersection or missing contacts. Although correcting such errors +requires reasoning about temporal aspects of interaction, most previous works +focus on static grasps and contacts. The core of our method are TOCH fields, a +novel spatio-temporal representation for modeling correspondences between hands +and objects during interaction. TOCH fields are a point-wise, object-centric +representation, which encode the hand position relative to the object. +Leveraging this novel representation, we learn a latent manifold of plausible +TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate +that TOCH outperforms state-of-the-art 3D hand-object interaction models, which +are limited to static grasps and contacts. More importantly, our method +produces smooth interactions even before and after contact. Using a single +trained TOCH model, we quantitatively and qualitatively demonstrate its +usefulness for correcting erroneous sequences from off-the-shelf RGB/RGB-D +hand-object reconstruction methods and transferring grasps across objects. + +
+
+
+
+
+ + ♻ ☆ FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling + + +
+ With the availability of large-scale video datasets and the advances of +diffusion models, text-driven video generation has achieved substantial +progress. However, existing video generation models are typically trained on a +limited number of frames, resulting in the inability to generate high-fidelity +long videos during inference. Furthermore, these models only support +single-text conditions, whereas real-life scenarios often require multi-text +conditions as the video content changes over time. To tackle these challenges, +this study explores the potential of extending the text-driven capability to +generate longer videos conditioned on multiple texts. 1) We first analyze the +impact of initial noise in video diffusion models. Then building upon the +observation of noise, we propose FreeNoise, a tuning-free and time-efficient +paradigm to enhance the generative capabilities of pretrained video diffusion +models while preserving content consistency. Specifically, instead of +initializing noises for all frames, we reschedule a sequence of noises for +long-range correlation and perform temporal attention over them by window-based +function. 2) Additionally, we design a novel motion injection method to support +the generation of videos conditioned on multiple text prompts. Extensive +experiments validate the superiority of our paradigm in extending the +generative capabilities of video diffusion models. It is noteworthy that +compared with the previous best-performing method which brought about 255% +extra time cost, our method incurs only negligible time cost of approximately +17%. Generated video samples are available at our website: +http://haonanqiu.com/projects/FreeNoise.html. + +
+
+ comment: Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo: + https://github.com/arthur-qiu/LongerCrafter +
+
+
+
+
+ + ♻ ☆ Learning Better with Less: Effective Augmentation for Sample-Efficient + Visual Reinforcement Learning NeurIPS 2023 + + +
+ Data augmentation (DA) is a crucial technique for enhancing the sample +efficiency of visual reinforcement learning (RL) algorithms. Notably, employing +simple observation transformations alone can yield outstanding performance +without extra auxiliary representation tasks or pre-trained encoders. However, +it remains unclear which attributes of DA account for its effectiveness in +achieving sample-efficient visual RL. To investigate this issue and further +explore the potential of DA, this work conducts comprehensive experiments to +assess the impact of DA's attributes on its efficacy and provides the following +insights and improvements: (1) For individual DA operations, we reveal that +both ample spatial diversity and slight hardness are indispensable. Building on +this finding, we introduce Random PadResize (Rand PR), a new DA operation that +offers abundant spatial diversity with minimal hardness. (2) For multi-type DA +fusion schemes, the increased DA hardness and unstable data distribution result +in the current fusion schemes being unable to achieve higher sample efficiency +than their corresponding individual operations. Taking the non-stationary +nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme +called Cycling Augmentation (CycAug), which performs periodic cycles of +different DA operations to increase type diversity while maintaining data +distribution consistency. Extensive evaluations on the DeepMind Control suite +and CARLA driving simulator demonstrate that our methods achieve superior +sample efficiency compared with the prior state-of-the-art methods. + +
+
+ comment: NeurIPS 2023 poster +
+
+
+
+
+ + ♻ ☆ Object pop-up: Can we infer 3D objects and their poses from human + interactions alone? CVPR'23 + + +
+ The intimate entanglement between objects affordances and human poses is of +large interest, among others, for behavioural sciences, cognitive psychology, +and Computer Vision communities. In recent years, the latter has developed +several object-centric approaches: starting from items, learning pipelines +synthesizing human poses and dynamics in a realistic way, satisfying both +geometrical and functional expectations. However, the inverse perspective is +significantly less explored: Can we infer 3D objects and their poses from human +interactions alone? Our investigation follows this direction, showing that a +generic 3D human point cloud is enough to pop up an unobserved object, even +when the user is just imitating a functionality (e.g., looking through a +binocular) without involving a tangible counterpart. We validate our method +qualitatively and quantitatively, with synthetic data and sequences acquired +for the task, showing applicability for XR/VR. The code is available at +https://github.com/ptrvilya/object-popup. + +
+
+ comment: Accepted at CVPR'23 +
+
+
+
+
+ + ♻ ☆ Towards an Effective and Efficient Transformer for Rain-by-snow Weather + Removal + + +
+ Rain-by-snow weather removal is a specialized task in weather-degraded image +restoration aiming to eliminate coexisting rain streaks and snow particles. In +this paper, we propose RSFormer, an efficient and effective Transformer that +addresses this challenge. Initially, we explore the proximity of convolution +networks (ConvNets) and vision Transformers (ViTs) in hierarchical +architectures and experimentally find they perform approximately at intra-stage +feature learning. On this basis, we utilize a Transformer-like convolution +block (TCB) that replaces the computationally expensive self-attention while +preserving attention characteristics for adapting to input content. We also +demonstrate that cross-stage progression is critical for performance +improvement, and propose a global-local self-attention sampling mechanism +(GLASM) that down-/up-samples features while capturing both global and local +dependencies. Finally, we synthesize two novel rain-by-snow datasets, +RSCityScape and RS100K, to evaluate our proposed RSFormer. Extensive +experiments verify that RSFormer achieves the best trade-off between +performance and time-consumption compared to other restoration methods. For +instance, it outperforms Restormer with a 1.53% reduction in the number of +parameters and a 15.6% reduction in inference time. Datasets, source code and +pre-trained models are available at \url{https://github.com/chdwyb/RSFormer}. + +
+
+ comment: code is available at \url{https://github.com/chdwyb/RSFormer} +
+
+
+
+
+ + ♻ ☆ A Unified Framework for Discovering Discrete Symmetries + + +
+ We consider the problem of learning a function respecting a symmetry from +among a class of symmetries. We develop a unified framework that enables +symmetry discovery across a broad range of subgroups including locally +symmetric, dihedral and cyclic subgroups. At the core of the framework is a +novel architecture composed of linear, matrix-valued and non-linear functions +that expresses functions invariant to these subgroups in a principled manner. +The structure of the architecture enables us to leverage multi-armed bandit +algorithms and gradient descent to efficiently optimize over the linear and the +non-linear functions, respectively, and to infer the symmetry that is +ultimately learnt. We also discuss the necessity of the matrix-valued functions +in the architecture. Experiments on image-digit sum and polynomial regression +tasks demonstrate the effectiveness of our approach. + +
+
+
+
+
+ + ♻ ☆ StableVQA: A Deep No-Reference Quality Assessment Model for Video + Stability ACM MM'23 + + +
+ Video shakiness is an unpleasant distortion of User Generated Content (UGC) +videos, which is usually caused by the unstable hold of cameras. In recent +years, many video stabilization algorithms have been proposed, yet no specific +and accurate metric enables comprehensively evaluating the stability of videos. +Indeed, most existing quality assessment models evaluate video quality as a +whole without specifically taking the subjective experience of video stability +into consideration. Therefore, these models cannot measure the video stability +explicitly and precisely when severe shakes are present. In addition, there is +no large-scale video database in public that includes various degrees of shaky +videos with the corresponding subjective scores available, which hinders the +development of Video Quality Assessment for Stability (VQA-S). To this end, we +build a new database named StableDB that contains 1,952 diversely-shaky UGC +videos, where each video has a Mean Opinion Score (MOS) on the degree of video +stability rated by 34 subjects. Moreover, we elaborately design a novel VQA-S +model named StableVQA, which consists of three feature extractors to acquire +the optical flow, semantic, and blur features respectively, and a regression +layer to predict the final stability score. Extensive experiments demonstrate +that the StableVQA achieves a higher correlation with subjective opinions than +the existing VQA-S models and generic VQA models. The database and codes are +available at https://github.com/QMME/StableVQA. + +
+
+ comment: Accepted by ACM MM'23 +
+
+
+
+
+ + ♻ ☆ FLARE: Fast Learning of Animatable and Relightable Mesh Avatars SIGGRAPH + + +
+ Our goal is to efficiently learn personalized animatable 3D head avatars from +videos that are geometrically accurate, realistic, relightable, and compatible +with current rendering systems. While 3D meshes enable efficient processing and +are highly portable, they lack realism in terms of shape and appearance. Neural +representations, on the other hand, are realistic but lack compatibility and +are slow to train and render. Our key insight is that it is possible to +efficiently learn high-fidelity 3D mesh representations via differentiable +rendering by exploiting highly-optimized methods from traditional computer +graphics and approximating some of the components with neural networks. To that +end, we introduce FLARE, a technique that enables the creation of animatable +and relightable mesh avatars from a single monocular video. First, we learn a +canonical geometry using a mesh representation, enabling efficient +differentiable rasterization and straightforward animation via learned +blendshapes and linear blend skinning weights. Second, we follow +physically-based rendering and factor observed colors into intrinsic albedo, +roughness, and a neural representation of the illumination, allowing the +learned avatars to be relit in novel scenes. Since our input videos are +captured on a single device with a narrow field of view, modeling the +surrounding environment light is non-trivial. Based on the split-sum +approximation for modeling specular reflections, we address this by +approximating the pre-filtered environment map with a multi-layer perceptron +(MLP) modulated by the surface roughness, eliminating the need to explicitly +model the light. We demonstrate that our mesh-based avatar formulation, +combined with learned deformation, material, and lighting MLPs, produces +avatars with high-quality geometry and appearance, while also being efficient +to train and render compared to existing approaches. + +
+
+ comment: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of + SIGGRAPH Asia), 2023 +
+
+
+
+
+ + ♻ ☆ SPA: A Graph Spectral Alignment Perspective for Domain Adaptation NeurIPS 2023 + + +
+ Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to +extend the in-domain model to the distinctive target domains where the data +distributions differ. Most prior works focus on capturing the inter-domain +transferability but largely overlook rich intra-domain structures, which +empirically results in even worse discriminability. In this work, we introduce +a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The +core of our method is briefly condensed as follows: (i)-by casting the DA +problem to graph primitives, SPA composes a coarse graph alignment mechanism +with a novel spectral regularizer towards aligning the domain graphs in +eigenspaces; (ii)-we further develop a fine-grained message propagation module +-- upon a novel neighbor-aware self-training mechanism -- in order for enhanced +discriminability in the target domain. On standardized benchmarks, the +extensive experiments of SPA demonstrate that its performance has surpassed the +existing cutting-edge DA methods. Coupled with dense model analysis, we +conclude that our approach indeed possesses superior efficacy, robustness, +discriminability, and transferability. Code and data are available at: +https://github.com/CrownX/SPA. + +
+
+ comment: NeurIPS 2023 camera ready +
+
+
+
+
+ + ♻ ☆ Global Structure-Aware Diffusion Process for Low-Light Image Enhancement NeurIPS 2023 + + +
+ This paper studies a diffusion-based framework to address the low-light image +enhancement problem. To harness the capabilities of diffusion models, we delve +into this intricate process and advocate for the regularization of its inherent +ODE-trajectory. To be specific, inspired by the recent research that low +curvature ODE-trajectory results in a stable and effective diffusion process, +we formulate a curvature regularization term anchored in the intrinsic +non-local structures of image data, i.e., global structure-aware +regularization, which gradually facilitates the preservation of complicated +details and the augmentation of contrast during the diffusion process. This +incorporation mitigates the adverse effects of noise and artifacts resulting +from the diffusion process, leading to a more precise and flexible enhancement. +To additionally promote learning in challenging regions, we introduce an +uncertainty-guided regularization technique, which wisely relaxes constraints +on the most extreme regions of the image. Experimental evaluations reveal that +the proposed diffusion-based framework, complemented by rank-informed +regularization, attains distinguished performance in low-light enhancement. The +outcomes indicate substantial advancements in image quality, noise suppression, +and contrast amplification in comparison with state-of-the-art methods. We +believe this innovative approach will stimulate further exploration and +advancement in low-light image processing, with potential implications for +other applications of diffusion models. The code is publicly available at +https://github.com/jinnh/GSAD. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Representation Learning via Consistent Assignment of Views over Random + Partitions NeurIPS 2023 + + +
+ We present Consistent Assignment of Views over Random Partitions (CARP), a +self-supervised clustering method for representation learning of visual +features. CARP learns prototypes in an end-to-end online fashion using gradient +descent without additional non-differentiable modules to solve the cluster +assignment problem. CARP optimizes a new pretext task based on random +partitions of prototypes that regularizes the model and enforces consistency +between views' assignments. Additionally, our method improves training +stability and prevents collapsed solutions in joint-embedding training. Through +an extensive evaluation, we demonstrate that CARP's representations are +suitable for learning downstream tasks. We evaluate CARP's representations +capabilities in 17 datasets across many standard protocols, including linear +evaluation, few-shot classification, k-NN, k-means, image retrieval, and copy +detection. We compare CARP performance to 11 existing self-supervised methods. +We extensively ablate our method and demonstrate that our proposed random +partition pretext task improves the quality of the learned representations by +devising multiple random classification tasks. In transfer learning tasks, CARP +achieves the best performance on average against many SSL methods trained for a +longer time. + +
+
+ comment: To appear in NeurIPS 2023. Code available at + https://github.com/sthalles/carp +
+
+
+
+
+ + ♻ ☆ Bayesian sparsification for deep neural networks with Bayesian model + reduction + + +
+ Deep learning's immense capabilities are often constrained by the complexity +of its models, leading to an increasing demand for effective sparsification +techniques. Bayesian sparsification for deep learning emerges as a crucial +approach, facilitating the design of models that are both computationally +efficient and competitive in terms of performance across various deep learning +applications. The state-of-the-art -- in Bayesian sparsification of deep neural +networks -- combines structural shrinkage priors on model weights with an +approximate inference scheme based on stochastic variational inference. +However, model inversion of the full generative model is exceptionally +computationally demanding, especially when compared to standard deep learning +of point estimates. In this context, we advocate for the use of Bayesian model +reduction (BMR) as a more efficient alternative for pruning of model weights. +As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc +elimination of redundant model weights based on the posterior estimates under a +straightforward (non-hierarchical) generative model. Our comparative study +highlights the advantages of the BMR method relative to established approaches +based on hierarchical horseshoe priors over model weights. We illustrate the +potential of BMR across various deep learning architectures, from classical +networks like LeNet to modern frameworks such as Vision Transformers and +MLP-Mixers. + +
+
+
+
+
+ + ♻ ☆ MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited + Memory NeurIPS 2023 + + +
+ Due to the high price and heavy energy consumption of GPUs, deploying deep +models on IoT devices such as microcontrollers makes significant contributions +for ecological AI. Conventional methods successfully enable convolutional +neural network inference of high resolution images on microcontrollers, while +the framework for vision transformers that achieve the state-of-the-art +performance in many vision applications still remains unexplored. In this +paper, we propose a hardware-algorithm co-optimizations method called MCUFormer +to deploy vision transformers on microcontrollers with extremely limited +memory, where we jointly design transformer architecture and construct the +inference operator library to fit the memory resource constraint. More +specifically, we generalize the one-shot network architecture search (NAS) to +discover the optimal architecture with highest task performance given the +memory budget from the microcontrollers, where we enlarge the existing search +space of vision transformers by considering the low-rank decomposition +dimensions and patch resolution for memory reduction. For the construction of +the inference operator library of vision transformers, we schedule the memory +buffer during inference through operator integration, patch embedding +decomposition, and token overwriting, allowing the memory buffer to be fully +utilized to adapt to the forward pass of the vision transformer. Experimental +results demonstrate that our MCUFormer achieves 73.62\% top-1 accuracy on +ImageNet for image classification with 320KB memory on STM32F746 +microcontroller. Code is available at https://github.com/liangyn22/MCUFormer. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Text-driven Editing of 3D Scenes without Retraining + + +
+ Numerous diffusion models have recently been applied to image synthesis and +editing. However, editing 3D scenes is still in its early stages. It poses +various challenges, such as the requirement to design specific methods for +different editing types, retraining new models for various 3D scenes, and the +absence of convenient human interaction during editing. To tackle these issues, +we introduce a text-driven editing method, termed DN2N, which allows for the +direct acquisition of a NeRF model with universal editing capabilities, +eliminating the requirement for retraining. Our method employs off-the-shelf +text-based editing models of 2D images to modify the 3D scene images, followed +by a filtering process to discard poorly edited images that disrupt 3D +consistency. We then consider the remaining inconsistency as a problem of +removing noise perturbation, which can be solved by generating training data +with similar perturbation characteristics for training. We further propose +cross-view regularization terms to help the generalized NeRF model mitigate +these perturbations. Our text-driven method allows users to edit a 3D scene +with their desired description, which is more friendly, intuitive, and +practical than prior works. Empirical results show that our method achieves +multiple editing types, including but not limited to appearance editing, +weather transition, material changing, and style transfer. Most importantly, +our method generalizes well with editing abilities shared among a set of model +parameters without requiring a customized editing model for some specific +scenes, thus inferring novel views with editing effects directly from user +input. The project website is available at https://sk-fun.fun/DN2N + +
+
+ comment: Project Website: https://sk-fun.fun/DN2N +
+
+
+
+
+ + ♻ ☆ Decoupled Diffusion Models: Image to Zero and Zero to Noise + + +
+ Recent diffusion probabilistic models (DPMs) have shown remarkable abilities +of generated content, however, they often suffer from complex forward +processes, resulting in inefficient solutions for the reversed process and +prolonged sampling times. In this paper, we aim to address the aforementioned +challenges by focusing on the diffusion process itself that we propose to +decouple the intricate diffusion process into two comparatively simpler process +to improve the generative efficacy and speed. In particular, we present a novel +diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito +diffusion process, in which the image distribution is approximated by an +explicit transition probability while the noise path is controlled by the +standard Wiener process. We find that decoupling the diffusion process reduces +the learning difficulty and the explicit transition probability improves the +generative speed significantly. We prove a new training objective for DPM, +which enables the model to learn to predict the noise and image components +separately. Moreover, given the novel forward diffusion equation, we derive the +reverse denoising formula of DDM that naturally supports fewer steps of +generation without ordinary differential equation (ODE) based accelerators. Our +experiments demonstrate that DDM outperforms previous DPMs by a large margin in +fewer function evaluations setting and gets comparable performances in long +function evaluations setting. We also show that our framework can be applied to +image-conditioned generation and high-resolution image synthesis, and that it +can generate high-quality images with only 10 function evaluations. + +
+
+
+
+
+ + ♻ ☆ Generalizing to Unseen Domains in Diabetic Retinopathy Classification WACV 2024 + + +
+ Diabetic retinopathy (DR) is caused by long-standing diabetes and is among +the fifth leading cause for visual impairments. The process of early diagnosis +and treatments could be helpful in curing the disease, however, the detection +procedure is rather challenging and mostly tedious. Therefore, automated +diabetic retinopathy classification using deep learning techniques has gained +interest in the medical imaging community. Akin to several other real-world +applications of deep learning, the typical assumption of i.i.d data is also +violated in DR classification that relies on deep learning. Therefore, +developing DR classification methods robust to unseen distributions is of great +value. In this paper, we study the problem of generalizing a model to unseen +distributions or domains (a.k.a domain generalization) in DR classification. To +this end, we propose a simple and effective domain generalization (DG) approach +that achieves self-distillation in vision transformers (ViT) via a novel +prediction softening mechanism. This prediction softening is an adaptive convex +combination one-hot labels with the model's own knowledge. We perform extensive +experiments on challenging open-source DR classification datasets under both +multi-source and single-source DG settings with three different ViT backbones +to establish the efficacy and applicability of our approach against competing +methods. For the first time, we report the performance of several +state-of-the-art DG methods on open-source DR classification datasets after +conducting thorough experiments. Finally, our method is also capable of +delivering improved calibration performance than other methods, showing its +suitability for safety-critical applications, including healthcare. We hope +that our contributions would investigate more DG research across the medical +imaging community. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ♻ ☆ Pre-training Contextualized World Models with In-the-wild Videos for + Reinforcement Learning NeurIPS 2023 + + +
+ Unsupervised pre-training methods utilizing large and diverse datasets have +achieved tremendous success across a range of domains. Recent work has +investigated such unsupervised pre-training methods for model-based +reinforcement learning (MBRL) but is limited to domain-specific or simulated +data. In this paper, we study the problem of pre-training world models with +abundant in-the-wild videos for efficient learning of downstream visual control +tasks. However, in-the-wild videos are complicated with various contextual +factors, such as intricate backgrounds and textured appearance, which precludes +a world model from extracting shared world knowledge to generalize better. To +tackle this issue, we introduce Contextualized World Models (ContextWM) that +explicitly separate context and dynamics modeling to overcome the complexity +and diversity of in-the-wild videos and facilitate knowledge transfer between +distinct scenes. Specifically, a contextualized extension of the latent +dynamics model is elaborately realized by incorporating a context encoder to +retain contextual information and empower the image decoder, which encourages +the latent dynamics model to concentrate on essential temporal variations. Our +experiments show that in-the-wild video pre-training equipped with ContextWM +can significantly improve the sample efficiency of MBRL in various domains, +including robotic manipulation, locomotion, and autonomous driving. Code is +available at this repository: https://github.com/thuml/ContextWM. + +
+
+ comment: NeurIPS 2023. Code is available at https://github.com/thuml/ContextWM +
+
+
+
+
+ + ♻ ☆ Joint-Relation Transformer for Multi-Person Motion Prediction + + +
+ Multi-person motion prediction is a challenging problem due to the dependency +of motion on both individual past movements and interactions with other people. +Transformer-based methods have shown promising results on this task, but they +miss the explicit relation representation between joints, such as skeleton +structure and pairwise distance, which is crucial for accurate interaction +modeling. In this paper, we propose the Joint-Relation Transformer, which +utilizes relation information to enhance interaction modeling and improve +future motion prediction. Our relation information contains the relative +distance and the intra-/inter-person physical constraints. To fuse relation and +joint information, we design a novel joint-relation fusion layer with +relation-aware attention to update both features. Additionally, we supervise +the relation information by forecasting future distance. Experiments show that +our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and +17.8%/12.0% improvement of 3s MPJPE on CMU-Mpcap/MuPoTS-3D dataset. + +
+
+
+
+
+ + ♻ ☆ Jigsaw: Learning to Assemble Multiple Fractured Objects NeurIPS 2023 + + +
+ Automated assembly of 3D fractures is essential in orthopedics, archaeology, +and our daily life. This paper presents Jigsaw, a novel framework for +assembling physically broken 3D objects from multiple pieces. Our approach +leverages hierarchical features of global and local geometry to match and align +the fracture surfaces. Our framework consists of four components: (1) front-end +point feature extractor with attention layers, (2) surface segmentation to +separate fracture and original parts, (3) multi-parts matching to find +correspondences among fracture surface points, and (4) robust global alignment +to recover the global poses of the pieces. We show how to jointly learn +segmentation and matching and seamlessly integrate feature matching and +rigidity constraints. We evaluate Jigsaw on the Breaking Bad dataset and +achieve superior performance compared to state-of-the-art methods. Our method +also generalizes well to diverse fracture modes, objects, and unseen instances. +To the best of our knowledge, this is the first learning-based method designed +specifically for 3D fracture assembly over multiple pieces. Our code is +available at https://jiaxin-lu.github.io/Jigsaw/. + +
+
+ comment: 18 pages, 9 figures, NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Understanding the Latent Space of Diffusion Models through the Lens of + Riemannian Geometry NeurIPS 2023 + + +
+ Despite the success of diffusion models (DMs), we still lack a thorough +understanding of their latent space. To understand the latent space +$\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective. +Our approach involves deriving the local latent basis within $\mathcal{X}$ by +leveraging the pullback metric associated with their encoding feature maps. +Remarkably, our discovered local latent basis enables image editing +capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis +vector at specific timesteps. We further analyze how the geometric structure of +DMs evolves over diffusion timesteps and differs across different text +conditions. This confirms the known phenomenon of coarse-to-fine generation, as +well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$ +across timesteps, the effect of dataset complexity, and the time-varying +influence of text prompts. To the best of our knowledge, this paper is the +first to present image editing through $\mathbf{x}$-space traversal, editing +only once at specific timestep $t$ without any additional training, and +providing thorough analyses of the latent structure of DMs. The code to +reproduce our experiments can be found at +https://github.com/enkeejunior1/Diffusion-Pullback. + +
+
+ comment: This paper has been accepted for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Visual Programming for Text-to-Image Generation and Evaluation NeurIPS 2023 + + +
+ As large language models have demonstrated impressive performance in many +domains, recent works have adopted language models (LMs) as controllers of +visual modules for vision-and-language tasks. While existing work focuses on +equipping LMs with visual understanding, we propose two novel +interpretable/explainable visual programming frameworks for text-to-image (T2I) +generation and evaluation. First, we introduce VPGen, an interpretable +step-by-step T2I generation framework that decomposes T2I generation into three +steps: object/count generation, layout generation, and image generation. We +employ an LM to handle the first two steps (object/count generation and layout +generation), by finetuning it on text-layout pairs. Our step-by-step T2I +generation framework provides stronger spatial control than end-to-end models, +the dominant approach for this task. Furthermore, we leverage the world +knowledge of pretrained LMs, overcoming the limitation of previous +layout-guided T2I works that can only handle predefined object classes. We +demonstrate that our VPGen has improved control in counts/spatial +relations/scales of objects than state-of-the-art T2I generation models. +Second, we introduce VPEval, an interpretable and explainable evaluation +framework for T2I generation based on visual programming. Unlike previous T2I +evaluations with a single scoring model that is accurate in some skills but +unreliable in others, VPEval produces evaluation programs that invoke a set of +visual modules that are experts in different skills, and also provides +visual+textual explanations of the evaluation results. Our analysis shows that +VPEval provides a more human-correlated evaluation for skill-specific and +open-ended prompts than widely used single model-based evaluation. We hope that +our work encourages future progress on interpretable/explainable generation and +evaluation for T2I models. + +
+
+ comment: NeurIPS 2023; Project website: https://vp-t2i.github.io +
+
+
+
+
+ + ♻ ☆ TIES-Merging: Resolving Interference When Merging Models NeurIPS 2023 + + +
+ Transfer learning - i.e., further fine-tuning a pre-trained model on a +downstream task - can confer significant advantages, including improved +downstream performance, faster convergence, and better sample efficiency. These +advantages have led to a proliferation of task-specific fine-tuned models, +which typically can only perform a single task and do not benefit from one +another. Recently, model merging techniques have emerged as a solution to +combine multiple task-specific models into a single multitask model without +performing additional training. However, existing merging methods often ignore +the interference between parameters of different models, resulting in large +performance drops when merging multiple models. In this paper, we demonstrate +that prior merging techniques inadvertently lose valuable information due to +two major sources of interference: (a) interference due to redundant parameter +values and (b) disagreement on the sign of a given parameter's values across +models. To address this, we propose our method, TRIM, ELECT SIGN & MERGE +(TIES-Merging), which introduces three novel steps when merging models: (1) +resetting parameters that only changed a small amount during fine-tuning, (2) +resolving sign conflicts, and (3) merging only the parameters that are in +alignment with the final agreed-upon sign. We find that TIES-Merging +outperforms several existing methods in diverse settings covering a range of +modalities, domains, number of tasks, model sizes, architectures, and +fine-tuning settings. We further analyze the impact of different types of +interference on model parameters, and highlight the importance of resolving +sign interference. Our code is available at +https://github.com/prateeky2806/ties-merging + +
+
+ comment: Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables +
+
+
+
+
+ + ♻ ☆ Addressing Uncertainty in Imbalanced Histopathology Image Classification + of HER2 Breast Cancer: An interpretable Ensemble Approach with Threshold + Filtered Single Instance Evaluation (SIE) + + +
+ Breast Cancer (BC) is among women's most lethal health concerns. Early +diagnosis can alleviate the mortality rate by helping patients make efficient +treatment decisions. Human Epidermal Growth Factor Receptor (HER2) has become +one the most lethal subtype of BC. According to the College of American +Pathologists American Society of Clinical Oncology (CAP/ASCO), the severity +level of HER2 expression can be classified between 0 and 3+ range. HER2 can be +detected effectively from immunohistochemical (IHC) and, hematoxylin & eosin +(HE) images of different classes such as 0, 1+, 2+, and 3+. An ensemble +approach integrated with threshold filtered single instance evaluation (SIE) +technique has been proposed in this study to diagnose BC from the +multi-categorical expression of HER2 subtypes. Initially, DenseNet201 and +Xception have been ensembled into a single classifier as feature extractors +with an effective combination of global average pooling, dropout layer, dense +layer with a swish activation function, and l2 regularizer, batch +normalization, etc. After that, extracted features has been processed through +single instance evaluation (SIE) to determine different confidence levels and +adjust decision boundary among the imbalanced classes. This study has been +conducted on the BC immunohistochemical (BCI) dataset, which is classified by +pathologists into four stages of HER2 BC. This proposed approach known as +DenseNet201-Xception-SIE with a threshold value of 0.7 surpassed all other +existing state-of-art models with an accuracy of 97.12%, precision of 97.15%, +and recall of 97.68% on H&E data and, accuracy of 97.56%, precision of 97.57%, +and recall of 98.00% on IHC data respectively, maintaining momentous +improvement. Finally, Grad-CAM and Guided Grad-CAM have been employed in this +study to interpret, how TL-based model works on the histopathology dataset and +make decisions from the data. + +
+
+
+
+
+ + ♻ ☆ Controlling Text-to-Image Diffusion by Orthogonal Finetuning NeurIPS 2023 + + +
+ Large text-to-image diffusion models have impressive capabilities in +generating photorealistic images from text prompts. How to effectively guide or +control these powerful models to perform different downstream tasks becomes an +important open problem. To tackle this challenge, we introduce a principled +finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image +diffusion models to downstream tasks. Unlike existing methods, OFT can provably +preserve hyperspherical energy which characterizes the pairwise neuron +relationship on the unit hypersphere. We find that this property is crucial for +preserving the semantic generation ability of text-to-image diffusion models. +To improve finetuning stability, we further propose Constrained Orthogonal +Finetuning (COFT) which imposes an additional radius constraint to the +hypersphere. Specifically, we consider two important finetuning text-to-image +tasks: subject-driven generation where the goal is to generate subject-specific +images given a few images of a subject and a text prompt, and controllable +generation where the goal is to enable the model to take in additional control +signals. We empirically show that our OFT framework outperforms existing +methods in generation quality and convergence speed. + +
+
+ comment: NeurIPS 2023 (43 pages, 34 figures, project page: + https://oft.wyliu.com/) +
+
+
+
+
+ + ♻ ☆ Towards Realistic Generative 3D Face Models + + +
+ In recent years, there has been significant progress in 2D generative face +models fueled by applications such as animation, synthetic data generation, and +digital avatars. However, due to the absence of 3D information, these 2D models +often struggle to accurately disentangle facial attributes like pose, +expression, and illumination, limiting their editing capabilities. To address +this limitation, this paper proposes a 3D controllable generative face model to +produce high-quality albedo and precise 3D shape leveraging existing 2D +generative models. By combining 2D face generative models with semantic face +manipulation, this method enables editing of detailed 3D rendered faces. The +proposed framework utilizes an alternating descent optimization approach over +shape and albedo. Differentiable rendering is used to train high-quality shapes +and albedo without 3D supervision. Moreover, this approach outperforms the +state-of-the-art (SOTA) methods in the well-known NoW benchmark for shape +reconstruction. It also outperforms the SOTA reconstruction models in +recovering rendered faces' identities across novel poses by an average of 10%. +Additionally, the paper demonstrates direct control of expressions in 3D faces +by exploiting latent space leading to text-based editing of 3D faces. + +
+
+ comment: Preprint +
+
+
+
+
+
+
+
+ + Information Retrieval 10 + +
+
+
+ + ☆ Text2Bundle: Towards Personalized Query-based Bundle Generation + + +
+ Bundle generation aims to provide a bundle of items for the user, and has +been widely studied and applied on online service platforms. Existing bundle +generation methods mainly utilized user's preference from historical +interactions in common recommendation paradigm, and ignored the potential +textual query which is user's current explicit intention. There can be a +scenario in which a user proactively queries a bundle with some natural +language description, the system should be able to generate a bundle that +exactly matches the user's intention through the user's query and preferences. +In this work, we define this user-friendly scenario as Query-based Bundle +Generation task and propose a novel framework Text2Bundle that leverages both +the user's short-term interests from the query and the user's long-term +preferences from the historical interactions. Our framework consists of three +modules: (1) a query interest extractor that mines the user's fine-grained +interests from the query; (2) a unified state encoder that learns the current +bundle context state and the user's preferences based on historical interaction +and current query; and (3) a bundle generator that generates personalized and +complementary bundles using a reinforcement learning with specifically designed +rewards. We conduct extensive experiments on three real-world datasets and +demonstrate the effectiveness of our framework compared with several +state-of-the-art methods. + +
+
+
+
+
+ + ☆ Chain-of-Choice Hierarchical Policy Learning for Conversational + Recommendation + + +
+ Conversational Recommender Systems (CRS) illuminate user preferences via +multi-round interactive dialogues, ultimately navigating towards precise and +satisfactory recommendations. However, contemporary CRS are limited to +inquiring binary or multi-choice questions based on a single attribute type +(e.g., color) per round, which causes excessive rounds of interaction and +diminishes the user's experience. To address this, we propose a more realistic +and efficient conversational recommendation problem setting, called +Multi-Type-Attribute Multi-round Conversational Recommendation (MTAMCR), which +enables CRS to inquire about multi-choice questions covering multiple types of +attributes in each round, thereby improving interactive efficiency. Moreover, +by formulating MTAMCR as a hierarchical reinforcement learning task, we propose +a Chain-of-Choice Hierarchical Policy Learning (CoCHPL) framework to enhance +both the questioning efficiency and recommendation effectiveness in MTAMCR. +Specifically, a long-term policy over options (i.e., ask or recommend) +determines the action type, while two short-term intra-option policies +sequentially generate the chain of attributes or items through multi-step +reasoning and selection, optimizing the diversity and interdependence of +questioning attributes. Finally, extensive experiments on four benchmarks +demonstrate the superior performance of CoCHPL over prevailing state-of-the-art +methods. + +
+
+ comment: Release with source code +
+
+
+
+
+ + ☆ Ranking with Slot Constraints + + +
+ We introduce the problem of ranking with slot constraints, which can be used +to model a wide range of application problems -- from college admission with +limited slots for different majors, to composing a stratified cohort of +eligible participants in a medical trial. We show that the conventional +Probability Ranking Principle (PRP) can be highly sub-optimal for +slot-constrained ranking problems, and we devise a new ranking algorithm, +called MatchRank. The goal of MatchRank is to produce rankings that maximize +the number of filled slots if candidates are evaluated by a human decision +maker in the order of the ranking. In this way, MatchRank generalizes the PRP, +and it subsumes the PRP as a special case when there are no slot constraints. +Our theoretical analysis shows that MatchRank has a strong approximation +guarantee without any independence assumptions between slots or candidates. +Furthermore, we show how MatchRank can be implemented efficiently. Beyond the +theoretical guarantees, empirical evaluations show that MatchRank can provide +substantial improvements over a range of synthetic and real-world tasks. + +
+
+
+
+
+ + ♻ ☆ Framework based on complex networks to model and mine patient pathways + + +
+ The automatic discovery of a model to represent the history of encounters of +a group of patients with the healthcare system -- the so-called "pathway of +patients" -- is a new field of research that supports clinical and +organisational decisions to improve the quality and efficiency of the treatment +provided. The pathways of patients with chronic conditions tend to vary +significantly from one person to another, have repetitive tasks, and demand the +analysis of multiple perspectives (interventions, diagnoses, medical +specialities, among others) influencing the results. Therefore, modelling and +mining those pathways is still a challenging task. In this work, we propose a +framework comprising: (i) a pathway model based on a multi-aspect graph, (ii) a +novel dissimilarity measurement to compare pathways taking the elapsed time +into account, and (iii) a mining method based on traditional centrality +measures to discover the most relevant steps of the pathways. We evaluated the +framework using the study cases of pregnancy and diabetes, which revealed its +usefulness in finding clusters of similar pathways, representing them in an +easy-to-interpret way, and highlighting the most significant patterns according +to multiple perspectives. + +
+
+ comment: 35 pages, 11 figures, 2 appendices +
+
+
+
+
+ + ♻ ☆ Machine Reading Comprehension using Case-based Reasoning + + +
+ We present an accurate and interpretable method for answer extraction in +machine reading comprehension that is reminiscent of case-based reasoning (CBR) +from classical AI. Our method (CBR-MRC) builds upon the hypothesis that +contextualized answers to similar questions share semantic similarities with +each other. Given a test question, CBR-MRC first retrieves a set of similar +cases from a non-parametric memory and then predicts an answer by selecting the +span in the test context that is most similar to the contextualized +representations of answers in the retrieved cases. The semi-parametric nature +of our approach allows it to attribute a prediction to the specific set of +evidence cases, making it a desirable choice for building reliable and +debuggable QA systems. We show that CBR-MRC provides high accuracy comparable +with large reader models and outperforms baselines by 11.5 and 8.4 EM on +NaturalQuestions and NewsQA, respectively. Further, we demonstrate the ability +of CBR-MRC in identifying not just the correct answer tokens but also the span +with the most relevant supporting evidence. Lastly, we observe that contexts +for certain question types show higher lexical diversity than others and find +that CBR-MRC is robust to these variations while performance using +fully-parametric methods drops. + +
+
+ comment: 9 pages, 2 figures +
+
+
+
+
+ + ♻ ☆ Is ChatGPT Good at Search? Investigating Large Language Models as + Re-Ranking Agents EMNLP 2023 + + +
+ Large Language Models (LLMs) have demonstrated remarkable zero-shot +generalization across various language-related tasks, including search engines. +However, existing work utilizes the generative ability of LLMs for Information +Retrieval (IR) rather than direct passage ranking. The discrepancy between the +pre-training objectives of LLMs and the ranking objective poses another +challenge. In this paper, we first investigate generative LLMs such as ChatGPT +and GPT-4 for relevance ranking in IR. Surprisingly, our experiments reveal +that properly instructed LLMs can deliver competitive, even superior results to +state-of-the-art supervised methods on popular IR benchmarks. Furthermore, to +address concerns about data contamination of LLMs, we collect a new test set +called NovelEval, based on the latest knowledge and aiming to verify the +model's ability to rank unknown knowledge. Finally, to improve efficiency in +real-world applications, we delve into the potential for distilling the ranking +capabilities of ChatGPT into small specialized models using a permutation +distillation scheme. Our evaluation results turn out that a distilled 440M +model outperforms a 3B supervised model on the BEIR benchmark. The code to +reproduce our results is available at www.github.com/sunnweiwei/RankGPT. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Is ChatGPT a Good Recommender? A Preliminary Study CIKM 2023 + + +
+ Recommendation systems have witnessed significant advancements and have been +widely used over the past decades. However, most traditional recommendation +methods are task-specific and therefore lack efficient generalization ability. +Recently, the emergence of ChatGPT has significantly advanced NLP tasks by +enhancing the capabilities of conversational models. Nonetheless, the +application of ChatGPT in the recommendation domain has not been thoroughly +investigated. In this paper, we employ ChatGPT as a general-purpose +recommendation model to explore its potential for transferring extensive +linguistic and world knowledge acquired from large-scale corpora to +recommendation scenarios. Specifically, we design a set of prompts and evaluate +ChatGPT's performance on five recommendation scenarios. Unlike traditional +recommendation methods, we do not fine-tune ChatGPT during the entire +evaluation process, relying only on the prompts themselves to convert +recommendation tasks into natural language tasks. Further, we explore the use +of few-shot prompting to inject interaction information that contains user +potential interest to help ChatGPT better understand user needs and interests. +Comprehensive experimental results on Amazon Beauty dataset show that ChatGPT +has achieved promising results in certain tasks and is capable of reaching the +baseline level in others. We conduct human evaluations on two +explainability-oriented tasks to more accurately evaluate the quality of +contents generated by different models. And the human evaluations show ChatGPT +can truly understand the provided information and generate clearer and more +reasonable results. We hope that our study can inspire researchers to further +explore the potential of language models like ChatGPT to improve recommendation +performance and contribute to the advancement of the recommendation systems +field. + +
+
+ comment: Accepted by CIKM 2023 GenRec Workshop +
+
+
+
+
+ + ♻ ☆ DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge + Graphs EMNLP 2023 + + +
+ Recent work within the Argument Mining community has shown the applicability +of Natural Language Processing systems for solving problems found within +competitive debate. One of the most important tasks within competitive debate +is for debaters to create high quality debate cases. We show that effective +debate cases can be constructed using constrained shortest path traversals on +Argumentative Semantic Knowledge Graphs. We study this potential in the context +of a type of American Competitive Debate, called Policy Debate, which already +has a large scale dataset targeting it called DebateSum. We significantly +improve upon DebateSum by introducing 53180 new examples, as well as further +useful metadata for every example, to the dataset. We leverage the txtai +semantic search and knowledge graph toolchain to produce and contribute 9 +semantic knowledge graphs built on this dataset. We create a unique method for +evaluating which knowledge graphs are better in the context of producing policy +debate cases. A demo which automatically generates debate cases, along with all +other code and the Knowledge Graphs, are open-sourced and made available to the +public here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG + +
+
+ comment: 8 pages, Accepted to The 4th New Frontiers in Summarization Workshop + (EMNLP 2023), System Demonstration paper +
+
+
+
+
+ + ♻ ☆ Cross-Modal Retrieval: A Systematic Review of Methods and Future + Directions + + +
+ With the exponential surge in diverse multi-modal data, traditional uni-modal +retrieval methods struggle to meet the needs of users demanding access to data +from various modalities. To address this, cross-modal retrieval has emerged, +enabling interaction across modalities, facilitating semantic matching, and +leveraging complementarity and consistency between different modal data. +Although prior literature undertook a review of the cross-modal retrieval +field, it exhibits numerous deficiencies pertaining to timeliness, taxonomy, +and comprehensiveness. This paper conducts a comprehensive review of +cross-modal retrieval's evolution, spanning from shallow statistical analysis +techniques to vision-language pre-training models. Commencing with a +comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and +models, the paper then delves deeply into the principles and architectures +underpinning existing cross-modal retrieval methods. Furthermore, it offers an +overview of widely used benchmarks, metrics, and performances. Lastly, the +paper probes the prospects and challenges that confront contemporary +cross-modal retrieval, while engaging in a discourse on potential directions +for further progress in the field. To facilitate the research on cross-modal +retrieval, we develop an open-source code repository at +https://github.com/BMC-SDNU/Cross-Modal-Retrieval. + +
+
+
+
+
+ + ♻ ☆ Pattern reconstruction with restricted Boltzmann machines + + +
+ Restricted Boltzmann machines are energy models made of a visible and a +hidden layer. We identify an effective energy function describing the +zero-temperature landscape on the visible units and depending only on the tail +behaviour of the hidden layer prior distribution. Studying the location of the +local minima of such an energy function, we show that the ability of a +restricted Boltzmann machine to reconstruct a random pattern depends indeed +only on the tail of the hidden prior distribution. We find that hidden priors +with strictly super-Gaussian tails give only a logarithmic loss in pattern +retrieval, while an efficient retrieval is much harder with hidden units with +strictly sub-Gaussian tails; if the hidden prior has Gaussian tails, the +retrieval capability is determined by the number of hidden units (as in the +Hopfield model). + +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ FP8-LM: Training FP8 Large Language Models + + +
+ In this paper, we explore FP8 low-bit data formats for efficient training of +large language models (LLMs). Our key insight is that most variables, such as +gradients and optimizer states, in LLM training can employ low-precision data +formats without compromising model accuracy and requiring no changes to +hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision +framework for training LLMs. This framework offers three levels of FP8 +utilization to streamline mixed-precision and distributed parallel training for +LLMs. It gradually incorporates 8-bit gradients, optimizer states, and +distributed learning in an incremental manner. Experiment results show that, +during the training of GPT-175B model on H100 GPU platform, our FP8 +mixed-precision training framework not only achieved a remarkable 42% reduction +in real memory usage but also ran 64% faster than the widely adopted BF16 +framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer +Engine by 17%. This largely reduces the training costs for large foundation +models. Furthermore, our FP8 mixed-precision training methodology is generic. +It can be seamlessly applied to other tasks such as LLM instruction tuning and +reinforcement learning with human feedback, offering savings in fine-tuning +expenses. Our FP8 low-precision training framework is open-sourced at +{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}. + +
+
+
+
+
+ + ☆ Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models + + +
+ Generalist robot manipulators need to learn a wide variety of manipulation +skills across diverse environments. Current robot training pipelines rely on +humans to provide kinesthetic demonstrations or to program simulation +environments and to code up reward functions for reinforcement learning. Such +human involvement is an important bottleneck towards scaling up robot learning +across diverse tasks and environments. We propose Generation to Simulation +(Gen2Sim), a method for scaling up robot skill learning in simulation by +automating generation of 3D assets, task descriptions, task decompositions and +reward functions using large pre-trained generative models of language and +vision. We generate 3D assets for simulation by lifting open-world 2D +object-centric images to 3D using image diffusion models and querying LLMs to +determine plausible physics parameters. Given URDF files of generated and +human-developed assets, we chain-of-thought prompt LLMs to map these to +relevant task descriptions, temporal decompositions, and corresponding python +reward functions for reinforcement learning. We show Gen2Sim succeeds in +learning policies for diverse long horizon tasks, where reinforcement learning +with non temporally decomposed reward functions fails. Gen2Sim provides a +viable path for scaling up reinforcement learning for robot manipulators in +simulation, both by diversifying and expanding task and environment +development, and by facilitating the discovery of reinforcement-learned +behaviors through temporal task decomposition in RL. Our work contributes +hundreds of simulated assets, tasks and demonstrations, taking a step towards +fully autonomous robotic manipulation skill acquisition in simulation. + +
+
+
+
+
+ + ☆ Supervised and Penalized Baseline Correction + + +
+ Spectroscopic measurements can show distorted spectra shapes arising from a +mixture of absorbing and scattering contributions. These distortions (or +baselines) often manifest themselves as non-constant offsets or low-frequency +oscillations. As a result, these baselines can adversely affect analytical and +quantitative results. Baseline correction is an umbrella term where one applies +pre-processing methods to obtain baseline spectra (the unwanted distortions) +and then remove the distortions by differencing. However, current state-of-the +art baseline correction methods do not utilize analyte concentrations even if +they are available, or even if they contribute significantly to the observed +spectral variability. We examine a class of state-of-the-art methods (penalized +baseline correction) and modify them such that they can accommodate a priori +analyte concentration such that prediction can be enhanced. Performance will be +access on two near infra-red data sets across both classical penalized baseline +correction methods (without analyte information) and modified penalized +baseline correction methods (leveraging analyte information). + +
+
+ comment: 27 pages; 8 figure with a total of 18 subfigures; 2 tables +
+
+
+
+
+ + ☆ A Stability Principle for Learning under Non-Stationarity + + +
+ We develop a versatile framework for statistical learning in non-stationary +environments. In each time period, our approach applies a stability principle +to select a look-back window that maximizes the utilization of historical data +while keeping the cumulative bias within an acceptable range relative to the +stochastic error. Our theory showcases the adaptability of this approach to +unknown non-stationarity. The regret bound is minimax optimal up to logarithmic +factors when the population losses are strongly convex, or Lipschitz only. At +the heart of our analysis lie two novel components: a measure of similarity +between functions and a segmentation technique for dividing the non-stationary +data sequence into quasi-stationary pieces. + +
+
+ comment: 47 pages, 1 figure +
+
+
+
+
+ + ☆ Addressing GAN Training Instabilities via Tunable Classification Losses + + +
+ Generative adversarial networks (GANs), modeled as a zero-sum game between a +generator (G) and a discriminator (D), allow generating synthetic data with +formal guarantees. Noting that D is a classifier, we begin by reformulating the +GAN value function using class probability estimation (CPE) losses. We prove a +two-way correspondence between CPE loss GANs and $f$-GANs which minimize +$f$-divergences. We also show that all symmetric $f$-divergences are equivalent +in convergence. In the finite sample and model capacity setting, we define and +obtain bounds on estimation and generalization errors. We specialize these +results to $\alpha$-GANs, defined using $\alpha$-loss, a tunable CPE loss +family parametrized by $\alpha\in(0,\infty]$. We next introduce a class of +dual-objective GANs to address training instabilities of GANs by modeling each +player's objective using $\alpha$-loss to obtain $(\alpha_D,\alpha_G)$-GANs. We +show that the resulting non-zero sum game simplifies to minimizing an +$f$-divergence under appropriate conditions on $(\alpha_D,\alpha_G)$. +Generalizing this dual-objective formulation using CPE losses, we define and +obtain upper bounds on an appropriately defined estimation error. Finally, we +highlight the value of tuning $(\alpha_D,\alpha_G)$ in alleviating training +instabilities for the synthetic 2D Gaussian mixture ring as well as the large +publicly available Celeb-A and LSUN Classroom image datasets. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2302.14320 +
+
+
+
+
+ + ☆ Sustainable Concrete via Bayesian Optimization NeurIPS 2023 + + +
+ Eight percent of global carbon dioxide emissions can be attributed to the +production of cement, the main component of concrete, which is also the +dominant source of CO2 emissions in the construction of data centers. The +discovery of lower-carbon concrete formulae is therefore of high significance +for sustainability. However, experimenting with new concrete formulae is time +consuming and labor intensive, as one usually has to wait to record the +concrete's 28-day compressive strength, a quantity whose measurement can by its +definition not be accelerated. This provides an opportunity for experimental +design methodology like Bayesian Optimization (BO) to accelerate the search for +strong and sustainable concrete formulae. Herein, we 1) propose modeling steps +that make concrete strength amenable to be predicted accurately by a Gaussian +process model with relatively few measurements, 2) formulate the search for +sustainable concrete as a multi-objective optimization problem, and 3) leverage +the proposed model to carry out multi-objective BO with real-world strength +measurements of the algorithmically proposed mixes. Our experimental results +show improved trade-offs between the mixtures' global warming potential (GWP) +and their associated compressive strengths, compared to mixes based on current +industry practices. + +
+
+ comment: NeurIPS 2023 Workshop on Adaptive Experimental Design and Active + Learning in the Real World +
+
+
+
+
+ + ☆ Optimal Transport for Treatment Effect Estimation NeurIPS 2023 + + +
+ Estimating conditional average treatment effect from observational data is +highly challenging due to the existence of treatment selection bias. Prevalent +methods mitigate this issue by aligning distributions of different treatment +groups in the latent space. However, there are two critical problems that these +methods fail to address: (1) mini-batch sampling effects (MSE), which causes +misalignment in non-ideal mini-batches with outcome imbalance and outliers; (2) +unobserved confounder effects (UCE), which results in inaccurate discrepancy +calculation due to the neglect of unobserved confounders. To tackle these +problems, we propose a principled approach named Entire Space CounterFactual +Regression (ESCFR), which is a new take on optimal transport in the context of +causality. Specifically, based on the framework of stochastic optimal +transport, we propose a relaxed mass-preserving regularizer to address the MSE +issue and design a proximal factual outcome regularizer to handle the UCE +issue. Extensive experiments demonstrate that our proposed ESCFR can +successfully tackle the treatment selection bias and achieve significantly +better performance than state-of-the-art methods. + +
+
+ comment: Accepted as NeurIPS 2023 Poster +
+
+
+
+
+ + ☆ Heterogeneous Federated Learning with Group-Aware Prompt Tuning + + +
+ Transformers have achieved remarkable success in various machine-learning +tasks, prompting their widespread adoption. In this paper, we explore their +application in the context of federated learning (FL), with a particular focus +on heterogeneous scenarios where individual clients possess diverse local +datasets. To meet the computational and communication demands of FL, we +leverage pre-trained Transformers and use an efficient prompt-tuning strategy. +Our strategy introduces the concept of learning both shared and group prompts, +enabling the acquisition of universal knowledge and group-specific knowledge +simultaneously. Additionally, a prompt selection module assigns personalized +group prompts to each input, aligning the global model with the data +distribution of each client. This approach allows us to train a single global +model that can automatically adapt to various local client data distributions +without requiring local fine-tuning. In this way, our proposed method +effectively bridges the gap between global and personalized local models in +Federated Learning and surpasses alternative approaches that lack the +capability to adapt to previously unseen clients. The effectiveness of our +approach is rigorously validated through extensive experimentation and ablation +studies. + +
+
+
+
+
+ + ☆ LipSim: A Provably Robust Perceptual Similarity Metric + + +
+ Recent years have seen growing interest in developing and applying perceptual +similarity metrics. Research has shown the superiority of perceptual metrics +over pixel-wise metrics in aligning with human perception and serving as a +proxy for the human visual system. On the other hand, as perceptual metrics +rely on neural networks, there is a growing concern regarding their resilience, +given the established vulnerability of neural networks to adversarial attacks. +It is indeed logical to infer that perceptual metrics may inherit both the +strengths and shortcomings of neural networks. In this work, we demonstrate the +vulnerability of state-of-the-art perceptual similarity metrics based on an +ensemble of ViT-based feature extractors to adversarial attacks. We then +propose a framework to train a robust perceptual similarity metric called +LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging +1-Lipschitz neural networks as the backbone, LipSim provides guarded areas +around each data point and certificates for all perturbations within an +$\ell_2$ ball. Finally, a comprehensive set of experiments shows the +performance of LipSim in terms of natural and certified scores and on the image +retrieval application. The code is available at +https://github.com/SaraGhazanfari/LipSim. + +
+
+
+
+
+ + ☆ PlantPlotGAN: A Physics-Informed Generative Adversarial Network for + Plant Disease Prediction WACV + + +
+ Monitoring plantations is crucial for crop management and producing healthy +harvests. Unmanned Aerial Vehicles (UAVs) have been used to collect +multispectral images that aid in this monitoring. However, given the number of +hectares to be monitored and the limitations of flight, plant disease signals +become visually clear only in the later stages of plant growth and only if the +disease has spread throughout a significant portion of the plantation. This +limited amount of relevant data hampers the prediction models, as the +algorithms struggle to generalize patterns with unbalanced or unrealistic +augmented datasets effectively. To address this issue, we propose PlantPlotGAN, +a physics-informed generative model capable of creating synthetic multispectral +plot images with realistic vegetation indices. These indices served as a proxy +for disease detection and were used to evaluate if our model could help +increase the accuracy of prediction models. The results demonstrate that the +synthetic imagery generated from PlantPlotGAN outperforms state-of-the-art +methods regarding the Fr\'echet inception distance. Moreover, prediction models +achieve higher accuracy metrics when trained with synthetic and original +imagery for earlier plant disease detection compared to the training processes +based solely on real imagery. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ☆ Structured Semidefinite Programming for Recovering Structured + Preconditioners + + +
+ We develop a general framework for finding approximately-optimal +preconditioners for solving linear systems. Leveraging this framework we obtain +improved runtimes for fundamental preconditioning and linear system solving +problems including the following. We give an algorithm which, given positive +definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with +$\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal +diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot +\mathrm{poly}(\kappa^\star,\epsilon^{-1}))$, where $\kappa^\star$ is the +optimal condition number of the rescaled matrix. We give an algorithm which, +given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse +of a graph Laplacian matrix or a constant spectral approximation of one, solves +linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. Our diagonal +preconditioning results improve state-of-the-art runtimes of $\Omega(d^{3.5})$ +attained by general-purpose semidefinite programming, and our solvers improve +state-of-the-art runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the +current matrix multiplication constant. We attain our results via new +algorithms for a class of semidefinite programs (SDPs) we call +matrix-dictionary approximation SDPs, which we leverage to solve an associated +problem we call matrix-dictionary recovery. + +
+
+ comment: Merge of arXiv:1812.06295 and arXiv:2008.01722 +
+
+
+
+
+ + ☆ Learning to Search Feasible and Infeasible Regions of Routing Problems + with Flexible Neural k-Opt NeurIPS 2023 + + +
+ In this paper, we present Neural k-Opt (NeuOpt), a novel learning-to-search +(L2S) solver for routing problems. It learns to perform flexible k-opt +exchanges based on a tailored action factorization method and a customized +recurrent dual-stream decoder. As a pioneering work to circumvent the pure +feasibility masking scheme and enable the autonomous exploration of both +feasible and infeasible regions, we then propose the Guided Infeasible Region +Exploration (GIRE) scheme, which supplements the NeuOpt policy network with +feasibility-related features and leverages reward shaping to steer +reinforcement learning more effectively. Additionally, we equip NeuOpt with +Dynamic Data Augmentation (D2A) for more diverse searches during inference. +Extensive experiments on the Traveling Salesman Problem (TSP) and Capacitated +Vehicle Routing Problem (CVRP) demonstrate that our NeuOpt not only +significantly outstrips existing (masking-based) L2S solvers, but also +showcases superiority over the learning-to-construct (L2C) and +learning-to-predict (L2P) solvers. Notably, we offer fresh perspectives on how +neural solvers can handle VRP constraints. Our code is available: +https://github.com/yining043/NeuOpt. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ Guided Data Augmentation for Offline Reinforcement Learning and + Imitation Learning + + +
+ Learning from demonstration (LfD) is a popular technique that uses expert +demonstrations to learn robot control policies. However, the difficulty in +acquiring expert-quality demonstrations limits the applicability of LfD +methods: real-world data collection is often costly, and the quality of the +demonstrations depends greatly on the demonstrator's abilities and safety +concerns. A number of works have leveraged data augmentation (DA) to +inexpensively generate additional demonstration data, but most DA works +generate augmented data in a random fashion and ultimately produce highly +suboptimal data. In this work, we propose Guided Data Augmentation (GuDA), a +human-guided DA framework that generates expert-quality augmented data. The key +insight of GuDA is that while it may be difficult to demonstrate the sequence +of actions required to produce expert data, a user can often easily identify +when an augmented trajectory segment represents task progress. Thus, the user +can impose a series of simple rules on the DA process to automatically generate +augmented samples that approximate expert behavior. To extract a policy from +GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning +algorithms. We evaluate GuDA on a physical robot soccer task as well as +simulated D4RL navigation tasks, a simulated autonomous driving task, and a +simulated soccer task. Empirically, we find that GuDA enables learning from a +small set of potentially suboptimal demonstrations and substantially +outperforms a DA strategy that samples augmented data randomly. + +
+
+
+
+
+ + ☆ $α$-Mutual Information: A Tunable Privacy Measure for Privacy + Protection in Data Sharing ICML + + +
+ This paper adopts Arimoto's $\alpha$-Mutual Information as a tunable privacy +measure, in a privacy-preserving data release setting that aims to prevent +disclosing private data to adversaries. By fine-tuning the privacy metric, we +demonstrate that our approach yields superior models that effectively thwart +attackers across various performance dimensions. We formulate a general +distortion-based mechanism that manipulates the original data to offer privacy +protection. The distortion metrics are determined according to the data +structure of a specific experiment. We confront the problem expressed in the +formulation by employing a general adversarial deep learning framework that +consists of a releaser and an adversary, trained with opposite goals. This +study conducts empirical experiments on images and time-series data to verify +the functionality of $\alpha$-Mutual Information. We evaluate the +privacy-utility trade-off of customized models and compare them to mutual +information as the baseline measure. Finally, we analyze the consequence of an +attacker's access to side information about private data and witness that +adapting the privacy measure results in a more refined model than the +state-of-the-art in terms of resiliency against side information. + +
+
+ comment: 2023 22nd IEEE International Conference on Machine Learning and + Applications (ICMLA) +
+
+
+
+
+ + ☆ How Re-sampling Helps for Long-Tail Learning? NeurIPS 2023 + + +
+ Long-tail learning has received significant attention in recent years due to +the challenge it poses with extremely imbalanced datasets. In these datasets, +only a few classes (known as the head classes) have an adequate number of +training samples, while the rest of the classes (known as the tail classes) are +infrequent in the training data. Re-sampling is a classical and widely used +approach for addressing class imbalance issues. Unfortunately, recent studies +claim that re-sampling brings negligible performance improvements in modern +long-tail learning tasks. This paper aims to investigate this phenomenon +systematically. Our research shows that re-sampling can considerably improve +generalization when the training images do not contain semantically irrelevant +contexts. In other scenarios, however, it can learn unexpected spurious +correlations between irrelevant contexts and target labels. We design +experiments on two homogeneous datasets, one containing irrelevant context and +the other not, to confirm our findings. To prevent the learning of spurious +correlations, we propose a new context shift augmentation module that generates +diverse training images for the tail class by maintaining a context bank +extracted from the head-class images. Experiments demonstrate that our proposed +module can boost the generalization and outperform other approaches, including +class-balanced re-sampling, decoupled classifier re-training, and data +augmentation methods. The source code is available at +https://www.lamda.nju.edu.cn/code_CSA.ashx. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Davidsonian Scene Graph: Improving Reliability in Fine-grained + Evaluation for Text-Image Generation + + +
+ Evaluating text-to-image models is notoriously difficult. A strong recent +approach for assessing text-image faithfulness is based on QG/A (question +generation and answering), which uses pre-trained foundational models to +automatically generate a set of questions and answers from the prompt, and +output images are scored based on whether these answers extracted with a visual +question answering model are consistent with the prompt-based answers. This +kind of evaluation is naturally dependent on the quality of the underlying QG +and QA models. We identify and address several reliability challenges in +existing QG/A work: (a) QG questions should respect the prompt (avoiding +hallucinations, duplications, and omissions) and (b) VQA answers should be +consistent (not asserting that there is no motorcycle in an image while also +claiming the motorcycle is blue). We address these issues with Davidsonian +Scene Graph (DSG), an empirically grounded evaluation framework inspired by +formal semantics. DSG is an automatic, graph-based QG/A that is modularly +implemented to be adaptable to any QG/A module. DSG produces atomic and unique +questions organized in dependency graphs, which (i) ensure appropriate semantic +coverage and (ii) sidestep inconsistent answers. With extensive experimentation +and human evaluation on a range of model configurations (LLM, VQA, and T2I), we +empirically demonstrate that DSG addresses the challenges noted above. Finally, +we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 +prompts, covering a wide range of fine-grained semantic categories with a +balanced distribution. We will release the DSG-1k prompts and the corresponding +DSG questions. + +
+
+ comment: Project website: https://google.github.io/DSG +
+
+
+
+
+ + ☆ Deep Transformed Gaussian Processes + + +
+ Transformed Gaussian Processes (TGPs) are stochastic processes specified by +transforming samples from the joint distribution from a prior process +(typically a GP) using an invertible transformation; increasing the flexibility +of the base process. + Furthermore, they achieve competitive results compared with Deep Gaussian +Processes (DGPs), which are another generalization constructed by a +hierarchical concatenation of GPs. In this work, we propose a generalization of +TGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend +of concatenating layers of stochastic processes. More precisely, we obtain a +multi-layer model in which each layer is a TGP. This generalization implies an +increment of flexibility with respect to both TGPs and DGPs. Exact inference in +such a model is intractable. However, we show that one can use variational +inference to approximate the required computations yielding a straightforward +extension of the popular DSVI inference algorithm Salimbeni et al (2017). The +experiments conducted evaluate the proposed novel DTGPs in multiple regression +datasets, achieving good scalability and performance. + +
+
+
+
+
+ + ☆ TBDLNet: a network for classifying multidrug-resistant and + drug-sensitive tuberculosis + + +
+ This paper proposes applying a novel deep-learning model, TBDLNet, to +recognize CT images to classify multidrug-resistant and drug-sensitive +tuberculosis automatically. The pre-trained ResNet50 is selected to extract +features. Three randomized neural networks are used to alleviate the +overfitting problem. The ensemble of three RNNs is applied to boost the +robustness via majority voting. The proposed model is evaluated by five-fold +cross-validation. Five indexes are selected in this paper, which are accuracy, +sensitivity, precision, F1-score, and specificity. The TBDLNet achieves 0.9822 +accuracy, 0.9815 specificity, 0.9823 precision, 0.9829 sensitivity, and 0.9826 +F1-score, respectively. The TBDLNet is suitable for classifying +multidrug-resistant tuberculosis and drug-sensitive tuberculosis. It can detect +multidrug-resistant pulmonary tuberculosis as early as possible, which helps to +adjust the treatment plan in time and improve the treatment effect. + +
+
+
+
+
+ + ☆ One Model Fits All: Cross-Region Taxi-Demand Forecasting SP + + +
+ The growing demand for ride-hailing services has led to an increasing need +for accurate taxi demand prediction. Existing systems are limited to specific +regions, lacking generalizability to unseen areas. This paper presents a novel +taxi demand forecasting system that leverages a graph neural network to capture +spatial dependencies and patterns in urban environments. Additionally, the +proposed system employs a region-neutral approach, enabling it to train a model +that can be applied to any region, including unseen regions. To achieve this, +the framework incorporates the power of Variational Autoencoder to disentangle +the input features into region-specific and region-neutral components. The +region-neutral features facilitate cross-region taxi demand predictions, +allowing the model to generalize well across different urban areas. +Experimental results demonstrate the effectiveness of the proposed system in +accurately forecasting taxi demand, even in previously unobserved regions, thus +showcasing its potential for optimizing taxi services and improving +transportation efficiency on a broader scale. + +
+
+ comment: Accepted to The 31st ACM International Conference on Advances in + Geographic Information Systems(SIGSPATIAL '23) as a short paper in the + Research, Systems and Industrial Experience Papers track +
+
+
+
+
+ + ☆ Robustness of Algorithms for Causal Structure Learning to Hyperparameter + Choice + + +
+ Hyperparameters play a critical role in machine learning. Hyperparameter +tuning can make the difference between state-of-the-art and poor prediction +performance for any algorithm, but it is particularly challenging for structure +learning due to its unsupervised nature. As a result, hyperparameter tuning is +often neglected in favour of using the default values provided by a particular +implementation of an algorithm. While there have been numerous studies on +performance evaluation of causal discovery algorithms, how hyperparameters +affect individual algorithms, as well as the choice of the best algorithm for a +specific problem, has not been studied in depth before. This work addresses +this gap by investigating the influence of hyperparameters on causal structure +learning tasks. Specifically, we perform an empirical evaluation of +hyperparameter selection for some seminal learning algorithms on datasets of +varying levels of complexity. We find that, while the choice of algorithm +remains crucial to obtaining state-of-the-art performance, hyperparameter +selection in ensemble settings strongly influences the choice of algorithm, in +that a poor choice of hyperparameters can lead to analysts using algorithms +which do not give state-of-the-art performance for their data. + +
+
+ comment: 26 pages, 16 figures +
+
+
+
+
+ + ☆ Alignment and Outer Shell Isotropy for Hyperbolic Graph Contrastive + Learning + + +
+ Learning good self-supervised graph representations that are beneficial to +downstream tasks is challenging. Among a variety of methods, contrastive +learning enjoys competitive performance. The embeddings of contrastive learning +are arranged on a hypersphere that enables the Cosine distance measurement in +the Euclidean space. However, the underlying structure of many domains such as +graphs exhibits highly non-Euclidean latent geometry. To this end, we propose a +novel contrastive learning framework to learn high-quality graph embedding. +Specifically, we design the alignment metric that effectively captures the +hierarchical data-invariant information, as well as we propose a substitute of +uniformity metric to prevent the so-called dimensional collapse. We show that +in the hyperbolic space one has to address the leaf- and height-level +uniformity which are related to properties of trees, whereas in the ambient +space of the hyperbolic manifold, these notions translate into imposing an +isotropic ring density towards boundaries of Poincar\'e ball. This ring density +can be easily imposed by promoting the isotropic feature distribution on the +tangent space of manifold. In the experiments, we demonstrate the efficacy of +our proposed method across different hyperbolic graph embedding techniques in +both supervised and self-supervised learning settings. + +
+
+
+
+
+ + ☆ ArcheType: A Novel Framework for Open-Source Column Type Annotation + using Large Language Models + + +
+ Existing deep-learning approaches to semantic column type annotation (CTA) +have important shortcomings: they rely on semantic types which are fixed at +training time; require a large number of training samples per type and incur +large run-time inference costs; and their performance can degrade when +evaluated on novel datasets, even when types remain constant. Large language +models have exhibited strong zero-shot classification performance on a wide +range of tasks and in this paper we explore their use for CTA. We introduce +ArcheType, a simple, practical method for context sampling, prompt +serialization, model querying, and label remapping, which enables large +language models to solve column type annotation problems in a fully zero-shot +manner. We ablate each component of our method separately, and establish that +improvements to context sampling and label remapping provide the most +consistent gains. ArcheType establishes new state-of-the-art performance on +both zero-shot and fine-tuned CTA, including three new domain-specific +benchmarks, which we release, along with the code to reproduce our results at +https://github.com/penfever/ArcheType. + +
+
+ comment: 17 pages, 8 figures +
+
+
+
+
+ + ☆ Is Scaling Learned Optimizers Worth It? Evaluating The Value of VeLO's + 4000 TPU Months + + +
+ We analyze VeLO (versatile learned optimizer), the largest scale attempt to +train a general purpose "foundational" optimizer to date. VeLO was trained on +thousands of machine learning tasks using over 4000 TPU months with the goal of +producing an optimizer capable of generalizing to new problems while being +hyperparameter free, and outperforming industry standards such as Adam. We +independently evaluate VeLO on the MLCommons optimizer benchmark suite. We find +that, contrary to initial claims: (1) VeLO has a critical hyperparameter that +needs problem-specific tuning, (2) VeLO does not necessarily outperform +competitors in quality of solution found, and (3) VeLO is not faster than +competing optimizers at reducing the training loss. These observations call +into question VeLO's generality and the value of the investment in training it. + +
+
+
+
+
+ + ☆ Model-free Posterior Sampling via Learning Rate Randomization NeurIPS-2023 + + +
+ In this paper, we introduce Randomized Q-learning (RandQL), a novel +randomized model-free algorithm for regret minimization in episodic Markov +Decision Processes (MDPs). To the best of our knowledge, RandQL is the first +tractable model-free posterior sampling-based algorithm. We analyze the +performance of RandQL in both tabular and non-tabular metric space settings. In +tabular MDPs, RandQL achieves a regret bound of order +$\widetilde{\mathcal{O}}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, +$S$ is the number of states, $A$ is the number of actions, and $T$ is the +number of episodes. For a metric state-action space, RandQL enjoys a regret +bound of order $\widetilde{\mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where +$d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic +exploration without using bonuses, relying instead on a novel idea of learning +rate randomization. Our empirical study shows that RandQL outperforms existing +approaches on baseline exploration environments. + +
+
+ comment: NeurIPS-2023 +
+
+
+
+
+ + ☆ Personas as a Way to Model Truthfulness in Language Models + + +
+ Large Language Models are trained on vast amounts of text from the internet, +which contains both factual and misleading information about the world. Can +language models discern truth from falsehood in this contradicting data? +Expanding on the view that LLMs can model different agents producing the +corpora, we hypothesize that they can cluster truthful text by modeling a +truthful persona: a group of agents that are likely to produce truthful text +and share similar features. For example, trustworthy sources like Wikipedia and +Science usually use formal writing styles and make consistent claims. By +modeling this persona, LLMs can generalize truthfulness beyond the specific +contexts in which each agent generated the training text. For example, the +model can infer that the agent "Wikipedia" will behave truthfully on topics +that were only generated by "Science" because they share a persona. We first +show evidence for the persona hypothesis via two observations: (1) we can probe +whether a model's answer will be truthful before it is generated; (2) +finetuning a model on a set of facts improves its truthfulness on unseen +topics. Next, using arithmetics as a synthetic environment, we show that +language models can separate true and false statements, and generalize +truthfulness across agents; but only if agents in the training data share a +truthful generative process that enables the creation of a truthful persona. +Overall, our findings suggest that models can exploit hierarchical structures +in the data to learn abstract concepts like truthfulness. + +
+
+
+
+
+ + ☆ Enhancing Enterprise Network Security: Comparing Machine-Level and + Process-Level Analysis for Dynamic Malware Detection + + +
+ Analysing malware is important to understand how malicious software works and +to develop appropriate detection and prevention methods. Dynamic analysis can +overcome evasion techniques commonly used to bypass static analysis and provide +insights into malware runtime activities. Much research on dynamic analysis +focused on investigating machine-level information (e.g., CPU, memory, network +usage) to identify whether a machine is running malicious activities. A +malicious machine does not necessarily mean all running processes on the +machine are also malicious. If we can isolate the malicious process instead of +isolating the whole machine, we could kill the malicious process, and the +machine can keep doing its job. Another challenge dynamic malware detection +research faces is that the samples are executed in one machine without any +background applications running. It is unrealistic as a computer typically runs +many benign (background) applications when a malware incident happens. Our +experiment with machine-level data shows that the existence of background +applications decreases previous state-of-the-art accuracy by about 20.12% on +average. We also proposed a process-level Recurrent Neural Network (RNN)-based +detection model. Our proposed model performs better than the machine-level +detection model; 0.049 increase in detection rate and a false-positive rate +below 0.1. + +
+
+ comment: Dataset link: https://github.com/bazz-066/cerberus-trace +
+
+
+
+
+ + ☆ Proportional Fairness in Clustering: A Social Choice Perspective + + +
+ We study the proportional clustering problem of Chen et al. [ICML'19] and +relate it to the area of multiwinner voting in computational social choice. We +show that any clustering satisfying a weak proportionality notion of Brill and +Peters [EC'23] simultaneously obtains the best known approximations to the +proportional fairness notion of Chen et al. [ICML'19], but also to individual +fairness [Jung et al., FORC'20] and the "core" [Li et al. ICML'21]. In fact, we +show that any approximation to proportional fairness is also an approximation +to individual fairness and vice versa. Finally, we also study stronger notions +of proportional representation, in which deviations do not only happen to +single, but multiple candidate centers, and show that stronger proportionality +notions of Brill and Peters [EC'23] imply approximations to these stronger +guarantees. + +
+
+
+
+
+ + ☆ Disentangled Representation Learning with Large Language Models for + Text-Attributed Graphs + + +
+ Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs +such as citation networks, e-commerce networks and social networks has +attracted considerable attention in the web community. Recently, large language +models (LLMs) have demonstrated exceptional capabilities across a wide range of +tasks. However, the existing works focus on harnessing the potential of LLMs +solely relying on prompts to convey graph structure information to LLMs, thus +suffering from insufficient understanding of the complex structural +relationships within TAGs. To address this problem, in this paper we present +the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the +reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model +incorporates graph structure information through tailored disentangled graph +neural network (GNN) layers, enabling LLMs to capture the intricate +relationships hidden in text-attributed graphs from multiple structural +factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing +computational costs and allowing much more flexibility in combining with +different LLM models. Experimental evaluations demonstrate the effectiveness of +the proposed DGTL model on achieving superior or comparable performance over +state-of-the-art baselines. Additionally, we also demonstrate that our DGTL +model can offer natural language explanations for predictions, thereby +significantly enhancing model interpretability. + +
+
+
+
+
+ + ☆ Improving Intrinsic Exploration by Creating Stationary Objectives ICLR 2024 + + +
+ Exploration bonuses in reinforcement learning guide long-horizon exploration +by defining custom intrinsic objectives. Count-based methods use the frequency +of state visits to derive an exploration bonus. In this paper, we identify that +any intrinsic reward function derived from count-based methods is +non-stationary and hence induces a difficult objective to optimize for the +agent. The key contribution of our work lies in transforming the original +non-stationary rewards into stationary rewards through an augmented state +representation. For this purpose, we introduce the Stationary Objectives For +Exploration (SOFE) framework. SOFE requires identifying sufficient statistics +for different exploration bonuses and finding an efficient encoding of these +statistics to use as input to a deep network. SOFE is based on proposing state +augmentations that expand the state space but hold the promise of simplifying +the optimization of the agent's objective. Our experiments show that SOFE +improves the agents' performance in challenging exploration problems, including +sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally +generated environments. + +
+
+ comment: Under Review at ICLR 2024 +
+
+
+
+
+ + ☆ Unsupervised Representation Learning for Diverse Deformable Shape + Collections + + +
+ We introduce a novel learning-based method for encoding and manipulating 3D +surface meshes. Our method is specifically designed to create an interpretable +embedding space for deformable shape collections. Unlike previous 3D mesh +autoencoders that require meshes to be in a 1-to-1 correspondence, our approach +is trained on diverse meshes in an unsupervised manner. Central to our method +is a spectral pooling technique that establishes a universal latent space, +breaking free from traditional constraints of mesh connectivity and shape +categories. The entire process consists of two stages. In the first stage, we +employ the functional map paradigm to extract point-to-point (p2p) maps between +a collection of shapes in an unsupervised manner. These p2p maps are then +utilized to construct a common latent space, which ensures straightforward +interpretation and independence from mesh connectivity and shape category. +Through extensive experiments, we demonstrate that our method achieves +excellent reconstructions and produces more realistic and smoother +interpolations than baseline approaches. + +
+
+ comment: Accepted at International Conference on 3D Vision 2024 +
+
+
+
+
+ + ☆ Ask more, know better: Reinforce-Learned Prompt Questions for Decision + Making with Large Language Models + + +
+ Large language models (LLMs) demonstrate their promise in tackling +complicated practical challenges by combining action-based policies with chain +of thought (CoT) reasoning. Having high-quality prompts on hand, however, is +vital to the framework's effectiveness. Currently, these prompts are +handcrafted utilizing extensive human labor, resulting in CoT policies that +frequently fail to generalize. Human intervention is also required in order to +develop grounding functions that ensure low-level controllers appropriately +process CoT reasoning. In this paper, we take the first step towards a fully +integrated end-to-end framework for task-solving in real settings employing +complicated reasoning. To that purpose, we offer a new leader-follower bilevel +framework capable of learning to ask relevant questions (prompts) and +subsequently undertaking reasoning to guide the learning of actions to be +performed in an environment. A good prompt should make introspective revisions +based on historical findings, leading the CoT to consider the anticipated +goals. A prompt-generator policy has its own aim in our system, allowing it to +adapt to the action policy and automatically root the CoT process towards +outputs that lead to decisive, high-performing actions. Meanwhile, the action +policy is learning how to use the CoT outputs to take specific actions. Our +empirical data reveal that our system outperforms leading methods in agent +learning benchmarks such as Overcooked and FourRoom. + +
+
+
+
+
+ + ☆ Sample Complexity Bounds for Score-Matching: Causal Discovery and + Generative Modeling NeurIPS 2023 + + +
+ This paper provides statistical sample complexity bounds for score-matching +and its applications in causal discovery. We demonstrate that accurate +estimation of the score function is achievable by training a standard deep ReLU +neural network using stochastic gradient descent. We establish bounds on the +error rate of recovering causal relationships using the score-matching-based +causal discovery method of Rolland et al. [2022], assuming a sufficiently good +estimation of the score function. Finally, we analyze the upper bound of +score-matching estimation within the score-based generative modeling, which has +been applied for causal discovery but is also of independent interest within +the domain of generative models. + +
+
+ comment: Accepted in NeurIPS 2023 +
+
+
+
+
+ + ☆ A Global Multi-Unit Calibration as a Method for Large Scale IoT + Particulate Matter Monitoring Systems Deployments + + +
+ Scalable and effective calibration is a fundamental requirement for Low Cost +Air Quality Monitoring Systems and will enable accurate and pervasive +monitoring in cities. Suffering from environmental interferences and +fabrication variance, these devices need to encompass sensors specific and +complex calibration processes for reaching a sufficient accuracy to be deployed +as indicative measurement devices in Air Quality (AQ) monitoring networks. +Concept and sensor drift often force calibration process to be frequently +repeated. These issues lead to unbearable calibration costs which denies their +massive deployment when accuracy is a concern. In this work, We propose a zero +transfer samples, global calibration methodology as a technological enabler for +IoT AQ multisensory devices which relies on low cost Particulate Matter (PM) +sensors. This methodology is based on field recorded responses from a limited +number of IoT AQ multisensors units and machine learning concepts and can be +universally applied to all units of the same type. A multi season test campaign +shown that, when applied to different sensors, this methodology performances +match those of state of the art methodology which requires to derive different +calibration parameters for each different unit. If confirmed, these results +show that, when properly derived, a global calibration law can be exploited for +a large number of networked devices with dramatic cost reduction eventually +allowing massive deployment of accurate IoT AQ monitoring devices. Furthermore, +this calibration model could be easily embedded on board of the device or +implemented on the edge allowing immediate access to accurate readings for +personal exposure monitor applications as well as reducing long range data +transfer needs. + +
+
+
+
+
+ + ☆ Transductive conformal inference with adaptive scores + + +
+ Conformal inference is a fundamental and versatile tool that provides +distribution-free guarantees for many machine learning tasks. We consider the +transductive setting, where decisions are made on a test sample of $m$ new +points, giving rise to $m$ conformal $p$-values. {While classical results only +concern their marginal distribution, we show that their joint distribution +follows a P\'olya urn model, and establish a concentration inequality for their +empirical distribution function.} The results hold for arbitrary exchangeable +scores, including {\it adaptive} ones that can use the covariates of the +test+calibration samples at training stage for increased accuracy. We +demonstrate the usefulness of these theoretical results through uniform, +in-probability guarantees for two machine learning tasks of current interest: +interval prediction for transductive transfer learning and novelty detection +based on two-class classification. + +
+
+ comment: 27 pages, 6 Figures +
+
+
+
+
+ + ☆ Adversarial Anomaly Detection using Gaussian Priors and Nonlinear + Anomaly Scores ICDM + + +
+ Anomaly detection in imbalanced datasets is a frequent and crucial problem, +especially in the medical domain where retrieving and labeling irregularities +is often expensive. By combining the generative stability of a +$\beta$-variational autoencoder (VAE) with the discriminative strengths of +generative adversarial networks (GANs), we propose a novel model, +$\beta$-VAEGAN. We investigate methods for composing anomaly scores based on +the discriminative and reconstructive capabilities of our model. Existing work +focuses on linear combinations of these components to determine if data is +anomalous. We advance existing work by training a kernelized support vector +machine (SVM) on the respective error components to also consider nonlinear +relationships. This improves anomaly detection performance, while allowing +faster optimization. Lastly, we use the deviations from the Gaussian prior of +$\beta$-VAEGAN to form a novel anomaly score component. In comparison to +state-of-the-art work, we improve the $F_1$ score during anomaly detection from +0.85 to 0.92 on the widely used MITBIH Arrhythmia Database. + +
+
+ comment: accepted at AI4TS @ ICDMW 2023 +
+
+
+
+
+ + ☆ Unveiling the Potential of Probabilistic Embeddings in Self-Supervised + Learning AISTATS 2024 + + +
+ In recent years, self-supervised learning has played a pivotal role in +advancing machine learning by allowing models to acquire meaningful +representations from unlabeled data. An intriguing research avenue involves +developing self-supervised models within an information-theoretic framework, +but many studies often deviate from the stochasticity assumptions made when +deriving their objectives. To gain deeper insights into this issue, we propose +to explicitly model the representation with stochastic embeddings and assess +their effects on performance, information compression and potential for +out-of-distribution detection. From an information-theoretic perspective, we +seek to investigate the impact of probabilistic modeling on the information +bottleneck, shedding light on a trade-off between compression and preservation +of information in both representation and loss space. Emphasizing the +importance of distinguishing between these two spaces, we demonstrate how +constraining one can affect the other, potentially leading to performance +degradation. Moreover, our findings suggest that introducing an additional +bottleneck in the loss space can significantly enhance the ability to detect +out-of-distribution examples, only leveraging either representation features or +the variance of their underlying distribution. + +
+
+ comment: Under review by AISTATS 2024 +
+
+
+
+
+ + ☆ Lipschitz and Hölder Continuity in Reproducing Kernel Hilbert Spaces + + +
+ Reproducing kernel Hilbert spaces (RKHSs) are very important function spaces, +playing an important role in machine learning, statistics, numerical analysis +and pure mathematics. Since Lipschitz and H\"older continuity are important +regularity properties, with many applications in interpolation, approximation +and optimization problems, in this work we investigate these continuity notion +in RKHSs. We provide several sufficient conditions as well as an in depth +investigation of reproducing kernels inducing prescribed Lipschitz or H\"older +continuity. Apart from new results, we also collect related known results from +the literature, making the present work also a convenient reference on this +topic. + +
+
+ comment: Preprint, under review +
+
+
+
+
+ + ☆ On kernel-based statistical learning in the mean field limit NeurIPS 2023 + + +
+ In many applications of machine learning, a large number of variables are +considered. Motivated by machine learning of interacting particle systems, we +consider the situation when the number of input variables goes to infinity. +First, we continue the recent investigation of the mean field limit of kernels +and their reproducing kernel Hilbert spaces, completing the existing theory. +Next, we provide results relevant for approximation with such kernels in the +mean field limit, including a representer theorem. Finally, we use these +kernels in the context of statistical learning in the mean field limit, +focusing on Support Vector Machines. In particular, we show mean field +convergence of empirical and infinite-sample solutions as well as the +convergence of the corresponding risks. On the one hand, our results establish +rigorous mean field limits in the context of kernel methods, providing new +theoretical tools and insights for large-scale problems. On the other hand, our +setting corresponds to a new form of limit of learning problems, which seems to +have not been investigated yet in the statistical learning theory literature. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ "Honey, Tell Me What's Wrong", Global Explanation of Textual + Discriminative Models through Cooperative Generation + + +
+ The ubiquity of complex machine learning has raised the importance of +model-agnostic explanation algorithms. These methods create artificial +instances by slightly perturbing real instances, capturing shifts in model +decisions. However, such methods rely on initial data and only provide +explanations of the decision for these. To tackle these problems, we propose +Therapy, the first global and model-agnostic explanation method adapted to text +which requires no input dataset. Therapy generates texts following the +distribution learned by a classifier through cooperative generation. Because it +does not rely on initial samples, it allows to generate explanations even when +data is absent (e.g., for confidentiality reasons). Moreover, conversely to +existing methods that combine multiple local explanations into a global one, +Therapy offers a global overview of the model behavior on the input space. Our +experiments show that although using no input data to generate samples, Therapy +provides insightful information about features used by the classifier that is +competitive with the ones from methods relying on input samples and outperforms +them when input samples are not specific to the studied model. + +
+
+ comment: 8 pages plus references and 2 pages of appendices. 7 figures and 2 + tables +
+
+
+
+
+ + ☆ DP-SGD with weight clipping + + +
+ Recently, due to the popularity of deep neural networks and other methods +whose training typically relies on the optimization of an objective function, +and due to concerns for data privacy, there is a lot of interest in +differentially private gradient descent methods. To achieve differential +privacy guarantees with a minimum amount of noise, it is important to be able +to bound precisely the sensitivity of the information which the participants +will observe. In this study, we present a novel approach that mitigates the +bias arising from traditional gradient clipping. By leveraging public +information concerning the current global model and its location within the +search domain, we can achieve improved gradient bounds, leading to enhanced +sensitivity determinations and refined noise level adjustments. We extend the +state of the art algorithms, present improved differential privacy guarantees +requiring less noise and present an empirical evaluation. + +
+
+
+
+
+ + ☆ Closing the Gap Between the Upper Bound and the Lower Bound of Adam's + Iteration Complexity NeurIPS 2023 + + +
+ Recently, Arjevani et al. [1] established a lower bound of iteration +complexity for the first-order optimization under an $L$-smooth condition and a +bounded noise variance assumption. However, a thorough review of existing +literature on Adam's convergence reveals a noticeable gap: none of them meet +the above lower bound. In this paper, we close the gap by deriving a new +convergence guarantee of Adam, with only an $L$-smooth condition and a bounded +noise variance assumption. Our results remain valid across a broad spectrum of +hyperparameters. Especially with properly chosen hyperparameters, we derive an +upper bound of the iteration complexity of Adam and show that it meets the +lower bound for first-order optimizers. To the best of our knowledge, this is +the first to establish such a tight upper bound for Adam's convergence. Our +proof utilizes novel techniques to handle the entanglement between momentum and +adaptive learning rate and to convert the first-order term in the Descent Lemma +to the gradient norm, which may be of independent interest. + +
+
+ comment: NeurIPS 2023 Accept +
+
+
+
+
+ + ☆ CEFL: Carbon-Efficient Federated Learning + + +
+ Federated Learning (FL) distributes machine learning (ML) training across +many edge devices to reduce data transfer overhead and protect data privacy. +Since FL model training may span millions of devices and is thus +resource-intensive, prior work has focused on improving its resource efficiency +to optimize time-to-accuracy. However, prior work generally treats all +resources the same, while, in practice, they may incur widely different costs, +which instead motivates optimizing cost-to-accuracy. To address the problem, we +design CEFL, which uses adaptive cost-aware client selection policies to +optimize an arbitrary cost metric when training FL models. Our policies extend +and combine prior work on utility-based client selection and critical learning +periods by making them cost-aware. We demonstrate CEFL by designing +carbon-efficient FL, where energy's carbon-intensity is the cost, and show that +it i) reduces carbon emissions by 93\% and reduces training time by 50% +compared to random client selection and ii) reduces carbon emissions by 80%, +while only increasing training time by 38%, compared to a state-of-the-art +approach that optimizes training time. + +
+
+
+
+
+ + ☆ Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online + Reinforcement Learning NeurIPS 2023 + + +
+ Offline-to-online reinforcement learning (RL) is a training paradigm that +combines pre-training on a pre-collected dataset with fine-tuning in an online +environment. However, the incorporation of online fine-tuning can intensify the +well-known distributional shift problem. Existing solutions tackle this problem +by imposing a policy constraint on the policy improvement objective in both +offline and online learning. They typically advocate a single balance between +policy improvement and constraints across diverse data collections. This +one-size-fits-all manner may not optimally leverage each collected sample due +to the significant variation in data quality across different states. To this +end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective +framework that empowers existing algorithms to determine state-adaptive +improvement-constraint balances. FamO2O utilizes a universal model to train a +family of policies with different improvement/constraint intensities, and a +balance model to select a suitable policy for each state. Theoretically, we +prove that state-adaptive balances are necessary for achieving a higher policy +performance upper bound. Empirically, extensive experiments show that FamO2O +offers a statistically significant improvement over various existing methods, +achieving state-of-the-art performance on the D4RL benchmark. Codes are +available at https://github.com/LeapLabTHU/FamO2O. + +
+
+ comment: NeurIPS 2023 spotlight. 24 pages, 13 figures +
+
+
+
+
+ + ☆ A Comprehensive and Reliable Feature Attribution Method: Double-sided + Remove and Reconstruct (DoRaR) + + +
+ The limited transparency of the inner decision-making mechanism in deep +neural networks (DNN) and other machine learning (ML) models has hindered their +application in several domains. In order to tackle this issue, feature +attribution methods have been developed to identify the crucial features that +heavily influence decisions made by these black box models. However, many +feature attribution methods have inherent downsides. For example, one category +of feature attribution methods suffers from the artifacts problem, which feeds +out-of-distribution masked inputs directly through the classifier that was +originally trained on natural data points. Another category of feature +attribution method finds explanations by using jointly trained feature +selectors and predictors. While avoiding the artifacts problem, this new +category suffers from the Encoding Prediction in the Explanation (EPITE) +problem, in which the predictor's decisions rely not on the features, but on +the masks that selects those features. As a result, the credibility of +attribution results is undermined by these downsides. In this research, we +introduce the Double-sided Remove and Reconstruct (DoRaR) feature attribution +method based on several improvement methods that addresses these issues. By +conducting thorough testing on MNIST, CIFAR10 and our own synthetic dataset, we +demonstrate that the DoRaR feature attribution method can effectively bypass +the above issues and can aid in training a feature selector that outperforms +other state-of-the-art feature attribution methods. Our code is available at +https://github.com/dxq21/DoRaR. + +
+
+ comment: 16 pages, 22 figures +
+
+
+
+
+ + ☆ Trustworthy Edge Machine Learning: A Survey + + +
+ The convergence of Edge Computing (EC) and Machine Learning (ML), known as +Edge Machine Learning (EML), has become a highly regarded research area by +utilizing distributed network resources to perform joint training and inference +in a cooperative manner. However, EML faces various challenges due to resource +constraints, heterogeneous network environments, and diverse service +requirements of different applications, which together affect the +trustworthiness of EML in the eyes of its stakeholders. This survey provides a +comprehensive summary of definitions, attributes, frameworks, techniques, and +solutions for trustworthy EML. Specifically, we first emphasize the importance +of trustworthy EML within the context of Sixth-Generation (6G) networks. We +then discuss the necessity of trustworthiness from the perspective of +challenges encountered during deployment and real-world application scenarios. +Subsequently, we provide a preliminary definition of trustworthy EML and +explore its key attributes. Following this, we introduce fundamental frameworks +and enabling technologies for trustworthy EML systems, and provide an in-depth +literature review of the latest solutions to enhance trustworthiness of EML. +Finally, we discuss corresponding research challenges and open issues. + +
+
+ comment: 27 pages, 7 figures, 10 tables +
+
+
+
+
+ + ☆ Transformers as Graph-to-Graph Models EMNLP 2023 + + +
+ We argue that Transformers are essentially graph-to-graph models, with +sequences just being a special case. Attention weights are functionally +equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes +this ability explicit, by inputting graph edges into the attention weight +computations and predicting graph edges with attention-like functions, thereby +integrating explicit graphs into the latent graphs learned by pretrained +Transformers. Adding iterative graph refinement provides a joint embedding of +input, output, and latent graphs, allowing non-autoregressive graph prediction +to optimise the complete graph without any bespoke pipeline or decoding +strategy. Empirical results show that this architecture achieves +state-of-the-art accuracies for modelling a variety of linguistic structures, +integrating very effectively with the latent linguistic representations learned +by pretraining. + +
+
+ comment: Accepted to Big Picture workshop at EMNLP 2023 +
+
+
+
+
+ + ☆ Lifting the Veil: Unlocking the Power of Depth in Q-learning + + +
+ With the help of massive data and rich computational resources, deep +Q-learning has been widely used in operations research and management science +and has contributed to great success in numerous applications, including +recommender systems, supply chains, games, and robotic manipulation. However, +the success of deep Q-learning lacks solid theoretical verification and +interpretability. The aim of this paper is to theoretically verify the power of +depth in deep Q-learning. Within the framework of statistical learning theory, +we rigorously prove that deep Q-learning outperforms its traditional version by +demonstrating its good generalization error bound. Our results reveal that the +main reason for the success of deep Q-learning is the excellent performance of +deep neural networks (deep nets) in capturing the special properties of rewards +namely, spatial sparseness and piecewise constancy, rather than their large +capacities. In this paper, we make fundamental contributions to the field of +reinforcement learning by answering to the following three questions: Why does +deep Q-learning perform so well? When does deep Q-learning perform better than +traditional Q-learning? How many samples are required to achieve a specific +prediction accuracy for deep Q-learning? Our theoretical assertions are +verified by applying deep Q-learning in the well-known beer game in supply +chain management and a simulated recommender system. + +
+
+
+
+
+ + ☆ Improving the Knowledge Gradient Algorithm + + +
+ The knowledge gradient (KG) algorithm is a popular policy for the best arm +identification (BAI) problem. It is built on the simple idea of always choosing +the measurement that yields the greatest expected one-step improvement in the +estimate of the best mean of the arms. In this research, we show that this +policy has limitations, causing the algorithm not asymptotically optimal. We +next provide a remedy for it, by following the manner of one-step look ahead of +KG, but instead choosing the measurement that yields the greatest one-step +improvement in the probability of selecting the best arm. The new policy is +called improved knowledge gradient (iKG). iKG can be shown to be asymptotically +optimal. In addition, we show that compared to KG, it is easier to extend iKG +to variant problems of BAI, with the $\epsilon$-good arm identification and +feasible arm identification as two examples. The superior performances of iKG +on these problems are further demonstrated using numerical examples. + +
+
+ comment: 32 pages, 42 figures +
+
+
+
+
+ + ☆ Submodel Partitioning in Hierarchical Federated Learning: Algorithm + Design and Convergence Analysis + + +
+ Hierarchical federated learning (HFL) has demonstrated promising scalability +advantages over the traditional "star-topology" architecture-based federated +learning (FL). However, HFL still imposes significant computation, +communication, and storage burdens on the edge, especially when training a +large-scale model over resource-constrained Internet of Things (IoT) devices. +In this paper, we propose hierarchical independent submodel training (HIST), a +new FL methodology that aims to address these issues in hierarchical settings. +The key idea behind HIST is a hierarchical version of model partitioning, where +we partition the global model into disjoint submodels in each round, and +distribute them across different cells, so that each cell is responsible for +training only one partition of the full model. This enables each client to save +computation/storage costs while alleviating the communication loads throughout +the hierarchy. We characterize the convergence behavior of HIST for non-convex +loss functions under mild assumptions, showing the impact of several attributes +(e.g., number of cells, local and global aggregation frequency) on the +performance-efficiency tradeoff. Finally, through numerical experiments, we +verify that HIST is able to save communication costs by a wide margin while +achieving the same target testing accuracy. + +
+
+ comment: 14 pages, 4 figures +
+
+
+
+
+ + ☆ Impressions: Understanding Visual Semiotics and Aesthetic Impact EMNLP 2023 + + +
+ Is aesthetic impact different from beauty? Is visual salience a reflection of +its capacity for effective communication? We present Impressions, a novel +dataset through which to investigate the semiotics of images, and how specific +visual features and design choices can elicit specific emotions, thoughts and +beliefs. We posit that the impactfulness of an image extends beyond formal +definitions of aesthetics, to its success as a communicative act, where style +contributes as much to meaning formation as the subject matter. However, prior +image captioning datasets are not designed to empower state-of-the-art +architectures to model potential human impressions or interpretations of +images. To fill this gap, we design an annotation task heavily inspired by +image analysis techniques in the Visual Arts to collect 1,440 image-caption +pairs and 4,320 unique annotations exploring impact, pragmatic image +description, impressions, and aesthetic design choices. We show that existing +multimodal image captioning and conditional generation models struggle to +simulate plausible human responses to images. However, this dataset +significantly improves their ability to model impressions and aesthetic +evaluations of images through fine-tuning and few-shot adaptation. + +
+
+ comment: To be published in EMNLP 2023 +
+
+
+
+
+ + ☆ Machine Learning Infused Distributed Optimization for Coordinating + Virtual Power Plant Assets + + +
+ Amid the increasing interest in the deployment of Distributed Energy +Resources (DERs), the Virtual Power Plant (VPP) has emerged as a pivotal tool +for aggregating diverse DERs and facilitating their participation in wholesale +energy markets. These VPP deployments have been fueled by the Federal Energy +Regulatory Commission's Order 2222, which makes DERs and VPPs competitive +across market segments. However, the diversity and decentralized nature of DERs +present significant challenges to the scalable coordination of VPP assets. To +address efficiency and speed bottlenecks, this paper presents a novel machine +learning-assisted distributed optimization to coordinate VPP assets. Our +method, named LOOP-MAC(Learning to Optimize the Optimization Process for +Multi-agent Coordination), adopts a multi-agent coordination perspective where +each VPP agent manages multiple DERs and utilizes neural network approximators +to expedite the solution search. The LOOP-MAC method employs a gauge map to +guarantee strict compliance with local constraints, effectively reducing the +need for additional post-processing steps. Our results highlight the advantages +of LOOP-MAC, showcasing accelerated solution times per iteration and +significantly reduced convergence times. The LOOP-MAC method outperforms +conventional centralized and distributed optimization methods in optimization +tasks that require repetitive and sequential execution. + +
+
+
+
+
+ + ☆ A Sublinear-Time Spectral Clustering Oracle with Improved Preprocessing + Time NeurIPS'23 + + +
+ We address the problem of designing a sublinear-time spectral clustering +oracle for graphs that exhibit strong clusterability. Such graphs contain $k$ +latent clusters, each characterized by a large inner conductance (at least +$\varphi$) and a small outer conductance (at most $\varepsilon$). Our aim is to +preprocess the graph to enable clustering membership queries, with the key +requirement that both preprocessing and query answering should be performed in +sublinear time, and the resulting partition should be consistent with a +$k$-partition that is close to the ground-truth clustering. Previous oracles +have relied on either a $\textrm{poly}(k)\log n$ gap between inner and outer +conductances or exponential (in $k/\varepsilon$) preprocessing time. Our +algorithm relaxes these assumptions, albeit at the cost of a slightly higher +misclassification ratio. We also show that our clustering oracle is robust +against a few random edge deletions. To validate our theoretical bounds, we +conducted experiments on synthetic networks. + +
+
+ comment: To appear at NeurIPS'23 +
+
+
+
+
+ + ☆ ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for + Consistent Data-to-Text Generation EMNLP2023 + + +
+ We present ASPIRO, an approach for structured data verbalisation into short +template sentences in zero to few-shot settings. Unlike previous methods, our +approach prompts large language models (LLMs) to directly produce +entity-agnostic templates, rather than relying on LLMs to faithfully copy the +given example entities, or validating/crafting the templates manually. We +incorporate LLM re-prompting, triggered by algorithmic parsing checks, as well +as the PARENT metric induced consistency validation to identify and rectify +template generation problems in real-time. ASPIRO, compared to direct LLM +output, averages 66\% parsing error rate reduction in generated verbalisations +of RDF triples on the DART dataset. Our best 5-shot text-davinci-003 setup, +scoring BLEU of 50.62, METEOR of 45.16, BLEURT of 0.82, NUBIA of 0.87, and +PARENT of 0.8962 on the Rel2Text dataset, competes effectively with recent +fine-tuned pre-trained language models. + +
+
+ comment: Accepted to Findings of EMNLP2023, code available at + https://github.com/vejvarm/ASPIRO +
+
+
+
+
+ + ☆ Ranking with Slot Constraints + + +
+ We introduce the problem of ranking with slot constraints, which can be used +to model a wide range of application problems -- from college admission with +limited slots for different majors, to composing a stratified cohort of +eligible participants in a medical trial. We show that the conventional +Probability Ranking Principle (PRP) can be highly sub-optimal for +slot-constrained ranking problems, and we devise a new ranking algorithm, +called MatchRank. The goal of MatchRank is to produce rankings that maximize +the number of filled slots if candidates are evaluated by a human decision +maker in the order of the ranking. In this way, MatchRank generalizes the PRP, +and it subsumes the PRP as a special case when there are no slot constraints. +Our theoretical analysis shows that MatchRank has a strong approximation +guarantee without any independence assumptions between slots or candidates. +Furthermore, we show how MatchRank can be implemented efficiently. Beyond the +theoretical guarantees, empirical evaluations show that MatchRank can provide +substantial improvements over a range of synthetic and real-world tasks. + +
+
+
+
+
+ + ☆ Reproducibility in Multiple Instance Learning: A Case For Algorithmic + Unit Tests NeurIPS 2023 + + +
+ Multiple Instance Learning (MIL) is a sub-domain of classification problems +with positive and negative labels and a "bag" of inputs, where the label is +positive if and only if a positive element is contained within the bag, and +otherwise is negative. Training in this context requires associating the +bag-wide label to instance-level information, and implicitly contains a causal +assumption and asymmetry to the task (i.e., you can't swap the labels without +changing the semantics). MIL problems occur in healthcare (one malignant cell +indicates cancer), cyber security (one malicious executable makes an infected +computer), and many other tasks. In this work, we examine five of the most +prominent deep-MIL models and find that none of them respects the standard MIL +assumption. They are able to learn anti-correlated instances, i.e., defaulting +to "positive" labels until seeing a negative counter-example, which should not +be possible for a correct MIL model. We suspect that enhancements and other +works derived from these models will share the same issue. In any context in +which these models are being used, this creates the potential for learning +incorrect models, which creates risk of operational failure. We identify and +demonstrate this problem via a proposed "algorithmic unit test", where we +create synthetic datasets that can be solved by a MIL respecting model, and +which clearly reveal learning that violates MIL assumptions. The five evaluated +methods each fail one or more of these tests. This provides a model-agnostic +way to identify violations of modeling assumptions, which we hope will be +useful for future development and evaluation of MIL models. + +
+
+ comment: To appear in the 37th Conference on Neural Information Processing + Systems (NeurIPS 2023) +
+
+
+
+
+ + ☆ Function Space Bayesian Pseudocoreset for Bayesian Neural Networks + + +
+ A Bayesian pseudocoreset is a compact synthetic dataset summarizing essential +information of a large-scale dataset and thus can be used as a proxy dataset +for scalable Bayesian inference. Typically, a Bayesian pseudocoreset is +constructed by minimizing a divergence measure between the posterior +conditioning on the pseudocoreset and the posterior conditioning on the full +dataset. However, evaluating the divergence can be challenging, particularly +for the models like deep neural networks having high-dimensional parameters. In +this paper, we propose a novel Bayesian pseudocoreset construction method that +operates on a function space. Unlike previous methods, which construct and +match the coreset and full data posteriors in the space of model parameters +(weights), our method constructs variational approximations to the coreset +posterior on a function space and matches it to the full data posterior in the +function space. By working directly on the function space, our method could +bypass several challenges that may arise when working on a weight space, +including limited scalability and multi-modality issue. Through various +experiments, we demonstrate that the Bayesian pseudocoresets constructed from +our method enjoys enhanced uncertainty quantification and better robustness +across various model architectures. + +
+
+
+
+
+ + ☆ Boosting Data Analytics With Synthetic Volume Expansion + + +
+ Synthetic data generation, a cornerstone of Generative Artificial +Intelligence, signifies a paradigm shift in data science by addressing data +scarcity and privacy while enabling unprecedented performance. As synthetic +data gains prominence, questions arise concerning the accuracy of statistical +methods when applied to synthetic data compared to raw data. In this article, +we introduce the Synthetic Data Generation for Analytics framework. This +framework employs statistical methods on high-fidelity synthetic data generated +by advanced models such as tabular diffusion and Generative Pre-trained +Transformer models. These models, trained on raw data, are further enhanced +with insights from pertinent studies. A significant discovery within this +framework is the generational effect: the error of a statistical method on +synthetic data initially diminishes with added synthetic data but may +eventually increase or plateau. This phenomenon, rooted in the complexities of +replicating raw data distributions, highlights a "reflection point"--an optimal +threshold in the size of synthetic data determined by specific error metrics. +Through three illustrative case studies-sentiment analysis of texts, predictive +modeling of structured data, and inference in tabular data--we demonstrate the +effectiveness of this framework over traditional ones. We underline its +potential to amplify various statistical methods, including gradient boosting +for prediction and hypothesis testing, thereby underscoring the transformative +potential of synthetic data generation in data science. + +
+
+
+
+
+ + ☆ A Data-Centric Online Market for Machine Learning: From Discovery to + Pricing + + +
+ Data fuels machine learning (ML) - rich and high-quality training data is +essential to the success of ML. However, to transform ML from the race among a +few large corporations to an accessible technology that serves numerous normal +users' data analysis requests, there still exist important challenges. One gap +we observed is that many ML users can benefit from new data that other data +owners possess, whereas these data owners sit on piles of data without knowing +who can benefit from it. This gap creates the opportunity for building an +online market that can automatically connect supply with demand. While online +matching markets are prevalent (e.g., ride-hailing systems), designing a +data-centric market for ML exhibits many unprecedented challenges. + This paper develops new techniques to tackle two core challenges in designing +such a market: (a) to efficiently match demand with supply, we design an +algorithm to automatically discover useful data for any ML task from a pool of +thousands of datasets, achieving high-quality matching between ML models and +data; (b) to encourage market participation of ML users without much ML +expertise, we design a new pricing mechanism for selling data-augmented ML +models. Furthermore, our market is designed to be API-compatible with existing +online ML markets like Vertex AI and Sagemaker, making it easy to use while +providing better results due to joint data and model search. We envision that +the synergy of our data and model discovery algorithm and pricing mechanism +will be an important step towards building a new data-centric online market +that serves ML users effectively. + +
+
+
+
+
+ + ☆ Positional Encoding-based Resident Identification in Multi-resident + Smart Homes + + +
+ We propose a novel resident identification framework to identify residents in +a multi-occupant smart environment. The proposed framework employs a feature +extraction model based on the concepts of positional encoding. The feature +extraction model considers the locations of homes as a graph. We design a novel +algorithm to build such graphs from layout maps of smart environments. The +Node2Vec algorithm is used to transform the graph into high-dimensional node +embeddings. A Long Short-Term Memory (LSTM) model is introduced to predict the +identities of residents using temporal sequences of sensor events with the node +embeddings. Extensive experiments show that our proposed scheme effectively +identifies residents in a multi-occupant environment. Evaluation results on two +real-world datasets demonstrate that our proposed approach achieves 94.5% and +87.9% accuracy, respectively. + +
+
+ comment: 27 pages, 11 figures, 2 tables +
+
+
+
+
+ + ☆ Hybrid Optical Turbulence Models Using Machine Learning and Local + Measurements + + +
+ Accurate prediction of atmospheric optical turbulence in localized +environments is essential for estimating the performance of free-space optical +systems. Macro-meteorological models developed to predict turbulent effects in +one environment may fail when applied in new environments. However, existing +macro-meteorological models are expected to offer some predictive power. +Building a new model from locally-measured macro-meteorology and scintillometer +readings can require significant time and resources, as well as a large number +of observations. These challenges motivate the development of a +machine-learning informed hybrid model framework. By combining some baseline +macro-meteorological model with local observations, hybrid models were trained +to improve upon the predictive power of each baseline model. Comparisons +between the performance of the hybrid models, the selected baseline +macro-meteorological models, and machine-learning models trained only on local +observations highlight potential use cases for the hybrid model framework when +local data is expensive to collect. Both the hybrid and data-only models were +trained using the Gradient Boosted Decision Tree (GBDT) architecture with a +variable number of in-situ meteorological observations. The hybrid and +data-only models were found to outperform three baseline macro-meteorological +models, even for low numbers of observations, in some cases as little as one +day. For the first baseline macro-meteorological model investigated, the hybrid +model achieves an estimated 29% reduction in mean absolute error (MAE) using +only one days-equivalent of observation, growing to 41% after only two days, +and 68% after 180 days-equivalent training data. The number of days-equivalent +training data required is potentially indicative of the seasonal variation in +the local microclimate and its propagation environment. + +
+
+ comment: 15 pages, 8 figures +
+
+
+
+
+ + ♻ ☆ Framework based on complex networks to model and mine patient pathways + + +
+ The automatic discovery of a model to represent the history of encounters of +a group of patients with the healthcare system -- the so-called "pathway of +patients" -- is a new field of research that supports clinical and +organisational decisions to improve the quality and efficiency of the treatment +provided. The pathways of patients with chronic conditions tend to vary +significantly from one person to another, have repetitive tasks, and demand the +analysis of multiple perspectives (interventions, diagnoses, medical +specialities, among others) influencing the results. Therefore, modelling and +mining those pathways is still a challenging task. In this work, we propose a +framework comprising: (i) a pathway model based on a multi-aspect graph, (ii) a +novel dissimilarity measurement to compare pathways taking the elapsed time +into account, and (iii) a mining method based on traditional centrality +measures to discover the most relevant steps of the pathways. We evaluated the +framework using the study cases of pregnancy and diabetes, which revealed its +usefulness in finding clusters of similar pathways, representing them in an +easy-to-interpret way, and highlighting the most significant patterns according +to multiple perspectives. + +
+
+ comment: 35 pages, 11 figures, 2 appendices +
+
+
+
+
+ + ♻ ☆ High-Dimensional Prediction for Sequential Decision Making + + +
+ We study the problem of making predictions of an adversarially chosen +high-dimensional state that are unbiased subject to an arbitrary collection of +conditioning events, with the goal of tailoring these events to downstream +decision makers. We give efficient algorithms for solving this problem, as well +as a number of applications that stem from choosing an appropriate set of +conditioning events. + For example, we can efficiently make predictions targeted at polynomially +many decision makers, giving each of them optimal swap regret if they +best-respond to our predictions. We generalize this to online combinatorial +optimization, where the decision makers have a very large action space, to give +the first algorithms offering polynomially many decision makers no regret on +polynomially many subsequences that may depend on their actions and the +context. We apply these results to get efficient no-subsequence-regret +algorithms in extensive-form games (EFGs), yielding a new family of regret +guarantees for EFGs that generalizes some existing EFG regret notions, e.g. +regret to informed causal deviations, and is generally incomparable to other +known such notions. + Next, we develop a novel transparent alternative to conformal prediction for +building valid online adversarial multiclass prediction sets. We produce class +scores that downstream algorithms can use for producing valid-coverage +prediction sets, as if these scores were the true conditional class +probabilities. We show this implies strong conditional validity guarantees +including set-size-conditional and multigroup-fair coverage for polynomially +many downstream prediction sets. Moreover, our class scores can be guaranteed +to have improved $L_2$ loss, cross-entropy loss, and generally any Bregman +loss, compared to any collection of benchmark models, yielding a +high-dimensional real-valued version of omniprediction. + +
+
+ comment: Added references, Arxiv abstract edited +
+
+
+
+
+ + ♻ ☆ Multi-scale Diffusion Denoised Smoothing NeurIPS 2023 + + +
+ Along with recent diffusion models, randomized smoothing has become one of a +few tangible approaches that offers adversarial robustness to models at scale, +e.g., those of large pre-trained models. Specifically, one can perform +randomized smoothing on any classifier via a simple "denoise-and-classify" +pipeline, so-called denoised smoothing, given that an accurate denoiser is +available - such as diffusion model. In this paper, we present scalable methods +to address the current trade-off between certified robustness and accuracy in +denoised smoothing. Our key idea is to "selectively" apply smoothing among +multiple noise scales, coined multi-scale smoothing, which can be efficiently +implemented with a single diffusion model. This approach also suggests a new +objective to compare the collective robustness of multi-scale smoothed +classifiers, and questions which representation of diffusion model would +maximize the objective. To address this, we propose to further fine-tune +diffusion model (a) to perform consistent denoising whenever the original image +is recoverable, but (b) to generate rather diverse outputs otherwise. Our +experiments show that the proposed multi-scale smoothing scheme combined with +diffusion fine-tuning enables strong certified robustness available with high +noise level while maintaining its accuracy close to non-smoothed classifiers. + +
+
+ comment: Published as a conference paper at NeurIPS 2023; Code is available at + https://github.com/jh-jeong/smoothing-multiscale +
+
+
+
+
+ + ♻ ☆ Towards Understanding Sycophancy in Language Models + + +
+ Human feedback is commonly utilized to finetune AI assistants. But human +feedback may also encourage model responses that match user beliefs over +truthful ones, a behaviour known as sycophancy. We investigate the prevalence +of sycophancy in models whose finetuning procedure made use of human feedback, +and the potential role of human preference judgments in such behavior. We first +demonstrate that five state-of-the-art AI assistants consistently exhibit +sycophancy across four varied free-form text-generation tasks. To understand if +human preferences drive this broadly observed behavior, we analyze existing +human preference data. We find that when a response matches a user's views, it +is more likely to be preferred. Moreover, both humans and preference models +(PMs) prefer convincingly-written sycophantic responses over correct ones a +non-negligible fraction of the time. Optimizing model outputs against PMs also +sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results +indicate that sycophancy is a general behavior of state-of-the-art AI +assistants, likely driven in part by human preference judgments favoring +sycophantic responses. + +
+
+ comment: 32 pages, 20 figures +
+
+
+
+
+ + ♻ ☆ Learning to Modulate pre-trained Models in RL + + +
+ Reinforcement Learning (RL) has been successful in various domains like +robotics, game playing, and simulation. While RL agents have shown impressive +capabilities in their specific tasks, they insufficiently adapt to new tasks. +In supervised learning, this adaptation problem is addressed by large-scale +pre-training followed by fine-tuning to new down-stream tasks. Recently, +pre-training on multiple tasks has been gaining traction in RL. However, +fine-tuning a pre-trained model often suffers from catastrophic forgetting. +That is, the performance on the pre-training tasks deteriorates when +fine-tuning on new tasks. To investigate the catastrophic forgetting +phenomenon, we first jointly pre-train a model on datasets from two benchmark +suites, namely Meta-World and DMControl. Then, we evaluate and compare a +variety of fine-tuning methods prevalent in natural language processing, both +in terms of performance on new tasks, and how well performance on pre-training +tasks is retained. Our study shows that with most fine-tuning approaches, the +performance on pre-training tasks deteriorates significantly. Therefore, we +propose a novel method, Learning-to-Modulate (L2M), that avoids the degradation +of learned skills by modulating the information flow of the frozen pre-trained +model via a learnable modulation pool. Our method achieves state-of-the-art +performance on the Continual-World benchmark, while retaining performance on +the pre-training tasks. Finally, to aid future research in this area, we +release a dataset encompassing 50 Meta-World and 16 DMControl tasks. + +
+
+ comment: 10 pages (+ references and appendix), Code: + https://github.com/ml-jku/L2M +
+
+
+
+
+ + ♻ ☆ Explainable Brain Age Prediction using coVariance Neural Networks NeurIPS 2023 + + +
+ In computational neuroscience, there has been an increased interest in +developing machine learning algorithms that leverage brain imaging data to +provide estimates of "brain age" for an individual. Importantly, the +discordance between brain age and chronological age (referred to as "brain age +gap") can capture accelerated aging due to adverse health conditions and +therefore, can reflect increased vulnerability towards neurological disease or +cognitive impairments. However, widespread adoption of brain age for clinical +decision support has been hindered due to lack of transparency and +methodological justifications in most existing brain age prediction algorithms. +In this paper, we leverage coVariance neural networks (VNN) to propose an +explanation-driven and anatomically interpretable framework for brain age +prediction using cortical thickness features. Specifically, our brain age +prediction framework extends beyond the coarse metric of brain age gap in +Alzheimer's disease (AD) and we make two important observations: (i) VNNs can +assign anatomical interpretability to elevated brain age gap in AD by +identifying contributing brain regions, (ii) the interpretability offered by +VNNs is contingent on their ability to exploit specific eigenvectors of the +anatomical covariance matrix. Together, these observations facilitate an +explainable and anatomically interpretable perspective to the task of brain age +prediction. + +
+
+ comment: Camera ready version for NeurIPS 2023. arXiv admin note: substantial + text overlap with arXiv:2305.01807 +
+
+
+
+
+ + ♻ ☆ MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational + Transcript Cleanup EMNLP 2023 + + +
+ Current disfluency detection models focus on individual utterances each from +a single speaker. However, numerous discontinuity phenomena in spoken +conversational transcripts occur across multiple turns, hampering human +readability and the performance of downstream NLP tasks. This study addresses +these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken +conversational transcripts and collecting a new dataset, MultiTurnCleanup1. We +design a data labeling schema to collect the high-quality dataset and provide +extensive data analysis. Furthermore, we leverage two modeling approaches for +experimental evaluation as benchmarks for future research. + +
+
+ comment: EMNLP 2023 main conference. Dataset: + https://github.com/huashen218/MultiTurnCleanup +
+
+
+
+
+ + ♻ ☆ Algorithmic Foundations of Empirical X-risk Minimization + + +
+ This manuscript introduces a new optimization framework for machine learning +and AI, named {\bf empirical X-risk minimization (EXM)}. X-risk is a term +introduced to represent a family of compositional measures or objectives, in +which each data point is compared with a large number of items explicitly or +implicitly for defining a risk function. It includes surrogate objectives of +many widely used measures and non-decomposable losses, e.g., AUROC, AUPRC, +partial AUROC, NDCG, MAP, precision/recall at top $K$ positions, precision at a +certain recall level, listwise losses, p-norm push, top push, global +contrastive losses, etc. While these non-decomposable objectives and their +optimization algorithms have been studied in the literature of machine +learning, computer vision, information retrieval, and etc, optimizing these +objectives has encountered some unique challenges for deep learning. In this +paper, we present recent rigorous efforts for EXM with a focus on its +algorithmic foundations and its applications. We introduce a class of +algorithmic techniques for solving EXM with smooth non-convex objectives. We +formulate EXM into three special families of non-convex optimization problems +belonging to non-convex compositional optimization, non-convex min-max +optimization and non-convex bilevel optimization, respectively. For each family +of problems, we present some strong baseline algorithms and their complexities, +which will motivate further research for improving the existing results. +Discussions about the presented results and future studies are given at the +end. Efficient algorithms for optimizing a variety of X-risks are implemented +in the LibAUC library at \url{www.libauc.org}. + +
+
+
+
+
+ + ♻ ☆ Statistical Learning under Heterogeneous Distribution Shift + + +
+ This paper studies the prediction of a target $\mathbf{z}$ from a pair of +random variables $(\mathbf{x},\mathbf{y})$, where the ground-truth predictor is +additive $\mathbb{E}[\mathbf{z} \mid \mathbf{x},\mathbf{y}] = +f_\star(\mathbf{x}) +g_{\star}(\mathbf{y})$. We study the performance of +empirical risk minimization (ERM) over functions $f+g$, $f \in F$ and $g \in +G$, fit on a given training distribution, but evaluated on a test distribution +which exhibits covariate shift. We show that, when the class $F$ is "simpler" +than $G$ (measured, e.g., in terms of its metric entropy), our predictor is +more resilient to heterogeneous covariate shifts} in which the shift in +$\mathbf{x}$ is much greater than that in $\mathbf{y}$. Our analysis proceeds +by demonstrating that ERM behaves qualitatively similarly to orthogonal machine +learning: the rate at which ERM recovers the $f$-component of the predictor has +only a lower-order dependence on the complexity of the class $G$, adjusted for +partial non-indentifiability introduced by the additive structure. These +results rely on a novel H\"older style inequality for the Dudley integral which +may be of independent interest. Moreover, we corroborate our theoretical +findings with experiments demonstrating improved resilience to shifts in +"simpler" features across numerous domains. + +
+
+
+
+
+ + ♻ ☆ eP-ALM: Efficient Perceptual Augmentation of Language Models ICCV 2023 + + +
+ Large Language Models (LLMs) have so far impressed the world, with +unprecedented capabilities that emerge in models at large scales. On the vision +side, transformer models (i.e., ViT) are following the same trend, achieving +the best performance on challenging benchmarks. With the abundance of such +unimodal models, a natural question arises; do we need also to follow this +trend to tackle multimodal tasks? In this work, we propose to rather direct +effort to efficient adaptations of existing models, and propose to augment +Language Models with perception. Existing approaches for adapting pretrained +models for vision-language tasks still rely on several key components that +hinder their efficiency. In particular, they still train a large number of +parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) +trained on huge image-text datasets, and add significant inference overhead. In +addition, most of these approaches have focused on Zero-Shot and In Context +Learning, with little to no effort on direct finetuning. We investigate the +minimal computational effort needed to adapt unimodal models for multimodal +tasks and propose a new challenging setup, alongside different approaches, that +efficiently adapts unimodal pretrained models. We show that by freezing more +than 99% of total parameters, training only one linear projection layer, and +prepending only one trainable token, our approach (dubbed eP-ALM) significantly +outperforms other baselines on VQA and Captioning across Image, Video, and +Audio modalities, following the proposed setup. The code is available here: +https://github.com/mshukor/eP-ALM. + +
+
+ comment: Accepted at ICCV 2023. Project page: + https://mshukor.github.io/eP-ALM.github.io/ +
+
+
+
+
+ + ♻ ☆ The noise level in linear regression with dependent data + + +
+ We derive upper bounds for random design linear regression with dependent +($\beta$-mixing) data absent any realizability assumptions. In contrast to the +strictly realizable martingale noise regime, no sharp instance-optimal +non-asymptotics are available in the literature. Up to constant factors, our +analysis correctly recovers the variance term predicted by the Central Limit +Theorem -- the noise level of the problem -- and thus exhibits graceful +degradation as we introduce misspecification. Past a burn-in, our result is +sharp in the moderate deviations regime, and in particular does not inflate the +leading order term by mixing time factors. + +
+
+
+
+
+ + ♻ ☆ Enhancing drug and cell line representations via contrastive learning + for improved anti-cancer drug prioritization + + +
+ Due to cancer's complex nature and variable response to therapy, precision +oncology informed by omics sequence analysis has become the current standard of +care. However, the amount of data produced for each patients makes it difficult +to quickly identify the best treatment regimen. Moreover, limited data +availability has hindered computational methods' abilities to learn patterns +associated with effective drug-cell line pairs. In this work, we propose the +use of contrastive learning to improve learned drug and cell line +representations by preserving relationship structures associated with drug +mechanism of action and cell line cancer types. In addition to achieving +enhanced performance relative to a state-of-the-art method, we find that +classifiers using our learned representations exhibit a more balances reliance +on drug- and cell line-derived features when making predictions. This +facilitates more personalized drug prioritizations that are informed by signals +related to drug resistance. + +
+
+ comment: 60 pages, 4 figures, 4 tables, 11 supplementary tables, 1 + supplementary note, submitted to Nature Communications +
+
+
+
+
+ + ♻ ☆ NeuroBack: Improving CDCL SAT Solving using Graph Neural Networks + + +
+ Propositional satisfiability (SAT) is an NP-complete problem that impacts +many research fields, such as planning, verification, and security. Mainstream +modern SAT solvers are based on the Conflict-Driven Clause Learning (CDCL) +algorithm. Recent work aimed to enhance CDCL SAT solvers using Graph Neural +Networks (GNNs). However, so far this approach either has not made solving more +effective, or required substantial GPU resources for frequent online model +inferences. Aiming to make GNN improvements practical, this paper proposes an +approach called NeuroBack, which builds on two insights: (1) predicting phases +(i.e., values) of variables appearing in the majority (or even all) of the +satisfying assignments are essential for CDCL SAT solving, and (2) it is +sufficient to query the neural model only once for the predictions before the +SAT solving starts. Once trained, the offline model inference allows NeuroBack +to execute exclusively on the CPU, removing its reliance on GPU resources. To +train NeuroBack, a new dataset called DataBack containing 120,286 data samples +is created. Finally, NeuroBack is implemented as an enhancement to a +state-of-the-art SAT solver called Kissat. As a result, it allowed Kissat to +solve 5.2% more problems on the recent SAT competition problem set, +SATCOMP-2022. NeuroBack therefore shows how machine learning can be harnessed +to improve SAT solving in an effective and practical manner. + +
+
+
+
+
+ + ♻ ☆ Modeling Path Importance for Effective Alzheimer's Disease Drug + Repurposing + + +
+ Recently, drug repurposing has emerged as an effective and resource-efficient +paradigm for AD drug discovery. Among various methods for drug repurposing, +network-based methods have shown promising results as they are capable of +leveraging complex networks that integrate multiple interaction types, such as +protein-protein interactions, to more effectively identify candidate drugs. +However, existing approaches typically assume paths of the same length in the +network have equal importance in identifying the therapeutic effect of drugs. +Other domains have found that same length paths do not necessarily have the +same importance. Thus, relying on this assumption may be deleterious to drug +repurposing attempts. In this work, we propose MPI (Modeling Path Importance), +a novel network-based method for AD drug repurposing. MPI is unique in that it +prioritizes important paths via learned node embeddings, which can effectively +capture a network's rich structural information. Thus, leveraging learned +embeddings allows MPI to effectively differentiate the importance among paths. +We evaluate MPI against a commonly used baseline method that identifies anti-AD +drug candidates primarily based on the shortest paths between drugs and AD in +the network. We observe that among the top-50 ranked drugs, MPI prioritizes +20.0% more drugs with anti-AD evidence compared to the baseline. Finally, Cox +proportional-hazard models produced from insurance claims data aid us in +identifying the use of etodolac, nicotine, and BBB-crossing ACE-INHs as having +a reduced risk of AD, suggesting such drugs may be viable candidates for +repurposing and should be explored further in future studies. + +
+
+ comment: 16 pages, 3 figures, 2 tables, 1 supplementary figure, 5 + supplementary tables, Preprint of an article accepted for publication in + Pacific Symposium on Biocomputing \copyright 2023 World Scientific Publishing + Co., Singapore, http://psb.stanford.edu/ +
+
+
+
+
+ + ♻ ☆ Provably Fast Convergence of Independent Natural Policy Gradient for + Markov Potential Games NeurIPS 2023 + + +
+ This work studies an independent natural policy gradient (NPG) algorithm for +the multi-agent reinforcement learning problem in Markov potential games. It is +shown that, under mild technical assumptions and the introduction of the +\textit{suboptimality gap}, the independent NPG method with an oracle providing +exact policy evaluation asymptotically reaches an $\epsilon$-Nash Equilibrium +(NE) within $\mathcal{O}(1/\epsilon)$ iterations. This improves upon the +previous best result of $\mathcal{O}(1/\epsilon^2)$ iterations and is of the +same order, $\mathcal{O}(1/\epsilon)$, that is achievable for the single-agent +case. Empirical results for a synthetic potential game and a congestion game +are presented to verify the theoretical bounds. + +
+
+ comment: Will appear in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Accountability in Offline Reinforcement Learning: Explaining Decisions + with a Corpus of Examples + + +
+ Learning controllers with offline data in decision-making systems is an +essential area of research due to its potential to reduce the risk of +applications in real-world systems. However, in responsibility-sensitive +settings such as healthcare, decision accountability is of paramount +importance, yet has not been adequately addressed by the literature. This paper +introduces the Accountable Offline Controller (AOC) that employs the offline +dataset as the Decision Corpus and performs accountable control based on a +tailored selection of examples, referred to as the Corpus Subset. AOC operates +effectively in low-data scenarios, can be extended to the strictly offline +imitation setting, and displays qualities of both conservation and +adaptability. We assess AOC's performance in both simulated and real-world +healthcare scenarios, emphasizing its capability to manage offline control +tasks with high levels of performance while maintaining accountability. + +
+
+
+
+
+ + ♻ ☆ Topological Parallax: A Geometric Specification for Deep Perception + Models NeurIPS 2023 + + +
+ For safety and robustness of AI systems, we introduce topological parallax as +a theoretical and computational tool that compares a trained model to a +reference dataset to determine whether they have similar multiscale geometric +structure. Our proofs and examples show that this geometric similarity between +dataset and model is essential to trustworthy interpolation and perturbation, +and we conjecture that this new concept will add value to the current debate +regarding the unclear relationship between overfitting and generalization in +applications of deep-learning. In typical DNN applications, an explicit +geometric description of the model is impossible, but parallax can estimate +topological features (components, cycles, voids, etc.) in the model by +examining the effect on the Rips complex of geodesic distortions using the +reference dataset. Thus, parallax indicates whether the model shares similar +multiscale geometric features with the dataset. Parallax presents theoretically +via topological data analysis [TDA] as a bi-filtered persistence module, and +the key properties of this module are stable under perturbation of the +reference dataset. + +
+
+ comment: 18 pages, 6 figures. Preprint submitted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Partial Counterfactual Identification of Continuous Outcomes with a + Curvature Sensitivity Model + + +
+ Counterfactual inference aims to answer retrospective "what if" questions and +thus belongs to the most fine-grained type of inference in Pearl's causality +ladder. Existing methods for counterfactual inference with continuous outcomes +aim at point identification and thus make strong and unnatural assumptions +about the underlying structural causal model. In this paper, we relax these +assumptions and aim at partial counterfactual identification of continuous +outcomes, i.e., when the counterfactual query resides in an ignorance interval +with informative bounds. We prove that, in general, the ignorance interval of +the counterfactual queries has non-informative bounds, already when functions +of structural causal models are continuously differentiable. As a remedy, we +propose a novel sensitivity model called Curvature Sensitivity Model. This +allows us to obtain informative bounds by bounding the curvature of level sets +of the functions. We further show that existing point counterfactual +identification methods are special cases of our Curvature Sensitivity Model +when the bound of the curvature is set to zero. We then propose an +implementation of our Curvature Sensitivity Model in the form of a novel deep +generative model, which we call Augmented Pseudo-Invertible Decoder. Our +implementation employs (i) residual normalizing flows with (ii) variational +augmentations. We empirically demonstrate the effectiveness of our Augmented +Pseudo-Invertible Decoder. To the best of our knowledge, ours is the first +partial identification model for Markovian structural causal models with +continuous outcomes. + +
+
+
+
+
+ + ♻ ☆ LeanDojo: Theorem Proving with Retrieval-Augmented Language Models NeurIPS 2023 + + +
+ Large language models (LLMs) have shown promise in proving formal theorems +using proof assistants such as Lean. However, existing methods are difficult to +reproduce or build on, due to private code, data, and large compute +requirements. This has created substantial barriers to research on machine +learning methods for theorem proving. This paper removes these barriers by +introducing LeanDojo: an open-source Lean playground consisting of toolkits, +data, models, and benchmarks. LeanDojo extracts data from Lean and enables +interaction with the proof environment programmatically. It contains +fine-grained annotations of premises in proofs, providing valuable data for +premise selection: a key bottleneck in theorem proving. Using this data, we +develop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented +with retrieval for selecting premises from a vast math library. It is +inexpensive and needs only one GPU week of training. Our retriever leverages +LeanDojo's program analysis capability to identify accessible premises and hard +negative examples, which makes retrieval much more effective. Furthermore, we +construct a new benchmark consisting of 98,734 theorems and proofs extracted +from Lean's math library. It features challenging data split requiring the +prover to generalize to theorems relying on novel premises that are never used +in training. We use this benchmark for training and evaluation, and +experimental results demonstrate the effectiveness of ReProver over +non-retrieval baselines and GPT-4. We thus provide the first set of open-source +LLM-based theorem provers without any proprietary datasets and release it under +a permissive MIT license to facilitate further research. + +
+
+ comment: Accepted to NeurIPS 2023 (Datasets and Benchmarks Track) as an oral + presentation. Data, code, and models available at https://leandojo.org/ +
+
+
+
+
+ + ♻ ☆ Fast and Regret Optimal Best Arm Identification: Fundamental Limits and + Low-Complexity Algorithms + + +
+ This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual +objectives: (i) quick identification and commitment to the optimal arm, and +(ii) reward maximization throughout a sequence of $T$ consecutive rounds. +Though each objective has been individually well-studied, i.e., best arm +identification for (i) and regret minimization for (ii), the simultaneous +realization of both objectives remains an open problem, despite its practical +importance. This paper introduces \emph{Regret Optimal Best Arm Identification} +(ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both +pre-determined stopping time and adaptive stopping time requirements, we +present an algorithm called EOCP and its variants respectively, which not only +achieve asymptotic optimal regret in both Gaussian and general bandits, but +also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with +pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive +stopping time. We further characterize lower bounds on the commitment time +(equivalent to the sample complexity) of ROBAI, showing that EOCP and its +variants are sample optimal with pre-determined stopping time, and almost +sample optimal with adaptive stopping time. Numerical results confirm our +theoretical analysis and reveal an interesting "over-exploration" phenomenon +carried by classic UCB algorithms, such that EOCP has smaller regret even +though it stops exploration much earlier than UCB, i.e., $\mathcal{O}(\log T)$ +versus $\mathcal{O}(T)$, which suggests over-exploration is unnecessary and +potentially harmful to system performance. + +
+
+
+
+
+ + ♻ ☆ Causal Effect Identification in Uncertain Causal Networks NeurIPS 2023 + + +
+ Causal identification is at the core of the causal inference literature, +where complete algorithms have been proposed to identify causal queries of +interest. The validity of these algorithms hinges on the restrictive assumption +of having access to a correctly specified causal structure. In this work, we +study the setting where a probabilistic model of the causal structure is +available. Specifically, the edges in a causal graph exist with uncertainties +which may, for example, represent degree of belief from domain experts. +Alternatively, the uncertainty about an edge may reflect the confidence of a +particular statistical test. The question that naturally arises in this setting +is: Given such a probabilistic graph and a specific causal effect of interest, +what is the subgraph which has the highest plausibility and for which the +causal effect is identifiable? We show that answering this question reduces to +solving an NP-complete combinatorial optimization problem which we call the +edge ID problem. We propose efficient algorithms to approximate this problem +and evaluate them against both real-world networks and randomly generated +graphs. + +
+
+ comment: 27 pages, 9 figures, NeurIPS 2023 conference, causal identification, + causal discovery, probabilistic models +
+
+
+
+
+ + ♻ ☆ Language models show human-like content effects on reasoning tasks + + +
+ Abstract reasoning is a key ability for an intelligent system. Large language +models (LMs) achieve above-chance performance on abstract reasoning tasks, but +exhibit many imperfections. However, human abstract reasoning is also +imperfect. For example, human reasoning is affected by our real-world knowledge +and beliefs, and shows notable "content effects"; humans reason more reliably +when the semantic content of a problem supports the correct logical inferences. +These content-entangled reasoning patterns play a central role in debates about +the fundamental nature of human intelligence. Here, we investigate whether +language models $\unicode{x2014}$ whose prior expectations capture some aspects +of human knowledge $\unicode{x2014}$ similarly mix content into their answers +to logical problems. We explored this question across three logical reasoning +tasks: natural language inference, judging the logical validity of syllogisms, +and the Wason selection task. We evaluate state of the art large language +models, as well as humans, and find that the language models reflect many of +the same patterns observed in humans across these tasks $\unicode{x2014}$ like +humans, models answer more accurately when the semantic content of a task +supports the logical inferences. These parallels are reflected both in answer +patterns, and in lower-level features like the relationship between model +answer distributions and human response times. Our findings have implications +for understanding both these cognitive effects in humans, and the factors that +contribute to language model performance. + +
+
+
+
+
+ + ♻ ☆ Adaptive Webpage Fingerprinting from TLS Traces + + +
+ In webpage fingerprinting, an on-path adversary infers the specific webpage +loaded by a victim user by analysing the patterns in the encrypted TLS traffic +exchanged between the user's browser and the website's servers. This work +studies modern webpage fingerprinting adversaries against the TLS protocol; +aiming to shed light on their capabilities and inform potential defences. +Despite the importance of this research area (the majority of global Internet +users rely on standard web browsing with TLS) and the potential real-life +impact, most past works have focused on attacks specific to anonymity networks +(e.g., Tor). We introduce a TLS-specific model that: 1) scales to an +unprecedented number of target webpages, 2) can accurately classify thousands +of classes it never encountered during training, and 3) has low operational +costs even in scenarios of frequent page updates. Based on these findings, we +then discuss TLS-specific countermeasures and evaluate the effectiveness of the +existing padding capabilities provided by TLS 1.3. + +
+
+
+
+
+ + ♻ ☆ MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with + Reinforcement Learning + + +
+ Recently, Meta-Black-Box Optimization with Reinforcement Learning +(MetaBBO-RL) has showcased the power of leveraging RL at the meta-level to +mitigate manual fine-tuning of low-level black-box optimizers. However, this +field is hindered by the lack of a unified benchmark. To fill this gap, we +introduce MetaBox, the first benchmark platform expressly tailored for +developing and evaluating MetaBBO-RL methods. MetaBox offers a flexible +algorithmic template that allows users to effortlessly implement their unique +designs within the platform. Moreover, it provides a broad spectrum of over 300 +problem instances, collected from synthetic to realistic scenarios, and an +extensive library of 19 baseline methods, including both traditional black-box +optimizers and recent MetaBBO-RL methods. Besides, MetaBox introduces three +standardized performance metrics, enabling a more thorough assessment of the +methods. In a bid to illustrate the utility of MetaBox for facilitating +rigorous evaluation and in-depth analysis, we carry out a wide-ranging +benchmarking study on existing MetaBBO-RL methods. Our MetaBox is open-source +and accessible at: https://github.com/GMC-DRL/MetaBox. + +
+
+ comment: Accepted at NuerIPS 2023 +
+
+
+
+
+ + ♻ ☆ Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function + Approximation: Minimax Optimal and Instance-Dependent Regret Bounds NeurIPS 2023 + + +
+ While numerous works have focused on devising efficient algorithms for +reinforcement learning (RL) with uniformly bounded rewards, it remains an open +question whether sample or time-efficient algorithms for RL with large +state-action space exist when the rewards are \emph{heavy-tailed}, i.e., with +only finite $(1+\epsilon)$-th moments for some $\epsilon\in(0,1]$. In this +work, we address the challenge of such rewards in RL with linear function +approximation. We first design an algorithm, \textsc{Heavy-OFUL}, for +heavy-tailed linear bandits, achieving an \emph{instance-dependent} $T$-round +regret of $\tilde{O}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} +\sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$, the +\emph{first} of this kind. Here, $d$ is the feature dimension, and +$\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of the reward at +the $t$-th round. We further show the above bound is minimax optimal when +applied to the worst-case instances in stochastic and deterministic linear +bandits. We then extend this algorithm to the RL settings with linear function +approximation. Our algorithm, termed as \textsc{Heavy-LSVI-UCB}, achieves the +\emph{first} computationally efficient \emph{instance-dependent} $K$-episode +regret of $\tilde{O}(d \sqrt{H \mathcal{U}^*} K^\frac{1}{1+\epsilon} + d +\sqrt{H \mathcal{V}^* K})$. Here, $H$ is length of the episode, and +$\mathcal{U}^*, \mathcal{V}^*$ are instance-dependent quantities scaling with +the central moment of reward and value functions, respectively. We also provide +a matching minimax lower bound $\Omega(d H K^{\frac{1}{1+\epsilon}} + d +\sqrt{H^3 K})$ to demonstrate the optimality of our algorithm in the worst +case. Our result is achieved via a novel robust self-normalized concentration +inequality that may be of independent interest in handling heavy-tailed noise +in general online regression problems. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Quantifying the Cost of Learning in Queueing Systems NeurIPS 2023 + + +
+ Queueing systems are widely applicable stochastic models with use cases in +communication networks, healthcare, service systems, etc. Although their +optimal control has been extensively studied, most existing approaches assume +perfect knowledge of the system parameters. Of course, this assumption rarely +holds in practice where there is parameter uncertainty, thus motivating a +recent line of work on bandit learning for queueing systems. This nascent +stream of research focuses on the asymptotic performance of the proposed +algorithms. + In this paper, we argue that an asymptotic metric, which focuses on +late-stage performance, is insufficient to capture the intrinsic statistical +complexity of learning in queueing systems which typically occurs in the early +stage. Instead, we propose the Cost of Learning in Queueing (CLQ), a new metric +that quantifies the maximum increase in time-averaged queue length caused by +parameter uncertainty. We characterize the CLQ of a single queue multi-server +system, and then extend these results to multi-queue multi-server systems and +networks of queues. In establishing our results, we propose a unified analysis +framework for CLQ that bridges Lyapunov and bandit analysis, provides +guarantees for a wide range of algorithms, and could be of independent +interest. + +
+
+ comment: A condensed version of this work was accepted for presentation at the + Conference on Neural Information Processing Systems (NeurIPS 2023). Compared + to the first version of the paper, the current version expands the comparison + with related work +
+
+
+
+
+ + ♻ ☆ High-performance real-world optical computing trained by in situ + model-free optimization + + +
+ Optical computing systems can provide high-speed and low-energy data +processing but face deficiencies in computationally demanding training and +simulation-to-reality gap. We propose a model-free solution for lightweight in +situ optimization of optical computing systems based on the score gradient +estimation algorithm. This approach treats the system as a black box and +back-propagates loss directly to the optical weights' probabilistic +distributions, hence circumventing the need for computation-heavy and biased +system simulation. We demonstrate a superior classification accuracy on the +MNIST and FMNIST datasets through experiments on a single-layer diffractive +optical computing system. Furthermore, we show its potential for image-free and +high-speed cell analysis. The inherent simplicity of our proposed method, +combined with its low demand for computational resources, expedites the +transition of optical computing from laboratory demonstrations to real-world +applications. + +
+
+
+
+
+ + ♻ ☆ Reliable Off-Policy Learning for Dosage Combinations NeurIPS 2023 + + +
+ Decision-making in personalized medicine such as cancer therapy or critical +care must often make choices for dosage combinations, i.e., multiple continuous +treatments. Existing work for this task has modeled the effect of multiple +treatments independently, while estimating the joint effect has received little +attention but comes with non-trivial challenges. In this paper, we propose a +novel method for reliable off-policy learning for dosage combinations. Our +method proceeds along three steps: (1) We develop a tailored neural network +that estimates the individualized dose-response function while accounting for +the joint effect of multiple dependent dosages. (2) We estimate the generalized +propensity score using conditional normalizing flows in order to detect regions +with limited overlap in the shared covariate-treatment space. (3) We present a +gradient-based learning algorithm to find the optimal, individualized dosage +combinations. Here, we ensure reliable estimation of the policy value by +avoiding regions with limited overlap. We finally perform an extensive +evaluation of our method to show its effectiveness. To the best of our +knowledge, ours is the first work to provide a method for reliable off-policy +learning for optimal dosage combinations. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Generalized Neural Collapse for a Large Number of Classes + + +
+ Neural collapse provides an elegant mathematical characterization of learned +last layer representations (a.k.a. features) and classifier weights in deep +classification models. Such results not only provide insights but also motivate +new techniques for improving practical deep models. However, most of the +existing empirical and theoretical studies in neural collapse focus on the case +that the number of classes is small relative to the dimension of the feature +space. This paper extends neural collapse to cases where the number of classes +are much larger than the dimension of feature space, which broadly occur for +language models, retrieval systems, and face recognition applications. We show +that the features and classifier exhibit a generalized neural collapse +phenomenon, where the minimum one-vs-rest margins is maximized.We provide +empirical study to verify the occurrence of generalized neural collapse in +practical deep neural networks. Moreover, we provide theoretical study to show +that the generalized neural collapse provably occurs under unconstrained +feature model with spherical constraint, under certain technical conditions on +feature dimension and number of classes. + +
+
+ comment: 32 pages, 12 figures +
+
+
+
+
+ + ♻ ☆ Deep Contract Design via Discontinuous Networks + + +
+ Contract design involves a principal who establishes contractual agreements +about payments for outcomes that arise from the actions of an agent. In this +paper, we initiate the study of deep learning for the automated design of +optimal contracts. We introduce a novel representation: the Discontinuous ReLU +(DeLU) network, which models the principal's utility as a discontinuous +piecewise affine function of the design of a contract where each piece +corresponds to the agent taking a particular action. DeLU networks implicitly +learn closed-form expressions for the incentive compatibility constraints of +the agent and the utility maximization objective of the principal, and support +parallel inference on each piece through linear programming or interior-point +methods that solve for optimal contracts. We provide empirical results that +demonstrate success in approximating the principal's utility function with a +small number of training samples and scaling to find approximately optimal +contracts on problems with a large number of actions and outcomes. + +
+
+
+
+
+ + ♻ ☆ Implicit Convolutional Kernels for Steerable CNNs NeurIPS 2023 + + +
+ Steerable convolutional neural networks (CNNs) provide a general framework +for building neural networks equivariant to translations and transformations of +an origin-preserving group $G$, such as reflections and rotations. They rely on +standard convolutions with $G$-steerable kernels obtained by analytically +solving the group-specific equivariance constraint imposed onto the kernel +space. As the solution is tailored to a particular group $G$, implementing a +kernel basis does not generalize to other symmetry transformations, +complicating the development of general group equivariant models. We propose +using implicit neural representation via multi-layer perceptrons (MLPs) to +parameterize $G$-steerable kernels. The resulting framework offers a simple and +flexible way to implement Steerable CNNs and generalizes to any group $G$ for +which a $G$-equivariant MLP can be built. We prove the effectiveness of our +method on multiple tasks, including N-body simulations, point cloud +classification and molecular property prediction. + +
+
+ comment: Accepted to 37th Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Human-Guided Complexity-Controlled Abstractions NeurIPS 2023 + + +
+ Neural networks often learn task-specific latent representations that fail to +generalize to novel settings or tasks. Conversely, humans learn discrete +representations (i.e., concepts or words) at a variety of abstraction levels +(e.g., "bird" vs. "sparrow") and deploy the appropriate abstraction based on +task. Inspired by this, we train neural models to generate a spectrum of +discrete representations, and control the complexity of the representations +(roughly, how many bits are allocated for encoding inputs) by tuning the +entropy of the distribution over representations. In finetuning experiments, +using only a small number of labeled examples for a new task, we show that (1) +tuning the representation to a task-appropriate complexity level supports the +highest finetuning performance, and (2) in a human-participant study, users +were able to identify the appropriate complexity level for a downstream task +using visualizations of discrete representations. Our results indicate a +promising direction for rapid model finetuning by leveraging human insight. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ IMP-MARL: a Suite of Environments for Large-scale Infrastructure + Management Planning via MARL + + +
+ We introduce IMP-MARL, an open-source suite of multi-agent reinforcement +learning (MARL) environments for large-scale Infrastructure Management Planning +(IMP), offering a platform for benchmarking the scalability of cooperative MARL +methods in real-world engineering applications. In IMP, a multi-component +engineering system is subject to a risk of failure due to its components' +damage condition. Specifically, each agent plans inspections and repairs for a +specific system component, aiming to minimise maintenance costs while +cooperating to minimise system failure risk. With IMP-MARL, we release several +environments including one related to offshore wind structural systems, in an +effort to meet today's needs to improve management strategies to support +sustainable and reliable energy systems. Supported by IMP practical engineering +environments featuring up to 100 agents, we conduct a benchmark campaign, where +the scalability and performance of state-of-the-art cooperative MARL methods +are compared against expert-based heuristic policies. The results reveal that +centralised training with decentralised execution methods scale better with the +number of agents than fully centralised or decentralised RL approaches, while +also outperforming expert-based heuristic policies in most IMP environments. +Based on our findings, we additionally outline remaining cooperation and +scalability challenges that future MARL methods should still address. Through +IMP-MARL, we encourage the implementation of new environments and the further +development of MARL methods. + +
+
+
+
+
+ + ♻ ☆ Android in the Wild: A Large-Scale Dataset for Android Device Control + + +
+ There is a growing interest in device-control systems that can interpret +human natural language instructions and execute them on a digital device by +directly controlling its user interface. We present a dataset for +device-control research, Android in the Wild (AITW), which is orders of +magnitude larger than current datasets. The dataset contains human +demonstrations of device interactions, including the screens and actions, and +corresponding natural language instructions. It consists of 715k episodes +spanning 30k unique instructions, four versions of Android (v10-13),and eight +device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It +contains multi-step tasks that require semantic understanding of language and +visual context. This dataset poses a new challenge: actions available through +the user interface must be inferred from their visual appearance. And, instead +of simple UI element-based actions, the action space consists of precise +gestures (e.g., horizontal scrolls to operate carousel widgets). We organize +our dataset to encourage robustness analysis of device-control systems, i.e., +how well a system performs in the presence of new task descriptions, new +applications, or new platform versions. We develop two agents and report +performance across the dataset. The dataset is available at +https://github.com/google-research/google-research/tree/master/android_in_the_wild. + +
+
+
+
+
+ + ♻ ☆ FAMO: Fast Adaptive Multitask Optimization + + +
+ One of the grand enduring goals of AI is to create generalist agents that can +learn multiple different tasks from diverse data via multitask learning (MTL). +However, in practice, applying gradient descent (GD) on the average loss across +all tasks may yield poor multitask performance due to severe under-optimization +of certain tasks. Previous approaches that manipulate task gradients for a more +balanced loss decrease require storing and computing all task gradients +($\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting +their use in large-scale scenarios. In this work, we introduce Fast Adaptive +Multitask Optimization FAMO, a dynamic weighting method that decreases task +losses in a balanced way using $\mathcal{O}(1)$ space and time. We conduct an +extensive set of experiments covering multi-task supervised and reinforcement +learning problems. Our results indicate that FAMO achieves comparable or +superior performance to state-of-the-art gradient manipulation techniques while +offering significant improvements in space and computational efficiency. Code +is available at \url{https://github.com/Cranial-XIX/FAMO}. + +
+
+
+
+
+ + ♻ ☆ Genes in Intelligent Agents + + +
+ The genes in nature give the lives on earth the current biological +intelligence through transmission and accumulation over billions of years. +Inspired by the biological intelligence, artificial intelligence (AI) has +devoted to building the machine intelligence. Although it has achieved thriving +successes, the machine intelligence still lags far behind the biological +intelligence. The reason may lie in that animals are born with some +intelligence encoded in their genes, but machines lack such intelligence and +learn from scratch. Inspired by the genes of animals, we define the ``genes'' +of machines named as the ``learngenes'' and propose the Genetic Reinforcement +Learning (GRL). GRL is a computational framework that simulates the evolution +of organisms in reinforcement learning (RL) and leverages the learngenes to +learn and evolve the intelligence agents. Leveraging GRL, we first show that +the learngenes take the form of the fragments of the agents' neural networks +and can be inherited across generations. Second, we validate that the +learngenes can transfer ancestral experience to the agents and bring them +instincts and strong learning abilities. Third, we justify the Lamarckian +inheritance of the intelligent agents and the continuous evolution of the +learngenes. Overall, the learngenes have taken the machine intelligence one +more step toward the biological intelligence. + +
+
+
+
+
+ + ♻ ☆ Fundamental Limits of Membership Inference Attacks on Machine Learning + Models + + +
+ Membership inference attacks (MIA) can reveal whether a particular data point +was part of the training dataset, potentially exposing sensitive information +about individuals. This article explores the fundamental statistical +limitations associated with MIAs on machine learning models. More precisely, we +first derive the statistical quantity that governs the effectiveness and +success of such attacks. Then, we investigate several situations for which we +provide bounds on this quantity of interest. This allows us to infer the +accuracy of potential attacks as a function of the number of samples and other +structural parameters of learning models, which in some cases can be directly +estimated from the dataset. + +
+
+
+
+
+ + ♻ ☆ Neural Latent Geometry Search: Product Manifold Inference via + Gromov-Hausdorff-Informed Bayesian Optimization + + +
+ Recent research indicates that the performance of machine learning models can +be improved by aligning the geometry of the latent space with the underlying +data structure. Rather than relying solely on Euclidean space, researchers have +proposed using hyperbolic and spherical spaces with constant curvature, or +combinations thereof, to better model the latent space and enhance model +performance. However, little attention has been given to the problem of +automatically identifying the optimal latent geometry for the downstream task. +We mathematically define this novel formulation and coin it as neural latent +geometry search (NLGS). More specifically, we introduce an initial attempt to +search for a latent geometry composed of a product of constant curvature model +spaces with a small number of query evaluations, under some simplifying +assumptions. To accomplish this, we propose a novel notion of distance between +candidate latent geometries based on the Gromov-Hausdorff distance from metric +geometry. In order to compute the Gromov-Hausdorff distance, we introduce a +mapping function that enables the comparison of different manifolds by +embedding them in a common high-dimensional ambient space. We then design a +graph search space based on the notion of smoothness between latent geometries +and employ the calculated distances as an additional inductive bias. Finally, +we use Bayesian optimization to search for the optimal latent geometry in a +query-efficient manner. This is a general method which can be applied to search +for the optimal latent geometry for a variety of models and downstream tasks. +We perform experiments on synthetic and real-world datasets to identify the +optimal latent geometry for multiple machine learning problems. + +
+
+
+
+
+ + ♻ ☆ Deep Gaussian Markov Random Fields for Graph-Structured Dynamical + Systems NeurIPS 2023 + + +
+ Probabilistic inference in high-dimensional state-space models is +computationally challenging. For many spatiotemporal systems, however, prior +knowledge about the dependency structure of state variables is available. We +leverage this structure to develop a computationally efficient approach to +state estimation and learning in graph-structured state-space models with +(partially) unknown dynamics and limited historical data. Building on recent +methods that combine ideas from deep learning with principled inference in +Gaussian Markov random fields (GMRF), we reformulate graph-structured +state-space models as Deep GMRFs defined by simple spatial and temporal graph +layers. This results in a flexible spatiotemporal prior that can be learned +efficiently from a single time sequence via variational inference. Under linear +Gaussian assumptions, we retain a closed-form posterior, which can be sampled +efficiently using the conjugate gradient method, scaling favourably compared to +classical Kalman filter based approaches + +
+
+ comment: NeurIPS 2023; camera-ready version +
+
+
+
+
+ + ♻ Simulation-free Schrödinger bridges via score and flow matching ICML 2023 + + +
+ We present simulation-free score and flow matching ([SF]$^2$M), a +simulation-free objective for inferring stochastic dynamics given unpaired +samples drawn from arbitrary source and target distributions. Our method +generalizes both the score-matching loss used in the training of diffusion +models and the recently proposed flow matching loss used in the training of +continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic +generative modeling as a Schr\"odinger bridge problem. It relies on static +entropy-regularized optimal transport, or a minibatch approximation, to +efficiently learn the SB without simulating the learned stochastic process. We +find that [SF]$^2$M is more efficient and gives more accurate solutions to the +SB problem than simulation-based methods from prior work. Finally, we apply +[SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably, +[SF]$^2$M is the first method to accurately model cell dynamics in high +dimensions and can recover known gene regulatory networks from simulated data. + +
+
+ comment: A version of this paper appeared in the New Frontiers in Learning, + Control, and Dynamical Systems workshop at ICML 2023. Code: + https://github.com/atong01/conditional-flow-matching +
+
+
+
+
+ + ♻ ☆ Implicit variance regularization in non-contrastive SSL NeurIPS 2023 + + +
+ Non-contrastive SSL methods like BYOL and SimSiam rely on asymmetric +predictor networks to avoid representational collapse without negative samples. +Yet, how predictor networks facilitate stable learning is not fully understood. +While previous theoretical analyses assumed Euclidean losses, most practical +implementations rely on cosine similarity. To gain further theoretical insight +into non-contrastive SSL, we analytically study learning dynamics in +conjunction with Euclidean and cosine similarity in the eigenspace of +closed-form linear predictor networks. We show that both avoid collapse through +implicit variance regularization albeit through different dynamical mechanisms. +Moreover, we find that the eigenvalues act as effective learning rate +multipliers and propose a family of isotropic loss functions (IsoLoss) that +equalize convergence rates across eigenmodes. Empirically, IsoLoss speeds up +the initial learning dynamics and increases robustness, thereby allowing us to +dispense with the EMA target network typically used with non-contrastive +methods. Our analysis sheds light on the variance regularization mechanisms of +non-contrastive SSL and lays the theoretical grounds for crafting novel loss +functions that shape the learning dynamics of the predictor's spectrum. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Flow Matching for Scalable Simulation-Based Inference NeurIPS 2023 + + +
+ Neural posterior estimation methods based on discrete normalizing flows have +become established tools for simulation-based inference (SBI), but scaling them +to high-dimensional problems can be challenging. Building on recent advances in +generative modeling, we here present flow matching posterior estimation (FMPE), +a technique for SBI using continuous normalizing flows. Like diffusion models, +and in contrast to discrete flows, flow matching allows for unconstrained +architectures, providing enhanced flexibility for complex data modalities. Flow +matching, therefore, enables exact density evaluation, fast training, and +seamless scalability to large architectures--making it ideal for SBI. We show +that FMPE achieves competitive performance on an established SBI benchmark, and +then demonstrate its improved scalability on a challenging scientific problem: +for gravitational-wave inference, FMPE outperforms methods based on comparable +discrete flows, reducing training time by 30% with substantially improved +accuracy. Our work underscores the potential of FMPE to enhance performance in +challenging inference scenarios, thereby paving the way for more advanced +applications to scientific problems. + +
+
+ comment: NeurIPS 2023. Code available at + https://github.com/dingo-gw/flow-matching-posterior-estimation +
+
+
+
+
+ + ♻ ☆ Can large language models replace humans in the systematic review + process? Evaluating GPT-4's efficacy in screening and extracting data from + peer-reviewed and grey literature in multiple languages + + +
+ Systematic reviews are vital for guiding practice, research, and policy, yet +they are often slow and labour-intensive. Large language models (LLMs) could +offer a way to speed up and automate systematic reviews, but their performance +in such tasks has not been comprehensively evaluated against humans, and no +study has tested GPT-4, the biggest LLM so far. This pre-registered study +evaluates GPT-4's capability in title/abstract screening, full-text review, and +data extraction across various literature types and languages using a +'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human +performance in most tasks, results were skewed by chance agreement and dataset +imbalance. After adjusting for these, there was a moderate level of performance +for data extraction, and - barring studies that used highly reliable prompts - +screening performance levelled at none to moderate for different stages and +languages. When screening full-text literature using highly reliable prompts, +GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key +studies using highly reliable prompts improved its performance even more. Our +findings indicate that, currently, substantial caution should be used if LLMs +are being used to conduct systematic reviews, but suggest that, for certain +systematic review tasks delivered under reliable prompts, LLMs can rival human +performance. + +
+
+ comment: 9 pages, 2 figures, 1 table +
+
+
+
+
+ + ♻ ☆ A Neurocomputational Account of Flexible Goal-directed Cognition and + Consciousness: The Goal-Aligning Representation Internal Manipulation Theory + (GARIM) + + +
+ Goal-directed manipulation of representations is a key element of human +flexible behaviour, while consciousness is often related to several aspects of +higher-order cognition and human flexibility. Currently these two phenomena are +only partially integrated (e.g., see Neurorepresentationalism) and this (a) +limits our understanding of neuro-computational processes that lead conscious +states to produce flexible goal-directed behaviours, (b) prevents a +computational formalisation of conscious goal-directed manipulations of +representations occurring in the brain, and (c) inhibits the exploitation of +this knowledge for modelling and technological purposes. Addressing these +issues, here we extend our `three-component theory of flexible cognition' by +proposing the `Goal-Aligning Representations Internal Manipulation' (GARIM) +theory of conscious and flexible goal-directed cognition. The central idea of +the theory is that conscious states support the active manipulation of +goal-relevant internal representations (e.g., of world states, objects, and +action sequences) to make them more aligned with the pursued goals. This leads +to the generation of the knowledge which is necessary to face novel +situations/goals, thus increasing the flexibility of goal-directed behaviours. +The GARIM theory integrates key aspects of the main theories of consciousness +into the functional neuro-computational framework of goal-directed behaviour. +Moreover, it takes into account the subjective sensation of agency that +accompanies conscious goal-directed processes (`GARIM agency'). The proposal +has also implications for experimental studies on consciousness and clinical +aspects of conscious goal-directed behaviour. Finally, the GARIM theory benefit +technological fields such as autonomous robotics and machine learning (e.g., +the manipulation process may describe the operations performed by systems based +on transformers). + +
+
+
+
+
+ + ♻ ☆ DeepPCR: Parallelizing Sequential Operations in Neural Networks + + +
+ Parallelization techniques have become ubiquitous for accelerating inference +and training of deep neural networks. Despite this, several operations are +still performed in a sequential manner. For instance, the forward and backward +passes are executed layer-by-layer, and the output of diffusion models is +produced by applying a sequence of denoising steps. This sequential approach +results in a computational cost proportional to the number of steps involved, +presenting a potential bottleneck as the number of steps increases. In this +work, we introduce DeepPCR, a novel algorithm which parallelizes typically +sequential operations in order to speed up inference and training of neural +networks. DeepPCR is based on interpreting a sequence of $L$ steps as the +solution of a specific system of equations, which we recover using the Parallel +Cyclic Reduction algorithm. This reduces the complexity of computing the +sequential operations from $\mathcal{O}(L)$ to $\mathcal{O}(\log_2L)$, thus +yielding a speedup for large $L$. To verify the theoretical lower complexity of +the algorithm, and to identify regimes for speedup, we test the effectiveness +of DeepPCR in parallelizing the forward and backward pass in multi-layer +perceptrons, and reach speedups of up to $30\times$ for the forward and +$200\times$ for the backward pass. We additionally showcase the flexibility of +DeepPCR by parallelizing training of ResNets with as many as 1024 layers, and +generation in diffusion models, enabling up to $7\times$ faster training and +$11\times$ faster generation, respectively, when compared to the sequential +approach. + +
+
+
+
+
+ + ♻ ☆ Disentangling Structure and Style: Political Bias Detection in News by + Inducing Document Hierarchy EMNLP 2023 + + +
+ We address an important gap in detecting political bias in news articles. +Previous works that perform document classification can be influenced by the +writing style of each news outlet, leading to overfitting and limited +generalizability. Our approach overcomes this limitation by considering both +the sentence-level semantics and the document-level rhetorical structure, +resulting in a more robust and style-agnostic approach to detecting political +bias in news articles. We introduce a novel multi-head hierarchical attention +model that effectively encodes the structure of long documents through a +diverse ensemble of attention heads. While journalism follows a formalized +rhetorical structure, the writing style may vary by news outlet. We demonstrate +that our method overcomes this domain dependency and outperforms previous +approaches for robustness and accuracy. Further analysis and human evaluation +demonstrate the ability of our model to capture common discourse structures in +journalism. Our code is available at: +https://github.com/xfactlab/emnlp2023-Document-Hierarchy + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Semantic HELM: A Human-Readable Memory for Reinforcement Learning NeurIPS 2023 + + +
+ Reinforcement learning agents deployed in the real world often have to cope +with partially observable environments. Therefore, most agents employ memory +mechanisms to approximate the state of the environment. Recently, there have +been impressive success stories in mastering partially observable environments, +mostly in the realm of computer games like Dota 2, StarCraft II, or MineCraft. +However, existing methods lack interpretability in the sense that it is not +comprehensible for humans what the agent stores in its memory. In this regard, +we propose a novel memory mechanism that represents past events in human +language. Our method uses CLIP to associate visual inputs with language tokens. +Then we feed these tokens to a pretrained language model that serves the agent +as memory and provides it with a coherent and human-readable representation of +the past. We train our memory mechanism on a set of partially observable +environments and find that it excels on tasks that require a memory component, +while mostly attaining performance on-par with strong baselines on tasks that +do not. On a challenging continuous recognition task, where memorizing the past +is crucial, our memory mechanism converges two orders of magnitude faster than +prior methods. Since our memory mechanism is human-readable, we can peek at an +agent's memory and check whether crucial pieces of information have been +stored. This significantly enhances troubleshooting and paves the way toward +more interpretable agents. + +
+
+ comment: To appear at NeurIPS 2023, 10 pages (+ references and appendix), + Code: https://github.com/ml-jku/helm +
+
+
+
+
+ + ♻ ☆ Towards Control-Centric Representations in Reinforcement Learning from + Images + + +
+ Image-based Reinforcement Learning is a practical yet challenging task. A +major hurdle lies in extracting control-centric representations while +disregarding irrelevant information. While approaches that follow the +bisimulation principle exhibit the potential in learning state representations +to address this issue, they still grapple with the limited expressive capacity +of latent dynamics and the inadaptability to sparse reward environments. To +address these limitations, we introduce ReBis, which aims to capture +control-centric information by integrating reward-free control information +alongside reward-specific knowledge. ReBis utilizes a transformer architecture +to implicitly model the dynamics and incorporates block-wise masking to +eliminate spatiotemporal redundancy. Moreover, ReBis combines +bisimulation-based loss with asymmetric reconstruction loss to prevent feature +collapse in environments with sparse rewards. Empirical studies on two large +benchmarks, including Atari games and DeepMind Control Suit, demonstrate that +ReBis has superior performance compared to existing methods, proving its +effectiveness. + +
+
+
+
+
+ + ♻ ☆ Learning Better with Less: Effective Augmentation for Sample-Efficient + Visual Reinforcement Learning NeurIPS 2023 + + +
+ Data augmentation (DA) is a crucial technique for enhancing the sample +efficiency of visual reinforcement learning (RL) algorithms. Notably, employing +simple observation transformations alone can yield outstanding performance +without extra auxiliary representation tasks or pre-trained encoders. However, +it remains unclear which attributes of DA account for its effectiveness in +achieving sample-efficient visual RL. To investigate this issue and further +explore the potential of DA, this work conducts comprehensive experiments to +assess the impact of DA's attributes on its efficacy and provides the following +insights and improvements: (1) For individual DA operations, we reveal that +both ample spatial diversity and slight hardness are indispensable. Building on +this finding, we introduce Random PadResize (Rand PR), a new DA operation that +offers abundant spatial diversity with minimal hardness. (2) For multi-type DA +fusion schemes, the increased DA hardness and unstable data distribution result +in the current fusion schemes being unable to achieve higher sample efficiency +than their corresponding individual operations. Taking the non-stationary +nature of RL into account, we propose a RL-tailored multi-type DA fusion scheme +called Cycling Augmentation (CycAug), which performs periodic cycles of +different DA operations to increase type diversity while maintaining data +distribution consistency. Extensive evaluations on the DeepMind Control suite +and CARLA driving simulator demonstrate that our methods achieve superior +sample efficiency compared with the prior state-of-the-art methods. + +
+
+ comment: NeurIPS 2023 poster +
+
+
+
+
+ + ♻ ☆ Epidemic Learning: Boosting Decentralized Learning with Randomized + Communication NeurIPS 2023 + + +
+ We present Epidemic Learning (EL), a simple yet powerful decentralized +learning (DL) algorithm that leverages changing communication topologies to +achieve faster model convergence compared to conventional DL approaches. At +each round of EL, each node sends its model updates to a random sample of $s$ +other nodes (in a system of $n$ nodes). We provide an extensive theoretical +analysis of EL, demonstrating that its changing topology culminates in superior +convergence properties compared to the state-of-the-art (static and dynamic) +topologies. Considering smooth non-convex loss functions, the number of +transient iterations for EL, i.e., the rounds required to achieve asymptotic +linear speedup, is in $O(n^3/s^2)$ which outperforms the best-known bound +$O(n^3)$ by a factor of $s^2$, indicating the benefit of randomized +communication for DL. We empirically evaluate EL in a 96-node network and +compare its performance with state-of-the-art DL approaches. Our results +illustrate that EL converges up to $ 1.7\times$ quicker than baseline DL +algorithms and attains $2.2 $\% higher accuracy for the same communication +volume. + +
+
+ comment: Accepted paper at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Entropy-dissipation Informed Neural Network for McKean-Vlasov Type PDEs NeurIPS 2023 + + +
+ We extend the concept of self-consistency for the Fokker-Planck equation +(FPE) to the more general McKean-Vlasov equation (MVE). While FPE describes the +macroscopic behavior of particles under drift and diffusion, MVE accounts for +the additional inter-particle interactions, which are often highly singular in +physical systems. Two important examples considered in this paper are the MVE +with Coulomb interactions and the vorticity formulation of the 2D Navier-Stokes +equation. We show that a generalized self-consistency potential controls the +KL-divergence between a hypothesis solution to the ground truth, through +entropy dissipation. Built on this result, we propose to solve the MVEs by +minimizing this potential function, while utilizing the neural networks for +function approximation. We validate the empirical performance of our approach +by comparing with state-of-the-art NN-based PDE solvers on several example +problems. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Generalized Teacher Forcing for Learning Chaotic Dynamics ICML 2023 + + +
+ Chaotic dynamical systems (DS) are ubiquitous in nature and society. Often we +are interested in reconstructing such systems from observed time series for +prediction or mechanistic insight, where by reconstruction we mean learning +geometrical and invariant temporal properties of the system in question (like +attractors). However, training reconstruction algorithms like recurrent neural +networks (RNNs) on such systems by gradient-descent based techniques faces +severe challenges. This is mainly due to exploding gradients caused by the +exponential divergence of trajectories in chaotic systems. Moreover, for +(scientific) interpretability we wish to have as low dimensional +reconstructions as possible, preferably in a model which is mathematically +tractable. Here we report that a surprisingly simple modification of teacher +forcing leads to provably strictly all-time bounded gradients in training on +chaotic systems, and, when paired with a simple architectural rearrangement of +a tractable RNN design, piecewise-linear RNNs (PLRNNs), allows for faithful +reconstruction in spaces of at most the dimensionality of the observed system. +We show on several DS that with these amendments we can reconstruct DS better +than current SOTA algorithms, in much lower dimensions. Performance differences +were particularly compelling on real world data with which most other methods +severely struggled. This work thus led to a simple yet powerful DS +reconstruction algorithm which is highly interpretable at the same time. + +
+
+ comment: Published in the Proceedings of the 40th International Conference on + Machine Learning (ICML 2023) +
+
+
+
+
+ + ♻ ☆ Looping in the Human: Collaborative and Explainable Bayesian + Optimization + + +
+ Like many optimizers, Bayesian optimization often falls short of gaining user +trust due to opacity. While attempts have been made to develop human-centric +optimizers, they typically assume user knowledge is well-specified and +error-free, employing users mainly as supervisors of the optimization process. +We relax these assumptions and propose a more balanced human-AI partnership +with our Collaborative and Explainable Bayesian Optimization (CoExBO) +framework. Instead of explicitly requiring a user to provide a knowledge model, +CoExBO employs preference learning to seamlessly integrate human insights into +the optimization, resulting in algorithmic suggestions that resonate with user +preference. CoExBO explains its candidate selection every iteration to foster +trust, empowering users with a clearer grasp of the optimization. Furthermore, +CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with +extreme adversarial interventions, the algorithm converges asymptotically to a +vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI +teaming experiments in lithium-ion battery design, highlighting substantial +improvements over conventional methods. + +
+
+ comment: 22 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ A Unified Framework for Discovering Discrete Symmetries + + +
+ We consider the problem of learning a function respecting a symmetry from +among a class of symmetries. We develop a unified framework that enables +symmetry discovery across a broad range of subgroups including locally +symmetric, dihedral and cyclic subgroups. At the core of the framework is a +novel architecture composed of linear, matrix-valued and non-linear functions +that expresses functions invariant to these subgroups in a principled manner. +The structure of the architecture enables us to leverage multi-armed bandit +algorithms and gradient descent to efficiently optimize over the linear and the +non-linear functions, respectively, and to infer the symmetry that is +ultimately learnt. We also discuss the necessity of the matrix-valued functions +in the architecture. Experiments on image-digit sum and polynomial regression +tasks demonstrate the effectiveness of our approach. + +
+
+
+
+
+ + ♻ ☆ Direct Preference-based Policy Optimization without Reward Modeling NeurIPS 2023 + + +
+ Preference-based reinforcement learning (PbRL) is an approach that enables RL +agents to learn from preference, which is particularly useful when formulating +a reward function is challenging. Existing PbRL methods generally involve a +two-step procedure: they first learn a reward model based on given preference +data and then employ off-the-shelf reinforcement learning algorithms using the +learned reward model. However, obtaining an accurate reward model solely from +preference information, especially when the preference is from human teachers, +can be difficult. Instead, we propose a PbRL algorithm that directly learns +from preference without requiring any reward modeling. To achieve this, we +adopt a contrastive learning framework to design a novel policy scoring metric +that assigns a high score to policies that align with the given preferences. We +apply our algorithm to offline RL tasks with actual human preference labels and +show that our algorithm outperforms or is on par with the existing PbRL +methods. Notably, on high-dimensional control tasks, our algorithm surpasses +offline RL methods that learn with ground-truth reward information. Finally, we +show that our algorithm can be successfully applied to fine-tune large language +models. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Learning via Wasserstein-Based High Probability Generalisation Bounds NeurIPS 2023 + + +
+ Minimising upper bounds on the population risk or the generalisation gap has +been widely used in structural risk minimisation (SRM) -- this is in particular +at the core of PAC-Bayesian learning. Despite its successes and unfailing surge +of interest in recent years, a limitation of the PAC-Bayesian framework is that +most bounds involve a Kullback-Leibler (KL) divergence term (or its +variations), which might exhibit erratic behavior and fail to capture the +underlying geometric structure of the learning problem -- hence restricting its +use in practical applications. As a remedy, recent studies have attempted to +replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein +distance. Even though these bounds alleviated the aforementioned issues to a +certain extent, they either hold in expectation, are for bounded losses, or are +nontrivial to minimize in an SRM framework. In this work, we contribute to this +line of research and prove novel Wasserstein distance-based PAC-Bayesian +generalisation bounds for both batch learning with independent and identically +distributed (i.i.d.) data, and online learning with potentially non-i.i.d. +data. Contrary to previous art, our bounds are stronger in the sense that (i) +they hold with high probability, (ii) they apply to unbounded (potentially +heavy-tailed) losses, and (iii) they lead to optimizable training objectives +that can be used in SRM. As a result we derive novel Wasserstein-based +PAC-Bayesian learning algorithms and we illustrate their empirical advantage on +a variety of experiments. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Unpaired Multi-Domain Causal Representation Learning + + +
+ The goal of causal representation learning is to find a representation of +data that consists of causally related latent variables. We consider a setup +where one has access to data from multiple domains that potentially share a +causal representation. Crucially, observations in different domains are assumed +to be unpaired, that is, we only observe the marginal distribution in each +domain but not their joint distribution. In this paper, we give sufficient +conditions for identifiability of the joint distribution and the shared causal +graph in a linear setup. Identifiability holds if we can uniquely recover the +joint distribution and the shared causal representation from the marginal +distributions in each domain. We transform our identifiability results into a +practical method to recover the shared latent causal graph. + +
+
+
+
+
+ + ♻ ☆ Macro Placement by Wire-Mask-Guided Black-Box Optimization NeurIPS'23 + + +
+ The development of very large-scale integration (VLSI) technology has posed +new challenges for electronic design automation (EDA) techniques in chip +floorplanning. During this process, macro placement is an important subproblem, +which tries to determine the positions of all macros with the aim of minimizing +half-perimeter wirelength (HPWL) and avoiding overlapping. Previous methods +include packing-based, analytical and reinforcement learning methods. In this +paper, we propose a new black-box optimization (BBO) framework (called +WireMask-BBO) for macro placement, by using a wire-mask-guided greedy procedure +for objective evaluation. Equipped with different BBO algorithms, WireMask-BBO +empirically achieves significant improvements over previous methods, i.e., +achieves significantly shorter HPWL by using much less time. Furthermore, it +can fine-tune existing placements by treating them as initial solutions, which +can bring up to 50% improvement in HPWL. WireMask-BBO has the potential to +significantly improve the quality and efficiency of chip floorplanning, which +makes it appealing to researchers and practitioners in EDA and will also +promote the application of BBO. Our code is available at +https://github.com/lamda-bbo/WireMask-BBO. + +
+
+ comment: Update NeurIPS'23 camera ready version +
+
+
+
+
+ + ♻ ☆ Representation Learning via Consistent Assignment of Views over Random + Partitions NeurIPS 2023 + + +
+ We present Consistent Assignment of Views over Random Partitions (CARP), a +self-supervised clustering method for representation learning of visual +features. CARP learns prototypes in an end-to-end online fashion using gradient +descent without additional non-differentiable modules to solve the cluster +assignment problem. CARP optimizes a new pretext task based on random +partitions of prototypes that regularizes the model and enforces consistency +between views' assignments. Additionally, our method improves training +stability and prevents collapsed solutions in joint-embedding training. Through +an extensive evaluation, we demonstrate that CARP's representations are +suitable for learning downstream tasks. We evaluate CARP's representations +capabilities in 17 datasets across many standard protocols, including linear +evaluation, few-shot classification, k-NN, k-means, image retrieval, and copy +detection. We compare CARP performance to 11 existing self-supervised methods. +We extensively ablate our method and demonstrate that our proposed random +partition pretext task improves the quality of the learned representations by +devising multiple random classification tasks. In transfer learning tasks, CARP +achieves the best performance on average against many SSL methods trained for a +longer time. + +
+
+ comment: To appear in NeurIPS 2023. Code available at + https://github.com/sthalles/carp +
+
+
+
+
+ + ♻ ☆ AdaTask: Adaptive Multitask Online Learning + + +
+ We introduce and analyze AdaTask, a multitask online learning algorithm that +adapts to the unknown structure of the tasks. When the $N$ tasks are +stochastically activated, we show that the regret of AdaTask is better, by a +factor that can be as large as $\sqrt{N}$, than the regret achieved by running +$N$ independent algorithms, one for each task. AdaTask can be seen as a +comparator-adaptive version of Follow-the-Regularized-Leader with a Mahalanobis +norm potential. Through a variational formulation of this potential, our +analysis reveals how AdaTask jointly learns the tasks and their structure. +Experiments supporting our findings are presented. + +
+
+ comment: The proof of Theorem 3 is wrong: in the display equation below + Equation (22), bottom of page 15, the gradient of $\phi_{t+1}$ is missing a + factor $1/(\alpha\eta_t)$ +
+
+
+
+
+ + ♻ ☆ Pitfall of Optimism: Distributional Reinforcement Learning by + Randomizing Risk Criterion NeurIPS 2023 + + +
+ Distributional reinforcement learning algorithms have attempted to utilize +estimated uncertainty for exploration, such as optimism in the face of +uncertainty. However, using the estimated variance for optimistic exploration +may cause biased data collection and hinder convergence or performance. In this +paper, we present a novel distributional reinforcement learning algorithm that +selects actions by randomizing risk criterion to avoid one-sided tendency on +risk. We provide a perturbed distributional Bellman optimality operator by +distorting the risk measure and prove the convergence and optimality of the +proposed method with the weaker contraction property. Our theoretical results +support that the proposed method does not fall into biased exploration and is +guaranteed to converge to an optimal return. Finally, we empirically show that +our method outperforms other existing distribution-based algorithms in various +environments including Atari 55 games. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Bayesian sparsification for deep neural networks with Bayesian model + reduction + + +
+ Deep learning's immense capabilities are often constrained by the complexity +of its models, leading to an increasing demand for effective sparsification +techniques. Bayesian sparsification for deep learning emerges as a crucial +approach, facilitating the design of models that are both computationally +efficient and competitive in terms of performance across various deep learning +applications. The state-of-the-art -- in Bayesian sparsification of deep neural +networks -- combines structural shrinkage priors on model weights with an +approximate inference scheme based on stochastic variational inference. +However, model inversion of the full generative model is exceptionally +computationally demanding, especially when compared to standard deep learning +of point estimates. In this context, we advocate for the use of Bayesian model +reduction (BMR) as a more efficient alternative for pruning of model weights. +As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc +elimination of redundant model weights based on the posterior estimates under a +straightforward (non-hierarchical) generative model. Our comparative study +highlights the advantages of the BMR method relative to established approaches +based on hierarchical horseshoe priors over model weights. We illustrate the +potential of BMR across various deep learning architectures, from classical +networks like LeNet to modern frameworks such as Vision Transformers and +MLP-Mixers. + +
+
+
+
+
+ + ♻ ☆ Distributional Learning of Variational AutoEncoder: Application to + Synthetic Data Generation + + +
+ The Gaussianity assumption has been consistently criticized as a main +limitation of the Variational Autoencoder (VAE) despite its efficiency in +computational modeling. In this paper, we propose a new approach that expands +the model capacity (i.e., expressive power of distributional family) without +sacrificing the computational advantages of the VAE framework. Our VAE model's +decoder is composed of an infinite mixture of asymmetric Laplace distribution, +which possesses general distribution fitting capabilities for continuous +variables. Our model is represented by a special form of a nonparametric +M-estimator for estimating general quantile functions, and we theoretically +establish the relevance between the proposed model and quantile estimation. We +apply the proposed model to synthetic data generation, and particularly, our +model demonstrates superiority in easily adjusting the level of data privacy. + +
+
+
+
+
+ + ♻ ☆ Memory Efficient Optimizers with 4-bit States NeurIPS 2023 + + +
+ Optimizer states are a major source of memory consumption for training neural +networks, limiting the maximum trainable model within given memory budget. +Compressing the optimizer states from 32-bit floating points to lower bitwidth +is promising to reduce the training memory footprint, while the current lowest +achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth +down to 4-bit through a detailed empirical analysis of first and second +moments. Specifically, we find that moments have complicated outlier patterns, +that current block-wise quantization cannot accurately approximate. We use a +smaller block size and propose to utilize both row-wise and column-wise +information for better quantization. We further identify a zero point problem +of quantizing the second moment, and solve this problem with a linear quantizer +that excludes the zero point. Our 4-bit optimizers are evaluated on a wide +variety of benchmarks including natural language understanding, machine +translation, image classification, and instruction tuning. On all the tasks our +optimizers can achieve comparable accuracy with their full-precision +counterparts, while enjoying better memory efficiency. + +
+
+ comment: v3: camera ready revisions for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ On the Probability of Necessity and Sufficiency of Explaining Graph + Neural Networks: A Lower Bound Optimization Approach + + +
+ The explainability of Graph Neural Networks (GNNs) is critical to various GNN +applications, yet it remains a significant challenge. A convincing explanation +should be both necessary and sufficient simultaneously. However, existing GNN +explaining approaches focus on only one of the two aspects, necessity or +sufficiency, or a heuristic trade-off between the two. Theoretically, the +Probability of Necessity and Sufficiency (PNS) holds the potential to identify +the most necessary and sufficient explanation since it can mathematically +quantify the necessity and sufficiency of an explanation. Nevertheless, the +difficulty of obtaining PNS due to non-monotonicity and the challenge of +counterfactual estimation limit its wide use. To address the +non-identifiability of PNS, we resort to a lower bound of PNS that can be +optimized via counterfactual estimation, and propose a framework of Necessary +and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. +Specifically, we depict the GNN as a structural causal model (SCM), and +estimate the probability of counterfactual via the intervention under the SCM. +Additionally, we leverage continuous masks with a sampling strategy to optimize +the lower bound to enhance the scalability. Empirical results demonstrate that +NSEG outperforms state-of-the-art methods, consistently generating the most +necessary and sufficient explanations. + +
+
+ comment: 36 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ ReSQueing Parallel and Private Stochastic Convex Optimization + + +
+ We introduce a new tool for stochastic convex optimization (SCO): a +Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function +convolved with a (Gaussian) probability density. Combining ReSQue with recent +advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop +algorithms achieving state-of-the-art complexities for SCO in parallel and +private settings. For a SCO objective constrained to the unit ball in +$\mathbb{R}^d$, we obtain the following results (up to polylogarithmic +factors). We give a parallel algorithm obtaining optimization error +$\epsilon_{\text{opt}}$ with $d^{1/3}\epsilon_{\text{opt}}^{-2/3}$ gradient +oracle query depth and $d^{1/3}\epsilon_{\text{opt}}^{-2/3} + +\epsilon_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a +bounded-variance stochastic gradient estimator. For $\epsilon_{\text{opt}} \in +[d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of +[BJLLS19] while maintaining the optimal total work of stochastic gradient +descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, +BFGT20, AFKT21, KLL21] established that if $n \gtrsim d +\epsilon_{\text{dp}}^{-2}$, $(\epsilon_{\text{dp}}, \delta)$-differential +privacy is attained at no asymptotic cost to the SCO utility. However, these +prior works all required a superlinear number of gradient queries. We close +this gap for sufficiently large $n \gtrsim d^2 \epsilon_{\text{dp}}^{-3}$, by +using ReSQue to design an algorithm with near-linear gradient query complexity +in this regime. + +
+
+
+
+
+ + ♻ ☆ Look Beneath the Surface: Exploiting Fundamental Symmetry for + Sample-Efficient Offline RL NeurIPS 2023 + + +
+ Offline reinforcement learning (RL) offers an appealing approach to +real-world tasks by learning policies from pre-collected datasets without +interacting with the environment. However, the performance of existing offline +RL algorithms heavily depends on the scale and state-action space coverage of +datasets. Real-world data collection is often expensive and uncontrollable, +leading to small and narrowly covered datasets and posing significant +challenges for practical deployments of offline RL. In this paper, we provide a +new insight that leveraging the fundamental symmetry of system dynamics can +substantially enhance offline RL performance under small datasets. +Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced +Dynamics Model (TDM), which establishes consistency between a pair of forward +and reverse latent dynamics. TDM provides both well-behaved representations for +small datasets and a new reliability measure for OOD samples based on +compliance with the T-symmetry. These can be readily used to construct a new +offline RL algorithm (TSRL) with less conservative policy constraints and a +reliable latent space data augmentation procedure. Based on extensive +experiments, we find TSRL achieves great performance on small benchmark +datasets with as few as 1% of the original samples, which significantly +outperforms the recent offline RL algorithms in terms of data efficiency and +generalizability.Code is available at: https://github.com/pcheng2/TSRL + +
+
+ comment: Accepted in NeurIPS 2023; The first two authors contributed equally +
+
+
+
+
+ + ♻ ☆ Data driven modeling of self-similar dynamics + + +
+ Multiscale modeling of complex systems is crucial for understanding their +intricacies. Data-driven multiscale modeling has emerged as a promising +approach to tackle challenges associated with complex systems. On the other +hand, self-similarity is prevalent in complex systems, hinting that large-scale +complex systems can be modeled at a reduced cost. In this paper, we introduce a +multiscale neural network framework that incorporates self-similarity as prior +knowledge, facilitating the modeling of self-similar dynamical systems. For +deterministic dynamics, our framework can discern whether the dynamics are +self-similar. For uncertain dynamics, it can compare and determine which +parameter set is closer to self-similarity. The framework allows us to extract +scale-invariant kernels from the dynamics for modeling at any scale. Moreover, +our method can identify the power law exponents in self-similar systems. +Preliminary tests on the Ising model yielded critical exponents consistent with +theoretical expectations, providing valuable insights for addressing critical +phase transitions in non-equilibrium systems. + +
+
+ comment: 11 pages,5 figures,1 table +
+
+
+
+
+ + ♻ ☆ Practical Sharpness-Aware Minimization Cannot Converge All the Way to + Optima NeurIPS 2023 + + +
+ Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step +based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla +f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing +studies prove convergence of SAM for smooth functions, but they do so by +assuming decaying perturbation size $\rho$ and/or no gradient normalization in +$y_t$, which is detached from practice. To address this gap, we study +deterministic/stochastic versions of SAM with practical configurations (i.e., +constant $\rho$ and gradient normalization in $y_t$) and explore their +convergence properties on smooth functions with (non)convexity assumptions. +Perhaps surprisingly, in many scenarios, we find out that SAM has limited +capability to converge to global minima or stationary points. For smooth +strongly convex functions, we show that while deterministic SAM enjoys tight +global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence +bound of stochastic SAM suffers an inevitable additive term $O(\rho^2)$, +indicating convergence only up to neighborhoods of optima. In fact, such +$O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, +and also for deterministic SAM in nonconvex cases; importantly, we prove by +examples that such terms are unavoidable. Our results highlight vastly +different characteristics of SAM with vs. without decaying perturbation size or +gradient normalization, and suggest that the intuitions gained from one version +may not apply to the other. + +
+
+ comment: 39 pages. v3 NeurIPS 2023 camera ready version +
+
+
+
+
+ + ♻ ☆ Truncated Affinity Maximization: One-class Homophily Modeling for Graph + Anomaly Detection + + +
+ One prevalent property we find empirically in real-world graph anomaly +detection (GAD) datasets is a one-class homophily, i.e., normal nodes tend to +have strong connection/affinity with each other, while the homophily in +abnormal nodes is significantly weaker than normal nodes. However, this +anomaly-discriminative property is ignored by existing GAD methods that are +typically built using a conventional anomaly detection objective, such as data +reconstruction. In this work, we explore this property to introduce a novel +unsupervised anomaly scoring measure for GAD -- local node affinity -- that +assigns a larger anomaly score to nodes that are less affiliated with their +neighbors, with the affinity defined as similarity on node +attributes/representations. We further propose Truncated Affinity Maximization +(TAM) that learns tailored node representations for our anomaly measure by +maximizing the local affinity of nodes to their neighbors. Optimizing on the +original graph structure can be biased by non-homophily edges (i.e., edges +connecting normal and abnormal nodes). Thus, TAM is instead optimized on +truncated graphs where non-homophily edges are removed iteratively to mitigate +this bias. The learned representations result in significantly stronger local +affinity for normal nodes than abnormal nodes. Extensive empirical results on +six real-world GAD datasets show that TAM substantially outperforms seven +competing models, achieving over 10% increase in AUROC/AUPRC compared to the +best contenders on challenging datasets. Our code will be made available at +https: //github.com/mala-lab/TAM-master/. + +
+
+ comment: 19 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ DebateKG: Automatic Policy Debate Case Creation with Semantic Knowledge + Graphs EMNLP 2023 + + +
+ Recent work within the Argument Mining community has shown the applicability +of Natural Language Processing systems for solving problems found within +competitive debate. One of the most important tasks within competitive debate +is for debaters to create high quality debate cases. We show that effective +debate cases can be constructed using constrained shortest path traversals on +Argumentative Semantic Knowledge Graphs. We study this potential in the context +of a type of American Competitive Debate, called Policy Debate, which already +has a large scale dataset targeting it called DebateSum. We significantly +improve upon DebateSum by introducing 53180 new examples, as well as further +useful metadata for every example, to the dataset. We leverage the txtai +semantic search and knowledge graph toolchain to produce and contribute 9 +semantic knowledge graphs built on this dataset. We create a unique method for +evaluating which knowledge graphs are better in the context of producing policy +debate cases. A demo which automatically generates debate cases, along with all +other code and the Knowledge Graphs, are open-sourced and made available to the +public here: https://huggingface.co/spaces/Hellisotherpeople/DebateKG + +
+
+ comment: 8 pages, Accepted to The 4th New Frontiers in Summarization Workshop + (EMNLP 2023), System Demonstration paper +
+
+
+
+
+ + ♻ ☆ Towards Personalized Federated Learning via Heterogeneous Model + Reassembly NeurIPS 2023 + + +
+ This paper focuses on addressing the practical yet challenging problem of +model heterogeneity in federated learning, where clients possess models with +different network structures. To track this problem, we propose a novel +framework called pFedHR, which leverages heterogeneous model reassembly to +achieve personalized federated learning. In particular, we approach the problem +of heterogeneous model personalization as a model-matching optimization task on +the server side. Moreover, pFedHR automatically and dynamically generates +informative and diverse personalized candidates with minimal human +intervention. Furthermore, our proposed heterogeneous model reassembly +technique mitigates the adverse impact introduced by using public data with +different distributions from the client data to a certain extent. Experimental +results demonstrate that pFedHR outperforms baselines on three datasets under +both IID and Non-IID settings. Additionally, pFedHR effectively reduces the +adverse impact of using different public data and dynamically generates diverse +personalized models in an automated manner. + +
+
+ comment: This paper has been accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Jorge: Approximate Preconditioning for GPU-efficient Second-order + Optimization + + +
+ Despite their better convergence properties compared to first-order +optimizers, second-order optimizers for deep learning have been less popular +due to their significant computational costs. The primary efficiency bottleneck +in such optimizers is matrix inverse calculations in the preconditioning step, +which are expensive to compute on GPUs. In this paper, we introduce Jorge, a +second-order optimizer that promises the best of both worlds -- rapid +convergence benefits of second-order methods, and high computational efficiency +typical of first-order methods. We address the primary computational bottleneck +of computing matrix inverses by completely eliminating them using an +approximation of the preconditioner computation. This makes Jorge extremely +efficient on GPUs in terms of wall-clock time. Further, we describe an approach +to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, +thereby significantly minimizing tuning efforts. Our empirical evaluations +demonstrate the distinct advantages of using Jorge, outperforming +state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple +deep learning models, both in terms of sample efficiency and wall-clock time. + +
+
+
+
+
+ + ♻ ☆ Beta Diffusion NeurIPS 2023 + + +
+ We introduce beta diffusion, a novel generative modeling method that +integrates demasking and denoising to generate data within bounded ranges. +Using scaled and shifted beta distributions, beta diffusion utilizes +multiplicative transitions over time to create both forward and reverse +diffusion processes, maintaining beta distributions in both the forward +marginals and the reverse conditionals, given the data at any point in time. +Unlike traditional diffusion-based generative models relying on additive +Gaussian noise and reweighted evidence lower bounds (ELBOs), beta diffusion is +multiplicative and optimized with KL-divergence upper bounds (KLUBs) derived +from the convexity of the KL divergence. We demonstrate that the proposed KLUBs +are more effective for optimizing beta diffusion compared to negative ELBOs, +which can also be derived as the KLUBs of the same KL divergence with its two +arguments swapped. The loss function of beta diffusion, expressed in terms of +Bregman divergence, further supports the efficacy of KLUBs for optimization. +Experimental results on both synthetic data and natural images demonstrate the +unique capabilities of beta diffusion in generative modeling of range-bounded +data and validate the effectiveness of KLUBs in optimizing diffusion models, +thereby making them valuable additions to the family of diffusion-based +generative models and the optimization techniques used to train them. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Pre-training Contextualized World Models with In-the-wild Videos for + Reinforcement Learning NeurIPS 2023 + + +
+ Unsupervised pre-training methods utilizing large and diverse datasets have +achieved tremendous success across a range of domains. Recent work has +investigated such unsupervised pre-training methods for model-based +reinforcement learning (MBRL) but is limited to domain-specific or simulated +data. In this paper, we study the problem of pre-training world models with +abundant in-the-wild videos for efficient learning of downstream visual control +tasks. However, in-the-wild videos are complicated with various contextual +factors, such as intricate backgrounds and textured appearance, which precludes +a world model from extracting shared world knowledge to generalize better. To +tackle this issue, we introduce Contextualized World Models (ContextWM) that +explicitly separate context and dynamics modeling to overcome the complexity +and diversity of in-the-wild videos and facilitate knowledge transfer between +distinct scenes. Specifically, a contextualized extension of the latent +dynamics model is elaborately realized by incorporating a context encoder to +retain contextual information and empower the image decoder, which encourages +the latent dynamics model to concentrate on essential temporal variations. Our +experiments show that in-the-wild video pre-training equipped with ContextWM +can significantly improve the sample efficiency of MBRL in various domains, +including robotic manipulation, locomotion, and autonomous driving. Code is +available at this repository: https://github.com/thuml/ContextWM. + +
+
+ comment: NeurIPS 2023. Code is available at https://github.com/thuml/ContextWM +
+
+
+
+
+ + ♻ ☆ Accented Speech Recognition With Accent-specific Codebooks EMNLP 2023 + + +
+ Speech accents pose a significant challenge to state-of-the-art automatic +speech recognition (ASR) systems. Degradation in performance across +underrepresented accents is a severe deterrent to the inclusive adoption of +ASR. In this work, we propose a novel accent adaptation approach for end-to-end +ASR systems using cross-attention with a trainable set of codebooks. These +learnable codebooks capture accent-specific information and are integrated +within the ASR encoder layers. The model is trained on accented English speech, +while the test data also contained accents which were not seen during training. +On the Mozilla Common Voice multi-accented dataset, we show that our proposed +approach yields significant performance gains not only on the seen English +accents (up to $37\%$ relative improvement in word error rate) but also on the +unseen accents (up to $5\%$ relative improvement in WER). Further, we +illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We +also compare the performance with other approaches based on accent adversarial +training. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ♻ ☆ Vicarious Offense and Noise Audit of Offensive Speech Classifiers: + Unifying Human and Machine Disagreement on What is Offensive EMNLP 2023 + + +
+ Offensive speech detection is a key component of content moderation. However, +what is offensive can be highly subjective. This paper investigates how machine +and human moderators disagree on what is offensive when it comes to real-world +social web political discourse. We show that (1) there is extensive +disagreement among the moderators (humans and machines); and (2) human and +large-language-model classifiers are unable to predict how other human raters +will respond, based on their political leanings. For (1), we conduct a noise +audit at an unprecedented scale that combines both machine and human responses. +For (2), we introduce a first-of-its-kind dataset of vicarious offense. Our +noise audit reveals that moderation outcomes vary wildly across different +machine moderators. Our experiments with human moderators suggest that +political leanings combined with sensitive issues affect both first-person and +vicarious offense. The dataset is available through +https://github.com/Homan-Lab/voiced. + +
+
+ comment: Accepted to appear at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ The Expressive Power of Low-Rank Adaptation + + +
+ Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that +leverages low-rank adaptation of weight matrices, has emerged as a prevalent +technique for fine-tuning pre-trained models such as large language models and +diffusion models. Despite its huge success in practice, the theoretical +underpinnings of LoRA have largely remained unexplored. This paper takes the +first step to bridge this gap by theoretically analyzing the expressive power +of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any +model $f$ to accurately represent any smaller target model $\overline{f}$ if +LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of +}\overline{f}}{\text{depth of }f}$. We also quantify the approximation error +when LoRA-rank is lower than the threshold. For Transformer networks, we show +any model can be adapted to a target model of the same size with +rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters. + +
+
+ comment: 40 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ SageFormer: Series-Aware Framework for Long-term Multivariate Time + Series Forecasting + + +
+ In the burgeoning ecosystem of Internet of Things, multivariate time series +(MTS) data has become ubiquitous, highlighting the fundamental role of time +series forecasting across numerous applications. The crucial challenge of +long-term MTS forecasting requires adept models capable of capturing both +intra- and inter-series dependencies. Recent advancements in deep learning, +notably Transformers, have shown promise. However, many prevailing methods +either marginalize inter-series dependencies or overlook them entirely. To +bridge this gap, this paper introduces a novel series-aware framework, +explicitly designed to emphasize the significance of such dependencies. At the +heart of this framework lies our specific implementation: the SageFormer. As a +Series-aware Graph-enhanced Transformer model, SageFormer proficiently discerns +and models the intricate relationships between series using graph structures. +Beyond capturing diverse temporal patterns, it also curtails redundant +information across series. Notably, the series-aware framework seamlessly +integrates with existing Transformer-based models, enriching their ability to +comprehend inter-series relationships. Extensive experiments on real-world and +synthetic datasets validate the superior performance of SageFormer against +contemporary state-of-the-art approaches. + +
+
+ comment: under review +
+
+
+
+
+ + ♻ ☆ TACO: Temporal Latent Action-Driven Contrastive Loss for Visual + Reinforcement Learning NeurIPS 2023 + + +
+ Despite recent progress in reinforcement learning (RL) from raw pixel data, +sample inefficiency continues to present a substantial obstacle. Prior works +have attempted to address this challenge by creating self-supervised auxiliary +tasks, aiming to enrich the agent's learned representations with +control-relevant information for future state prediction. However, these +objectives are often insufficient to learn representations that can represent +the optimal policy or value function, and they often consider tasks with small, +abstract discrete action spaces and thus overlook the importance of action +representation learning in continuous control. In this paper, we introduce +TACO: Temporal Action-driven Contrastive Learning, a simple yet powerful +temporal contrastive learning approach that facilitates the concurrent +acquisition of latent state and action representations for agents. TACO +simultaneously learns a state and an action representation by optimizing the +mutual information between representations of current states paired with action +sequences and representations of the corresponding future states. +Theoretically, TACO can be shown to learn state and action representations that +encompass sufficient information for control, thereby improving sample +efficiency. For online RL, TACO achieves 40% performance boost after one +million environment interaction steps on average across nine challenging visual +continuous control tasks from Deepmind Control Suite. In addition, we show that +TACO can also serve as a plug-and-play module adding to existing offline visual +RL methods to establish the new state-of-the-art performance for offline visual +RL across offline datasets with varying quality. + +
+
+ comment: Accepted at 37th Conference on Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ A simple uniformly optimal method without line search for convex + optimization + + +
+ Line search (or backtracking) procedures have been widely employed into +first-order methods for solving convex optimization problems, especially those +with unknown problem parameters (e.g., Lipschitz constant). In this paper, we +show that line search is superfluous in attaining the optimal rate of +convergence for solving a convex optimization problem whose parameters are not +given a priori. In particular, we present a novel accelerated gradient descent +type algorithm called auto-conditioned fast gradient method (AC-FGM) that can +achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex +optimization without requiring the estimate of a global Lipschitz constant or +the employment of line search procedures. We then extend AC-FGM to solve convex +optimization problems with H\"{o}lder continuous gradients and show that it +automatically achieves the optimal rates of convergence uniformly for all +problem classes with the desired accuracy of the solution as the only input. +Finally, we report some encouraging numerical results that demonstrate the +advantages of AC-FGM over the previously developed parameter-free methods for +convex optimization. + +
+
+
+
+
+ + ♻ ☆ HYTREL: Hypergraph-enhanced Tabular Data Representation Learning NeurIPS 2023 + + +
+ Language models pretrained on large collections of tabular data have +demonstrated their effectiveness in several downstream tasks. However, many of +these models do not take into account the row/column permutation invariances, +hierarchical structure, etc. that exist in tabular data. To alleviate these +limitations, we propose HYTREL, a tabular language model, that captures the +permutation invariances and three more structural properties of tabular data by +using hypergraphs - where the table cells make up the nodes and the cells +occurring jointly together in each row, column, and the entire table are used +to form three different types of hyperedges. We show that HYTREL is maximally +invariant under certain conditions for tabular data, i.e., two tables obtain +the same representations via HYTREL iff the two tables are identical up to +permutations. Our empirical results demonstrate that HYTREL consistently +outperforms other competitive baselines on four downstream tasks with minimal +pretraining, illustrating the advantages of incorporating the inductive biases +associated with tabular data into the representations. Finally, our qualitative +analyses showcase that HYTREL can assimilate the table structures to generate +robust representations for the cells, rows, columns, and the entire table. + +
+
+ comment: NeurIPS 2023 (spotlight) +
+
+
+
+
+ + ♻ ☆ Visual Programming for Text-to-Image Generation and Evaluation NeurIPS 2023 + + +
+ As large language models have demonstrated impressive performance in many +domains, recent works have adopted language models (LMs) as controllers of +visual modules for vision-and-language tasks. While existing work focuses on +equipping LMs with visual understanding, we propose two novel +interpretable/explainable visual programming frameworks for text-to-image (T2I) +generation and evaluation. First, we introduce VPGen, an interpretable +step-by-step T2I generation framework that decomposes T2I generation into three +steps: object/count generation, layout generation, and image generation. We +employ an LM to handle the first two steps (object/count generation and layout +generation), by finetuning it on text-layout pairs. Our step-by-step T2I +generation framework provides stronger spatial control than end-to-end models, +the dominant approach for this task. Furthermore, we leverage the world +knowledge of pretrained LMs, overcoming the limitation of previous +layout-guided T2I works that can only handle predefined object classes. We +demonstrate that our VPGen has improved control in counts/spatial +relations/scales of objects than state-of-the-art T2I generation models. +Second, we introduce VPEval, an interpretable and explainable evaluation +framework for T2I generation based on visual programming. Unlike previous T2I +evaluations with a single scoring model that is accurate in some skills but +unreliable in others, VPEval produces evaluation programs that invoke a set of +visual modules that are experts in different skills, and also provides +visual+textual explanations of the evaluation results. Our analysis shows that +VPEval provides a more human-correlated evaluation for skill-specific and +open-ended prompts than widely used single model-based evaluation. We hope that +our work encourages future progress on interpretable/explainable generation and +evaluation for T2I models. + +
+
+ comment: NeurIPS 2023; Project website: https://vp-t2i.github.io +
+
+
+
+
+ + ♻ ☆ Understanding Code Semantics: An Evaluation of Transformer Models in + Summarization EMNLP 2023 + + +
+ This paper delves into the intricacies of code summarization using advanced +transformer-based language models. Through empirical studies, we evaluate the +efficacy of code summarization by altering function and variable names to +explore whether models truly understand code semantics or merely rely on +textual cues. We have also introduced adversaries like dead code and commented +code across three programming languages (Python, Javascript, and Java) to +further scrutinize the model's understanding. Ultimately, our research aims to +offer valuable insights into the inner workings of transformer-based LMs, +enhancing their ability to understand code and contributing to more efficient +software development practices and maintenance workflows. + +
+
+ comment: Accepted at GenBench, EMNLP 2023. All authors are co-first authors + and have equal contributions +
+
+
+
+
+ + ♻ ☆ TIES-Merging: Resolving Interference When Merging Models NeurIPS 2023 + + +
+ Transfer learning - i.e., further fine-tuning a pre-trained model on a +downstream task - can confer significant advantages, including improved +downstream performance, faster convergence, and better sample efficiency. These +advantages have led to a proliferation of task-specific fine-tuned models, +which typically can only perform a single task and do not benefit from one +another. Recently, model merging techniques have emerged as a solution to +combine multiple task-specific models into a single multitask model without +performing additional training. However, existing merging methods often ignore +the interference between parameters of different models, resulting in large +performance drops when merging multiple models. In this paper, we demonstrate +that prior merging techniques inadvertently lose valuable information due to +two major sources of interference: (a) interference due to redundant parameter +values and (b) disagreement on the sign of a given parameter's values across +models. To address this, we propose our method, TRIM, ELECT SIGN & MERGE +(TIES-Merging), which introduces three novel steps when merging models: (1) +resetting parameters that only changed a small amount during fine-tuning, (2) +resolving sign conflicts, and (3) merging only the parameters that are in +alignment with the final agreed-upon sign. We find that TIES-Merging +outperforms several existing methods in diverse settings covering a range of +modalities, domains, number of tasks, model sizes, architectures, and +fine-tuning settings. We further analyze the impact of different types of +interference on model parameters, and highlight the importance of resolving +sign interference. Our code is available at +https://github.com/prateeky2806/ties-merging + +
+
+ comment: Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables +
+
+
+
+
+ + ♻ ☆ De-novo Chemical Reaction Generation by Means of Temporarily + Convolutional Neural Networks + + +
+ We present here a combination of two networks, Recurrent Neural Networks +(RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction +generation using the novel Reaction Smiles-like representation of reactions +(CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks +are known for their autoregressive properties and are frequently used in +language modelling with direct application to SMILES generation. The relatively +novel TCNs possess similar properties with wide receptive field while obeying +the causality required for natural language processing (NLP). The combination +of both latent representations expressed through TCN and RNN results in an +overall better performance compared to RNN alone. Additionally, it is shown +that different fine-tuning protocols have a profound impact on generative scope +of the model when applied on a dataset of interest via transfer learning. + +
+
+
+
+
+ + ♻ ☆ Synthetic Experience Replay NeurIPS + + +
+ A key theme in the past decade has been that when large neural networks and +large datasets combine they can produce remarkable results. In deep +reinforcement learning (RL), this paradigm is commonly made possible through +experience replay, whereby a dataset of past experiences is used to train a +policy or value function. However, unlike in supervised or self-supervised +learning, an RL agent has to collect its own data, which is often limited. +Thus, it is challenging to reap the benefits of deep learning, and even small +neural networks can overfit at the start of training. In this work, we leverage +the tremendous recent progress in generative modeling and propose Synthetic +Experience Replay (SynthER), a diffusion-based approach to flexibly upsample an +agent's collected experience. We show that SynthER is an effective method for +training RL agents across offline and online settings, in both proprioceptive +and pixel-based environments. In offline settings, we observe drastic +improvements when upsampling small offline datasets and see that additional +synthetic data also allows us to effectively train larger networks. +Furthermore, SynthER enables online agents to train with a much higher +update-to-data ratio than before, leading to a significant increase in sample +efficiency, without any algorithmic changes. We believe that synthetic training +data could open the door to realizing the full potential of deep learning for +replay-based RL algorithms from limited data. Finally, we open-source our code +at https://github.com/conglu1997/SynthER. + +
+
+ comment: Published at NeurIPS, 2023 +
+
+
+
+
+ + ♻ ☆ Controlling Text-to-Image Diffusion by Orthogonal Finetuning NeurIPS 2023 + + +
+ Large text-to-image diffusion models have impressive capabilities in +generating photorealistic images from text prompts. How to effectively guide or +control these powerful models to perform different downstream tasks becomes an +important open problem. To tackle this challenge, we introduce a principled +finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image +diffusion models to downstream tasks. Unlike existing methods, OFT can provably +preserve hyperspherical energy which characterizes the pairwise neuron +relationship on the unit hypersphere. We find that this property is crucial for +preserving the semantic generation ability of text-to-image diffusion models. +To improve finetuning stability, we further propose Constrained Orthogonal +Finetuning (COFT) which imposes an additional radius constraint to the +hypersphere. Specifically, we consider two important finetuning text-to-image +tasks: subject-driven generation where the goal is to generate subject-specific +images given a few images of a subject and a text prompt, and controllable +generation where the goal is to enable the model to take in additional control +signals. We empirically show that our OFT framework outperforms existing +methods in generation quality and convergence speed. + +
+
+ comment: NeurIPS 2023 (43 pages, 34 figures, project page: + https://oft.wyliu.com/) +
+
+
+
+
+ + ♻ ☆ Replicable Clustering NeurIPS 2023 + + +
+ We design replicable algorithms in the context of statistical clustering +under the recently introduced notion of replicability from Impagliazzo et al. +[2022]. According to this definition, a clustering algorithm is replicable if, +with high probability, its output induces the exact same partition of the +sample space after two executions on different inputs drawn from the same +distribution, when its internal randomness is shared across the executions. We +propose such algorithms for the statistical $k$-medians, statistical $k$-means, +and statistical $k$-centers problems by utilizing approximation routines for +their combinatorial counterparts in a black-box manner. In particular, we +demonstrate a replicable $O(1)$-approximation algorithm for statistical +Euclidean $k$-medians ($k$-means) with $\operatorname{poly}(d)$ sample +complexity. We also describe an $O(1)$-approximation algorithm with an +additional $O(1)$-additive error for statistical Euclidean $k$-centers, albeit +with $\exp(d)$ sample complexity. In addition, we provide experiments on +synthetic distributions in 2D using the $k$-means++ implementation from sklearn +as a black-box that validate our theoretical results. + +
+
+ comment: to be published in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Practical Contextual Bandits with Feedback Graphs + + +
+ While contextual bandit has a mature theory, effectively leveraging different +feedback patterns to enhance the pace of learning remains unclear. Bandits with +feedback graphs, which interpolates between the full information and bandit +regimes, provides a promising framework to mitigate the statistical complexity +of learning. In this paper, we propose and analyze an approach to contextual +bandits with feedback graphs based upon reduction to regression. The resulting +algorithms are computationally practical and achieve established minimax rates, +thereby reducing the statistical complexity in real-world applications. + +
+
+
+
+
+
+
+
+ + Multimedia 5 + +
+
+
+ + ☆ Enabling Acoustic Audience Feedback in Large Virtual Events + + +
+ The COVID-19 pandemic shifted many events in our daily lives into the virtual +domain. While virtual conference systems provide an alternative to physical +meetings, larger events require a muted audience to avoid an accumulation of +background noise and distorted audio. However, performing artists strongly rely +on the feedback of their audience. We propose a concept for a virtual audience +framework which supports all participants with the ambience of a real audience. +Audience feedback is collected locally, allowing users to express enthusiasm or +discontent by selecting means such as clapping, whistling, booing, and +laughter. This feedback is sent as abstract information to a virtual audience +server. We broadcast the combined virtual audience feedback information to all +participants, which can be synthesized as a single acoustic feedback by the +client. The synthesis can be done by turning the collective audience feedback +into a prompt that is fed to state-of-the-art models such as AudioGen. This +way, each user hears a single acoustic feedback sound of the entire virtual +event, without requiring to unmute or risk hearing distorted, unsynchronized +feedback. + +
+
+ comment: 4 pages, 2 figures +
+
+
+
+
+ + ☆ Improved Lossless Coding for Storage and Transmission of Multichannel + Immersive Audio + + +
+ In this paper, techniques for improving multichannel lossless coding are +examined. A method is proposed for the simultaneous coding of two or more +different renderings (mixes) of the same content. The signal model uses both +past samples of the upmix, and the current time samples of downmix samples to +predict the upmix. Model parameters are optimized via a general linear solver, +and the prediction residual is Rice coded. Additionally, the use of an SVD +projection prior to residual coding is proposed. A comparison is made against +various baselines, including FLAC. The proposed methods show improved +compression ratios for the storage and transmission of immersive audio. + +
+
+
+
+
+ + ☆ Large-scale Foundation Models and Generative AI for BigData Neuroscience + + +
+ Recent advances in machine learning have made revolutionary breakthroughs in +computer games, image and natural language understanding, and scientific +discovery. Foundation models and large-scale language models (LLMs) have +recently achieved human-like intelligence thanks to BigData. With the help of +self-supervised learning (SSL) and transfer learning, these models may +potentially reshape the landscapes of neuroscience research and make a +significant impact on the future. Here we present a mini-review on recent +advances in foundation models and generative AI models as well as their +applications in neuroscience, including natural language and speech, semantic +memory, brain-machine interfaces (BMIs), and data augmentation. We argue that +this paradigm-shift framework will open new avenues for many neuroscience +research directions and discuss the accompanying challenges and opportunities. + +
+
+
+
+
+ + ♻ ☆ Separate Anything You Describe + + +
+ Language-queried audio source separation (LASS) is a new paradigm for +computational auditory scene analysis (CASA). LASS aims to separate a target +sound from an audio mixture given a natural language query, which provides a +natural and scalable interface for digital audio applications. Recent works on +LASS, despite attaining promising separation performance on specific sources +(e.g., musical instruments, limited classes of audio events), are unable to +separate audio concepts in the open domain. In this work, we introduce +AudioSep, a foundation model for open-domain audio source separation with +natural language queries. We train AudioSep on large-scale multimodal datasets +and extensively evaluate its capabilities on numerous tasks including audio +event separation, musical instrument separation, and speech enhancement. +AudioSep demonstrates strong separation performance and impressive zero-shot +generalization ability using audio captions or text labels as queries, +substantially outperforming previous audio-queried and language-queried sound +separation models. For reproducibility of this work, we will release the source +code, evaluation benchmark and pre-trained model at: +https://github.com/Audio-AGI/AudioSep. + +
+
+ comment: Code, benchmark and pre-trained models: + https://github.com/Audio-AGI/AudioSep +
+
+
+
+
+ + ♻ ☆ Cross-Modal Retrieval: A Systematic Review of Methods and Future + Directions + + +
+ With the exponential surge in diverse multi-modal data, traditional uni-modal +retrieval methods struggle to meet the needs of users demanding access to data +from various modalities. To address this, cross-modal retrieval has emerged, +enabling interaction across modalities, facilitating semantic matching, and +leveraging complementarity and consistency between different modal data. +Although prior literature undertook a review of the cross-modal retrieval +field, it exhibits numerous deficiencies pertaining to timeliness, taxonomy, +and comprehensiveness. This paper conducts a comprehensive review of +cross-modal retrieval's evolution, spanning from shallow statistical analysis +techniques to vision-language pre-training models. Commencing with a +comprehensive taxonomy grounded in machine learning paradigms, mechanisms, and +models, the paper then delves deeply into the principles and architectures +underpinning existing cross-modal retrieval methods. Furthermore, it offers an +overview of widely used benchmarks, metrics, and performances. Lastly, the +paper probes the prospects and challenges that confront contemporary +cross-modal retrieval, while engaging in a discourse on potential directions +for further progress in the field. To facilitate the research on cross-modal +retrieval, we develop an open-source code repository at +https://github.com/BMC-SDNU/Cross-Modal-Retrieval. + +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 144 + +
+
+
+ + ☆ torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free + Deep Learning Studies: A Case Study on NLP EMNLP 2023 + + +
+ Reproducibility in scientific work has been becoming increasingly important +in research communities such as machine learning, natural language processing, +and computer vision communities due to the rapid development of the research +domains supported by recent advances in deep learning. In this work, we present +a significantly upgraded version of torchdistill, a modular-driven coding-free +deep learning framework significantly upgraded from the initial release, which +supports only image classification and object detection tasks for reproducible +knowledge distillation experiments. To demonstrate that the upgraded framework +can support more tasks with third-party libraries, we reproduce the GLUE +benchmark results of BERT models using a script based on the upgraded +torchdistill, harmonizing with various Hugging Face libraries. All the 27 +fine-tuned BERT models and configurations to reproduce the results are +published at Hugging Face, and the model weights have already been widely used +in research communities. We also reimplement popular small-sized models and new +knowledge distillation methods and perform additional experiments for computer +vision tasks. + +
+
+ comment: Accepted at the 3rd Workshop for Natural Language Processing Open + Source Software (NLP-OSS) at EMNLP 2023 +
+
+
+
+
+ + ☆ In-Context Learning Dynamics with Random Binary Sequences + + +
+ Large language models (LLMs) trained on huge corpora of text datasets +demonstrate complex, emergent capabilities, achieving state-of-the-art +performance on tasks they were not explicitly trained for. The precise nature +of LLM capabilities is often mysterious, and different prompts can elicit +different capabilities through in-context learning. We propose a Cognitive +Interpretability framework that enables us to analyze in-context learning +dynamics to understand latent concepts in LLMs underlying behavioral patterns. +This provides a more nuanced understanding than success-or-failure evaluation +benchmarks, but does not require observing internal activations as a +mechanistic interpretation of circuits would. Inspired by the cognitive science +of human randomness perception, we use random binary sequences as context and +study dynamics of in-context learning by manipulating properties of context +data, such as sequence length. In the latest GPT-3.5+ models, we find emergent +abilities to generate pseudo-random numbers and learn basic formal languages, +with striking in-context learning dynamics where model outputs transition +sharply from pseudo-random behaviors to deterministic repetition. + +
+
+
+
+
+ + ☆ JudgeLM: Fine-tuned Large Language Models are Scalable Judges + + +
+ Evaluating Large Language Models (LLMs) in open-ended scenarios is +challenging because existing benchmarks and metrics can not measure them +comprehensively. To address this problem, we propose to fine-tune LLMs as +scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in +open-ended benchmarks. We first propose a comprehensive, large-scale, +high-quality dataset containing task seeds, LLMs-generated answers, and +GPT-4-generated judgments for fine-tuning high-performance judges, as well as a +new benchmark for evaluating the judges. We train JudgeLM at different scales +from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its +capabilities and behaviors. We then analyze the key biases in fine-tuning LLM +as a judge and consider them as position bias, knowledge bias, and format bias. +To address these issues, JudgeLM introduces a bag of techniques including swap +augmentation, reference support, and reference drop, which clearly enhance the +judge's performance. JudgeLM obtains the state-of-the-art judge performance on +both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM +is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 +A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an +agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM +also demonstrates extended capabilities in being judges of the single answer, +multimodal models, multiple answers, and multi-turn chat. + +
+
+ comment: 30 pages, 23 figures +
+
+
+
+
+ + ☆ InstOptima: Evolutionary Multi-objective Instruction Optimization via + Large Language Model-based Instruction Operators EMNLP + + +
+ Instruction-based language modeling has received significant attention in +pretrained language models. However, the efficiency of instruction engineering +remains low and hinders the development of instruction studies. Recent studies +have focused on automating instruction generation, but they primarily aim to +improve performance without considering other crucial objectives that impact +instruction quality, such as instruction length and perplexity. Therefore, we +propose a novel approach (i.e., InstOptima) that treats instruction generation +as an evolutionary multi-objective optimization problem. In contrast to text +edition-based methods, our approach utilizes a large language model (LLM) to +simulate instruction operators, including mutation and crossover. Furthermore, +we introduce an objective-guided mechanism for these operators, allowing the +LLM to comprehend the objectives and enhance the quality of the generated +instructions. Experimental results demonstrate improved fine-tuning performance +and the generation of a diverse set of high-quality instructions. + +
+
+ comment: Accepted by EMNLP Findings +
+
+
+
+
+ + ☆ Proving Test Set Contamination in Black Box Language Models + + +
+ Large language models are trained on vast amounts of internet data, prompting +concerns and speculation that they have memorized public benchmarks. Going from +speculation to proof of contamination is challenging, as the pretraining data +used by proprietary models are often not publicly accessible. We show that it +is possible to provide provable guarantees of test set contamination in +language models without access to pretraining data or model weights. Our +approach leverages the fact that when there is no data contamination, all +orderings of an exchangeable benchmark should be equally likely. In contrast, +the tendency for language models to memorize example order means that a +contaminated language model will find certain canonical orderings to be much +more likely than others. Our test flags potential contamination whenever the +likelihood of a canonically ordered benchmark dataset is significantly higher +than the likelihood after shuffling the examples. We demonstrate that our +procedure is sensitive enough to reliably prove test set contamination in +challenging situations, including models as small as 1.4 billion parameters, on +small test sets of only 1000 examples, and datasets that appear only a few +times in the pretraining corpus. Using our test, we audit five popular publicly +accessible language models for test set contamination and find little evidence +for pervasive contamination. + +
+
+
+
+
+ + ☆ Uncovering Meanings of Embeddings via Partial Orthogonality + + +
+ Machine learning tools often rely on embedding text as vectors of real +numbers. In this paper, we study how the semantic structure of language is +encoded in the algebraic structure of such embeddings. Specifically, we look at +a notion of ``semantic independence'' capturing the idea that, e.g., +``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such +examples are intuitive, it is difficult to formalize such a notion of semantic +independence. The key observation here is that any sensible formalization +should obey a set of so-called independence axioms, and thus any algebraic +encoding of this structure should also obey these axioms. This leads us +naturally to use partial orthogonality as the relevant algebraic structure. We +develop theory and methods that allow us to demonstrate that partial +orthogonality does indeed capture semantic independence. Complementary to this, +we also introduce the concept of independence preserving embeddings where +embeddings preserve the conditional independence structures of a distribution, +and we prove the existence of such embeddings and approximations to them. + +
+
+
+
+
+ + ☆ LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset + + +
+ As an important component of intelligent legal systems, legal case retrieval +plays a critical role in ensuring judicial justice and fairness. However, the +development of legal case retrieval technologies in the Chinese legal system is +restricted by three problems in existing datasets: limited data size, narrow +definitions of legal relevance, and naive candidate pooling strategies used in +data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale +Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 +candidates extracted from 4.3 million criminal case documents. To the best of +our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval +datasets, providing extensive coverage of criminal charges. Additionally, we +enrich the existing relevance criteria by considering three key aspects: +characterization, penalty, procedure. This comprehensive criteria enriches the +dataset and may provides a more holistic perspective. Furthermore, we propose a +two-level candidate set pooling strategy that effectively identify potential +candidates for each query case. It's important to note that all cases in the +dataset have been annotated by multiple legal experts specializing in criminal +law. Their expertise ensures the accuracy and reliability of the annotations. +We evaluate several state-of-the-art retrieval models at LeCaRDv2, +demonstrating that there is still significant room for improvement in legal +case retrieval. The details of LeCaRDv2 can be found at the anonymous website +https://github.com/anonymous1113243/LeCaRDv2. + +
+
+
+
+
+ + ☆ Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in + Ghana + + +
+ This paper reports on a set of three recent experiments utilizing large-scale +speech models to evaluate the oral reading fluency (ORF) of students in Ghana. +While ORF is a well-established measure of foundational literacy, assessing it +typically requires one-on-one sessions between a student and a trained +evaluator, a process that is time-consuming and costly. Automating the +evaluation of ORF could support better literacy instruction, particularly in +education contexts where formative assessment is uncommon due to large class +sizes and limited resources. To our knowledge, this research is among the first +to examine the use of the most recent versions of large-scale speech models +(Whisper V2 wav2vec2.0) for ORF assessment in the Global South. + We find that Whisper V2 produces transcriptions of Ghanaian students reading +aloud with a Word Error Rate of 13.5. This is close to the model's average WER +on adult speech (12.8) and would have been considered state-of-the-art for +children's speech transcription only a few years ago. We also find that when +these transcriptions are used to produce fully automated ORF scores, they +closely align with scores generated by expert human graders, with a correlation +coefficient of 0.96. Importantly, these results were achieved on a +representative dataset (i.e., students with regional accents, recordings taken +in actual classrooms), using a free and publicly available speech model out of +the box (i.e., no fine-tuning). This suggests that using large-scale speech +models to assess ORF may be feasible to implement and scale in lower-resource, +linguistically diverse educational contexts. + +
+
+
+
+
+ + ☆ Lil-Bevo: Explorations of Strategies for Training Language Models in + More Humanlike Ways + + +
+ We present Lil-Bevo, our submission to the BabyLM Challenge. We pretrained +our masked language models with three ingredients: an initial pretraining with +music data, training on shorter sequences before training on longer ones, and +masking specific tokens to target some of the BLiMP subtasks. Overall, our +baseline models performed above chance, but far below the performance levels of +larger LLMs trained on more data. We found that training on short sequences +performed better than training on longer sequences.Pretraining on music may +help performance marginally, but, if so, the effect seems small. Our targeted +Masked Language Modeling augmentation did not seem to improve model performance +in general, but did seem to help on some of the specific BLiMP tasks that we +were targeting (e.g., Negative Polarity Items). Training performant LLMs on +small amounts of data is a difficult but potentially informative task. While +some of our techniques showed some promise, more work is needed to explore +whether they can improve performance more than the modest gains here. Our code +is available at https://github.com/venkatasg/Lil-Bevo and out models at +https://huggingface.co/collections/venkatasg/babylm-653591cdb66f4bf68922873a + +
+
+ comment: Proceedings of the BabyLM Challenge +
+
+
+
+
+ + ☆ An Open Source Data Contamination Report for Llama Series Models + + +
+ Data contamination in language model evaluation is increasingly prevalent as +the popularity of large language models. It allows models to "cheat" via +memorisation instead of displaying true capabilities. Therefore, contamination +analysis has became an crucial part of reliable model evaluation to validate +results. However, existing contamination analysis is usually conducted +internally by LLM developers and often lacks transparency and completeness. +This paper present an open source data contamination reports for the Llama +series models. We analyse six popular multi-choice QA benchmarks and quantify +their overlapping with the training set of Llama. Various levels of +contamination ranging from 1\% to 8.7\% are found across benchmarks. Our +comparison also reveals that Llama models can gain over 5\% higher accuracy on +contaminated subsets versus clean subsets. Data and code are available at: +https://github.com/liyucheng09/Contamination_Detector. + +
+
+
+
+
+ + ☆ PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven + Perturbed Gradient Descent EMNLP23 + + +
+ Fine-tuning pretrained language models (PLMs) for downstream tasks is a +large-scale optimization problem, in which the choice of the training algorithm +critically determines how well the trained model can generalize to unseen test +data, especially in the context of few-shot learning. To achieve good +generalization performance and avoid overfitting, techniques such as data +augmentation and pruning are often applied. However, adding these +regularizations necessitates heavy tuning of the hyperparameters of +optimization algorithms, such as the popular Adam optimizer. In this paper, we +propose a two-stage fine-tuning method, PAC-tuning, to address this +optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly +minimizes the PAC-Bayes generalization bound to learn proper parameter +distribution. Second, PAC-tuning modifies the gradient by injecting noise with +the variance learned in the first stage into the model parameters during +training, resulting in a variant of perturbed gradient descent (PGD). In the +past, the few-shot scenario posed difficulties for PAC-Bayes training because +the PAC-Bayes bound, when applied to large models with limited training data, +might not be stringent. Our experimental results across 5 GLUE benchmark tasks +demonstrate that PAC-tuning successfully handles the challenges of fine-tuning +tasks and outperforms strong baseline methods by a visible margin, further +confirming the potential to apply PAC training for any other settings where the +Adam optimizer is currently used for training. + +
+
+ comment: Accepted to EMNLP23 main +
+
+
+
+
+ + ☆ Global Voices, Local Biases: Socio-Cultural Prejudices across Languages EMNLP 2023 + + +
+ Human biases are ubiquitous but not uniform: disparities exist across +linguistic, cultural, and societal borders. As large amounts of recent +literature suggest, language models (LMs) trained on human data can reflect and +often amplify the effects of these social biases. However, the vast majority of +existing studies on bias are heavily skewed towards Western and European +languages. In this work, we scale the Word Embedding Association Test (WEAT) to +24 languages, enabling broader studies and yielding interesting findings about +LM bias. We additionally enhance this data with culturally relevant information +for each language, capturing local contexts on a global scale. Further, to +encompass more widely prevalent societal biases, we examine new bias dimensions +across toxicity, ableism, and more. Moreover, we delve deeper into the Indian +linguistic landscape, conducting a comprehensive regional bias analysis across +six prevalent Indian languages. Finally, we highlight the significance of these +social biases and the new dimensions through an extensive comparison of +embedding methods, reinforcing the need to address them in pursuit of more +equitable language models. All code, data and results are available here: +https://github.com/iamshnoo/weathub. + +
+
+ comment: accepted at EMNLP 2023 +
+
+
+
+
+ + ☆ 1D-Touch: NLP-Assisted Coarse Text Selection via a Semi-Direct Gesture + + +
+ Existing text selection techniques on touchscreen focus on improving the +control for moving the carets. Coarse-grained text selection on word and phrase +levels has not received much support beyond word-snapping and entity +recognition. We introduce 1D-Touch, a novel text selection method that +complements the carets-based sub-word selection by facilitating the selection +of semantic units of words and above. This method employs a simple vertical +slide gesture to expand and contract a selection area from a word. The +expansion can be by words or by semantic chunks ranging from sub-phrases to +sentences. This technique shifts the concept of text selection, from defining a +range by locating the first and last words, towards a dynamic process of +expanding and contracting a textual semantic entity. To understand the effects +of our approach, we prototyped and tested two variants: WordTouch, which offers +a straightforward word-by-word expansion, and ChunkTouch, which leverages NLP +to chunk text into syntactic units, allowing the selection to grow by +semantically meaningful units in response to the sliding gesture. Our +evaluation, focused on the coarse-grained selection tasks handled by 1D-Touch, +shows a 20% improvement over the default word-snapping selection method on +Android. + +
+
+
+
+
+ + ☆ DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct + Speech-to-Speech Translation EMNLP2023 + + +
+ While Diffusion Generative Models have achieved great success on image +generation tasks, how to efficiently and effectively incorporate them into +speech generation especially translation tasks remains a non-trivial problem. +Specifically, due to the low information density of speech data, the +transformed discrete speech unit sequence is much longer than the corresponding +text transcription, posing significant challenges to existing auto-regressive +models. Furthermore, it is not optimal to brutally apply discrete diffusion on +the speech unit sequence while disregarding the continuous space structure, +which will degrade the generation performance significantly. In this paper, we +propose a novel diffusion model by applying the diffusion forward process in +the \textit{continuous} speech representation space, while employing the +diffusion backward process in the \textit{discrete} speech unit space. In this +way, we preserve the semantic structure of the continuous speech representation +space in the diffusion process and integrate the continuous and discrete +diffusion models. We conduct extensive experiments on the textless direct +speech-to-speech translation task, where the proposed method achieves +comparable results to the computationally intensive auto-regressive baselines +(500 steps on average) with significantly fewer decoding steps (50 steps). + +
+
+ comment: Accepted in EMNLP2023 main conference +
+
+
+
+
+ + ☆ Navigating to Success in Multi-Modal Human-Robot Collaboration: Analysis + and Corpus Release + + +
+ Human-guided robotic exploration is a useful approach to gathering +information at remote locations, especially those that might be too risky, +inhospitable, or inaccessible for humans. Maintaining common ground between the +remotely-located partners is a challenge, one that can be facilitated by +multi-modal communication. In this paper, we explore how participants utilized +multiple modalities to investigate a remote location with the help of a robotic +partner. Participants issued spoken natural language instructions and received +from the robot: text-based feedback, continuous 2D LIDAR mapping, and +upon-request static photographs. We noticed that different strategies were +adopted in terms of use of the modalities, and hypothesize that these +differences may be correlated with success at several exploration sub-tasks. We +found that requesting photos may have improved the identification and counting +of some key entities (doorways in particular) and that this strategy did not +hinder the amount of overall area exploration. Future work with larger samples +may reveal the effects of more nuanced photo and dialogue strategies, which can +inform the training of robotic agents. Additionally, we announce the release of +our unique multi-modal corpus of human-robot communication in an exploration +context: SCOUT, the Situated Corpus on Understanding Transactions. + +
+
+ comment: 7 pages, 3 figures +
+
+
+
+
+ + ☆ Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models + + +
+ With LLMs shifting their role from statistical modeling of language to +serving as general-purpose AI agents, how should LLM evaluations change? +Arguably, a key ability of an AI agent is to flexibly combine, as needed, the +basic skills it has learned. The capability to combine skills plays an +important role in (human) pedagogy and also in a paper on emergence phenomena +(Arora & Goyal, 2023). + This work introduces Skill-Mix, a new evaluation to measure ability to +combine skills. Using a list of $N$ skills the evaluator repeatedly picks +random subsets of $k$ skills and asks the LLM to produce text combining that +subset of skills. Since the number of subsets grows like $N^k$, for even modest +$k$ this evaluation will, with high probability, require the LLM to produce +text significantly different from any text in the training set. The paper +develops a methodology for (a) designing and administering such an evaluation, +and (b) automatic grading (plus spot-checking by humans) of the results using +GPT-4 as well as the open LLaMA-2 70B model. + Administering a version of to popular chatbots gave results that, while +generally in line with prior expectations, contained surprises. Sizeable +differences exist among model capabilities that are not captured by their +ranking on popular LLM leaderboards ("cramming for the leaderboard"). +Furthermore, simple probability calculations indicate that GPT-4's reasonable +performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior +(Bender et al., 2021), i.e., it combines skills in ways that it had not seen +during training. + We sketch how the methodology can lead to a Skill-Mix based eco-system of +open evaluations for AI capabilities of future models. + +
+
+
+
+
+ + ☆ Towards Matching Phones and Speech Representations + + +
+ Learning phone types from phone instances has been a long-standing problem, +while still being open. In this work, we revisit this problem in the context of +self-supervised learning, and pose it as the problem of matching cluster +centroids to phone embeddings. We study two key properties that enable +matching, namely, whether cluster centroids of self-supervised representations +reduce the variability of phone instances and respect the relationship among +phones. We then use the matching result to produce pseudo-labels and introduce +a new loss function for improving self-supervised representations. Our +experiments show that the matching result captures the relationship among +phones. Training the new loss function jointly with the regular self-supervised +losses, such as APC and CPC, significantly improves the downstream phone +classification. + +
+
+ comment: Accepted to ASRU 2023 +
+
+
+
+
+ + ☆ Unpacking the Ethical Value Alignment in Big Models + + +
+ Big models have greatly advanced AI's ability to understand, generate, and +manipulate information and content, enabling numerous applications. However, as +these models become increasingly integrated into everyday life, their inherent +ethical values and potential biases pose unforeseen risks to society. This +paper provides an overview of the risks and challenges associated with big +models, surveys existing AI ethics guidelines, and examines the ethical +implications arising from the limitations of these models. Taking a normative +ethics perspective, we propose a reassessment of recent normative guidelines, +highlighting the importance of collaborative efforts in academia to establish a +unified and universal AI ethics framework. Furthermore, we investigate the +moral inclinations of current mainstream LLMs using the Moral Foundation +theory, analyze existing alignment algorithms, and outline the unique +challenges encountered in aligning ethical values within them. To address these +challenges, we introduce a novel conceptual paradigm for aligning the ethical +values of big models and discuss promising research directions for alignment +criteria, evaluation, and method, representing an initial step towards the +interdisciplinary construction of the ethically aligned AI + This paper is a modified English version of our Chinese paper +https://crad.ict.ac.cn/cn/article/doi/10.7544/issn1000-1239.202330553, intended +to help non-Chinese native speakers better understand our work. + +
+
+
+
+
+ + ☆ Evaluating Bias and Fairness in Gender-Neutral Pretrained + Vision-and-Language Models EMNLP 2024 + + +
+ Pretrained machine learning models are known to perpetuate and even amplify +existing biases in data, which can result in unfair outcomes that ultimately +impact user experience. Therefore, it is crucial to understand the mechanisms +behind those prejudicial biases to ensure that model performance does not +result in discriminatory behaviour toward certain groups or populations. In +this work, we define gender bias as our case study. We quantify bias +amplification in pretraining and after fine-tuning on three families of +vision-and-language models. We investigate the connection, if any, between the +two learning stages, and evaluate how bias amplification reflects on model +performance. Overall, we find that bias amplification in pretraining and after +fine-tuning are independent. We then examine the effect of continued +pretraining on gender-neutral data, finding that this reduces group +disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without +significantly compromising task performance. + +
+
+ comment: To appear in EMNLP 2024 +
+
+
+
+
+ + ☆ Can large language models replace humans in the systematic review + process? Evaluating GPT-4's efficacy in screening and extracting data from + peer-reviewed and grey literature in multiple languages + + +
+ Systematic reviews are vital for guiding practice, research, and policy, yet +they are often slow and labour-intensive. Large language models (LLMs) could +offer a way to speed up and automate systematic reviews, but their performance +in such tasks has not been comprehensively evaluated against humans, and no +study has tested GPT-4, the biggest LLM so far. This pre-registered study +evaluates GPT-4's capability in title/abstract screening, full-text review, and +data extraction across various literature types and languages using a +'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human +performance in most tasks, results were skewed by chance agreement and dataset +imbalance. After adjusting for these, there was a moderate level of performance +for data extraction, and - barring studies that used highly reliable prompts - +screening performance levelled at none to moderate for different stages and +languages. When screening full-text literature using highly reliable prompts, +GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key +studies using highly reliable prompts improved its performance even more. Our +findings indicate that, currently, substantial caution should be used if LLMs +are being used to conduct systematic reviews, but suggest that, for certain +systematic review tasks delivered under reliable prompts, LLMs can rival human +performance. + +
+
+ comment: 9 pages, 2 figures, 1 table +
+
+
+
+
+ + ☆ The Validity of Evaluation Results: Assessing Concurrence Across + Compositionality Benchmarks CoNLL2023 + + +
+ NLP models have progressed drastically in recent years, according to numerous +datasets proposed to evaluate performance. Questions remain, however, about how +particular dataset design choices may impact the conclusions we draw about +model capabilities. In this work, we investigate this question in the domain of +compositional generalization. We examine the performance of six modeling +approaches across 4 datasets, split according to 8 compositional splitting +strategies, ranking models by 18 compositional generalization splits in total. +Our results show that: i) the datasets, although all designed to evaluate +compositional generalization, rank modeling approaches differently; ii) +datasets generated by humans align better with each other than they with +synthetic datasets, or than synthetic datasets among themselves; iii) +generally, whether datasets are sampled from the same source is more predictive +of the resulting model ranking than whether they maintain the same +interpretation of compositionality; and iv) which lexical items are used in the +data can strongly impact conclusions. Overall, our results demonstrate that +much work remains to be done when it comes to assessing whether popular +evaluation datasets measure what they intend to measure, and suggest that +elucidating more rigorous standards for establishing the validity of evaluation +sets could benefit the field. + +
+
+ comment: CoNLL2023 +
+
+
+
+
+ + ☆ The Expressive Power of Low-Rank Adaptation + + +
+ Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that +leverages low-rank adaptation of weight matrices, has emerged as a prevalent +technique for fine-tuning pre-trained models such as large language models and +diffusion models. Despite its huge success in practice, the theoretical +underpinnings of LoRA have largely remained unexplored. This paper takes the +first step to bridge this gap by theoretically analyzing the expressive power +of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any +model $f$ to accurately represent any smaller target model $\overline{f}$ if +LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of +}\overline{f}}{\text{depth of }f}$. We also quantify the approximation error +when LoRA-rank is lower than the threshold. For Transformer networks, we show +any model can be adapted to a target model of the same size with +rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters. + +
+
+ comment: 40 pages,5 figures +
+
+
+
+
+ + ☆ CompeteAI: Understanding the Competition Behaviors in Large Language + Model-based Agents + + +
+ Large language models (LLMs) have been widely used as agents to complete +different tasks, such as personal assistance or event planning. While most work +has focused on cooperation and collaboration between agents, little work +explores competition, another important mechanism that fosters the development +of society and economy. In this paper, we seek to examine the competition +behaviors in LLM-based agents. We first propose a general framework to study +the competition between agents. Then, we implement a practical competitive +environment using GPT-4 to simulate a virtual town with two types of agents, +including restaurant agents and customer agents. Specifically, restaurant +agents compete with each other to attract more customers, where the competition +fosters them to transform, such as cultivating new operating strategies. The +results of our experiments reveal several interesting findings ranging from +social learning to Matthew Effect, which aligns well with existing sociological +and economic theories. We believe that competition between agents deserves +further investigation to help us understand society better. The code will be +released soon. + +
+
+ comment: Technical report; 21 pages +
+
+
+
+
+ + ☆ The IMS Toucan System for the Blizzard Challenge 2023 + + +
+ For our contribution to the Blizzard Challenge 2023, we improved on the +system we submitted to the Blizzard Challenge 2021. Our approach entails a +rule-based text-to-phoneme processing system that includes rule-based +disambiguation of homographs in the French language. It then transforms the +phonemes to spectrograms as intermediate representations using a fast and +efficient non-autoregressive synthesis architecture based on Conformer and +Glow. A GAN based neural vocoder that combines recent state-of-the-art +approaches converts the spectrogram to the final wave. We carefully designed +the data processing, training, and inference procedures for the challenge data. +Our system identifier is G. Open source code and demo are available. + +
+
+ comment: Published at the Blizzard Challenge Workshop 2023, colocated with the + Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023 +
+
+
+
+
+ + ☆ Improving Zero-shot Reader by Reducing Distractions from Irrelevant + Documents in Open-Domain Question Answering EMNLP 2023 + + +
+ Large language models (LLMs) enable zero-shot approaches in open-domain +question answering (ODQA), yet with limited advancements as the reader is +compared to the retriever. This study aims at the feasibility of a zero-shot +reader that addresses the challenges of computational cost and the need for +labeled data. We find that LLMs are distracted due to irrelevant documents in +the retrieved set and the overconfidence of the generated answers when they are +exploited as zero-shot readers. To tackle these problems, we mitigate the +impact of such documents via Distraction-aware Answer Selection (DAS) with a +negation-based instruction and score adjustment for proper answer selection. +Experimental results show that our approach successfully handles distraction +across diverse scenarios, enhancing the performance of zero-shot readers. +Furthermore, unlike supervised readers struggling with unseen data, zero-shot +readers demonstrate outstanding transferability without any training. + +
+
+ comment: Findings of EMNLP 2023 Camera Ready +
+
+
+
+
+ + ☆ LightLM: A Lightweight Deep and Narrow Language Model for Generative + Recommendation + + +
+ This paper presents LightLM, a lightweight Transformer-based language model +for generative recommendation. While Transformer-based generative modeling has +gained importance in various AI sub-fields such as NLP and vision, generative +recommendation is still in its infancy due to its unique demand on personalized +generative modeling. Existing works on generative recommendation often use +NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are +heavy-weight and are not specifically designed for recommendation tasks. +LightLM tackles the issue by introducing a light-weight deep and narrow +Transformer architecture, which is specifically tailored for direct generation +of recommendation items. This structure is especially apt for straightforward +generative recommendation and stems from the observation that language model +does not have to be too wide for this task, as the input predominantly consists +of short tokens that are well-suited for the model's capacity. We also show +that our devised user and item ID indexing methods, i.e., Spectral +Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables +the deep and narrow Transformer architecture to outperform large-scale language +models for recommendation. Besides, to address the hallucination problem of +generating items as output, we propose the constrained generation process for +generative recommenders. Experiments on real-world datasets show that LightLM +outperforms various competitive baselines in terms of both recommendation +accuracy and efficiency. The code can be found at +https://github.com/dongyuanjushi/LightLM. + +
+
+
+
+
+ + ☆ Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech + Systems for the MADASR 2023 Challenge + + +
+ This paper describes Tallinn University of Technology (TalTech) systems +developed for the ASRU MADASR 2023 Challenge. The challenge focuses on +automatic speech recognition of dialect-rich Indian languages with limited +training audio and text data. TalTech participated in two tracks of the +challenge: Track 1 that allowed using only the provided training data and Track +3 which allowed using additional audio data. In both tracks, we relied on +wav2vec2.0 models. Our methodology diverges from the traditional procedure of +finetuning pretrained wav2vec2.0 models in two key points: firstly, through the +implementation of the aligned data augmentation technique to enhance the +linguistic diversity of the training data, and secondly, via the application of +deep prefix tuning for dialect adaptation of wav2vec2.0 models. In both tracks, +our approach yielded significant improvements over the provided baselines, +achieving the lowest word error rates across all participating teams. + +
+
+
+
+
+ + ☆ ''Fifty Shades of Bias'': Normative Ratings of Gender Bias in GPT + Generated English Text EMNLP 2023 + + +
+ Language serves as a powerful tool for the manifestation of societal belief +systems. In doing so, it also perpetuates the prevalent biases in our society. +Gender bias is one of the most pervasive biases in our society and is seen in +online and offline discourses. With LLMs increasingly gaining human-like +fluency in text generation, gaining a nuanced understanding of the biases these +systems can generate is imperative. Prior work often treats gender bias as a +binary classification task. However, acknowledging that bias must be perceived +at a relative scale; we investigate the generation and consequent receptivity +of manual annotators to bias of varying degrees. Specifically, we create the +first dataset of GPT-generated English text with normative ratings of gender +bias. Ratings were obtained using Best--Worst Scaling -- an efficient +comparative annotation framework. Next, we systematically analyze the variation +of themes of gender biases in the observed ranking and show that +identity-attack is most closely related to gender bias. Finally, we show the +performance of existing automated models trained on related concepts on our +dataset. + +
+
+ comment: Camera-ready version in EMNLP 2023 +
+
+
+
+
+ + ☆ PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word + Tokenization on Downstream Applications + + +
+ Large protein language models are adept at capturing the underlying +evolutionary information in primary structures, offering significant practical +value for protein engineering. Compared to natural language models, protein +amino acid sequences have a smaller data volume and a limited combinatorial +space. Choosing an appropriate vocabulary size to optimize the pre-trained +model is a pivotal issue. Moreover, despite the wealth of benchmarks and +studies in the natural language community, there remains a lack of a +comprehensive benchmark for systematically evaluating protein language model +quality. Given these challenges, PETA trained language models with 14 different +vocabulary sizes under three tokenization methods. It conducted thousands of +tests on 33 diverse downstream datasets to assess the models' transfer learning +capabilities, incorporating two classification heads and three random seeds to +mitigate potential biases. Extensive experiments indicate that vocabulary sizes +between 50 and 200 optimize the model, whereas sizes exceeding 800 +detrimentally affect the model's representational performance. Our code, model +weights and datasets are available at +https://github.com/ginnm/ProteinPretraining. + +
+
+ comment: 46 pages, 4figures, 9 tables +
+
+
+
+
+ + ☆ Harnessing GPT-3.5-turbo for Rhetorical Role Prediction in Legal Cases + + +
+ We propose a comprehensive study of one-stage elicitation techniques for +querying a large pre-trained generative transformer (GPT-3.5-turbo) in the +rhetorical role prediction task of legal cases. This task is known as requiring +textual context to be addressed. Our study explores strategies such as zero-few +shots, task specification with definitions and clarification of annotation +ambiguities, textual context and reasoning with general prompts and specific +questions. We show that the number of examples, the definition of labels, the +presentation of the (labelled) textual context and specific questions about +this context have a positive influence on the performance of the model. Given +non-equivalent test set configurations, we observed that prompting with a few +labelled examples from direct context can lead the model to a better +performance than a supervised fined-tuned multi-class classifier based on the +BERT encoder (weighted F1 score of = 72%). But there is still a gap to reach +the performance of the best systems = 86%) in the LegalEval 2023 task which, on +the other hand, require dedicated resources, architectures and training. + +
+
+
+
+
+ + ☆ Tackling the Matrix Multiplication Micro-kernel Generation with Exo + + +
+ The optimization of the matrix multiplication (or GEMM) has been a need +during the last decades. This operation is considered the flagship of current +linear algebra libraries such as BLIS, OpenBLAS, or Intel OneAPI because of its +widespread use in a large variety of scientific applications. The GEMM is +usually implemented following the GotoBLAS philosophy, which tiles the GEMM +operands and uses a series of nested loops for performance improvement. These +approaches extract the maximum computational power of the architectures through +small pieces of hardware-oriented, high-performance code called micro-kernel. +However, this approach forces developers to generate, with a non-negligible +effort, a dedicated micro-kernel for each new hardware. + In this work, we present a step-by-step procedure for generating +micro-kernels with the Exo compiler that performs close to (or even better +than) manually developed microkernels written with intrinsic functions or +assembly language. Our solution also improves the portability of the generated +code, since a hardware target is fully specified by a concise library-based +description of its instructions. + +
+
+ comment: 11 pages, 18 figures. Presented at CGO 2024. It includes a software + artifact step-by-step execution +
+
+
+
+
+ + ☆ Meaning and understanding in large language models + + +
+ Can a machine understand the meanings of natural language? Recent +developments in the generative large language models (LLMs) of artificial +intelligence have led to the belief that traditional philosophical assumptions +about machine understanding of language need to be revised. This article +critically evaluates the prevailing tendency to regard machine language +performance as mere syntactic manipulation and the simulation of understanding, +which is only partial and very shallow, without sufficient referential +grounding in the world. The aim is to highlight the conditions crucial to +attributing natural language understanding to state-of-the-art LLMs, where it +can be legitimately argued that LLMs not only use syntax but also semantics, +their understanding not being simulated but duplicated; and determine how they +ground the meanings of linguistic expressions. + +
+
+ comment: 20 pages +
+
+
+
+
+ + ☆ ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in + Real-World User-AI Conversation + + +
+ Despite remarkable advances that large language models have achieved in +chatbots, maintaining a non-toxic user-AI interactive environment has become +increasingly critical nowadays. However, previous efforts in toxicity detection +have been mostly based on benchmarks derived from social media content, leaving +the unique challenges inherent to real-world user-AI interactions +insufficiently explored. In this work, we introduce ToxicChat, a novel +benchmark based on real user queries from an open-source chatbot. This +benchmark contains the rich, nuanced phenomena that can be tricky for current +toxicity detection models to identify, revealing a significant domain +difference compared to social media content. Our systematic evaluation of +models trained on existing toxicity datasets has shown their shortcomings when +applied to this unique domain of ToxicChat. Our work illuminates the +potentially overlooked challenges of toxicity detection in real-world user-AI +conversations. In the future, ToxicChat can be a valuable resource to drive +further advancements toward building a safe and healthy environment for user-AI +interactions. + +
+
+
+
+
+ + ☆ Dialogue-based generation of self-driving simulation scenarios using + Large Language Models + + +
+ Simulation is an invaluable tool for developing and evaluating controllers +for self-driving cars. Current simulation frameworks are driven by +highly-specialist domain specific languages, and so a natural language +interface would greatly enhance usability. But there is often a gap, consisting +of tacit assumptions the user is making, between a concise English utterance +and the executable code that captures the user's intent. In this paper we +describe a system that addresses this issue by supporting an extended +multimodal interaction: the user can follow up prior instructions with +refinements or revisions, in reaction to the simulations that have been +generated from their utterances so far. We use Large Language Models (LLMs) to +map the user's English utterances in this interaction into domain-specific +code, and so we explore the extent to which LLMs capture the context +sensitivity that's necessary for computing the speaker's intended message in +discourse. + +
+
+ comment: 12 pages, 6 figures, SpLU-RoboNLP 2023 +
+
+
+
+
+ + ☆ Language and Mental Health: Measures of Emotion Dynamics from Text as + Linguistic Biosocial Markers + + +
+ Research in psychopathology has shown that, at an aggregate level, the +patterns of emotional change over time -- emotion dynamics -- are indicators of +one's mental health. One's patterns of emotion change have traditionally been +determined through self-reports of emotions; however, there are known issues +with accuracy, bias, and convenience. Recent approaches to determining emotion +dynamics from one's everyday utterances, addresses many of these concerns, but +it is not yet known whether these measures of utterance emotion dynamics (UED) +correlate with mental health diagnoses. Here, for the first time, we study the +relationship between tweet emotion dynamics and mental health disorders. We +find that each of the UED metrics studied varied by the user's self-disclosed +diagnosis. For example: average valence was significantly higher (i.e., more +positive text) in the control group compared to users with ADHD, MDD, and PTSD. +Valence variability was significantly lower in the control group compared to +ADHD, depression, bipolar disorder, MDD, PTSD, and OCD but not PPD. Rise and +recovery rates of valence also exhibited significant differences from the +control. This work provides important early evidence for how linguistic cues +pertaining to emotion dynamics can play a crucial role as biosocial markers for +mental illnesses and aid in the understanding, diagnosis, and management of +mental health disorders. + +
+
+ comment: 9 pages, 3 figures +
+
+
+
+
+ + ☆ Cultural Adaptation of Recipes ACL + + +
+ Building upon the considerable advances in Large Language Models (LLMs), we +are now equipped to address more sophisticated tasks demanding a nuanced +understanding of cross-cultural contexts. A key example is recipe adaptation, +which goes beyond simple translation to include a grasp of ingredients, +culinary techniques, and dietary preferences specific to a given culture. We +introduce a new task involving the translation and cultural adaptation of +recipes between Chinese and English-speaking cuisines. To support this +investigation, we present CulturalRecipes, a unique dataset comprised of +automatically paired recipes written in Mandarin Chinese and English. This +dataset is further enriched with a human-written and curated test set. In this +intricate task of cross-cultural recipe adaptation, we evaluate the performance +of various methods, including GPT-4 and other LLMs, traditional machine +translation, and information retrieval techniques. Our comprehensive analysis +includes both automatic and human evaluation metrics. While GPT-4 exhibits +impressive abilities in adapting Chinese recipes into English, it still lags +behind human expertise when translating English recipes into Chinese. This +underscores the multifaceted nature of cultural adaptations. We anticipate that +these insights will significantly contribute to future research on +culturally-aware language models and their practical application in culturally +diverse contexts. + +
+
+ comment: Accepted to TACL +
+
+
+
+
+ + ☆ ACT-SQL: In-Context Learning for Text-to-SQL with + Automatically-Generated Chain-of-Thought + + +
+ Recently Large Language Models (LLMs) have been proven to have strong +abilities in various domains and tasks. We study the problem of prompt +designing in the text-to-SQL task and attempt to improve the LLMs' reasoning +ability when generating SQL queries. Besides the trivial few-shot in-context +learning setting, we design our chain-of-thought (CoT) prompt with a similar +method to schema linking. We provide a method named ACT-SQL to automatically +generate auto-CoT exemplars and thus the whole process doesn't need manual +labeling. Our approach is cost-saving since we only use the LLMs' API call once +when generating one SQL query. Furthermore, we extend our in-context learning +method to the multi-turn text-to-SQL task. The experiment results show that the +LLMs' performance can benefit from our ACT-SQL approach. Our approach achieves +SOTA performance on the Spider dev set among existing in-context learning +approaches. + +
+
+
+
+
+ + ☆ Arabic Fine-Grained Entity Recognition + + +
+ Traditional NER systems are typically trained to recognize coarse-grained +entities, and less attention is given to classifying entities into a hierarchy +of fine-grained lower-level subtypes. This article aims to advance Arabic NER +with fine-grained entities. We chose to extend Wojood (an open-source Nested +Arabic Named Entity Corpus) with subtypes. In particular, four main entity +types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), +and facility (FAC), are extended with 31 subtypes. To do this, we first revised +Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's +ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, +ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE +sub-types. We refer to this extended version of Wojood as WojoodF ine. To +evaluate our annotations, we measured the inter-annotator agreement (IAA) using +both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. +To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic +BERT encoders in three settings: flat NER, nested NER and nested NER with +subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our +corpus and models are open-source and available at +https://sina.birzeit.edu/wojood/. + +
+
+
+
+
+ + ☆ Nabra: Syrian Arabic Dialects with Morphological Annotations + + +
+ This paper presents Nabra, a corpora of Syrian Arabic dialects with +morphological annotations. A team of Syrian natives collected more than 6K +sentences containing about 60K words from several sources including social +media posts, scripts of movies and series, lyrics of songs and local proverbs +to build Nabra. Nabra covers several local Syrian dialects including those of +Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and +Suwayda. A team of nine annotators annotated the 60K tokens with full +morphological annotations across sentence contexts. We trained the annotators +to follow methodological annotation guidelines to ensure unique morpheme +annotations, and normalized the annotations. F1 and kappa agreement scores +ranged between 74% and 98% across features, showing the excellent quality of +Nabra annotations. Our corpora are open-source and publicly available as part +of the Currasat portal https://sina.birzeit.edu/currasat. + +
+
+
+
+
+ + ☆ An Ensemble Method Based on the Combination of Transformers with + Convolutional Neural Networks to Detect Artificially Generated Text ALT + + +
+ Thanks to the state-of-the-art Large Language Models (LLMs), language +generation has reached outstanding levels. These models are capable of +generating high quality content, thus making it a challenging task to detect +generated text from human-written content. Despite the advantages provided by +Natural Language Generation, the inability to distinguish automatically +generated text can raise ethical concerns in terms of authenticity. +Consequently, it is important to design and develop methodologies to detect +artificial content. In our work, we present some classification models +constructed by ensembling transformer models such as Sci-BERT, DeBERTa and +XLNet, with Convolutional Neural Networks (CNNs). Our experiments demonstrate +that the considered ensemble architectures surpass the performance of the +individual transformer models for classification. Furthermore, the proposed +SciBERT-CNN ensemble model produced an F1-score of 98.36% on the ALTA shared +task 2023 data. + +
+
+ comment: In Proceedings of the 21st Annual Workshop of the Australasian + Language Technology Association (ALTA 2023) +
+
+
+
+
+ + ☆ FormaT5: Abstention and Examples for Conditional Table Formatting with + Natural Language VLDB 2024 + + +
+ Formatting is an important property in tables for visualization, +presentation, and analysis. Spreadsheet software allows users to automatically +format their tables by writing data-dependent conditional formatting (CF) +rules. Writing such rules is often challenging for users as it requires them to +understand and implement the underlying logic. We present FormaT5, a +transformer-based model that can generate a CF rule given the target table and +a natural language description of the desired formatting logic. We find that +user descriptions for these tasks are often under-specified or ambiguous, +making it harder for code generation systems to accurately learn the desired +rule in a single step. To tackle this problem of under-specification and +minimise argument errors, FormaT5 learns to predict placeholders though an +abstention objective. These placeholders can then be filled by a second model +or, when examples of rows that should be formatted are available, by a +programming-by-example system. To evaluate FormaT5 on diverse and real +scenarios, we create an extensive benchmark of 1053 CF tasks, containing +real-world descriptions collected from four different sources. We release our +benchmarks to encourage research in this area. Abstention and filling allow +FormaT5 to outperform 8 different neural approaches on our benchmarks, both +with and without examples. Our results illustrate the value of building +domain-specific learning systems. + +
+
+ comment: VLDB 2024, 14 pages +
+
+
+
+
+ + ☆ Comparing Photorealistic and Animated Embodied Conversational Agents in + Serious Games: An Empirical Study on User Experience + + +
+ Embodied conversational agents (ECAs) are paradigms of conversational user +interfaces in the form of embodied characters. While ECAs offer various +manipulable features, this paper focuses on a study conducted to explore two +distinct levels of presentation realism. The two agent versions are +photorealistic and animated. The study aims to provide insights and design +suggestions for speech-enabled ECAs within serious game environments. A +within-subjects, two-by-two factorial design was employed for this research +with a cohort of 36 participants balanced for gender. The results showed that +both the photorealistic and the animated versions were perceived as highly +usable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4 +per cent of the participants stated they preferred the photorealistic version, +25 per cent stated they preferred the animated version and 5.6 per cent had no +stated preference. The photorealistic agents were perceived as more realistic +and human-like, while the animated characters made the task feel more like a +game. Even though the agents' realism had no significant effect on usability, +it positively influenced participants' perceptions of the agent. This research +aims to lay the groundwork for future studies on ECA realism's impact in +serious games across diverse contexts. + +
+
+ comment: 21 pages, 14 figures, preprint to be published in HCI INTERNATIONAL + 2023 25TH INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION proceedings +
+
+
+
+
+ + ☆ Learning to Abstract with Nonparametric Variational Information + Bottleneck EMNLP 2023 + + +
+ Learned representations at the level of characters, sub-words, words and +sentences, have each contributed to advances in understanding different NLP +tasks and linguistic phenomena. However, learning textual embeddings is costly +as they are tokenization specific and require different models to be trained +for each level of abstraction. We introduce a novel language representation +model which can learn to compress to different levels of abstraction at +different layers of the same model. We apply Nonparametric Variational +Information Bottleneck (NVIB) to stacked Transformer self-attention layers in +the encoder, which encourages an information-theoretic compression of the +representations through the model. We find that the layers within the model +correspond to increasing levels of abstraction and that their representations +are more linguistically informed. Finally, we show that NVIB compression +results in a model which is more robust to adversarial perturbations. + +
+
+ comment: Accepted to Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Automatic Logical Forms improve fidelity in Table-to-Text generation + + +
+ Table-to-text systems generate natural language statements from structured +data like tables. While end-to-end techniques suffer from low factual +correctness (fidelity), a previous study reported gains when using manual +logical forms (LF) that represent the selected content and the semantics of the +target text. Given the manual step, it was not clear whether automatic LFs +would be effective, or whether the improvement came from content selection +alone. We present TlT which, given a table and a selection of the content, +first produces LFs and then the textual statement. We show for the first time +that automatic LFs improve quality, with an increase in fidelity of 30 points +over a comparable system not using LFs. Our experiments allow to quantify the +remaining challenges for high factual correctness, with automatic selection of +content coming first, followed by better Logic-to-Text generation and, to a +lesser extent, better Table-to-Logic parsing. + +
+
+
+
+
+ + ☆ Understanding the Role of Input Token Characters in Language Models: How + Does Information Loss Affect Performance? EMNLP 2023 + + +
+ Understanding how and what pre-trained language models (PLMs) learn about +language is an open challenge in natural language processing. Previous work has +focused on identifying whether they capture semantic and syntactic information, +and how the data or the pre-training objective affects their performance. +However, to the best of our knowledge, no previous work has specifically +examined how information loss in input token characters affects the performance +of PLMs. In this study, we address this gap by pre-training language models +using small subsets of characters from individual tokens. Surprisingly, we find +that pre-training even under extreme settings, i.e. using only one character of +each token, the performance retention in standard NLU benchmarks and probing +tasks compared to full-token models is high. For instance, a model pre-trained +only on single first characters from tokens achieves performance retention of +approximately $90$\% and $77$\% of the full-token model in SuperGLUE and GLUE +tasks, respectively. + +
+
+ comment: To appear at EMNLP 2023 +
+
+
+
+
+ + ☆ Joint Entity and Relation Extraction with Span Pruning and Hypergraph + Neural Networks EMNLP + + +
+ Entity and Relation Extraction (ERE) is an important task in information +extraction. Recent marker-based pipeline models achieve state-of-the-art +performance, but still suffer from the error propagation issue. Also, most of +current ERE models do not take into account higher-order interactions between +multiple entities and relations, while higher-order modeling could be +beneficial.In this work, we propose HyperGraph neural network for ERE +($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based +pipleline model). To alleviate error propagation,we use a high-recall pruner +mechanism to transfer the burden of entity identification and labeling from the +NER module to the joint module of our model. For higher-order modeling, we +build a hypergraph, where nodes are entities (provided by the span pruner) and +relations thereof, and hyperedges encode interactions between two different +relations or between a relation and its associated subject and object entities. +We then run a hypergraph neural network for higher-order inference by applying +message passing over the built hypergraph. Experiments on three widely used +benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant +improvements over the previous state-of-the-art PL-marker. + +
+
+ comment: Accepted to Proceedings of EMNLP, 2023 +
+
+
+
+
+ + ☆ EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual + Representation Learning NeurIPS 2023 + + +
+ Expressing universal semantics common to all languages is helpful in +understanding the meanings of complex and culture-specific sentences. The +research theme underlying this scenario focuses on learning universal +representations across languages with the usage of massive parallel corpora. +However, due to the sparsity and scarcity of parallel data, there is still a +big challenge in learning authentic ``universals'' for any two languages. In +this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm, +to learn (X)Cross-lingual universals with the aid of excessive multilingual +non-parallel data. EMMA-X unifies the cross-lingual representation learning +task and an extra semantic relation prediction task within an EM framework. +Both the extra semantic classifier and the cross-lingual sentence encoder +approximate the semantic relation of two sentences, and supervise each other +until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly +introduced benchmark containing 12 widely studied cross-lingual tasks that +fully depend on sentence-level representations. Results reveal that EMMA-X +achieves state-of-the-art performance. Further geometric analysis of the built +representation space with three requirements demonstrates the superiority of +EMMA-X over advanced models. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Codebook Features: Sparse and Discrete Interpretability for Neural + Networks + + +
+ Understanding neural networks is challenging in part because of the dense, +continuous nature of their hidden states. We explore whether we can train +neural networks to have hidden states that are sparse, discrete, and more +interpretable by quantizing their continuous features into what we call +codebook features. Codebook features are produced by finetuning neural networks +with vector quantization bottlenecks at each layer, producing a network whose +hidden features are the sum of a small number of discrete vector codes chosen +from a larger codebook. Surprisingly, we find that neural networks can operate +under this extreme bottleneck with only modest degradation in performance. This +sparse, discrete bottleneck also provides an intuitive way of controlling +neural network behavior: first, find codes that activate when the desired +behavior is present, then activate those same codes during generation to elicit +that behavior. We validate our approach by training codebook Transformers on +several different datasets. First, we explore a finite state machine dataset +with far more hidden states than neurons. In this setting, our approach +overcomes the superposition problem by assigning states to distinct codes, and +we find that we can make the neural network behave as if it is in a different +state by activating the code for that state. Second, we train Transformer +language models with up to 410M parameters on two natural language datasets. We +identify codes in these models representing diverse, disentangled concepts +(ranging from negative emotions to months of the year) and find that we can +guide the model to generate different topics by activating the appropriate +codes during inference. Overall, codebook features appear to be a promising +unit of analysis and control for neural networks and interpretability. Our +codebase and models are open-sourced at +https://github.com/taufeeque9/codebook-features. + +
+
+
+
+
+ + ☆ TST$^\mathrm{R}$: Target Similarity Tuning Meets the Real World EMNLP + + +
+ Target similarity tuning (TST) is a method of selecting relevant examples in +natural language (NL) to code generation through large language models (LLMs) +to improve performance. Its goal is to adapt a sentence embedding model to have +the similarity between two NL inputs match the similarity between their +associated code outputs. In this paper, we propose different methods to apply +and improve TST in the real world. First, we replace the sentence transformer +with embeddings from a larger model, which reduces sensitivity to the language +distribution and thus provides more flexibility in synthetic generation of +examples, and we train a tiny model that transforms these embeddings to a space +where embedding similarity matches code similarity, which allows the model to +remain a black box and only requires a few matrix multiplications at inference +time. Second, we how to efficiently select a smaller number of training +examples to train the TST model. Third, we introduce a ranking-based evaluation +for TST that does not require end-to-end code generation experiments, which can +be expensive to perform. + +
+
+ comment: Accepted for EMNLP-Findings, 2023 +
+
+
+
+
+ + ☆ Beyond MLE: Convex Learning for Text Generation NeurIPS 2023 + + +
+ Maximum likelihood estimation (MLE) is a statistical method used to estimate +the parameters of a probability distribution that best explain the observed +data. In the context of text generation, MLE is often used to train generative +language models, which can then be used to generate new text. However, we argue +that MLE is not always necessary and optimal, especially for closed-ended text +generation tasks like machine translation. In these tasks, the goal of model is +to generate the most appropriate response, which does not necessarily require +it to estimate the entire data distribution with MLE. To this end, we propose a +novel class of training objectives based on convex functions, which enables +text generation models to focus on highly probable outputs without having to +estimate the entire data distribution. We investigate the theoretical +properties of the optimal predicted distribution when applying convex functions +to the loss, demonstrating that convex functions can sharpen the optimal +distribution, thereby enabling the model to better capture outputs with high +probabilities. Experiments on various text generation tasks and models show the +effectiveness of our approach. It enables autoregressive models to bridge the +gap between greedy and beam search, and facilitates the learning of +non-autoregressive models with a maximum improvement of 9+ BLEU points. +Moreover, our approach also exhibits significant impact on large language +models (LLMs), substantially enhancing their generative capability on various +tasks. Source code is available at +\url{https://github.com/ictnlp/Convex-Learning}. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Efficient Data Fusion using the Tsetlin Machine + + +
+ We propose a novel way of assessing and fusing noisy dynamic data using a +Tsetlin Machine. Our approach consists in monitoring how explanations in form +of logical clauses that a TM learns changes with possible noise in dynamic +data. This way TM can recognize the noise by lowering weights of previously +learned clauses, or reflect it in the form of new clauses. We also perform a +comprehensive experimental study using notably different datasets that +demonstrated high performance of the proposed approach. + +
+
+
+
+
+ + ☆ How do Language Models Bind Entities in Context? + + +
+ To correctly use in-context information, language models (LMs) must bind +entities to their attributes. For example, given a context describing a "green +square" and a "blue circle", LMs must bind the shapes to their respective +colors. We analyze LM representations and identify the binding ID mechanism: a +general mechanism for solving the binding problem, which we observe in every +sufficiently large model from the Pythia and LLaMA families. Using causal +interventions, we show that LMs' internal activations represent binding +information by attaching binding ID vectors to corresponding entities and +attributes. We further show that binding ID vectors form a continuous subspace, +in which distances between binding ID vectors reflect their discernability. +Overall, our results uncover interpretable strategies in LMs for representing +symbolic knowledge in-context, providing a step towards understanding general +in-context reasoning in large-scale LMs. + +
+
+
+
+
+ + ☆ X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity EMNLP 2023 + + +
+ Cross-lingual transfer (XLT) is an emergent ability of multilingual language +models that preserves their performance on a task to a significant extent when +evaluated in languages that were not included in the fine-tuning process. While +English, due to its widespread usage, is typically regarded as the primary +language for model adaption in various tasks, recent studies have revealed that +the efficacy of XLT can be amplified by selecting the most appropriate source +languages based on specific conditions. In this work, we propose the +utilization of sub-network similarity between two languages as a proxy for +predicting the compatibility of the languages in the context of XLT. Our +approach is model-oriented, better reflecting the inner workings of foundation +models. In addition, it requires only a moderate amount of raw text from +candidate languages, distinguishing it from the majority of previous methods +that rely on external resources. In experiments, we demonstrate that our method +is more effective than baselines across diverse tasks. Specifically, it shows +proficiency in ranking candidates for zero-shot XLT, achieving an improvement +of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that +confirm the utility of sub-networks for XLT prediction. + +
+
+ comment: Accepted to EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Supercharging academic writing with generative AI: framework, + techniques, and caveats + + +
+ Academic writing is an indispensable yet laborious part of the research +enterprise. This Perspective maps out principles and methods for using +generative artificial intelligence (AI), specifically large language models +(LLMs), to elevate the quality and efficiency of academic writing. We introduce +a human-AI collaborative framework that delineates the rationale (why), process +(how), and nature (what) of AI engagement in writing. The framework pinpoints +both short-term and long-term reasons for engagement and their underlying +mechanisms (e.g., cognitive offloading and imaginative stimulation). It reveals +the role of AI throughout the writing process, conceptualized through a +two-stage model for human-AI collaborative writing, and the nature of AI +assistance in writing, represented through a model of writing-assistance types +and levels. Building on this framework, we describe effective prompting +techniques for incorporating AI into the writing routine (outlining, drafting, +and editing) as well as strategies for maintaining rigorous scholarship, +adhering to varied journal policies, and avoiding overreliance on AI. +Ultimately, the prudent integration of AI into academic writing can ease the +communication burden, empower authors, accelerate discovery, and promote +diversity in science. + +
+
+ comment: 14 pages, 2 figures, 1 table, 1 box +
+
+
+
+
+ + ☆ Symbolic Planning and Code Generation for Grounded Dialogue EMNLP 2023 + + +
+ Large language models (LLMs) excel at processing and generating both text and +code. However, LLMs have had limited applicability in grounded task-oriented +dialogue as they are difficult to steer toward task objectives and fail to +handle novel grounding. We present a modular and interpretable grounded +dialogue system that addresses these shortcomings by composing LLMs with a +symbolic planner and grounded code execution. Our system consists of a reader +and planner: the reader leverages an LLM to convert partner utterances into +executable code, calling functions that perform grounding. The translated +code's output is stored to track dialogue state, while a symbolic planner +determines the next appropriate response. We evaluate our system's performance +on the demanding OneCommon dialogue task, involving collaborative reference +resolution on abstract images of scattered dots. Our system substantially +outperforms the previous state-of-the-art, including improving task success in +human evaluations from 56% to 69% in the most challenging setting. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Incorporating Probing Signals into Multimodal Machine Translation via + Visual Question-Answering Pairs EMNLP2023 + + +
+ This paper presents an in-depth study of multimodal machine translation +(MMT), examining the prevailing understanding that MMT systems exhibit +decreased sensitivity to visual information when text inputs are complete. +Instead, we attribute this phenomenon to insufficient cross-modal interaction, +rather than image information redundancy. A novel approach is proposed to +generate parallel Visual Question-Answering (VQA) style pairs from the source +text, fostering more robust cross-modal interaction. Using Large Language +Models (LLMs), we explicitly model the probing signal in MMT to convert it into +VQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask +learning framework is introduced to incorporate explicit probing signals from +the dataset into the MMT training process. Experimental results on two +widely-used benchmarks demonstrate the effectiveness of this novel approach. +Our code and data would be available at: +\url{https://github.com/libeineu/MMT-VQA}. + +
+
+ comment: Findings of EMNLP2023 +
+
+
+
+
+ + ☆ M2C: Towards Automatic Multimodal Manga Complement EMNLP2023 + + +
+ Multimodal manga analysis focuses on enhancing manga understanding with +visual and textual features, which has attracted considerable attention from +both natural language processing and computer vision communities. Currently, +most comics are hand-drawn and prone to problems such as missing pages, text +contamination, and aging, resulting in missing comic text content and seriously +hindering human comprehension. In other words, the Multimodal Manga Complement +(M2C) task has not been investigated, which aims to handle the aforementioned +issues by providing a shared semantic space for vision and language +understanding. To this end, we first propose the Multimodal Manga Complement +task by establishing a new M2C benchmark dataset covering two languages. First, +we design a manga argumentation method called MCoT to mine event knowledge in +comics with large language models. Then, an effective baseline FVP-M$^{2}$ +using fine-grained visual prompts is proposed to support manga complement. +Extensive experimental results show the effectiveness of FVP-M$^{2}$ method for +Multimodal Mange Complement. + +
+
+ comment: EMNLP2023. arXiv admin note: text overlap with arXiv:2210.15461 +
+
+
+
+
+ + ☆ Test-time Augmentation for Factual Probing EMNLP 2023 + + +
+ Factual probing is a method that uses prompts to test if a language model +"knows" certain world knowledge facts. A problem in factual probing is that +small changes to the prompt can lead to large changes in model output. Previous +work aimed to alleviate this problem by optimizing prompts via text mining or +fine-tuning. However, such approaches are relation-specific and do not +generalize to unseen relation types. Here, we propose to use test-time +augmentation (TTA) as a relation-agnostic method for reducing sensitivity to +prompt variations by automatically augmenting and ensembling prompts at test +time. Experiments show improved model calibration, i.e., with TTA, model +confidence better reflects prediction accuracy. Improvements in prediction +accuracy are observed for some models, but for other models, TTA leads to +degradation. Error analysis identifies the difficulty of producing high-quality +prompt variations as the main challenge for TTA. + +
+
+ comment: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper) +
+
+
+
+
+ + ☆ Topic Segmentation of Semi-Structured and Unstructured Conversational + Datasets using Language Models + + +
+ Breaking down a document or a conversation into multiple contiguous segments +based on its semantic structure is an important and challenging problem in NLP, +which can assist many downstream tasks. However, current works on topic +segmentation often focus on segmentation of structured texts. In this paper, we +comprehensively analyze the generalization capabilities of state-of-the-art +topic segmentation models on unstructured texts. We find that: (a) Current +strategies of pre-training on a large corpus of structured text such as +Wiki-727K do not help in transferability to unstructured conversational data. +(b) Training from scratch with only a relatively small-sized dataset of the +target unstructured domain improves the segmentation results by a significant +margin. We stress-test our proposed Topic Segmentation approach by +experimenting with multiple loss functions, in order to mitigate effects of +imbalance in unstructured conversational datasets. Our empirical evaluation +indicates that Focal Loss function is a robust alternative to Cross-Entropy and +re-weighted Cross-Entropy loss function when segmenting unstructured and +semi-structured chats. + +
+
+ comment: Accepted to IntelliSys 2023. arXiv admin note: substantial text + overlap with arXiv:2211.14954 +
+
+
+
+
+ + ☆ FLEEK: Factual Error Detection and Correction with Evidence Retrieved + from External Knowledge EMNLP 2023 + + +
+ Detecting factual errors in textual information, whether generated by large +language models (LLM) or curated by humans, is crucial for making informed +decisions. LLMs' inability to attribute their claims to external knowledge and +their tendency to hallucinate makes it difficult to rely on their responses. +Humans, too, are prone to factual errors in their writing. Since manual +detection and correction of factual errors is labor-intensive, developing an +automatic approach can greatly reduce human effort. We present FLEEK, a +prototype tool that automatically extracts factual claims from text, gathers +evidence from external knowledge sources, evaluates the factuality of each +claim, and suggests revisions for identified errors using the collected +evidence. Initial empirical evaluation on fact error detection (77-85\% F1) +shows the potential of FLEEK. A video demo of FLEEK can be found at +https://youtu.be/NapJFUlkPdQ. + +
+
+ comment: EMNLP 2023 (Demonstration Track) +
+
+
+
+
+ + ☆ Transformers Learn Higher-Order Optimization Methods for In-Context + Learning: A Study with Linear Models + + +
+ Transformers are remarkably good at in-context learning (ICL) -- learning +from demonstrations without parameter updates -- but how they perform ICL +remains a mystery. Recent work suggests that Transformers may learn in-context +by internally running Gradient Descent, a first-order optimization method. In +this paper, we instead demonstrate that Transformers learn to implement +higher-order optimization methods to perform ICL. Focusing on in-context linear +regression, we show that Transformers learn to implement an algorithm very +similar to Iterative Newton's Method, a higher-order optimization method, +rather than Gradient Descent. Empirically, we show that predictions from +successive Transformer layers closely match different iterations of Newton's +Method linearly, with each middle layer roughly computing 3 iterations. In +contrast, exponentially more Gradient Descent steps are needed to match an +additional Transformers layer; this suggests that Transformers have an +comparable rate of convergence with high-order methods such as Iterative +Newton, which are exponentially faster than Gradient Descent. We also show that +Transformers can learn in-context on ill-conditioned data, a setting where +Gradient Descent struggles but Iterative Newton succeeds. Finally, we show +theoretical results which support our empirical findings and have a close +correspondence with them: we prove that Transformers can implement $k$ +iterations of Newton's method with $\mathcal{O}(k)$ layers. + +
+
+
+
+
+ + ☆ Style-Aware Radiology Report Generation with RadGraph and Few-Shot + Prompting EMNLP 2023 + + +
+ Automatically generated reports from medical images promise to improve the +workflow of radiologists. Existing methods consider an image-to-report modeling +task by directly generating a fully-fledged report from an image. However, this +conflates the content of the report (e.g., findings and their attributes) with +its style (e.g., format and choice of words), which can lead to clinically +inaccurate reports. To address this, we propose a two-step approach for +radiology report generation. First, we extract the content from an image; then, +we verbalize the extracted content into a report that matches the style of a +specific radiologist. For this, we leverage RadGraph -- a graph representation +of reports -- together with large language models (LLMs). In our quantitative +evaluations, we find that our approach leads to beneficial performance. Our +human evaluation with clinical raters highlights that the AI-generated reports +are indistinguishably tailored to the style of individual radiologist despite +leveraging only a few examples as context. + +
+
+ comment: Accepted to Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ TIMELINE: Exhaustive Annotation of Temporal Relations Supporting the + Automatic Ordering of Events in News Articles EMNLP 2023 + + +
+ Temporal relation extraction models have thus far been hindered by a number +of issues in existing temporal relation-annotated news datasets, including: (1) +low inter-annotator agreement due to the lack of specificity of their +annotation guidelines in terms of what counts as a temporal relation; (2) the +exclusion of long-distance relations within a given document (those spanning +across different paragraphs); and (3) the exclusion of events that are not +centred on verbs. This paper aims to alleviate these issues by presenting a new +annotation scheme that clearly defines the criteria based on which temporal +relations should be annotated. Additionally, the scheme includes events even if +they are not expressed as verbs (e.g., nominalised events). Furthermore, we +propose a method for annotating all temporal relations -- including +long-distance ones -- which automates the process, hence reducing time and +manual effort on the part of annotators. The result is a new dataset, the +TIMELINE corpus, in which improved inter-annotator agreement was obtained, in +comparison with previously reported temporal relation datasets. We report the +results of training and evaluating baseline temporal relation extraction models +on the new corpus, and compare them with results obtained on the widely used +MATRES corpus. + +
+
+ comment: Accepted for publication in EMNLP 2023: 13 pages, 3 figures and 14 + tables +
+
+
+
+
+ + ☆ "You Are An Expert Linguistic Annotator": Limits of LLMs as Analyzers of + Abstract Meaning Representation EMNLP 2023 + + +
+ Large language models (LLMs) show amazing proficiency and fluency in the use +of language. Does this mean that they have also acquired insightful linguistic +knowledge about the language, to an extent that they can serve as an "expert +linguistic annotator"? In this paper, we examine the successes and limitations +of the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaning +structure, focusing on the Abstract Meaning Representation (AMR; Banarescu et +al. 2013) parsing formalism, which provides rich graphical representations of +sentence meaning structure while abstracting away from surface forms. We +compare models' analysis of this semantic structure across two settings: 1) +direct production of AMR parses based on zero- and few-shot prompts, and 2) +indirect partial reconstruction of AMR via metalinguistic natural language +queries (e.g., "Identify the primary event of this sentence, and the predicate +corresponding to that event."). Across these settings, we find that models can +reliably reproduce the basic format of AMR, and can often capture core event, +argument, and modifier structure -- however, model outputs are prone to +frequent and major errors, and holistic analysis of parse acceptability shows +that even with few-shot demonstrations, models have virtually 0% success in +producing fully accurate parses. Eliciting natural language responses produces +similar patterns of errors. Overall, our findings indicate that these models +out-of-the-box can capture aspects of semantic structure, but there remain key +limitations in their ability to support fully accurate semantic analyses or +parses. + +
+
+ comment: EMNLP 2023 Findings (short) +
+
+
+
+
+ + ☆ Utilizing Language Models for Energy Load Forecasting + + +
+ Energy load forecasting plays a crucial role in optimizing resource +allocation and managing energy consumption in buildings and cities. In this +paper, we propose a novel approach that leverages language models for energy +load forecasting. We employ prompting techniques to convert energy consumption +data into descriptive sentences, enabling fine-tuning of language models. By +adopting an autoregressive generating approach, our proposed method enables +predictions of various horizons of future energy load consumption. Through +extensive experiments on real-world datasets, we demonstrate the effectiveness +and accuracy of our proposed method. Our results indicate that utilizing +language models for energy load forecasting holds promise for enhancing energy +efficiency and facilitating intelligent decision-making in energy systems. + +
+
+ comment: BuildSys 2023 Accepted +
+
+
+
+
+ + ☆ Evaluation of large language models using an Indian language LGBTI+ + lexicon + + +
+ Large language models (LLMs) are typically evaluated on the basis of +task-based benchmarks such as MMLU. Such benchmarks do not examine responsible +behaviour of LLMs in specific contexts. This is particularly true in the LGBTI+ +context where social stereotypes may result in variation in LGBTI+ terminology. +Therefore, domain-specific lexicons or dictionaries may be useful as a +representative list of words against which the LLM's behaviour needs to be +evaluated. This paper presents a methodology for evaluation of LLMs using an +LGBTI+ lexicon in Indian languages. The methodology consists of four steps: +formulating NLP tasks relevant to the expected behaviour, creating prompts that +test LLMs, using the LLMs to obtain the output and, finally, manually +evaluating the results. Our qualitative analysis shows that the three LLMs we +experiment on are unable to detect underlying hateful content. Similarly, we +observe limitations in using machine translation as means to evaluate natural +language understanding in languages other than English. The methodology +presented in this paper can be useful for LGBTI+ lexicons in other languages as +well as other domain-specific lexicons. The work done in this paper opens +avenues for responsible behaviour of LLMs, as demonstrated in the context of +prevalent social perception of the LGBTI+ community. + +
+
+ comment: Selected for publication in the AI Ethics Journal published by the + Artificial Intelligence Robotics Ethics Society (AIRES) +
+
+
+
+
+ + ☆ Words, Subwords, and Morphemes: What Really Matters in the + Surprisal-Reading Time Relationship? EMNLP 2023 + + +
+ An important assumption that comes with using LLMs on psycholinguistic data +has gone unverified. LLM-based predictions are based on subword tokenization, +not decomposition of words into morphemes. Does that matter? We carefully test +this by comparing surprisal estimates using orthographic, morphological, and +BPE tokenization against reading time data. Our results replicate previous +findings and provide evidence that in the aggregate, predictions using BPE +tokenization do not suffer relative to morphological and orthographic +segmentation. However, a finer-grained analysis points to potential issues with +relying on BPE-based tokenization, as well as providing promising results +involving morphologically-aware surprisal estimates and suggesting a new method +for evaluating morphological prediction. + +
+
+ comment: Accepted to Findings of EMNLP 2023; 10 pages, 5 figures +
+
+
+
+
+ + ☆ GROOViST: A Metric for Grounding Objects in Visual Storytelling EMNLP 2023 + + +
+ A proper evaluation of stories generated for a sequence of images -- the task +commonly referred to as visual storytelling -- must consider multiple aspects, +such as coherence, grammatical correctness, and visual grounding. In this work, +we focus on evaluating the degree of grounding, that is, the extent to which a +story is about the entities shown in the images. We analyze current metrics, +both designed for this purpose and for general vision-text alignment. Given +their observed shortcomings, we propose a novel evaluation tool, GROOViST, that +accounts for cross-modal dependencies, temporal misalignments (the fact that +the order in which entities appear in the story and the image sequence may not +match), and human intuitions on visual grounding. An additional advantage of +GROOViST is its modular design, where the contribution of each component can be +assessed and interpreted individually. + +
+
+ comment: In EMNLP 2023 main conference proceedings (to appear) +
+
+
+
+
+ + ☆ Social Contract AI: Aligning AI Assistants with Implicit Group Norms NeurIPS 2023 + + +
+ We explore the idea of aligning an AI assistant by inverting a model of +users' (unknown) preferences from observed interactions. To validate our +proposal, we run proof-of-concept simulations in the economic ultimatum game, +formalizing user preferences as policies that guide the actions of simulated +players. We find that the AI assistant accurately aligns its behavior to match +standard policies from the economic literature (e.g., selfish, altruistic). +However, the assistant's learned policies lack robustness and exhibit limited +generalization in an out-of-distribution setting when confronted with a +currency (e.g., grams of medicine) that was not included in the assistant's +training distribution. Additionally, we find that when there is inconsistency +in the relationship between language use and an unknown policy (e.g., an +altruistic policy combined with rude language), the assistant's learning of the +policy is slowed. Overall, our preliminary results suggest that developing +simulation frameworks in which AI assistants need to infer preferences from +diverse users can provide a valuable approach for studying practical alignment +questions. + +
+
+ comment: SoLaR NeurIPS 2023 Workshop (https://solar-neurips.github.io/) +
+
+
+
+
+ + ☆ A Framework for Automated Measurement of Responsible AI Harms in + Generative AI Applications + + +
+ We present a framework for the automated measurement of responsible AI (RAI) +metrics for large language models (LLMs) and associated products and services. +Our framework for automatically measuring harms from LLMs builds on existing +technical and sociotechnical expertise and leverages the capabilities of +state-of-the-art LLMs, such as GPT-4. We use this framework to run through +several case studies investigating how different LLMs may violate a range of +RAI-related principles. The framework may be employed alongside domain-specific +sociotechnical expertise to create measurements for new harm areas in the +future. By implementing this framework, we aim to enable more advanced harm +measurement efforts and further the responsible use of LLMs. + +
+
+ comment: This is a living document +
+
+
+
+
+ + ☆ Salespeople vs SalesBot: Exploring the Role of Educational Value in + Conversational Recommender Systems + + +
+ Making big purchases requires consumers to research or consult a salesperson +to gain domain expertise. However, existing conversational recommender systems +(CRS) often overlook users' lack of background knowledge, focusing solely on +gathering preferences. In this work, we define a new problem space for +conversational agents that aim to provide both product recommendations and +educational value through mixed-type mixed-initiative dialog. We introduce +SalesOps, a framework that facilitates the simulation and evaluation of such +systems by leveraging recent advancements in large language models (LLMs). We +build SalesBot and ShopperBot, a pair of LLM-powered agents that can simulate +either side of the framework. A comprehensive human study compares SalesBot +against professional salespeople, revealing that although SalesBot approaches +professional performance in terms of fluency and informativeness, it lags +behind in recommendation quality. We emphasize the distinct limitations both +face in providing truthful information, highlighting the challenges of ensuring +faithfulness in the CRS context. We release our code and make all data +available. + +
+
+
+
+
+ + ☆ StyleBART: Decorate Pretrained Model with Style Adapters for + Unsupervised Stylistic Headline Generation + + +
+ Stylistic headline generation is the task to generate a headline that not +only summarizes the content of an article, but also reflects a desired style +that attracts users. As style-specific article-headline pairs are scarce, +previous researches focus on unsupervised approaches with a standard headline +generation dataset and mono-style corpora. In this work, we follow this line +and propose StyleBART, an unsupervised approach for stylistic headline +generation. Our method decorates the pretrained BART model with adapters that +are responsible for different styles and allows the generation of headlines +with diverse styles by simply switching the adapters. Different from previous +works, StyleBART separates the task of style learning and headline generation, +making it possible to freely combine the base model and the style adapters +during inference. We further propose an inverse paraphrasing task to enhance +the style adapters. Extensive automatic and human evaluations show that +StyleBART achieves new state-of-the-art performance in the unsupervised +stylistic headline generation task, producing high-quality headlines with the +desired style. + +
+
+
+
+
+ + ☆ ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural + Languages CoNLL 2023 + + +
+ Building multi-modal language models has been a trend in the recent years, +where additional modalities such as image, video, speech, etc. are jointly +learned along with natural languages (i.e., textual information). Despite the +success of these multi-modal language models with different modalities, there +is no existing solution for neural network architectures and natural languages. +Providing neural architectural information as a new modality allows us to +provide fast architecture-2-text and text-2-architecture retrieval/generation +services on the cloud with a single inference. Such solution is valuable in +terms of helping beginner and intermediate ML users to come up with better +neural architectures or AutoML approaches with a simple text query. In this +paper, we propose ArchBERT, a bi-modal model for joint learning and +understanding of neural architectures and natural languages, which opens up new +avenues for research in this area. We also introduce a pre-training strategy +named Masked Architecture Modeling (MAM) for a more generalized joint learning. +Moreover, we introduce and publicly release two new bi-modal datasets for +training and validating our methods. The ArchBERT's performance is verified +through a set of numerical experiments on different downstream tasks such as +architecture-oriented reasoning, question answering, and captioning +(summarization). Datasets, codes, and demos are available supplementary +materials. + +
+
+ comment: CoNLL 2023 +
+
+
+
+
+ + ☆ Investigating Multilingual Coreference Resolution by Universal + Annotations EMNLP2023 + + +
+ Multilingual coreference resolution (MCR) has been a long-standing and +challenging task. With the newly proposed multilingual coreference dataset, +CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by +using its harmonized universal morphosyntactic and coreference annotations. +First, we study coreference by examining the ground truth data at different +linguistic levels, namely mention, entity and document levels, and across +different genres, to gain insights into the characteristics of coreference +across multiple languages. Second, we perform an error analysis of the most +challenging cases that the SotA system fails to resolve in the CRAC 2022 shared +task using the universal annotations. Last, based on this analysis, we extract +features from universal morphosyntactic annotations and integrate these +features into a baseline system to assess their potential benefits for the MCR +task. Our results show that our best configuration of features improves the +baseline by 0.9% F1 score. + +
+
+ comment: Accepted at Findings of EMNLP2023 +
+
+
+
+
+ + ☆ ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training + Quantization Framework for W8A8 Transformers + + +
+ Quantization techniques are pivotal in reducing the memory and computational +demands of deep neural network inference. Existing solutions, such as +ZeroQuant, offer dynamic quantization for models like BERT and GPT but overlook +crucial memory-bounded operators and the complexities of per-token +quantization. Addressing these gaps, we present a novel, fully +hardware-enhanced robust optimized post-training W8A8 quantization framework, +ZeroQuant-HERO. This framework uniquely integrates both memory bandwidth and +compute-intensive operators, aiming for optimal hardware performance. +Additionally, it offers flexibility by allowing specific INT8 modules to switch +to FP16/BF16 mode, enhancing accuracy. + +
+
+ comment: 8 pages, 2 figures +
+
+
+
+
+ + ☆ Large Language Models as Generalizable Policies for Embodied Tasks + + +
+ We show that large language models (LLMs) can be adapted to be generalizable +policies for embodied visual tasks. Our approach, called Large LAnguage model +Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take +as input text instructions and visual egocentric observations and output +actions directly in the environment. Using reinforcement learning, we train +LLaRP to see and act solely through environmental interactions. We show that +LLaRP is robust to complex paraphrasings of task instructions and can +generalize to new tasks that require novel optimal behavior. In particular, on +1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other +common learned baselines or zero-shot applications of LLMs. Finally, to aid the +community in studying language conditioned, massively multi-task, embodied AI +problems we release a novel benchmark, Language Rearrangement, consisting of +150,000 training and 1,000 testing tasks for language-conditioned +rearrangement. Video examples of LLaRP in unseen Language Rearrangement +instructions are at https://llm-rl.github.io. + +
+
+
+
+
+ + ☆ From Transcripts to Insights: Uncovering Corporate Risks Using + Generative AI + + +
+ We explore the value of generative AI tools, such as ChatGPT, in helping +investors uncover dimensions of corporate risk. We develop and validate +firm-level measures of risk exposure to political, climate, and AI-related +risks. Using the GPT 3.5 model to generate risk summaries and assessments from +the context provided by earnings call transcripts, we show that GPT-based +measures possess significant information content and outperform the existing +risk measures in predicting (abnormal) firm-level volatility and firms' choices +such as investment and innovation. Importantly, information in risk assessments +dominates that in risk summaries, establishing the value of general AI +knowledge. We also find that generative AI is effective at detecting emerging +risks, such as AI risk, which has soared in recent quarters. Our measures +perform well both within and outside the GPT's training window and are priced +in equity markets. Taken together, an AI-based approach to risk measurement +provides useful insights to users of corporate disclosures at a low cost. + +
+
+
+
+
+ + ☆ Outlier Dimensions Encode Task-Specific Knowledge EMNLP 2023 + + +
+ Representations from large language models (LLMs) are known to be dominated +by a small subset of dimensions with exceedingly high variance. Previous works +have argued that although ablating these outlier dimensions in LLM +representations hurts downstream performance, outlier dimensions are +detrimental to the representational quality of embeddings. In this study, we +investigate how fine-tuning impacts outlier dimensions and show that 1) outlier +dimensions that occur in pre-training persist in fine-tuned models and 2) a +single outlier dimension can complete downstream tasks with a minimal error +rate. Our results suggest that outlier dimensions can encode crucial +task-specific knowledge and that the value of a representation in a single +outlier dimension drives downstream model decisions. + +
+
+ comment: Camera-ready version for EMNLP 2023 +
+
+
+
+
+ + ☆ Nearest Neighbor Search over Vectorized Lexico-Syntactic Patterns for + Relation Extraction from Financial Documents + + +
+ Relation extraction (RE) has achieved remarkable progress with the help of +pre-trained language models. However, existing RE models are usually incapable +of handling two situations: implicit expressions and long-tail relation +classes, caused by language complexity and data sparsity. Further, these +approaches and models are largely inaccessible to users who don't have direct +access to large language models (LLMs) and/or infrastructure for supervised +training or fine-tuning. Rule-based systems also struggle with implicit +expressions. Apart from this, Real world financial documents such as various +10-X reports (including 10-K, 10-Q, etc.) of publicly traded companies pose +another challenge to rule-based systems in terms of longer and complex +sentences. In this paper, we introduce a simple approach that consults training +relations at test time through a nearest-neighbor search over dense vectors of +lexico-syntactic patterns and provides a simple yet effective means to tackle +the above issues. We evaluate our approach on REFinD and show that our method +achieves state-of-the-art performance. We further show that it can provide a +good start for human in the loop setup when a small number of annotations are +available and it is also beneficial when domain experts can provide high +quality patterns. + +
+
+
+
+
+ + ☆ Is Explanation the Cure? Misinformation Mitigation in the Short Term and + Long Term EMNLP + + +
+ With advancements in natural language processing (NLP) models, automatic +explanation generation has been proposed to mitigate misinformation on social +media platforms in addition to adding warning labels to identified fake news. +While many researchers have focused on generating good explanations, how these +explanations can really help humans combat fake news is under-explored. In this +study, we compare the effectiveness of a warning label and the state-of-the-art +counterfactual explanations generated by GPT-4 in debunking misinformation. In +a two-wave, online human-subject study, participants (N = 215) were randomly +assigned to a control group in which false contents are shown without any +intervention, a warning tag group in which the false claims were labeled, or an +explanation group in which the false contents were accompanied by GPT-4 +generated explanations. Our results show that both interventions significantly +decrease participants' self-reported belief in fake claims in an equivalent +manner for the short-term and long-term. We discuss the implications of our +findings and directions for future NLP-based misinformation debunking +strategies. + +
+
+ comment: EMNLP Findings 2023 +
+
+
+
+
+ + ☆ The impact of using an AI chatbot to respond to patient messages + + +
+ Documentation burden is a major contributor to clinician burnout, which is +rising nationally and is an urgent threat to our ability to care for patients. +Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician +burden by assisting with documentation. Although many hospitals are actively +integrating such systems into electronic medical record systems, AI chatbots +utility and impact on clinical decision-making have not been studied for this +intended use. We are the first to examine the utility of large language models +in assisting clinicians draft responses to patient questions. In our two-stage +cross-sectional study, 6 oncologists responded to 100 realistic synthetic +cancer patient scenarios and portal messages developed to reflect common +medical situations, first manually, then with AI assistance. + We find AI-assisted responses were longer, less readable, but provided +acceptable drafts without edits 58% of time. AI assistance improved efficiency +77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses +could severely harm. In 31% cases, physicians thought AI drafts were +human-written. AI assistance led to more patient education recommendations, +fewer clinical actions than manual responses. Results show promise for AI to +improve clinician efficiency and patient care through assisting documentation, +if used judiciously. Monitoring model outputs and human-AI interaction remains +crucial for safe implementation. + +
+
+ comment: 4 figures and tables in main, submitted for review +
+
+
+
+
+ + ☆ Non-contrastive sentence representations via self-supervision EMNLP 2023 + + +
+ Sample contrastive methods, typically referred to simply as contrastive are +the foundation of most unsupervised methods to learn text and sentence +embeddings. On the other hand, a different class of self-supervised loss +functions and methods have been considered in the computer vision community and +referred to as dimension contrastive. In this paper, we thoroughly compare this +class of methods with the standard baseline for contrastive sentence +embeddings, SimCSE. We find that self-supervised embeddings trained using +dimension contrastive objectives can outperform SimCSE on downstream tasks +without needing auxiliary loss functions. + +
+
+ comment: Submitted and rejected by EMNLP 2023. Contact the authors for a copy + of the "reviews" +
+
+
+
+
+ + ♻ ☆ Detecting and Mitigating Hallucinations in Multilingual Summarisation EMNLP 2023 + + +
+ Hallucinations pose a significant challenge to the reliability of neural +models for abstractive summarisation. While automatically generated summaries +may be fluent, they often lack faithfulness to the original document. This +issue becomes even more pronounced in low-resource settings, such as +cross-lingual transfer. With the existing faithful metrics focusing on English, +even measuring the extent of this phenomenon in cross-lingual settings is hard. +To address this, we first develop a novel metric, mFACT, evaluating the +faithfulness of non-English summaries, leveraging translation-based transfer +from multiple English faithfulness metrics. We then propose a simple but +effective method to reduce hallucinations with a cross-lingual transfer, which +weighs the loss of each training example by its faithfulness score. Through +extensive experiments in multiple languages, we demonstrate that mFACT is the +metric that is most suited to detect hallucinations. Moreover, we find that our +proposed loss weighting method drastically increases both performance and +faithfulness according to both automatic and human evaluation when compared to +strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset +are available at https://github.com/yfqiu-nlp/mfact-summ. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ AutoPlan: Automatic Planning of Interactive Decision-Making Tasks With + Large Language Models EMNLP 2023 + + +
+ Recent large language models (LLMs) are promising for making decisions in +grounded environments. However, LLMs frequently fail in complex decision-making +tasks due to the misalignment between the pre-trained knowledge in LLMs and the +actual rules in the environment. Existing methods require either costly +gradient computation or lengthy in-context demonstrations. In this paper, we +propose AutoPlan, an approach to guide LLM-based agents to accomplish +interactive decision-making tasks. AutoPlan augments the LLM prompt with a +task-solving plan and optimizes it through iterative experience collection and +reflection. Our experiments show that AutoPlan, though using no in-context +demonstrations, achieves success rates on par with the baselines using +human-written demonstrations on ALFWorld and even outperforms them by 8% on +HotpotQA. The code is available at https://github.com/owaski/AutoPlan. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ DocumentNet: Bridging the Data Gap in Document Pre-Training EMNLP 2023 + + +
+ Document understanding tasks, in particular, Visually-rich Document Entity +Retrieval (VDER), have gained significant attention in recent years thanks to +their broad applications in enterprise AI. However, publicly available data +have been scarce for these tasks due to strict privacy constraints and high +annotation costs. To make things worse, the non-overlapping entity spaces from +different datasets hinder the knowledge transfer between document types. In +this paper, we propose a method to collect massive-scale and weakly labeled +data from the web to benefit the training of VDER models. The collected +dataset, named DocumentNet, does not depend on specific document types or +entity sets, making it universally applicable to all VDER tasks. The current +DocumentNet consists of 30M documents spanning nearly 400 document types +organized in a four-level ontology. Experiments on a set of broadly adopted +VDER tasks show significant improvements when DocumentNet is incorporated into +the pre-training for both classic and few-shot learning settings. With the +recent emergence of large language models (LLMs), DocumentNet provides a large +data source to extend their multi-modal capabilities for VDER. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Multi-grained Hypergraph Interest Modeling for Conversational + Recommendation + + +
+ Conversational recommender system (CRS) interacts with users through +multi-turn dialogues in natural language, which aims to provide high-quality +recommendations for user's instant information need. Although great efforts +have been made to develop effective CRS, most of them still focus on the +contextual information from the current dialogue, usually suffering from the +data scarcity issue. Therefore, we consider leveraging historical dialogue data +to enrich the limited contexts of the current dialogue session. + In this paper, we propose a novel multi-grained hypergraph interest modeling +approach to capture user interest beneath intricate historical data from +different perspectives. As the core idea, we employ hypergraph to represent +complicated semantic relations underlying historical dialogues. In our +approach, we first employ the hypergraph structure to model users' historical +dialogue sessions and form a session-based hypergraph, which captures +coarse-grained, session-level relations. Second, to alleviate the issue of data +scarcity, we use an external knowledge graph and construct a knowledge-based +hypergraph considering fine-grained, entity-level semantics. We further conduct +multi-grained hypergraph convolution on the two kinds of hypergraphs, and +utilize the enhanced representations to develop interest-aware CRS. Extensive +experiments on two benchmarks ReDial and TG-ReDial validate the effectiveness +of our approach on both recommendation and conversation tasks. Code is +available at: https://github.com/RUCAIBox/MHIM. + +
+
+
+
+
+ + ♻ ☆ Editing Common Sense in Transformers EMNLP 2023 + + +
+ Editing model parameters directly in Transformers makes updating open-source +transformer-based models possible without re-training (Meng et al., 2023). +However, these editing methods have only been evaluated on statements about +encyclopedic knowledge with a single correct answer. Commonsense knowledge with +multiple correct answers, e.g., an apple can be green or red but not +transparent, has not been studied but is as essential for enhancing +transformers' reliability and usefulness. In this paper, we investigate whether +commonsense judgments are causally associated with localized, editable +parameters in Transformers, and we provide an affirmative answer. We find that +directly applying the MEMIT editing algorithm results in sub-par performance +and improve it for the commonsense domain by varying edit tokens and improving +the layer selection strategy, i.e., $MEMIT_{CSK}$. GPT-2 Large and XL models +edited using $MEMIT_{CSK}$ outperform best-fine-tuned baselines by 10.97% and +10.73% F1 scores on PEP3k and 20Q datasets. In addition, we propose a novel +evaluation dataset, PROBE SET, that contains unaffected and affected +neighborhoods, affected paraphrases, and affected reasoning challenges. +$MEMIT_{CSK}$ performs well across the metrics while fine-tuning baselines show +significant trade-offs between unaffected and affected metrics. These results +suggest a compelling future direction for incorporating feedback about common +sense into Transformers through direct model editing. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference. Anshita, Debanjan, Akshay are + co-first authors. Code and datasets for all experiments are available at + https://github.com/anshitag/memit_csk +
+
+
+
+
+ + ♻ ☆ On Classifying Continuous Constraint Satisfaction Problems + + +
+ A continuous constraint satisfaction problem (CCSP) is a constraint +satisfaction problem (CSP) with an interval domain $U \subset \mathbb{R}$. We +engage in a systematic study to classify CCSPs that are complete of the +Existential Theory of the Reals, i.e., ER-complete. To define this class, we +first consider the problem ETR, which also stands for Existential Theory of the +Reals. In an instance of this problem we are given some sentence of the form +$\exists x_1, \ldots, x_n \in \mathbb{R} : \Phi(x_1, \ldots, x_n)$, where +$\Phi$ is a well-formed quantifier-free formula consisting of the symbols $\{0, +1, +, \cdot, \geq, >, \wedge, \vee, \neg\}$, the goal is to check whether this +sentence is true. Now the class ER is the family of all problems that admit a +polynomial-time many-one reduction to ETR. It is known that NP $\subseteq$ ER +$\subseteq$ PSPACE. + We restrict our attention on CCSPs with addition constraints ($x + y = z$) +and some other mild technical condition. Previously, it was shown that +multiplication constraints ($x \cdot y = z$), squaring constraints ($x^2 = y$), +or inversion constraints ($x\cdot y = 1$) are sufficient to establish +ER-completeness. We extend this in the strongest possible sense for equality +constraints as follows. We show that CCSPs (with addition constraints and some +other mild technical condition) that have any one well-behaved curved equality +constraint ($f(x,y) = 0$) are ER-complete. We further extend our results to +inequality constraints. We show that any well-behaved convexly curved and any +well-behaved concavely curved inequality constraint ($f(x,y) \geq 0$ and +$g(x,y) \geq 0$) imply ER-completeness on the class of such CCSPs. + +
+
+ comment: 39 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ Multilingual Natural Language Processing Model for Radiology Reports -- + The Summary is all you need! + + +
+ The impression section of a radiology report summarizes important radiology +findings and plays a critical role in communicating these findings to +physicians. However, the preparation of these summaries is time-consuming and +error-prone for radiologists. Recently, numerous models for radiology report +summarization have been developed. Nevertheless, there is currently no model +that can summarize these reports in multiple languages. Such a model could +greatly improve future research and the development of Deep Learning models +that incorporate data from patients with different ethnic backgrounds. In this +study, the generation of radiology impressions in different languages was +automated by fine-tuning a model, publicly available, based on a multilingual +text-to-text Transformer to summarize findings available in English, +Portuguese, and German radiology reports. In a blind test, two board-certified +radiologists indicated that for at least 70% of the system-generated summaries, +the quality matched or exceeded the corresponding human-written summaries, +suggesting substantial clinical reliability. Furthermore, this study showed +that the multilingual model outperformed other models that specialized in +summarizing radiology reports in only one language, as well as models that were +not specifically designed for summarizing radiology reports, such as ChatGPT. + +
+
+ comment: Problems with the model +
+
+
+
+
+ + ♻ ☆ Event knowledge in large language models: the gap between the impossible + and the unlikely + + +
+ Word co-occurrence patterns in language corpora contain a surprising amount +of conceptual knowledge. Large language models (LLMs), trained to predict words +in context, leverage these patterns to achieve impressive performance on +diverse semantic tasks requiring world knowledge. An important but understudied +question about LLMs' semantic abilities is whether they acquire generalized +knowledge of common events. Here, we test whether five pre-trained LLMs (from +2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions +of agent-patient interactions than to minimally different implausible versions +of the same event. Using three curated sets of minimal sentence pairs (total +n=1,215), we found that pre-trained LLMs possess substantial event knowledge, +outperforming other distributional language models. In particular, they almost +always assign higher likelihood to possible vs. impossible events (The teacher +bought the laptop vs. The laptop bought the teacher). However, LLMs show less +consistent preferences for likely vs. unlikely events (The nanny tutored the +boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM +scores are driven by both plausibility and surface-level sentence features, +(ii) LLM scores generalize well across syntactic variants (active vs. passive +constructions) but less well across semantic variants (synonymous sentences), +(iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence +plausibility serves as an organizing dimension in internal LLM representations. +Overall, our results show that important aspects of event knowledge naturally +emerge from distributional linguistic patterns, but also highlight a gap +between representations of possible/impossible and likely/unlikely events. + +
+
+ comment: The two lead authors have contributed equally to this work +
+
+
+
+
+ + ♻ ☆ NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial + Reports + + +
+ How can we interpret and retrieve medical evidence to support clinical +decisions? Clinical trial reports (CTR) amassed over the years contain +indispensable information for the development of personalized medicine. +However, it is practically infeasible to manually inspect over 400,000+ +clinical trial reports in order to find the best evidence for experimental +treatments. Natural Language Inference (NLI) offers a potential solution to +this problem, by allowing the scalable computation of textual entailment. +However, existing NLI models perform poorly on biomedical corpora, and +previously published datasets fail to capture the full complexity of inference +over CTRs. In this work, we present a novel resource to advance research on NLI +for reasoning on CTRs. The resource includes two main tasks. Firstly, to +determine the inference relation between a natural language statement, and a +CTR. Secondly, to retrieve supporting facts to justify the predicted relation. +We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these +tasks. Baselines on this corpus expose the limitations of existing NLI models, +with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To +the best of our knowledge, we are the first to design a task that covers the +interpretation of full CTRs. To encourage further work on this challenging +dataset, we make the corpus, competition leaderboard, website and code to +replicate the baseline experiments available at: +https://github.com/ai-systems/nli4ct + +
+
+ comment: 15 pages +
+
+
+
+
+ + ♻ ☆ Lexical Diversity in Kinship Across Languages and Dialects + + +
+ Languages are known to describe the world in diverse ways. Across lexicons, +diversity is pervasive, appearing through phenomena such as lexical gaps and +untranslatability. However, in computational resources, such as multilingual +lexical databases, diversity is hardly ever represented. In this paper, we +introduce a method to enrich computational lexicons with content relating to +linguistic diversity. The method is verified through two large-scale case +studies on kinship terminology, a domain known to be diverse across languages +and cultures: one case study deals with seven Arabic dialects, while the other +one with three Indonesian languages. Our results, made available as browseable +and downloadable computational resources, extend prior linguistics research on +kinship terminology, and provide insight into the extent of diversity even +within linguistically and culturally close communities. + +
+
+
+
+
+ + ♻ ☆ Visually-Situated Natural Language Understanding with Contrastive + Reading Model and Frozen Large Language Models EMNLP 2023 + + +
+ Recent advances in Large Language Models (LLMs) have stimulated a surge of +research aimed at extending their applications to the visual domain. While +these models exhibit promise in generating abstract image captions and +facilitating natural conversations, their performance on text-rich images still +requires improvement. In this paper, we introduce Contrastive Reading Model +(Cream), a novel neural architecture designed to enhance the language-image +understanding capability of LLMs by capturing intricate details that are often +overlooked in existing methods. Cream combines vision and auxiliary encoders, +fortified by a contrastive feature alignment technique, to achieve a more +effective comprehension of language information in visually situated contexts +within the images. Our approach bridges the gap between vision and language +understanding, paving the way for the development of more sophisticated +Document Intelligence Assistants. Through rigorous evaluations across diverse +visually-situated language understanding tasks that demand reasoning +capabilities, we demonstrate the compelling performance of Cream, positioning +it as a prominent model in the field of visual document understanding. We +provide our codebase and newly-generated datasets at +https://github.com/naver-ai/cream . + +
+
+ comment: 22 pages; To appear at EMNLP 2023 Main Conference (Project page: + https://naver-ai.github.io/cream ) +
+
+
+
+
+ + ♻ ☆ Large Content And Behavior Models To Understand, Simulate, And Optimize + Content And Behavior + + +
+ Shannon, in his seminal paper introducing information theory, divided the +communication into three levels: technical, semantic, and effectivenss. While +the technical level is concerned with accurate reconstruction of transmitted +symbols, the semantic and effectiveness levels deal with the inferred meaning +and its effect on the receiver. Thanks to telecommunications, the first level +problem has produced great advances like the internet. Large Language Models +(LLMs) make some progress towards the second goal, but the third level still +remains largely untouched. The third problem deals with predicting and +optimizing communication for desired receiver behavior. LLMs, while showing +wide generalization capabilities across a wide range of tasks, are unable to +solve for this. One reason for the underperformance could be a lack of +``behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver +behavior over a communication, such as shares, likes, clicks, purchases, +retweets, etc. While preprocessing data for LLM training, behavior tokens are +often removed from the corpora as noise. Therefore, in this paper, we make some +initial progress towards reintroducing behavior tokens in LLM training. The +trained models, other than showing similar performance to LLMs on content +understanding tasks, show generalization capabilities on behavior simulation, +content simulation, behavior understanding, and behavior domain adaptation. +Using a wide range of tasks on two corpora, we show results on all these +capabilities. We call these models Large Content and Behavior Models (LCBMs). +Further, to spur more research on LCBMs, we release our new Content Behavior +Corpus (CBC), a repository containing communicator, message, and corresponding +receiver behavior. + +
+
+
+
+
+ + ♻ ☆ Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into + the Morphological Capabilities of a Large Language Model EMNLP 2023 + + +
+ Large language models (LLMs) have recently reached an impressive level of +linguistic capability, prompting comparisons with human language skills. +However, there have been relatively few systematic inquiries into the +linguistic capabilities of the latest generation of LLMs, and those studies +that do exist (i) ignore the remarkable ability of humans to generalize, (ii) +focus only on English, and (iii) investigate syntax or semantics and overlook +other capabilities that lie at the heart of human language, like morphology. +Here, we close these gaps by conducting the first rigorous analysis of the +morphological capabilities of ChatGPT in four typologically varied languages +(specifically, English, German, Tamil, and Turkish). We apply a version of +Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for +the four examined languages. We find that ChatGPT massively underperforms +purpose-built systems, particularly in English. Overall, our results -- through +the lens of morphology -- cast a new light on the linguistic capabilities of +ChatGPT, suggesting that claims of human-like language skills are premature and +misleading. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Describe me an Aucklet: Generating Grounded Perceptual Category + Descriptions EMNLP + + +
+ Human speakers can generate descriptions of perceptual concepts, abstracted +from the instance-level. Moreover, such descriptions can be used by other +speakers to learn provisional representations of those concepts. Learning and +using abstract perceptual concepts is under-investigated in the +language-and-vision field. The problem is also highly relevant to the field of +representation learning in multi-modal NLP. In this paper, we introduce a +framework for testing category-level perceptual grounding in multi-modal +language models. In particular, we train separate neural networks to generate +and interpret descriptions of visual categories. We measure the communicative +success of the two models with the zero-shot classification performance of the +interpretation model, which we argue is an indicator of perceptual grounding. +Using this framework, we compare the performance of prototype- and +exemplar-based representations. Finally, we show that communicative success +exposes performance issues in the generation model, not captured by traditional +intrinsic NLG evaluation metrics, and argue that these issues stem from a +failure to properly ground language in vision at the category level. + +
+
+ comment: To appear in Proceedings of the 2023 Conference on Empirical Methods + in Natural Language Processing (EMNLP, Main) +
+
+
+
+
+ + ♻ ☆ Adapting Offline Speech Translation Models for Streaming with + Future-Aware Distillation and Inference EMNLP 2023 + + +
+ A popular approach to streaming speech translation is to employ a single +offline model with a wait-k policy to support different latency requirements, +which is simpler than training multiple online models with different latency +constraints. However, there is a mismatch problem in using a model trained with +complete utterances for streaming inference with partial input. We demonstrate +that speech representations extracted at the end of a streaming input are +significantly different from those extracted from a complete utterance. To +address this issue, we propose a new approach called Future-Aware Streaming +Translation (FAST) that adapts an offline ST model for streaming input. FAST +includes a Future-Aware Inference (FAI) strategy that incorporates future +context through a trainable masked embedding, and a Future-Aware Distillation +(FAD) framework that transfers future context from an approximation of full +speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr +benchmarks show that FAST achieves better trade-offs between translation +quality and latency than strong baselines. Extensive analyses suggest that our +methods effectively alleviate the aforementioned mismatch problem between +offline training and online inference. + +
+
+ comment: Accept to EMNLP 2023 main conference +
+
+
+
+
+ + ♻ ☆ Recycle-and-Distill: Universal Compression Strategy for + Transformer-based Speech SSL Models with Attention Map Reusing and Masking + Distillation + + +
+ Transformer-based speech self-supervised learning (SSL) models, such as +HuBERT, show surprising performance in various speech processing tasks. +However, huge number of parameters in speech SSL models necessitate the +compression to a more compact model for wider usage in academia or small +companies. In this study, we suggest to reuse attention maps across the +Transformer layers, so as to remove key and query parameters while retaining +the number of layers. Furthermore, we propose a novel masking distillation +strategy to improve the student model's speech representation quality. We +extend the distillation loss to utilize both masked and unmasked speech frames +to fully leverage the teacher model's high-quality representation. Our +universal compression strategy yields the student model that achieves phoneme +error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB +benchmark. + +
+
+ comment: Proceedings of Interspeech 2023. Code URL: + https://github.com/sungnyun/ARMHuBERT +
+
+
+
+
+ + ♻ ☆ A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In + Zero Shot EMNLP-23 + + +
+ Multimedia content, such as advertisements and story videos, exhibit a rich +blend of creativity and multiple modalities. They incorporate elements like +text, visuals, audio, and storytelling techniques, employing devices like +emotions, symbolism, and slogans to convey meaning. There is a dearth of large +annotated training datasets in the multimedia domain hindering the development +of supervised learning models with satisfactory performance for real-world +applications. On the other hand, the rise of large language models (LLMs) has +witnessed remarkable zero-shot performance in various natural language +processing (NLP) tasks, such as emotion classification, question-answering, and +topic classification. To leverage such advanced techniques to bridge this +performance gap in multimedia understanding, we propose verbalizing long videos +to generate their descriptions in natural language, followed by performing +video-understanding tasks on the generated story as opposed to the original +video. Through extensive experiments on fifteen video-understanding tasks, we +demonstrate that our method, despite being zero-shot, achieves significantly +better results than supervised baselines for video understanding. Furthermore, +to alleviate a lack of story understanding benchmarks, we publicly release the +first dataset on a crucial task in computational social science on persuasion +strategy identification. + +
+
+ comment: Accepted to EMNLP-23 TL;DR: Video understanding lags far behind NLP; + LLMs excel in zero-shot. Our approach utilizes LLMs to verbalize videos, + creating stories for zero-shot video understanding. This yields + state-of-the-art results across five datasets, covering fifteen tasks +
+
+
+
+
+ + ♻ ☆ Can Language Models Laugh at YouTube Short-form Videos? EMNLP 2023 + + +
+ As short-form funny videos on social networks are gaining popularity, it +becomes demanding for AI models to understand them for better communication +with humans. Unfortunately, previous video humor datasets target specific +domains, such as speeches or sitcoms, and mostly focus on verbal cues. We +curate a user-generated dataset of 10K multimodal funny videos from YouTube, +called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both +verbal and visual elements contributing to humor. After filtering, we annotate +each video with timestamps and text explanations for funny moments. Our +ExFunTube is unique over existing datasets in that our videos cover a wide +range of domains with various types of humor that necessitate a multimodal +understanding of the content. Also, we develop a zero-shot video-to-text +prompting to maximize video humor understanding of large language models +(LLMs). With three different evaluation methods using automatic scores, +rationale quality experiments, and human evaluations, we show that our +prompting significantly improves LLMs' ability for humor explanation. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ♻ ☆ SkyMath: Technical Report + + +
+ Large language models (LLMs) have shown great potential to solve varieties of +natural language processing (NLP) tasks, including mathematical reasoning. In +this work, we present SkyMath, a large language model for mathematics with 13 +billion parameters. By applying self-compare fine-tuning, we have enhanced +mathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K, +SkyMath outperforms all known open-source models of similar size and has +established a new SOTA performance. + +
+
+
+
+
+ + ♻ ☆ Is ChatGPT A Good Keyphrase Generator? A Preliminary Study + + +
+ The emergence of ChatGPT has recently garnered significant attention from the +computational linguistics community. To demonstrate its capabilities as a +keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the +keyphrase generation task. We evaluate its performance in various aspects, +including keyphrase generation prompts, keyphrase generation diversity, and +long document understanding. Our evaluation is based on six benchmark datasets, +and we adopt the prompt suggested by OpenAI while extending it to six candidate +prompts. We find that ChatGPT performs exceptionally well on all six candidate +prompts, with minor performance differences observed across the datasets. Based +on our findings, we conclude that ChatGPT has great potential for keyphrase +generation. Moreover, we discover that ChatGPT still faces challenges when it +comes to generating absent keyphrases. Meanwhile, in the final section, we also +present some limitations and future expansions of this report. + +
+
+ comment: Technical Report, 7 pages +
+
+
+
+
+ + ♻ ☆ Read and Reap the Rewards: Learning to Play Atari with the Help of + Instruction Manuals + + +
+ High sample complexity has long been a challenge for RL. On the other hand, +humans learn to perform tasks not only from interaction or demonstrations, but +also by reading unstructured text documents, e.g., instruction manuals. +Instruction manuals and wiki pages are among the most abundant data that could +inform agents of valuable features and policies or task-specific environmental +dynamics and reward structures. Therefore, we hypothesize that the ability to +utilize human-written instruction manuals to assist learning policies for +specific tasks should lead to a more efficient and better-performing agent. We +propose the Read and Reward framework. Read and Reward speeds up RL algorithms +on Atari games by reading manuals released by the Atari game developers. Our +framework consists of a QA Extraction module that extracts and summarizes +relevant information from the manual and a Reasoning module that evaluates +object-agent interactions based on information from the manual. An auxiliary +reward is then provided to a standard A2C RL agent, when interaction is +detected. Experimentally, various RL algorithms obtain significant improvement +in performance and training speed when assisted by our design. + +
+
+
+
+
+ + ♻ ☆ What are Public Concerns about ChatGPT? A Novel Self-Supervised Neural + Topic Model Tells You + + +
+ The recently released artificial intelligence conversational agent, ChatGPT, +has gained significant attention in academia and real life. A multitude of +early ChatGPT users eagerly explore its capabilities and share their opinions +on it via social media. Both user queries and social media posts express public +concerns regarding this advanced dialogue system. To mine public concerns about +ChatGPT, a novel Self-Supervised neural Topic Model (SSTM), which formalizes +topic modeling as a representation learning procedure, is proposed in this +paper. Extensive experiments have been conducted on Twitter posts about ChatGPT +and queries asked by ChatGPT users. And experimental results demonstrate that +the proposed approach could extract higher quality public concerns with +improved interpretability and diversity, surpassing the performance of +state-of-the-art approaches. + +
+
+ comment: The paper requires major revision +
+
+
+
+
+ + ♻ ☆ CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity + and Infant Care NeurIPS 2023 + + +
+ The recent advances in natural language processing (NLP), have led to a new +trend of applying large language models (LLMs) to real-world scenarios. While +the latest LLMs are astonishingly fluent when interacting with humans, they +suffer from the misinformation problem by unintentionally generating factually +false statements. This can lead to harmful consequences, especially when +produced within sensitive contexts, such as healthcare. Yet few previous works +have focused on evaluating misinformation in the long-form (LF) generation of +LLMs, especially for knowledge-intensive topics. Moreover, although LLMs have +been shown to perform well in different languages, misinformation evaluation +has been mostly conducted in English. To this end, we present a benchmark, +CARE-MI, for evaluating LLM misinformation in: 1) a sensitive topic, +specifically the maternity and infant care domain; and 2) a language other than +English, namely Chinese. Most importantly, we provide an innovative paradigm +for building LF generation evaluation benchmarks that can be transferred to +other knowledge-intensive domains and low-resourced languages. Our proposed +benchmark fills the gap between the extensive usage of LLMs and the lack of +datasets for assessing the misinformation generated by these models. It +contains 1,612 expert-checked questions, accompanied with human-selected +references. Using our benchmark, we conduct extensive experiments and found +that current Chinese LLMs are far from perfect in the topic of maternity and +infant care. In an effort to minimize the reliance on human resources for +performance evaluation, we offer off-the-shelf judgment models for +automatically assessing the LF output of LLMs given benchmark questions. +Moreover, we compare potential solutions for LF generation evaluation and +provide insights for building better automated metrics. + +
+
+ comment: NeurIPS 2023 Datasets and Benchmarks Track +
+
+
+
+
+ + ♻ ☆ Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, + and LLMs Evaluations NeurIPS 2023 + + +
+ This paper reexamines the research on out-of-distribution (OOD) robustness in +the field of NLP. We find that the distribution shift settings in previous +studies commonly lack adequate challenges, hindering the accurate evaluation of +OOD robustness. To address these issues, we propose a benchmark construction +protocol that ensures clear differentiation and challenging distribution +shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution +robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we +conduct a series of experiments on pre-trained language models for analysis and +evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the +relationship between in-distribution (ID) and OOD performance. We identify +three typical types that unveil the inner learning mechanism, which could +potentially facilitate the forecasting of OOD robustness, correlating with the +advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and +find that, despite exhibiting some effectiveness in specific cases, they do not +offer significant improvement compared to vanilla fine-tuning. Further, we +evaluate 5 LLMs with various adaptation paradigms and find that when sufficient +ID data is available, fine-tuning domain-specific models outperform LLMs on ID +examples significantly. However, in the case of OOD instances, prioritizing +LLMs with in-context learning yields better results. We identify that both +fine-tuned small models and LLMs face challenges in effectively addressing +downstream tasks. The code is public at +\url{https://github.com/lifan-yuan/OOD_NLP}. + +
+
+ comment: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is + available at \url{https://github.com/lifan-yuan/OOD_NLP} +
+
+
+
+
+ + ♻ ☆ ZipLM: Inference-Aware Structured Pruning of Language Models NeurIPS 2023 + + +
+ The breakthrough performance of large language models (LLMs) comes with major +computational footprints and high deployment costs. In this paper, we progress +towards resolving this problem by proposing a novel structured compression +approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art +accuracy-vs-speedup, while matching a set of desired target runtime speedups in +any given inference environment. Specifically, given a model, a dataset, an +inference environment, as well as a set of speedup targets, ZipLM iteratively +identifies and removes components with the worst loss-runtime trade-off. Unlike +prior methods that specialize in either the post-training/one-shot or the +gradual compression setting, and only for specific families of models such as +BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed +models across all these settings. Furthermore, ZipLM achieves superior results +for a fraction of the computational cost relative to prior distillation and +pruning techniques, making it a cost-effective approach for generating an +entire family of smaller, faster, and highly accurate models, guaranteed to +meet the desired inference specifications. In particular, ZipLM outperforms all +prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and +TinyBERT. Moreover, it matches the performance of the heavily optimized +MobileBERT model, obtained via extensive architecture search, by simply pruning +the baseline BERT-large model. When compressing GPT2, ZipLM outperforms +DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: +https://github.com/IST-DASLab/ZipLM. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Estimating Large Language Model Capabilities without Labeled Test Data EMNLP 2023 + + +
+ Large Language Models (LLMs) have the impressive ability to perform +in-context learning (ICL) from only a few examples, but the success of ICL +varies widely from task to task. Thus, it is important to quickly determine +whether ICL is applicable to a new task, but directly evaluating ICL accuracy +can be expensive in situations where test data is expensive to annotate -- the +exact situations where ICL is most appealing. In this paper, we propose the +task of ICL accuracy estimation, in which we predict the accuracy of an LLM +when doing in-context learning on a new task given only unlabeled test data for +that task. To perform ICL accuracy estimation, we propose a method that trains +a meta-model using LLM confidence scores as features. We compare our method to +several strong accuracy estimation baselines on a new benchmark that covers 4 +LLMs and 3 task collections. The meta-model improves over all baselines across +8 out of 12 settings and achieves the same estimation performance as directly +evaluating on 40 collected labeled test examples per task. At the same time, no +existing approach provides an accurate and reliable ICL accuracy estimation in +every setting, highlighting the need for better ways to measure the uncertainty +of LLM predictions. + +
+
+ comment: Accepted to EMNLP 2023 Findings. Camera-ready version. Code: + https://github.com/harvey-fin/icl-estimate +
+
+
+
+
+ + ♻ ☆ Bhasha-Abhijnaanam: Native-script and romanized Language Identification + for 22 Indic languages ACL 2023 + + +
+ We create publicly available language identification (LID) datasets and +models in all 22 Indian languages listed in the Indian constitution in both +native-script and romanized text. First, we create Bhasha-Abhijnaanam, a +language identification test set for native-script as well as romanized text +which spans all 22 Indic languages. We also train IndicLID, a language +identifier for all the above-mentioned languages in both native and romanized +script. For native-script text, it has better language coverage than existing +LIDs and is competitive or better than other LIDs. IndicLID is the first LID +for romanized text in Indian languages. Two major challenges for romanized text +LID are the lack of training data and low-LID performance when languages are +similar. We provide simple and effective solutions to these problems. In +general, there has been limited work on romanized text in any language, and our +findings are relevant to other languages that need romanized language +identification. Our models are publicly available at +https://ai4bharat.iitm.ac.in/indiclid under open-source licenses. Our training +and test sets are also publicly available at +https://ai4bharat.iitm.ac.in/bhasha-abhijnaanam under open-source licenses. + +
+
+ comment: Accepted to ACL 2023 +
+
+
+
+
+ + ♻ ☆ Global Structure Knowledge-Guided Relation Extraction Method for + Visually-Rich Document EMNLP 2023 + + +
+ Visual Relation Extraction (VRE) is a powerful means of discovering +relationships between entities within visually-rich documents. Existing methods +often focus on manipulating entity features to find pairwise relations, yet +neglect the more fundamental structural information that links disparate entity +pairs together. The absence of global structure information may make the model +struggle to learn long-range relations and easily predict conflicted results. +To alleviate such limitations, we propose a \textbf{G}l\textbf{O}bal +\textbf{S}tructure knowledge-guided relation \textbf{E}xtraction +(\textbf{\model}) framework. {\model} initiates by generating preliminary +relation predictions on entity pairs extracted from a scanned image of the +document. Subsequently, global structural knowledge is captured from the +preceding iterative predictions, which are then incorporated into the +representations of the entities. This ``generate-capture-incorporate'' cycle is +repeated multiple times, allowing entity representations and global structure +knowledge to be mutually reinforced. Extensive experiments validate that +{\model} not only outperforms existing methods in the standard fine-tuning +setting but also reveals superior cross-lingual learning capabilities; indeed, +even yields stronger data-efficient performance in the low-resource setting. +The code for GOSE will be available at https://github.com/chenxn2020/GOSE. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ♻ ☆ Aksharantar: Open Indic-language Transliteration datasets and models for + the Next Billion Users EMNLP + + +
+ Transliteration is very important in the Indian language context due to the +usage of multiple scripts and the widespread use of romanized inputs. However, +few training and evaluation sets are publicly available. We introduce +Aksharantar, the largest publicly available transliteration dataset for Indian +languages created by mining from monolingual and parallel corpora, as well as +collecting data from human annotators. The dataset contains 26 million +transliteration pairs for 21 Indic languages from 3 language families using 12 +scripts. Aksharantar is 21 times larger than existing datasets and is the first +publicly available dataset for 7 languages and 1 language family. We also +introduce the Aksharantar testset comprising 103k word pairs spanning 19 +languages that enables a fine-grained analysis of transliteration models on +native origin words, foreign words, frequent words, and rare words. Using the +training set, we trained IndicXlit, a multilingual transliteration model that +improves accuracy by 15% on the Dakshina test set, and establishes strong +baselines on the Aksharantar testset introduced in this work. The models, +mining scripts, transliteration guidelines, and datasets are available at +https://github.com/AI4Bharat/IndicXlit under open-source licenses. We hope the +availability of these large-scale, open resources will spur innovation for +Indic language transliteration and downstream applications. We hope the +availability of these large-scale, open resources will spur innovation for +Indic language transliteration and downstream applications. + +
+
+ comment: This manuscript is an extended version of the paper accepted to EMNLP + Findings 2023. You can find the EMNLP Findings version at + https://anoopkunchukuttan.gitlab.io/publications/emnlp_findings_2023_aksharantar.pdf +
+
+
+
+
+ + ♻ ☆ Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation EMNLP 2023 + + +
+ Instruction tuning has emerged to enhance the capabilities of large language +models (LLMs) to comprehend instructions and generate appropriate responses. +Existing methods either manually annotate or employ LLM (e.g., GPT-series) to +generate data for instruction tuning. However, they often overlook associating +instructions with existing annotated datasets. In this paper, we propose +Dynosaur, a dynamic growth paradigm for the automatic curation of +instruction-tuning data. Based on the metadata of existing datasets, we use +LLMs to automatically construct instruction-tuning data by identifying relevant +data fields and generating appropriate instructions. + By leveraging the existing annotated datasets, Dynosaur offers several +advantages: 1) it reduces the API cost for generating instructions (e.g., it +costs less than $12 USD by calling GPT-3.5-turbo for generating 800K +instruction tuning samples; 2) it provides high-quality data for instruction +tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform +with comparable data sizes); and 3) it supports the continuous improvement of +models by generating instruction-tuning data when a new annotated dataset +becomes available. We further investigate a continual learning scheme for +learning with the ever-growing instruction-tuning dataset, and demonstrate that +replaying tasks with diverse instruction embeddings not only helps mitigate +forgetting issues but generalizes to unseen tasks better. + Code and data are available at https://github.com/WadeYin9712/Dynosaur. + +
+
+ comment: EMNLP 2023. Code and data are available at + https://github.com/WadeYin9712/Dynosaur +
+
+
+
+
+ + ♻ ☆ A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue + Information Extraction + + +
+ This paper focuses on term-status pair extraction from medical dialogues +(MD-TSPE), which is essential in diagnosis dialogue systems and the automatic +scribe of electronic medical records (EMRs). In the past few years, works on +MD-TSPE have attracted increasing research attention, especially after the +remarkable progress made by generative methods. However, these generative +methods output a whole sequence consisting of term-status pairs in one stage +and ignore integrating prior knowledge, which demands a deeper understanding to +model the relationship between terms and infer the status of each term. This +paper presents a knowledge-enhanced two-stage generative framework (KTGF) to +address the above challenges. Using task-specific prompts, we employ a single +model to complete the MD-TSPE through two phases in a unified generative form: +we generate all terms the first and then generate the status of each generated +term. In this way, the relationship between terms can be learned more +effectively from the sequence containing only terms in the first phase, and our +designed knowledge-enhanced prompt in the second phase can leverage the +category and status candidates of the generated term for status generation. +Furthermore, our proposed special status "not mentioned" makes more terms +available and enriches the training data in the second phase, which is critical +in the low-resource setting. The experiments on the Chunyu and CMDD datasets +show that the proposed method achieves superior results compared to the +state-of-the-art models in the full training and low-resource settings. + +
+
+ comment: Published in Machine Intelligence Research +
+
+
+
+
+ + ♻ ☆ DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning + in Language Models NeurIPS 2023 + + +
+ A long-standing goal of AI systems is to perform complex multimodal reasoning +like humans. Recently, large language models (LLMs) have made remarkable +strides in such multi-step reasoning on the language modality solely by +leveraging the chain of thought (CoT) to mimic human thinking. However, the +transfer of these advancements to multimodal contexts introduces heightened +challenges, including but not limited to the impractical need for +labor-intensive annotation and the limitations in terms of flexibility, +generalizability, and explainability. To evoke CoT reasoning in multimodality, +this work first conducts an in-depth analysis of these challenges posed by +multimodality and presents two key insights: "keeping critical thinking" and +"letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this +study proposes a novel DDCoT prompting that maintains a critical attitude +through negative-space prompting and incorporates multimodality into reasoning +by first dividing the reasoning responsibility of LLMs into reasoning and +recognition and then integrating the visual recognition capability of visual +models into the joint reasoning process. The rationales generated by DDCoT not +only improve the reasoning abilities of both large and small language models in +zero-shot prompting and fine-tuning learning, significantly outperforming +state-of-the-art methods but also exhibit impressive generalizability and +explainability. + +
+
+ comment: 24 pages, 13 figures, to be published in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ AlpaGasus: Training A Better Alpaca with Fewer Data + + +
+ Large language models~(LLMs) strengthen instruction-following capability +through instruction-finetuning (IFT) on supervised instruction/response data. +However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly +contain many low-quality instances with incorrect or irrelevant responses, +which are misleading and detrimental to IFT. In this paper, we propose a simple +and effective data selection strategy that automatically identifies and filters +out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we +introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered +from the 52k Alpaca data. AlpaGasus significantly outperforms the original +Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human +evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM +(i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also +provides 5.7x faster training, reducing the training time for a 7B variant from +80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the +efficacy of our method across diverse datasets, base models, and LLM filters. +Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be +generally applied to instruction-tuning data, leading to faster training and +better instruction-following models. Our project page is available at: +\url{https://lichang-chen.github.io/AlpaGasus/} + +
+
+ comment: 32 Pages; 29 Figures; 15 Tables +
+
+
+
+
+ + ♻ ☆ Unraveling Feature Extraction Mechanisms in Neural Networks EMNLP 2023 + + +
+ The underlying mechanism of neural networks in capturing precise knowledge +has been the subject of consistent research efforts. In this work, we propose a +theoretical approach based on Neural Tangent Kernels (NTKs) to investigate such +mechanisms. Specifically, considering the infinite network width, we +hypothesize the learning dynamics of target models may intuitively unravel the +features they acquire from training data, deepening our insights into their +internal mechanisms. We apply our approach to several fundamental models and +reveal how these models leverage statistical features during gradient descent +and how they are integrated into final decisions. We also discovered that the +choice of activation function can affect feature extraction. For instance, the +use of the \textit{ReLU} activation function could potentially introduce a bias +in features, providing a plausible explanation for its replacement with +alternative functions in recent pre-trained language models. Additionally, we +find that while self-attention and CNN models may exhibit limitations in +learning n-grams, multiplication-based models seem to excel in this area. We +verify these theoretical findings through experiments and find that they can be +applied to analyze language modeling tasks, which can be regarded as a special +variant of classification. Our contributions offer insights into the roles and +capacities of fundamental components within large language models, thereby +aiding the broader understanding of these complex systems. + +
+
+ comment: Accepted by EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Simultaneous Machine Translation with Tailored Reference EMNLP 2023 + + +
+ Simultaneous machine translation (SiMT) generates translation while reading +the whole source sentence. However, existing SiMT models are typically trained +using the same reference disregarding the varying amounts of available source +information at different latency. Training the model with ground-truth at low +latency may introduce forced anticipations, whereas utilizing reference +consistent with the source word order at high latency results in performance +degradation. Consequently, it is crucial to train the SiMT model with +appropriate reference that avoids forced anticipations during training while +maintaining high quality. In this paper, we propose a novel method that +provides tailored reference for the SiMT models trained at different latency by +rephrasing the ground-truth. Specifically, we introduce the tailor, induced by +reinforcement learning, to modify ground-truth to the tailored reference. The +SiMT model is trained with the tailored reference and jointly optimized with +the tailor to enhance performance. Importantly, our method is applicable to a +wide range of current SiMT approaches. Experiments on three translation tasks +demonstrate that our method achieves state-of-the-art performance in both fixed +and adaptive policies. + +
+
+ comment: Accepted to EMNLP 2023; 15 pages, 8 figures +
+
+
+
+
+ + ♻ ☆ ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist + Examination EMNLP 2023 + + +
+ As ChatGPT and GPT-4 spearhead the development of Large Language Models +(LLMs), more researchers are investigating their performance across various +tasks. But more research needs to be done on the interpretability capabilities +of LLMs, that is, the ability to generate reasons after an answer has been +given. Existing explanation datasets are mostly English-language general +knowledge questions, which leads to insufficient thematic and linguistic +diversity. To address the language bias and lack of medical resources in +generating rationales QA datasets, we present ExplainCPE (over 7k instances), a +challenging medical benchmark in Simplified Chinese. We analyzed the errors of +ChatGPT and GPT-4, pointing out the limitations of current LLMs in +understanding text and computational reasoning. During the experiment, we also +found that different LLMs have different preferences for in-context learning. +ExplainCPE presents a significant challenge, but its potential for further +investigation is promising, and it can be used to evaluate the ability of a +model to generate explanations. AI safety and trustworthiness need more +attention, and this work makes the first step to explore the medical +interpretability of LLMs.The dataset is available at +https://github.com/HITsz-TMG/ExplainCPE. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ DEFT: Data Efficient Fine-Tuning for Large Language Models via + Unsupervised Core-Set Selection + + +
+ Recent advances have led to the availability of many pre-trained language +models (PLMs); however, a question that remains is how much data is truly +needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT, +a data-efficient fine-tuning framework that leverages unsupervised core-set +selection to minimize the amount of data needed to fine-tune PLMs for +downstream tasks. We demonstrate the efficacy of our DEFT framework in the +context of text-editing LMs, and compare to the state-of-the art text-editing +model, CoEDIT. Our quantitative and qualitative results demonstrate that DEFT +models are just as accurate as CoEDIT while being finetuned on ~70% less data. + +
+
+
+
+
+ + ♻ ☆ pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks + + +
+ In recent years, the extraction of opinions and information from +user-generated text has attracted a lot of interest, largely due to the +unprecedented volume of content in Social Media. However, social researchers +face some issues in adopting cutting-edge tools for these tasks, as they are +usually behind commercial APIs, unavailable for other languages than English, +or very complex to use for non-experts. To address these issues, we present +pysentimiento, a comprehensive multilingual Python toolkit designed for opinion +mining and other Social NLP tasks. This open-source library brings +state-of-the-art models for Spanish, English, Italian, and Portuguese in an +easy-to-use Python library, allowing researchers to leverage these techniques. +We present a comprehensive assessment of performance for several pre-trained +language models across a variety of tasks, languages, and datasets, including +an evaluation of fairness in the results. + +
+
+
+
+
+ + ♻ ☆ Medical Text Simplification: Optimizing for Readability with + Unlikelihood Training and Reranked Beam Search Decoding EMNLP 2023 + + +
+ Text simplification has emerged as an increasingly useful application of AI +for bridging the communication gap in specialized fields such as medicine, +where the lexicon is often dominated by technical jargon and complex +constructs. Despite notable progress, methods in medical simplification +sometimes result in the generated text having lower quality and diversity. In +this work, we explore ways to further improve the readability of text +simplification in the medical domain. We propose (1) a new unlikelihood loss +that encourages generation of simpler terms and (2) a reranked beam search +decoding method that optimizes for simplicity, which achieve better performance +on readability metrics on three datasets. This study's findings offer promising +avenues for improving text simplification in the medical field. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Evaluating Object Hallucination in Large Vision-Language Models EMNLP 2023 + + +
+ Inspired by the superior language abilities of large language models (LLM), +large vision-language models (LVLM) have been recently explored by integrating +powerful LLMs for improving the performance on complex multimodal tasks. +Despite the promising progress on LVLMs, we find that LVLMs suffer from the +hallucination problem, i.e. they tend to generate objects that are inconsistent +with the target images in the descriptions. To investigate it, this work +presents the first systematic study on object hallucination of LVLMs. We +conduct the evaluation experiments on several representative LVLMs, and show +that they mostly suffer from severe object hallucination issue. We further +discuss that the visual instructions may influence the hallucination, and find +that: objects that frequently occur in the visual instructions or co-occur with +the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we +find that existing evaluation methods might be affected by the input +instructions and generation styles of LVLMs. Thus, we further design an +improved evaluation method for object hallucination by proposing a +polling-based query method called POPE. Experiment results demonstrate that our +POPE can evaluate the object hallucination in a more stable and flexible way. +Our codes and data are publicly available at https://github.com/RUCAIBox/POPE. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Asymmetric feature interaction for interpreting model predictions ACL 2023 + + +
+ In natural language processing (NLP), deep neural networks (DNNs) could model +complex interactions between context and have achieved impressive results on a +range of NLP tasks. Prior works on feature interaction attribution mainly focus +on studying symmetric interaction that only explains the additional influence +of a set of words in combination, which fails to capture asymmetric influence +that contributes to model prediction. In this work, we propose an asymmetric +feature interaction attribution explanation model that aims to explore +asymmetric higher-order feature interactions in the inference of deep neural +NLP models. By representing our explanation with an directed interaction graph, +we experimentally demonstrate interpretability of the graph to discover +asymmetric feature interactions. Experimental results on two sentiment +classification datasets show the superiority of our model against the +state-of-the-art feature interaction attribution methods in identifying +influential features for model predictions. Our code is available at +https://github.com/StillLu/ASIV. + +
+
+ comment: Accepted by Findings of the Association for Computational + Linguistics: ACL 2023 (long paper) +
+
+
+
+
+ + ♻ ☆ Scaling Data-Constrained Language Models + + +
+ The current trend of scaling language models involves increasing both +parameter count and training dataset size. Extrapolating this trend suggests +that training dataset size may soon be limited by the amount of text data +available on the internet. Motivated by this limit, we investigate scaling +language models in data-constrained regimes. Specifically, we run a large set +of experiments varying the extent of data repetition and compute budget, +ranging up to 900 billion training tokens and 9 billion parameter models. We +find that with constrained data for a fixed compute budget, training with up to +4 epochs of repeated data yields negligible changes to loss compared to having +unique data. However, with more repetition, the value of adding compute +eventually decays to zero. We propose and empirically validate a scaling law +for compute optimality that accounts for the decreasing value of repeated +tokens and excess parameters. Finally, we experiment with approaches mitigating +data scarcity, including augmenting the training dataset with code data or +removing commonly used filters. Models and datasets from our 400 training runs +are freely available at https://github.com/huggingface/datablations. + +
+
+ comment: 50 pages (9 main), 39 figures, 15 tables +
+
+
+
+
+ + ♻ ☆ Self-Evaluation Guided Beam Search for Reasoning NeurIPS 2023 + + +
+ Breaking down a problem into intermediate steps has demonstrated impressive +performance in Large Language Model (LLM) reasoning. However, the growth of the +reasoning chain introduces uncertainty and error accumulation, making it +challenging to elicit accurate final results. To tackle this challenge of +uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation +mechanism to guide and calibrate the reasoning process of LLMs. We propose a +decoding algorithm integrating the self-evaluation guidance via stochastic beam +search. The self-evaluation guidance serves as a better-calibrated automatic +criterion, facilitating an efficient search in the reasoning space and +resulting in superior prediction quality. Stochastic beam search balances +exploitation and exploration of the search space with temperature-controlled +randomness. Our approach surpasses the corresponding Codex-backboned baselines +in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA, +and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on +arithmetic reasoning demonstrate the efficiency of our method in outperforming +the baseline methods with comparable computational budgets. Further analysis in +multi-step reasoning finds our self-evaluation guidance pinpoints logic +failures and leads to higher consistency and robustness. Our code is publicly +available at https://guideddecoding.github.io/. + +
+
+ comment: NeurIPS 2023. 10 pages, 7 figures, 4 tables (33 pages, 14 figures, 15 + tables including references and appendices) +
+
+
+
+
+ + ♻ ☆ Large Language Models as General Pattern Machines + + +
+ We observe that pre-trained large language models (LLMs) are capable of +autoregressively completing complex token sequences -- from arbitrary ones +procedurally generated by probabilistic context-free grammars (PCFG), to more +rich spatial patterns found in the Abstraction and Reasoning Corpus (ARC), a +general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern +completion proficiency can be partially retained even when the sequences are +expressed using tokens randomly sampled from the vocabulary. These results +suggest that without any additional training, LLMs can serve as general +sequence modelers, driven by in-context learning. In this work, we investigate +how these zero-shot capabilities may be applied to problems in robotics -- from +extrapolating sequences of numbers that represent states over time to complete +simple motions, to least-to-most prompting of reward-conditioned trajectories +that can discover and represent closed-loop policies (e.g., a stabilizing +controller for CartPole). While difficult to deploy today for real systems due +to latency, context size limitations, and compute costs, the approach of using +LLMs to drive low-level control may provide an exciting glimpse into how the +patterns among words could be transferred to actions. + +
+
+ comment: 21 pages, 25 figures. To appear at Conference on Robot Learning + (CoRL) 2023 +
+
+
+
+
+ + ♻ ☆ CLASS: A Design Framework for building Intelligent Tutoring Systems + based on Learning Science principles EMNLP 2023 + + +
+ We present a design framework called Conversational Learning with Analytical +Step-by-Step Strategies (CLASS) for building advanced Intelligent Tutoring +Systems (ITS) powered by high-performance Large Language Models (LLMs). The +CLASS framework empowers ITS with two key capabilities. First, through a +carefully curated scaffolding dataset, CLASS equips ITS with essential +problem-solving strategies, enabling it to provide tutor-like, step-by-step +guidance to students. Second, by using a dynamic conversational dataset, CLASS +assists ITS in facilitating natural language interactions, fostering engaging +student-tutor conversations. The CLASS framework also provides valuable +insights into ITS' internal decision-making process which allows seamless +integration of user feedback, thus enabling continuous refinement and +improvement. We also present a proof-of-concept ITS, referred to as SPOCK, +which is trained using the CLASS framework with a focus on introductory +college-level biology content. A carefully constructed protocol was developed +for SPOCK's preliminary evaluation, examining aspects such as the factual +accuracy and relevance of its responses. Experts in the field of biology +offered favorable remarks, particularly highlighting SPOCK's capability to +break down questions into manageable subproblems and provide encouraging +responses to students. Code and models are available at +https://github.com/luffycodes/Tutorbot-Spock. + +
+
+ comment: Paper accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ What Else Do I Need to Know? The Effect of Background Information on + Users' Reliance on QA Systems + + +
+ NLP systems have shown impressive performance at answering questions by +retrieving relevant context. However, with the increasingly large models, it is +impossible and often undesirable to constrain models' knowledge or reasoning to +only the retrieved context. This leads to a mismatch between the information +that the models access to derive the answer and the information that is +available to the user to assess the model predicted answer. In this work, we +study how users interact with QA systems in the absence of sufficient +information to assess their predictions. Further, we ask whether adding the +requisite background helps mitigate users' over-reliance on predictions. Our +study reveals that users rely on model predictions even in the absence of +sufficient information needed to assess the model's correctness. Providing the +relevant background, however, helps users better catch model errors, reducing +over-reliance on incorrect predictions. On the flip side, background +information also increases users' confidence in their accurate as well as +inaccurate judgments. Our work highlights that supporting users' verification +of QA predictions is an important, yet challenging, problem. + +
+
+
+
+
+ + ♻ ☆ Coverage-based Example Selection for In-Context Learning EMNLP 2023 + + +
+ In-context learning (ICL), the ability of large language models to perform +novel tasks by conditioning on a prompt with a few task examples, requires +these examples to be informative about the test instance. The standard approach +of independently ranking and selecting the most similar examples selects +redundant examples while omitting important information. In this work, we show +that BERTScore-Recall (BSR) selects better examples that demonstrate more of +the salient aspects, e.g. reasoning patterns, of the test input. We further +extend BSR and many standard metrics to easily optimizable set-level metrics, +giving still better coverage of those salient aspects. On 15 datasets spanning +6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric +for in-context example selection across the board, and (2) for compositional +tasks, set selection using Set-BSR outperforms independent ranking by up to 17 +points on average and, despite being training-free, surpasses methods that +leverage task or LLM-specific training. + +
+
+ comment: EMNLP 2023 (Findings) +
+
+
+
+
+ + ♻ ☆ Knowledge Editing for Large Language Models: A Survey + + +
+ Large language models (LLMs) have recently transformed both the academic and +industrial landscapes due to their remarkable capacity to understand, analyze, +and generate texts based on their vast knowledge and reasoning ability. +Nevertheless, one major drawback of LLMs is their substantial computational +cost for pre-training due to their unprecedented amounts of parameters. The +disadvantage is exacerbated when new knowledge frequently needs to be +introduced into the pre-trained model. Therefore, it is imperative to develop +effective and efficient techniques to update pre-trained LLMs. Traditional +methods encode new knowledge in pre-trained LLMs through direct fine-tuning. +However, naively re-training LLMs can be computationally intensive and risks +degenerating valuable pre-trained knowledge irrelevant to the update in the +model. Recently, Knowledge-based Model Editing (KME) has attracted increasing +attention, which aims to precisely modify the LLMs to incorporate specific +knowledge, without negatively influencing other irrelevant knowledge. In this +survey, we aim to provide a comprehensive and in-depth overview of recent +advances in the field of KME. We first introduce a general formulation of KME +to encompass different KME strategies. Afterward, we provide an innovative +taxonomy of KME techniques based on how the new knowledge is introduced into +pre-trained LLMs, and investigate existing KME strategies while analyzing key +insights, advantages, and limitations of methods from each category. Moreover, +representative metrics, datasets, and applications of KME are introduced +accordingly. Finally, we provide an in-depth analysis regarding the +practicality and remaining challenges of KME and suggest promising research +directions for further advancement in this field. + +
+
+ comment: 31 pages +
+
+
+
+
+ + ♻ ☆ YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English + Parallel Corpus + + +
+ Machine learning for sign languages is bottlenecked by data. In this paper, +we present YouTube-ASL, a large-scale, open-domain corpus of American Sign +Language (ASL) videos and accompanying English captions drawn from YouTube. +With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as +large and has ~10x as many unique signers as the largest prior ASL dataset. We +train baseline models for ASL to English translation on YouTube-ASL and +evaluate them on How2Sign, where we achieve a new finetuned state of the art of +12.39 BLEU and, for the first time, report zero-shot results. + +
+
+
+
+
+ + ♻ ☆ mLongT5: A Multilingual and Efficient Text-To-Text Transformer for + Longer Sequences + + +
+ We present our work on developing a multilingual, efficient text-to-text +transformer that is suitable for handling long inputs. This model, called +mLongT5, builds upon the architecture of LongT5, while leveraging the +multilingual datasets used for pretraining mT5 and the pretraining tasks of +UL2. We evaluate this model on a variety of multilingual summarization and +question-answering tasks, and the results show stronger performance for mLongT5 +when compared to existing multilingual models such as mBART or M-BERT. + +
+
+
+
+
+ + ♻ ☆ Embedding structure matters: Comparing methods to adapt multilingual + vocabularies to new languages + + +
+ Pre-trained multilingual language models underpin a large portion of modern +NLP tools outside of English. A strong baseline for specializing these models +for specific languages is Language-Adaptive Pre-Training (LAPT). However, +retaining a large cross-lingual vocabulary and embedding matrix comes at +considerable excess computational cost during adaptation. In this study, we +propose several simple techniques to replace a cross-lingual vocabulary with a +compact, language-specific one. Namely, we address strategies for +re-initializing the token embedding matrix after vocabulary specialization. We +then provide a systematic experimental comparison of our techniques, in +addition to the recently-proposed Focus method. We demonstrate that: 1) +Embedding-replacement techniques in the monolingual transfer literature are +inadequate for adapting multilingual models. 2) Replacing cross-lingual +vocabularies with smaller specialized ones provides an efficient method to +improve performance in low-resource languages. 3) Simple embedding +re-initialization techniques based on script-wise sub-distributions rival +techniques such as Focus, which rely on similarity scores obtained from an +auxiliary model. + +
+
+ comment: Camera-ready for Proceedings of the 3rd Workshop on Multilingual + Representation Learning +
+
+
+
+
+ + ♻ ☆ Learning to reason over visual objects ICLR 2023 + + +
+ A core component of human intelligence is the ability to identify abstract +patterns inherent in complex, high-dimensional perceptual data, as exemplified +by visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated +by the goal of designing AI systems with this capacity, recent work has focused +on evaluating whether neural networks can learn to solve RPM-like problems. +Previous work has generally found that strong performance on these problems +requires the incorporation of inductive biases that are specific to the RPM +problem format, raising the question of whether such models might be more +broadly useful. Here, we investigated the extent to which a general-purpose +mechanism for processing visual scenes in terms of objects might help promote +abstract visual reasoning. We found that a simple model, consisting only of an +object-centric encoder and a transformer reasoning module, achieved +state-of-the-art results on both of two challenging RPM-like benchmarks (PGM +and I-RAVEN), as well as a novel benchmark with greater visual complexity +(CLEVR-Matrices). These results suggest that an inductive bias for +object-centric processing may be a key component of abstract visual reasoning, +obviating the need for problem-specific inductive biases. + +
+
+ comment: ICLR 2023 +
+
+
+
+
+ + ♻ ☆ Block-State Transformers NeurIPS'23 + + +
+ State space models (SSMs) have shown impressive results on tasks that require +modeling long-range dependencies and efficiently scale to long sequences owing +to their subquadratic runtime complexity. Originally designed for continuous +signals, SSMs have shown superior performance on a plethora of tasks, in vision +and audio; however, SSMs still lag Transformer performance in Language Modeling +tasks. In this work, we propose a hybrid layer named Block-State Transformer +(BST), that internally combines an SSM sublayer for long-range +contextualization, and a Block Transformer sublayer for short-term +representation of sequences. We study three different, and completely +parallelizable, variants that integrate SSMs and block-wise attention. We show +that our model outperforms similar Transformer-based architectures on language +modeling perplexity and generalizes to longer sequences. In addition, the +Block-State Transformer demonstrates more than tenfold increase in speed at the +layer level compared to the Block-Recurrent Transformer when model +parallelization is employed. + +
+
+ comment: NeurIPS'23 - Thirty-seventh Conference on Neural Information + Processing Systems +
+
+
+
+
+ + ♻ ☆ On the Risk of Misinformation Pollution with Large Language Models EMNLP 2023 + + +
+ In this paper, we comprehensively investigate the potential misuse of modern +Large Language Models (LLMs) for generating credible-sounding misinformation +and its subsequent impact on information-intensive applications, particularly +Open-Domain Question Answering (ODQA) systems. We establish a threat model and +simulate potential misuse scenarios, both unintentional and intentional, to +assess the extent to which LLMs can be utilized to produce misinformation. Our +study reveals that LLMs can act as effective misinformation generators, leading +to a significant degradation in the performance of ODQA systems. To mitigate +the harm caused by LLM-generated misinformation, we explore three defense +strategies: prompting, misinformation detection, and majority voting. While +initial results show promising trends for these defensive strategies, much more +work needs to be done to address the challenge of misinformation pollution. Our +work highlights the need for further research and interdisciplinary +collaboration to address LLM-generated misinformation and to promote +responsible use of LLMs. + +
+
+ comment: EMNLP 2023 (Findings; Long Paper) +
+
+
+
+
+ + ♻ ☆ Effects of sub-word segmentation on performance of transformer language + models EMNLP 2023 + + +
+ Language modeling is a fundamental task in natural language processing, which +has been thoroughly explored with various architectures and hyperparameters. +However, few studies focus on the effect of sub-word segmentation on the +performance of language models (LMs). In this paper, we compare GPT and BERT +models trained with the statistical segmentation algorithm BPE vs. two +unsupervised algorithms for morphological segmentation -- Morfessor and +StateMorph. We train the models for several languages -- including ones with +very rich morphology -- and compare their performance with different +segmentation algorithms, vocabulary sizes, and model sizes. The results show +that training with morphological segmentation allows the LMs to: 1. achieve +lower perplexity, 2. converge more efficiently in terms of training time, and +3. achieve equivalent or better evaluation scores on downstream tasks. Lastly, +we show 4. that LMs of smaller size using morphological segmentation can +perform comparably to models of larger size trained with BPE -- both in terms +of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact +on sustainability of LMs, since they reduce the model cost: size and +computation time. While (2) reduces cost only in the training phase, (4) does +so also in the inference phase. + +
+
+ comment: This submission published in EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Company classification using zero-shot learning + + +
+ In recent years, natural language processing (NLP) has become increasingly +important in a variety of business applications, including sentiment analysis, +text classification, and named entity recognition. In this paper, we propose an +approach for company classification using NLP and zero-shot learning. Our +method utilizes pre-trained transformer models to extract features from company +descriptions, and then applies zero-shot learning to classify companies into +relevant categories without the need for specific training data for each +category. We evaluate our approach on a dataset obtained through the Wharton +Research Data Services (WRDS), which comprises textual descriptions of publicly +traded companies. We demonstrate that the approach can streamline the process +of company classification, thereby reducing the time and resources required in +traditional approaches such as the Global Industry Classification Standard +(GICS). The results show that this method has potential for automation of +company classification, making it a promising avenue for future research in +this area. + +
+
+ comment: 6 pages, 1 figure, 4 tables, conference paper, published in the 20th + International Conference on Informatics and Information Technologies (CIIT + 2023) +
+
+
+
+
+ + ♻ ☆ Reward-Augmented Decoding: Efficient Controlled Text Generation With a + Unidirectional Reward Model + + +
+ While large language models have proven effective in a huge range of +downstream applications, they often generate text that is problematic or lacks +a desired attribute. In this paper, we introduce Reward-Augmented Decoding +(RAD), a text generation procedure that uses a small unidirectional reward +model to encourage a language model to generate text that has certain +properties. Specifically, RAD uses the reward model to score generations as +they are produced and rescales sampling probabilities to favor high-reward +tokens. By using a unidirectional reward model, RAD can cache activations from +prior generation steps to decrease computational overhead. Through experiments +on generating non-toxic and sentiment-controlled text, we demonstrate that RAD +performs best among methods that change only the generation procedure and +matches the performance of state-of-the-art methods that involve re-training +the language model. We further validate that RAD is effective on very large +language models while incurring a minimal computational overhead. + +
+
+
+
+
+ + ♻ ☆ LR-Sum: Summarization for Less-Resourced Languages + + +
+ This preprint describes work in progress on LR-Sum, a new +permissively-licensed dataset created with the goal of enabling further +research in automatic summarization for less-resourced languages. LR-Sum +contains human-written summaries for 40 languages, many of which are +less-resourced. We describe our process for extracting and filtering the +dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The +source data is public domain newswire collected from from Voice of America +websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), +making it one of the most openly-licensed multilingual summarization datasets. +We describe how we plan to use the data for modeling experiments and discuss +limitations of the dataset. + +
+
+
+
+
+ + ♻ ☆ A Causal View of Entity Bias in (Large) Language Models EMNLP 2023 + + +
+ Entity bias widely affects pretrained (large) language models, causing them +to rely on (biased) parametric knowledge to make unfaithful predictions. +Although causality-inspired methods have shown great potential to mitigate +entity bias, it is hard to precisely estimate the parameters of underlying +causal models in practice. The rise of black-box LLMs also makes the situation +even worse, because of their inaccessible parameters and uncalibrated logits. +To address these problems, we propose a specific structured causal model (SCM) +whose parameters are comparatively easier to estimate. Building upon this SCM, +we propose causal intervention techniques to mitigate entity bias for both +white-box and black-box settings. The proposed causal intervention perturbs the +original entity with neighboring entities. This intervention reduces specific +biasing information pertaining to the original entity while still preserving +sufficient semantic information from similar entities. Under the white-box +setting, our training-time intervention improves OOD performance of PLMs on +relation extraction (RE) and machine reading comprehension (MRC) by 5.7 points +and by 9.1 points, respectively. Under the black-box setting, our in-context +intervention effectively reduces the entity-based knowledge conflicts of +GPT-3.5, achieving up to 20.5 points of improvement of exact match accuracy on +MRC and up to 17.6 points of reduction in memorization ratio on RE. Our code is +available at https://github.com/luka-group/Causal-View-of-Entity-Bias. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Automatic Calibration and Error Correction for Generative Large Language + Models via Pareto Optimal Self-Supervision + + +
+ Generative Large language models (LLMs) have demonstrated remarkable +capabilities for a wide range of applications, but reducing ungrounded or +erroneous responses remains a major growth area. Unlike task-specific models, +there lack an effective method to calibrate the confidence level of LLM +responses to indicate potential errors and facilitate human-in-the-loop +verification. An important source of calibration stems from expert-stipulated +programmatic supervision, which is often available at low cost but has its own +limitations such as noise and coverage. In this paper, we introduce a Pareto +optimal self-supervision framework that can leverage available programmatic +supervision to systematically calibrate LLM responses by producing a risk score +for every LLM response, without any additional manual efforts. This is +accomplished by learning a harmonizer model to align with LLM output as well as +other weak supervision sources. The model assigns higher risk scores to more +uncertain LLM responses and facilitate error correction. Experiments on +standard relation extraction and classification tasks in biomedical and general +domains demonstrate that the proposed risk score is highly correlated with the +actual LLM error rate. By using a dynamic prompting strategy based on the risk +score, we observed significant accuracy improvement for off-the-shelf LLMs, +boosting GPT-3.5 results past state-of-the-art (SOTA) weak supervision model +and GPT-4 results past SOTA supervised results on challenging evaluation +datasets. + +
+
+
+
+
+ + ♻ ☆ INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained + Feedback EMNLP2023 + + +
+ Automatically evaluating the quality of language generation is critical. +Although recent learned metrics show high correlation with human judgement, +these metrics can not explain their verdict or associate the scores with +defects in generated text. To address this limitation, we present +InstructScore, an explainable evaluation metric for text generation. By +harnessing both explicit human instruction and the implicit knowledge of GPT-4, +we fine-tune a text evaluation metric based on LLaMA, producing both a score +for generated text and a human readable diagnostic report. We evaluate +InstructScore on a variety of generation tasks, including translation, +captioning, data-to-text and commonsense generation. Experiments show that our +7B model surpasses all other unsupervised metrics, including those based on +175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without direct +supervision from human-rated data, achieves performance levels on par with +state-of-the-art metrics like COMET22, which were fine-tuned on human ratings. + +
+
+ comment: Accepted to EMNLP2023 Main Conference +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 159 + +
+
+
+ + ☆ Fantastic Gains and Where to Find Them: On the Existence and Prospect of + General Knowledge Transfer between Any Pretrained Model + + +
+ Training deep networks requires various design decisions regarding for +instance their architecture, data augmentation, or optimization. In this work, +we find these training variations to result in networks learning unique feature +sets from the data. Using public model libraries comprising thousands of models +trained on canonical datasets like ImageNet, we observe that for arbitrary +pairings of pretrained models, one model extracts significant data context +unavailable in the other -- independent of overall performance. Given any +arbitrary pairing of pretrained models and no external rankings (such as +separate test sets, e.g. due to data privacy), we investigate if it is possible +to transfer such "complementary" knowledge from one model to another without +performance degradation -- a task made particularly difficult as additional +knowledge can be contained in stronger, equiperformant or weaker models. Yet +facilitating robust transfer in scenarios agnostic to pretrained model pairings +would unlock auxiliary gains and knowledge fusion from any model repository +without restrictions on model and problem specifics - including from weaker, +lower-performance models. This work therefore provides an initial, in-depth +exploration on the viability of such general-purpose knowledge transfer. Across +large-scale experiments, we first reveal the shortcomings of standard knowledge +distillation techniques, and then propose a much more general extension through +data partitioning for successful transfer between nearly all pretrained models, +which we show can also be done unsupervised. Finally, we assess both the +scalability and impact of fundamental model properties on successful +model-agnostic knowledge transfer. + +
+
+
+
+
+ + ☆ A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised + Video Anomaly Detection WACV + + +
+ Detection of anomalous events in videos is an important problem in +applications such as surveillance. Video anomaly detection (VAD) is +well-studied in the one-class classification (OCC) and weakly supervised (WS) +settings. However, fully unsupervised (US) video anomaly detection methods, +which learn a complete system without any annotation or human supervision, have +not been explored in depth. This is because the lack of any ground truth +annotations significantly increases the magnitude of the VAD challenge. To +address this challenge, we propose a simple-but-effective two-stage +pseudo-label generation framework that produces segment-level (normal/anomaly) +pseudo-labels, which can be further used to train a segment-level anomaly +detector in a supervised manner. The proposed coarse-to-fine pseudo-label +(C2FPL) generator employs carefully-designed hierarchical divisive clustering +and statistical hypothesis testing to identify anomalous video segments from a +set of completely unlabeled videos. The trained anomaly detector can be +directly applied on segments of an unseen test video to obtain segment-level, +and subsequently, frame-level anomaly predictions. Extensive studies on two +large-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that +the proposed unsupervised approach achieves superior performance compared to +all existing OCC and US methods , while yielding comparable performance to the +state-of-the-art WS methods. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ☆ 6-DoF Stability Field via Diffusion Models + + +
+ A core capability for robot manipulation is reasoning over where and how to +stably place objects in cluttered environments. Traditionally, robots have +relied on object-specific, hand-crafted heuristics in order to perform such +reasoning, with limited generalizability beyond a small number of object +instances and object interaction patterns. Recent approaches instead learn +notions of physical interaction, namely motion prediction, but require +supervision in the form of labeled object information or come at the cost of +high sample complexity, and do not directly reason over stability or object +placement. We present 6-DoFusion, a generative model capable of generating 3D +poses of an object that produces a stable configuration of a given scene. +Underlying 6-DoFusion is a diffusion model that incrementally refines a +randomly initialized SE(3) pose to generate a sample from a learned, +context-dependent distribution over stable poses. We evaluate our model on +different object placement and stacking tasks, demonstrating its ability to +construct stable scenes that involve novel object classes as well as to improve +the accuracy of state-of-the-art 3D pose estimation methods. + +
+
+ comment: In submission +
+
+
+
+
+ + ☆ Defending Against Transfer Attacks From Public Models + + +
+ Adversarial attacks have been a looming and unaddressed threat in the +industry. However, through a decade-long history of the robustness evaluation +literature, we have learned that mounting a strong or optimal attack is +challenging. It requires both machine learning and domain expertise. In other +words, the white-box threat model, religiously assumed by a large majority of +the past literature, is unrealistic. In this paper, we propose a new practical +threat model where the adversary relies on transfer attacks through publicly +available surrogate models. We argue that this setting will become the most +prevalent for security-sensitive applications in the future. We evaluate the +transfer attacks in this setting and propose a specialized defense method based +on a game-theoretic perspective. The defenses are evaluated under 24 public +models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and +ImageNet). Under this threat model, our defense, PubDef, outperforms the +state-of-the-art white-box adversarial training by a large margin with almost +no loss in the normal accuracy. For instance, on ImageNet, our defense achieves +62% accuracy under the strongest transfer attack vs only 36% of the best +adversarially trained model. Its accuracy when not under attack is only 2% +lower than that of an undefended model (78% vs 80%). We release our code at +https://github.com/wagner-group/pubdef. + +
+
+ comment: Under submission. Code available at + https://github.com/wagner-group/pubdef +
+
+
+
+
+ + ☆ torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free + Deep Learning Studies: A Case Study on NLP EMNLP 2023 + + +
+ Reproducibility in scientific work has been becoming increasingly important +in research communities such as machine learning, natural language processing, +and computer vision communities due to the rapid development of the research +domains supported by recent advances in deep learning. In this work, we present +a significantly upgraded version of torchdistill, a modular-driven coding-free +deep learning framework significantly upgraded from the initial release, which +supports only image classification and object detection tasks for reproducible +knowledge distillation experiments. To demonstrate that the upgraded framework +can support more tasks with third-party libraries, we reproduce the GLUE +benchmark results of BERT models using a script based on the upgraded +torchdistill, harmonizing with various Hugging Face libraries. All the 27 +fine-tuned BERT models and configurations to reproduce the results are +published at Hugging Face, and the model weights have already been widely used +in research communities. We also reimplement popular small-sized models and new +knowledge distillation methods and perform additional experiments for computer +vision tasks. + +
+
+ comment: Accepted at the 3rd Workshop for Natural Language Processing Open + Source Software (NLP-OSS) at EMNLP 2023 +
+
+
+
+
+ + ☆ Drive Anywhere: Generalizable End-to-end Autonomous Driving with + Multi-modal Foundation Models + + +
+ As autonomous driving technology matures, end-to-end methodologies have +emerged as a leading strategy, promising seamless integration from perception +to control via deep learning. However, existing systems grapple with challenges +such as unexpected open set environments and the complexity of black-box +models. At the same time, the evolution of deep learning introduces larger, +multimodal foundational models, offering multi-modal visual and textual +understanding. In this paper, we harness these multimodal foundation models to +enhance the robustness and adaptability of autonomous driving systems, enabling +out-of-distribution, end-to-end, multimodal, and more explainable autonomy. +Specifically, we present an approach to apply end-to-end open-set (any +environment/scene) autonomous driving that is capable of providing driving +decisions from representations queryable by image and text. To do so, we +introduce a method to extract nuanced spatial (pixel/patch-aligned) features +from transformers to enable the encapsulation of both spatial and semantic +features. Our approach (i) demonstrates unparalleled results in diverse tests +while achieving significantly greater robustness in out-of-distribution +situations, and (ii) allows the incorporation of latent space simulation (via +text) for improved training (data augmentation via text) and policy debugging. +We encourage the reader to check our explainer video at +https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the +code and demos on our project webpage at https://drive-anywhere.github.io/. + +
+
+ comment: Project webpage: https://drive-anywhere.github.io Explainer video: + https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be +
+
+
+
+
+ + ☆ DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown + Lighting 3DV 2024 + + +
+ Geometry reconstruction of textureless, non-Lambertian objects under unknown +natural illumination (i.e., in the wild) remains challenging as correspondences +cannot be established and the reflectance cannot be expressed in simple +analytical forms. We derive a novel multi-view method, DeepShaRM, that achieves +state-of-the-art accuracy on this challenging task. Unlike past methods that +formulate this as inverse-rendering, i.e., estimation of reflectance, +illumination, and geometry from images, our key idea is to realize that +reflectance and illumination need not be disentangled and instead estimated as +a compound reflectance map. We introduce a novel deep reflectance map +estimation network that recovers the camera-view reflectance maps from the +surface normals of the current geometry estimate and the input multi-view +images. The network also explicitly estimates per-pixel confidence scores to +handle global light transport effects. A deep shape-from-shading network then +updates the geometry estimate expressed with a signed distance function using +the recovered reflectance maps. By alternating between these two, and, most +important, by bypassing the ill-posed problem of reflectance and illumination +decomposition, the method accurately recovers object geometry in these +challenging settings. Extensive experiments on both synthetic and real-world +data clearly demonstrate its state-of-the-art accuracy. + +
+
+ comment: 3DV 2024 +
+
+
+
+
+ + ☆ A Survey on Transferability of Adversarial Examples across Deep Neural + Networks + + +
+ The emergence of Deep Neural Networks (DNNs) has revolutionized various +domains, enabling the resolution of complex tasks spanning image recognition, +natural language processing, and scientific problem-solving. However, this +progress has also exposed a concerning vulnerability: adversarial examples. +These crafted inputs, imperceptible to humans, can manipulate machine learning +models into making erroneous predictions, raising concerns for safety-critical +applications. An intriguing property of this phenomenon is the transferability +of adversarial examples, where perturbations crafted for one model can deceive +another, often with a different architecture. This intriguing property enables +"black-box" attacks, circumventing the need for detailed knowledge of the +target model. This survey explores the landscape of the adversarial +transferability of adversarial examples. We categorize existing methodologies +to enhance adversarial transferability and discuss the fundamental principles +guiding each approach. While the predominant body of research primarily +concentrates on image classification, we also extend our discussion to +encompass other vision tasks and beyond. Challenges and future prospects are +discussed, highlighting the importance of fortifying DNNs against adversarial +vulnerabilities in an evolving landscape. + +
+
+
+
+
+ + ☆ MimicGen: A Data Generation System for Scalable Robot Learning using + Human Demonstrations + + +
+ Imitation learning from a large set of human demonstrations has proved to be +an effective paradigm for building capable robot agents. However, the +demonstrations can be extremely costly and time-consuming to collect. We +introduce MimicGen, a system for automatically synthesizing large-scale, rich +datasets from only a small number of human demonstrations by adapting them to +new contexts. We use MimicGen to generate over 50K demonstrations across 18 +tasks with diverse scene configurations, object instances, and robot arms from +just ~200 human demonstrations. We show that robot agents can be effectively +trained on this generated dataset by imitation learning to achieve strong +performance in long-horizon and high-precision tasks, such as multi-part +assembly and coffee preparation, across broad initial state distributions. We +further demonstrate that the effectiveness and utility of MimicGen data compare +favorably to collecting additional human demonstrations, making it a powerful +and economical approach towards scaling up robot learning. Datasets, simulation +environments, videos, and more at https://mimicgen.github.io . + +
+
+ comment: Conference on Robot Learning (CoRL) 2023 +
+
+
+
+
+ + ☆ SPA: A Graph Spectral Alignment Perspective for Domain Adaptation + + +
+ Unsupervised domain adaptation (UDA) is a pivotal form in machine learning to +extend the in-domain model to the distinctive target domains where the data +distributions differ. Most prior works focus on capturing the inter-domain +transferability but largely overlook rich intra-domain structures, which +empirically results in even worse discriminability. In this work, we introduce +a novel graph SPectral Alignment (SPA) framework to tackle the tradeoff. The +core of our method is briefly condensed as follows: (i)-by casting the DA +problem to graph primitives, SPA composes a coarse graph alignment mechanism +with a novel spectral regularizer towards aligning the domain graphs in +eigenspaces; (ii)-we further develop a fine-grained message propagation module +-- upon a novel neighbor-aware self-training mechanism -- in order for enhanced +discriminability in the target domain. On standardized benchmarks, the +extensive experiments of SPA demonstrate that its performance has surpassed the +existing cutting-edge DA methods. Coupled with dense model analysis, we +conclude that our approach indeed possesses superior efficacy, robustness, +discriminability, and transferability. Code and data are available at: +https://github.com/CrownX/SPA. + +
+
+
+
+
+ + ☆ Noise-Free Score Distillation + + +
+ Score Distillation Sampling (SDS) has emerged as the de facto approach for +text-to-content generation in non-image domains. In this paper, we reexamine +the SDS process and introduce a straightforward interpretation that demystifies +the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the +distillation of an undesired noise term. Building upon our interpretation, we +propose a novel Noise-Free Score Distillation (NFSD) process, which requires +minimal modifications to the original SDS framework. Through this streamlined +design, we achieve more effective distillation of pre-trained text-to-image +diffusion models while using a nominal CFG scale. This strategic choice allows +us to prevent the over-smoothing of results, ensuring that the generated data +is both realistic and complies with the desired prompt. To demonstrate the +efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as +well as several other methods. + +
+
+ comment: Project page at https://orenkatzir.github.io/nfsd/ +
+
+
+
+
+ + ☆ Global Structure-Aware Diffusion Process for Low-Light Image Enhancement NeurIPS 2023 + + +
+ This paper studies a diffusion-based framework to address the low-light image +enhancement problem. To harness the capabilities of diffusion models, we delve +into this intricate process and advocate for the regularization of its inherent +ODE-trajectory. To be specific, inspired by the recent research that low +curvature ODE-trajectory results in a stable and effective diffusion process, +we formulate a curvature regularization term anchored in the intrinsic +non-local structures of image data, i.e., global structure-aware +regularization, which gradually facilitates the preservation of complicated +details and the augmentation of contrast during the diffusion process. This +incorporation mitigates the adverse effects of noise and artifacts resulting +from the diffusion process, leading to a more precise and flexible enhancement. +To additionally promote learning in challenging regions, we introduce an +uncertainty-guided regularization technique, which wisely relaxes constraints +on the most extreme regions of the image. Experimental evaluations reveal that +the proposed diffusion-based framework, complemented by rank-informed +regularization, attains distinguished performance in low-light enhancement. The +outcomes indicate substantial advancements in image quality, noise suppression, +and contrast amplification in comparison with state-of-the-art methods. We +believe this innovative approach will stimulate further exploration and +advancement in low-light image processing, with potential implications for +other applications of diffusion models. The code is publicly available at +https://github.com/jinnh/GSAD. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ SD4Match: Learning to Prompt Stable Diffusion Model for Semantic + Matching + + +
+ In this paper, we address the challenge of matching semantically similar +keypoints across image pairs. Existing research indicates that the intermediate +output of the UNet within the Stable Diffusion (SD) can serve as robust image +feature maps for such a matching task. We demonstrate that by employing a basic +prompt tuning technique, the inherent potential of Stable Diffusion can be +harnessed, resulting in a significant enhancement in accuracy over previous +approaches. We further introduce a novel conditional prompting module that +conditions the prompt on the local details of the input image pairs, leading to +a further improvement in performance. We designate our approach as SD4Match, +short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of +SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets +new benchmarks in accuracy across all these datasets. Particularly, SD4Match +outperforms the previous state-of-the-art by a margin of 12 percentage points +on the challenging SPair-71k dataset. + +
+
+
+
+
+ + ☆ Instability of computer vision models is a necessary result of the task + itself + + +
+ Adversarial examples resulting from instability of current computer vision +models are an extremely important topic due to their potential to compromise +any application. In this paper we demonstrate that instability is inevitable +due to a) symmetries (translational invariance) of the data, b) the categorical +nature of the classification task, and c) the fundamental discrepancy of +classifying images as objects themselves. The issue is further exacerbated by +non-exhaustive labelling of the training data. Therefore we conclude that +instability is a necessary result of how the problem of computer vision is +currently formulated. While the problem cannot be eliminated, through the +analysis of the causes, we have arrived at ways how it can be partially +alleviated. These include i) increasing the resolution of images, ii) providing +contextual information for the image, iii) exhaustive labelling of training +data, and iv) preventing attackers from frequent access to the computer vision +system. + +
+
+
+
+
+ + ☆ SoK: Pitfalls in Evaluating Black-Box Attacks + + +
+ Numerous works study black-box attacks on image classifiers. However, these +works make different assumptions on the adversary's knowledge and current +literature lacks a cohesive organization centered around the threat model. To +systematize knowledge in this area, we propose a taxonomy over the threat space +spanning the axes of feedback granularity, the access of interactive queries, +and the quality and quantity of the auxiliary data available to the attacker. +Our new taxonomy provides three key insights. 1) Despite extensive literature, +numerous under-explored threat spaces exist, which cannot be trivially solved +by adapting techniques from well-explored settings. We demonstrate this by +establishing a new state-of-the-art in the less-studied setting of access to +top-k confidence scores by adapting techniques from well-explored settings of +accessing the complete confidence vector, but show how it still falls short of +the more restrictive setting that only obtains the prediction label, +highlighting the need for more research. 2) Identification the threat model of +different attacks uncovers stronger baselines that challenge prior +state-of-the-art claims. We demonstrate this by enhancing an initially weaker +baseline (under interactive query access) via surrogate models, effectively +overturning claims in the respective paper. 3) Our taxonomy reveals +interactions between attacker knowledge that connect well to related areas, +such as model inversion and extraction attacks. We discuss how advances in +other areas can enable potentially stronger black-box attacks. Finally, we +emphasize the need for a more realistic assessment of attack success by +factoring in local attack runtime. This approach reveals the potential for +certain attacks to achieve notably higher success rates and the need to +evaluate attacks in diverse and harder settings, highlighting the need for +better selection criteria. + +
+
+
+
+
+ + ☆ Evaluating Bias and Fairness in Gender-Neutral Pretrained + Vision-and-Language Models EMNLP 2024 + + +
+ Pretrained machine learning models are known to perpetuate and even amplify +existing biases in data, which can result in unfair outcomes that ultimately +impact user experience. Therefore, it is crucial to understand the mechanisms +behind those prejudicial biases to ensure that model performance does not +result in discriminatory behaviour toward certain groups or populations. In +this work, we define gender bias as our case study. We quantify bias +amplification in pretraining and after fine-tuning on three families of +vision-and-language models. We investigate the connection, if any, between the +two learning stages, and evaluate how bias amplification reflects on model +performance. Overall, we find that bias amplification in pretraining and after +fine-tuning are independent. We then examine the effect of continued +pretraining on gender-neutral data, finding that this reduces group +disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without +significantly compromising task performance. + +
+
+ comment: To appear in EMNLP 2024 +
+
+
+
+
+ + ☆ Masked Space-Time Hash Encoding for Efficient Dynamic Scene + Reconstruction NeurIPS 2023 + + +
+ In this paper, we propose the Masked Space-Time Hash encoding (MSTH), a novel +method for efficiently reconstructing dynamic 3D scenes from multi-view or +monocular videos. Based on the observation that dynamic scenes often contain +substantial static areas that result in redundancy in storage and computations, +MSTH represents a dynamic scene as a weighted combination of a 3D hash encoding +and a 4D hash encoding. The weights for the two components are represented by a +learnable mask which is guided by an uncertainty-based objective to reflect the +spatial and temporal importance of each 3D position. With this design, our +method can reduce the hash collision rate by avoiding redundant queries and +modifications on static areas, making it feasible to represent a large number +of space-time voxels by hash tables with small size.Besides, without the +requirements to fit the large numbers of temporally redundant features +independently, our method is easier to optimize and converge rapidly with only +twenty minutes of training for a 300-frame dynamic scene.As a result, MSTH +obtains consistently better results than previous methods with only 20 minutes +of training time and 130 MB of memory storage. Code is available at +https://github.com/masked-spacetime-hashing/msth + +
+
+ comment: NeurIPS 2023 (Spotlight) +
+
+
+
+
+ + ☆ FLARE: Fast Learning of Animatable and Relightable Mesh Avatars SIGGRAPH + + +
+ Our goal is to efficiently learn personalized animatable 3D head avatars from +videos that are geometrically accurate, realistic, relightable, and compatible +with current rendering systems. While 3D meshes enable efficient processing and +are highly portable, they lack realism in terms of shape and appearance. Neural +representations, on the other hand, are realistic but lack compatibility and +are slow to train and render. Our key insight is that it is possible to +efficiently learn high-fidelity 3D mesh representations via differentiable +rendering by exploiting highly-optimized methods from traditional computer +graphics and approximating some of the components with neural networks. To that +end, we introduce \moniker, a technique that enables the creation of animatable +and relightable mesh avatars from a single monocular video. First, we learn a +canonical geometry using a mesh representation, enabling efficient +differentiable rasterization and straightforward animation via learned +blendshapes and linear blend skinning weights. Second, we follow +physically-based rendering and factor observed colors into intrinsic albedo, +roughness, and a neural representation of the illumination, allowing the +learned avatars to be relit in novel scenes. Since our input videos are +captured on a single device with a narrow field of view, modeling the +surrounding environment light is non-trivial. Based on the split-sum +approximation for modeling specular reflections, we address this by +approximating the pre-filtered environment map with a multi-layer perceptron +(MLP) modulated by the surface roughness, eliminating the need to explicitly +model the light. We demonstrate that our mesh-based avatar formulation, +combined with learned deformation, material, and lighting MLPs, produces +avatars with high-quality geometry and appearance, while also being efficient +to train and render compared to existing approaches. + +
+
+ comment: 15 pages, Accepted: ACM Transactions on Graphics (Proceedings of + SIGGRAPH Asia), 2023 +
+
+
+
+
+ + ☆ Revisiting the Distillation of Image Representations into Point Clouds + for Autonomous Driving + + +
+ Self-supervised image networks can be used to address complex 2D tasks (e.g., +semantic segmentation, object discovery) very efficiently and with little or no +downstream supervision. However, self-supervised 3D networks on lidar data do +not perform as well for now. A few methods therefore propose to distill +high-quality self-supervised 2D features into 3D networks. The most recent ones +doing so on autonomous driving data show promising results. Yet, a performance +gap persists between these distilled features and fully-supervised ones. In +this work, we revisit 2D-to-3D distillation. First, we propose, for semantic +segmentation, a simple approach that leads to a significant improvement +compared to prior 3D distillation methods. Second, we show that distillation in +high capacity 3D networks is key to reach high quality 3D features. This +actually allows us to significantly close the gap between unsupervised +distilled 3D features and fully-supervised ones. Last, we show that our +high-quality distilled representations can also be used for open-vocabulary +segmentation and background/foreground discovery. + +
+
+
+
+
+ + ☆ A Hybrid Graph Network for Complex Activity Detection in Video WACV 2024 + + +
+ Interpretation and understanding of video presents a challenging computer +vision task in numerous fields - e.g. autonomous driving and sports analytics. +Existing approaches to interpreting the actions taking place within a video +clip are based upon Temporal Action Localisation (TAL), which typically +identifies short-term actions. The emerging field of Complex Activity Detection +(CompAD) extends this analysis to long-term activities, with a deeper +understanding obtained by modelling the internal structure of a complex +activity taking place within the video. We address the CompAD problem using a +hybrid graph neural network which combines attention applied to a graph +encoding the local (short-term) dynamic scene with a temporal graph modelling +the overall long-duration activity. Our approach is as follows: i) Firstly, we +propose a novel feature extraction technique which, for each video snippet, +generates spatiotemporal `tubes' for the active elements (`agents') in the +(local) scene by detecting individual objects, tracking them and then +extracting 3D features from all the agent tubes as well as the overall scene. +ii) Next, we construct a local scene graph where each node (representing either +an agent tube or the scene) is connected to all other nodes. Attention is then +applied to this graph to obtain an overall representation of the local dynamic +scene. iii) Finally, all local scene graph representations are interconnected +via a temporal graph, to estimate the complex activity class together with its +start and end time. The proposed framework outperforms all previous +state-of-the-art methods on all three datasets including ActivityNet-1.3, +Thumos-14, and ROAD. + +
+
+ comment: This paper is Accepted at WACV 2024 +
+
+
+
+
+ + ☆ Cross-modal Active Complementary Learning with Self-refining + Correspondence NeurIPS 2023 + + +
+ Recently, image-text matching has attracted more and more attention from +academia and industry, which is fundamental to understanding the latent +correspondence across visual and textual modalities. However, most existing +methods implicitly assume the training pairs are well-aligned while ignoring +the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby +inevitably leading to a performance drop. Although some methods attempt to +address such noise, they still face two challenging problems: excessive +memorizing/overfitting and unreliable correction for NC, especially under high +noise. To address the two problems, we propose a generalized Cross-modal Robust +Complementary Learning framework (CRCL), which benefits from a novel Active +Complementary Loss (ACL) and an efficient Self-refining Correspondence +Correction (SCC) to improve the robustness of existing methods. Specifically, +ACL exploits active and complementary learning losses to reduce the risk of +providing erroneous supervision, leading to theoretically and experimentally +demonstrated robustness against NC. SCC utilizes multiple self-refining +processes with momentum correction to enlarge the receptive field for +correcting correspondences, thereby alleviating error accumulation and +achieving accurate and stable corrections. We carry out extensive experiments +on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify +the superior robustness of our CRCL against synthetic and real-world noisy +correspondences. + +
+
+ comment: This paper is accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Towards Learning Monocular 3D Object Localization From 2D Labels using + the Physical Laws of Motion + + +
+ We present a novel method for precise 3D object localization in single images +from a single calibrated camera using only 2D labels. No expensive 3D labels +are needed. Thus, instead of using 3D labels, our model is trained with +easy-to-annotate 2D labels along with the physical knowledge of the object's +motion. Given this information, the model can infer the latent third dimension, +even though it has never seen this information during training. Our method is +evaluated on both synthetic and real-world datasets, and we are able to achieve +a mean distance error of just 6 cm in our experiments on real data. The results +indicate the method's potential as a step towards learning 3D object location +estimation, where collecting 3D data for training is not feasible. + +
+
+
+
+
+ + ☆ OTMatch: Improving Semi-Supervised Learning with Optimal Transport + + +
+ Semi-supervised learning has made remarkable strides by effectively utilizing +a limited amount of labeled data while capitalizing on the abundant information +present in unlabeled data. However, current algorithms often prioritize +aligning image predictions with specific classes generated through +self-training techniques, thereby neglecting the inherent relationships that +exist within these classes. In this paper, we present a new approach called +OTMatch, which leverages semantic relationships among classes by employing an +optimal transport loss function. By utilizing optimal transport, our proposed +method consistently outperforms established state-of-the-art methods. Notably, +we observed a substantial improvement of a certain percentage in accuracy +compared to the current state-of-the-art method, FreeMatch. OTMatch achieves +3.18%, 3.46%, and 1.28% error rate reduction over FreeMatch on CIFAR-10 with 1 +label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels +per class, respectively. This demonstrates the effectiveness and superiority of +our approach in harnessing semantic relationships to enhance learning +performance in a semi-supervised setting. + +
+
+
+
+
+ + ☆ Generating by Understanding: Neural Visual Generation with Logical + Symbol Groundings + + +
+ Despite the great success of neural visual generative models in recent years, +integrating them with strong symbolic knowledge reasoning systems remains a +challenging task. The main challenges are two-fold: one is symbol assignment, +i.e. bonding latent factors of neural visual generators with meaningful symbols +from knowledge reasoning systems. Another is rule learning, i.e. learning new +rules, which govern the generative process of the data, to augment the +knowledge reasoning systems. To deal with these symbol grounding problems, we +propose a neural-symbolic learning approach, Abductive Visual Generation +(AbdGen), for integrating logic programming systems with neural visual +generative models based on the abductive learning framework. To achieve +reliable and efficient symbol assignment, the quantized abduction method is +introduced for generating abduction proposals by the nearest-neighbor lookups +within semantic codebooks. To achieve precise rule learning, the contrastive +meta-abduction method is proposed to eliminate wrong rules with positive cases +and avoid less-informative rules with negative cases simultaneously. +Experimental results on various benchmark datasets show that compared to the +baselines, AbdGen requires significantly fewer instance-level labeling +information for symbol assignment. Furthermore, our approach can effectively +learn underlying logical generative rules from data, which is out of the +capability of existing approaches. + +
+
+
+
+
+ + ☆ Sign Languague Recognition without frame-sequencing constraints: A proof + of concept on the Argentinian Sign Language + + +
+ Automatic sign language recognition (SLR) is an important topic within the +areas of human-computer interaction and machine learning. On the one hand, it +poses a complex challenge that requires the intervention of various knowledge +areas, such as video processing, image processing, intelligent systems and +linguistics. On the other hand, robust recognition of sign language could +assist in the translation process and the integration of hearing-impaired +people, as well as the teaching of sign language for the hearing population. + SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or +similar models to recognize signs. Such techniques exploit the sequential +ordering of frames to reduce the number of hypothesis. This paper presents a +general probabilistic model for sign classification that combines +sub-classifiers based on different types of features such as position, movement +and handshape. The model employs a bag-of-words approach in all classification +steps, to explore the hypothesis that ordering is not essential for +recognition. The proposed model achieved an accuracy rate of 97% on an +Argentinian Sign Language dataset containing 64 classes of signs and 3200 +samples, providing some evidence that indeed recognition without ordering is +possible. + +
+
+ comment: IBERAMIA 2016 +
+
+
+
+
+ + ☆ Uncertainty-weighted Loss Functions for Improved Adversarial Attacks on + Semantic Segmentation + + +
+ State-of-the-art deep neural networks have been shown to be extremely +powerful in a variety of perceptual tasks like semantic segmentation. However, +these networks are vulnerable to adversarial perturbations of the input which +are imperceptible for humans but lead to incorrect predictions. Treating image +segmentation as a sum of pixel-wise classifications, adversarial attacks +developed for classification models were shown to be applicable to segmentation +models as well. In this work, we present simple uncertainty-based weighting +schemes for the loss functions of such attacks that (i) put higher weights on +pixel classifications which can more easily perturbed and (ii) zero-out the +pixel-wise losses corresponding to those pixels that are already confidently +misclassified. The weighting schemes can be easily integrated into the loss +function of a range of well-known adversarial attackers with minimal additional +computational overhead, but lead to significant improved perturbation +performance, as we demonstrate in our empirical analysis on several datasets +and models. + +
+
+
+
+
+ + ☆ LSA64: An Argentinian Sign Language Dataset + + +
+ Automatic sign language recognition is a research area that encompasses +human-computer interaction, computer vision and machine learning. Robust +automatic recognition of sign language could assist in the translation process +and the integration of hearing-impaired people, as well as the teaching of sign +language to the hearing population. Sign languages differ significantly in +different countries and even regions, and their syntax and semantics are +different as well from those of written languages. While the techniques for +automatic sign language recognition are mostly the same for different +languages, training a recognition system for a new language requires having an +entire dataset for that language. This paper presents a dataset of 64 signs +from the Argentinian Sign Language (LSA). The dataset, called LSA64, contains +3200 videos of 64 different LSA signs recorded by 10 subjects, and is a first +step towards building a comprehensive research-level dataset of Argentinian +signs, specifically tailored to sign language recognition or other machine +learning tasks. The subjects that performed the signs wore colored gloves to +ease the hand tracking and segmentation steps, allowing experiments on the +dataset to focus specifically on the recognition of signs. We also present a +pre-processed version of the dataset, from which we computed statistics of +movement, position and handshape of the signs. + +
+
+ comment: Published in CACIC XXII +
+
+
+
+
+ + ☆ Handshape recognition for Argentinian Sign Language using ProbSom + + +
+ Automatic sign language recognition is an important topic within the areas of +human-computer interaction and machine learning. On the one hand, it poses a +complex challenge that requires the intervention of various knowledge areas, +such as video processing, image processing, intelligent systems and +linguistics. On the other hand, robust recognition of sign language could +assist in the translation process and the integration of hearing-impaired +people. + This paper offers two main contributions: first, the creation of a database +of handshapes for the Argentinian Sign Language (LSA), which is a topic that +has barely been discussed so far. Secondly, a technique for image processing, +descriptor extraction and subsequent handshape classification using a +supervised adaptation of self-organizing maps that is called ProbSom. This +technique is compared to others in the state of the art, such as Support Vector +Machines (SVM), Random Forests, and Neural Networks. + The database that was built contains 800 images with 16 LSA handshapes, and +is a first step towards building a comprehensive database of Argentinian signs. +The ProbSom-based neural classifier, using the proposed descriptor, achieved an +accuracy rate above 90%. + +
+
+
+
+
+ + ☆ Distribution of Action Movements (DAM): A Descriptor for Human Action + Recognition + + +
+ Human action recognition from skeletal data is an important and active area +of research in which the state of the art has not yet achieved near-perfect +accuracy on many well-known datasets. In this paper, we introduce the +Distribution of Action Movements Descriptor, a novel action descriptor based on +the distribution of the directions of the motions of the joints between frames, +over the set of all possible motions in the dataset. The descriptor is computed +as a normalized histogram over a set of representative directions of the +joints, which are in turn obtained via clustering. While the descriptor is +global in the sense that it represents the overall distribution of movement +directions of an action, it is able to partially retain its temporal structure +by applying a windowing scheme. + The descriptor, together with a standard classifier, outperforms several +state-of-the-art techniques on many well-known datasets. + +
+
+
+
+
+ + ☆ AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image + Detectors + + +
+ Deep generative models can create remarkably photorealistic fake images while +raising concerns about misinformation and copyright infringement, known as +deepfake threats. Deepfake detection technique is developed to distinguish +between real and fake images, where the existing methods typically learn +classifiers in the image domain or various feature domains. However, the +generalizability of deepfake detection against emerging and more advanced +generative models remains challenging. In this paper, being inspired by the +zero-shot advantages of Vision-Language Models (VLMs), we propose a novel +approach using VLMs (e.g. InstructBLIP) and prompt tuning techniques to improve +the deepfake detection accuracy over unseen data. We formulate deepfake +detection as a visual question answering problem, and tune soft prompts for +InstructBLIP to answer the real/fake information of a query image. We conduct +full-spectrum experiments on datasets from 3 held-in and 13 held-out generative +models, covering modern text-to-image generation, image editing and image +attacks. Results demonstrate that (1) the deepfake detection accuracy can be +significantly and consistently improved (from 58.8% to 91.31%, in average +accuracy over unseen data) using pretrained vision-language models with prompt +tuning; (2) our superior performance is at less cost of trainable parameters, +resulting in an effective and efficient solution for deepfake detection. Code +and models can be found at https://github.com/nctu-eva-lab/AntifakePrompt. + +
+
+
+
+
+ + ☆ Circuit as Set of Points + + +
+ As the size of circuit designs continues to grow rapidly, artificial +intelligence technologies are being extensively used in Electronic Design +Automation (EDA) to assist with circuit design. Placement and routing are the +most time-consuming parts of the physical design process, and how to quickly +evaluate the placement has become a hot research topic. Prior works either +transformed circuit designs into images using hand-crafted methods and then +used Convolutional Neural Networks (CNN) to extract features, which are limited +by the quality of the hand-crafted methods and could not achieve end-to-end +training, or treated the circuit design as a graph structure and used Graph +Neural Networks (GNN) to extract features, which require time-consuming +preprocessing. In our work, we propose a novel perspective for circuit design +by treating circuit components as point clouds and using Transformer-based +point cloud perception methods to extract features from the circuit. This +approach enables direct feature extraction from raw data without any +preprocessing, allows for end-to-end training, and results in high performance. +Experimental results show that our method achieves state-of-the-art performance +in congestion prediction tasks on both the CircuitNet and ISPD2015 datasets, as +well as in design rule check (DRC) violation prediction tasks on the CircuitNet +dataset. Our method establishes a bridge between the relatively mature point +cloud perception methods and the fast-developing EDA algorithms, enabling us to +leverage more collective intelligence to solve this task. To facilitate the +research of open EDA design, source codes and pre-trained models are released +at https://github.com/hustvl/circuitformer. + +
+
+
+
+
+ + ☆ Detection Defenses: An Empty Promise against Adversarial Patch Attacks + on Optical Flow WACV 2024 + + +
+ Adversarial patches undermine the reliability of optical flow predictions +when placed in arbitrary scene locations. Therefore, they pose a realistic +threat to real-world motion detection and its downstream applications. +Potential remedies are defense strategies that detect and remove adversarial +patches, but their influence on the underlying motion prediction has not been +investigated. In this paper, we thoroughly examine the currently available +detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art +optical flow methods, and illuminate their side effects on the quality and +robustness of the final flow predictions. In particular, we implement +defense-aware attacks to investigate whether current defenses are able to +withstand attacks that take the defense mechanism into account. Our experiments +yield two surprising results: Detect-and-remove defenses do not only lower the +optical flow quality on benign scenes, in doing so, they also harm the +robustness under patch attacks for all tested optical flow methods except +FlowNetC. As currently employed detect-and-remove defenses fail to deliver the +promised adversarial robustness for optical flow, they evoke a false sense of +security. The code is available at +https://github.com/cv-stuttgart/DetectionDefenses. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ Learning Temporal Sentence Grounding From Narrated EgoVideos BMVC 2023 + + +
+ The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens +presents a new challenge for the task of Temporal Sentence Grounding (TSG). +Compared to traditional benchmarks on which this task is evaluated, these +datasets offer finer-grained sentences to ground in notably longer videos. In +this paper, we develop an approach for learning to ground sentences in these +datasets using only narrations and their corresponding rough narration +timestamps. We propose to artificially merge clips to train for temporal +grounding in a contrastive manner using text-conditioning attention. This Clip +Merging (CliMer) approach is shown to be effective when compared with a high +performing TSG method -- e.g. mean R@1 improves from 3.9 to 5.7 on Ego4D and +from 10.7 to 13.0 on EPIC-Kitchens. Code and data splits available from: +https://github.com/keflanagan/CliMer + +
+
+ comment: Accepted in BMVC 2023 +
+
+
+
+
+ + ☆ YOLO-BEV: Generating Bird's-Eye View in the Same Way as 2D Object + Detection + + +
+ Vehicle perception systems strive to achieve comprehensive and rapid visual +interpretation of their surroundings for improved safety and navigation. We +introduce YOLO-BEV, an efficient framework that harnesses a unique surrounding +cameras setup to generate a 2D bird's-eye view of the vehicular environment. By +strategically positioning eight cameras, each at a 45-degree interval, our +system captures and integrates imagery into a coherent 3x3 grid format, leaving +the center blank, providing an enriched spatial representation that facilitates +efficient processing. In our approach, we employ YOLO's detection mechanism, +favoring its inherent advantages of swift response and compact model structure. +Instead of leveraging the conventional YOLO detection head, we augment it with +a custom-designed detection head, translating the panoramically captured data +into a unified bird's-eye view map of ego car. Preliminary results validate the +feasibility of YOLO-BEV in real-time vehicular perception tasks. With its +streamlined architecture and potential for rapid deployment due to minimized +parameters, YOLO-BEV poses as a promising tool that may reshape future +perspectives in autonomous driving systems. + +
+
+
+
+
+ + ☆ SE(3) Diffusion Model-based Point Cloud Registration for Robust 6D + Object Pose Estimation NeurIPS-2023 + + +
+ In this paper, we introduce an SE(3) diffusion model-based point cloud +registration framework for 6D object pose estimation in real-world scenarios. +Our approach formulates the 3D registration task as a denoising diffusion +process, which progressively refines the pose of the source point cloud to +obtain a precise alignment with the model point cloud. Training our framework +involves two operations: An SE(3) diffusion process and an SE(3) reverse +process. The SE(3) diffusion process gradually perturbs the optimal rigid +transformation of a pair of point clouds by continuously injecting noise +(perturbation transformation). By contrast, the SE(3) reverse process focuses +on learning a denoising network that refines the noisy transformation +step-by-step, bringing it closer to the optimal transformation for accurate +pose estimation. Unlike standard diffusion models used in linear Euclidean +spaces, our diffusion model operates on the SE(3) manifold. This requires +exploiting the linear Lie algebra $\mathfrak{se}(3)$ associated with SE(3) to +constrain the transformation transitions during the diffusion and reverse +processes. Additionally, to effectively train our denoising network, we derive +a registration-specific variational lower bound as the optimization objective +for model learning. Furthermore, we show that our denoising network can be +constructed with a surrogate registration model, making our approach applicable +to different deep registration networks. Extensive experiments demonstrate that +our diffusion registration framework presents outstanding pose estimation +performance on the real-world TUD-L, LINEMOD, and Occluded-LINEMOD datasets. + +
+
+ comment: Accepted by NeurIPS-2023 +
+
+
+
+
+ + ☆ Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning + + +
+ Ahead-of-time forecasting of the output power of power plants is essential +for the stability of the electricity grid and ensuring uninterrupted service. +However, forecasting renewable energy sources is difficult due to the chaotic +behavior of natural energy sources. This paper presents a new approach to +estimate short-term solar irradiance from sky images. The~proposed algorithm +extracts features from sky images and use learning-based techniques to estimate +the solar irradiance. The~performance of proposed machine learning (ML) +algorithm is evaluated using two publicly available datasets of sky images. +The~datasets contain over 350,000 images for an interval of 16 years, from 2004 +to 2020, with the corresponding global horizontal irradiance (GHI) of each +image as the ground truth. Compared to the state-of-the-art computationally +heavy algorithms proposed in the literature, our approach achieves competitive +results with much less computational complexity for both nowcasting and +forecasting up to 4 h ahead of time. + +
+
+ comment: Published in MDPI Electronics Journal +
+
+
+
+
+ + ☆ CADS: Unleashing the Diversity of Diffusion Models through + Condition-Annealed Sampling + + +
+ While conditional diffusion models are known to have good coverage of the +data distribution, they still face limitations in output diversity, +particularly when sampled with a high classifier-free guidance scale for +optimal image quality or when trained on small datasets. We attribute this +problem to the role of the conditioning signal in inference and offer an +improved sampling strategy for diffusion models that can increase generation +diversity, especially at high guidance scales, with minimal loss of sample +quality. Our sampling strategy anneals the conditioning signal by adding +scheduled, monotonically decreasing Gaussian noise to the conditioning vector +during inference to balance diversity and condition alignment. Our +Condition-Annealed Diffusion Sampler (CADS) can be used with any pretrained +model and sampling algorithm, and we show that it boosts the diversity of +diffusion models in various conditional generation tasks. Further, using an +existing pretrained diffusion model, CADS achieves a new state-of-the-art FID +of 1.70 and 2.31 for class-conditional ImageNet generation at 256$\times$256 +and 512$\times$512 respectively. + +
+
+
+
+
+ + ☆ C-Disentanglement: Discovering Causally-Independent Generative Factors + under an Inductive Bias of Confounder + + +
+ Representation learning assumes that real-world data is generated by a few +semantically meaningful generative factors (i.e., sources of variation) and +aims to discover them in the latent space. These factors are expected to be +causally disentangled, meaning that distinct factors are encoded into separate +latent variables, and changes in one factor will not affect the values of the +others. Compared to statistical independence, causal disentanglement allows +more controllable data generation, improved robustness, and better +generalization. However, most existing work assumes unconfoundedness in the +discovery process, that there are no common causes to the generative factors +and thus obtain only statistical independence. In this paper, we recognize the +importance of modeling confounders in discovering causal generative factors. +Unfortunately, such factors are not identifiable without proper inductive bias. +We fill the gap by introducing a framework entitled Confounded-Disentanglement +(C-Disentanglement), the first framework that explicitly introduces the +inductive bias of confounder via labels from domain expertise. In addition, we +accordingly propose an approach to sufficiently identify the causally +disentangled factors under any inductive bias of the confounder. We conduct +extensive experiments on both synthetic and real-world datasets. Our method +demonstrates competitive results compared to various SOTA baselines in +obtaining causally disentangled features and downstream tasks under domain +shifts. + +
+
+ comment: accepted to Neurips 2023 +
+
+
+
+
+ + ☆ IndustReal: A Dataset for Procedure Step Recognition Handling Execution + Errors in Egocentric Videos in an Industrial-Like Setting WACV 2024 + + +
+ Although action recognition for procedural tasks has received notable +attention, it has a fundamental flaw in that no measure of success for actions +is provided. This limits the applicability of such systems especially within +the industrial domain, since the outcome of procedural actions is often +significantly more important than the mere execution. To address this +limitation, we define the novel task of procedure step recognition (PSR), +focusing on recognizing the correct completion and order of procedural steps. +Alongside the new task, we also present the multi-modal IndustReal dataset. +Unlike currently available datasets, IndustReal contains procedural errors +(such as omissions) as well as execution errors. A significant part of these +errors are exclusively present in the validation and test sets, making +IndustReal suitable to evaluate robustness of algorithms to new, unseen +mistakes. Additionally, to encourage reproducibility and allow for scalable +approaches trained on synthetic data, the 3D models of all parts are publicly +available. Annotations and benchmark performance are provided for action +recognition and assembly state detection, as well as the new PSR task. +IndustReal, along with the code and model weights, is available at: +https://github.com/TimSchoonbeek/IndustReal . + +
+
+ comment: Accepted for WACV 2024. 15 pages, 9 figures, including supplementary + materials +
+
+
+
+
+ + ☆ Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with + Rich Semantics + + +
+ Defect inspection is paramount within the closed-loop manufacturing system. +However, existing datasets for defect inspection often lack precision and +semantic granularity required for practical applications. In this paper, we +introduce the Defect Spectrum, a comprehensive benchmark that offers precise, +semantic-abundant, and large-scale annotations for a wide range of industrial +defects. Building on four key industrial benchmarks, our dataset refines +existing annotations and introduces rich semantic details, distinguishing +multiple defect types within a single image. Furthermore, we introduce +Defect-Gen, a two-stage diffusion-based generator designed to create +high-quality and diverse defective images, even when working with limited +datasets. The synthetic images generated by Defect-Gen significantly enhance +the efficacy of defect inspection models. Overall, The Defect Spectrum dataset +demonstrates its potential in defect inspection research, offering a solid +platform for testing and refining advanced models. + +
+
+
+
+
+ + ☆ Scale-Adaptive Feature Aggregation for Efficient Space-Time Video + Super-Resolution WACV2024 + + +
+ The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual +quality of videos, by simultaneously performing video frame interpolation (VFI) +and video super-resolution (VSR). However, facing the challenge of the +additional temporal dimension and scale inconsistency, most existing STVSR +methods are complex and inflexible in dynamically modeling different motion +amplitudes. In this work, we find that choosing an appropriate processing scale +achieves remarkable benefits in flow-based feature propagation. We propose a +novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects +sub-networks with different processing scales for individual samples. +Experiments on four public STVSR benchmarks demonstrate that SAFA achieves +state-of-the-art performance. Our SAFA network outperforms recent +state-of-the-art methods such as TMNet and VideoINR by an average improvement +of over 0.5dB on PSNR, while requiring less than half the number of parameters +and only 1/3 computational costs. + +
+
+ comment: WACV2024, 16 pages +
+
+
+
+
+ + ☆ RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open + Environments NeurIPS 2023 + + +
+ Intention-oriented object detection aims to detect desired objects based on +specific intentions or requirements. For instance, when we desire to "lie down +and rest", we instinctively seek out a suitable option such as a "bed" or a +"sofa" that can fulfill our needs. Previous work in this area is limited either +by the number of intention descriptions or by the affordance vocabulary +available for intention objects. These limitations make it challenging to +handle intentions in open environments effectively. To facilitate this +research, we construct a comprehensive dataset called Reasoning +Intention-Oriented Objects (RIO). In particular, RIO is specifically designed +to incorporate diverse real-world scenarios and a wide range of object +categories. It offers the following key features: 1) intention descriptions in +RIO are represented as natural sentences rather than a mere word or verb +phrase, making them more practical and meaningful; 2) the intention +descriptions are contextually relevant to the scene, enabling a broader range +of potential functionalities associated with the objects; 3) the dataset +comprises a total of 40,214 images and 130,585 intention-object pairs. With the +proposed RIO, we evaluate the ability of some existing models to reason +intention-oriented objects in open environments. + +
+
+ comment: NeurIPS 2023 D&B accepted. See our project page for more details: + https://reasonio.github.io/ +
+
+
+
+
+ + ☆ BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point + Clouds 3DV 2024 + + +
+ We present a surprisingly simple and efficient method for self-supervision of +3D backbone on automotive Lidar point clouds. We design a contrastive loss +between features of Lidar scans captured in the same scene. Several such +approaches have been proposed in the literature from PointConstrast, which uses +a contrast at the level of points, to the state-of-the-art TARL, which uses a +contrast at the level of segments, roughly corresponding to objects. While the +former enjoys a great simplicity of implementation, it is surpassed by the +latter, which however requires a costly pre-processing. In BEVContrast, we +define our contrast at the level of 2D cells in the Bird's Eye View plane. +Resulting cell-level representations offer a good trade-off between the +point-level representations exploited in PointContrast and segment-level +representations exploited in TARL: we retain the simplicity of PointContrast +(cell representations are cheap to compute) while surpassing the performance of +TARL in downstream semantic segmentation. + +
+
+ comment: Accepted to 3DV 2024 +
+
+
+
+
+ + ☆ Attribute Based Interpretable Evaluation Metrics for Generative Models + + +
+ When the training dataset comprises a 1:1 proportion of dogs to cats, a +generative model that produces 1:1 dogs and cats better resembles the training +species distribution than another model with 3:1 dogs and cats. Can we capture +this phenomenon using existing metrics? Unfortunately, we cannot, because these +metrics do not provide any interpretability beyond "diversity". In this +context, we propose a new evaluation protocol that measures the divergence of a +set of generated images from the training set regarding the distribution of +attribute strengths as follows. Single-attribute Divergence (SaD) measures the +divergence regarding PDFs of a single attribute. Paired-attribute Divergence +(PaD) measures the divergence regarding joint PDFs of a pair of attributes. +They provide which attributes the models struggle. For measuring the attribute +strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures +the cosine similarity between image and text vectors with heterogeneous initial +points. With SaD and PaD, we reveal the following about existing generative +models. ProjectedGAN generates implausible attribute relationships such as a +baby with a beard even though it has competitive scores of existing metrics. +Diffusion models struggle to capture diverse colors in the datasets. The larger +sampling timesteps of latent diffusion model generate the more minor objects +including earrings and necklaces. Stable Diffusion v1.5 better captures the +attributes than v2.1. Our metrics lay a foundation for explainable evaluations +of generative models. + +
+
+
+
+
+ + ☆ Generalizing to Unseen Domains in Diabetic Retinopathy Classification WACV 2024 + + +
+ Diabetic retinopathy (DR). is caused by long-standing diabetes and is among +the fifth leading cause for visual impairments. The process of early diagnosis +and treatments could be helpful in curing the disease, however, the detection +procedure is rather challenging and mostly tedious. Therefore, automated +diabetic retinopathy classification using deep learning techniques has gained +interest in the medical imaging community. Akin to several other real-world +applications of deep learning, the typical assumption of i.i.d data is also +violated in DR classification that relies on deep learning. Therefore, +developing DR classification methods robust to unseen distributions is of great +value. In this paper, we study the problem of generalizing a model to unseen +distributions or domains (a.k.a domain generalization) in DR classification. To +this end, we propose a simple and effective domain generalization (DG) approach +that achieves self-distillation in vision transformers (ViT) via a novel +prediction softening mechanism. This prediction softening is an adaptive convex +combination one-hot labels with the model's own knowledge. We perform extensive +experiments on challenging open-source DR classification datasets under both +multi-source and single-source DG settings with three different ViT backbones +to establish the efficacy and applicability of our approach against competing +methods. For the first time, we report the performance of several +state-of-the-art DG methods on open-source DR classification datasets after +conducting thorough experiments. Finally, our method is also capable of +delivering improved calibration performance than other methods, showing its +suitability for safety-critical applications, including healthcare. We hope +that our contributions would investigate more DG research across the medical +imaging community. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ☆ Prototypical Contrastive Learning-based CLIP Fine-tuning for Object + Re-identification + + +
+ This work aims to adapt large-scale pre-trained vision-language models, such +as contrastive language-image pretraining (CLIP), to enhance the performance of +object reidentification (Re-ID) across various supervision settings. Although +prompt learning has enabled a recent work named CLIP-ReID to achieve promising +performance, the underlying mechanisms and the necessity of prompt learning +remain unclear due to the absence of semantic labels in ReID tasks. In this +work, we first analyze the role prompt learning in CLIP-ReID and identify its +limitations. Based on our investigations, we propose a simple yet effective +approach to adapt CLIP for supervised object Re-ID. Our approach directly +fine-tunes the image encoder of CLIP using a prototypical contrastive learning +(PCL) loss, eliminating the need for prompt learning. Experimental results on +both person and vehicle Re-ID datasets demonstrate the competitiveness of our +method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP +fine-tuning approach to unsupervised scenarios, where we achieve state-of-the +art performance. + +
+
+
+
+
+ + ☆ Three-dimensional Bone Image Synthesis with Generative Adversarial + Networks + + +
+ Medical image processing has been highlighted as an area where deep +learning-based models have the greatest potential. However, in the medical +field in particular, problems of data availability and privacy are hampering +research progress and thus rapid implementation in clinical routine. The +generation of synthetic data not only ensures privacy, but also allows to +\textit{draw} new patients with specific characteristics, enabling the +development of data-driven models on a much larger scale. This work +demonstrates that three-dimensional generative adversarial networks (GANs) can +be efficiently trained to generate high-resolution medical volumes with finely +detailed voxel-based architectures. In addition, GAN inversion is successfully +implemented for the three-dimensional setting and used for extensive research +on model interpretability and applications such as image morphing, attribute +editing and style mixing. The results are comprehensively validated on a +database of three-dimensional HR-pQCT instances representing the bone +micro-architecture of the distal radius. + +
+
+ comment: Submitted to the journal Artificial Intelligence in Medicine +
+
+
+
+
+ + ☆ Emotion Recognition by Video: A review + + +
+ Video emotion recognition is an important branch of affective computing, and +its solutions can be applied in different fields such as human-computer +interaction (HCI) and intelligent medical treatment. Although the number of +papers published in the field of emotion recognition is increasing, there are +few comprehensive literature reviews covering related research on video emotion +recognition. Therefore, this paper selects articles published from 2015 to 2023 +to systematize the existing trends in video emotion recognition in related +studies. In this paper, we first talk about two typical emotion models, then we +talk about databases that are frequently utilized for video emotion +recognition, including unimodal databases and multimodal databases. Next, we +look at and classify the specific structure and performance of modern unimodal +and multimodal video emotion recognition methods, talk about the benefits and +drawbacks of each, and then we compare them in detail in the tables. Further, +we sum up the primary difficulties right now looked by video emotion +recognition undertakings and point out probably the most encouraging future +headings, such as establishing an open benchmark database and better multimodal +fusion strategys. The essential objective of this paper is to assist scholarly +and modern scientists with keeping up to date with the most recent advances and +new improvements in this speedy, high-influence field of video emotion +recognition. + +
+
+
+
+
+ + ☆ Weakly-Supervised Surgical Phase Recognition + + +
+ A key element of computer-assisted surgery systems is phase recognition of +surgical videos. Existing phase recognition algorithms require frame-wise +annotation of a large number of videos, which is time and money consuming. In +this work we join concepts of graph segmentation with self-supervised learning +to derive a random-walk solution for per-frame phase prediction. Furthermore, +we utilize within our method two forms of weak supervision: sparse timestamps +or few-shot learning. The proposed algorithm enjoys low complexity and can +operate in lowdata regimes. We validate our method by running experiments with +the public Cholec80 dataset of laparoscopic cholecystectomy videos, +demonstrating promising performance in multiple setups. + +
+
+
+
+
+ + ☆ Lookup Table meets Local Laplacian Filter: Pyramid Reconstruction + Network for Tone Mapping + + +
+ Tone mapping aims to convert high dynamic range (HDR) images to low dynamic +range (LDR) representations, a critical task in the camera imaging pipeline. In +recent years, 3-Dimensional LookUp Table (3D LUT) based methods have gained +attention due to their ability to strike a favorable balance between +enhancement performance and computational efficiency. However, these methods +often fail to deliver satisfactory results in local areas since the look-up +table is a global operator for tone mapping, which works based on pixel values +and fails to incorporate crucial local information. To this end, this paper +aims to address this issue by exploring a novel strategy that integrates global +and local operators by utilizing closed-form Laplacian pyramid decomposition +and reconstruction. Specifically, we employ image-adaptive 3D LUTs to +manipulate the tone in the low-frequency image by leveraging the specific +characteristics of the frequency information. Furthermore, we utilize local +Laplacian filters to refine the edge details in the high-frequency components +in an adaptive manner. Local Laplacian filters are widely used to preserve edge +details in photographs, but their conventional usage involves manual tuning and +fixed implementation within camera imaging pipelines or photo editing tools. We +propose to learn parameter value maps progressively for local Laplacian filters +from annotated data using a lightweight network. Our model achieves +simultaneous global tone manipulation and local edge detail preservation in an +end-to-end manner. Extensive experimental results on two benchmark datasets +demonstrate that the proposed method performs favorably against +state-of-the-art methods. + +
+
+ comment: 12 pages, 6 figures, accepted by NeurlPS 2023 +
+
+
+
+
+ + ☆ Exploring Iterative Refinement with Diffusion Models for Video Grounding + + +
+ Video grounding aims to localize the target moment in an untrimmed video +corresponding to a given sentence query. Existing methods typically select the +best prediction from a set of predefined proposals or directly regress the +target span in a single-shot manner, resulting in the absence of a systematical +prediction refinement process. In this paper, we propose DiffusionVG, a novel +framework with diffusion models that formulates video grounding as a +conditional generation task, where the target span is generated from Gaussian +noise inputs and interatively refined in the reverse diffusion process. During +training, DiffusionVG progressively adds noise to the target span with a fixed +forward diffusion process and learns to recover the target span in the reverse +diffusion process. In inference, DiffusionVG can generate the target span from +Gaussian noise inputs by the learned reverse diffusion process conditioned on +the video-sentence representations. Our DiffusionVG follows the encoder-decoder +architecture, which firstly encodes the video-sentence features and iteratively +denoises the predicted spans in its specialized span refining decoder. Without +bells and whistles, our DiffusionVG demonstrates competitive or even superior +performance compared to existing well-crafted models on mainstream Charades-STA +and ActivityNet Captions benchmarks. + +
+
+
+
+
+ + ☆ Blind Image Super-resolution with Rich Texture-Aware Codebooks + + +
+ Blind super-resolution (BSR) methods based on high-resolution (HR) +reconstruction codebooks have achieved promising results in recent years. +However, we find that a codebook based on HR reconstruction may not effectively +capture the complex correlations between low-resolution (LR) and HR images. In +detail, multiple HR images may produce similar LR versions due to complex blind +degradations, causing the HR-dependent only codebooks having limited texture +diversity when faced with confusing LR inputs. To alleviate this problem, we +propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists +of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware +Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution +correlation of textures between LR and HR images by exploiting the +cross-resolution correspondence of textures. PTPM uses patch-wise semantic +pre-training to correct the misperception of texture similarity in the +high-level semantic regularization. By taking advantage of this, RTCNet +effectively gets rid of the misalignment of confusing textures between HR and +LR in the BSR scenarios. Experiments show that RTCNet outperforms +state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB. + +
+
+
+
+
+ + ☆ Understanding the Effects of Projectors in Knowledge Distillation + + +
+ Conventionally, during the knowledge distillation process (e.g. feature +distillation), an additional projector is often required to perform feature +transformation due to the dimension mismatch between the teacher and the +student networks. Interestingly, we discovered that even if the student and the +teacher have the same feature dimensions, adding a projector still helps to +improve the distillation performance. In addition, projectors even improve +logit distillation if we add them to the architecture too. Inspired by these +surprising findings and the general lack of understanding of the projectors in +the knowledge distillation process from existing literature, this paper +investigates the implicit role that projectors play but so far have been +overlooked. Our empirical study shows that the student with a projector (1) +obtains a better trade-off between the training accuracy and the testing +accuracy compared to the student without a projector when it has the same +feature dimensions as the teacher, (2) better preserves its similarity to the +teacher beyond shallow and numeric resemblance, from the view of Centered +Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does +at the testing phase. Motivated by the positive effects of projectors, we +propose a projector ensemble-based feature distillation method to further +improve distillation performance. Despite the simplicity of the proposed +strategy, empirical results from the evaluation of classification tasks on +benchmark datasets demonstrate the superior classification performance of our +method on a broad range of teacher-student pairs and verify from the aspects of +CKA and model calibration that the student's features are of improved quality +with the projector ensemble design. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2210.15274 +
+
+
+
+
+ + ☆ Bridging The Gaps Between Token Pruning and Full Pre-training via Masked + Fine-tuning + + +
+ Despite the success of transformers on various computer vision tasks, they +suffer from excessive memory and computational cost. Some works present dynamic +vision transformers to accelerate inference by pruning redundant tokens. A key +to improving token pruning is using well-trained models as initialization for +faster convergence and better performance. However, current base models usually +adopt full image training, i.e., using full images as inputs and keeping the +whole feature maps through the forward process, which causes inconsistencies +with dynamic models that gradually reduce tokens, including calculation +pattern, information amount and token selection strategy inconsistencies. +Inspired by MAE which performs masking and reconstruction self-supervised task, +we devise masked fine-tuning to bridge the gaps between pre-trained base models +used for initialization and token pruning based dynamic vision transformers, by +masking image patches and predicting the image class label based on left +unmasked patches. Extensive experiments on ImageNet demonstrate that base +models via masked fine-tuning gain strong occlusion robustness and ability +against information loss. With this better initialization, Dynamic ViT achieves +higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. +81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). +Moreover, we apply our method into different token pruning based dynamic vision +transformers, different pre-trained models and randomly initialized models to +demonstrate the generalization ability. + +
+
+ comment: Submitted to TIP +
+
+
+
+
+ + ☆ A Deep Learning Approach to Teeth Segmentation and Orientation from + Panoramic X-rays + + +
+ Accurate teeth segmentation and orientation are fundamental in modern oral +healthcare, enabling precise diagnosis, treatment planning, and dental implant +design. In this study, we present a comprehensive approach to teeth +segmentation and orientation from panoramic X-ray images, leveraging deep +learning techniques. We build our model based on FUSegNet, a popular model +originally developed for wound segmentation, and introduce modifications by +incorporating grid-based attention gates into the skip connections. We +introduce oriented bounding box (OBB) generation through principal component +analysis (PCA) for precise tooth orientation estimation. Evaluating our +approach on the publicly available DNS dataset, comprising 543 panoramic X-ray +images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% +and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in +teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) +score of 82.82%. We also conduct detailed analyses of individual tooth labels +and categorical performance, shedding light on strengths and weaknesses. The +proposed model's accuracy and versatility offer promising prospects for +improving dental diagnoses, treatment planning, and personalized healthcare in +the oral domain. Our generated OBB coordinates and codes are available at +https://github.com/mrinal054/Instance_teeth_segmentation. + +
+
+
+
+
+ + ☆ MO-YOLO: End-to-End Multiple-Object Tracking Method with YOLO and MOTR + + +
+ This paper aims to address critical issues in the field of Multi-Object +Tracking (MOT) by proposing an efficient and computationally resource-efficient +end-to-end multi-object tracking model, named MO-YOLO. Traditional MOT methods +typically involve two separate steps: object detection and object tracking, +leading to computational complexity and error propagation issues. Recent +research has demonstrated outstanding performance in end-to-end MOT models +based on Transformer architectures, but they require substantial hardware +support. MO-YOLO combines the strengths of YOLO and RT-DETR models to construct +a high-efficiency, lightweight, and resource-efficient end-to-end multi-object +tracking network, offering new opportunities in the multi-object tracking +domain. On the MOT17 dataset, MOTR\cite{zeng2022motr} requires training with 8 +GeForce 2080 Ti GPUs for 4 days to achieve satisfactory results, while MO-YOLO +only requires 1 GeForce 2080 Ti GPU and 12 hours of training to achieve +comparable performance. + +
+
+
+
+
+ + ☆ Improving Denoising Diffusion Models via Simultaneous Estimation of + Image and Noise + + +
+ This paper introduces two key contributions aimed at improving the speed and +quality of images generated through inverse diffusion processes. The first +contribution involves reparameterizing the diffusion process in terms of the +angle on a quarter-circular arc between the image and noise, specifically +setting the conventional $\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta)$. This +reparameterization eliminates two singularities and allows for the expression +of diffusion evolution as a well-behaved ordinary differential equation (ODE). +In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be +used effectively. The second contribution is to directly estimate both the +image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which +enables more stable calculations of the update step in the inverse diffusion +steps, as accurate estimation of both the image and noise are crucial at +different stages of the process. Together with these changes, our model +achieves faster generation, with the ability to converge on high-quality images +more quickly, and higher quality of the generated images, as measured by +metrics such as Frechet Inception Distance (FID), spatial Frechet Inception +Distance (sFID), precision, and recall. + +
+
+
+
+
+ + ☆ Bridging Phylogeny and Taxonomy with Protein-protein Interaction + Networks + + +
+ The protein-protein interaction (PPI) network provides an overview of the +complex biological reactions vital to an organism's metabolism and survival. +Even though in the past PPI network were compared across organisms in detail, +there has not been large-scale research on how individual PPI networks reflect +on the species relationships. In this study we aim to increase our +understanding of the tree of life and taxonomy by gleaming information from the +PPI networks. We successful created (1) a predictor of network statistics based +on known traits of existing species in the phylogeny, and (2) a taxonomic +classifier of organism using the known protein network statistics, whether +experimentally determined or predicted de novo. With the knowledge of protein +interactions at its core, our two models effectively connects two field with +widely diverging methodologies - the phylogeny and taxonomy of species. + +
+
+
+
+
+ + ☆ Low-Dimensional Gradient Helps Out-of-Distribution Detection + + +
+ Detecting out-of-distribution (OOD) samples is essential for ensuring the +reliability of deep neural networks (DNNs) in real-world scenarios. While +previous research has predominantly investigated the disparity between +in-distribution (ID) and OOD data through forward information analysis, the +discrepancy in parameter gradients during the backward process of DNNs has +received insufficient attention. Existing studies on gradient disparities +mainly focus on the utilization of gradient norms, neglecting the wealth of +information embedded in gradient directions. To bridge this gap, in this paper, +we conduct a comprehensive investigation into leveraging the entirety of +gradient information for OOD detection. The primary challenge arises from the +high dimensionality of gradients due to the large number of network parameters. +To solve this problem, we propose performing linear dimension reduction on the +gradient using a designated subspace that comprises principal components. This +innovative technique enables us to obtain a low-dimensional representation of +the gradient with minimal information loss. Subsequently, by integrating the +reduced gradient with various existing detection score functions, our approach +demonstrates superior performance across a wide range of detection tasks. For +instance, on the ImageNet benchmark, our method achieves an average reduction +of 11.15% in the false positive rate at 95% recall (FPR95) compared to the +current state-of-the-art approach. The code would be released. + +
+
+
+
+
+ + ☆ CosmosDSR -- a methodology for automated detection and tracking of + orbital debris using the Unscented Kalman Filter + + +
+ The Kessler syndrome refers to the escalating space debris from frequent +space activities, threatening future space exploration. Addressing this issue +is vital. Several AI models, including Convolutional Neural Networks (CNN), +Kernel Principal Component Analysis (KPCA), and Model-Agnostic Meta-Learning +(MAML), have been assessed with various data types. Earlier studies highlighted +the combination of the YOLO object detector and a linear Kalman filter for +object detection and tracking. Building on this, our project introduces +CosmosDSR, a novel methodology combining YOLOv3 with an Unscented Kalman Filter +for tracking satellites in sequential images, compared to a linear Kalman +filter. Using the SPARK dataset from the University of Luxembourg for training +and testing, the YOLOv3 precisely detected and classified all satellite +categories (mAP=97.18%, F1=0.95) with few errors (TP=4163, FP=209, FN=237). +Both CosmosDSR and the LKF tracked satellites accurately (UKF: +MSE=2.83/RMSE=1.66, LKF: MSE=2.84/RMSE=1.66). Despite concerns of class +imbalance and the absence of real images, the model shows promise. Future work +should address these limitations, increase tracking sample size, and improve +metrics. This research suggests the algorithm's potential in detecting and +tracking satellites, paving the way for solutions to the Kessler syndrome. + +
+
+ comment: 7 figures, 15 pages inc refs +
+
+
+
+
+ + ☆ Learning depth from monocular video sequences + + +
+ Learning single image depth estimation model from monocular video sequence is +a very challenging problem. In this paper, we propose a novel training loss +which enables us to include more images for supervision during the training +process. We propose a simple yet effective model to account the frame to frame +pixel motion. We also design a novel network architecture for single image +estimation. When combined, our method produces state of the art results for +monocular depth estimation on the KITTI dataset in the self-supervised setting. + +
+
+
+
+
+ + ☆ Deep Imbalanced Regression via Hierarchical Classification Adjustment + + +
+ Regression tasks in computer vision, such as age estimation or counting, are +often formulated into classification by quantizing the target space into +classes. Yet real-world data is often imbalanced -- the majority of training +samples lie in a head range of target values, while a minority of samples span +a usually larger tail range. By selecting the class quantization, one can +adjust imbalanced regression targets into balanced classification outputs, +though there are trade-offs in balancing classification accuracy and +quantization error. To improve regression performance over the entire range of +data, we propose to construct hierarchical classifiers for solving imbalanced +regression tasks. The fine-grained classifiers limit the quantization error +while being modulated by the coarse predictions to ensure high accuracy. +Standard hierarchical classification approaches, however, when applied to the +regression problem, fail to ensure that predicted ranges remain consistent +across the hierarchy. As such, we propose a range-preserving distillation +process that can effectively learn a single classifier from the set of +hierarchical classifiers. Our novel hierarchical classification adjustment +(HCA) for imbalanced regression shows superior results on three diverse tasks: +age estimation, crowd counting and depth estimation. We will release the source +code upon acceptance. + +
+
+ comment: 14 pages, 5 figures +
+
+
+
+
+ + ☆ Technical Note: Feasibility of translating 3.0T-trained Deep-Learning + Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy + Controls + + +
+ In the current study, our purpose is to evaluate the feasibility of applying +deep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in +healthy controls scanned at 0.55T and compared with 3.0T. The current study +assesses the performance of standard in-practice bone, and cartilage +segmentation algorithms at 0.55T, both qualitatively and quantitatively, in +terms of comparing segmentation performance, areas of improvement, and +compartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial +results demonstrate a usable to good technical feasibility of translating +existing quantitative deep-learning-based image segmentation techniques, +trained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition +environment. Especially in terms of segmenting cartilage compartments, the +models perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T +low-field sustainable and easy-to-install MRI, as demonstrated, thus, can be +utilized for evaluating knee cartilage thickness and bone segmentations aided +by established DL algorithms trained at higher-field strengths out-of-the-box +initially. This could be utilized at the far-spread point-of-care locations +with a lack of radiologists available to manually segment low-field images, at +least till a decent base of low-field data pool is collated. With further +fine-tuning with manual labeling of low-field data or utilizing synthesized +higher SNR images from low-field images, OA biomarker quantification +performance is potentially guaranteed to be further improved. + +
+
+ comment: 11 Pages, 3 Figures, 2 Tables +
+
+
+
+
+ + ☆ Simple Baselines for Projection-based Full-reference and No-reference + Point Cloud Quality Assessment + + +
+ Point clouds are widely used in 3D content representation and have various +applications in multimedia. However, compression and simplification processes +inevitably result in the loss of quality-aware information under storage and +bandwidth constraints. Therefore, there is an increasing need for effective +methods to quantify the degree of distortion in point clouds. In this paper, we +propose simple baselines for projection-based point cloud quality assessment +(PCQA) to tackle this challenge. We use multi-projections obtained via a common +cube-like projection process from the point clouds for both full-reference (FR) +and no-reference (NR) PCQA tasks. Quality-aware features are extracted with +popular vision backbones. The FR quality representation is computed as the +similarity between the feature maps of reference and distorted projections +while the NR quality representation is obtained by simply squeezing the feature +maps of distorted projections with average pooling The corresponding quality +representations are regressed into visual quality scores by fully-connected +layers. Taking part in the ICIP 2023 PCVQA Challenge, we succeeded in achieving +the top spot in four out of the five competition tracks. + +
+
+
+
+
+ + ☆ A Classifier Using Global Character Level and Local Sub-unit Level + Features for Hindi Online Handwritten Character Recognition + + +
+ A classifier is developed that defines a joint distribution of global +character features, number of sub-units and local sub-unit features to model +Hindi online handwritten characters. The classifier uses latent variables to +model the structure of sub-units. The classifier uses histograms of points, +orientations, and dynamics of orientations (HPOD) features to represent +characters at global character level and local sub-unit level and is +independent of character stroke order and stroke direction variations. The +parameters of the classifier is estimated using maximum likelihood method. +Different classifiers and features used in other studies are considered in this +study for classification performance comparison with the developed classifier. +The classifiers considered are Second Order Statistics (SOS), Sub-space (SS), +Fisher Discriminant (FD), Feedforward Neural Network (FFN) and Support Vector +Machines (SVM) and the features considered are Spatio Temporal (ST), Discrete +Fourier Transform (DFT), Discrete Cosine Transform (SCT), Discrete Wavelet +Transform (DWT), Spatial (SP) and Histograms of Oriented Gradients (HOG). Hindi +character datasets used for training and testing the developed classifier +consist of samples of handwritten characters from 96 different character +classes. There are 12832 samples with an average of 133 samples per character +class in the training set and 2821 samples with an average of 29 samples per +character class in the testing set. The developed classifier has the highest +accuracy of 93.5\% on the testing set compared to that of the classifiers +trained on different features extracted from the same training set and +evaluated on the same testing set considered in this study. + +
+
+ comment: 23 pages, 8 jpg figures. arXiv admin note: text overlap with + arXiv:2310.08222 +
+
+
+
+
+ + ☆ Comparison of Cross-Entropy, Dice, and Focal Loss for Sea Ice Type + Segmentation + + +
+ Up-to-date sea ice charts are crucial for safer navigation in ice-infested +waters. Recently, Convolutional Neural Network (CNN) models show the potential +to accelerate the generation of ice maps for large regions. However, results +from CNN models still need to undergo scrutiny as higher metrics performance +not always translate to adequate outputs. Sea ice type classes are imbalanced, +requiring special treatment during training. We evaluate how three different +loss functions, some developed for imbalanced class problems, affect the +performance of CNN models trained to predict the dominant ice type in +Sentinel-1 images. Despite the fact that Dice and Focal loss produce higher +metrics, results from cross-entropy seem generally more physically consistent. + +
+
+
+
+
+ + ☆ Virtual Accessory Try-On via Keypoint Hallucination + + +
+ The virtual try-on task refers to fitting the clothes from one image onto +another portrait image. In this paper, we focus on virtual accessory try-on, +which fits accessory (e.g., glasses, ties) onto a face or portrait image. +Unlike clothing try-on, which relies on human silhouette as guidance, accessory +try-on warps the accessory into an appropriate location and shape to generate a +plausible composite image. In contrast to previous try-on methods that treat +foreground (i.e., accessories) and background (i.e., human faces or bodies) +equally, we propose a background-oriented network to utilize the prior +knowledge of human bodies and accessories. Specifically, our approach learns +the human body priors and hallucinates the target locations of specified +foreground keypoints in the background. Then our approach will inject +foreground information with accessory priors into the background UNet. Based on +the hallucinated target locations, the warping parameters are calculated to +warp the foreground. Moreover, this background-oriented network can also easily +incorporate auxiliary human face/body semantic segmentation supervision to +further boost performance. Experiments conducted on STRAT dataset validate the +effectiveness of our proposed method. + +
+
+
+
+
+ + ☆ Task-driven Prompt Evolution for Foundation Models + + +
+ Promptable foundation models, particularly Segment Anything Model (SAM), have +emerged as a promising alternative to the traditional task-specific supervised +learning for image segmentation. However, many evaluation studies have found +that their performance on medical imaging modalities to be underwhelming +compared to conventional deep learning methods. In the world of large +pre-trained language and vision-language models, learning prompt from +downstream tasks has achieved considerable success in improving performance. In +this work, we propose a plug-and-play Prompt Optimization Technique for +foundation models like SAM (SAMPOT) that utilizes the downstream segmentation +task to optimize the human-provided prompt to obtain improved performance. We +demonstrate the utility of SAMPOT on lung segmentation in chest X-ray images +and obtain an improvement on a significant number of cases ($\sim75\%$) over +human-provided initial prompts. We hope this work will lead to further +investigations in the nascent field of automatic visual prompt-tuning. + +
+
+
+
+
+ + ☆ Deep Learning on SAR Imagery: Transfer Learning Versus Randomly + Initialized Weights + + +
+ Deploying deep learning on Synthetic Aperture Radar (SAR) data is becoming +more common for mapping purposes. One such case is sea ice, which is highly +dynamic and rapidly changes as a result of the combined effect of wind, +temperature, and ocean currents. Therefore, frequent mapping of sea ice is +necessary to ensure safe marine navigation. However, there is a general +shortage of expert-labeled data to train deep learning algorithms. Fine-tuning +a pre-trained model on SAR imagery is a potential solution. In this paper, we +compare the performance of deep learning models trained from scratch using +randomly initialized weights against pre-trained models that we fine-tune for +this purpose. Our results show that pre-trained models lead to better results, +especially on test samples from the melt season. + +
+
+
+
+
+ + ☆ Enhancing sea ice segmentation in Sentinel-1 images with atrous + convolutions + + +
+ Due to the growing volume of remote sensing data and the low latency required +for safe marine navigation, machine learning (ML) algorithms are being +developed to accelerate sea ice chart generation, currently a manual +interpretation task. However, the low signal-to-noise ratio of the freely +available Sentinel-1 Synthetic Aperture Radar (SAR) imagery, the ambiguity of +backscatter signals for ice types, and the scarcity of open-source +high-resolution labelled data makes automating sea ice mapping challenging. We +use Extreme Earth version 2, a high-resolution benchmark dataset generated for +ML training and evaluation, to investigate the effectiveness of ML for +automated sea ice mapping. Our customized pipeline combines ResNets and Atrous +Spatial Pyramid Pooling for SAR image segmentation. We investigate the +performance of our model for: i) binary classification of sea ice and open +water in a segmentation framework; and ii) a multiclass segmentation of five +sea ice types. For binary ice-water classification, models trained with our +largest training set have weighted F1 scores all greater than 0.95 for January +and July test scenes. Specifically, the median weighted F1 score was 0.98, +indicating high performance for both months. By comparison, a competitive +baseline U-Net has a weighted average F1 score of ranging from 0.92 to 0.94 +(median 0.93) for July, and 0.97 to 0.98 (median 0.97) for January. Multiclass +ice type classification is more challenging, and even though our models achieve +2% improvement in weighted F1 average compared to the baseline U-Net, test +weighted F1 is generally between 0.6 and 0.80. Our approach can efficiently +segment full SAR scenes in one run, is faster than the baseline U-Net, retains +spatial resolution and dimension, and is more robust against noise compared to +approaches that rely on patch classification. + +
+
+
+
+
+ + ☆ LP-OVOD: Open-Vocabulary Object Detection by Linear Probing WACV 2024 + + +
+ This paper addresses the challenging problem of open-vocabulary object +detection (OVOD) where an object detector must identify both seen and unseen +classes in test images without labeled examples of the unseen classes in +training. A typical approach for OVOD is to use joint text-image embeddings of +CLIP to assign box proposals to their closest text label. However, this method +has a critical issue: many low-quality boxes, such as over- and +under-covered-object boxes, have the same similarity score as high-quality +boxes since CLIP is not trained on exact object location information. To +address this issue, we propose a novel method, LP-OVOD, that discards +low-quality boxes by training a sigmoid linear classifier on pseudo labels +retrieved from the top relevant region proposals to the novel text. +Experimental results on COCO affirm the superior performance of our approach +over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ +using ResNet50 as the backbone and without external datasets or knowing novel +classes during training. Our code will be available at +https://github.com/VinAIResearch/LP-OVOD. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ Navigating Data Heterogeneity in Federated Learning: A Semi-Supervised + Approach for Object Detection NeurIPS 2023 + + +
+ Federated Learning (FL) has emerged as a potent framework for training models +across distributed data sources while maintaining data privacy. Nevertheless, +it faces challenges with limited high-quality labels and non-IID client data, +particularly in applications like autonomous driving. To address these hurdles, +we navigate the uncharted waters of Semi-Supervised Federated Object Detection +(SSFOD). We present a pioneering SSFOD framework, designed for scenarios where +labeled data reside only at the server while clients possess unlabeled data. +Notably, our method represents the inaugural implementation of SSFOD for +clients with 0% labeled non-IID data, a stark contrast to previous studies that +maintain some subset of labels at each client. We propose FedSTO, a two-stage +strategy encompassing Selective Training followed by Orthogonally enhanced +full-parameter training, to effectively address data shift (e.g. weather +conditions) between server and clients. Our contributions include selectively +refining the backbone of the detector to avert overfitting, orthogonality +regularization to boost representation divergence, and local EMA-driven pseudo +label assignment to yield high-quality pseudo labels. Extensive validation on +prominent autonomous driving datasets (BDD100K, Cityscapes, and SODA10M) +attests to the efficacy of our approach, demonstrating state-of-the-art +results. Remarkably, FedSTO, using just 20-30% of labels, performs nearly as +well as fully-supervised centralized training methods. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Automating lichen monitoring in ecological studies using instance + segmentation of time-lapse images ICML + + +
+ Lichens are symbiotic organisms composed of fungi, algae, and/or +cyanobacteria that thrive in a variety of environments. They play important +roles in carbon and nitrogen cycling, and contribute directly and indirectly to +biodiversity. Ecologists typically monitor lichens by using them as indicators +to assess air quality and habitat conditions. In particular, epiphytic lichens, +which live on trees, are key markers of air quality and environmental health. A +new method of monitoring epiphytic lichens involves using time-lapse cameras to +gather images of lichen populations. These cameras are used by ecologists in +Newfoundland and Labrador to subsequently analyze and manually segment the +images to determine lichen thalli condition and change. These methods are +time-consuming and susceptible to observer bias. In this work, we aim to +automate the monitoring of lichens over extended periods and to estimate their +biomass and condition to facilitate the task of ecologists. To accomplish this, +our proposed framework uses semantic segmentation with an effective training +approach to automate monitoring and biomass estimation of epiphytic lichens on +time-lapse images. We show that our method has the potential to significantly +improve the accuracy and efficiency of lichen population monitoring, making it +a valuable tool for forest ecologists and environmental scientists to evaluate +the impact of climate change on Canada's forests. To the best of our knowledge, +this is the first time that such an approach has been used to assist ecologists +in monitoring and analyzing epiphytic lichens. + +
+
+ comment: 6 pages, 3 Figures, 8 Tables, Accepted for publication in IEEE + International Conference on Machine Learning and Applications (ICMLA), + copyright IEEE +
+
+
+
+
+ + ☆ HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and + severity prediction from gait ICML + + +
+ In this paper, we propose a novel deep learning method based on a new Hybrid +ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) +from gait data. We adopt a two-step approach by dividing the problem into two +sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy +versus parkinsonian patients. If the patient is parkinsonian, a multi-class +Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to +assess the PD severity stage. Our hybrid architecture exploits the strengths of +both Convolutional Neural Networks (ConvNets) and Transformers to accurately +detect PD and determine the severity stage. In particular, we take advantage of +ConvNets to capture local patterns and correlations in the data, while we +exploit Transformers for handling long-term dependencies in the input signal. +We show that our hybrid method achieves superior performance when compared to +other state-of-the-art methods, with a PD detection accuracy of 97% and a +severity staging accuracy of 87%. Our source code is available at: +https://github.com/SafwenNaimi + +
+
+ comment: 6 pages, 6 figures, 3 tables, Accepted for publication in IEEE + International Conference on Machine Learning and Applications (ICMLA), + copyright IEEE +
+
+
+
+
+ + ☆ HyperFields: Towards Zero-Shot Generation of NeRFs from Text + + +
+ We introduce HyperFields, a method for generating text-conditioned Neural +Radiance Fields (NeRFs) with a single forward pass and (optionally) some +fine-tuning. Key to our approach are: (i) a dynamic hypernetwork, which learns +a smooth mapping from text token embeddings to the space of NeRFs; (ii) NeRF +distillation training, which distills scenes encoded in individual NeRFs into +one dynamic hypernetwork. These techniques enable a single network to fit over +a hundred unique scenes. We further demonstrate that HyperFields learns a more +general map between text and NeRFs, and consequently is capable of predicting +novel in-distribution and out-of-distribution scenes -- either zero-shot or +with a few finetuning steps. Finetuning HyperFields benefits from accelerated +convergence thanks to the learned general map, and is capable of synthesizing +novel scenes 5 to 10 times faster than existing neural optimization-based +methods. Our ablation experiments show that both the dynamic architecture and +NeRF distillation are critical to the expressivity of HyperFields. + +
+
+ comment: Project page: https://threedle.github.io/hyperfields/ +
+
+
+
+
+ + ☆ Image Prior and Posterior Conditional Probability Representation for + Efficient Damage Assessment + + +
+ It is important to quantify Damage Assessment (DA) for Human Assistance and +Disaster Response (HADR) applications. In this paper, to achieve efficient and +scalable DA in HADR, an image prior and posterior conditional probability +(IP2CP) is developed as an effective computational imaging representation. +Equipped with the IP2CP representation, the matching pre- and post-disaster +images are effectively encoded into one image that is then processed using deep +learning approaches to determine the damage levels. Two scenarios of crucial +importance for the practical use of DA in HADR applications are examined: +pixel-wise semantic segmentation and patch-based contrastive learning-based +global damage classification. Results achieved by IP2CP in both scenarios +demonstrate promising performances, showing that our IP2CP-based methods within +the deep learning framework can effectively achieve data and computational +efficiency, which is of utmost importance for the DA in HADR applications. + +
+
+ comment: 6 pages, 2 figures +
+
+
+
+
+ + ☆ ControlLLM: Augment Language Models with Tools by Searching on Graphs + + +
+ We present ControlLLM, a novel framework that enables large language models +(LLMs) to utilize multi-modal tools for solving complex real-world tasks. +Despite the remarkable performance of LLMs, they still struggle with tool +invocation due to ambiguous user prompts, inaccurate tool selection and +parameterization, and inefficient tool scheduling. To overcome these +challenges, our framework comprises three key components: (1) a \textit{task +decomposer} that breaks down a complex task into clear subtasks with +well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) +paradigm} that searches the optimal solution path on a pre-built tool graph, +which specifies the parameter and dependency relations among different tools; +and (3) an \textit{execution engine with a rich toolbox} that interprets the +solution path and runs the tools efficiently on different computational +devices. We evaluate our framework on diverse tasks involving image, audio, and +video processing, demonstrating its superior accuracy, efficiency, and +versatility compared to existing methods. + +
+
+ comment: 22 pages, 9 figures, 10 tables +
+
+
+
+
+ + ☆ AutoCT: Automated CT registration, segmentation, and quantification + + +
+ The processing and analysis of computed tomography (CT) imaging is important +for both basic scientific development and clinical applications. In AutoCT, we +provide a comprehensive pipeline that integrates an end-to-end automatic +preprocessing, registration, segmentation, and quantitative analysis of 3D CT +scans. The engineered pipeline enables atlas-based CT segmentation and +quantification leveraging diffeomorphic transformations through efficient +forward and inverse mappings. The extracted localized features from the +deformation field allow for downstream statistical learning that may facilitate +medical diagnostics. On a lightweight and portable software platform, AutoCT +provides a new toolkit for the CT imaging community to underpin the deployment +of artificial intelligence-driven applications. + +
+
+
+
+
+ + ☆ Graph Convolutional Networks for Complex Traffic Scenario Classification + + +
+ A scenario-based testing approach can reduce the time required to obtain +statistically significant evidence of the safety of Automated Driving Systems +(ADS). Identifying these scenarios in an automated manner is a challenging +task. Most methods on scenario classification do not work for complex scenarios +with diverse environments (highways, urban) and interaction with other traffic +agents. This is mirrored in their approaches which model an individual vehicle +in relation to its environment, but neglect the interaction between multiple +vehicles (e.g. cut-ins, stationary lead vehicle). Furthermore, existing +datasets lack diversity and do not have per-frame annotations to accurately +learn the start and end time of a scenario. We propose a method for complex +traffic scenario classification that is able to model the interaction of a +vehicle with the environment, as well as other agents. We use Graph +Convolutional Networks to model spatial and temporal aspects of these +scenarios. Expanding the nuScenes and Argoverse 2 driving datasets, we +introduce a scenario-labeled dataset, which covers different driving +environments and is annotated per frame. Training our method on this dataset, +we present a promising baseline for future research on per-frame complex +scenario classification. + +
+
+ comment: Netherlands Conference on Computer Vision (NCCV) 2023 camera-ready + + supplementary material +
+
+
+
+
+ + ☆ GROOViST: A Metric for Grounding Objects in Visual Storytelling EMNLP 2023 + + +
+ A proper evaluation of stories generated for a sequence of images -- the task +commonly referred to as visual storytelling -- must consider multiple aspects, +such as coherence, grammatical correctness, and visual grounding. In this work, +we focus on evaluating the degree of grounding, that is, the extent to which a +story is about the entities shown in the images. We analyze current metrics, +both designed for this purpose and for general vision-text alignment. Given +their observed shortcomings, we propose a novel evaluation tool, GROOViST, that +accounts for cross-modal dependencies, temporal misalignments (the fact that +the order in which entities appear in the story and the image sequence may not +match), and human intuitions on visual grounding. An additional advantage of +GROOViST is its modular design, where the contribution of each component can be +assessed and interpreted individually. + +
+
+ comment: In EMNLP 2023 main conference proceedings (to appear) +
+
+
+
+
+ + ☆ A Dataset of Relighted 3D Interacting Hands NeurIPS 2023 + + +
+ The two-hand interaction is one of the most challenging signals to analyze +due to the self-similarity, complicated articulations, and occlusions of hands. +Although several datasets have been proposed for the two-hand interaction +analysis, all of them do not achieve 1) diverse and realistic image appearances +and 2) diverse and large-scale groundtruth (GT) 3D poses at the same time. In +this work, we propose Re:InterHand, a dataset of relighted 3D interacting hands +that achieve the two goals. To this end, we employ a state-of-the-art hand +relighting network with our accurately tracked two-hand 3D poses. We compare +our Re:InterHand with existing 3D interacting hands datasets and show the +benefit of it. Our Re:InterHand is available in +https://mks0601.github.io/ReInterHand/. + +
+
+ comment: Accepted by NeurIPS 2023 (Datasets and Benchmarks Track) +
+
+
+
+
+ + ☆ SynergyNet: Bridging the Gap between Discrete and Continuous + Representations for Precise Medical Image Segmentation WACV 2024 + + +
+ In recent years, continuous latent space (CLS) and discrete latent space +(DLS) deep learning models have been proposed for medical image analysis for +improved performance. However, these models encounter distinct challenges. CLS +models capture intricate details but often lack interpretability in terms of +structural representation and robustness due to their emphasis on low-level +features. Conversely, DLS models offer interpretability, robustness, and the +ability to capture coarse-grained information thanks to their structured latent +space. However, DLS models have limited efficacy in capturing fine-grained +details. To address the limitations of both DLS and CLS models, we propose +SynergyNet, a novel bottleneck architecture designed to enhance existing +encoder-decoder segmentation frameworks. SynergyNet seamlessly integrates +discrete and continuous representations to harness complementary information +and successfully preserves both fine and coarse-grained details in the learned +representations. Our extensive experiment on multi-organ segmentation and +cardiac datasets demonstrates that SynergyNet outperforms other state of the +art methods, including TransUNet: dice scores improving by 2.16%, and Hausdorff +scores improving by 11.13%, respectively. When evaluating skin lesion and brain +tumor segmentation datasets, we observe a remarkable improvement of 1.71% in +Intersection-over Union scores for skin lesion segmentation and of 8.58% for +brain tumor segmentation. Our innovative approach paves the way for enhancing +the overall performance and capabilities of deep learning models in the +critical domain of medical image analysis. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ☆ Alzheimers Disease Diagnosis by Deep Learning Using MRI-Based Approaches + + +
+ The most frequent kind of dementia of the nervous system, Alzheimer's +disease, weakens several brain processes (such as memory) and eventually +results in death. The clinical study uses magnetic resonance imaging to +diagnose AD. Deep learning algorithms are capable of pattern recognition and +feature extraction from the inputted raw data. As early diagnosis and stage +detection are the most crucial elements in enhancing patient care and treatment +outcomes, deep learning algorithms for MRI images have recently allowed for +diagnosing a medical condition at the beginning stage and identifying +particular symptoms of Alzheimer's disease. As a result, we aimed to analyze +five specific studies focused on AD diagnosis using MRI-based deep learning +algorithms between 2021 and 2023 in this study. To completely illustrate the +differences between these techniques and comprehend how deep learning +algorithms function, we attempted to explore selected approaches in depth. + +
+
+
+
+
+ + ☆ Improving Traffic Density Forecasting in Intelligent Transportation + Systems Using Gated Graph Neural Networks + + +
+ This study delves into the application of graph neural networks in the realm +of traffic forecasting, a crucial facet of intelligent transportation systems. +Accurate traffic predictions are vital for functions like trip planning, +traffic control, and vehicle routing in such systems. Three prominent GNN +architectures Graph Convolutional Networks (Graph Sample and Aggregation) and +Gated Graph Neural Networks are explored within the context of traffic +prediction. Each architecture's methodology is thoroughly examined, including +layer configurations, activation functions,and hyperparameters. The primary +goal is to minimize prediction errors, with GGNNs emerging as the most +effective choice among the three models. The research outlines outcomes for +each architecture, elucidating their predictive performance through root mean +squared error and mean absolute error (MAE). Hypothetical results reveal +intriguing insights: GCNs display an RMSE of 9.10 and an MAE of 8.00, while +GraphSAGE shows improvement with an RMSE of 8.3 and an MAE of 7.5. Gated Graph +Neural Networks (GGNNs) exhibit the lowest RMSE at 9.15 and an impressive MAE +of 7.1, positioning them as the frontrunner. + +
+
+
+
+
+ + ☆ Advancing Brain Tumor Detection: A Thorough Investigation of CNNs, + Clustering, and SoftMax Classification in the Analysis of MRI Images + + +
+ Brain tumors pose a significant global health challenge due to their high +prevalence and mortality rates across all age groups. Detecting brain tumors at +an early stage is crucial for effective treatment and patient outcomes. This +study presents a comprehensive investigation into the use of Convolutional +Neural Networks (CNNs) for brain tumor detection using Magnetic Resonance +Imaging (MRI) images. The dataset, consisting of MRI scans from both healthy +individuals and patients with brain tumors, was processed and fed into the CNN +architecture. The SoftMax Fully Connected layer was employed to classify the +images, achieving an accuracy of 98%. To evaluate the CNN's performance, two +other classifiers, Radial Basis Function (RBF) and Decision Tree (DT), were +utilized, yielding accuracy rates of 98.24% and 95.64%, respectively. The study +also introduced a clustering method for feature extraction, improving CNN's +accuracy. Sensitivity, Specificity, and Precision were employed alongside +accuracy to comprehensively evaluate the network's performance. Notably, the +SoftMax classifier demonstrated the highest accuracy among the categorizers, +achieving 99.52% accuracy on test data. The presented research contributes to +the growing field of deep learning in medical image analysis. The combination +of CNNs and MRI data offers a promising tool for accurately detecting brain +tumors, with potential implications for early diagnosis and improved patient +care. + +
+
+
+
+
+ + ♻ ☆ Segment Any Building + + +
+ The task of identifying and segmenting buildings within remote sensing +imagery has perennially stood at the forefront of scholarly investigations. +This manuscript accentuates the potency of harnessing diversified datasets in +tandem with cutting-edge representation learning paradigms for building +segmentation in such images. Through the strategic amalgamation of disparate +datasets, we have not only expanded the informational horizon accessible for +model training but also manifested unparalleled performance metrics across +multiple datasets. Our avant-garde joint training regimen underscores the merit +of our approach, bearing significant implications in pivotal domains such as +urban infrastructural development, disaster mitigation strategies, and +ecological surveillance. Our methodology, predicated upon the fusion of +datasets and gleaning insights from pre-trained models, carves a new benchmark +in the annals of building segmentation endeavors. The outcomes of this research +both fortify the foundations for ensuing scholarly pursuits and presage a +horizon replete with innovative applications in the discipline of building +segmentation. + +
+
+ comment: CGI 2023 +
+
+
+
+
+ + ♻ ☆ SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from + Diffusion Models NeurIPS 2023 + + +
+ A potent class of generative models known as Diffusion Probabilistic Models +(DPMs) has become prominent. A forward diffusion process adds gradually noise +to data, while a model learns to gradually denoise. Sampling from pre-trained +DPMs is obtained by solving differential equations (DE) defined by the learnt +model, a process which has shown to be prohibitively slow. Numerous efforts on +speeding-up this process have consisted on crafting powerful ODE solvers. +Despite being quick, such solvers do not usually reach the optimal quality +achieved by available slow SDE solvers. Our goal is to propose SDE solvers that +reach optimal quality without requiring several hundreds or thousands of NFEs +to achieve that goal. We propose Stochastic Explicit Exponential +Derivative-free Solvers (SEEDS), improving and generalizing Exponential +Integrator approaches to the stochastic case on several frameworks. After +carefully analyzing the formulation of exact solutions of diffusion SDEs, we +craft SEEDS to analytically compute the linear part of such solutions. Inspired +by the Exponential Time-Differencing method, SEEDS use a novel treatment of the +stochastic components of solutions, enabling the analytical computation of +their variance, and contains high-order terms allowing to reach optimal quality +sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our +approach on several image generation benchmarks, showing that SEEDS outperform +or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are +derivative and training free, and we fully prove strong convergence guarantees +for them. + +
+
+ comment: 60 pages. Camera-Ready version for the 37th Conference on Neural + Information Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Integrating View Conditions for Image Synthesis + + +
+ In the field of image processing, applying intricate semantic modifications +within existing images remains an enduring challenge. This paper introduces a +pioneering framework that integrates viewpoint information to enhance the +control of image editing tasks. By surveying existing object editing +methodologies, we distill three essential criteria, consistency, +controllability, and harmony, that should be met for an image editing method. +In contrast to previous approaches, our method takes the lead in satisfying all +three requirements for addressing the challenge of image synthesis. Through +comprehensive experiments, encompassing both quantitative assessments and +qualitative comparisons with contemporary state-of-the-art methods, we present +compelling evidence of our framework's superior performance across multiple +dimensions. This work establishes a promising avenue for advancing image +synthesis techniques and empowering precise object modifications while +preserving the visual coherence of the entire composition. + +
+
+
+
+
+ + ♻ ☆ TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for + Gaze Estimation + + +
+ Intelligent edge vision tasks encounter the critical challenge of ensuring +power and latency efficiency due to the typically heavy computational load they +impose on edge platforms.This work leverages one of the first "AI in sensor" +vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power +end-to-end edge vision applications. We evaluate the IMX500 and compare it to +other edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by +exploring gaze estimation as a case study. We propose TinyTracker, a highly +efficient, fully quantized model for 2D gaze estimation designed to maximize +the performance of the edge vision systems considered in this study. +TinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1] +without significant loss in gaze estimation accuracy (maximum of 0.16 cm when +fully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor +results in end-to-end latency of around 19ms. The camera takes around 17.9ms to +read, process and transmit the pixels to the accelerator. The inference time of +the network is 0.86ms with an additional 0.24 ms for retrieving the results +from the sensor. The overall energy consumption of the end-to-end system is 4.9 +mJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is +1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ +VS 34.2mJ) + +
+
+
+
+
+ + ♻ ☆ Spontaneous Symmetry Breaking in Generative Diffusion Models NeurIPS 2023 + + +
+ Generative diffusion models have recently emerged as a leading approach for +generating high-dimensional data. In this paper, we show that the dynamics of +these models exhibit a spontaneous symmetry breaking that divides the +generative dynamics into two distinct phases: 1) A linear steady-state dynamics +around a central fixed-point and 2) an attractor dynamics directed towards the +data manifold. These two "phases" are separated by the change in stability of +the central fixed-point, with the resulting window of instability being +responsible for the diversity of the generated samples. Using both theoretical +and empirical evidence, we show that an accurate simulation of the early +dynamics does not significantly contribute to the final generation, since early +fluctuations are reverted to the central fixed point. To leverage this insight, +we propose a Gaussian late initialization scheme, which significantly improves +model performance, achieving up to 3x FID improvements on fast samplers, while +also increasing sample diversity (e.g., racial composition of generated CelebA +images). Our work offers a new way to understand the generative dynamics of +diffusion models that has the potential to bring about higher performance and +less biased fast-samplers. + +
+
+ comment: As published at NeurIPS 2023, and the size of the file has been + optimized for fast downloading +
+
+
+
+
+ + ♻ ☆ Ponder: Point Cloud Pre-training via Neural Rendering + + +
+ We propose a novel approach to self-supervised learning of point cloud +representations by differentiable neural rendering. Motivated by the fact that +informative point cloud features should be able to encode rich geometry and +appearance cues and render realistic images, we train a point-cloud encoder +within a devised point-based neural renderer by comparing the rendered images +with real images on massive RGB-D data. The learned point-cloud encoder can be +easily integrated into various downstream tasks, including not only high-level +tasks like 3D detection and segmentation, but low-level tasks like 3D +reconstruction and image synthesis. Extensive experiments on various tasks +demonstrate the superiority of our approach compared to existing pre-training +methods. + +
+
+ comment: Project page: https://dihuang.me/ponder/ +
+
+
+
+
+ + ♻ ☆ DiffSketcher: Text Guided Vector Sketch Synthesis through Latent + Diffusion Models NIPS 2023 + + +
+ Even though trained mainly on images, we discover that pretrained diffusion +models show impressive power in guiding sketch synthesis. In this paper, we +present DiffSketcher, an innovative algorithm that creates \textit{vectorized} +free-hand sketches using natural language input. DiffSketcher is developed +based on a pre-trained text-to-image diffusion model. It performs the task by +directly optimizing a set of B\'ezier curves with an extended version of the +score distillation sampling (SDS) loss, which allows us to use a raster-level +diffusion model as a prior for optimizing a parametric vectorized sketch +generator. Furthermore, we explore attention maps embedded in the diffusion +model for effective stroke initialization to speed up the generation process. +The generated sketches demonstrate multiple levels of abstraction while +maintaining recognizability, underlying structure, and essential visual details +of the subject drawn. Our experiments show that DiffSketcher achieves greater +quality than prior work. The code and demo of DiffSketcher can be found at +https://ximinng.github.io/DiffSketcher-project/. + +
+
+ comment: Accepted by NIPS 2023. Project page: + https://ximinng.github.io/DiffSketcher-project/ +
+
+
+
+
+ + ♻ ☆ AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud + Dataset NeurIPS 2023 + + +
+ It is a long-term vision for Autonomous Driving (AD) community that the +perception models can learn from a large-scale point cloud dataset, to obtain +unified representations that can achieve promising results on different tasks +or benchmarks. Previous works mainly focus on the self-supervised pre-training +pipeline, meaning that they perform the pre-training and fine-tuning on the +same benchmark, which is difficult to attain the performance scalability and +cross-dataset application for the pre-training checkpoint. In this paper, for +the first time, we are committed to building a large-scale pre-training +point-cloud dataset with diverse data distribution, and meanwhile learning +generalizable representations from such a diverse pre-training dataset. We +formulate the point-cloud pre-training task as a semi-supervised problem, which +leverages the few-shot labeled and massive unlabeled point-cloud data to +generate the unified backbone representations that can be directly applied to +many baseline models and benchmarks, decoupling the AD-related pre-training +process and downstream fine-tuning task. During the period of backbone +pre-training, by enhancing the scene- and instance-level distribution diversity +and exploiting the backbone's ability to learn from unknown instances, we +achieve significant performance gains on a series of downstream perception +benchmarks including Waymo, nuScenes, and KITTI, under different baseline +models like PV-RCNN++, SECOND, CenterPoint. + +
+
+ comment: Accepted by NeurIPS 2023. Project page: + https://jiakangyuan.github.io/AD-PT.github.io/ +
+
+
+
+
+ + ♻ ☆ StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual + Representation Learners + + +
+ We investigate the potential of learning visual representations using +synthetic images generated by text-to-image models. This is a natural question +in the light of the excellent performance of such models in generating +high-quality images. We consider specifically the Stable Diffusion, one of the +leading open source text-to-image models. We show that (1) when the generative +model is configured with proper classifier-free guidance scale, training +self-supervised methods on synthetic images can match or beat the real image +counterpart; (2) by treating the multiple images generated from the same text +prompt as positives for each other, we develop a multi-positive contrastive +learning method, which we call StableRep. With solely synthetic images, the +representations learned by StableRep surpass the performance of representations +learned by SimCLR and CLIP using the same set of text prompts and corresponding +real images, on large scale datasets. When we further add language supervision, +StableRep trained with 20M synthetic images achieves better accuracy than CLIP +trained with 50M real images. + +
+
+ comment: code is available at: + https://github.com/google-research/syn-rep-learn +
+
+
+
+
+ + ♻ ☆ Driving through the Concept Gridlock: Unraveling Explainability + Bottlenecks in Automated Driving + + +
+ Concept bottleneck models have been successfully used for explainable machine +learning by encoding information within the model with a set of human-defined +concepts. In the context of human-assisted or autonomous driving, +explainability models can help user acceptance and understanding of decisions +made by the autonomous vehicle, which can be used to rationalize and explain +driver or vehicle behavior. We propose a new approach using concept bottlenecks +as visual features for control command predictions and explanations of user and +vehicle behavior. We learn a human-understandable concept layer that we use to +explain sequential driving scenes while learning vehicle control commands. This +approach can then be used to determine whether a change in a preferred gap or +steering commands from a human (or autonomous vehicle) is led by an external +stimulus or change in preferences. We achieve competitive performance to latent +visual features while gaining interpretability within our model setup. + +
+
+
+
+
+ + ♻ ☆ Language-based Action Concept Spaces Improve Video Self-Supervised + Learning NeurIPS 2023 + + +
+ Recent contrastive language image pre-training has led to learning highly +transferable and robust image representations. However, adapting these models +to video domains with minimal supervision remains an open problem. We explore a +simple step in that direction, using language tied self-supervised learning to +adapt an image CLIP model to the video domain. A backbone modified for temporal +modeling is trained under self-distillation settings with train objectives +operating in an action concept space. Feature vectors of various action +concepts extracted from a language encoder using relevant textual prompts +construct this space. We introduce two train objectives, concept distillation +and concept alignment, that retain generality of original representations while +enforcing relations between actions and their attributes. Our approach improves +zero-shot and linear probing performance on three action recognition +benchmarks. + +
+
+ comment: Presented at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via + Cross-modal Distillation and Super-Voxel Clustering ICCV + + +
+ Semantic segmentation of point clouds usually requires exhausting efforts of +human annotations, hence it attracts wide attention to the challenging topic of +learning from unlabeled or weaker forms of annotations. In this paper, we take +the first attempt for fully unsupervised semantic segmentation of point clouds, +which aims to delineate semantically meaningful objects without any form of +annotations. Previous works of unsupervised pipeline on 2D images fails in this +task of point clouds, due to: 1) Clustering Ambiguity caused by limited +magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity +caused by the irregular sparsity of point cloud. Therefore, we propose a novel +framework, PointDC, which is comprised of two steps that handle the +aforementioned problems respectively: Cross-Modal Distillation (CMD) and +Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual +features are back-projected to the 3D space and aggregated to a unified point +feature to distill the training of the point representation. In the second +stage of SVC, the point features are aggregated to super-voxels and then fed to +the iterative clustering process for excavating semantic classes. PointDC +yields a significant improvement over the prior state-of-the-art unsupervised +methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic +segmentation benchmarks. + +
+
+ comment: Accepted by International Conference on Computer Vision (ICCV) 2023 +
+
+
+
+
+ + ♻ ☆ Training Methods of Multi-label Prediction Classifiers for Hyperspectral + Remote Sensing Images + + +
+ With their combined spectral depth and geometric resolution, hyperspectral +remote sensing images embed a wealth of complex, non-linear information that +challenges traditional computer vision techniques. Yet, deep learning methods +known for their representation learning capabilities prove more suitable for +handling such complexities. Unlike applications that focus on single-label, +pixel-level classification methods for hyperspectral remote sensing images, we +propose a multi-label, patch-level classification method based on a +two-component deep-learning network. We use patches of reduced spatial +dimension and a complete spectral depth extracted from the remote sensing +images. Additionally, we investigate three training schemes for our network: +Iterative, Joint, and Cascade. Experiments suggest that the Joint scheme is the +best-performing scheme; however, its application requires an expensive search +for the best weight combination of the loss constituents. The Iterative scheme +enables the sharing of features between the two parts of the network at the +early stages of training. It performs better on complex data with multi-labels. +Further experiments showed that methods designed with different architectures +performed well when trained on patches extracted and labeled according to our +sampling method. + +
+
+ comment: 1- Added references. 2- updated methodology figure and added new + figures to visualise the different training schemes and 3- Correcting typos + 4- Revised introduction, no change in results or discussion +
+
+
+
+
+ + ♻ ☆ Learning Environment-Aware Affordance for 3D Articulated Object + Manipulation under Occlusions NeurIPS + 2023 + + +
+ Perceiving and manipulating 3D articulated objects in diverse environments is +essential for home-assistant robots. Recent studies have shown that point-level +affordance provides actionable priors for downstream manipulation tasks. +However, existing works primarily focus on single-object scenarios with +homogeneous agents, overlooking the realistic constraints imposed by the +environment and the agent's morphology, e.g., occlusions and physical +limitations. In this paper, we propose an environment-aware affordance +framework that incorporates both object-level actionable priors and environment +constraints. Unlike object-centric affordance approaches, learning +environment-aware affordance faces the challenge of combinatorial explosion due +to the complexity of various occlusions, characterized by their quantities, +geometries, positions and poses. To address this and enhance data efficiency, +we introduce a novel contrastive affordance learning framework capable of +training on scenes containing a single occluder and generalizing to scenes with +complex occluder combinations. Experiments demonstrate the effectiveness of our +proposed approach in learning affordance considering environment constraints. +Project page at https://chengkaiacademycity.github.io/EnvAwareAfford/ + +
+
+ comment: In 37th Conference on Neural Information Processing Systems (NeurIPS + 2023). Website at https://chengkaiacademycity.github.io/EnvAwareAfford/ +
+
+
+
+
+ + ♻ ☆ ConvBKI: Real-Time Probabilistic Semantic Mapping Network with + Quantifiable Uncertainty + + +
+ In this paper, we develop a modular neural network for real-time semantic +mapping in uncertain environments, which explicitly updates per-voxel +probabilistic distributions within a neural network layer. Our approach +combines the reliability of classical probabilistic algorithms with the +performance and efficiency of modern neural networks. Although robotic +perception is often divided between modern differentiable methods and classical +explicit methods, a union of both is necessary for real-time and trustworthy +performance. We introduce a novel Convolutional Bayesian Kernel Inference +(ConvBKI) layer which incorporates semantic segmentation predictions online +into a 3D map through a depthwise convolution layer by leveraging conjugate +priors. We compare ConvBKI against state-of-the-art deep learning approaches +and probabilistic algorithms for mapping to evaluate reliability and +performance. We also create a Robot Operating System (ROS) package of ConvBKI +and test it on real-world perceptually challenging off-road driving data. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2209.10663 +
+
+
+
+
+ + ♻ ☆ Large Content And Behavior Models To Understand, Simulate, And Optimize + Content And Behavior + + +
+ Shannon, in his seminal paper introducing information theory, divided the +communication into three levels: technical, semantic, and effectivenss. While +the technical level is concerned with accurate reconstruction of transmitted +symbols, the semantic and effectiveness levels deal with the inferred meaning +and its effect on the receiver. Thanks to telecommunications, the first level +problem has produced great advances like the internet. Large Language Models +(LLMs) make some progress towards the second goal, but the third level still +remains largely untouched. The third problem deals with predicting and +optimizing communication for desired receiver behavior. LLMs, while showing +wide generalization capabilities across a wide range of tasks, are unable to +solve for this. One reason for the underperformance could be a lack of +``behavior tokens'' in LLMs' training corpora. Behavior tokens define receiver +behavior over a communication, such as shares, likes, clicks, purchases, +retweets, etc. While preprocessing data for LLM training, behavior tokens are +often removed from the corpora as noise. Therefore, in this paper, we make some +initial progress towards reintroducing behavior tokens in LLM training. The +trained models, other than showing similar performance to LLMs on content +understanding tasks, show generalization capabilities on behavior simulation, +content simulation, behavior understanding, and behavior domain adaptation. +Using a wide range of tasks on two corpora, we show results on all these +capabilities. We call these models Large Content and Behavior Models (LCBMs). +Further, to spur more research on LCBMs, we release our new Content Behavior +Corpus (CBC), a repository containing communicator, message, and corresponding +receiver behavior. + +
+
+
+
+
+ + ♻ ☆ Networks are Slacking Off: Understanding Generalization Problem in Image + Deraining NeurIPS 2023 + + +
+ Deep deraining networks consistently encounter substantial generalization +issues when deployed in real-world applications, although they are successful +in laboratory benchmarks. A prevailing perspective in deep learning encourages +using highly complex data for training, with the expectation that richer image +background content will facilitate overcoming the generalization problem. +However, through comprehensive and systematic experimentation, we discover that +this strategy does not enhance the generalization capability of these networks. +On the contrary, it exacerbates the tendency of networks to overfit specific +degradations. Our experiments reveal that better generalization in a deraining +network can be achieved by simplifying the complexity of the training +background images. This is because that the networks are ``slacking off'' +during training, that is, learning the least complex elements in the image +background and degradation to minimize training loss. When the background +images are less complex than the rain streaks, the network will prioritize the +background reconstruction, thereby suppressing overfitting the rain patterns +and leading to improved generalization performance. Our research offers a +valuable perspective and methodology for better understanding the +generalization problem in low-level vision tasks and displays promising +potential for practical application. + +
+
+ comment: This article has been accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ A weighted-variance variational autoencoder model for speech enhancement + + +
+ We address speech enhancement based on variational autoencoders, which +involves learning a speech prior distribution in the time-frequency (TF) +domain. A zero-mean complex-valued Gaussian distribution is usually assumed for +the generative model, where the speech information is encoded in the variance +as a function of a latent variable. In contrast to this commonly used approach, +we propose a weighted variance generative model, where the contribution of each +spectrogram time-frame in parameter learning is weighted. We impose a Gamma +prior distribution on the weights, which would effectively lead to a Student's +t-distribution instead of Gaussian for speech generative modeling. We develop +efficient training and speech enhancement algorithms based on the proposed +generative model. Our experimental results on spectrogram auto-encoding and +speech enhancement demonstrate the effectiveness and robustness of the proposed +approach compared to the standard unweighted variance model. + +
+
+
+
+
+ + ♻ ☆ Tuning Multi-mode Token-level Prompt Alignment across Modalities NeurIPS2023 + + +
+ Advancements in prompt tuning of vision-language models have underscored +their potential in enhancing open-world visual concept comprehension. However, +prior works only primarily focus on single-mode (only one prompt for each +modality) and holistic level (image or sentence) semantic alignment, which +fails to capture the sample diversity, leading to sub-optimal prompt discovery. +To address the limitation, we propose a multi-mode token-level tuning framework +that leverages the optimal transportation to learn and align a set of prompt +tokens across modalities. Specifically, we rely on two essential factors: 1) +multi-mode prompts discovery, which guarantees diverse semantic +representations, and 2) token-level alignment, which helps explore fine-grained +similarity. Consequently, the similarity can be calculated as a hierarchical +transportation problem between the modality-specific sets. Extensive +experiments on popular image recognition benchmarks show the superior +generalization and few-shot abilities of our approach. The qualitative analysis +demonstrates that the learned prompt tokens have the ability to capture diverse +visual concepts. + +
+
+ comment: In Proceedings of NeurIPS2023 +
+
+
+
+
+ + ♻ ☆ Parallel Spiking Neurons with High Efficiency and Ability to Learn + Long-term Dependencies NeurIPS 2023 + + +
+ Vanilla spiking neurons in Spiking Neural Networks (SNNs) use +charge-fire-reset neuronal dynamics, which can only be simulated serially and +can hardly learn long-time dependencies. We find that when removing reset, the +neuronal dynamics can be reformulated in a non-iterative form and parallelized. +By rewriting neuronal dynamics without reset to a general formulation, we +propose the Parallel Spiking Neuron (PSN), which generates hidden states that +are independent of their predecessors, resulting in parallelizable neuronal +dynamics and extremely high simulation speed. The weights of inputs in the PSN +are fully connected, which maximizes the utilization of temporal information. +To avoid the use of future inputs for step-by-step inference, the weights of +the PSN can be masked, resulting in the masked PSN. By sharing weights across +time-steps based on the masked PSN, the sliding PSN is proposed to handle +sequences of varying lengths. We evaluate the PSN family on simulation speed +and temporal/static data classification, and the results show the overwhelming +advantage of the PSN family in efficiency and accuracy. To the best of our +knowledge, this is the first study about parallelizing spiking neurons and can +be a cornerstone for the spiking deep learning research. Our codes are +available at \url{https://github.com/fangwei123456/Parallel-Spiking-Neuron}. + +
+
+ comment: Accepted in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Robotic Fabric Flattening with Wrinkle Direction Detection + + +
+ Deformable Object Manipulation (DOM) is an important field of research as it +contributes to practical tasks such as automatic cloth handling, cable routing, +surgical operation, etc. Perception is considered one of the major challenges +in DOM due to the complex dynamics and high degree of freedom of deformable +objects. In this paper, we develop a novel image-processing algorithm based on +Gabor filters to extract useful features from cloth, and based on this, devise +a strategy for cloth flattening tasks. We also evaluate the overall framework +experimentally and compare it with three human operators. The results show that +our algorithm can determine the direction of wrinkles on the cloth accurately +in simulation as well as in real robot experiments. Furthermore, our +dewrinkling strategy compares favorably to baseline methods. The experiment +video is available on +https://sites.google.com/view/robotic-fabric-flattening/home + +
+
+ comment: Accepted by the 18th International Symposium on Experimental Robotics + (ISER 2023) +
+
+
+
+
+ + ♻ ☆ A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In + Zero Shot EMNLP-23 + + +
+ Multimedia content, such as advertisements and story videos, exhibit a rich +blend of creativity and multiple modalities. They incorporate elements like +text, visuals, audio, and storytelling techniques, employing devices like +emotions, symbolism, and slogans to convey meaning. There is a dearth of large +annotated training datasets in the multimedia domain hindering the development +of supervised learning models with satisfactory performance for real-world +applications. On the other hand, the rise of large language models (LLMs) has +witnessed remarkable zero-shot performance in various natural language +processing (NLP) tasks, such as emotion classification, question-answering, and +topic classification. To leverage such advanced techniques to bridge this +performance gap in multimedia understanding, we propose verbalizing long videos +to generate their descriptions in natural language, followed by performing +video-understanding tasks on the generated story as opposed to the original +video. Through extensive experiments on fifteen video-understanding tasks, we +demonstrate that our method, despite being zero-shot, achieves significantly +better results than supervised baselines for video understanding. Furthermore, +to alleviate a lack of story understanding benchmarks, we publicly release the +first dataset on a crucial task in computational social science on persuasion +strategy identification. + +
+
+ comment: Accepted to EMNLP-23 TL;DR: Video understanding lags far behind NLP; + LLMs excel in zero-shot. Our approach utilizes LLMs to verbalize videos, + creating stories for zero-shot video understanding. This yields + state-of-the-art results across five datasets, covering fifteen tasks +
+
+
+
+
+ + ♻ ☆ Time-Conditioned Generative Modeling of Object-Centric Representations + for Video Decomposition and Prediction + + +
+ When perceiving the world from multiple viewpoints, humans have the ability +to reason about the complete objects in a compositional manner even when an +object is completely occluded from certain viewpoints. Meanwhile, humans are +able to imagine novel views after observing multiple viewpoints. Recent +remarkable advances in multi-view object-centric learning still leaves some +unresolved problems: 1) The shapes of partially or completely occluded objects +can not be well reconstructed. 2) The novel viewpoint prediction depends on +expensive viewpoint annotations rather than implicit rules in view +representations. In this paper, we introduce a time-conditioned generative +model for videos. To reconstruct the complete shape of an object accurately, we +enhance the disentanglement between the latent representations of objects and +views, where the latent representations of time-conditioned views are jointly +inferred with a Transformer and then are input to a sequential extension of +Slot Attention to learn object-centric representations. In addition, Gaussian +processes are employed as priors of view latent variables for video generation +and novel-view prediction without viewpoint annotations. Experiments on +multiple datasets demonstrate that the proposed model can make object-centric +video decomposition, reconstruct the complete shapes of occluded objects, and +make novel-view predictions. + +
+
+
+
+
+ + ♻ ☆ COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action + Spotting using Transformers + + +
+ We present COMEDIAN, a novel pipeline to initialize spatiotemporal +transformers for action spotting, which involves self-supervised learning and +knowledge distillation. Action spotting is a timestamp-level temporal action +detection task. Our pipeline consists of three steps, with two initialization +stages. First, we perform self-supervised initialization of a spatial +transformer using short videos as input. Additionally, we initialize a temporal +transformer that enhances the spatial transformer's outputs with global context +through knowledge distillation from a pre-computed feature bank aligned with +each short video segment. In the final step, we fine-tune the transformers to +the action spotting task. The experiments, conducted on the SoccerNet-v2 +dataset, demonstrate state-of-the-art performance and validate the +effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several +advantages of our pretraining pipeline, including improved performance and +faster convergence compared to non-pretrained models. + +
+
+ comment: Source code is available here: + https://github.com/juliendenize/eztorch +
+
+
+
+
+ + ♻ ☆ Changes to Captions: An Attentive Network for Remote Sensing Change + Captioning + + +
+ In recent years, advanced research has focused on the direct learning and +analysis of remote sensing images using natural language processing (NLP) +techniques. The ability to accurately describe changes occurring in +multi-temporal remote sensing images is becoming increasingly important for +geospatial understanding and land planning. Unlike natural image change +captioning tasks, remote sensing change captioning aims to capture the most +significant changes, irrespective of various influential factors such as +illumination, seasonal effects, and complex land covers. In this study, we +highlight the significance of accurately describing changes in remote sensing +images and present a comparison of the change captioning task for natural and +synthetic images and remote sensing images. To address the challenge of +generating accurate captions, we propose an attentive changes-to-captions +network, called Chg2Cap for short, for bi-temporal remote sensing images. The +network comprises three main components: 1) a Siamese CNN-based feature +extractor to collect high-level representations for each image pair; 2) an +attentive decoder that includes a hierarchical self-attention block to locate +change-related features and a residual block to generate the image embedding; +and 3) a transformer-based caption generator to decode the relationship between +the image embedding and the word embedding into a description. The proposed +Chg2Cap network is evaluated on two representative remote sensing datasets, and +a comprehensive experimental analysis is provided. The code and pre-trained +models will be available online at https://github.com/ShizhenChang/Chg2Cap. + +
+
+
+
+
+ + ♻ ☆ A Robust Morphological Approach for Semantic Segmentation of Very High + Resolution Images + + +
+ State-of-the-art methods for semantic segmentation of images involve +computationally intensive neural network architectures. Most of these methods +are not adaptable to high-resolution image segmentation due to memory and other +computational issues. Typical approaches in literature involve design of neural +network architectures that can fuse global information from low-resolution +images and local information from the high-resolution counterparts. However, +architectures designed for processing high resolution images are unnecessarily +complex and involve a lot of hyper parameters that can be difficult to tune. +Also, most of these architectures require ground truth annotations of the high +resolution images to train, which can be hard to obtain. In this article, we +develop a robust pipeline based on mathematical morphological (MM) operators +that can seamlessly extend any existing semantic segmentation algorithm to high +resolution images. Our method does not require the ground truth annotations of +the high resolution images. It is based on efficiently utilizing information +from the low-resolution counterparts, and gradient information on the +high-resolution images. We obtain high quality seeds from the inferred labels +on low-resolution images using traditional morphological operators and +propagate seed labels using a random walker to refine the semantic labels at +the boundaries. We show that the semantic segmentation results obtained by our +method beat the existing state-of-the-art algorithms on high-resolution images. +We empirically prove the robustness of our approach to the hyper parameters +used in our pipeline. Further, we characterize some necessary conditions under +which our pipeline is applicable and provide an in-depth analysis of the +proposed approach. + +
+
+ comment: Under review at Computer Vision and Image Understanding +
+
+
+
+
+ + ♻ ☆ Can Language Models Laugh at YouTube Short-form Videos? EMNLP 2023 + + +
+ As short-form funny videos on social networks are gaining popularity, it +becomes demanding for AI models to understand them for better communication +with humans. Unfortunately, previous video humor datasets target specific +domains, such as speeches or sitcoms, and mostly focus on verbal cues. We +curate a user-generated dataset of 10K multimodal funny videos from YouTube, +called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both +verbal and visual elements contributing to humor. After filtering, we annotate +each video with timestamps and text explanations for funny moments. Our +ExFunTube is unique over existing datasets in that our videos cover a wide +range of domains with various types of humor that necessitate a multimodal +understanding of the content. Also, we develop a zero-shot video-to-text +prompting to maximize video humor understanding of large language models +(LLMs). With three different evaluation methods using automatic scores, +rationale quality experiments, and human evaluations, we show that our +prompting significantly improves LLMs' ability for humor explanation. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Training-free Diffusion Model Adaptation for Variable-Sized + Text-to-Image Synthesis NeurIPS 2023 + + +
+ Diffusion models (DMs) have recently gained attention with state-of-the-art +performance in text-to-image synthesis. Abiding by the tradition in deep +learning, DMs are trained and evaluated on the images with fixed sizes. +However, users are demanding for various images with specific sizes and various +aspect ratio. This paper focuses on adapting text-to-image diffusion models to +handle such variety while maintaining visual fidelity. First we observe that, +during the synthesis, lower resolution images suffer from incomplete object +portrayal, while higher resolution images exhibit repetitively disordered +presentation. Next, we establish a statistical relationship indicating that +attention entropy changes with token quantity, suggesting that models aggregate +spatial information in proportion to image resolution. The subsequent +interpretation on our observations is that objects are incompletely depicted +due to limited spatial information for low resolutions, while repetitively +disorganized presentation arises from redundant spatial information for high +resolutions. From this perspective, we propose a scaling factor to alleviate +the change of attention entropy and mitigate the defective pattern observed. +Extensive experimental results validate the efficacy of the proposed scaling +factor, enabling models to achieve better visual effects, image quality, and +text alignment. Notably, these improvements are achieved without additional +training or fine-tuning techniques. + +
+
+ comment: Accepted by NeurIPS 2023. 23 pages, 13 figures +
+
+
+
+
+ + ♻ ☆ Estimation of control area in badminton doubles with pose information + from top and back view drone videos + + +
+ The application of visual tracking to the performance analysis of sports +players in dynamic competitions is vital for effective coaching. In doubles +matches, coordinated positioning is crucial for maintaining control of the +court and minimizing opponents' scoring opportunities. The analysis of such +teamwork plays a vital role in understanding the dynamics of the game. However, +previous studies have primarily focused on analyzing and assessing singles +players without considering occlusion in broadcast videos. These studies have +relied on discrete representations, which involve the analysis and +representation of specific actions (e.g., strokes) or events that occur during +the game while overlooking the meaningful spatial distribution. In this work, +we present the first annotated drone dataset from top and back views in +badminton doubles and propose a framework to estimate the control area +probability map, which can be used to evaluate teamwork performance. We present +an efficient framework of deep neural networks that enables the calculation of +full probability surfaces. This framework utilizes the embedding of a Gaussian +mixture map of players' positions and employs graph convolution on their poses. +In the experiment, we verify our approach by comparing various baselines and +discovering the correlations between the score and control area. Additionally, +we propose a practical application for assessing optimal positioning to provide +instructions during a game. Our approach offers both visual and quantitative +evaluations of players' movements, thereby providing valuable insights into +doubles teamwork. The dataset and related project code is available at +https://github.com/Ning-D/Drone_BD_ControlArea + +
+
+ comment: 15 pages, 10 figures, to appear in Multimedia Tools and Applications +
+
+
+
+
+ + ♻ ☆ RoCNet: 3D Robust Registration of Point-Clouds using Deep Learning + + +
+ This paper introduces a new method for 3D point cloud registration based on +deep learning. The architecture is composed of three distinct blocs: (i) an +encoder composed of a convolutional graph-based descriptor that encodes the +immediate neighbourhood of each point and an attention mechanism that encodes +the variations of the surface normals. Such descriptors are refined by +highlighting attention between the points of the same set and then between the +points of the two sets. (ii) a matching process that estimates a matrix of +correspondences using the Sinkhorn algorithm. (iii) Finally, the rigid +transformation between the two point clouds is calculated by RANSAC using the +Kc best scores from the correspondence matrix. We conduct experiments on the +ModelNet40 dataset, and our proposed architecture shows very promising results, +outperforming state-of-the-art methods in most of the simulated configurations, +including partial overlap and data augmentation with Gaussian noise. + +
+
+ comment: 8 pages +
+
+
+
+
+ + ♻ ☆ NeRFtrinsic Four: An End-To-End Trainable NeRF Jointly Optimizing + Diverse Intrinsic and Extrinsic Camera Parameters + + +
+ Novel view synthesis using neural radiance fields (NeRF) is the +state-of-the-art technique for generating high-quality images from novel +viewpoints. Existing methods require a priori knowledge about extrinsic and +intrinsic camera parameters. This limits their applicability to synthetic +scenes, or real-world scenarios with the necessity of a preprocessing step. +Current research on the joint optimization of camera parameters and NeRF +focuses on refining noisy extrinsic camera parameters and often relies on the +preprocessing of intrinsic camera parameters. Further approaches are limited to +cover only one single camera intrinsic. To address these limitations, we +propose a novel end-to-end trainable approach called NeRFtrinsic Four. We +utilize Gaussian Fourier features to estimate extrinsic camera parameters and +dynamically predict varying intrinsic camera parameters through the supervision +of the projection error. Our approach outperforms existing joint optimization +methods on LLFF and BLEFF. In addition to these existing datasets, we introduce +a new dataset called iFF with varying intrinsic camera parameters. NeRFtrinsic +Four is a step forward in joint optimization NeRF-based view synthesis and +enables more realistic and flexible rendering in real-world scenarios with +varying camera parameters. + +
+
+
+
+
+ + ♻ ☆ Quality-Aware Network for Face Parsing CVPR 2021 + + +
+ This is a very short technical report, which introduces the solution of the +Team BUPT-CASIA for Short-video Face Parsing Track of The 3rd Person in Context +(PIC) Workshop and Challenge at CVPR 2021. + Face parsing has recently attracted increasing interest due to its numerous +application potentials. Generally speaking, it has a lot in common with human +parsing, such as task setting, data characteristics, number of categories and +so on. Therefore, this work applies state-of-the-art human parsing method to +face parsing task to explore the similarities and differences between them. Our +submission achieves 86.84% score and wins the 2nd place in the challenge. + +
+
+ comment: 2nd place in Short-video Face Parsing Track of The 3rd Person in + Context (PIC) Workshop and Challenge at CVPR 2021 +
+
+
+
+
+ + ♻ ☆ DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model + Statistics NeurIPS 2023 + + +
+ Diffusion probabilistic models (DPMs) have exhibited excellent performance +for high-fidelity image generation while suffering from inefficient sampling. +Recent works accelerate the sampling procedure by proposing fast ODE solvers +that leverage the specific ODE form of DPMs. However, they highly rely on +specific parameterization during inference (such as noise/data prediction), +which might not be the optimal choice. In this work, we propose a novel +formulation towards the optimal parameterization during sampling that minimizes +the first-order discretization error of the ODE solution. Based on such +formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs +by introducing several coefficients efficiently computed on the pretrained +model, which we call \textit{empirical model statistics}. We further +incorporate multistep methods and a predictor-corrector framework, and propose +some techniques for improving sample quality at small numbers of function +evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 +achieves consistently better or comparable performance in both unconditional +and conditional sampling with both pixel-space and latent-space DPMs, +especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) +on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable +Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous +state-of-the-art training-free methods. Code is available at +\url{https://github.com/thu-ml/DPM-Solver-v3}. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ DDF-HO: Hand-Held Object Reconstruction via Conditional Directed + Distance Field NeurIPS 2023 + + +
+ Reconstructing hand-held objects from a single RGB image is an important and +challenging problem. Existing works utilizing Signed Distance Fields (SDF) +reveal limitations in comprehensively capturing the complex hand-object +interactions, since SDF is only reliable within the proximity of the target, +and hence, infeasible to simultaneously encode local hand and object cues. To +address this issue, we propose DDF-HO, a novel approach leveraging Directed +Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in +3D space, consisting of an origin and a direction, to corresponding DDF values, +including a binary visibility signal determining whether the ray intersects the +objects and a distance value measuring the distance from origin to target in +the given direction. We randomly sample multiple rays and collect local to +global geometric features for them by introducing a novel 2D ray-based feature +aggregation scheme and a 3D intersection-aware hand pose embedding, combining +2D-3D features to model hand-object interactions. Extensive experiments on +synthetic and real-world datasets demonstrate that DDF-HO consistently +outperforms all baseline methods by a large margin, especially under Chamfer +Distance, with about 80% leap forward. Codes are available at +https://github.com/ZhangCYG/DDFHO. + +
+
+ comment: Camera Ready for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained + Models in Few-Shot Learning NeurIPS 2023 + + +
+ Due to the limited availability of data, existing few-shot learning methods +trained from scratch fail to achieve satisfactory performance. In contrast, +large-scale pre-trained models such as CLIP demonstrate remarkable few-shot and +zero-shot capabilities. To enhance the performance of pre-trained models for +downstream tasks, fine-tuning the model on downstream data is frequently +necessary. However, fine-tuning the pre-trained model leads to a decrease in +its generalizability in the presence of distribution shift, while the limited +number of samples in few-shot learning makes the model highly susceptible to +overfitting. Consequently, existing methods for fine-tuning few-shot learning +primarily focus on fine-tuning the model's classification head or introducing +additional structure. In this paper, we introduce a fine-tuning approach termed +Feature Discrimination Alignment (FD-Align). Our method aims to bolster the +model's generalizability by preserving the consistency of spurious features +across the fine-tuning process. Extensive experimental results validate the +efficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model +can seamlessly integrate with existing methods, leading to performance +improvements. Our code can be found in https://github.com/skingorz/FD-Align. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields ICCV 2023 + + +
+ Neural Radiance Field (NeRF) has shown impressive performance in novel view +synthesis via implicit scene representation. However, it usually suffers from +poor scalability as requiring densely sampled images for each new scene. +Several studies have attempted to mitigate this problem by integrating +Multi-View Stereo (MVS) technique into NeRF while they still entail a +cumbersome fine-tuning process for new scenes. Notably, the rendering quality +will drop severely without this fine-tuning process and the errors mainly +appear around the high-frequency features. In the light of this observation, we +design WaveNeRF, which integrates wavelet frequency decomposition into MVS and +NeRF to achieve generalizable yet high-quality synthesis without any per-scene +optimization. To preserve high-frequency information when generating 3D feature +volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating +the discrete wavelet transform into the classical cascade MVS, which +disentangles high-frequency information explicitly. With that, disentangled +frequency features can be injected into classic NeRF via a novel hybrid neural +renderer to yield faithful high-frequency details, and an intuitive +frequency-guided sampling strategy can be designed to suppress artifacts around +high-frequency regions. Extensive experiments over three widely studied +benchmarks show that WaveNeRF achieves superior generalizable radiance field +modeling when only given three images as input. + +
+
+ comment: Accepted to ICCV 2023. Project website: + https://mxuai.github.io/WaveNeRF/ +
+
+
+
+
+ + ♻ ☆ DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion + Prior + + +
+ We present DreamCraft3D, a hierarchical 3D content generation method that +produces high-fidelity and coherent 3D objects. We tackle the problem by +leveraging a 2D reference image to guide the stages of geometry sculpting and +texture boosting. A central focus of this work is to address the consistency +issue that existing works encounter. To sculpt geometries that render +coherently, we perform score distillation sampling via a view-dependent +diffusion model. This 3D prior, alongside several training strategies, +prioritizes the geometry consistency but compromises the texture fidelity. We +further propose Bootstrapped Score Distillation to specifically boost the +texture. We train a personalized diffusion model, Dreambooth, on the augmented +renderings of the scene, imbuing it with 3D knowledge of the scene being +optimized. The score distillation from this 3D-aware diffusion prior provides +view-consistent guidance for the scene. Notably, through an alternating +optimization of the diffusion prior and 3D scene representation, we achieve +mutually reinforcing improvements: the optimized 3D scene aids in training the +scene-specific diffusion model, which offers increasingly view-consistent +guidance for 3D optimization. The optimization is thus bootstrapped and leads +to substantial texture boosting. With tailored 3D priors throughout the +hierarchical generation, DreamCraft3D generates coherent 3D objects with +photorealistic renderings, advancing the state-of-the-art in 3D content +generation. Code available at https://github.com/deepseek-ai/DreamCraft3D. + +
+
+ comment: Project Page: https://mrtornado24.github.io/DreamCraft3D/ +
+
+
+
+
+ + ♻ ☆ CoFiI2P: Coarse-to-Fine Correspondences for Image-to-Point Cloud + Registration + + +
+ Image-to-point cloud (I2P) registration is a fundamental task in the field of +autonomous vehicles and transportation systems for cross-modality data fusion +and localization. Existing I2P registration methods estimate correspondences at +the point/pixel level, often overlooking global alignment. However, I2P +matching can easily converge to a local optimum when performed without +high-level guidance from global constraints. To address this issue, this paper +introduces CoFiI2P, a novel I2P registration network that extracts +correspondences in a coarse-to-fine manner to achieve the globally optimal +solution. First, the image and point cloud data are processed through a Siamese +encoder-decoder network for hierarchical feature extraction. Second, a +coarse-to-fine matching module is designed to leverage these features and +establish robust feature correspondences. Specifically, In the coarse matching +phase, a novel I2P transformer module is employed to capture both homogeneous +and heterogeneous global information from the image and point cloud data. This +enables the estimation of coarse super-point/super-pixel matching pairs with +discriminative descriptors. In the fine matching module, point/pixel pairs are +established with the guidance of super-point/super-pixel correspondences. +Finally, based on matching pairs, the transform matrix is estimated with the +EPnP-RANSAC algorithm. Extensive experiments conducted on the KITTI dataset +demonstrate that CoFiI2P achieves impressive results, with a relative rotation +error (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29 +meters. These results represent a significant improvement of 84\% in RRE and +89\% in RTE compared to the current state-of-the-art (SOTA) method. Qualitative +results are available at https://youtu.be/ovbedasXuZE. The source code will be +publicly released at https://github.com/kang-1-2-3/CoFiI2P. + +
+
+ comment: demo video: https://youtu.be/ovbedasXuZE; source code (will be + public): https://github.com/kang-1-2-3/CoFiI2P +
+
+
+
+
+ + ♻ ☆ GNeSF: Generalizable Neural Semantic Fields NeurIPS 2023 + + +
+ 3D scene segmentation based on neural implicit representation has emerged +recently with the advantage of training only on 2D supervision. However, +existing approaches still requires expensive per-scene optimization that +prohibits generalization to novel scenes during inference. To circumvent this +problem, we introduce a generalizable 3D segmentation framework based on +implicit representation. Specifically, our framework takes in multi-view image +features and semantic maps as the inputs instead of only spatial information to +avoid overfitting to scene-specific geometric and semantic information. We +propose a novel soft voting mechanism to aggregate the 2D semantic information +from different views for each 3D point. In addition to the image features, view +difference information is also encoded in our framework to predict the voting +scores. Intuitively, this allows the semantic information from nearby views to +contribute more compared to distant ones. Furthermore, a visibility module is +also designed to detect and filter out detrimental information from occluded +views. Due to the generalizability of our proposed method, we can synthesize +semantic maps or conduct 3D semantic segmentation for novel scenes with solely +2D semantic supervision. Experimental results show that our approach achieves +comparable performance with scene-specific approaches. More importantly, our +approach can even outperform existing strong supervision-based approaches with +only 2D annotations. Our source code is available at: +https://github.com/HLinChen/GNeSF. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual + Grounding + + +
+ Multimodal transformer exhibits high capacity and flexibility to align image +and text for visual grounding. However, the existing encoder-only grounding +framework (e.g., TransVG) suffers from heavy computation due to the +self-attention operation with quadratic time complexity. To address this issue, +we present a new multimodal transformer architecture, coined as Dynamic +Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into +encoding and decoding phases. The key observation is that there exists high +spatial redundancy in images. Thus, we devise a new dynamic multimodal +transformer decoder by exploiting this sparsity prior to speed up the visual +grounding process. Specifically, our dynamic decoder is composed of a 2D +adaptive sampling module and a text guided decoding module. The sampling module +aims to select these informative patches by predicting the offsets with respect +to a reference point, while the decoding module works for extracting the +grounded object information by performing cross attention between image +features and text features. These two modules are stacked alternatively to +gradually bridge the modality gap and iteratively refine the reference point of +grounded object, eventually realizing the objective of visual grounding. +Extensive experiments on five benchmarks demonstrate that our proposed Dynamic +MDETR achieves competitive trade-offs between computation and accuracy. +Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs +of the multimodal transformer, but still get higher accuracy than the +encoder-only counterpart. In addition, to verify its generalization ability and +scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual +grounding framework, and achieve the state-of-the-art performance on these +benchmarks. + +
+
+ comment: Accepted by IEEE Transactions on Pattern Analysis and Machine + Intelligence (TPAMI) in October 2023 +
+
+
+
+
+ + ♻ ☆ Optimization-Inspired Learning with Architecture Augmentations and + Control Mechanisms for Low-Level Vision + + +
+ In recent years, there has been a growing interest in combining learnable +modules with numerical optimization to solve low-level vision tasks. However, +most existing approaches focus on designing specialized schemes to generate +image/feature propagation. There is a lack of unified consideration to +construct propagative modules, provide theoretical analysis tools, and design +effective learning mechanisms. To mitigate the above issues, this paper +proposes a unified optimization-inspired learning framework to aggregate +Generative, Discriminative, and Corrective (GDC for short) principles with +strong generalization for diverse optimization models. Specifically, by +introducing a general energy minimization model and formulating its descent +direction from different viewpoints (i.e., in a generative manner, based on the +discriminative metric and with optimality-based correction), we construct three +propagative modules to effectively solve the optimization models with flexible +combinations. We design two control mechanisms that provide the non-trivial +theoretical guarantees for both fully- and partially-defined optimization +formulations. Under the support of theoretical guarantees, we can introduce +diverse architecture augmentation strategies such as normalization and search +to ensure stable propagation with convergence and seamlessly integrate the +suitable modules into the propagation respectively. Extensive experiments +across varied low-level vision tasks validate the efficacy and adaptability of +GDC. The codes are available at +https://github.com/LiuZhu-CV/GDC-OptimizationLearning + +
+
+ comment: 14 pages. The codes are available at + https://github.com/LiuZhu-CV/GDC-OptimizationLearning +
+
+
+
+
+ + ♻ ☆ Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards + Enhancing Text Spotting Performance WACV 2024 + + +
+ The adaptation capability to a wide range of domains is crucial for scene +text spotting models when deployed to real-world conditions. However, existing +state-of-the-art (SOTA) approaches usually incorporate scene text detection and +recognition simply by pretraining on natural scene text datasets, which do not +directly exploit the intermediate feature representations between multiple +domains. Here, we investigate the problem of domain-adaptive scene text +spotting, i.e., training a model on multi-domain source data such that it can +directly adapt to target domains rather than being specialized for a specific +domain or scenario. Further, we investigate a transformer baseline called +Swin-TESTR to focus on solving scene-text spotting for both regular and +arbitrary-shaped scene text along with an exhaustive evaluation. The results +clearly demonstrate the potential of intermediate representations to achieve +significant performance on text spotting benchmarks across multiple domains +(e.g. language, synth-to-real, and documents). both in terms of accuracy and +efficiency. + +
+
+ comment: Accepted to the 2024 IEEE/CVF Winter Conference on Applications of + Computer Vision (WACV 2024) +
+
+
+
+
+ + ♻ ☆ Semi-supervised Deep Multi-view Stereo + + +
+ Significant progress has been witnessed in learning-based Multi-view Stereo +(MVS) under supervised and unsupervised settings. To combine their respective +merits in accuracy and completeness, meantime reducing the demand for expensive +labeled data, this paper explores the problem of learning-based MVS in a +semi-supervised setting that only a tiny part of the MVS data is attached with +dense depth ground truth. However, due to huge variation of scenarios and +flexible settings in views, it may break the basic assumption in classic +semi-supervised learning, that unlabeled data and labeled data share the same +label space and data distribution, named as semi-supervised distribution-gap +ambiguity in the MVS problem. To handle these issues, we propose a novel +semi-supervised distribution-augmented MVS framework, namely SDA-MVS. For the +simple case that the basic assumption works in MVS data, consistency +regularization encourages the model predictions to be consistent between +original sample and randomly augmented sample. For further troublesome case +that the basic assumption is conflicted in MVS data, we propose a novel style +consistency loss to alleviate the negative effect caused by the distribution +gap. The visual style of unlabeled sample is transferred to labeled sample to +shrink the gap, and the model prediction of generated sample is further +supervised with the label in original labeled sample. The experimental results +in semi-supervised settings of multiple MVS datasets show the superior +performance of the proposed method. With the same settings in backbone network, +our proposed SDA-MVS outperforms its fully-supervised and unsupervised +baselines. + +
+
+ comment: This paper is accepted in ACMMM-2023. The code is released at: + https://github.com/ToughStoneX/Semi-MVS +
+
+
+
+
+ + ♻ ☆ MiniGPT-v2: large language model as a unified interface for + vision-language multi-task learning + + +
+ Large language models have shown their remarkable capabilities as a general +interface for various language-related applications. Motivated by this, we +target to build a unified interface for completing many vision-language tasks +including image description, visual question answering, and visual grounding, +among others. The challenge is to use a single model for performing diverse +vision-language tasks effectively with simple multi-modal instructions. Towards +this objective, we introduce MiniGPT-v2, a model that can be treated as a +unified interface for better handling various vision-language tasks. We propose +using unique identifiers for different tasks when training the model. These +identifiers enable our model to better distinguish each task instruction +effortlessly and also improve the model learning efficiency for each task. +After the three-stage training, the experimental results show that MiniGPT-v2 +achieves strong performance on many visual question-answering and visual +grounding benchmarks compared to other vision-language generalist models. Our +model and codes are available at https://minigpt-v2.github.io/ + +
+
+ comment: fix small typos +
+
+
+
+
+ + ♻ ☆ DualStreamFoveaNet: A Dual Stream Fusion Architecture with Anatomical + Awareness for Robust Fovea Localization + + +
+ Accurate fovea localization is essential for analyzing retinal diseases to +prevent irreversible vision loss. While current deep learning-based methods +outperform traditional ones, they still face challenges such as the lack of +local anatomical landmarks around the fovea, the inability to robustly handle +diseased retinal images, and the variations in image conditions. In this paper, +we propose a novel transformer-based architecture called DualStreamFoveaNet +(DSFN) for multi-cue fusion. This architecture explicitly incorporates +long-range connections and global features using retina and vessel +distributions for robust fovea localization. We introduce a spatial attention +mechanism in the dual-stream encoder to extract and fuse self-learned +anatomical information, focusing more on features distributed along blood +vessels and significantly reducing computational costs by decreasing token +numbers. Our extensive experiments show that the proposed architecture achieves +state-of-the-art performance on two public datasets and one large-scale private +dataset. Furthermore, we demonstrate that the DSFN is more robust on both +normal and diseased retina images and has better generalization capacity in +cross-dataset experiments. + +
+
+ comment: This paper is prepared for IEEE Transactions on Biomedical + Engineering +
+
+
+
+
+ + ♻ ☆ DeepIron: Predicting Unwarped Garment Texture from a Single Image + + +
+ Realistic reconstruction of 3D clothing from an image has wide applications, +such as avatar creation and virtual try-on. This paper presents a novel +framework that reconstructs the texture map for 3D garments from a single image +with pose. Assuming that 3D garments are modeled by stitching 2D garment sewing +patterns, our specific goal is to generate a texture image for the sewing +patterns. A key component of our framework, the Texture Unwarper, infers the +original texture image from the input clothing image, which exhibits warping +and occlusion of texture due to the user's body shape and pose. The Texture +Unwarper effectively transforms between the input and output images by mapping +the latent spaces of the two images. By inferring the unwarped original texture +of the input garment, our method helps reconstruct 3D garment models that can +show high-quality texture images realistically deformed for new poses. We +validate the effectiveness of our approach through a comparison with other +methods and ablation studies. + +
+
+
+
+
+ + ♻ ☆ Control3Diff: Learning Controllable 3D Diffusion Models from Single-view + Images 3DV24 + + +
+ Diffusion models have recently become the de-facto approach for generative +modeling in the 2D domain. However, extending diffusion models to 3D is +challenging due to the difficulties in acquiring 3D ground truth data for +training. On the other hand, 3D GANs that integrate implicit 3D representations +into GANs have shown remarkable 3D-aware generation when trained only on +single-view image datasets. However, 3D GANs do not provide straightforward +ways to precisely control image synthesis. To address these challenges, We +present Control3Diff, a 3D diffusion model that combines the strengths of +diffusion models and 3D GANs for versatile, controllable 3D-aware image +synthesis for single-view datasets. Control3Diff explicitly models the +underlying latent distribution (optionally conditioned on external inputs), +thus enabling direct control during the diffusion process. Moreover, our +approach is general and applicable to any type of controlling input, allowing +us to train it with the same diffusion objective without any auxiliary +supervision. We validate the efficacy of Control3Diff on standard image +generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various +conditioning inputs such as images, sketches, and text prompts. Please see the +project website (\url{https://jiataogu.me/control3diff}) for video comparisons. + +
+
+ comment: Accepted by 3DV24 +
+
+
+
+
+ + ♻ ☆ Mitigate Replication and Copying in Diffusion Models with Generalized + Caption and Dual Fusion Enhancement + + +
+ While diffusion models demonstrate a remarkable capability for generating +high-quality images, their tendency to `replicate' training data raises privacy +concerns. Although recent research suggests that this replication may stem from +the insufficient generalization of training data captions and duplication of +training images, effective mitigation strategies remain elusive. To address +this gap, our paper first introduces a generality score that measures the +caption generality and employ large language model (LLM) to generalize training +captions. Subsequently, we leverage generalized captions and propose a novel +dual fusion enhancement approach to mitigate the replication of diffusion +models. Our empirical results demonstrate that our proposed methods can +significantly reduce replication by 43.5% compared to the original diffusion +model while maintaining the diversity and quality of generations. + +
+
+
+
+
+ + ♻ ☆ Lithium Metal Battery Quality Control via Transformer-CNN Segmentation + + +
+ Lithium metal battery (LMB) has the potential to be the next-generation +battery system because of its high theoretical energy density. However, defects +known as dendrites are formed by heterogeneous lithium (Li) plating, which +hinders the development and utilization of LMBs. Non-destructive techniques to +observe the dendrite morphology often use X-ray computed tomography (XCT) to +provide cross-sectional views. To retrieve three-dimensional structures inside +a battery, image segmentation becomes essential to quantitatively analyze XCT +images. This work proposes a new semantic segmentation approach using a +transformer-based neural network called TransforCNN that is capable of +segmenting out dendrites from XCT data. In addition, we compare the performance +of the proposed TransforCNN with three other algorithms, such as U-Net, Y-Net, +and E-Net, consisting of an Ensemble Network model for XCT analysis. Our +results show the advantages of using TransforCNN when evaluating +over-segmentation metrics, such as mean Intersection over Union (mIoU) and mean +Dice Similarity Coefficient (mDSC) as well as through several qualitatively +comparative visualizations. + +
+
+ comment: 15 pages, 12 figures +
+
+
+
+
+ + ♻ ☆ DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning + in Language Models NeurIPS 2023 + + +
+ A long-standing goal of AI systems is to perform complex multimodal reasoning +like humans. Recently, large language models (LLMs) have made remarkable +strides in such multi-step reasoning on the language modality solely by +leveraging the chain of thought (CoT) to mimic human thinking. However, the +transfer of these advancements to multimodal contexts introduces heightened +challenges, including but not limited to the impractical need for +labor-intensive annotation and the limitations in terms of flexibility, +generalizability, and explainability. To evoke CoT reasoning in multimodality, +this work first conducts an in-depth analysis of these challenges posed by +multimodality and presents two key insights: "keeping critical thinking" and +"letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this +study proposes a novel DDCoT prompting that maintains a critical attitude +through negative-space prompting and incorporates multimodality into reasoning +by first dividing the reasoning responsibility of LLMs into reasoning and +recognition and then integrating the visual recognition capability of visual +models into the joint reasoning process. The rationales generated by DDCoT not +only improve the reasoning abilities of both large and small language models in +zero-shot prompting and fine-tuning learning, significantly outperforming +state-of-the-art methods but also exhibit impressive generalizability and +explainability. + +
+
+ comment: 24 pages, 13 figures, to be published in NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Fine-tuning Global Model via Data-Free Knowledge Distillation for + Non-IID Federated Learning CVPR2022 + + +
+ Federated Learning (FL) is an emerging distributed learning paradigm under +privacy constraint. Data heterogeneity is one of the main challenges in FL, +which results in slow convergence and degraded performance. Most existing +approaches only tackle the heterogeneity challenge by restricting the local +model update in client, ignoring the performance drop caused by direct global +model aggregation. Instead, we propose a data-free knowledge distillation +method to fine-tune the global model in the server (FedFTG), which relieves the +issue of direct model aggregation. Concretely, FedFTG explores the input space +of local models through a generator, and uses it to transfer the knowledge from +local models to the global model. Besides, we propose a hard sample mining +scheme to achieve effective knowledge distillation throughout the training. In +addition, we develop customized label sampling and class-level ensemble to +derive maximum utilization of knowledge, which implicitly mitigates the +distribution discrepancy across clients. Extensive experiments show that our +FedFTG significantly outperforms the state-of-the-art (SOTA) FL algorithms and +can serve as a strong plugin for enhancing FedAvg, FedProx, FedDyn, and +SCAFFOLD. + +
+
+ comment: This paper is accepted by CVPR2022 +
+
+
+
+
+ + ♻ ☆ Improving Multimodal Datasets with Image Captioning NeurIPS 2023 + + +
+ Massive web datasets play a key role in the success of large vision-language +models like CLIP and Flamingo. However, the raw web data is noisy, and existing +filtering methods to reduce noise often come at the expense of data diversity. +Our work focuses on caption quality as one major source of noise, and studies +how generated captions can increase the utility of web-scraped datapoints with +nondescript text. Through exploring different mixing strategies for raw and +generated captions, we outperform the best filtering method proposed by the +DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a +candidate pool of 128M image-text pairs. Our best approach is also 2x better at +Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an +effective source of text supervision. In experimenting with different image +captioning models, we also demonstrate that the performance of a model on +standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable +indicator of the utility of the captions it generates for multimodal training. +Finally, our experiments with using generated captions at DataComp's large +scale (1.28B image-text pairs) offer insights into the limitations of synthetic +text, as well as the importance of image curation with increasing training data +quantity. The synthetic captions used in our experiments are now available on +HuggingFace. + +
+
+ comment: Accepted at NeurIPS 2023 Datasets & Benchmarks +
+
+
+
+
+ + ♻ ☆ Cross-Stream Contrastive Learning for Self-Supervised Skeleton-Based + Action Recognition + + +
+ Self-supervised skeleton-based action recognition enjoys a rapid growth along +with the development of contrastive learning. The existing methods rely on +imposing invariance to augmentations of 3D skeleton within a single data +stream, which merely leverages the easy positive pairs and limits the ability +to explore the complicated movement patterns. In this paper, we advocate that +the defect of single-stream contrast and the lack of necessary feature +transformation are responsible for easy positives, and therefore propose a +Cross-Stream Contrastive Learning framework for skeleton-based action +Representation learning (CSCLR). Specifically, the proposed CSCLR not only +utilizes intra-stream contrast pairs, but introduces inter-stream contrast +pairs as hard samples to formulate a better representation learning. Besides, +to further exploit the potential of positive pairs and increase the robustness +of self-supervised representation learning, we propose a Positive Feature +Transformation (PFT) strategy which adopts feature-level manipulation to +increase the variance of positive pairs. To validate the effectiveness of our +method, we conduct extensive experiments on three benchmark datasets NTU-RGB+D +60, NTU-RGB+D 120 and PKU-MMD. Experimental results show that our proposed +CSCLR exceeds the state-of-the-art methods on a diverse range of evaluation +protocols. + +
+
+ comment: 15 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language + Perspective NeurIPS 2023 + + +
+ We focus on the weakly-supervised audio-visual video parsing task (AVVP), +which aims to identify and locate all the events in audio/visual modalities. +Previous works only concentrate on video-level overall label denoising across +modalities, but overlook the segment-level label noise, where adjacent video +segments (i.e., 1-second video clips) may contain different events. However, +recognizing events in the segment is challenging because its label could be any +combination of events that occur in the video. To address this issue, we +consider tackling AVVP from the language perspective, since language could +freely describe how various events appear in each segment beyond fixed labels. +Specifically, we design language prompts to describe all cases of event +appearance for each video. Then, the similarity between language prompts and +segments is calculated, where the event of the most similar prompt is regarded +as the segment-level label. In addition, to deal with the mislabeled segments, +we propose to perform dynamic re-weighting on the unreliable segments to adjust +their labels. Experiments show that our simple yet effective approach +outperforms state-of-the-art methods by a large margin. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Diffusion in Diffusion: Cyclic One-Way Diffusion for + Text-Vision-Conditioned Generation + + +
+ Originating from the diffusion phenomenon in physics that describes particle +movement, the diffusion generative models inherit the characteristics of +stochastic random walk in the data space along the denoising trajectory. +However, the intrinsic mutual interference among image regions contradicts the +need for practical downstream application scenarios where the preservation of +low-level pixel information from given conditioning is desired (e.g., +customization tasks like personalized generation and inpainting based on a +user-provided single image). In this work, we investigate the diffusion +(physics) in diffusion (machine learning) properties and propose our Cyclic +One-Way Diffusion (COW) method to control the direction of diffusion phenomenon +given a pre-trained frozen diffusion model for versatile customization +application scenarios, where the low-level pixel information from the +conditioning needs to be preserved. Notably, unlike most current methods that +incorporate additional conditions by fine-tuning the base text-to-image +diffusion model or learning auxiliary networks, our method provides a novel +perspective to understand the task needs and is applicable to a wider range of +customization scenarios in a learning-free manner. Extensive experiment results +show that our proposed COW can achieve more flexible customization based on +strict visual conditions in different application settings. + +
+
+
+
+
+ + ♻ ☆ Convolutional Visual Prompt for Robust Visual Perception + + +
+ Vision models are often vulnerable to out-of-distribution (OOD) samples +without adapting. While visual prompts offer a lightweight method of +input-space adaptation for large-scale vision models, they rely on a +high-dimensional additive vector and labeled data. This leads to overfitting +when adapting models in a self-supervised test-time setting without labels. We +introduce convolutional visual prompts (CVP) for label-free test-time +adaptation for robust visual perception. The structured nature of CVP demands +fewer trainable parameters, less than 1\% compared to standard visual prompts, +combating overfitting. Extensive experiments and analysis on a wide variety of +OOD visual perception tasks show that our approach is effective, improving +robustness by up to 5.87% over several large-scale models. + +
+
+
+
+
+ + ♻ ☆ Evaluating Object Hallucination in Large Vision-Language Models EMNLP 2023 + + +
+ Inspired by the superior language abilities of large language models (LLM), +large vision-language models (LVLM) have been recently explored by integrating +powerful LLMs for improving the performance on complex multimodal tasks. +Despite the promising progress on LVLMs, we find that LVLMs suffer from the +hallucination problem, i.e. they tend to generate objects that are inconsistent +with the target images in the descriptions. To investigate it, this work +presents the first systematic study on object hallucination of LVLMs. We +conduct the evaluation experiments on several representative LVLMs, and show +that they mostly suffer from severe object hallucination issue. We further +discuss that the visual instructions may influence the hallucination, and find +that: objects that frequently occur in the visual instructions or co-occur with +the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we +find that existing evaluation methods might be affected by the input +instructions and generation styles of LVLMs. Thus, we further design an +improved evaluation method for object hallucination by proposing a +polling-based query method called POPE. Experiment results demonstrate that our +POPE can evaluate the object hallucination in a more stable and flexible way. +Our codes and data are publicly available at https://github.com/RUCAIBox/POPE. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Road Network Guided Fine-Grained Urban Traffic Flow Inference + + +
+ Accurate inference of fine-grained traffic flow from coarse-grained one is an +emerging yet crucial problem, which can help greatly reduce the number of the +required traffic monitoring sensors for cost savings. In this work, we notice +that traffic flow has a high correlation with road network, which was either +completely ignored or simply treated as an external factor in previous works. +To facilitate this problem, we propose a novel Road-Aware Traffic Flow +Magnifier (RATFM) that explicitly exploits the prior knowledge of road networks +to fully learn the road-aware spatial distribution of fine-grained traffic +flow. Specifically, a multi-directional 1D convolutional layer is first +introduced to extract the semantic feature of the road network. Subsequently, +we incorporate the road network feature and coarse-grained flow feature to +regularize the short-range spatial distribution modeling of road-relative +traffic flow. Furthermore, we take the road network feature as a query to +capture the long-range spatial distribution of traffic flow with a transformer +architecture. Benefiting from the road-aware inference mechanism, our method +can generate high-quality fine-grained traffic flow maps. Extensive experiments +on three real-world datasets show that the proposed RATFM outperforms +state-of-the-art models under various scenarios. Our code and datasets are +released at {\url{https://github.com/luimoli/RATFM}}. + +
+
+ comment: This work has been accepted to TNNLS +
+
+
+
+
+ + ♻ ☆ R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image + Generation + + +
+ Recent text-to-image (T2I) diffusion models have achieved remarkable progress +in generating high-quality images given text-prompts as input. However, these +models fail to convey appropriate spatial composition specified by a layout +instruction. In this work, we probe into zero-shot grounded T2I generation with +diffusion models, that is, generating images corresponding to the input layout +information without training auxiliary modules or finetuning diffusion models. +We propose a Region and Boundary (R&B) aware cross-attention guidance approach +that gradually modulates the attention maps of diffusion model during +generative process, and assists the model to synthesize images (1) with high +fidelity, (2) highly compatible with textual input, and (3) interpreting layout +instructions accurately. Specifically, we leverage the discrete sampling to +bridge the gap between consecutive attention maps and discrete layout +constraints, and design a region-aware loss to refine the generative layout +during diffusion process. We further propose a boundary-aware loss to +strengthen object discriminability within the corresponding regions. +Experimental results show that our method outperforms existing state-of-the-art +zero-shot grounded T2I generation methods by a large margin both qualitatively +and quantitatively on several benchmarks. + +
+
+ comment: Preprint. Under review. Project page: + https://sagileo.github.io/Region-and-Boundary +
+
+
+
+
+ + ♻ ☆ CHITNet: A Complementary to Harmonious Information Transfer Network for + Infrared and Visible Image Fusion + + +
+ Current infrared and visible image fusion (IVIF) methods go to great lengths +to excavate complementary features and design complex fusion strategies, which +is extremely challenging. To this end, we rethink the IVIF outside the box, +proposing a complementary to harmonious information transfer network (CHITNet). +It reasonably transfers complementary information into harmonious one, which +integrates both the shared and complementary features from two modalities. +Specifically, to skillfully sidestep aggregating complementary information in +IVIF, we design a mutual information transfer (MIT) module to mutually +represent features from two modalities, roughly transferring complementary +information into harmonious one. Then, a harmonious information acquisition +supervised by source image (HIASSI) module is devised to further ensure the +complementary to harmonious information transfer after MIT. Meanwhile, we also +propose a structure information preservation (SIP) module to guarantee that the +edge structure information of the source images can be transferred to the +fusion results. Moreover, a mutual promotion training paradigm (MPTP) with +interaction loss is adopted to facilitate better collaboration among MIT, +HIASSI and SIP. In this way, the proposed method is able to generate fused +images with higher qualities. Extensive experimental results demonstrate the +superiority of our CHITNet over state-of-the-art algorithms in terms of visual +quality and quantitative evaluations. + +
+
+
+
+
+ + ♻ ☆ DEMIST: A deep-learning-based task-specific denoising approach for + myocardial perfusion SPECT + + +
+ There is an important need for methods to process myocardial perfusion +imaging (MPI) SPECT images acquired at lower radiation dose and/or acquisition +time such that the processed images improve observer performance on the +clinical task of detecting perfusion defects. To address this need, we build +upon concepts from model-observer theory and our understanding of the human +visual system to propose a Detection task-specific deep-learning-based approach +for denoising MPI SPECT images (DEMIST). The approach, while performing +denoising, is designed to preserve features that influence observer performance +on detection tasks. We objectively evaluated DEMIST on the task of detecting +perfusion defects using a retrospective study with anonymized clinical data in +patients who underwent MPI studies across two scanners (N = 338). The +evaluation was performed at low-dose levels of 6.25%, 12.5% and 25% and using +an anthropomorphic channelized Hotelling observer. Performance was quantified +using area under the receiver operating characteristics curve (AUC). Images +denoised with DEMIST yielded significantly higher AUC compared to corresponding +low-dose images and images denoised with a commonly used task-agnostic DL-based +denoising method. Similar results were observed with stratified analysis based +on patient sex and defect type. Additionally, DEMIST improved visual fidelity +of the low-dose images as quantified using root mean squared error and +structural similarity index metric. A mathematical analysis revealed that +DEMIST preserved features that assist in detection tasks while improving the +noise properties, resulting in improved observer performance. The results +provide strong evidence for further clinical evaluation of DEMIST to denoise +low-count images in MPI SPECT. + +
+
+
+
+
+ + ♻ ☆ DFPENet-geology: A Deep Learning Framework for High Precision + Recognition and Segmentation of Co-seismic Landslides + + +
+ Automatic recognition and segmentation methods now become the essential +requirement in identifying co-seismic landslides, which are fundamental for +disaster assessment and mitigation in large-scale earthquakes. This approach +used to be carried out through pixel-based or object-oriented methods. However, +due to the massive amount of remote sensing data, variations in different +earthquake scenarios, and the efficiency requirement for post-earthquake +rescue, these methods are difficult to develop into an accurate, rapid, +comprehensive, and general (cross-scene) solution for co-seismic landslide +recognition. This paper develops a robust model, Dense Feature Pyramid with +Encoder-decoder Network (DFPENet), to understand and fuse the multi-scale +features of objects in remote sensing images. The proposed method achieves a +competitive segmentation accuracy on the public ISPRS 2D Semantic. Furthermore, +a comprehensive and widely-used scheme is proposed for co-seismic landslide +recognition, which integrates image features extracted from the DFPENet model, +geologic features, temporal resolution, landslide spatial analysis, and +transfer learning, while only RGB images are used. To corroborate its +feasibility and applicability, the proposed scheme is applied to two +earthquake-triggered landslides in Jiuzhaigou (China) and Hokkaido (Japan), +using available pre- and post-earthquake remote sensing images. + +
+
+ comment: 35 pages, 11 figures +
+
+
+
+
+ + ♻ ☆ Block Coordinate Plug-and-Play Methods for Blind Inverse Problems + + +
+ Plug-and-play (PnP) prior is a well-known class of methods for solving +imaging inverse problems by computing fixed-points of operators combining +physical measurement models and learned image denoisers. While PnP methods have +been extensively used for image recovery with known measurement operators, +there is little work on PnP for solving blind inverse problems. We address this +gap by presenting a new block-coordinate PnP (BC-PnP) method that efficiently +solves this joint estimation problem by introducing learned denoisers as priors +on both the unknown image and the unknown measurement operator. We present a +new convergence theory for BC-PnP compatible with blind inverse problems by +considering nonconvex data-fidelity terms and expansive denoisers. Our theory +analyzes the convergence of BC-PnP to a stationary point of an implicit +function associated with an approximate minimum mean-squared error (MMSE) +denoiser. We numerically validate our method on two blind inverse problems: +automatic coil sensitivity estimation in magnetic resonance imaging (MRI) and +blind image deblurring. Our results show that BC-PnP provides an efficient and +principled framework for using denoisers as PnP priors for jointly estimating +measurement operators and images. + +
+
+
+
+
+ + ♻ ☆ Assessment of a new GeoAI foundation model for flood inundation mapping SP + + +
+ Vision foundation models are a new frontier in Geospatial Artificial +Intelligence (GeoAI), an interdisciplinary research area that applies and +extends AI for geospatial problem solving and geographic knowledge discovery, +because of their potential to enable powerful image analysis by learning and +extracting important image features from vast amounts of geospatial data. This +paper evaluates the performance of the first-of-its-kind geospatial foundation +model, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood +inundation mapping. This model is compared with convolutional neural network +and vision transformer-based architectures in terms of mapping accuracy for +flooded areas. A benchmark dataset, Sen1Floods11, is used in the experiments, +and the models' predictability, generalizability, and transferability are +evaluated based on both a test dataset and a dataset that is completely unseen +by the model. Results show the good transferability of the Prithvi model, +highlighting its performance advantages in segmenting flooded areas in +previously unseen regions. The findings also indicate areas for improvement for +the Prithvi model in terms of adopting multi-scale representation learning, +developing more end-to-end pipelines for high-level image analysis tasks, and +offering more flexibility in terms of input data bands. + +
+
+ comment: 8 pages, 4 figures, Accepted for the 6th ACM SIGSPATIAL International + Workshop on AI for Geographic Knowledge Discovery +
+
+
+
+
+ + ♻ ☆ YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English + Parallel Corpus + + +
+ Machine learning for sign languages is bottlenecked by data. In this paper, +we present YouTube-ASL, a large-scale, open-domain corpus of American Sign +Language (ASL) videos and accompanying English captions drawn from YouTube. +With ~1000 hours of videos and >2500 unique signers, YouTube-ASL is ~3x as +large and has ~10x as many unique signers as the largest prior ASL dataset. We +train baseline models for ASL to English translation on YouTube-ASL and +evaluate them on How2Sign, where we achieve a new finetuned state of the art of +12.39 BLEU and, for the first time, report zero-shot results. + +
+
+
+
+
+ + ♻ ☆ Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields + + +
+ Neural Radiance Field training can be accelerated through the use of +grid-based representations in NeRF's learned mapping from spatial coordinates +to colors and volumetric density. However, these grid-based approaches lack an +explicit understanding of scale and therefore often introduce aliasing, usually +in the form of jaggies or missing scene content. Anti-aliasing has previously +been addressed by mip-NeRF 360, which reasons about sub-volumes along a cone +rather than points along a ray, but this approach is not natively compatible +with current grid-based techniques. We show how ideas from rendering and signal +processing can be used to construct a technique that combines mip-NeRF 360 and +grid-based models such as Instant NGP to yield error rates that are 8% - 77% +lower than either prior technique, and that trains 24x faster than mip-NeRF +360. + +
+
+ comment: Project page: https://jonbarron.info/zipnerf/ +
+
+
+
+
+ + ♻ ☆ Retinexformer: One-stage Retinex-based Transformer for Low-light Image + Enhancement ICCV 2023 + + +
+ When enhancing low-light images, many deep learning algorithms are based on +the Retinex theory. However, the Retinex model does not consider the +corruptions hidden in the dark or introduced by the light-up process. Besides, +these methods usually require a tedious multi-stage training pipeline and rely +on convolutional neural networks, showing limitations in capturing long-range +dependencies. In this paper, we formulate a simple yet principled One-stage +Retinex-based Framework (ORF). ORF first estimates the illumination information +to light up the low-light image and then restores the corruption to produce the +enhanced image. We design an Illumination-Guided Transformer (IGT) that +utilizes illumination representations to direct the modeling of non-local +interactions of regions with different lighting conditions. By plugging IGT +into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative +and qualitative experiments demonstrate that our Retinexformer significantly +outperforms state-of-the-art methods on thirteen benchmarks. The user study and +application on low-light object detection also reveal the latent practical +values of our method. Code, models, and results are available at +https://github.com/caiyuanhao1998/Retinexformer + +
+
+ comment: ICCV 2023; The first Transformer-based method for low-light image + enhancement +
+
+
+
+
+ + ♻ ☆ Incremental Multimodal Surface Mapping via Self-Organizing Gaussian + Mixture Models + + +
+ This letter describes an incremental multimodal surface mapping methodology, +which represents the environment as a continuous probabilistic model. This +model enables high-resolution reconstruction while simultaneously compressing +spatial and intensity point cloud data. The strategy employed in this work +utilizes Gaussian mixture models (GMMs) to represent the environment. While +prior GMM-based mapping works have developed methodologies to determine the +number of mixture components using information-theoretic techniques, these +approaches either operate on individual sensor observations, making them +unsuitable for incremental mapping, or are not real-time viable, especially for +applications where high-fidelity modeling is required. To bridge this gap, this +letter introduces a spatial hash map for rapid GMM submap extraction combined +with an approach to determine relevant and redundant data in a point cloud. +These contributions increase computational speed by an order of magnitude +compared to state-of-the-art incremental GMM-based mapping. In addition, the +proposed approach yields a superior tradeoff in map accuracy and size when +compared to state-of-the-art mapping methodologies (both GMM- and not +GMM-based). Evaluations are conducted using both simulated and real-world data. +The software is released open-source to benefit the robotics community. + +
+
+ comment: 8 pages, 7 figures, published in IEEE Robotics and Automation Letters +
+
+
+
+
+ + ♻ ☆ Learning to reason over visual objects ICLR 2023 + + +
+ A core component of human intelligence is the ability to identify abstract +patterns inherent in complex, high-dimensional perceptual data, as exemplified +by visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated +by the goal of designing AI systems with this capacity, recent work has focused +on evaluating whether neural networks can learn to solve RPM-like problems. +Previous work has generally found that strong performance on these problems +requires the incorporation of inductive biases that are specific to the RPM +problem format, raising the question of whether such models might be more +broadly useful. Here, we investigated the extent to which a general-purpose +mechanism for processing visual scenes in terms of objects might help promote +abstract visual reasoning. We found that a simple model, consisting only of an +object-centric encoder and a transformer reasoning module, achieved +state-of-the-art results on both of two challenging RPM-like benchmarks (PGM +and I-RAVEN), as well as a novel benchmark with greater visual complexity +(CLEVR-Matrices). These results suggest that an inductive bias for +object-centric processing may be a key component of abstract visual reasoning, +obviating the need for problem-specific inductive biases. + +
+
+ comment: ICLR 2023 +
+
+
+
+
+ + ♻ ☆ LightSpeed: Light and Fast Neural Light Fields on Mobile Devices + + +
+ Real-time novel-view image synthesis on mobile devices is prohibitive due to +the limited computational power and storage. Using volumetric rendering +methods, such as NeRF and its derivatives, on mobile devices is not suitable +due to the high computational cost of volumetric rendering. On the other hand, +recent advances in neural light field representations have shown promising +real-time view synthesis results on mobile devices. Neural light field methods +learn a direct mapping from a ray representation to the pixel color. The +current choice of ray representation is either stratified ray sampling or +Plucker coordinates, overlooking the classic light slab (two-plane) +representation, the preferred representation to interpolate between light field +views. In this work, we find that using the light slab representation is an +efficient representation for learning a neural light field. More importantly, +it is a lower-dimensional ray representation enabling us to learn the 4D ray +space using feature grids which are significantly faster to train and render. +Although mostly designed for frontal views, we show that the light-slab +representation can be further extended to non-frontal scenes using a +divide-and-conquer strategy. Our method offers superior rendering quality +compared to previous light field methods and achieves a significantly improved +trade-off between rendering quality and speed. + +
+
+ comment: Project Page: http://lightspeed-r2l.github.io/ . Add camera ready + version +
+
+
+
+
+ + ♻ ☆ Table Detection for Visually Rich Document Images + + +
+ Table Detection (TD) is a fundamental task to enable visually rich document +understanding, which requires the model to extract information without +information loss. However, popular Intersection over Union (IoU) based +evaluation metrics and IoU-based loss functions for the detection models cannot +directly represent the degree of information loss for the prediction results. +Therefore, we propose to decouple IoU into a ground truth coverage term and a +prediction coverage term, in which the former can be used to measure the +information loss of the prediction results. Besides, considering the sparse +distribution of tables in document images, we use SparseR-CNN as the base model +and further improve the model by using Gaussian Noise Augmented Image Size +region proposals and many-to-one label assignments. Results under comprehensive +experiments show that the proposed method can consistently outperform +state-of-the-art methods with different IoU-based metrics under various +datasets and demonstrate that the proposed decoupled IoU loss can enable the +model to alleviate information loss. + +
+
+ comment: Accepted by Knowledge-Based Systems +
+
+
+
+
+ + ♻ ☆ Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment + Alignment WACV 2024 + + +
+ This paper presents an unsupervised transformer-based framework for temporal +activity segmentation which leverages not only frame-level cues but also +segment-level cues. This is in contrast with previous methods which often rely +on frame-level information only. Our approach begins with a frame-level +prediction module which estimates framewise action classes via a transformer +encoder. The frame-level prediction module is trained in an unsupervised manner +via temporal optimal transport. To exploit segment-level information, we +utilize a segment-level prediction module and a frame-to-segment alignment +module. The former includes a transformer decoder for estimating video +transcripts, while the latter matches frame-level features with segment-level +features, yielding permutation-aware segmentation results. Moreover, inspired +by temporal optimal transport, we introduce simple-yet-effective pseudo labels +for unsupervised training of the above modules. Our experiments on four public +datasets, i.e., 50 Salads, YouTube Instructions, Breakfast, and Desktop +Assembly show that our approach achieves comparable or better performance than +previous methods in unsupervised activity segmentation. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ♻ ☆ Neural Bounding + + +
+ Bounding volumes are an established concept in computer graphics and vision +tasks but have seen little change since their early inception. In this work, we +study the use of neural networks as bounding volumes. Our key observation is +that bounding, which so far has primarily been considered a problem of +computational geometry, can be redefined as a problem of learning to classify +space into free or occupied. This learning-based approach is particularly +advantageous in high-dimensional spaces, such as animated scenes with complex +queries, where neural networks are known to excel. However, unlocking neural +bounding requires a twist: allowing -- but also limiting -- false positives, +while ensuring that the number of false negatives is strictly zero. We enable +such tight and conservative results using a dynamically-weighted asymmetric +loss function. Our results show that our neural bounding produces up to an +order of magnitude fewer false positives than traditional methods. + +
+
+
+
+
+ + ♻ ☆ Training-based Model Refinement and Representation Disagreement for + Semi-Supervised Object Detection WACV + + +
+ Semi-supervised object detection (SSOD) aims to improve the performance and +generalization of existing object detectors by utilizing limited labeled data +and extensive unlabeled data. Despite many advances, recent SSOD methods are +still challenged by inadequate model refinement using the classical exponential +moving average (EMA) strategy, the consensus of Teacher-Student models in the +latter stages of training (i.e., losing their distinctiveness), and +noisy/misleading pseudo-labels. This paper proposes a novel training-based +model refinement (TMR) stage and a simple yet effective representation +disagreement (RD) strategy to address the limitations of classical EMA and the +consensus problem. The TMR stage of Teacher-Student models optimizes the +lightweight scaling operation to refine the model's weights and prevent +overfitting or forgetting learned patterns from unlabeled data. Meanwhile, the +RD strategy helps keep these models diverged to encourage the student model to +explore additional patterns in unlabeled data. Our approach can be integrated +into established SSOD methods and is empirically validated using two baseline +methods, with and without cascade regression, to generate more reliable +pseudo-labels. Extensive experiments demonstrate the superior performance of +our approach over state-of-the-art SSOD methods. Specifically, the proposed +approach outperforms the baseline Unbiased-Teacher-v2 (& Unbiased-Teacher-v1) +method by an average mAP margin of 2.23, 2.1, and 3.36 (& 2.07, 1.9, and 3.27) +on COCO-standard, COCO-additional, and Pascal VOC datasets, respectively. + +
+
+ comment: Accepted in IEEE/CVF Winter Applications of Computer Vision (WACV) + 2024 +
+
+
+
+
+
+
+
+ + Information Retrieval 9 + +
+
+
+ + ☆ LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset + + +
+ As an important component of intelligent legal systems, legal case retrieval +plays a critical role in ensuring judicial justice and fairness. However, the +development of legal case retrieval technologies in the Chinese legal system is +restricted by three problems in existing datasets: limited data size, narrow +definitions of legal relevance, and naive candidate pooling strategies used in +data sampling. To alleviate these issues, we introduce LeCaRDv2, a large-scale +Legal Case Retrieval Dataset (version 2). It consists of 800 queries and 55,192 +candidates extracted from 4.3 million criminal case documents. To the best of +our knowledge, LeCaRDv2 is one of the largest Chinese legal case retrieval +datasets, providing extensive coverage of criminal charges. Additionally, we +enrich the existing relevance criteria by considering three key aspects: +characterization, penalty, procedure. This comprehensive criteria enriches the +dataset and may provides a more holistic perspective. Furthermore, we propose a +two-level candidate set pooling strategy that effectively identify potential +candidates for each query case. It's important to note that all cases in the +dataset have been annotated by multiple legal experts specializing in criminal +law. Their expertise ensures the accuracy and reliability of the annotations. +We evaluate several state-of-the-art retrieval models at LeCaRDv2, +demonstrating that there is still significant room for improvement in legal +case retrieval. The details of LeCaRDv2 can be found at the anonymous website +https://github.com/anonymous1113243/LeCaRDv2. + +
+
+
+
+
+ + ☆ LightLM: A Lightweight Deep and Narrow Language Model for Generative + Recommendation + + +
+ This paper presents LightLM, a lightweight Transformer-based language model +for generative recommendation. While Transformer-based generative modeling has +gained importance in various AI sub-fields such as NLP and vision, generative +recommendation is still in its infancy due to its unique demand on personalized +generative modeling. Existing works on generative recommendation often use +NLP-oriented Transformer architectures such as T5, GPT, LLaMA and M6, which are +heavy-weight and are not specifically designed for recommendation tasks. +LightLM tackles the issue by introducing a light-weight deep and narrow +Transformer architecture, which is specifically tailored for direct generation +of recommendation items. This structure is especially apt for straightforward +generative recommendation and stems from the observation that language model +does not have to be too wide for this task, as the input predominantly consists +of short tokens that are well-suited for the model's capacity. We also show +that our devised user and item ID indexing methods, i.e., Spectral +Collaborative Indexing (SCI) and Graph Collaborative Indexing (GCI), enables +the deep and narrow Transformer architecture to outperform large-scale language +models for recommendation. Besides, to address the hallucination problem of +generating items as output, we propose the constrained generation process for +generative recommenders. Experiments on real-world datasets show that LightLM +outperforms various competitive baselines in terms of both recommendation +accuracy and efficiency. The code can be found at +https://github.com/dongyuanjushi/LightLM. + +
+
+
+
+
+ + ☆ FMMRec: Fairness-aware Multimodal Recommendation + + +
+ Recently, multimodal recommendations have gained increasing attention for +effectively addressing the data sparsity problem by incorporating +modality-based representations. Although multimodal recommendations excel in +accuracy, the introduction of different modalities (e.g., images, text, and +audio) may expose more users' sensitive information (e.g., gender and age) to +recommender systems, resulting in potentially more serious unfairness issues. +Despite many efforts on fairness, existing fairness-aware methods are either +incompatible with multimodal scenarios, or lead to suboptimal fairness +performance due to neglecting sensitive information of multimodal content. To +achieve counterfactual fairness in multimodal recommendations, we propose a +novel fairness-aware multimodal recommendation approach (dubbed as FMMRec) to +disentangle the sensitive and non-sensitive information from modal +representations and leverage the disentangled modal representations to guide +fairer representation learning. Specifically, we first disentangle biased and +filtered modal representations by maximizing and minimizing their sensitive +attribute prediction ability respectively. With the disentangled modal +representations, we mine the modality-based unfair and fair (corresponding to +biased and filtered) user-user structures for enhancing explicit user +representation with the biased and filtered neighbors from the corresponding +structures, followed by adversarially filtering out sensitive information. +Experiments on two real-world public datasets demonstrate the superiority of +our FMMRec relative to the state-of-the-art baselines. Our source code is +available at https://anonymous.4open.science/r/FMMRec. + +
+
+
+
+
+ + ☆ Exploring the Potential of Generative AI for the World Wide Web + + +
+ Generative Artificial Intelligence (AI) is a cutting-edge technology capable +of producing text, images, and various media content leveraging generative +models and user prompts. Between 2022 and 2023, generative AI surged in +popularity with a plethora of applications spanning from AI-powered movies to +chatbots. In this paper, we delve into the potential of generative AI within +the realm of the World Wide Web, specifically focusing on image generation. Web +developers already harness generative AI to help crafting text and images, +while Web browsers might use it in the future to locally generate images for +tasks like repairing broken webpages, conserving bandwidth, and enhancing +privacy. To explore this research area, we have developed WebDiffusion, a tool +that allows to simulate a Web powered by stable diffusion, a popular +text-to-image model, from both a client and server perspective. WebDiffusion +further supports crowdsourcing of user opinions, which we use to evaluate the +quality and accuracy of 409 AI-generated images sourced from 60 webpages. Our +findings suggest that generative AI is already capable of producing pertinent +and high-quality Web images, even without requiring Web designers to manually +input prompts, just by leveraging contextual information available within the +webpages. However, we acknowledge that direct in-browser image generation +remains a challenge, as only highly powerful GPUs, such as the A40 and A100, +can (partially) compete with classic image downloads. Nevertheless, this +approach could be valuable for a subset of the images, for example when fixing +broken webpages or handling highly private content. + +
+
+ comment: 11 pages, 9 figures +
+
+
+
+
+ + ☆ GNN-GMVO: Graph Neural Networks for Optimizing Gross Merchandise Value + in Similar Item Recommendation + + +
+ Similar item recommendation is a critical task in the e-Commerce industry, +which helps customers explore similar and relevant alternatives based on their +interested products. Despite the traditional machine learning models, Graph +Neural Networks (GNNs), by design, can understand complex relations like +similarity between products. However, in contrast to their wide usage in +retrieval tasks and their focus on optimizing the relevance, the current GNN +architectures are not tailored toward maximizing revenue-related objectives +such as Gross Merchandise Value (GMV), which is one of the major business +metrics for e-Commerce companies. In addition, defining accurate edge relations +in GNNs is non-trivial in large-scale e-Commerce systems, due to the +heterogeneity nature of the item-item relationships. This work aims to address +these issues by designing a new GNN architecture called GNN-GMVO (Graph Neural +Network - Gross Merchandise Value Optimizer). This model directly optimizes GMV +while considering the complex relations between items. In addition, we propose +a customized edge construction method to tailor the model toward similar item +recommendation task and alleviate the noisy and complex item-item relations. In +our comprehensive experiments on three real-world datasets, we show higher +prediction performance and expected GMV for top ranked items recommended by our +model when compared with selected state-of-the-art benchmark models. + +
+
+ comment: 9 pages, 3 figures, 43 citations +
+
+
+
+
+ + ♻ ☆ DocumentNet: Bridging the Data Gap in Document Pre-Training EMNLP 2023 + + +
+ Document understanding tasks, in particular, Visually-rich Document Entity +Retrieval (VDER), have gained significant attention in recent years thanks to +their broad applications in enterprise AI. However, publicly available data +have been scarce for these tasks due to strict privacy constraints and high +annotation costs. To make things worse, the non-overlapping entity spaces from +different datasets hinder the knowledge transfer between document types. In +this paper, we propose a method to collect massive-scale and weakly labeled +data from the web to benefit the training of VDER models. The collected +dataset, named DocumentNet, does not depend on specific document types or +entity sets, making it universally applicable to all VDER tasks. The current +DocumentNet consists of 30M documents spanning nearly 400 document types +organized in a four-level ontology. Experiments on a set of broadly adopted +VDER tasks show significant improvements when DocumentNet is incorporated into +the pre-training for both classic and few-shot learning settings. With the +recent emergence of large language models (LLMs), DocumentNet provides a large +data source to extend their multi-modal capabilities for VDER. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Multi-grained Hypergraph Interest Modeling for Conversational + Recommendation + + +
+ Conversational recommender system (CRS) interacts with users through +multi-turn dialogues in natural language, which aims to provide high-quality +recommendations for user's instant information need. Although great efforts +have been made to develop effective CRS, most of them still focus on the +contextual information from the current dialogue, usually suffering from the +data scarcity issue. Therefore, we consider leveraging historical dialogue data +to enrich the limited contexts of the current dialogue session. + In this paper, we propose a novel multi-grained hypergraph interest modeling +approach to capture user interest beneath intricate historical data from +different perspectives. As the core idea, we employ hypergraph to represent +complicated semantic relations underlying historical dialogues. In our +approach, we first employ the hypergraph structure to model users' historical +dialogue sessions and form a session-based hypergraph, which captures +coarse-grained, session-level relations. Second, to alleviate the issue of data +scarcity, we use an external knowledge graph and construct a knowledge-based +hypergraph considering fine-grained, entity-level semantics. We further conduct +multi-grained hypergraph convolution on the two kinds of hypergraphs, and +utilize the enhanced representations to develop interest-aware CRS. Extensive +experiments on two benchmarks ReDial and TG-ReDial validate the effectiveness +of our approach on both recommendation and conversation tasks. Code is +available at: https://github.com/RUCAIBox/MHIM. + +
+
+
+
+
+ + ♻ ☆ Distributionally Robust Unsupervised Dense Retrieval Training on Web + Graphs + + +
+ This paper introduces Web-DRO, an unsupervised dense retrieval model, which +clusters documents based on web structures and reweights the groups during +contrastive training. Specifically, we first leverage web graph links and +contrastively train an embedding model for clustering anchor-document pairs. +Then we use Group Distributional Robust Optimization to reweight different +clusters of anchor-document pairs, which guides the model to assign more +weights to the group with higher contrastive loss and pay more attention to the +worst case during training. Our experiments on MS MARCO and BEIR show that our +model, Web-DRO, significantly improves the retrieval effectiveness in +unsupervised scenarios. A comparison of clustering techniques shows that +training on the web graph combining URL information reaches optimal performance +on clustering. Further analysis confirms that group weights are stable and +valid, indicating consistent model preferences as well as effective +up-weighting of valuable groups and down-weighting of uninformative ones. The +code of this paper can be obtained from https://github.com/OpenMatch/Web-DRO. + +
+
+ comment: 9 pages, 5 figures, 5 tables +
+
+
+
+
+ + ♻ ☆ Table Detection for Visually Rich Document Images + + +
+ Table Detection (TD) is a fundamental task to enable visually rich document +understanding, which requires the model to extract information without +information loss. However, popular Intersection over Union (IoU) based +evaluation metrics and IoU-based loss functions for the detection models cannot +directly represent the degree of information loss for the prediction results. +Therefore, we propose to decouple IoU into a ground truth coverage term and a +prediction coverage term, in which the former can be used to measure the +information loss of the prediction results. Besides, considering the sparse +distribution of tables in document images, we use SparseR-CNN as the base model +and further improve the model by using Gaussian Noise Augmented Image Size +region proposals and many-to-one label assignments. Results under comprehensive +experiments show that the proposed method can consistently outperform +state-of-the-art methods with different IoU-based metrics under various +datasets and demonstrate that the proposed decoupled IoU loss can enable the +model to alleviate information loss. + +
+
+ comment: Accepted by Knowledge-Based Systems +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ Fantastic Gains and Where to Find Them: On the Existence and Prospect of + General Knowledge Transfer between Any Pretrained Model + + +
+ Training deep networks requires various design decisions regarding for +instance their architecture, data augmentation, or optimization. In this work, +we find these training variations to result in networks learning unique feature +sets from the data. Using public model libraries comprising thousands of models +trained on canonical datasets like ImageNet, we observe that for arbitrary +pairings of pretrained models, one model extracts significant data context +unavailable in the other -- independent of overall performance. Given any +arbitrary pairing of pretrained models and no external rankings (such as +separate test sets, e.g. due to data privacy), we investigate if it is possible +to transfer such "complementary" knowledge from one model to another without +performance degradation -- a task made particularly difficult as additional +knowledge can be contained in stronger, equiperformant or weaker models. Yet +facilitating robust transfer in scenarios agnostic to pretrained model pairings +would unlock auxiliary gains and knowledge fusion from any model repository +without restrictions on model and problem specifics - including from weaker, +lower-performance models. This work therefore provides an initial, in-depth +exploration on the viability of such general-purpose knowledge transfer. Across +large-scale experiments, we first reveal the shortcomings of standard knowledge +distillation techniques, and then propose a much more general extension through +data partitioning for successful transfer between nearly all pretrained models, +which we show can also be done unsupervised. Finally, we assess both the +scalability and impact of fundamental model properties on successful +model-agnostic knowledge transfer. + +
+
+
+
+
+ + ☆ High-Dimensional Prediction for Sequential Decision Making + + +
+ We study the problem of making predictions of an adversarially chosen +high-dimensional state that are unbiased subject to an arbitrary collection of +conditioning events, with the goal of tailoring these events to downstream +decision makers. We give efficient algorithms for solving this problem, as well +as a number of applications that stem from choosing an appropriate set of +conditioning events. + +
+
+
+
+
+ + ☆ A Coarse-to-Fine Pseudo-Labeling (C2FPL) Framework for Unsupervised + Video Anomaly Detection WACV + + +
+ Detection of anomalous events in videos is an important problem in +applications such as surveillance. Video anomaly detection (VAD) is +well-studied in the one-class classification (OCC) and weakly supervised (WS) +settings. However, fully unsupervised (US) video anomaly detection methods, +which learn a complete system without any annotation or human supervision, have +not been explored in depth. This is because the lack of any ground truth +annotations significantly increases the magnitude of the VAD challenge. To +address this challenge, we propose a simple-but-effective two-stage +pseudo-label generation framework that produces segment-level (normal/anomaly) +pseudo-labels, which can be further used to train a segment-level anomaly +detector in a supervised manner. The proposed coarse-to-fine pseudo-label +(C2FPL) generator employs carefully-designed hierarchical divisive clustering +and statistical hypothesis testing to identify anomalous video segments from a +set of completely unlabeled videos. The trained anomaly detector can be +directly applied on segments of an unseen test video to obtain segment-level, +and subsequently, frame-level anomaly predictions. Extensive studies on two +large-scale public-domain datasets, UCF-Crime and XD-Violence, demonstrate that +the proposed unsupervised approach achieves superior performance compared to +all existing OCC and US methods , while yielding comparable performance to the +state-of-the-art WS methods. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ☆ Do Graph Neural Networks Dream of Landau Damping? Insights from Kinetic + Simulations of a Plasma Sheet Model + + +
+ We explore the possibility of fully replacing a plasma physics kinetic +simulator with a graph neural network-based simulator. We focus on this class +of surrogate models given the similarity between their message-passing update +mechanism and the traditional physics solver update, and the possibility of +enforcing known physical priors into the graph construction and update. We show +that our model learns the kinetic plasma dynamics of the one-dimensional plasma +model, a predecessor of contemporary kinetic plasma simulation codes, and +recovers a wide range of well-known kinetic plasma processes, including plasma +thermalization, electrostatic fluctuations about thermal equilibrium, and the +drag on a fast sheet and Landau damping. We compare the performance against the +original plasma model in terms of run-time, conservation laws, and temporal +evolution of key physical quantities. The limitations of the model are +presented and possible directions for higher-dimensional surrogate models for +kinetic plasmas are discussed. + +
+
+ comment: 27 pages, 14 figures +
+
+
+
+
+ + ☆ Defending Against Transfer Attacks From Public Models + + +
+ Adversarial attacks have been a looming and unaddressed threat in the +industry. However, through a decade-long history of the robustness evaluation +literature, we have learned that mounting a strong or optimal attack is +challenging. It requires both machine learning and domain expertise. In other +words, the white-box threat model, religiously assumed by a large majority of +the past literature, is unrealistic. In this paper, we propose a new practical +threat model where the adversary relies on transfer attacks through publicly +available surrogate models. We argue that this setting will become the most +prevalent for security-sensitive applications in the future. We evaluate the +transfer attacks in this setting and propose a specialized defense method based +on a game-theoretic perspective. The defenses are evaluated under 24 public +models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and +ImageNet). Under this threat model, our defense, PubDef, outperforms the +state-of-the-art white-box adversarial training by a large margin with almost +no loss in the normal accuracy. For instance, on ImageNet, our defense achieves +62% accuracy under the strongest transfer attack vs only 36% of the best +adversarially trained model. Its accuracy when not under attack is only 2% +lower than that of an undefended model (78% vs 80%). We release our code at +https://github.com/wagner-group/pubdef. + +
+
+ comment: Under submission. Code available at + https://github.com/wagner-group/pubdef +
+
+
+
+
+ + ☆ torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free + Deep Learning Studies: A Case Study on NLP EMNLP 2023 + + +
+ Reproducibility in scientific work has been becoming increasingly important +in research communities such as machine learning, natural language processing, +and computer vision communities due to the rapid development of the research +domains supported by recent advances in deep learning. In this work, we present +a significantly upgraded version of torchdistill, a modular-driven coding-free +deep learning framework significantly upgraded from the initial release, which +supports only image classification and object detection tasks for reproducible +knowledge distillation experiments. To demonstrate that the upgraded framework +can support more tasks with third-party libraries, we reproduce the GLUE +benchmark results of BERT models using a script based on the upgraded +torchdistill, harmonizing with various Hugging Face libraries. All the 27 +fine-tuned BERT models and configurations to reproduce the results are +published at Hugging Face, and the model weights have already been widely used +in research communities. We also reimplement popular small-sized models and new +knowledge distillation methods and perform additional experiments for computer +vision tasks. + +
+
+ comment: Accepted at the 3rd Workshop for Natural Language Processing Open + Source Software (NLP-OSS) at EMNLP 2023 +
+
+
+
+
+ + ☆ Where you go is who you are -- A study on machine learning based + semantic privacy attacks + + +
+ Concerns about data privacy are omnipresent, given the increasing usage of +digital applications and their underlying business model that includes selling +user data. Location data is particularly sensitive since they allow us to infer +activity patterns and interests of users, e.g., by categorizing visited +locations based on nearby points of interest (POI). On top of that, machine +learning methods provide new powerful tools to interpret big data. In light of +these considerations, we raise the following question: What is the actual risk +that realistic, machine learning based privacy attacks can obtain meaningful +semantic information from raw location data, subject to inaccuracies in the +data? In response, we present a systematic analysis of two attack scenarios, +namely location categorization and user profiling. Experiments on the +Foursquare dataset and tracking data demonstrate the potential for abuse of +high-quality spatial information, leading to a significant privacy loss even +with location inaccuracy of up to 200m. With location obfuscation of more than +1 km, spatial information hardly adds any value, but a high privacy risk solely +from temporal information remains. The availability of public context data such +as POIs plays a key role in inference based on spatial information. Our +findings point out the risks of ever-growing databases of tracking data and +spatial context data, which policymakers should consider for privacy +regulations, and which could guide individuals in their personal location +protection measures. + +
+
+
+
+
+ + ☆ Drive Anywhere: Generalizable End-to-end Autonomous Driving with + Multi-modal Foundation Models + + +
+ As autonomous driving technology matures, end-to-end methodologies have +emerged as a leading strategy, promising seamless integration from perception +to control via deep learning. However, existing systems grapple with challenges +such as unexpected open set environments and the complexity of black-box +models. At the same time, the evolution of deep learning introduces larger, +multimodal foundational models, offering multi-modal visual and textual +understanding. In this paper, we harness these multimodal foundation models to +enhance the robustness and adaptability of autonomous driving systems, enabling +out-of-distribution, end-to-end, multimodal, and more explainable autonomy. +Specifically, we present an approach to apply end-to-end open-set (any +environment/scene) autonomous driving that is capable of providing driving +decisions from representations queryable by image and text. To do so, we +introduce a method to extract nuanced spatial (pixel/patch-aligned) features +from transformers to enable the encapsulation of both spatial and semantic +features. Our approach (i) demonstrates unparalleled results in diverse tests +while achieving significantly greater robustness in out-of-distribution +situations, and (ii) allows the incorporation of latent space simulation (via +text) for improved training (data augmentation via text) and policy debugging. +We encourage the reader to check our explainer video at +https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be and to view the +code and demos on our project webpage at https://drive-anywhere.github.io/. + +
+
+ comment: Project webpage: https://drive-anywhere.github.io Explainer video: + https://www.youtube.com/watch?v=4n-DJf8vXxo&feature=youtu.be +
+
+
+
+
+ + ☆ In-Context Learning Dynamics with Random Binary Sequences + + +
+ Large language models (LLMs) trained on huge corpora of text datasets +demonstrate complex, emergent capabilities, achieving state-of-the-art +performance on tasks they were not explicitly trained for. The precise nature +of LLM capabilities is often mysterious, and different prompts can elicit +different capabilities through in-context learning. We propose a Cognitive +Interpretability framework that enables us to analyze in-context learning +dynamics to understand latent concepts in LLMs underlying behavioral patterns. +This provides a more nuanced understanding than success-or-failure evaluation +benchmarks, but does not require observing internal activations as a +mechanistic interpretation of circuits would. Inspired by the cognitive science +of human randomness perception, we use random binary sequences as context and +study dynamics of in-context learning by manipulating properties of context +data, such as sequence length. In the latest GPT-3.5+ models, we find emergent +abilities to generate pseudo-random numbers and learn basic formal languages, +with striking in-context learning dynamics where model outputs transition +sharply from pseudo-random behaviors to deterministic repetition. + +
+
+
+
+
+ + ☆ Generative Fractional Diffusion Models + + +
+ We generalize the continuous time framework for score-based generative models +from an underlying Brownian motion (BM) to an approximation of fractional +Brownian motion (FBM). We derive a continuous reparameterization trick and the +reverse time model by representing FBM as a stochastic integral over a family +of Ornstein-Uhlenbeck processes to define generative fractional diffusion +models (GFDM) with driving noise converging to a non-Markovian process of +infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables +control of the roughness of the distribution transforming path. To the best of +our knowledge, this is the first attempt to build a generative model upon a +stochastic process with infinite quadratic variation. + +
+
+
+
+
+ + ☆ Grow Your Limits: Continuous Improvement with Real-World RL for Robotic + Locomotion + + +
+ Deep reinforcement learning (RL) can enable robots to autonomously acquire +complex behaviors, such as legged locomotion. However, RL in the real world is +complicated by constraints on efficiency, safety, and overall training +stability, which limits its practical applicability. We present APRL, a policy +regularization framework that modulates the robot's exploration over the course +of training, striking a balance between flexible improvement potential and +focused, efficient exploration. APRL enables a quadrupedal robot to efficiently +learn to walk entirely in the real world within minutes and continue to improve +with more training where prior work saturates in performance. We demonstrate +that continued training with APRL results in a policy that is substantially +more capable of navigating challenging situations and is able to adapt to +changes in dynamics with continued training. + +
+
+ comment: First two authors contributed equally. Project website: + https://sites.google.com/berkeley.edu/aprl +
+
+
+
+
+ + ☆ Proving Test Set Contamination in Black Box Language Models + + +
+ Large language models are trained on vast amounts of internet data, prompting +concerns and speculation that they have memorized public benchmarks. Going from +speculation to proof of contamination is challenging, as the pretraining data +used by proprietary models are often not publicly accessible. We show that it +is possible to provide provable guarantees of test set contamination in +language models without access to pretraining data or model weights. Our +approach leverages the fact that when there is no data contamination, all +orderings of an exchangeable benchmark should be equally likely. In contrast, +the tendency for language models to memorize example order means that a +contaminated language model will find certain canonical orderings to be much +more likely than others. Our test flags potential contamination whenever the +likelihood of a canonically ordered benchmark dataset is significantly higher +than the likelihood after shuffling the examples. We demonstrate that our +procedure is sensitive enough to reliably prove test set contamination in +challenging situations, including models as small as 1.4 billion parameters, on +small test sets of only 1000 examples, and datasets that appear only a few +times in the pretraining corpus. Using our test, we audit five popular publicly +accessible language models for test set contamination and find little evidence +for pervasive contamination. + +
+
+
+
+
+ + ☆ Combating Representation Learning Disparity with Geometric Harmonization NeurIPS 2023 + + +
+ Self-supervised learning (SSL) as an effective paradigm of representation +learning has achieved tremendous success on various curated datasets in diverse +scenarios. Nevertheless, when facing the long-tailed distribution in real-world +applications, it is still hard for existing methods to capture transferable and +robust representation. Conventional SSL methods, pursuing sample-level +uniformity, easily leads to representation learning disparity where head +classes dominate the feature regime but tail classes passively collapse. To +address this problem, we propose a novel Geometric Harmonization (GH) method to +encourage category-level uniformity in representation learning, which is more +benign to the minority and almost does not hurt the majority under long-tailed +distribution. Specially, GH measures the population statistics of the embedding +space on top of self-supervised learning, and then infer an fine-grained +instance-wise calibration to constrain the space expansion of head classes and +avoid the passive collapse of tail classes. Our proposal does not alter the +setting of SSL and can be easily integrated into existing methods in a low-cost +manner. Extensive results on a range of benchmark datasets show the +effectiveness of GH with high tolerance to the distribution skewness. Our code +is available at https://github.com/MediaBrain-SJTU/Geometric-Harmonization. + +
+
+ comment: Accepted to NeurIPS 2023 (spotlight) +
+
+
+
+
+ + ☆ Uncovering Meanings of Embeddings via Partial Orthogonality + + +
+ Machine learning tools often rely on embedding text as vectors of real +numbers. In this paper, we study how the semantic structure of language is +encoded in the algebraic structure of such embeddings. Specifically, we look at +a notion of ``semantic independence'' capturing the idea that, e.g., +``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such +examples are intuitive, it is difficult to formalize such a notion of semantic +independence. The key observation here is that any sensible formalization +should obey a set of so-called independence axioms, and thus any algebraic +encoding of this structure should also obey these axioms. This leads us +naturally to use partial orthogonality as the relevant algebraic structure. We +develop theory and methods that allow us to demonstrate that partial +orthogonality does indeed capture semantic independence. Complementary to this, +we also introduce the concept of independence preserving embeddings where +embeddings preserve the conditional independence structures of a distribution, +and we prove the existence of such embeddings and approximations to them. + +
+
+
+
+
+ + ☆ A qualitative difference between gradient flows of convex functions in + finite- and infinite-dimensional Hilbert spaces + + +
+ We consider gradient flow/gradient descent and heavy ball/accelerated +gradient descent optimization for convex objective functions. In the gradient +flow case, we prove the following: + 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can +be arbitrarily slow. + 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is +integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as +$t\to\infty$. + 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as +slowly as any given function which is monotone decreasing and integrable at +$\infty$, even for a fixed quadratic objective. + 4. In finite dimension (or more generally, for all gradient flow curves of +finite length), this is not optimal: We prove that there are convex monotone +decreasing integrable functions $g(t)$ which decrease to zero slower than +$f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. +For instance, we show that any gradient flow $x_t$ of a convex function $f$ in +finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot +\big\{f(x_t) -\inf f\big\}\big)=0$. + This improves on the commonly reported $O(1/t)$ rate and provides a sharp +characterization of the energy decay law. We also note that it is impossible to +establish a rate $O(1/(t\phi(t))$ for any function $\phi$ which satisfies +$\lim_{t\to\infty}\phi(t) = \infty$, even asymptotically. + Similar results are obtained in related settings for (1) discrete time +gradient descent, (2) stochastic gradient descent with multiplicative noise and +(3) the heavy ball ODE. In the case of stochastic gradient descent, the +summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to +\inf f$ almost surely - an improvement on the convergence almost surely up to a +subsequence which follows from the $O(1/n)$ decay estimate. + +
+
+
+
+
+ + ☆ MimicGen: A Data Generation System for Scalable Robot Learning using + Human Demonstrations + + +
+ Imitation learning from a large set of human demonstrations has proved to be +an effective paradigm for building capable robot agents. However, the +demonstrations can be extremely costly and time-consuming to collect. We +introduce MimicGen, a system for automatically synthesizing large-scale, rich +datasets from only a small number of human demonstrations by adapting them to +new contexts. We use MimicGen to generate over 50K demonstrations across 18 +tasks with diverse scene configurations, object instances, and robot arms from +just ~200 human demonstrations. We show that robot agents can be effectively +trained on this generated dataset by imitation learning to achieve strong +performance in long-horizon and high-precision tasks, such as multi-part +assembly and coffee preparation, across broad initial state distributions. We +further demonstrate that the effectiveness and utility of MimicGen data compare +favorably to collecting additional human demonstrations, making it a powerful +and economical approach towards scaling up robot learning. Datasets, simulation +environments, videos, and more at https://mimicgen.github.io . + +
+
+ comment: Conference on Robot Learning (CoRL) 2023 +
+
+
+
+
+ + ☆ PAC-tuning:Fine-tuning Pretrained Language Models with PAC-driven + Perturbed Gradient Descent EMNLP23 + + +
+ Fine-tuning pretrained language models (PLMs) for downstream tasks is a +large-scale optimization problem, in which the choice of the training algorithm +critically determines how well the trained model can generalize to unseen test +data, especially in the context of few-shot learning. To achieve good +generalization performance and avoid overfitting, techniques such as data +augmentation and pruning are often applied. However, adding these +regularizations necessitates heavy tuning of the hyperparameters of +optimization algorithms, such as the popular Adam optimizer. In this paper, we +propose a two-stage fine-tuning method, PAC-tuning, to address this +optimization challenge. First, based on PAC-Bayes training, PAC-tuning directly +minimizes the PAC-Bayes generalization bound to learn proper parameter +distribution. Second, PAC-tuning modifies the gradient by injecting noise with +the variance learned in the first stage into the model parameters during +training, resulting in a variant of perturbed gradient descent (PGD). In the +past, the few-shot scenario posed difficulties for PAC-Bayes training because +the PAC-Bayes bound, when applied to large models with limited training data, +might not be stringent. Our experimental results across 5 GLUE benchmark tasks +demonstrate that PAC-tuning successfully handles the challenges of fine-tuning +tasks and outperforms strong baseline methods by a visible margin, further +confirming the potential to apply PAC training for any other settings where the +Adam optimizer is currently used for training. + +
+
+ comment: Accepted to EMNLP23 main +
+
+
+
+
+ + ☆ A minimax optimal control approach for robust neural ODEs + + +
+ In this paper, we address the adversarial training of neural ODEs from a +robust control perspective. This is an alternative to the classical training +via empirical risk minimization, and it is widely used to enforce reliable +outcomes for input perturbations. Neural ODEs allow the interpretation of deep +neural networks as discretizations of control systems, unlocking powerful tools +from control theory for the development and the understanding of machine +learning. In this specific case, we formulate the adversarial training with +perturbed data as a minimax optimal control problem, for which we derive first +order optimality conditions in the form of Pontryagin's Maximum Principle. We +provide a novel interpretation of robust training leading to an alternative +weighted technique, which we test on a low-dimensional classification task. + +
+
+ comment: 6 pages, 2 figures and 1 table +
+
+
+
+
+ + ☆ Convergence of flow-based generative models via proximal gradient + descent in Wasserstein space + + +
+ Flow-based generative models enjoy certain advantages in computing the data +generation and the likelihood, and have recently shown competitive empirical +performance. Compared to the accumulating theoretical studies on related +score-based diffusion models, analysis of flow-based models, which are +deterministic in both forward (data-to-noise) and reverse (noise-to-data) +directions, remain sparse. In this paper, we provide a theoretical guarantee of +generating data distribution by a progressive flow model, the so-called JKO +flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a +normalizing flow network. Leveraging the exponential convergence of the +proximal gradient descent (GD) in Wasserstein space, we prove the +Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be +$O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps +($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the +per-step first-order condition. The assumption on data density is merely a +finite second moment, and the theory extends to data distributions without +density and when there are inversion errors in the reverse process where we +obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of +the JKO-type $W_2$-proximal GD is proved for a general class of convex +objective functionals that includes the KL divergence as a special case, which +can be of independent interest. + +
+
+
+
+
+ + ☆ BLIS-Net: Classifying and Analyzing Signals on Graphs + + +
+ Graph neural networks (GNNs) have emerged as a powerful tool for tasks such +as node classification and graph classification. However, much less work has +been done on signal classification, where the data consists of many functions +(referred to as signals) defined on the vertices of a single graph. These tasks +require networks designed differently from those designed for traditional GNN +tasks. Indeed, traditional GNNs rely on localized low-pass filters, and signals +of interest may have intricate multi-frequency behavior and exhibit long range +interactions. This motivates us to introduce the BLIS-Net (Bi-Lipschitz +Scattering Net), a novel GNN that builds on the previously introduced geometric +scattering transform. Our network is able to capture both local and global +signal structure and is able to capture both low-frequency and high-frequency +information. We make several crucial changes to the original geometric +scattering architecture which we prove increase the ability of our network to +capture information about the input signal and show that BLIS-Net achieves +superior performance on both synthetic and real-world data sets based on +traffic flow and fMRI data. + +
+
+
+
+
+ + ☆ SD4Match: Learning to Prompt Stable Diffusion Model for Semantic + Matching + + +
+ In this paper, we address the challenge of matching semantically similar +keypoints across image pairs. Existing research indicates that the intermediate +output of the UNet within the Stable Diffusion (SD) can serve as robust image +feature maps for such a matching task. We demonstrate that by employing a basic +prompt tuning technique, the inherent potential of Stable Diffusion can be +harnessed, resulting in a significant enhancement in accuracy over previous +approaches. We further introduce a novel conditional prompting module that +conditions the prompt on the local details of the input image pairs, leading to +a further improvement in performance. We designate our approach as SD4Match, +short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of +SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets +new benchmarks in accuracy across all these datasets. Particularly, SD4Match +outperforms the previous state-of-the-art by a margin of 12 percentage points +on the challenging SPair-71k dataset. + +
+
+
+
+
+ + ☆ Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models + + +
+ With LLMs shifting their role from statistical modeling of language to +serving as general-purpose AI agents, how should LLM evaluations change? +Arguably, a key ability of an AI agent is to flexibly combine, as needed, the +basic skills it has learned. The capability to combine skills plays an +important role in (human) pedagogy and also in a paper on emergence phenomena +(Arora & Goyal, 2023). + This work introduces Skill-Mix, a new evaluation to measure ability to +combine skills. Using a list of $N$ skills the evaluator repeatedly picks +random subsets of $k$ skills and asks the LLM to produce text combining that +subset of skills. Since the number of subsets grows like $N^k$, for even modest +$k$ this evaluation will, with high probability, require the LLM to produce +text significantly different from any text in the training set. The paper +develops a methodology for (a) designing and administering such an evaluation, +and (b) automatic grading (plus spot-checking by humans) of the results using +GPT-4 as well as the open LLaMA-2 70B model. + Administering a version of to popular chatbots gave results that, while +generally in line with prior expectations, contained surprises. Sizeable +differences exist among model capabilities that are not captured by their +ranking on popular LLM leaderboards ("cramming for the leaderboard"). +Furthermore, simple probability calculations indicate that GPT-4's reasonable +performance on $k=5$ is suggestive of going beyond "stochastic parrot" behavior +(Bender et al., 2021), i.e., it combines skills in ways that it had not seen +during training. + We sketch how the methodology can lead to a Skill-Mix based eco-system of +open evaluations for AI capabilities of future models. + +
+
+
+
+
+ + ☆ Bifurcations and loss jumps in RNN training + + +
+ Recurrent neural networks (RNNs) are popular machine learning tools for +modeling and forecasting sequential data and for inferring dynamical systems +(DS) from observed time series. Concepts from DS theory (DST) have variously +been used to further our understanding of both, how trained RNNs solve complex +tasks, and the training process itself. Bifurcations are particularly important +phenomena in DS, including RNNs, that refer to topological (qualitative) +changes in a system's dynamical behavior as one or more of its parameters are +varied. Knowing the bifurcation structure of an RNN will thus allow to deduce +many of its computational and dynamical properties, like its sensitivity to +parameter variations or its behavior during training. In particular, +bifurcations may account for sudden loss jumps observed in RNN training that +could severely impede the training process. Here we first mathematically prove +for a particular class of ReLU-based RNNs that certain bifurcations are indeed +associated with loss gradients tending toward infinity or zero. We then +introduce a novel heuristic algorithm for detecting all fixed points and +k-cycles in ReLU-based RNNs and their existence and stability regions, hence +bifurcation manifolds in parameter space. In contrast to previous numerical +algorithms for finding fixed points and common continuation methods, our +algorithm provides exact results and returns fixed points and cycles up to high +orders with surprisingly good scaling behavior. We exemplify the algorithm on +the analysis of the training process of RNNs, and find that the recently +introduced technique of generalized teacher forcing completely avoids certain +types of bifurcations in training. Thus, besides facilitating the DST analysis +of trained RNNs, our algorithm provides a powerful instrument for analyzing the +training process itself. + +
+
+
+
+
+ + ☆ Towards Matching Phones and Speech Representations + + +
+ Learning phone types from phone instances has been a long-standing problem, +while still being open. In this work, we revisit this problem in the context of +self-supervised learning, and pose it as the problem of matching cluster +centroids to phone embeddings. We study two key properties that enable +matching, namely, whether cluster centroids of self-supervised representations +reduce the variability of phone instances and respect the relationship among +phones. We then use the matching result to produce pseudo-labels and introduce +a new loss function for improving self-supervised representations. Our +experiments show that the matching result captures the relationship among +phones. Training the new loss function jointly with the regular self-supervised +losses, such as APC and CPC, significantly improves the downstream phone +classification. + +
+
+ comment: Accepted to ASRU 2023 +
+
+
+
+
+ + ☆ Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient + Descent + + +
+ We propose a new algorithm for efficiently solving the damped Fisher matrix +in large-scale scenarios where the number of parameters significantly exceeds +the number of available samples. This problem is fundamental for natural +gradient descent and stochastic reconfiguration. Our algorithm is based on +Cholesky decomposition and is generally applicable. Benchmark results show that +the algorithm is significantly faster than existing methods. + +
+
+
+
+
+ + ☆ Interactive Robot Learning from Verbal Correction + + +
+ The ability to learn and refine behavior after deployment has become ever +more important for robots as we design them to operate in unstructured +environments like households. In this work, we design a new learning system +based on large language model (LLM), OLAF, that allows everyday users to teach +a robot using verbal corrections when the robot makes mistakes, e.g., by saying +"Stop what you're doing. You should move closer to the cup." A key feature of +OLAF is its ability to update the robot's visuomotor neural policy based on the +verbal feedback to avoid repeating mistakes in the future. This is in contrast +to existing LLM-based robotic systems, which only follow verbal commands or +corrections but not learn from them. We demonstrate the efficacy of our design +in experiments where a user teaches a robot to perform long-horizon +manipulation tasks both in simulation and on physical hardware, achieving on +average 20.0% improvement in policy success rate. Videos and more results are +at https://ut-austin-rpl.github.io/olaf/ + +
+
+
+
+
+ + ☆ Model-Based Runtime Monitoring with Interactive Imitation Learning + + +
+ Robot learning methods have recently made great strides, but generalization +and robustness challenges still hinder their widespread deployment. Failing to +detect and address potential failures renders state-of-the-art learning systems +not combat-ready for high-stakes tasks. Recent advances in interactive +imitation learning have presented a promising framework for human-robot +teaming, enabling the robots to operate safely and continually improve their +performances over long-term deployments. Nonetheless, existing methods +typically require constant human supervision and preemptive feedback, limiting +their practicality in realistic domains. This work aims to endow a robot with +the ability to monitor and detect errors during task execution. We introduce a +model-based runtime monitoring algorithm that learns from deployment data to +detect system anomalies and anticipate failures. Unlike prior work that cannot +foresee future failures or requires failure experiences for training, our +method learns a latent-space dynamics model and a failure classifier, enabling +our method to simulate future action outcomes and detect out-of-distribution +and high-risk states preemptively. We train our method within an interactive +imitation learning framework, where it continually updates the model from the +experiences of the human-robot team collected using trustworthy deployments. +Consequently, our method reduces the human workload needed over time while +ensuring reliable task execution. Our method outperforms the baselines across +system-level and unit-test metrics, with 23% and 40% higher success rates in +simulation and on physical hardware, respectively. More information at +https://ut-austin-rpl.github.io/sirius-runtime-monitor/ + +
+
+
+
+
+ + ☆ Human-Guided Complexity-Controlled Abstractions NeurIPS 2023 + + +
+ Neural networks often learn task-specific latent representations that fail to +generalize to novel settings or tasks. Conversely, humans learn discrete +representations (i.e., concepts or words) at a variety of abstraction levels +(e.g., ``bird'' vs. ``sparrow'') and deploy the appropriate abstraction based +on task. Inspired by this, we train neural models to generate a spectrum of +discrete representations, and control the complexity of the representations +(roughly, how many bits are allocated for encoding inputs) by tuning the +entropy of the distribution over representations. In finetuning experiments, +using only a small number of labeled examples for a new task, we show that (1) +tuning the representation to a task-appropriate complexity level supports the +highest finetuning performance, and (2) in a human-participant study, users +were able to identify the appropriate complexity level for a downstream task +using visualizations of discrete representations. Our results indicate a +promising direction for rapid model finetuning by leveraging human insight. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Hierarchical Ensemble-Based Feature Selection for Time Series + Forecasting + + +
+ We study a novel ensemble approach for feature selection based on +hierarchical stacking in cases of non-stationarity and limited number of +samples with large number of features. Our approach exploits the co-dependency +between features using a hierarchical structure. Initially, a machine learning +model is trained using a subset of features, and then the model's output is +updated using another algorithm with the remaining features to minimize the +target loss. This hierarchical structure allows for flexible depth and feature +selection. By exploiting feature co-dependency hierarchically, our proposed +approach overcomes the limitations of traditional feature selection methods and +feature importance scores. The effectiveness of the approach is demonstrated on +synthetic and real-life datasets, indicating improved performance with +scalability and stability compared to the traditional methods and +state-of-the-art approaches. + +
+
+
+
+
+ + ☆ EqDrive: Efficient Equivariant Motion Forecasting with Multi-Modality + for Autonomous Driving + + +
+ Forecasting vehicular motions in autonomous driving requires a deep +understanding of agent interactions and the preservation of motion equivariance +under Euclidean geometric transformations. Traditional models often lack the +sophistication needed to handle the intricate dynamics inherent to autonomous +vehicles and the interaction relationships among agents in the scene. As a +result, these models have a lower model capacity, which then leads to higher +prediction errors and lower training efficiency. In our research, we employ +EqMotion, a leading equivariant particle, and human prediction model that also +accounts for invariant agent interactions, for the task of multi-agent vehicle +motion forecasting. In addition, we use a multi-modal prediction mechanism to +account for multiple possible future paths in a probabilistic manner. By +leveraging EqMotion, our model achieves state-of-the-art (SOTA) performance +with fewer parameters (1.2 million) and a significantly reduced training time +(less than 2 hours). + +
+
+ comment: 6 pages, 7 figures +
+
+
+
+
+ + ☆ Little Exploration is All You Need + + +
+ The prevailing principle of "Optimism in the Face of Uncertainty" advocates +for the incorporation of an exploration bonus, generally assumed to be +proportional to the inverse square root of the visit count ($1/\sqrt{n}$), +where $n$ is the number of visits to a particular state-action pair. This +approach, however, exclusively focuses on "uncertainty," neglecting the +inherent "difficulty" of different options. To address this gap, we introduce a +novel modification of standard UCB algorithm in the multi-armed bandit problem, +proposing an adjusted bonus term of $1/n^\tau$, where $\tau > 1/2$, that +accounts for task difficulty. Our proposed algorithm, denoted as UCB$^\tau$, is +substantiated through comprehensive regret and risk analyses, confirming its +theoretical robustness. Comparative evaluations with standard UCB and Thompson +Sampling algorithms on synthetic datasets demonstrate that UCB$^\tau$ not only +outperforms in efficacy but also exhibits lower risk across various +environmental conditions and hyperparameter settings. + +
+
+
+
+
+ + ☆ Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic + Forgetting in Curiosity NeurIPS 2023 + + +
+ Deep reinforcement learning methods exhibit impressive performance on a range +of tasks but still struggle on hard exploration tasks in large environments +with sparse rewards. To address this, intrinsic rewards can be generated using +forward model prediction errors that decrease as the environment becomes known, +and incentivize an agent to explore novel states. While prediction-based +intrinsic rewards can help agents solve hard exploration tasks, they can suffer +from catastrophic forgetting and actually increase at visited states. We first +examine the conditions and causes of catastrophic forgetting in grid world +environments. We then propose a new method FARCuriosity, inspired by how humans +and animals learn. The method depends on fragmentation and recall: an agent +fragments an environment based on surprisal, and uses different local curiosity +modules (prediction-based intrinsic reward functions) for each fragment so that +modules are not trained on the entire environment. At each fragmentation event, +the agent stores the current module in long-term memory (LTM) and either +initializes a new module or recalls a previously stored module based on its +match with the current state. With fragmentation and recall, FARCuriosity +achieves less forgetting and better overall performance in games with varied +and heterogeneous environments in the Atari benchmark suite of tasks. Thus, +this work highlights the problem of catastrophic forgetting in prediction-based +curiosity methods and proposes a solution. + +
+
+ comment: NeurIPS 2023 Workshop - Intrinsically Motivated Open-ended Learning +
+
+
+
+
+ + ☆ SoK: Pitfalls in Evaluating Black-Box Attacks + + +
+ Numerous works study black-box attacks on image classifiers. However, these +works make different assumptions on the adversary's knowledge and current +literature lacks a cohesive organization centered around the threat model. To +systematize knowledge in this area, we propose a taxonomy over the threat space +spanning the axes of feedback granularity, the access of interactive queries, +and the quality and quantity of the auxiliary data available to the attacker. +Our new taxonomy provides three key insights. 1) Despite extensive literature, +numerous under-explored threat spaces exist, which cannot be trivially solved +by adapting techniques from well-explored settings. We demonstrate this by +establishing a new state-of-the-art in the less-studied setting of access to +top-k confidence scores by adapting techniques from well-explored settings of +accessing the complete confidence vector, but show how it still falls short of +the more restrictive setting that only obtains the prediction label, +highlighting the need for more research. 2) Identification the threat model of +different attacks uncovers stronger baselines that challenge prior +state-of-the-art claims. We demonstrate this by enhancing an initially weaker +baseline (under interactive query access) via surrogate models, effectively +overturning claims in the respective paper. 3) Our taxonomy reveals +interactions between attacker knowledge that connect well to related areas, +such as model inversion and extraction attacks. We discuss how advances in +other areas can enable potentially stronger black-box attacks. Finally, we +emphasize the need for a more realistic assessment of attack success by +factoring in local attack runtime. This approach reveals the potential for +certain attacks to achieve notably higher success rates and the need to +evaluate attacks in diverse and harder settings, highlighting the need for +better selection criteria. + +
+
+
+
+
+ + ☆ Learning Regularized Graphon Mean-Field Games with Unknown Graphons + + +
+ We design and analyze reinforcement learning algorithms for Graphon +Mean-Field Games (GMFGs). In contrast to previous works that require the +precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of +the regularized GMFGs when the graphons are unknown. Our contributions are +threefold. First, we propose the Proximal Policy Optimization for GMFG +(GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ +after $T$ iterations with an estimation oracle, improving on a previous work by +Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we +design efficient algorithms to estimate the transition kernels, reward +functions, and graphons from sampled agents. Convergence rates are then derived +when the positions of the agents are either known or unknown. Results for the +combination of the optimization algorithm GMFG-PPO and the estimation algorithm +are then provided. These algorithms are the first specifically designed for +learning graphons from sampled agents. Finally, the efficacy of the proposed +algorithms are corroborated through simulations. These simulations demonstrate +that learning the unknown graphons reduces the exploitability effectively. + +
+
+
+
+
+ + ☆ Evaluating Bias and Fairness in Gender-Neutral Pretrained + Vision-and-Language Models EMNLP 2024 + + +
+ Pretrained machine learning models are known to perpetuate and even amplify +existing biases in data, which can result in unfair outcomes that ultimately +impact user experience. Therefore, it is crucial to understand the mechanisms +behind those prejudicial biases to ensure that model performance does not +result in discriminatory behaviour toward certain groups or populations. In +this work, we define gender bias as our case study. We quantify bias +amplification in pretraining and after fine-tuning on three families of +vision-and-language models. We investigate the connection, if any, between the +two learning stages, and evaluate how bias amplification reflects on model +performance. Overall, we find that bias amplification in pretraining and after +fine-tuning are independent. We then examine the effect of continued +pretraining on gender-neutral data, finding that this reduces group +disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without +significantly compromising task performance. + +
+
+ comment: To appear in EMNLP 2024 +
+
+
+
+
+ + ☆ Can large language models replace humans in the systematic review + process? Evaluating GPT-4's efficacy in screening and extracting data from + peer-reviewed and grey literature in multiple languages + + +
+ Systematic reviews are vital for guiding practice, research, and policy, yet +they are often slow and labour-intensive. Large language models (LLMs) could +offer a way to speed up and automate systematic reviews, but their performance +in such tasks has not been comprehensively evaluated against humans, and no +study has tested GPT-4, the biggest LLM so far. This pre-registered study +evaluates GPT-4's capability in title/abstract screening, full-text review, and +data extraction across various literature types and languages using a +'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human +performance in most tasks, results were skewed by chance agreement and dataset +imbalance. After adjusting for these, there was a moderate level of performance +for data extraction, and - barring studies that used highly reliable prompts - +screening performance levelled at none to moderate for different stages and +languages. When screening full-text literature using highly reliable prompts, +GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key +studies using highly reliable prompts improved its performance even more. Our +findings indicate that, currently, substantial caution should be used if LLMs +are being used to conduct systematic reviews, but suggest that, for certain +systematic review tasks delivered under reliable prompts, LLMs can rival human +performance. + +
+
+ comment: 9 pages, 2 figures, 1 table +
+
+
+
+
+ + ☆ The Expressive Power of Low-Rank Adaptation + + +
+ Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that +leverages low-rank adaptation of weight matrices, has emerged as a prevalent +technique for fine-tuning pre-trained models such as large language models and +diffusion models. Despite its huge success in practice, the theoretical +underpinnings of LoRA have largely remained unexplored. This paper takes the +first step to bridge this gap by theoretically analyzing the expressive power +of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any +model $f$ to accurately represent any smaller target model $\overline{f}$ if +LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of +}\overline{f}}{\text{depth of }f}$. We also quantify the approximation error +when LoRA-rank is lower than the threshold. For Transformer networks, we show +any model can be adapted to a target model of the same size with +rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters. + +
+
+ comment: 40 pages,5 figures +
+
+
+
+
+ + ☆ Controllable Generation of Artificial Speaker Embeddings through + Discovery of Principal Directions ISCA + + +
+ Customizing voice and speaking style in a speech synthesis system with +intuitive and fine-grained controls is challenging, given that little data with +appropriate labels is available. Furthermore, editing an existing human's voice +also comes with ethical concerns. In this paper, we propose a method to +generate artificial speaker embeddings that cannot be linked to a real human +while offering intuitive and fine-grained control over the voice and speaking +style of the embeddings, without requiring any labels for speaker or style. The +artificial and controllable embeddings can be fed to a speech synthesis system, +conditioned on embeddings of real humans during training, without sacrificing +privacy during inference. + +
+
+ comment: Published at ISCA Interspeech 2023 + https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html +
+
+
+
+
+ + ☆ The IMS Toucan System for the Blizzard Challenge 2023 + + +
+ For our contribution to the Blizzard Challenge 2023, we improved on the +system we submitted to the Blizzard Challenge 2021. Our approach entails a +rule-based text-to-phoneme processing system that includes rule-based +disambiguation of homographs in the French language. It then transforms the +phonemes to spectrograms as intermediate representations using a fast and +efficient non-autoregressive synthesis architecture based on Conformer and +Glow. A GAN based neural vocoder that combines recent state-of-the-art +approaches converts the spectrogram to the final wave. We carefully designed +the data processing, training, and inference procedures for the challenge data. +Our system identifier is G. Open source code and demo are available. + +
+
+ comment: Published at the Blizzard Challenge Workshop 2023, colocated with the + Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023 +
+
+
+
+
+ + ☆ CBD: A Certified Backdoor Detector Based on Local Dominant Probability NeurIPS 2023 + + +
+ Backdoor attack is a common threat to deep neural networks. During testing, +samples embedded with a backdoor trigger will be misclassified as an +adversarial target by a backdoored model, while samples without the backdoor +trigger will be correctly classified. In this paper, we present the first +certified backdoor detector (CBD), which is based on a novel, adjustable +conformal prediction scheme based on our proposed statistic local dominant +probability. For any classifier under inspection, CBD provides 1) a detection +inference, 2) the condition under which the attacks are guaranteed to be +detectable for the same classification domain, and 3) a probabilistic upper +bound for the false positive rate. Our theoretical results show that attacks +with triggers that are more resilient to test-time noise and have smaller +perturbation magnitudes are more likely to be detected with guarantees. +Moreover, we conduct extensive experiments on four benchmark datasets +considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves +comparable or even higher detection accuracy than state-of-the-art detectors, +and it in addition provides detection certification. Notably, for backdoor +attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which +achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% +(84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true +positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and +TinyImageNet, respectively, with low false positive rates. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Tackling Interference Induced by Data Training Loops in A/B Tests: A + Weighted Training Approach + + +
+ In modern recommendation systems, the standard pipeline involves training +machine learning models on historical data to predict user behaviors and +improve recommendations continuously. However, these data training loops can +introduce interference in A/B tests, where data generated by control and +treatment algorithms, potentially with different distributions, are combined. +To address these challenges, we introduce a novel approach called weighted +training. This approach entails training a model to predict the probability of +each data point appearing in either the treatment or control data and +subsequently applying weighted losses during model training. We demonstrate +that this approach achieves the least variance among all estimators without +causing shifts in the training distributions. Through simulation studies, we +demonstrate the lower bias and variance of our approach compared to other +methods. + +
+
+
+
+
+ + ☆ Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation + Models: A Multi-Agent Deep Reinforcement Learning Approach + + +
+ The efficient deployment and fine-tuning of foundation models are pivotal in +contemporary artificial intelligence. In this study, we present a +groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation +models, specifically designed to enhance local task performance on user +equipment (UE). Central to our approach is the innovative Emulator-Adapter +architecture, segmenting the foundation model into two cohesive modules. This +design not only conserves computational resources but also ensures adaptability +and fine-tuning efficiency for downstream tasks. Additionally, we introduce an +advanced resource allocation mechanism that is fine-tuned to the needs of the +Emulator-Adapter structure in decentralized settings. To address the challenges +presented by this system, we employ a hybrid multi-agent Deep Reinforcement +Learning (DRL) strategy, adept at handling mixed discrete-continuous action +spaces, ensuring dynamic and optimal resource allocations. Our comprehensive +simulations and validations underscore the practical viability of our approach, +demonstrating its robustness, efficiency, and scalability. Collectively, this +work offers a fresh perspective on deploying foundation models and balancing +computational efficiency with task proficiency. + +
+
+
+
+
+ + ☆ FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine + Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation + Models with Mobile Edge Computing + + +
+ The emergence of foundation models, including language and vision models, has +reshaped AI's landscape, offering capabilities across various applications. +Deploying and fine-tuning these large models, like GPT-3 and BERT, presents +challenges, especially in the current foundation model era. We introduce +Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning +(PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we +expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses +adapters, emulators, and PEFT for federated model tuning, enhancing model +privacy and memory efficiency. Adapters adjust pre-trained models, while +emulators give a compact representation of original models, addressing both +privacy and efficiency. Adaptable to various neural networks, our approach also +uses deep reinforcement learning for hyper-parameter optimization. We tested +FedPEAT in a unique scenario with a server participating in collaborative +federated tuning, showcasing its potential in tackling foundation model +challenges. + +
+
+
+
+
+ + ☆ Bias in Evaluation Processes: An Optimization-Based Model NeurIPS 2023 + + +
+ Biases with respect to socially-salient attributes of individuals have been +well documented in evaluation processes used in settings such as admissions and +hiring. We view such an evaluation process as a transformation of a +distribution of the true utility of an individual for a task to an observed +distribution and model it as a solution to a loss minimization problem subject +to an information constraint. Our model has two parameters that have been +identified as factors leading to biases: the resource-information trade-off +parameter in the information constraint and the risk-averseness parameter in +the loss function. We characterize the distributions that arise from our model +and study the effect of the parameters on the observed distribution. The +outputs of our model enrich the class of distributions that can be used to +capture variation across groups in the observed evaluations. We empirically +validate our model by fitting real-world datasets and use it to study the +effect of interventions in a downstream selection task. These results +contribute to an understanding of the emergence of bias in evaluation processes +and provide tools to guide the deployment of interventions to mitigate biases. + +
+
+ comment: The conference version of this paper appears in NeurIPS 2023 +
+
+
+
+
+ + ☆ Fair collaborative vehicle routing: A deep multi-agent reinforcement + learning approach + + +
+ Collaborative vehicle routing occurs when carriers collaborate through +sharing their transportation requests and performing transportation requests on +behalf of each other. This achieves economies of scale, thus reducing cost, +greenhouse gas emissions and road congestion. But which carrier should partner +with whom, and how much should each carrier be compensated? Traditional game +theoretic solution concepts are expensive to calculate as the characteristic +function scales exponentially with the number of agents. This would require +solving the vehicle routing problem (NP-hard) an exponential number of times. +We therefore propose to model this problem as a coalitional bargaining game +solved using deep multi-agent reinforcement learning, where - crucially - +agents are not given access to the characteristic function. Instead, we +implicitly reason about the characteristic function; thus, when deployed in +production, we only need to evaluate the expensive post-collaboration vehicle +routing problem once. Our contribution is that we are the first to consider +both the route allocation problem and gain sharing problem simultaneously - +without access to the expensive characteristic function. Through decentralised +machine learning, our agents bargain with each other and agree to outcomes that +correlate well with the Shapley value - a fair profit allocation mechanism. +Importantly, we are able to achieve a reduction in run-time of 88%. + +
+
+ comment: Final, published version can be found here: + https://www.sciencedirect.com/science/article/pii/S0968090X23003662 +
+
+
+
+
+ + ☆ Secure short-term load forecasting for smart grids with + transformer-based federated learning + + +
+ Electricity load forecasting is an essential task within smart grids to +assist demand and supply balance. While advanced deep learning models require +large amounts of high-resolution data for accurate short-term load predictions, +fine-grained load profiles can expose users' electricity consumption behaviors, +which raises privacy and security concerns. One solution to improve data +privacy is federated learning, where models are trained locally on private +data, and only the trained model parameters are merged and updated on a global +server. Therefore, this paper presents a novel transformer-based deep learning +approach with federated learning for short-term electricity load prediction. To +evaluate our results, we benchmark our federated learning architecture against +central and local learning and compare the performance of our model to long +short-term memory models and convolutional neural networks. Our simulations are +based on a dataset from a German university campus and show that +transformer-based forecasting is a promising alternative to state-of-the-art +models within federated learning. + +
+
+
+
+
+ + ☆ Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End + Collaboration + + +
+ Future wireless communication networks are in a position to move beyond +data-centric, device-oriented connectivity and offer intelligent, immersive +experiences based on task-oriented connections, especially in the context of +the thriving development of pre-trained foundation models (PFM) and the +evolving vision of 6G native artificial intelligence (AI). Therefore, +redefining modes of collaboration between devices and servers and constructing +native intelligence libraries become critically important in 6G. In this paper, +we analyze the challenges of achieving 6G native AI from the perspectives of +data, intelligence, and networks. Then, we propose a 6G native AI framework +based on foundation models, provide a customization approach for intent-aware +PFM, present a construction of a task-oriented AI toolkit, and outline a novel +cloud-edge-end collaboration paradigm. As a practical use case, we apply this +framework for orchestration, achieving the maximum sum rate within a wireless +communication system, and presenting preliminary evaluation results. Finally, +we outline research directions for achieving native AI in 6G. + +
+
+ comment: 8 pages, 4 figures, 1 table +
+
+
+
+
+ + ☆ Cross-modal Active Complementary Learning with Self-refining + Correspondence NeurIPS 2023 + + +
+ Recently, image-text matching has attracted more and more attention from +academia and industry, which is fundamental to understanding the latent +correspondence across visual and textual modalities. However, most existing +methods implicitly assume the training pairs are well-aligned while ignoring +the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby +inevitably leading to a performance drop. Although some methods attempt to +address such noise, they still face two challenging problems: excessive +memorizing/overfitting and unreliable correction for NC, especially under high +noise. To address the two problems, we propose a generalized Cross-modal Robust +Complementary Learning framework (CRCL), which benefits from a novel Active +Complementary Loss (ACL) and an efficient Self-refining Correspondence +Correction (SCC) to improve the robustness of existing methods. Specifically, +ACL exploits active and complementary learning losses to reduce the risk of +providing erroneous supervision, leading to theoretically and experimentally +demonstrated robustness against NC. SCC utilizes multiple self-refining +processes with momentum correction to enlarge the receptive field for +correcting correspondences, thereby alleviating error accumulation and +achieving accurate and stable corrections. We carry out extensive experiments +on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify +the superior robustness of our CRCL against synthetic and real-world noisy +correspondences. + +
+
+ comment: This paper is accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ The statistical thermodynamics of generative diffusion models + + +
+ Generative diffusion models have achieved spectacular performance in many +areas of generative modeling. While the fundamental ideas behind these models +come from non-equilibrium physics, in this paper we show that many aspects of +these models can be understood using the tools of equilibrium statistical +mechanics. Using this reformulation, we show that generative diffusion models +undergo second-order phase transitions corresponding to symmetry breaking +phenomena. We argue that this lead to a form of instability that lies at the +heart of their generative capabilities and that can be described by a set of +mean field critical exponents. We conclude by analyzing recent work connecting +diffusion models and associative memory networks in view of the thermodynamic +formulations. + +
+
+
+
+
+ + ☆ Bayesian Neural Controlled Differential Equations for Treatment Effect + Estimation + + +
+ Treatment effect estimation in continuous time is crucial for personalized +medicine. However, existing methods for this task are limited to point +estimates of the potential outcomes, whereas uncertainty estimates have been +ignored. Needless to say, uncertainty quantification is crucial for reliable +decision-making in medical applications. To fill this gap, we propose a novel +Bayesian neural controlled differential equation (BNCDE) for treatment effect +estimation in continuous time. In our BNCDE, the time dimension is modeled +through a coupled system of neural controlled differential equations and neural +stochastic differential equations, where the neural stochastic differential +equations allow for tractable variational Bayesian inference. Thereby, for an +assigned sequence of treatments, our BNCDE provides meaningful posterior +predictive distributions of the potential outcomes. To the best of our +knowledge, ours is the first tailored neural method to provide uncertainty +estimates of treatment effects in continuous time. As such, our method is of +direct practical value for promoting reliable decision-making in medicine. + +
+
+
+
+
+ + ☆ Towards Learning Monocular 3D Object Localization From 2D Labels using + the Physical Laws of Motion + + +
+ We present a novel method for precise 3D object localization in single images +from a single calibrated camera using only 2D labels. No expensive 3D labels +are needed. Thus, instead of using 3D labels, our model is trained with +easy-to-annotate 2D labels along with the physical knowledge of the object's +motion. Given this information, the model can infer the latent third dimension, +even though it has never seen this information during training. Our method is +evaluated on both synthetic and real-world datasets, and we are able to achieve +a mean distance error of just 6 cm in our experiments on real data. The results +indicate the method's potential as a step towards learning 3D object location +estimation, where collecting 3D data for training is not feasible. + +
+
+
+
+
+ + ☆ Coalitional Bargaining via Reinforcement Learning: An Application to + Collaborative Vehicle Routing NeurIPS 2021 + + +
+ Collaborative Vehicle Routing is where delivery companies cooperate by +sharing their delivery information and performing delivery requests on behalf +of each other. This achieves economies of scale and thus reduces cost, +greenhouse gas emissions, and road congestion. But which company should partner +with whom, and how much should each company be compensated? Traditional game +theoretic solution concepts, such as the Shapley value or nucleolus, are +difficult to calculate for the real-world problem of Collaborative Vehicle +Routing due to the characteristic function scaling exponentially with the +number of agents. This would require solving the Vehicle Routing Problem (an +NP-Hard problem) an exponential number of times. We therefore propose to model +this problem as a coalitional bargaining game where - crucially - agents are +not given access to the characteristic function. Instead, we implicitly reason +about the characteristic function, and thus eliminate the need to evaluate the +VRP an exponential number of times - we only need to evaluate it once. Our +contribution is that our decentralised approach is both scalable and considers +the self-interested nature of companies. The agents learn using a modified +Independent Proximal Policy Optimisation. Our RL agents outperform a strong +heuristic bot. The agents correctly identify the optimal coalitions 79% of the +time with an average optimality gap of 4.2% and reduction in run-time of 62%. + +
+
+ comment: Accepted to NeurIPS 2021 Workshop on Cooperative AI +
+
+
+
+
+ + ☆ Sign Languague Recognition without frame-sequencing constraints: A proof + of concept on the Argentinian Sign Language + + +
+ Automatic sign language recognition (SLR) is an important topic within the +areas of human-computer interaction and machine learning. On the one hand, it +poses a complex challenge that requires the intervention of various knowledge +areas, such as video processing, image processing, intelligent systems and +linguistics. On the other hand, robust recognition of sign language could +assist in the translation process and the integration of hearing-impaired +people, as well as the teaching of sign language for the hearing population. + SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or +similar models to recognize signs. Such techniques exploit the sequential +ordering of frames to reduce the number of hypothesis. This paper presents a +general probabilistic model for sign classification that combines +sub-classifiers based on different types of features such as position, movement +and handshape. The model employs a bag-of-words approach in all classification +steps, to explore the hypothesis that ordering is not essential for +recognition. The proposed model achieved an accuracy rate of 97% on an +Argentinian Sign Language dataset containing 64 classes of signs and 3200 +samples, providing some evidence that indeed recognition without ordering is +possible. + +
+
+ comment: IBERAMIA 2016 +
+
+
+
+
+ + ☆ Likelihood-based Out-of-Distribution Detection with Denoising Diffusion + Probabilistic Models BMVC 2023 + + +
+ Out-of-Distribution detection between dataset pairs has been extensively +explored with generative models. We show that likelihood-based +Out-of-Distribution detection can be extended to diffusion models by leveraging +the fact that they, like other likelihood-based generative models, are +dramatically affected by the input sample complexity. Currently, all +Out-of-Distribution detection methods with Diffusion Models are +reconstruction-based. We propose a new likelihood ratio for Out-of-Distribution +detection with Deep Denoising Diffusion Models, which we call the Complexity +Corrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence +Lower-Bound evaluations from an individual model at various noising levels. We +present results that are comparable to state-of-the-art Out-of-Distribution +detection methods with generative models. + +
+
+ comment: 9 pages (main paper), 3 pages (acknowledgements & references), 3 + figures, 2 tables, 1 algorithm, work accepted for BMVC 2023 +
+
+
+
+
+ + ☆ Handshape recognition for Argentinian Sign Language using ProbSom + + +
+ Automatic sign language recognition is an important topic within the areas of +human-computer interaction and machine learning. On the one hand, it poses a +complex challenge that requires the intervention of various knowledge areas, +such as video processing, image processing, intelligent systems and +linguistics. On the other hand, robust recognition of sign language could +assist in the translation process and the integration of hearing-impaired +people. + This paper offers two main contributions: first, the creation of a database +of handshapes for the Argentinian Sign Language (LSA), which is a topic that +has barely been discussed so far. Secondly, a technique for image processing, +descriptor extraction and subsequent handshape classification using a +supervised adaptation of self-organizing maps that is called ProbSom. This +technique is compared to others in the state of the art, such as Support Vector +Machines (SVM), Random Forests, and Neural Networks. + The database that was built contains 800 images with 16 LSA handshapes, and +is a first step towards building a comprehensive database of Argentinian signs. +The ProbSom-based neural classifier, using the proposed descriptor, achieved an +accuracy rate above 90%. + +
+
+
+
+
+ + ☆ Causal Modeling with Stationary Diffusions + + +
+ We develop a novel approach towards causal inference. Rather than structural +equations over a causal graph, we learn stochastic differential equations +(SDEs) whose stationary densities model a system's behavior under +interventions. These stationary diffusion models do not require the formalism +of causal graphs, let alone the common assumption of acyclicity. We show that +in several cases, they generalize to unseen interventions on their variables, +often better than classical approaches. Our inference method is based on a new +theoretical result that expresses a stationarity condition on the diffusion's +generator in a reproducing kernel Hilbert space. The resulting kernel deviation +from stationarity (KDS) is an objective function of independent interest. + +
+
+
+
+
+ + ☆ Invariance Measures for Neural Networks + + +
+ Invariances in neural networks are useful and necessary for many tasks. +However, the representation of the invariance of most neural network models has +not been characterized. We propose measures to quantify the invariance of +neural networks in terms of their internal representation. The measures are +efficient and interpretable, and can be applied to any neural network model. +They are also more sensitive to invariance than previously defined measures. We +validate the measures and their properties in the domain of affine +transformations and the CIFAR10 and MNIST datasets, including their stability +and interpretability. Using the measures, we perform a first analysis of CNN +models and show that their internal invariance is remarkably stable to random +weight initializations, but not to changes in dataset or transformation. We +believe the measures will enable new avenues of research in invariance +representation. + +
+
+
+
+
+ + ☆ Detection Defenses: An Empty Promise against Adversarial Patch Attacks + on Optical Flow WACV 2024 + + +
+ Adversarial patches undermine the reliability of optical flow predictions +when placed in arbitrary scene locations. Therefore, they pose a realistic +threat to real-world motion detection and its downstream applications. +Potential remedies are defense strategies that detect and remove adversarial +patches, but their influence on the underlying motion prediction has not been +investigated. In this paper, we thoroughly examine the currently available +detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art +optical flow methods, and illuminate their side effects on the quality and +robustness of the final flow predictions. In particular, we implement +defense-aware attacks to investigate whether current defenses are able to +withstand attacks that take the defense mechanism into account. Our experiments +yield two surprising results: Detect-and-remove defenses do not only lower the +optical flow quality on benign scenes, in doing so, they also harm the +robustness under patch attacks for all tested optical flow methods except +FlowNetC. As currently employed detect-and-remove defenses fail to deliver the +promised adversarial robustness for optical flow, they evoke a false sense of +security. The code is available at +https://github.com/cv-stuttgart/DetectionDefenses. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ Enhancing Graph Neural Networks with Structure-Based Prompt + + +
+ Graph Neural Networks (GNNs) are powerful in learning semantics of graph +data. Recently, a new paradigm "pre-train, prompt" has shown promising results +in adapting GNNs to various tasks with less supervised data. The success of +such paradigm can be attributed to the more consistent objectives of +pre-training and task-oriented prompt tuning, where the pre-trained knowledge +can be effectively transferred to downstream tasks. However, an overlooked +issue of existing studies is that the structure information of graph is usually +exploited during pre-training for learning node representations, while +neglected in the prompt tuning stage for learning task-specific parameters. To +bridge this gap, we propose a novel structure-based prompting method for GNNs, +namely SAP, which consistently exploits structure information in both +pre-training and prompt tuning stages. In particular, SAP 1) employs a +dual-view contrastive learning to align the latent semantic spaces of node +attributes and graph structure, and 2) incorporates structure information in +prompted graph to elicit more pre-trained knowledge in prompt tuning. We +conduct extensive experiments on node classification and graph classification +tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to +better performance in more challenging few-shot scenarios on both homophilous +and heterophilous graphs. + +
+
+
+
+
+ + ☆ A Challenge in Reweighting Data with Bilevel Optimization + + +
+ In many scenarios, one uses a large training set to train a model with the +goal of performing well on a smaller testing set with a different distribution. +Learning a weight for each data point of the training set is an appealing +solution, as it ideally allows one to automatically learn the importance of +each training point for generalization on the testing set. This task is usually +formalized as a bilevel optimization problem. Classical bilevel solvers are +based on a warm-start strategy where both the parameters of the models and the +data weights are learned at the same time. We show that this joint dynamic may +lead to sub-optimal solutions, for which the final data weights are very +sparse. This finding illustrates the difficulty of data reweighting and offers +a clue as to why this method is rarely used in practice. + +
+
+
+
+
+ + ☆ Multitask Online Learning: Listen to the Neighborhood Buzz + + +
+ We study multitask online learning in a setting where agents can only +exchange information with their neighbors on an arbitrary communication +network. We introduce $\texttt{MT-CO}_2\texttt{OL}$, a decentralized algorithm +for this setting whose regret depends on the interplay between the task +similarities and the network structure. Our analysis shows that the regret of +$\texttt{MT-CO}_2\texttt{OL}$ is never worse (up to constants) than the bound +obtained when agents do not share information. On the other hand, our bounds +significantly improve when neighboring agents operate on similar tasks. In +addition, we prove that our algorithm can be made differentially private with a +negligible impact on the regret when the losses are linear. Finally, we provide +experimental support for our theory. + +
+
+
+
+
+ + ☆ On the recognition of the game type based on physiological signals and + eye tracking + + +
+ Automated interpretation of signals yields many impressive applications from +the area of affective computing and human activity recognition (HAR). In this +paper we ask the question about possibility of cognitive activity recognition +on the base of particular set of signals. We use recognition of the game played +by the participant as a playground for exploration of the problem. We build +classifier of three different games (Space Invaders, Tetris, Tower Defence) and +inter-game pause. We validate classifier in the player-independent and +player-dependent scenario. We discuss the improvement in the player-dependent +scenario in the context of biometric person recognition. On the base of the +results obtained in game classification, we consider potential applications in +smart surveillance and quantified self. + +
+
+ comment: 5 pages, 3 figures, extended version of ESM paper +
+
+
+
+
+ + ☆ Optimization dependent generalization bound for ReLU networks based on + sensitivity in the tangent bundle NeurIPS 2023 + + +
+ Recent advances in deep learning have given us some very promising results on +the generalization ability of deep neural networks, however literature still +lacks a comprehensive theory explaining why heavily over-parametrized models +are able to generalize well while fitting the training data. In this paper we +propose a PAC type bound on the generalization error of feedforward ReLU +networks via estimating the Rademacher complexity of the set of networks +available from an initial parameter vector via gradient descent. The key idea +is to bound the sensitivity of the network's gradient to perturbation of the +input data along the optimization trajectory. The obtained bound does not +explicitly depend on the depth of the network. Our results are experimentally +verified on the MNIST and CIFAR-10 datasets. + +
+
+ comment: 17 pages, 5 figures, OPT2023: 15th Annual Workshop on Optimization + for Machine Learning at the 37th NeurIPS 2023, New Orleans, LA, USA +
+
+
+
+
+ + ☆ Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal + Graph Learning + + +
+ Spatio-temporal graph learning is a fundamental problem in the Web of Things +era, which enables a plethora of Web applications such as smart cities, human +mobility and climate analysis. Existing approaches tackle different learning +tasks independently, tailoring their models to unique task characteristics. +These methods, however, fall short of modeling intrinsic uncertainties in the +spatio-temporal data. Meanwhile, their specialized designs limit their +universality as general spatio-temporal learning solutions. In this paper, we +propose to model the learning tasks in a unified perspective, viewing them as +predictions based on conditional information with shared spatio-temporal +patterns. Based on this proposal, we introduce Unified Spatio-Temporal +Diffusion Models (USTD) to address the tasks uniformly within the +uncertainty-aware diffusion framework. USTD is holistically designed, +comprising a shared spatio-temporal encoder and attention-based denoising +networks that are task-specific. The shared encoder, optimized by a +pre-training strategy, effectively captures conditional spatio-temporal +patterns. The denoising networks, utilizing both cross- and self-attention, +integrate conditional dependencies and generate predictions. Opting for +forecasting and kriging as downstream tasks, we design Gated Attention (SGA) +and Temporal Gated Attention (TGA) for each task, with different emphases on +the spatial and temporal dimensions, respectively. By combining the advantages +of deterministic encoders and probabilistic diffusion models, USTD achieves +state-of-the-art performances compared to deterministic and probabilistic +baselines in both tasks, while also providing valuable uncertainty estimates. + +
+
+
+
+
+ + ☆ Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning + + +
+ Ahead-of-time forecasting of the output power of power plants is essential +for the stability of the electricity grid and ensuring uninterrupted service. +However, forecasting renewable energy sources is difficult due to the chaotic +behavior of natural energy sources. This paper presents a new approach to +estimate short-term solar irradiance from sky images. The~proposed algorithm +extracts features from sky images and use learning-based techniques to estimate +the solar irradiance. The~performance of proposed machine learning (ML) +algorithm is evaluated using two publicly available datasets of sky images. +The~datasets contain over 350,000 images for an interval of 16 years, from 2004 +to 2020, with the corresponding global horizontal irradiance (GHI) of each +image as the ground truth. Compared to the state-of-the-art computationally +heavy algorithms proposed in the literature, our approach achieves competitive +results with much less computational complexity for both nowcasting and +forecasting up to 4 h ahead of time. + +
+
+ comment: Published in MDPI Electronics Journal +
+
+
+
+
+ + ☆ Exploring the Trie of Rules: a fast data structure for the + representation of association rules + + +
+ Association rule mining techniques can generate a large volume of sequential +data when implemented on transactional databases. Extracting insights from a +large set of association rules has been found to be a challenging process. When +examining a ruleset, the fundamental question is how to summarise and represent +meaningful mined knowledge efficiently. Many algorithms and strategies have +been developed to address issue of knowledge extraction; however, the +effectiveness of this process can be limited by the data structures. A better +data structure can sufficiently affect the speed of the knowledge extraction +process. This paper proposes a novel data structure, called the Trie of rules, +for storing a ruleset that is generated by association rule mining. The +resulting data structure is a prefix-tree graph structure made of pre-mined +rules. This graph stores the rules as paths within the prefix-tree in a way +that similar rules overlay each other. Each node in the tree represents a rule +where a consequent is this node, and an antecedent is a path from this node to +the root of the tree. The evaluation showed that the proposed representation +technique is promising. It compresses a ruleset with almost no data loss and +benefits in terms of time for basic operations such as searching for a specific +rule and sorting, which is the base for many knowledge discovery methods. +Moreover, our method demonstrated a significant improvement in traversing time, +achieving an 8-fold increase compared to traditional data structures. + +
+
+ comment: 12 pages, 13 figures, preprint of journal article +
+
+
+
+
+ + ☆ De-novo Chemical Reaction Generation by Means of Temporarily + Convolutional Neural Networks + + +
+ We present here a combination of two networks, Recurrent Neural Networks +(RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction +generation using the novel Reaction Smiles-like representation of reactions +(CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks +are known for their autoregressive properties and are frequently used in +language modelling with direct application to SMILES generation. The relatively +novel TCNs possess similar properties with wide receptive field while obeying +the causality required for natural language processing (NLP). The combination +of both latent representations expressed through TCN and RNN results in an +overall better performance compared to RNN alone. Additionally, it is shown +that different fine-tuning protocols have a profound impact on generative scope +of the model when applied on a dataset of interest via transfer learning. + +
+
+
+
+
+ + ☆ A multi-artifact EEG denoising by frequency-based deep learning + + +
+ Electroencephalographic (EEG) signals are fundamental to neuroscience +research and clinical applications such as brain-computer interfaces and +neurological disorder diagnosis. These signals are typically a combination of +neurological activity and noise, originating from various sources, including +physiological artifacts like ocular and muscular movements. Under this setting, +we tackle the challenge of distinguishing neurological activity from +noise-related sources. We develop a novel EEG denoising model that operates in +the frequency domain, leveraging prior knowledge about noise spectral features +to adaptively compute optimal convolutional filters for noise separation. The +model is trained to learn an empirical relationship connecting the spectral +characteristics of noise and noisy signal to a non-linear transformation which +allows signal denoising. Performance evaluation on the EEGdenoiseNet dataset +shows that the proposed model achieves optimal results according to both +temporal and spectral metrics. The model is found to remove physiological +artifacts from input EEG data, thus achieving effective EEG denoising. Indeed, +the model performance either matches or outperforms that achieved by benchmark +models, proving to effectively remove both muscle and ocular artifacts without +the need to perform any training on the particular type of artifact. + +
+
+ comment: Accepted at the Italian Workshop on Artificial Intelligence for + Human-Machine Interaction (AIxHMI 2023), November 06, 2023, Rome, Italy +
+
+
+
+
+ + ☆ On Forecast Stability + + +
+ Forecasts are typically not produced in a vacuum but in a business context, +where forecasts are generated on a regular basis and interact with each other. +For decisions, it may be important that forecasts do not change arbitrarily, +and are stable in some sense. However, this area has received only limited +attention in the forecasting literature. In this paper, we explore two types of +forecast stability that we call vertical stability and horizontal stability. +The existing works in the literature are only applicable to certain base models +and extending these frameworks to be compatible with any base model is not +straightforward. Furthermore, these frameworks can only stabilise the forecasts +vertically. To fill this gap, we propose a simple linear-interpolation-based +approach that is applicable to stabilise the forecasts provided by any base +model vertically and horizontally. The approach can produce both accurate and +stable forecasts. Using N-BEATS, Pooled Regression and LightGBM as the base +models, in our evaluation on four publicly available datasets, the proposed +framework is able to achieve significantly higher stability and/or accuracy +compared to a set of benchmarks including a state-of-the-art forecast +stabilisation method across three error metrics and six stability metrics. + +
+
+
+
+
+ + ☆ CQM: Curriculum Reinforcement Learning with a Quantized World Model NeurIPS 2023 + + +
+ Recent curriculum Reinforcement Learning (RL) has shown notable progress in +solving complex tasks by proposing sequences of surrogate tasks. However, the +previous approaches often face challenges when they generate curriculum goals +in a high-dimensional space. Thus, they usually rely on manually specified goal +spaces. To alleviate this limitation and improve the scalability of the +curriculum, we propose a novel curriculum method that automatically defines the +semantic goal space which contains vital information for the curriculum +process, and suggests curriculum goals over it. To define the semantic goal +space, our method discretizes continuous observations via vector +quantized-variational autoencoders (VQ-VAE) and restores the temporal relations +between the discretized observations by a graph. Concurrently, ours suggests +uncertainty and temporal distance-aware curriculum goals that converges to the +final goals over the automatically composed goal space. We demonstrate that the +proposed method allows efficient explorations in an uninformed environment with +raw goal examples only. Also, ours outperforms the state-of-the-art curriculum +RL methods on data efficiency and performance, in various goal-reaching tasks +even with ego-centric visual inputs. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ C-Disentanglement: Discovering Causally-Independent Generative Factors + under an Inductive Bias of Confounder + + +
+ Representation learning assumes that real-world data is generated by a few +semantically meaningful generative factors (i.e., sources of variation) and +aims to discover them in the latent space. These factors are expected to be +causally disentangled, meaning that distinct factors are encoded into separate +latent variables, and changes in one factor will not affect the values of the +others. Compared to statistical independence, causal disentanglement allows +more controllable data generation, improved robustness, and better +generalization. However, most existing work assumes unconfoundedness in the +discovery process, that there are no common causes to the generative factors +and thus obtain only statistical independence. In this paper, we recognize the +importance of modeling confounders in discovering causal generative factors. +Unfortunately, such factors are not identifiable without proper inductive bias. +We fill the gap by introducing a framework entitled Confounded-Disentanglement +(C-Disentanglement), the first framework that explicitly introduces the +inductive bias of confounder via labels from domain expertise. In addition, we +accordingly propose an approach to sufficiently identify the causally +disentangled factors under any inductive bias of the confounder. We conduct +extensive experiments on both synthetic and real-world datasets. Our method +demonstrates competitive results compared to various SOTA baselines in +obtaining causally disentangled features and downstream tasks under domain +shifts. + +
+
+ comment: accepted to Neurips 2023 +
+
+
+
+
+ + ☆ Demonstration-Regularized RL + + +
+ Incorporating expert demonstrations has empirically helped to improve the +sample efficiency of reinforcement learning (RL). This paper quantifies +theoretically to what extent this extra information reduces RL's sample +complexity. In particular, we study the demonstration-regularized reinforcement +learning that leverages the expert demonstrations by KL-regularization for a +policy learned by behavior cloning. Our findings reveal that using +$N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal +policy at a sample complexity of order +$\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ +in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 +N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is +the target precision, $H$ the horizon, $A$ the number of action, $S$ the number +of states in the finite case and $d$ the dimension of the feature space in the +linear case. As a by-product, we provide tight convergence guarantees for the +behaviour cloning procedure under general assumptions on the policy classes. +Additionally, we establish that demonstration-regularized methods are provably +efficient for reinforcement learning from human feedback (RLHF). In this +respect, we provide theoretical evidence showing the benefits of +KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid +pessimism injection by employing computationally feasible regularization to +handle reward estimation uncertainty, thus setting our approach apart from the +prior works. + +
+
+
+
+
+ + ☆ BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point + Clouds 3DV 2024 + + +
+ We present a surprisingly simple and efficient method for self-supervision of +3D backbone on automotive Lidar point clouds. We design a contrastive loss +between features of Lidar scans captured in the same scene. Several such +approaches have been proposed in the literature from PointConstrast, which uses +a contrast at the level of points, to the state-of-the-art TARL, which uses a +contrast at the level of segments, roughly corresponding to objects. While the +former enjoys a great simplicity of implementation, it is surpassed by the +latter, which however requires a costly pre-processing. In BEVContrast, we +define our contrast at the level of 2D cells in the Bird's Eye View plane. +Resulting cell-level representations offer a good trade-off between the +point-level representations exploited in PointContrast and segment-level +representations exploited in TARL: we retain the simplicity of PointContrast +(cell representations are cheap to compute) while surpassing the performance of +TARL in downstream semantic segmentation. + +
+
+ comment: Accepted to 3DV 2024 +
+
+
+
+
+ + ☆ Looping in the Human: Collaborative and Explainable Bayesian + Optimization + + +
+ Like many optimizers, Bayesian optimization often falls short of gaining user +trust due to opacity. While attempts have been made to develop human-centric +optimizers, they typically assume user knowledge is well-specified and +error-free, employing users mainly as supervisors of the optimization process. +We relax these assumptions and propose a more balanced human-AI partnership +with our Collaborative and Explainable Bayesian Optimization (CoExBO) +framework. Instead of explicitly requiring a user to provide a knowledge model, +CoExBO employs preference learning to seamlessly integrate human insights into +the optimization, resulting in algorithmic suggestions that resonate with user +preference. CoExBO explains its candidate selection every iteration to foster +trust, empowering users with a clearer grasp of the optimization. Furthermore, +CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with +extreme adversarial interventions, the algorithm converges asymptotically to a +vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI +teaming experiments in lithium-ion battery design, highlighting substantial +improvements over conventional methods. + +
+
+ comment: 22 pages, 9 figures +
+
+
+
+
+ + ☆ Variance of ML-based software fault predictors: are we really improving + fault prediction? + + +
+ Software quality assurance activities become increasingly difficult as +software systems become more and more complex and continuously grow in size. +Moreover, testing becomes even more expensive when dealing with large-scale +systems. Thus, to effectively allocate quality assurance resources, researchers +have proposed fault prediction (FP) which utilizes machine learning (ML) to +predict fault-prone code areas. However, ML algorithms typically make use of +stochastic elements to increase the prediction models' generalizability and +efficiency of the training process. These stochastic elements, also known as +nondeterminism-introducing (NI) factors, lead to variance in the training +process and as a result, lead to variance in prediction accuracy and training +time. This variance poses a challenge for reproducibility in research. More +importantly, while fault prediction models may have shown good performance in +the lab (e.g., often-times involving multiple runs and averaging outcomes), +high variance of results can pose the risk that these models show low +performance when applied in practice. In this work, we experimentally analyze +the variance of a state-of-the-art fault prediction approach. Our experimental +results indicate that NI factors can indeed cause considerable variance in the +fault prediction models' accuracy. We observed a maximum variance of 10.10% in +terms of the per-class accuracy metric. We thus, also discuss how to deal with +such variance. + +
+
+
+
+
+ + ☆ fairret: a Framework for Differentiable Fairness Regularization Terms + + +
+ Current tools for machine learning fairness only admit a limited range of +fairness definitions and have seen little integration with automatic +differentiation libraries, despite the central role these libraries play in +modern machine learning pipelines. + We introduce a framework of fairness regularization terms (fairrets) which +quantify bias as modular objectives that are easily integrated in automatic +differentiation pipelines. By employing a general definition of fairness in +terms of linear-fractional statistics, a wide class of fairrets can be computed +efficiently. Experiments show the behavior of their gradients and their utility +in enforcing fairness with minimal loss of predictive power compared to +baselines. Our contribution includes a PyTorch implementation of the fairret +framework. + +
+
+
+
+
+ + ☆ IDENAS: Internal Dependency Exploration for Neural Architecture Search + + +
+ Machine learning is a powerful tool for extracting valuable information and +making various predictions from diverse datasets. Traditional algorithms rely +on well-defined input and output variables however, there are scenarios where +the distinction between the input and output variables and the underlying, +associated (input and output) layers of the model, are unknown. Neural +Architecture Search (NAS) and Feature Selection have emerged as promising +solutions in such scenarios. This research proposes IDENAS, an Internal +Dependency-based Exploration for Neural Architecture Search, integrating NAS +with feature selection. The methodology explores internal dependencies in the +complete parameter space for classification involving 1D sensor and 2D image +data as well. IDENAS employs a modified encoder-decoder model and the +Sequential Forward Search (SFS) algorithm, combining input-output configuration +search with embedded feature selection. Experimental results demonstrate +IDENASs superior performance in comparison to other algorithms, showcasing its +effectiveness in model development pipelines and automated machine learning. On +average, IDENAS achieved significant modelling improvements, underscoring its +significant contribution to advancing the state-of-the-art in neural +architecture search and feature selection integration. + +
+
+ comment: 57 pages, 19 figures + appendix, the related software code can be + found under the link: https://github.com/viharoszsolt/IDENAS +
+
+
+
+
+ + ☆ Grokking Beyond Neural Networks: An Empirical Exploration with Model + Complexity + + +
+ In some settings neural networks exhibit a phenomenon known as grokking, +where they achieve perfect or near-perfect accuracy on the validation set long +after the same performance has been achieved on the training set. In this +paper, we discover that grokking is not limited to neural networks but occurs +in other settings such as Gaussian process (GP) classification, GP regression +and linear regression. We also uncover a mechanism by which to induce grokking +on algorithmic datasets via the addition of dimensions containing spurious +information. The presence of the phenomenon in non-neural architectures +provides evidence that grokking is not specific to SGD or weight norm +regularisation. Instead, grokking may be possible in any setting where solution +search is guided by complexity and error. Based on this insight and further +trends we see in the training trajectories of a Bayesian neural network (BNN) +and GP regression model, we make progress towards a more general theory of +grokking. Specifically, we hypothesise that the phenomenon is governed by the +accessibility of certain regions in the error and complexity landscapes. + +
+
+
+
+
+ + ☆ CROP: Conservative Reward for Model-based Offline Policy Optimization + + +
+ Offline reinforcement learning (RL) aims to optimize policy using collected +data without online interactions. Model-based approaches are particularly +appealing for addressing offline RL challenges due to their capability to +mitigate the limitations of offline data through data generation using models. +Prior research has demonstrated that introducing conservatism into the model or +Q-function during policy optimization can effectively alleviate the prevalent +distribution drift problem in offline RL. However, the investigation into the +impacts of conservatism in reward estimation is still lacking. This paper +proposes a novel model-based offline RL algorithm, Conservative Reward for +model-based Offline Policy optimization (CROP), which conservatively estimates +the reward in model training. To achieve a conservative reward estimation, CROP +simultaneously minimizes the estimation error and the reward of random actions. +Theoretical analysis shows that this conservative reward mechanism leads to a +conservative policy evaluation and helps mitigate distribution drift. +Experiments on D4RL benchmarks showcase that the performance of CROP is +comparable to the state-of-the-art baselines. Notably, CROP establishes an +innovative connection between offline and online RL, highlighting that offline +RL problems can be tackled by adopting online RL techniques to the empirical +Markov decision process trained with a conservative reward. The source code is +available with https://github.com/G0K0URURI/CROP.git. + +
+
+
+
+
+ + ☆ Joint Entity and Relation Extraction with Span Pruning and Hypergraph + Neural Networks EMNLP + + +
+ Entity and Relation Extraction (ERE) is an important task in information +extraction. Recent marker-based pipeline models achieve state-of-the-art +performance, but still suffer from the error propagation issue. Also, most of +current ERE models do not take into account higher-order interactions between +multiple entities and relations, while higher-order modeling could be +beneficial.In this work, we propose HyperGraph neural network for ERE +($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based +pipleline model). To alleviate error propagation,we use a high-recall pruner +mechanism to transfer the burden of entity identification and labeling from the +NER module to the joint module of our model. For higher-order modeling, we +build a hypergraph, where nodes are entities (provided by the span pruner) and +relations thereof, and hyperedges encode interactions between two different +relations or between a relation and its associated subject and object entities. +We then run a hypergraph neural network for higher-order inference by applying +message passing over the built hypergraph. Experiments on three widely used +benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant +improvements over the previous state-of-the-art PL-marker. + +
+
+ comment: Accepted to Proceedings of EMNLP, 2023 +
+
+
+
+
+ + ☆ Codebook Features: Sparse and Discrete Interpretability for Neural + Networks + + +
+ Understanding neural networks is challenging in part because of the dense, +continuous nature of their hidden states. We explore whether we can train +neural networks to have hidden states that are sparse, discrete, and more +interpretable by quantizing their continuous features into what we call +codebook features. Codebook features are produced by finetuning neural networks +with vector quantization bottlenecks at each layer, producing a network whose +hidden features are the sum of a small number of discrete vector codes chosen +from a larger codebook. Surprisingly, we find that neural networks can operate +under this extreme bottleneck with only modest degradation in performance. This +sparse, discrete bottleneck also provides an intuitive way of controlling +neural network behavior: first, find codes that activate when the desired +behavior is present, then activate those same codes during generation to elicit +that behavior. We validate our approach by training codebook Transformers on +several different datasets. First, we explore a finite state machine dataset +with far more hidden states than neurons. In this setting, our approach +overcomes the superposition problem by assigning states to distinct codes, and +we find that we can make the neural network behave as if it is in a different +state by activating the code for that state. Second, we train Transformer +language models with up to 410M parameters on two natural language datasets. We +identify codes in these models representing diverse, disentangled concepts +(ranging from negative emotions to months of the year) and find that we can +guide the model to generate different topics by activating the appropriate +codes during inference. Overall, codebook features appear to be a promising +unit of analysis and control for neural networks and interpretability. Our +codebase and models are open-sourced at +https://github.com/taufeeque9/codebook-features. + +
+
+
+
+
+ + ☆ Beyond MLE: Convex Learning for Text Generation NeurIPS 2023 + + +
+ Maximum likelihood estimation (MLE) is a statistical method used to estimate +the parameters of a probability distribution that best explain the observed +data. In the context of text generation, MLE is often used to train generative +language models, which can then be used to generate new text. However, we argue +that MLE is not always necessary and optimal, especially for closed-ended text +generation tasks like machine translation. In these tasks, the goal of model is +to generate the most appropriate response, which does not necessarily require +it to estimate the entire data distribution with MLE. To this end, we propose a +novel class of training objectives based on convex functions, which enables +text generation models to focus on highly probable outputs without having to +estimate the entire data distribution. We investigate the theoretical +properties of the optimal predicted distribution when applying convex functions +to the loss, demonstrating that convex functions can sharpen the optimal +distribution, thereby enabling the model to better capture outputs with high +probabilities. Experiments on various text generation tasks and models show the +effectiveness of our approach. It enables autoregressive models to bridge the +gap between greedy and beam search, and facilitates the learning of +non-autoregressive models with a maximum improvement of 9+ BLEU points. +Moreover, our approach also exhibits significant impact on large language +models (LLMs), substantially enhancing their generative capability on various +tasks. Source code is available at +\url{https://github.com/ictnlp/Convex-Learning}. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Weakly-Supervised Surgical Phase Recognition + + +
+ A key element of computer-assisted surgery systems is phase recognition of +surgical videos. Existing phase recognition algorithms require frame-wise +annotation of a large number of videos, which is time and money consuming. In +this work we join concepts of graph segmentation with self-supervised learning +to derive a random-walk solution for per-frame phase prediction. Furthermore, +we utilize within our method two forms of weak supervision: sparse timestamps +or few-shot learning. The proposed algorithm enjoys low complexity and can +operate in lowdata regimes. We validate our method by running experiments with +the public Cholec80 dataset of laparoscopic cholecystectomy videos, +demonstrating promising performance in multiple setups. + +
+
+
+
+
+ + ☆ miditok: A Python package for MIDI file tokenization + + +
+ Recent progress in natural language processing has been adapted to the +symbolic music modality. Language models, such as Transformers, have been used +with symbolic music for a variety of tasks among which music generation, +modeling or transcription, with state-of-the-art performances. These models are +beginning to be used in production products. To encode and decode music for the +backbone model, they need to rely on tokenizers, whose role is to serialize +music into sequences of distinct elements called tokens. MidiTok is an +open-source library allowing to tokenize symbolic music with great flexibility +and extended features. It features the most popular music tokenizations, under +a unified API. It is made to be easily used and extensible for everyone. + +
+
+ comment: Updated and comprehensive report. Original ISMIR 2021 document at + https://archives.ismir.net/ismir2021/latebreaking/000005.pdf +
+
+
+
+
+ + ☆ Taming Gradient Variance in Federated Learning with Networked Control + Variates + + +
+ Federated learning, a decentralized approach to machine learning, faces +significant challenges such as extensive communication overheads, slow +convergence, and unstable improvements. These challenges primarily stem from +the gradient variance due to heterogeneous client data distributions. To +address this, we introduce a novel Networked Control Variates (FedNCV) +framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO) +as a fundamental control variate unit in the FedNCV framework, implemented at +both client and server levels. At the client level, the RLOO control variate is +employed to optimize local gradient updates, mitigating the variance introduced +by data samples. Once relayed to the server, the RLOO-based estimator further +provides an unbiased and low-variance aggregated gradient, leading to robust +global updates. This dual-side application is formalized as a linear +combination of composite control variates. We provide a mathematical expression +capturing this integration of double control variates within FedNCV and present +three theoretical results with corresponding proofs. This unique dual structure +equips FedNCV to address data heterogeneity and scalability issues, thus +potentially paving the way for large-scale applications. Moreover, we tested +FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} = +0.1, and benchmarked its performance against six SOTA methods, demonstrating +its superiority. + +
+
+ comment: 14 pages +
+
+
+
+
+ + ☆ How do Language Models Bind Entities in Context? + + +
+ To correctly use in-context information, language models (LMs) must bind +entities to their attributes. For example, given a context describing a "green +square" and a "blue circle", LMs must bind the shapes to their respective +colors. We analyze LM representations and identify the binding ID mechanism: a +general mechanism for solving the binding problem, which we observe in every +sufficiently large model from the Pythia and LLaMA families. Using causal +interventions, we show that LMs' internal activations represent binding +information by attaching binding ID vectors to corresponding entities and +attributes. We further show that binding ID vectors form a continuous subspace, +in which distances between binding ID vectors reflect their discernability. +Overall, our results uncover interpretable strategies in LMs for representing +symbolic knowledge in-context, providing a step towards understanding general +in-context reasoning in large-scale LMs. + +
+
+
+
+
+ + ♻ ☆ TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023 + + +
+ Deep learning (DL) models for tabular data problems (e.g. classification, +regression) are currently receiving increasingly more attention from +researchers. However, despite the recent efforts, the non-DL algorithms based +on gradient-boosted decision trees (GBDT) remain a strong go-to solution for +these problems. One of the research directions aimed at improving the position +of tabular DL involves designing so-called retrieval-augmented models. For a +target object, such models retrieve other objects (e.g. the nearest neighbors) +from the available training data and use their features and labels to make a +better prediction. + In this work, we present TabR -- essentially, a feed-forward network with a +custom k-Nearest-Neighbors-like component in the middle. On a set of public +benchmarks with datasets up to several million objects, TabR marks a big step +forward for tabular DL: it demonstrates the best average performance among +tabular DL models, becomes the new state-of-the-art on several datasets, and +even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark +(see Figure 1). Among the important findings and technical details powering +TabR, the main ones lie in the attention-like mechanism that is responsible for +retrieving the nearest neighbors and extracting valuable signal from them. In +addition to the much higher performance, TabR is simple and significantly more +efficient compared to prior retrieval-based tabular DL models. + +
+
+ comment: Code: https://github.com/yandex-research/tabular-dl-tabr +
+
+
+
+
+ + ♻ ☆ Online Estimation and Community Detection of Network Point Processes for + Event Streams + + +
+ A common goal in network modeling is to uncover the latent community +structure present among nodes. For many real-world networks, the true +connections consist of events arriving as streams, which are then aggregated to +form edges, ignoring the dynamic temporal component. A natural way to take +account of these temporal dynamics of interactions is to use point processes as +the foundation of network models for community detection. Computational +complexity hampers the scalability of such approaches to large sparse networks. +To circumvent this challenge, we propose a fast online variational inference +algorithm for estimating the latent structure underlying dynamic event arrivals +on a network, using continuous-time point process latent network models. We +describe this procedure for networks models capturing community structure. This +structure can be learned as new events are observed on the network, updating +the inferred community assignments. We investigate the theoretical properties +of such an inference scheme, and provide regret bounds on the loss function of +this procedure. The proposed inference procedure is then thoroughly compared, +using both simulation studies and real data, to non-online variants. We +demonstrate that online inference can obtain comparable performance, in terms +of community recovery, to non-online variants, while realising computational +gains. Our proposed inference framework can also be readily modified to +incorporate other popular network structures. + +
+
+ comment: 45 pages +
+
+
+
+
+ + ♻ ☆ Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions + + +
+ Machine learning approaches relying on such criteria as adversarial +robustness or multi-agent settings have raised the need for solving +game-theoretic equilibrium problems. Of particular relevance to these +applications are methods targeting finite-sum structure, which generically +arises in empirical variants of learning problems in these contexts. Further, +methods with computable approximation errors are highly desirable, as they +provide verifiable exit criteria. Motivated by these applications, we study +finite-sum monotone inclusion problems, which model broad classes of +equilibrium problems. Our main contributions are variants of the classical +Halpern iteration that employ variance reduction to obtain improved complexity +guarantees in which $n$ component operators in the finite sum are ``on +average'' either cocoercive or Lipschitz continuous and monotone, with +parameter $L$. The resulting oracle complexity of our methods, which provide +guarantees for the last iterate and for a (computable) operator norm residual, +is $\widetilde{\mathcal{O}}( n + \sqrt{n}L\varepsilon^{-1})$, which improves +upon existing methods by a factor up to $\sqrt{n}$. This constitutes the first +variance reduction-type result for general finite-sum monotone inclusions and +for more specific problems such as convex-concave optimization when operator +norm residual is the optimality measure. We further argue that, up to +poly-logarithmic factors, this complexity is unimprovable in the monotone +Lipschitz setting; i.e., the provided result is near-optimal. + +
+
+
+
+
+ + ♻ ☆ CEIL: Generalized Contextual Imitation Learning NeurIPS 2023 + + +
+ In this paper, we present \textbf{C}ont\textbf{E}xtual \textbf{I}mitation +\textbf{L}earning~(CEIL), a general and broadly applicable algorithm for +imitation learning (IL). Inspired by the formulation of hindsight information +matching, we derive CEIL by explicitly learning a hindsight embedding function +together with a contextual policy using the hindsight embeddings. To achieve +the expert matching objective for IL, we advocate for optimizing a contextual +variable such that it biases the contextual policy towards mimicking expert +behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL +is a generalist that can be effectively applied to multiple settings including: +1)~learning from observations (LfO), 2)~offline IL, 3)~cross-domain IL +(mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate +CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). +Compared to prior state-of-the-art baselines, we show that CEIL is more +sample-efficient in most online IL tasks and achieves better or competitive +performances in offline tasks. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Gaussian Membership Inference Privacy NeurIPS 2023 + + +
+ We propose a novel and practical privacy notion called $f$-Membership +Inference Privacy ($f$-MIP), which explicitly considers the capabilities of +realistic adversaries under the membership inference attack threat model. +Consequently, $f$-MIP offers interpretable privacy guarantees and improved +utility (e.g., better classification accuracy). In particular, we derive a +parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian +Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood +ratio-based membership inference attacks on stochastic gradient descent (SGD). +Our analysis highlights that models trained with standard SGD already offer an +elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by +adding noise to gradient updates. Our analysis further yields an analytical +membership inference attack that offers two distinct advantages over previous +approaches. First, unlike existing state-of-the-art attacks that require +training hundreds of shadow models, our attack does not require any shadow +model. Second, our analytical attack enables straightforward auditing of our +privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., +batch size, number of model parameters) and specific data characteristics +determine an attacker's ability to accurately infer a point's membership in the +training set. We demonstrate the effectiveness of our method on models trained +on vision and tabular datasets. + +
+
+ comment: NeurIPS 2023 camera-ready. The first two authors contributed equally +
+
+
+
+
+ + ♻ ☆ Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL + with General Regularizers and Multiple Optimal Arms NeurIPS 2023 + + +
+ We study the problem of designing adaptive multi-armed bandit algorithms that +perform optimally in both the stochastic setting and the adversarial setting +simultaneously (often known as a best-of-both-world guarantee). A line of +recent works shows that when configured and analyzed properly, the +Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the +adversarial setting, can in fact optimally adapt to the stochastic setting as +well. Such results, however, critically rely on an assumption that there exists +one unique optimal arm. Recently, Ito (2021) took the first step to remove such +an undesirable uniqueness assumption for one particular FTRL algorithm with the +$\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly +improve and generalize this result, showing that uniqueness is unnecessary for +FTRL with a broad family of regularizers and a new learning rate schedule. For +some regularizers, our regret bounds also improve upon prior results even when +uniqueness holds. We further provide an application of our results to the +decoupled exploration and exploitation problem, demonstrating that our +techniques are broadly applicable. + +
+
+ comment: Update the camera-ready version for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Multi-scale Diffusion Denoised Smoothing NeurIPS 2023 + + +
+ Along with recent diffusion models, randomized smoothing has become one of a +few tangible approaches that offers adversarial robustness to models at scale, +e.g., those of large pre-trained models. Specifically, one can perform +randomized smoothing on any classifier via a simple "denoise-and-classify" +pipeline, so-called denoised smoothing, given that an accurate denoiser is +available - such as diffusion model. In this paper, we present scalable methods +to address the current trade-off between certified robustness and accuracy in +denoised smoothing. Our key idea is to "selectively" apply smoothing among +multiple noise scales, coined multi-scale smoothing, which can be efficiently +implemented with a single diffusion model. This approach also suggests a new +objective to compare the collective robustness of multi-scale smoothed +classifiers, and questions which representation of diffusion model would +maximize the objective. To address this, we propose to further fine-tune +diffusion model (a) to perform consistent denoising whenever the original image +is recoverable, but (b) to generate rather diverse outputs otherwise. Our +experiments show that the proposed multi-scale smoothing scheme combined with +diffusion fine-tuning enables strong certified robustness available with high +noise level while maintaining its accuracy closer to non-smoothed classifiers. + +
+
+ comment: Published as a conference paper at NeurIPS 2023; Code is available at + https://github.com/jh-jeong/smoothing-multiscale +
+
+
+
+
+ + ♻ ☆ No-Regret Online Reinforcement Learning with Adversarial Losses and + Transitions NeurIPS 2023 + + +
+ Existing online learning algorithms for adversarial Markov Decision Processes +achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the +loss functions are chosen arbitrarily by an adversary, with the caveat that the +transition function has to be fixed. This is because it has been shown that +adversarial transition functions make no-regret learning impossible. Despite +such impossibility results, in this work, we develop algorithms that can handle +both adversarial losses and adversarial transitions, with regret increasing +smoothly in the degree of maliciousness of the adversary. More concretely, we +first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + +C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the +transition functions are and can be at most ${O}(T)$. While this algorithm +itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box +reduction approach that removes this requirement. Moreover, we also show that +further refinements of the algorithm not only maintains the same regret bound, +but also simultaneously adapts to easier environments (where losses are +generated in a certain stochastically constrained manner as in Jin et al. +[2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + +C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient +and $C^{\textsf{L}}$ is the amount of corruption on losses. + +
+
+ comment: Update the camera-ready version for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Risk-Averse Model Uncertainty for Distributionally Robust Safe + Reinforcement Learning NeurIPS + 2023 + + +
+ Many real-world domains require safe decision making in uncertain +environments. In this work, we introduce a deep reinforcement learning +framework for approaching this important problem. We consider a distribution +over transition models, and apply a risk-averse perspective towards model +uncertainty through the use of coherent distortion risk measures. We provide +robustness guarantees for this framework by showing it is equivalent to a +specific class of distributionally robust safe reinforcement learning problems. +Unlike existing approaches to robustness in deep reinforcement learning, +however, our formulation does not involve minimax optimization. This leads to +an efficient, model-free implementation of our approach that only requires +standard data collection from a single training environment. In experiments on +continuous control tasks with safety constraints, we demonstrate that our +framework produces robust performance and safety at deployment time across a +range of perturbed test environments. + +
+
+ comment: 37th Conference on Neural Information Processing Systems (NeurIPS + 2023) +
+
+
+
+
+ + ♻ ☆ Detecting and Mitigating Hallucinations in Multilingual Summarisation EMNLP 2023 + + +
+ Hallucinations pose a significant challenge to the reliability of neural +models for abstractive summarisation. While automatically generated summaries +may be fluent, they often lack faithfulness to the original document. This +issue becomes even more pronounced in low-resource settings, such as +cross-lingual transfer. With the existing faithful metrics focusing on English, +even measuring the extent of this phenomenon in cross-lingual settings is hard. +To address this, we first develop a novel metric, mFACT, evaluating the +faithfulness of non-English summaries, leveraging translation-based transfer +from multiple English faithfulness metrics. We then propose a simple but +effective method to reduce hallucinations with a cross-lingual transfer, which +weighs the loss of each training example by its faithfulness score. Through +extensive experiments in multiple languages, we demonstrate that mFACT is the +metric that is most suited to detect hallucinations. Moreover, we find that our +proposed loss weighting method drastically increases both performance and +faithfulness according to both automatic and human evaluation when compared to +strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset +are available at https://github.com/yfqiu-nlp/mfact-summ. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Optimal Scoring Rule Design under Partial Knowledge + + +
+ This paper studies the design of optimal proper scoring rules when the +principal has partial knowledge of an agent's signal distribution. Recent work +characterizes the proper scoring rules that maximize the increase of an agent's +payoff when the agent chooses to access a costly signal to refine a posterior +belief from her prior prediction, under the assumption that the agent's signal +distribution is fully known to the principal. In our setting, the principal +only knows about a set of distributions where the agent's signal distribution +belongs. We formulate the scoring rule design problem as a max-min optimization +that maximizes the worst-case increase in payoff across the set of +distributions. + We propose an efficient algorithm to compute an optimal scoring rule when the +set of distributions is finite, and devise a fully polynomial-time +approximation scheme that accommodates various infinite sets of distributions. +We further remark that widely used scoring rules, such as the quadratic and log +rules, as well as previously identified optimal scoring rules under full +knowledge, can be far from optimal in our partial knowledge settings. + +
+
+
+
+
+ + ♻ ☆ SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from + Diffusion Models NeurIPS 2023 + + +
+ A potent class of generative models known as Diffusion Probabilistic Models +(DPMs) has become prominent. A forward diffusion process adds gradually noise +to data, while a model learns to gradually denoise. Sampling from pre-trained +DPMs is obtained by solving differential equations (DE) defined by the learnt +model, a process which has shown to be prohibitively slow. Numerous efforts on +speeding-up this process have consisted on crafting powerful ODE solvers. +Despite being quick, such solvers do not usually reach the optimal quality +achieved by available slow SDE solvers. Our goal is to propose SDE solvers that +reach optimal quality without requiring several hundreds or thousands of NFEs +to achieve that goal. We propose Stochastic Explicit Exponential +Derivative-free Solvers (SEEDS), improving and generalizing Exponential +Integrator approaches to the stochastic case on several frameworks. After +carefully analyzing the formulation of exact solutions of diffusion SDEs, we +craft SEEDS to analytically compute the linear part of such solutions. Inspired +by the Exponential Time-Differencing method, SEEDS use a novel treatment of the +stochastic components of solutions, enabling the analytical computation of +their variance, and contains high-order terms allowing to reach optimal quality +sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our +approach on several image generation benchmarks, showing that SEEDS outperform +or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are +derivative and training free, and we fully prove strong convergence guarantees +for them. + +
+
+ comment: 60 pages. Camera-Ready version for the 37th Conference on Neural + Information Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ What Makes Data Suitable for a Locally Connected Neural Network? A + Necessary and Sufficient Condition Based on Quantum Entanglement NeurIPS 2023 + + +
+ The question of what makes a data distribution suitable for deep learning is +a fundamental open problem. Focusing on locally connected neural networks (a +prevalent family of architectures that includes convolutional and recurrent +neural networks as well as local self-attention models), we address this +problem by adopting theoretical tools from quantum physics. Our main +theoretical result states that a certain locally connected neural network is +capable of accurate prediction over a data distribution if and only if the data +distribution admits low quantum entanglement under certain canonical partitions +of features. As a practical application of this result, we derive a +preprocessing method for enhancing the suitability of a data distribution to +locally connected neural networks. Experiments with widespread models over +various datasets demonstrate our findings. We hope that our use of quantum +entanglement will encourage further adoption of tools from physics for formally +reasoning about the relation between deep learning and real-world data. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Spontaneous Symmetry Breaking in Generative Diffusion Models NeurIPS 2023 + + +
+ Generative diffusion models have recently emerged as a leading approach for +generating high-dimensional data. In this paper, we show that the dynamics of +these models exhibit a spontaneous symmetry breaking that divides the +generative dynamics into two distinct phases: 1) A linear steady-state dynamics +around a central fixed-point and 2) an attractor dynamics directed towards the +data manifold. These two "phases" are separated by the change in stability of +the central fixed-point, with the resulting window of instability being +responsible for the diversity of the generated samples. Using both theoretical +and empirical evidence, we show that an accurate simulation of the early +dynamics does not significantly contribute to the final generation, since early +fluctuations are reverted to the central fixed point. To leverage this insight, +we propose a Gaussian late initialization scheme, which significantly improves +model performance, achieving up to 3x FID improvements on fast samplers, while +also increasing sample diversity (e.g., racial composition of generated CelebA +images). Our work offers a new way to understand the generative dynamics of +diffusion models that has the potential to bring about higher performance and +less biased fast-samplers. + +
+
+ comment: As published at NeurIPS 2023, and the size of the file has been + optimized for fast downloading +
+
+
+
+
+ + ♻ ☆ Out-of-Distribution Detection in Time-Series Domain: A Novel Seasonal + Ratio Scoring Approach + + +
+ Safe deployment of time-series classifiers for real-world applications relies +on the ability to detect the data which is not generated from the same +distribution as training data. This task is referred to as out-of-distribution +(OOD) detection. We consider the novel problem of OOD detection for the +time-series domain. We discuss the unique challenges posed by time-series data +and explain why prior methods from the image domain will perform poorly. +Motivated by these challenges, this paper proposes a novel {\em Seasonal Ratio +Scoring (SRS)} approach. SRS consists of three key algorithmic steps. First, +each input is decomposed into class-wise semantic component and remainder. +Second, this decomposition is employed to estimate the class-wise conditional +likelihoods of the input and remainder using deep generative models. The +seasonal ratio score is computed from these estimates. Third, a threshold +interval is identified from the in-distribution data to detect OOD examples. +Experiments on diverse real-world benchmarks demonstrate that the SRS method is +well-suited for time-series OOD detection when compared to baseline methods. +Open-source code for SRS method is provided at +https://github.com/tahabelkhouja/SRS + +
+
+ comment: Accepted for publication at ACM Transactions on Intelligent Systems + and Technology (TIST) +
+
+
+
+
+ + ♻ ☆ Adaptive whitening with fast gain modulation and slow synaptic + plasticity NeurIPS 2023 + + +
+ Neurons in early sensory areas rapidly adapt to changing sensory statistics, +both by normalizing the variance of their individual responses and by reducing +correlations between their responses. Together, these transformations may be +viewed as an adaptive form of statistical whitening. Existing mechanistic +models of adaptive whitening exclusively use either synaptic plasticity or gain +modulation as the biological substrate for adaptation; however, on their own, +each of these models has significant limitations. In this work, we unify these +approaches in a normative multi-timescale mechanistic model that adaptively +whitens its responses with complementary computational roles for synaptic +plasticity and gain modulation. Gains are modified on a fast timescale to adapt +to the current statistical context, whereas synapses are modified on a slow +timescale to match structural properties of the input statistics that are +invariant across contexts. Our model is derived from a novel multi-timescale +whitening objective that factorizes the inverse whitening matrix into basis +vectors, which correspond to synaptic weights, and a diagonal matrix, which +corresponds to neuronal gains. We test our model on synthetic and natural +datasets and find that the synapses learn optimal configurations over long +timescales that enable adaptive whitening on short timescales using gain +modulation. + +
+
+ comment: NeurIPS 2023 Spotlight; 18 pages, 8 figures +
+
+
+
+
+ + ♻ ☆ Sequential Memory with Temporal Predictive Coding NeurIPS + 2023 + + +
+ Forming accurate memory of sequential stimuli is a fundamental function of +biological agents. However, the computational mechanism underlying sequential +memory in the brain remains unclear. Inspired by neuroscience theories and +recent successes in applying predictive coding (PC) to \emph{static} memory +tasks, in this work we propose a novel PC-based model for \emph{sequential} +memory, called \emph{temporal predictive coding} (tPC). We show that our tPC +models can memorize and retrieve sequential inputs accurately with a +biologically plausible neural implementation. Importantly, our analytical study +reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) +with an implicit statistical whitening process, which leads to more stable +performance in sequential memory tasks of structured inputs. Moreover, we find +that tPC exhibits properties consistent with behavioral observations and +theories in neuroscience, thereby strengthening its biological relevance. Our +work establishes a possible computational mechanism underlying sequential +memory in the brain that can also be theoretically interpreted using existing +memory model frameworks. + +
+
+ comment: 37th Conference on Neural Information Processing Systems (NeurIPS + 2023) +
+
+
+
+
+ + ♻ ☆ Improving Neural Additive Models with Bayesian Principles + + +
+ Neural additive models (NAMs) can improve the interpretability of deep neural +networks by handling input features in separate additive sub-networks. However, +they lack inherent mechanisms that provide calibrated uncertainties and enable +selection of relevant features and interactions. Approaching NAMs from a +Bayesian perspective, we enhance them in three primary ways, namely by a) +providing credible intervals for the individual additive sub-networks; b) +estimating the marginal likelihood to perform an implicit selection of features +via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as +candidates for second-order interaction in fine-tuned models. In particular, we +develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical +performance on tabular datasets and challenging real-world medical tasks. + +
+
+
+
+
+ + ♻ ☆ Explanations Based on Item Response Theory (eXirt): A Model-Specific + Method to Explain Tree-Ensemble Model in Trust Perspective + + +
+ In recent years, XAI researchers have been formalizing proposals and +developing new methods to explain black box models, with no general consensus +in the community on which method to use to explain these models, with this +choice being almost directly linked to the popularity of a specific method. +Methods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the +proposal to explain black box models through global rankings of feature +relevance, which based on different methodologies, generate global explanations +that indicate how the model's inputs explain its predictions. In this context, +41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost, +Random Forest, and Gradient Boosting), and 6 XAI methods were used to support +the launch of a new XAI method, called eXirt, based on Item Response Theory - +IRT and aimed at tree-ensemble black box models that use tabular data referring +to binary classification problems. In the first set of analyses, the 164 global +feature relevance ranks of the eXirt were compared with 984 ranks of the other +XAI methods present in the literature, seeking to highlight their similarities +and differences. In a second analysis, exclusive explanations of the eXirt +based on Explanation-by-example were presented that help in understanding the +model trust. Thus, it was verified that eXirt is able to generate global +explanations of tree-ensemble models and also local explanations of instances +of models through IRT, showing how this consolidated theory can be used in +machine learning in order to obtain explainable and reliable models. + +
+
+ comment: 54 pages, 15 Figures, 3 Equations, 7 table +
+
+
+
+
+ + ♻ ☆ The Wasserstein Believer: Learning Belief Updates for Partially + Observable Environments through Reliable Latent Space Models + + +
+ Partially Observable Markov Decision Processes (POMDPs) are used to model +environments where the full state cannot be perceived by an agent. As such the +agent needs to reason taking into account the past observations and actions. +However, simply remembering the full history is generally intractable due to +the exponential growth in the history space. Maintaining a probability +distribution that models the belief over what the true state is can be used as +a sufficient statistic of the history, but its computation requires access to +the model of the environment and is often intractable. While SOTA algorithms +use Recurrent Neural Networks to compress the observation-action history aiming +to learn a sufficient statistic, they lack guarantees of success and can lead +to sub-optimal policies. To overcome this, we propose the Wasserstein Belief +Updater, an RL algorithm that learns a latent model of the POMDP and an +approximation of the belief update. Our approach comes with theoretical +guarantees on the quality of our approximation ensuring that our outputted +beliefs allow for learning the optimal value function. + +
+
+
+
+
+ + ♻ ☆ Trajectory Alignment: Understanding the Edge of Stability Phenomenon via + Bifurcation Theory NeurIPS 2023 + + +
+ Cohen et al. (2021) empirically study the evolution of the largest eigenvalue +of the loss Hessian, also known as sharpness, along the gradient descent (GD) +trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness +increases at the early phase of training (referred to as progressive +sharpening), and eventually saturates close to the threshold of $2 / +\text{(step size)}$. In this paper, we start by demonstrating through empirical +studies that when the EoS phenomenon occurs, different GD trajectories (after a +proper reparameterization) align on a specific bifurcation diagram independent +of initialization. We then rigorously prove this trajectory alignment +phenomenon for a two-layer fully-connected linear network and a single-neuron +nonlinear network trained with a single data point. Our trajectory alignment +analysis establishes both progressive sharpening and EoS phenomena, +encompassing and extending recent findings in the literature. + +
+
+ comment: NeurIPS 2023 camera-ready; 51 pages +
+
+
+
+
+ + ♻ ☆ Driving through the Concept Gridlock: Unraveling Explainability + Bottlenecks in Automated Driving + + +
+ Concept bottleneck models have been successfully used for explainable machine +learning by encoding information within the model with a set of human-defined +concepts. In the context of human-assisted or autonomous driving, +explainability models can help user acceptance and understanding of decisions +made by the autonomous vehicle, which can be used to rationalize and explain +driver or vehicle behavior. We propose a new approach using concept bottlenecks +as visual features for control command predictions and explanations of user and +vehicle behavior. We learn a human-understandable concept layer that we use to +explain sequential driving scenes while learning vehicle control commands. This +approach can then be used to determine whether a change in a preferred gap or +steering commands from a human (or autonomous vehicle) is led by an external +stimulus or change in preferences. We achieve competitive performance to latent +visual features while gaining interpretability within our model setup. + +
+
+
+
+
+ + ♻ ☆ Leveraging Ensemble Diversity for Robust Self-Training in the Presence + of Sample Selection Bias + + +
+ Self-training is a well-known approach for semi-supervised learning. It +consists of iteratively assigning pseudo-labels to unlabeled data for which the +model is confident and treating them as labeled examples. For neural networks, +softmax prediction probabilities are often used as a confidence measure, +despite the fact that they are known to be overconfident, even for wrong +predictions. This phenomenon is particularly intensified in the presence of +sample selection bias, i.e., when data labeling is subject to some constraint. +To address this issue, we propose a novel confidence measure, called +$\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of +linear classifiers. We provide the theoretical analysis of our approach by +studying stationary points and describing the relationship between the +diversity of the individual members and their performance. We empirically +demonstrate the benefit of our confidence measure for three different +pseudo-labeling policies on classification datasets of various data modalities. + +
+
+
+
+
+ + ♻ ☆ A Theoretical Explanation of Activation Sparsity through Flat Minima and + Adversarial Robustness + + +
+ A recent empirical observation (Li et al., 2022b) of activation sparsity in +MLP blocks offers an opportunity to drastically reduce computation costs for +free. Although having attributed it to training dynamics, existing theoretical +explanations of activation sparsity are restricted to shallow networks, small +training steps and special training, despite its emergence in deep models +standardly trained for a large number of steps. To fill these gaps, we propose +the notion of gradient sparsity as one source of activation sparsity and a +theoretical explanation based on it that sees sparsity a necessary step to +adversarial robustness w.r.t. hidden features and parameters, which is +approximately the flatness of minima for well-learned models. The theory +applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or +other architectures trained with weight noises. Eliminating other sources of +flatness except for sparsity, we discover the phenomenon that the ratio between +the largest and smallest non-zero singular values of weight matrices is small. +When discussing the emergence of this spectral concentration, we use random +matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises. +Validational experiments are conducted to verify our gradient-sparsity-based +explanation. We propose two plug-and-play modules for both training and +finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their +50% sparsity improvements, indicating further potential cost reduction in both +training and inference. + +
+
+
+
+
+ + ♻ ☆ Label Embedding via Low-Coherence Matrices + + +
+ Label embedding is a framework for multiclass classification problems where +each label is represented by a distinct vector of some fixed dimension, and +training involves matching model output to the vector representing the correct +label. While label embedding has been successfully applied in extreme +classification and zero-shot learning, and offers both computational and +statistical advantages, its theoretical foundations remain poorly understood. +This work presents an analysis of label embedding in the context of extreme +multiclass classification, where the number of classes $C$ is very large. We +present an excess risk bound that reveals a trade-off between computational and +statistical efficiency, quantified via the coherence of the embedding matrix. +We further show that under the Massart noise condition, the statistical penalty +for label embedding vanishes with sufficiently low coherence. Our analysis +supports an algorithm that is simple, scalable, and easily parallelizable, and +experimental results demonstrate its effectiveness in large-scale applications. + +
+
+
+
+
+ + ♻ ☆ Look Beneath the Surface: Exploiting Fundamental Symmetry for + Sample-Efficient Offline RL + + +
+ Offline reinforcement learning (RL) offers an appealing approach to +real-world tasks by learning policies from pre-collected datasets without +interacting with the environment. However, the performance of existing offline +RL algorithms heavily depends on the scale and state-action space coverage of +datasets. Real-world data collection is often expensive and uncontrollable, +leading to small and narrowly covered datasets and posing significant +challenges for practical deployments of offline RL. In this paper, we provide a +new insight that leveraging the fundamental symmetry of system dynamics can +substantially enhance offline RL performance under small datasets. +Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced +Dynamics Model (TDM), which establishes consistency between a pair of forward +and reverse latent dynamics. TDM provides both well-behaved representations for +small datasets and a new reliability measure for OOD samples based on +compliance with the T-symmetry. These can be readily used to construct a new +offline RL algorithm (TSRL) with less conservative policy constraints and a +reliable latent space data augmentation procedure. Based on extensive +experiments, we find TSRL achieves great performance on small benchmark +datasets with as few as 1% of the original samples, which significantly +outperforms the recent offline RL algorithms in terms of data efficiency and +generalizability.Code is available at: https://github.com/pcheng2/TSRL + +
+
+ comment: The first two authors contributed equally +
+
+
+
+
+ + ♻ ☆ A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge + Graphs NeurIPS 2023 + + +
+ Graph neural networks are prominent models for representation learning over +graph-structured data. While the capabilities and limitations of these models +are well-understood for simple graphs, our understanding remains incomplete in +the context of knowledge graphs. Our goal is to provide a systematic +understanding of the landscape of graph neural networks for knowledge graphs +pertaining to the prominent task of link prediction. Our analysis entails a +unifying perspective on seemingly unrelated models and unlocks a series of +other models. The expressive power of various models is characterized via a +corresponding relational Weisfeiler-Leman algorithm. This analysis is extended +to provide a precise logical characterization of the class of functions +captured by a class of graph neural networks. The theoretical findings +presented in this paper explain the benefits of some widely employed practical +design choices, which are validated empirically. + +
+
+ comment: Proceedings of the Thirty-Seventh Annual Conference on Advances in + Neural Information Processing Systems (NeurIPS 2023). Code available at: + https://github.com/HxyScotthuang/CMPNN +
+
+
+
+
+ + ♻ ☆ Language-based Action Concept Spaces Improve Video Self-Supervised + Learning NeurIPS 2023 + + +
+ Recent contrastive language image pre-training has led to learning highly +transferable and robust image representations. However, adapting these models +to video domains with minimal supervision remains an open problem. We explore a +simple step in that direction, using language tied self-supervised learning to +adapt an image CLIP model to the video domain. A backbone modified for temporal +modeling is trained under self-distillation settings with train objectives +operating in an action concept space. Feature vectors of various action +concepts extracted from a language encoder using relevant textual prompts +construct this space. We introduce two train objectives, concept distillation +and concept alignment, that retain generality of original representations while +enforcing relations between actions and their attributes. Our approach improves +zero-shot and linear probing performance on three action recognition +benchmarks. + +
+
+ comment: Presented at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Neural (Tangent Kernel) Collapse + + +
+ This work bridges two important concepts: the Neural Tangent Kernel (NTK), +which captures the evolution of deep neural networks (DNNs) during training, +and the Neural Collapse (NC) phenomenon, which refers to the emergence of +symmetry and structure in the last-layer features of well-trained +classification DNNs. We adopt the natural assumption that the empirical NTK +develops a block structure aligned with the class labels, i.e., samples within +the same class have stronger correlations than samples from different classes. +Under this assumption, we derive the dynamics of DNNs trained with mean squared +(MSE) loss and break them into interpretable phases. Moreover, we identify an +invariant that captures the essence of the dynamics, and use it to prove the +emergence of NC in DNNs with block-structured NTK. We provide large-scale +numerical experiments on three common DNN architectures and three benchmark +datasets to support our theory. + +
+
+
+
+
+ + ♻ ☆ Towards Better Generalization with Flexible Representation of + Multi-Module Graph Neural Networks + + +
+ Graph neural networks (GNNs) have become compelling models designed to +perform learning and inference on graph-structured data. However, little work +has been done to understand the fundamental limitations of GNNs for scaling to +larger graphs and generalizing to out-of-distribution (OOD) inputs. In this +paper, we use a random graph generator to systematically investigate how the +graph size and structural properties affect the predictive performance of GNNs. +We present specific evidence that the average node degree is a key feature in +determining whether GNNs can generalize to unseen graphs, and that the use of +multiple node update functions can improve the generalization performance of +GNNs when dealing with graphs of multimodal degree distributions. Accordingly, +we propose a multi-module GNN framework that allows the network to adapt +flexibly to new graphs by generalizing a single canonical nonlinear +transformation over aggregated inputs. Our results show that the multi-module +GNNs improve the OOD generalization on a variety of inference tasks in the +direction of diverse structural features. + +
+
+
+
+
+ + ♻ ☆ NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial + Reports + + +
+ How can we interpret and retrieve medical evidence to support clinical +decisions? Clinical trial reports (CTR) amassed over the years contain +indispensable information for the development of personalized medicine. +However, it is practically infeasible to manually inspect over 400,000+ +clinical trial reports in order to find the best evidence for experimental +treatments. Natural Language Inference (NLI) offers a potential solution to +this problem, by allowing the scalable computation of textual entailment. +However, existing NLI models perform poorly on biomedical corpora, and +previously published datasets fail to capture the full complexity of inference +over CTRs. In this work, we present a novel resource to advance research on NLI +for reasoning on CTRs. The resource includes two main tasks. Firstly, to +determine the inference relation between a natural language statement, and a +CTR. Secondly, to retrieve supporting facts to justify the predicted relation. +We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these +tasks. Baselines on this corpus expose the limitations of existing NLI models, +with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To +the best of our knowledge, we are the first to design a task that covers the +interpretation of full CTRs. To encourage further work on this challenging +dataset, we make the corpus, competition leaderboard, website and code to +replicate the baseline experiments available at: +https://github.com/ai-systems/nli4ct + +
+
+ comment: 15 pages +
+
+
+
+
+ + ♻ ☆ Finding Regions of Counterfactual Explanations via Robust Optimization + + +
+ Counterfactual explanations play an important role in detecting bias and +improving the explainability of data-driven classification models. A +counterfactual explanation (CE) is a minimal perturbed data point for which the +decision of the model changes. Most of the existing methods can only provide +one CE, which may not be achievable for the user. In this work we derive an +iterative method to calculate robust CEs, i.e. CEs that remain valid even after +the features are slightly perturbed. To this end, our method provides a whole +region of CEs allowing the user to choose a suitable recourse to obtain a +desired outcome. We use algorithmic ideas from robust optimization and prove +convergence results for the most common machine learning methods including +logistic regression, decision trees, random forests, and neural networks. Our +experiments show that our method can efficiently generate globally optimal +robust CEs for a variety of common data sets and classification models. + +
+
+
+
+
+ + ♻ ☆ A Comprehensive Study of Groundbreaking Machine Learning Research: + Analyzing highly cited and impactful publications across six decades + + +
+ Machine learning (ML) has emerged as a prominent field of research in +computer science and other related fields, thereby driving advancements in +other domains of interest. As the field continues to evolve, it is crucial to +understand the landscape of highly cited publications to identify key trends, +influential authors, and significant contributions made thus far. In this +paper, we present a comprehensive bibliometric analysis of highly cited ML +publications. We collected a dataset consisting of the top-cited papers from +reputable ML conferences and journals, covering a period of several years from +1959 to 2022. We employed various bibliometric techniques to analyze the data, +including citation analysis, co-authorship analysis, keyword analysis, and +publication trends. Our findings reveal the most influential papers, highly +cited authors, and collaborative networks within the machine learning +community. We identify popular research themes and uncover emerging topics that +have recently gained significant attention. Furthermore, we examine the +geographical distribution of highly cited publications, highlighting the +dominance of certain countries in ML research. By shedding light on the +landscape of highly cited ML publications, our study provides valuable insights +for researchers, policymakers, and practitioners seeking to understand the key +developments and trends in this rapidly evolving field. + +
+
+ comment: Journal of Engineering Research (2023) +
+
+
+
+
+ + ♻ ☆ Efficient Diffusion Policies for Offline Reinforcement Learning NeurIPS 2023 + + +
+ Offline reinforcement learning (RL) aims to learn optimal policies from +offline datasets, where the parameterization of policies is crucial but often +overlooked. Recently, Diffsuion-QL significantly boosts the performance of +offline RL by representing a policy with a diffusion model, whose success +relies on a parametrized Markov Chain with hundreds of steps for sampling. +However, Diffusion-QL suffers from two critical limitations. 1) It is +computationally inefficient to forward and backward through the whole Markov +chain during training. 2) It is incompatible with maximum likelihood-based RL +algorithms (e.g., policy gradient methods) as the likelihood of diffusion +models is intractable. Therefore, we propose efficient diffusion policy (EDP) +to overcome these two challenges. EDP approximately constructs actions from +corrupted ones at training to avoid running the sampling chain. We conduct +extensive experiments on the D4RL benchmark. The results show that EDP can +reduce the diffusion policy training time from 5 days to 5 hours on +gym-locomotion tasks. Moreover, we show that EDP is compatible with various +offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on +D4RL by large margins over previous methods. Our code is available at +https://github.com/sail-sg/edp. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ On Embeddings for Numerical Features in Tabular Deep Learning NeurIPS 2022 + + +
+ Recently, Transformer-like deep architectures have shown strong performance +on tabular data problems. Unlike traditional models, e.g., MLP, these +architectures map scalar values of numerical features to high-dimensional +embeddings before mixing them in the main backbone. In this work, we argue that +embeddings for numerical features are an underexplored degree of freedom in +tabular DL, which allows constructing more powerful DL models and competing +with GBDT on some traditionally GBDT-friendly benchmarks. We start by +describing two conceptually different approaches to building embedding modules: +the first one is based on a piecewise linear encoding of scalar values, and the +second one utilizes periodic activations. Then, we empirically demonstrate that +these two approaches can lead to significant performance boosts compared to the +embeddings based on conventional blocks such as linear layers and ReLU +activations. Importantly, we also show that embedding numerical features is +beneficial for many backbones, not only for Transformers. Specifically, after +proper embeddings, simple MLP-like models can perform on par with the +attention-based architectures. Overall, we highlight embeddings for numerical +features as an important design aspect with good potential for further +improvements in tabular DL. + +
+
+ comment: NeurIPS 2022 camera-ready. Code: + https://github.com/yandex-research/tabular-dl-num-embeddings (v3-v4: minor + changes) +
+
+
+
+
+ + ♻ ☆ Revisiting Deep Learning Models for Tabular Data NeurIPS 2021 + + +
+ The existing literature on deep learning for tabular data proposes a wide +range of novel architectures and reports competitive results on various +datasets. However, the proposed models are usually not properly compared to +each other and existing works often use different benchmarks and experiment +protocols. As a result, it is unclear for both researchers and practitioners +what models perform best. Additionally, the field still lacks effective +baselines, that is, the easy-to-use models that provide competitive performance +across different problems. + In this work, we perform an overview of the main families of DL architectures +for tabular data and raise the bar of baselines in tabular DL by identifying +two simple and powerful deep architectures. The first one is a ResNet-like +architecture which turns out to be a strong baseline that is often missing in +prior works. The second model is our simple adaptation of the Transformer +architecture for tabular data, which outperforms other solutions on most tasks. +Both models are compared to many existing architectures on a diverse set of +tasks under the same training and tuning protocols. We also compare the best DL +models with Gradient Boosted Decision Trees and conclude that there is still no +universally superior solution. + +
+
+ comment: NeurIPS 2021 camera-ready. Code: + https://github.com/yandex-research/tabular-dl-revisiting-models (v3-v5: minor + changes) +
+
+
+
+
+ + ♻ ☆ Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with + Application to Fairness + + +
+ We consider contextual bandit problems with knapsacks [CBwK], a problem where +at each round, a scalar reward is obtained and vector-valued costs are +suffered. The learner aims to maximize the cumulative rewards while ensuring +that the cumulative costs are lower than some predetermined cost constraints. +We assume that contexts come from a continuous set, that costs can be signed, +and that the expected reward and cost functions, while unknown, may be +uniformly estimated -- a typical assumption in the literature. In this setting, +total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ +is the number of rounds, and were even typically assumed to depend linearly on +$T$. We are however motivated to use CBwK to impose a fairness constraint of +equalized average costs between groups: the budget associated with the +corresponding cost constraints should be as close as possible to the natural +deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy +based on projected-gradient-descent updates, that is able to deal with +total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. +This strategy is more direct and simpler than existing strategies in the +literature. It relies on a careful, adaptive, tuning of the step size. + +
+
+
+
+
+ + ♻ ☆ A weighted-variance variational autoencoder model for speech enhancement + + +
+ We address speech enhancement based on variational autoencoders, which +involves learning a speech prior distribution in the time-frequency (TF) +domain. A zero-mean complex-valued Gaussian distribution is usually assumed for +the generative model, where the speech information is encoded in the variance +as a function of a latent variable. In contrast to this commonly used approach, +we propose a weighted variance generative model, where the contribution of each +spectrogram time-frame in parameter learning is weighted. We impose a Gamma +prior distribution on the weights, which would effectively lead to a Student's +t-distribution instead of Gaussian for speech generative modeling. We develop +efficient training and speech enhancement algorithms based on the proposed +generative model. Our experimental results on spectrogram auto-encoding and +speech enhancement demonstrate the effectiveness and robustness of the proposed +approach compared to the standard unweighted variance model. + +
+
+
+
+
+ + ♻ ☆ Neural Optimal Transport with General Cost Functionals + + +
+ We introduce a novel neural network-based algorithm to compute optimal +transport (OT) plans for general cost functionals. In contrast to common +Euclidean costs, i.e., $\ell^1$ or $\ell^2$, such functionals provide more +flexibility and allow using auxiliary information, such as class labels, to +construct the required transport map. Existing methods for general costs are +discrete and have limitations in practice, i.e. they do not provide an +out-of-sample estimation. We address the challenge of designing a continuous OT +approach for general costs that generalizes to new data points in +high-dimensional spaces, such as images. Additionally, we provide the +theoretical error analysis for our recovered transport plans. As an +application, we construct a cost functional to map data distributions while +preserving the class-wise structure. + +
+
+
+
+
+ + ♻ ☆ Local Advantage Networks for Cooperative Multi-Agent Reinforcement + Learning + + +
+ Many recent successful off-policy multi-agent reinforcement learning (MARL) +algorithms for cooperative partially observable environments focus on finding +factorized value functions, leading to convoluted network structures. Building +on the structure of independent Q-learners, our LAN algorithm takes a radically +different approach, leveraging a dueling architecture to learn for each agent a +decentralized best-response policies via individual advantage functions. The +learning is stabilized by a centralized critic whose primary objective is to +reduce the moving target problem of the individual advantages. The critic, +whose network's size is independent of the number of agents, is cast aside +after learning. Evaluation on the StarCraft II multi-agent challenge benchmark +shows that LAN reaches state-of-the-art performance and is highly scalable with +respect to the number of agents, opening up a promising alternative direction +for MARL research. + +
+
+ comment: https://openreview.net/forum?id=adpKzWQunW +
+
+
+
+
+ + ♻ ☆ Deep machine learning for meteor monitoring: advances with transfer + learning and gradient-weighted class activation mapping + + +
+ In recent decades, the use of optical detection systems for meteor studies +has increased dramatically, resulting in huge amounts of data being analyzed. +Automated meteor detection tools are essential for studying the continuous +meteoroid incoming flux, recovering fresh meteorites, and achieving a better +understanding of our Solar System. Concerning meteor detection, distinguishing +false positives between meteor and non-meteor images has traditionally been +performed by hand, which is significantly time-consuming. To address this +issue, we developed a fully automated pipeline that uses Convolutional Neural +Networks (CNNs) to classify candidate meteor detections. Our new method is able +to detect meteors even in images that contain static elements such as clouds, +the Moon, and buildings. To accurately locate the meteor within each frame, we +employ the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. +This method facilitates the identification of the region of interest by +multiplying the activations from the last convolutional layer with the average +of the gradients across the feature map of that layer. By combining these +findings with the activation map derived from the first convolutional layer, we +effectively pinpoint the most probable pixel location of the meteor. We trained +and evaluated our model on a large dataset collected by the Spanish Meteor +Network (SPMN) and achieved a precision of 98\%. Our new methodology presented +here has the potential to reduce the workload of meteor scientists and station +operators and improve the accuracy of meteor tracking and classification. + +
+
+ comment: Accepted in Planetary and Space Science +
+
+
+
+
+ + ♻ ☆ Recycle-and-Distill: Universal Compression Strategy for + Transformer-based Speech SSL Models with Attention Map Reusing and Masking + Distillation + + +
+ Transformer-based speech self-supervised learning (SSL) models, such as +HuBERT, show surprising performance in various speech processing tasks. +However, huge number of parameters in speech SSL models necessitate the +compression to a more compact model for wider usage in academia or small +companies. In this study, we suggest to reuse attention maps across the +Transformer layers, so as to remove key and query parameters while retaining +the number of layers. Furthermore, we propose a novel masking distillation +strategy to improve the student model's speech representation quality. We +extend the distillation loss to utilize both masked and unmasked speech frames +to fully leverage the teacher model's high-quality representation. Our +universal compression strategy yields the student model that achieves phoneme +error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB +benchmark. + +
+
+ comment: Proceedings of Interspeech 2023. Code URL: + https://github.com/sungnyun/ARMHuBERT +
+
+
+
+
+ + ♻ ☆ RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers + via Randomized Deletion NeurIPS 2023 + + +
+ Randomized smoothing is a leading approach for constructing classifiers that +are certifiably robust against adversarial examples. Existing work on +randomized smoothing has focused on classifiers with continuous inputs, such as +images, where $\ell_p$-norm bounded adversaries are commonly studied. However, +there has been limited work for classifiers with discrete or variable-size +inputs, such as for source code, which require different threat models and +smoothing mechanisms. In this work, we adapt randomized smoothing for discrete +sequence classifiers to provide certified robustness against edit +distance-bounded adversaries. Our proposed smoothing mechanism randomized +deletion (RS-Del) applies random deletion edits, which are (perhaps +surprisingly) sufficient to confer robustness against adversarial deletion, +insertion and substitution edits. Our proof of certification deviates from the +established Neyman-Pearson approach, which is intractable in our setting, and +is instead organized around longest common subsequences. We present a case +study on malware detection--a binary classification problem on byte sequences +where classifier evasion is a well-established threat model. When applied to +the popular MalConv malware detection model, our smoothing mechanism RS-Del +achieves a certified accuracy of 91% at an edit distance radius of 128 bytes. + +
+
+ comment: To be published in NeurIPS 2023. 36 pages, 7 figures, 12 tables. + Includes 20 pages of appendices +
+
+
+
+
+ + ♻ ☆ Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ + Convergence with Low-Rank Updates + + +
+ In this paper, we propose the first sketch-and-project Newton method with +fast $\mathcal O(k^{-2})$ global convergence rate for self-concordant +functions. Our method, SGN, can be viewed in three ways: i) as a +sketch-and-project algorithm projecting updates of Newton method, ii) as a +cubically regularized Newton ethod in sketched subspaces, and iii) as a damped +Newton method in sketched subspaces. SGN inherits best of all three worlds: +cheap iteration costs of sketch-and-project methods, state-of-the-art $\mathcal +O(k^{-2})$ global convergence rate of full-rank Newton-like methods and the +algorithm simplicity of damped Newton methods. Finally, we demonstrate its +comparable empirical performance to baseline algorithms. + +
+
+ comment: 10 pages +
+
+
+
+
+ + ♻ ☆ Curvature Filtrations for Graph Generative Model Evaluation NeurIPS + + +
+ Graph generative model evaluation necessitates understanding differences +between graphs on the distributional level. This entails being able to harness +salient attributes of graphs in an efficient manner. Curvature constitutes one +such property that has recently proved its utility in characterising graphs. +Its expressive properties, stability, and practical utility in model evaluation +remain largely unexplored, however. We combine graph curvature descriptors with +emerging methods from topological data analysis to obtain robust, expressive +descriptors for evaluating graph generative models. + +
+
+ comment: Accepted at the 37th Conference on Neural Information Processing + Systems (NeurIPS) 2023 +
+
+
+
+
+ + ♻ ☆ Time-Conditioned Generative Modeling of Object-Centric Representations + for Video Decomposition and Prediction + + +
+ When perceiving the world from multiple viewpoints, humans have the ability +to reason about the complete objects in a compositional manner even when an +object is completely occluded from certain viewpoints. Meanwhile, humans are +able to imagine novel views after observing multiple viewpoints. Recent +remarkable advances in multi-view object-centric learning still leaves some +unresolved problems: 1) The shapes of partially or completely occluded objects +can not be well reconstructed. 2) The novel viewpoint prediction depends on +expensive viewpoint annotations rather than implicit rules in view +representations. In this paper, we introduce a time-conditioned generative +model for videos. To reconstruct the complete shape of an object accurately, we +enhance the disentanglement between the latent representations of objects and +views, where the latent representations of time-conditioned views are jointly +inferred with a Transformer and then are input to a sequential extension of +Slot Attention to learn object-centric representations. In addition, Gaussian +processes are employed as priors of view latent variables for video generation +and novel-view prediction without viewpoint annotations. Experiments on +multiple datasets demonstrate that the proposed model can make object-centric +video decomposition, reconstruct the complete shapes of occluded objects, and +make novel-view predictions. + +
+
+
+
+
+ + ♻ ☆ Learning Space-Time Continuous Neural PDEs from Partially Observed + States + + +
+ We introduce a novel grid-independent model for learning partial differential +equations (PDEs) from noisy and partial observations on irregular +spatiotemporal grids. We propose a space-time continuous latent neural PDE +model with an efficient probabilistic framework and a novel encoder design for +improved data efficiency and grid independence. The latent state dynamics are +governed by a PDE model that combines the collocation method and the method of +lines. We employ amortized variational inference for approximate posterior +estimation and utilize a multiple shooting technique for enhanced training +speed and stability. Our model demonstrates state-of-the-art performance on +complex synthetic and real-world datasets, overcoming limitations of previous +approaches and effectively handling partially-observed data. The proposed model +outperforms recent methods, showing its potential to advance data-driven PDE +modeling and enabling robust, grid-independent modeling of complex +partially-observed dynamic processes. + +
+
+
+
+
+ + ♻ ☆ COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action + Spotting using Transformers + + +
+ We present COMEDIAN, a novel pipeline to initialize spatiotemporal +transformers for action spotting, which involves self-supervised learning and +knowledge distillation. Action spotting is a timestamp-level temporal action +detection task. Our pipeline consists of three steps, with two initialization +stages. First, we perform self-supervised initialization of a spatial +transformer using short videos as input. Additionally, we initialize a temporal +transformer that enhances the spatial transformer's outputs with global context +through knowledge distillation from a pre-computed feature bank aligned with +each short video segment. In the final step, we fine-tune the transformers to +the action spotting task. The experiments, conducted on the SoccerNet-v2 +dataset, demonstrate state-of-the-art performance and validate the +effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several +advantages of our pretraining pipeline, including improved performance and +faster convergence compared to non-pretrained models. + +
+
+ comment: Source code is available here: + https://github.com/juliendenize/eztorch +
+
+
+
+
+ + ♻ ☆ Fairness and bias correction in machine learning for depression + prediction: results from four study populations + + +
+ A significant level of stigma and inequality exists in mental healthcare, +especially in under-served populations. Inequalities are reflected in the data +collected for scientific purposes. When not properly accounted for, machine +learning (ML) models leart from data can reinforce these structural +inequalities or biases. Here, we present a systematic study of bias in ML +models designed to predict depression in four different case studies covering +different countries and populations. We find that standard ML approaches show +regularly biased behaviors. We also show that mitigation techniques, both +standard and our own post-hoc method, can be effective in reducing the level of +unfair bias. No single best ML model for depression prediction provides +equality of outcomes. This emphasizes the importance of analyzing fairness +during model selection and transparent reporting about the impact of debiasing +interventions. Finally, we provide practical recommendations to develop +bias-aware ML models for depression risk prediction. + +
+
+ comment: 11 pages, 2 figures +
+
+
+
+
+ + ♻ ☆ Learning Transferable Adversarial Robust Representations via Multi-view + Consistency NeurIPS + + +
+ Despite the success on few-shot learning problems, most meta-learned models +only focus on achieving good performance on clean examples and thus easily +break down when given adversarially perturbed samples. While some recent works +have shown that a combination of adversarial learning and meta-learning could +enhance the robustness of a meta-learner against adversarial attacks, they fail +to achieve generalizable adversarial robustness to unseen domains and tasks, +which is the ultimate goal of meta-learning. To address this challenge, we +propose a novel meta-adversarial multi-view representation learning framework +with dual encoders. Specifically, we introduce the discrepancy across the two +differently augmented samples of the same data instance by first updating the +encoder parameters with them and further imposing a novel label-free +adversarial attack to maximize their discrepancy. Then, we maximize the +consistency across the views to learn transferable robust representations +across domains and tasks. Through experimental validation on multiple +benchmarks, we demonstrate the effectiveness of our framework on few-shot +learning tasks from unseen domains, achieving over 10\% robust accuracy +improvements against previous adversarial meta-learning baselines. + +
+
+ comment: *Equal contribution (Author ordering determined by coin flip). + NeurIPS SafetyML workshop 2022, Under review +
+
+
+
+
+ + ♻ ☆ Mind the spikes: Benign overfitting of kernels and neural networks in + fixed dimension + + +
+ The success of over-parameterized neural networks trained to near-zero +training error has caused great interest in the phenomenon of benign +overfitting, where estimators are statistically consistent even though they +interpolate noisy training data. While benign overfitting in fixed dimension +has been established for some learning methods, current literature suggests +that for regression with typical kernel methods and wide neural networks, +benign overfitting requires a high-dimensional setting where the dimension +grows with the sample size. In this paper, we show that the smoothness of the +estimators, and not the dimension, is the key: benign overfitting is possible +if and only if the estimator's derivatives are large enough. We generalize +existing inconsistency results to non-interpolating models and more kernels to +show that benign overfitting with moderate derivatives is impossible in fixed +dimension. Conversely, we show that rate-optimal benign overfitting is possible +for regression with a sequence of spiky-smooth kernels with large derivatives. +Using neural tangent kernels, we translate our results to wide neural networks. +We prove that while infinite-width networks do not overfit benignly with the +ReLU activation, this can be fixed by adding small high-frequency fluctuations +to the activation function. Our experiments verify that such neural networks, +while overfitting, can indeed generalize well even on low-dimensional data +sets. + +
+
+ comment: We provide Python code to reproduce all of our experimental results + at https://github.com/moritzhaas/mind-the-spikes +
+
+
+
+
+ + ♻ ☆ Effective Targeted Attacks for Adversarial Self-Supervised Learning NeurIPS 2023 + + +
+ Recently, unsupervised adversarial training (AT) has been highlighted as a +means of achieving robustness in models without any label information. Previous +studies in unsupervised AT have mostly focused on implementing self-supervised +learning (SSL) frameworks, which maximize the instance-wise classification loss +to generate adversarial examples. However, we observe that simply maximizing +the self-supervised training loss with an untargeted adversarial attack often +results in generating ineffective adversaries that may not help improve the +robustness of the trained model, especially for non-contrastive SSL frameworks +without negative examples. To tackle this problem, we propose a novel positive +mining for targeted adversarial attack to generate effective adversaries for +adversarial SSL frameworks. Specifically, we introduce an algorithm that +selects the most confusing yet similar target example for a given instance +based on entropy and similarity, and subsequently perturbs the given instance +towards the selected target. Our method demonstrates significant enhancements +in robustness when applied to non-contrastive SSL frameworks, and less but +consistent robustness improvements with contrastive SSL frameworks, on the +benchmark datasets. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Training-free Diffusion Model Adaptation for Variable-Sized + Text-to-Image Synthesis NeurIPS 2023 + + +
+ Diffusion models (DMs) have recently gained attention with state-of-the-art +performance in text-to-image synthesis. Abiding by the tradition in deep +learning, DMs are trained and evaluated on the images with fixed sizes. +However, users are demanding for various images with specific sizes and various +aspect ratio. This paper focuses on adapting text-to-image diffusion models to +handle such variety while maintaining visual fidelity. First we observe that, +during the synthesis, lower resolution images suffer from incomplete object +portrayal, while higher resolution images exhibit repetitively disordered +presentation. Next, we establish a statistical relationship indicating that +attention entropy changes with token quantity, suggesting that models aggregate +spatial information in proportion to image resolution. The subsequent +interpretation on our observations is that objects are incompletely depicted +due to limited spatial information for low resolutions, while repetitively +disorganized presentation arises from redundant spatial information for high +resolutions. From this perspective, we propose a scaling factor to alleviate +the change of attention entropy and mitigate the defective pattern observed. +Extensive experimental results validate the efficacy of the proposed scaling +factor, enabling models to achieve better visual effects, image quality, and +text alignment. Notably, these improvements are achieved without additional +training or fine-tuning techniques. + +
+
+ comment: Accepted by NeurIPS 2023. 23 pages, 13 figures +
+
+
+
+
+ + ♻ ☆ Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset + Selection NeurIPS 2023 + + +
+ Adversarial contrastive learning (ACL) does not require expensive data +annotations but outputs a robust representation that withstands adversarial +attacks and also generalizes to a wide range of downstream tasks. However, ACL +needs tremendous running time to generate the adversarial variants of all +training data, which limits its scalability to large datasets. To speed up ACL, +this paper proposes a robustness-aware coreset selection (RCS) method. RCS does +not require label information and searches for an informative subset that +minimizes a representational divergence, which is the distance of the +representation between natural data and their virtual adversarial variants. The +vanilla solution of RCS via traversing all possible subsets is computationally +prohibitive. Therefore, we theoretically transform RCS into a surrogate problem +of submodular maximization, of which the greedy search is an efficient solution +with an optimality guarantee for the original problem. Empirically, our +comprehensive results corroborate that RCS can speed up ACL by a large margin +without significantly hurting the robustness transferability. Notably, to the +best of our knowledge, we are the first to conduct ACL efficiently on the +large-scale ImageNet-1K dataset to obtain an effective robust representation +via RCS. Our source code is at +https://github.com/GodXuxilie/Efficient_ACL_via_RCS. + +
+
+ comment: NeurIPS 2023 Spotlight +
+
+
+
+
+ + ♻ ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ♻ ☆ Emergent representations in networks trained with the Forward-Forward + algorithm + + +
+ The Backpropagation algorithm has often been criticised for its lack of +biological realism. In an attempt to find a more biologically plausible +alternative, the recently introduced Forward-Forward algorithm replaces the +forward and backward passes of Backpropagation with two forward passes. In this +work, we show that the internal representations obtained by the Forward-Forward +algorithm can organise into category-specific ensembles exhibiting high +sparsity - i.e. composed of an extremely low number of active units. This +situation is reminiscent of what has been observed in cortical sensory areas, +where neuronal ensembles are suggested to serve as the functional building +blocks for perception and action. Interestingly, while this sparse pattern does +not typically arise in models trained with standard Backpropagation, it can +emerge in networks trained with Backpropagation on the same objective proposed +for the Forward-Forward algorithm. These results suggest that the learning +procedure proposed by Forward-Forward may be superior to Backpropagation in +modelling learning in the cortex, even when a backward pass is used. + +
+
+ comment: 16 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Read and Reap the Rewards: Learning to Play Atari with the Help of + Instruction Manuals + + +
+ High sample complexity has long been a challenge for RL. On the other hand, +humans learn to perform tasks not only from interaction or demonstrations, but +also by reading unstructured text documents, e.g., instruction manuals. +Instruction manuals and wiki pages are among the most abundant data that could +inform agents of valuable features and policies or task-specific environmental +dynamics and reward structures. Therefore, we hypothesize that the ability to +utilize human-written instruction manuals to assist learning policies for +specific tasks should lead to a more efficient and better-performing agent. We +propose the Read and Reward framework. Read and Reward speeds up RL algorithms +on Atari games by reading manuals released by the Atari game developers. Our +framework consists of a QA Extraction module that extracts and summarizes +relevant information from the manual and a Reasoning module that evaluates +object-agent interactions based on information from the manual. An auxiliary +reward is then provided to a standard A2C RL agent, when interaction is +detected. Experimentally, various RL algorithms obtain significant improvement +in performance and training speed when assisted by our design. + +
+
+
+
+
+ + ♻ ☆ DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model + Statistics NeurIPS 2023 + + +
+ Diffusion probabilistic models (DPMs) have exhibited excellent performance +for high-fidelity image generation while suffering from inefficient sampling. +Recent works accelerate the sampling procedure by proposing fast ODE solvers +that leverage the specific ODE form of DPMs. However, they highly rely on +specific parameterization during inference (such as noise/data prediction), +which might not be the optimal choice. In this work, we propose a novel +formulation towards the optimal parameterization during sampling that minimizes +the first-order discretization error of the ODE solution. Based on such +formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs +by introducing several coefficients efficiently computed on the pretrained +model, which we call \textit{empirical model statistics}. We further +incorporate multistep methods and a predictor-corrector framework, and propose +some techniques for improving sample quality at small numbers of function +evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 +achieves consistently better or comparable performance in both unconditional +and conditional sampling with both pixel-space and latent-space DPMs, +especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) +on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable +Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous +state-of-the-art training-free methods. Code is available at +\url{https://github.com/thu-ml/DPM-Solver-v3}. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Improving the Timing Resolution of Positron Emission Tomography + Detectors Using Boosted Learning -- A Residual Physics Approach + + +
+ Artificial intelligence (AI) is entering medical imaging, mainly enhancing +image reconstruction. Nevertheless, improvements throughout the entire +processing, from signal detection to computation, potentially offer significant +benefits. This work presents a novel and versatile approach to detector +optimization using machine learning (ML) and residual physics. We apply the +concept to positron emission tomography (PET), intending to improve the +coincidence time resolution (CTR). PET visualizes metabolic processes in the +body by detecting photons with scintillation detectors. Improved CTR +performance offers the advantage of reducing radioactive dose exposure for +patients. Modern PET detectors with sophisticated concepts and read-out +topologies represent complex physical and electronic systems requiring +dedicated calibration techniques. Traditional methods primarily depend on +analytical formulations successfully describing the main detector +characteristics. However, when accounting for higher-order effects, additional +complexities arise matching theoretical models to experimental reality. Our +work addresses this challenge by combining traditional calibration with AI and +residual physics, presenting a highly promising approach. We present a residual +physics-based strategy using gradient tree boosting and physics-guided data +generation. The explainable AI framework SHapley Additive exPlanations (SHAP) +was used to identify known physical effects with learned patterns. In addition, +the models were tested against basic physical laws. We were able to improve the +CTR significantly (more than 20%) for clinically relevant detectors of 19 mm +height, reaching CTRs of 185 ps (450-550 keV). + +
+
+
+
+
+ + ♻ ☆ Bounding Box-based Multi-objective Bayesian Optimization of Risk + Measures under Input Uncertainty + + +
+ In this study, we propose a novel multi-objective Bayesian optimization +(MOBO) method to efficiently identify the Pareto front (PF) defined by risk +measures for black-box functions under the presence of input uncertainty (IU). +Existing BO methods for Pareto optimization in the presence of IU are +risk-specific or without theoretical guarantees, whereas our proposed method +addresses general risk measures and has theoretical guarantees. The basic idea +of the proposed method is to assume a Gaussian process (GP) model for the +black-box function and to construct high-probability bounding boxes for the +risk measures using the GP model. Furthermore, in order to reduce the +uncertainty of non-dominated bounding boxes, we propose a method of selecting +the next evaluation point using a maximin distance defined by the maximum value +of a quasi distance based on bounding boxes. As theoretical analysis, we prove +that the algorithm can return an arbitrary-accurate solution in a finite number +of iterations with high probability, for various risk measures such as Bayes +risk, worst-case risk, and value-at-risk. We also give a theoretical analysis +that takes into account approximation errors because there exist non-negligible +approximation errors (e.g., finite approximation of PFs and sampling-based +approximation of bounding boxes) in practice. We confirm that the proposed +method outperforms compared with existing methods not only in the setting with +IU but also in the setting of ordinary MOBO through numerical experiments. + +
+
+ comment: 39 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ Leave-one-out Distinguishability in Machine Learning + + +
+ We introduce a new analytical framework to quantify the changes in a machine +learning algorithm's output distribution following the inclusion of a few data +points in its training set, a notion we define as leave-one-out +distinguishability (LOOD). This problem is key to measuring data +**memorization** and **information leakage** in machine learning, and the +**influence** of training data points on model predictions. We illustrate how +our method broadens and refines existing empirical measures of memorization and +privacy risks associated with training data. We use Gaussian processes to model +the randomness of machine learning algorithms, and validate LOOD with extensive +empirical analysis of information leakage using membership inference attacks. +Our theoretical framework enables us to investigate the causes of information +leakage and where the leakage is high. For example, we analyze the influence of +activation functions, on data memorization. Additionally, our method allows us +to optimize queries that disclose the most significant information about the +training data in the leave-one-out setting. We illustrate how optimal queries +can be used for accurate **reconstruction** of training data. + +
+
+ comment: Fixed typos +
+
+
+
+
+ + ♻ ☆ SpatialRank: Urban Event Ranking with NDCG Optimization on + Spatiotemporal Data NeurIPS + 2023 + + +
+ The problem of urban event ranking aims at predicting the top-k most risky +locations of future events such as traffic accidents and crimes. This problem +is of fundamental importance to public safety and urban administration +especially when limited resources are available. The problem is, however, +challenging due to complex and dynamic spatio-temporal correlations between +locations, uneven distribution of urban events in space, and the difficulty to +correctly rank nearby locations with similar features. Prior works on event +forecasting mostly aim at accurately predicting the actual risk score or counts +of events for all the locations. Rankings obtained as such usually have low +quality due to prediction errors. Learning-to-rank methods directly optimize +measures such as Normalized Discounted Cumulative Gain (NDCG), but cannot +handle the spatiotemporal autocorrelation existing among locations. In this +paper, we bridge the gap by proposing a novel spatial event ranking approach +named SpatialRank. SpatialRank features adaptive graph convolution layers that +dynamically learn the spatiotemporal dependencies across locations from data. +In addition, the model optimizes through surrogates a hybrid NDCG loss with a +spatial component to better rank neighboring spatial locations. We design an +importance-sampling with a spatial filtering algorithm to effectively evaluate +the loss during training. Comprehensive experiments on three real-world +datasets demonstrate that SpatialRank can effectively identify the top riskiest +locations of crimes and traffic accidents and outperform state-of-art methods +in terms of NDCG by up to 12.7%. + +
+
+ comment: 37th Conference on Neural Information Processing Systems (NeurIPS + 2023) +
+
+
+
+
+ + ♻ ☆ Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, + and LLMs Evaluations NeurIPS 2023 + + +
+ This paper reexamines the research on out-of-distribution (OOD) robustness in +the field of NLP. We find that the distribution shift settings in previous +studies commonly lack adequate challenges, hindering the accurate evaluation of +OOD robustness. To address these issues, we propose a benchmark construction +protocol that ensures clear differentiation and challenging distribution +shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution +robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we +conduct a series of experiments on pre-trained language models for analysis and +evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the +relationship between in-distribution (ID) and OOD performance. We identify +three typical types that unveil the inner learning mechanism, which could +potentially facilitate the forecasting of OOD robustness, correlating with the +advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and +find that, despite exhibiting some effectiveness in specific cases, they do not +offer significant improvement compared to vanilla fine-tuning. Further, we +evaluate 5 LLMs with various adaptation paradigms and find that when sufficient +ID data is available, fine-tuning domain-specific models outperform LLMs on ID +examples significantly. However, in the case of OOD instances, prioritizing +LLMs with in-context learning yields better results. We identify that both +fine-tuned small models and LLMs face challenges in effectively addressing +downstream tasks. The code is public at +\url{https://github.com/lifan-yuan/OOD_NLP}. + +
+
+ comment: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is + available at \url{https://github.com/lifan-yuan/OOD_NLP} +
+
+
+
+
+ + ♻ ☆ Diversified Outlier Exposure for Out-of-Distribution Detection via + Informative Extrapolation NeurIPS 2023 + + +
+ Out-of-distribution (OOD) detection is important for deploying reliable +machine learning models on real-world applications. Recent advances in outlier +exposure have shown promising results on OOD detection via fine-tuning model +with informatively sampled auxiliary outliers. However, previous methods assume +that the collected outliers can be sufficiently large and representative to +cover the boundary between ID and OOD data, which might be impractical and +challenging. In this work, we propose a novel framework, namely, Diversified +Outlier Exposure (DivOE), for effective OOD detection via informative +extrapolation based on the given auxiliary outliers. Specifically, DivOE +introduces a new learning objective, which diversifies the auxiliary +distribution by explicitly synthesizing more informative outliers for +extrapolation during training. It leverages a multi-step optimization method to +generate novel outliers beyond the original ones, which is compatible with many +variants of outlier exposure. Extensive experiments and analyses have been +conducted to characterize and demonstrate the effectiveness of the proposed +DivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE. + +
+
+ comment: accepted by NeurIPS 2023 +
+
+
+
+
+
+
+
+ + Multimedia 5 + +
+
+
+ + ☆ Architecture Design of a Networked Music Performance Platform for a + Chamber Choir + + +
+ This paper describes an architecture design process for Networked Music +Performance (NMP) platform for medium-sized conducted music ensembles, based on +remote rehearsals of Academic Choir of Gdansk University of Technology. The +issues of real-time remote communication, in-person music performance, and NMP +are described. Three iterative steps defining and extending the architecture of +the NMP platform with additional features to enhance its utility in remote +rehearsals are presented. The first iteration uses a regular video conferencing +platform, the second iteration uses dedicated NMP devices and tools, and the +third iteration adds video transmission and utilizes professional low-latency +audio and video workstations. For each iteration, the platform architecture is +defined and deployed with simultaneous usability tests. Its strengths and +weaknesses are identified through qualitative and quantitative measurements - +statistical analysis shows a significant improvement in rehearsal quality after +each iteration. The final optimal architecture is described and concluded with +guidelines for creating NMP systems for said music ensembles. + +
+
+
+
+
+ + ☆ Comparing Photorealistic and Animated Embodied Conversational Agents in + Serious Games: An Empirical Study on User Experience + + +
+ Embodied conversational agents (ECAs) are paradigms of conversational user +interfaces in the form of embodied characters. While ECAs offer various +manipulable features, this paper focuses on a study conducted to explore two +distinct levels of presentation realism. The two agent versions are +photorealistic and animated. The study aims to provide insights and design +suggestions for speech-enabled ECAs within serious game environments. A +within-subjects, two-by-two factorial design was employed for this research +with a cohort of 36 participants balanced for gender. The results showed that +both the photorealistic and the animated versions were perceived as highly +usable, with overall mean scores of 5.76 and 5.71, respectively. However, 69.4 +per cent of the participants stated they preferred the photorealistic version, +25 per cent stated they preferred the animated version and 5.6 per cent had no +stated preference. The photorealistic agents were perceived as more realistic +and human-like, while the animated characters made the task feel more like a +game. Even though the agents' realism had no significant effect on usability, +it positively influenced participants' perceptions of the agent. This research +aims to lay the groundwork for future studies on ECA realism's impact in +serious games across diverse contexts. + +
+
+ comment: 21 pages, 14 figures, preprint to be published in HCI INTERNATIONAL + 2023 25TH INTERNATIONAL CONFERENCE ON HUMAN-COMPUTER INTERACTION proceedings +
+
+
+
+
+ + ☆ Automatic Edge Error Judgment in Figure Skating Using 3D Pose Estimation + from a Monocular Camera and IMUs + + +
+ Automatic evaluating systems are fundamental issues in sports technologies. +In many sports, such as figure skating, automated evaluating methods based on +pose estimation have been proposed. However, previous studies have evaluated +skaters' skills in 2D analysis. In this paper, we propose an automatic edge +error judgment system with a monocular smartphone camera and inertial sensors, +which enable us to analyze 3D motions. Edge error is one of the most +significant scoring items and is challenging to automatically judge due to its +3D motion. The results show that the model using 3D joint position coordinates +estimated from the monocular camera as the input feature had the highest +accuracy at 83% for unknown skaters' data. We also analyzed the detailed motion +analysis for edge error judgment. These results indicate that the monocular +camera can be used to judge edge errors automatically. We will provide the +figure skating single Lutz jump dataset, including pre-processed videos and +labels, at https://github.com/ryota-takedalab/JudgeAI-LutzEdge. + +
+
+
+
+
+ + ☆ ControlLLM: Augment Language Models with Tools by Searching on Graphs + + +
+ We present ControlLLM, a novel framework that enables large language models +(LLMs) to utilize multi-modal tools for solving complex real-world tasks. +Despite the remarkable performance of LLMs, they still struggle with tool +invocation due to ambiguous user prompts, inaccurate tool selection and +parameterization, and inefficient tool scheduling. To overcome these +challenges, our framework comprises three key components: (1) a \textit{task +decomposer} that breaks down a complex task into clear subtasks with +well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) +paradigm} that searches the optimal solution path on a pre-built tool graph, +which specifies the parameter and dependency relations among different tools; +and (3) an \textit{execution engine with a rich toolbox} that interprets the +solution path and runs the tools efficiently on different computational +devices. We evaluate our framework on diverse tasks involving image, audio, and +video processing, demonstrating its superior accuracy, efficiency, and +versatility compared to existing methods. + +
+
+ comment: 22 pages, 9 figures, 10 tables +
+
+
+
+
+ + ♻ ☆ Evaluating Object Hallucination in Large Vision-Language Models EMNLP 2023 + + +
+ Inspired by the superior language abilities of large language models (LLM), +large vision-language models (LVLM) have been recently explored by integrating +powerful LLMs for improving the performance on complex multimodal tasks. +Despite the promising progress on LVLMs, we find that LVLMs suffer from the +hallucination problem, i.e. they tend to generate objects that are inconsistent +with the target images in the descriptions. To investigate it, this work +presents the first systematic study on object hallucination of LVLMs. We +conduct the evaluation experiments on several representative LVLMs, and show +that they mostly suffer from severe object hallucination issue. We further +discuss that the visual instructions may influence the hallucination, and find +that: objects that frequently occur in the visual instructions or co-occur with +the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we +find that existing evaluation methods might be affected by the input +instructions and generation styles of LVLMs. Thus, we further design an +improved evaluation method for object hallucination by proposing a +polling-based query method called POPE. Experiment results demonstrate that our +POPE can evaluate the object hallucination in a more stable and flexible way. +Our codes and data are publicly available at https://github.com/RUCAIBox/POPE. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 136 + +
+
+
+ + ☆ LLM-FP4: 4-Bit Floating-Point Quantized Transformers EMNLP 2023 + + +
+ We propose LLM-FP4 for quantizing both weights and activations in large +language models (LLMs) down to 4-bit floating-point values, in a post-training +manner. Existing post-training quantization (PTQ) solutions are primarily +integer-based and struggle with bit widths below 8 bits. Compared to integer +quantization, floating-point (FP) quantization is more flexible and can better +handle long-tail or bell-shaped distributions, and it has emerged as a default +choice in many hardware platforms. One characteristic of FP quantization is +that its performance largely depends on the choice of exponent bits and +clipping range. In this regard, we construct a strong FP-PTQ baseline by +searching for the optimal quantization parameters. Furthermore, we observe a +high inter-channel variance and low intra-channel variance pattern in +activation distributions, which adds activation quantization difficulty. We +recognize this pattern to be consistent across a spectrum of transformer models +designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. +To tackle this, we propose per-channel activation quantization and show that +these additional scaling factors can be reparameterized as exponential biases +of weights, incurring a negligible cost. Our method, for the first time, can +quantize both weights and activations in the LLaMA-13B to only 4-bit and +achieves an average score of 63.1 on the common sense zero-shot reasoning +tasks, which is only 5.8 lower than the full-precision model, significantly +outperforming the previous state-of-the-art by 12.7 points. Code is available +at: https://github.com/nbasyl/LLM-FP4. + +
+
+ comment: EMNLP 2023 Main Conference +
+
+
+
+
+ + ☆ Discrete Diffusion Language Modeling by Estimating the Ratios of the + Data Distribution + + +
+ Despite their groundbreaking performance for many generative modeling tasks, +diffusion models have fallen short on discrete data domains such as natural +language. Crucially, standard diffusion models rely on the well-established +theory of score matching, but efforts to generalize this to discrete structures +have not yielded the same empirical gains. In this work, we bridge this gap by +proposing score entropy, a novel discrete score matching loss that is more +stable than existing methods, forms an ELBO for maximum likelihood training, +and can be efficiently optimized with a denoising variant. We scale our Score +Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, +achieving highly competitive likelihoods while also introducing distinct +algorithmic advantages. In particular, when comparing similarly sized SEDD and +GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of +and sometimes outperforming the baseline). Furthermore, SEDD models learn a +more faithful sequence distribution (around $4\times$ better compared to GPT-2 +models with ancestral sampling as measured by large models), can trade off +compute for generation quality (needing only $16\times$ fewer network +evaluations to match GPT-2), and enables arbitrary infilling beyond the +standard left to right prompting. + +
+
+ comment: 30 pages +
+
+
+
+
+ + ☆ Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity + and Relation Extraction + + +
+ How can we better extract entities and relations from text? Using multimodal +extraction with images and text obtains more signals for entities and +relations, and aligns them through graphs or hierarchical fusion, aiding in +extraction. Despite attempts at various fusions, previous works have overlooked +many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes +innovative pre-training objectives for entity-object and relation-image +alignment, extracting objects from images and aligning them with entity and +relation prompts for soft pseudo-labels. These labels are used as +self-supervised signals for pre-training, enhancing the ability to extract +entities and relations. Experiments on three datasets show an average 3.41% F1 +improvement over prior SOTA. Additionally, our method is orthogonal to previous +multimodal fusions, and using it on prior SOTA fusions further improves 5.47% +F1. + +
+
+ comment: Accepted to ACM Multimedia 2023 +
+
+
+
+
+ + ☆ Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT + and GPT-4 for Dialogue Summarization + + +
+ This study explores the capabilities of prompt-driven Large Language Models +(LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue +summarization. Experiments employed DialogSum (English social conversations) +and DECODA (French call center interactions), testing various prompts: +including prompts from existing literature and those from human summarization +guidelines, as well as a two-step prompt approach. Our findings indicate that +GPT models often produce lengthy summaries and deviate from human summarization +guidelines. However, using human guidelines as an intermediate step shows +promise, outperforming direct word-length constraint prompts in some cases. The +results reveal that GPT models exhibit unique stylistic tendencies in their +summaries. While BERTScores did not dramatically decrease for GPT outputs +suggesting semantic similarity to human references and specialised pre-trained +models, ROUGE scores reveal grammatical and lexical disparities between +GPT-generated and human-written summaries. These findings shed light on the +capabilities and limitations of GPT models in following human instructions for +dialogue summarization. + +
+
+
+
+
+ + ☆ Language Agnostic Code Embeddings + + +
+ Recently, code language models have achieved notable advancements in +addressing a diverse array of essential code comprehension and generation +tasks. Yet, the field lacks a comprehensive deep dive and understanding of the +code embeddings of multilingual code models. In this paper, we present a +comprehensive study on multilingual code embeddings, focusing on the +cross-lingual capabilities of these embeddings across different programming +languages. Through probing experiments, we demonstrate that code embeddings +comprise two distinct components: one deeply tied to the nuances and syntax of +a specific language, and the other remaining agnostic to these details, +primarily focusing on semantics. Further, we show that when we isolate and +eliminate this language-specific component, we witness significant improvements +in downstream code retrieval tasks, leading to an absolute increase of up to ++17 in the Mean Reciprocal Rank (MRR). + +
+
+
+
+
+ + ☆ Improving a Named Entity Recognizer Trained on Noisy Data with a Few + Clean Instances + + +
+ To achieve state-of-the-art performance, one still needs to train NER models +on large-scale, high-quality annotated data, an asset that is both costly and +time-intensive to accumulate. In contrast, real-world applications often resort +to massive low-quality labeled data through non-expert annotators via +crowdsourcing and external knowledge bases via distant supervision as a +cost-effective alternative. However, these annotation methods result in noisy +labels, which in turn lead to a notable decline in performance. Hence, we +propose to denoise the noisy NER data with guidance from a small set of clean +instances. Along with the main NER model we train a discriminator model and use +its outputs to recalibrate the sample weights. The discriminator is capable of +detecting both span and category errors with different discriminative prompts. +Results on public crowdsourcing and distant supervision datasets show that the +proposed method can consistently improve performance with a small guidance set. + +
+
+ comment: 14 pages +
+
+
+
+
+ + ☆ Detecting Pretraining Data from Large Language Models + + +
+ Although large language models (LLMs) are widely deployed, the data used to +train them is rarely disclosed. Given the incredible scale of this data, up to +trillions of tokens, it is all but certain that it includes potentially +problematic text such as copyrighted materials, personally identifiable +information, and test data for widely reported reference benchmarks. However, +we currently have no way to know which data of these types is included or in +what proportions. In this paper, we study the pretraining data detection +problem: given a piece of text and black-box access to an LLM without knowing +the pretraining data, can we determine if the model was trained on the provided +text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that +uses data created before and after model training to support gold truth +detection. We also introduce a new detection method Min-K% Prob based on a +simple hypothesis: an unseen example is likely to contain a few outlier words +with low probabilities under the LLM, while a seen example is less likely to +have words with such low probabilities. Min-K% Prob can be applied without any +knowledge about the pretraining corpus or any additional training, departing +from previous detection methods that require training a reference model on data +that is similar to the pretraining data. Moreover, our experiments demonstrate +that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous +methods. We apply Min-K% Prob to two real-world scenarios, copyrighted book +detection, and contaminated downstream example detection, and find it a +consistently effective solution. + +
+
+
+
+
+ + ☆ The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing + & Attribution in AI + + +
+ The race to train language models on vast, diverse, and inconsistently +documented datasets has raised pressing concerns about the legal and ethical +risks for practitioners. To remedy these practices threatening data +transparency and understanding, we convene a multi-disciplinary effort between +legal and machine learning experts to systematically audit and trace 1800+ text +datasets. We develop tools and standards to trace the lineage of these +datasets, from their source, creators, series of license conditions, +properties, and subsequent use. Our landscape analysis highlights the sharp +divides in composition and focus of commercially open vs closed datasets, with +closed datasets monopolizing important categories: lower resource languages, +more creative tasks, richer topic variety, newer and more synthetic training +data. This points to a deepening divide in the types of data that are made +available under different license conditions, and heightened implications for +jurisdictional legal interpretations of copyright and fair use. We also observe +frequent miscategorization of licenses on widely used dataset hosting sites, +with license omission of 72%+ and error rates of 50%+. This points to a crisis +in misattribution and informed use of the most popular datasets driving many +recent breakthroughs. As a contribution to ongoing improvements in dataset +transparency and responsible use, we release our entire audit, with an +interactive UI, the Data Provenance Explorer, which allows practitioners to +trace and filter on data provenance for the most popular open source finetuning +data collections: www.dataprovenance.org. + +
+
+ comment: 30 pages (18 main), 6 figures, 5 tables +
+
+
+
+
+ + ☆ Kiki or Bouba? Sound Symbolism in Vision-and-Language Models NeurIPS 2023 + + +
+ Although the mapping between sound and meaning in human language is assumed +to be largely arbitrary, research in cognitive science has shown that there are +non-trivial correlations between particular sounds and meanings across +languages and demographic groups, a phenomenon known as sound symbolism. Among +the many dimensions of meaning, sound symbolism is particularly salient and +well-demonstrated with regards to cross-modal associations between language and +the visual domain. In this work, we address the question of whether sound +symbolism is reflected in vision-and-language models such as CLIP and Stable +Diffusion. Using zero-shot knowledge probing to investigate the inherent +knowledge of these models, we find strong evidence that they do show this +pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our +work provides a novel method for demonstrating sound symbolism and +understanding its nature using computational tools. Our code will be made +publicly available. + +
+
+ comment: Accepted to NeurIPS 2023 (spotlight). Project webpage: + https://kiki-bouba.github.io/ +
+
+
+
+
+ + ☆ DEFT: Data Efficient Fine-Tuning for Large Language Models via + Unsupervised Core-Set Selection + + +
+ Recent advances have led to the availability of many pre-trained language +models (PLMs); however, a question that remains is how much data is truly +needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT, +a data-efficient fine-tuning framework that leverages unsupervised core-set +selection to minimize the amount of data needed to fine-tune PLMs for +downstream tasks. We demonstrate the efficacy of our DEFT framework in the +context of text-editing LMs, and compare to the state-of-the art text-editing +model, CoEDIT (Raheja et al., 2023). Our quantitative and qualitative results +demonstrate that DEFT models are just as accurate as CoEDIT while being +finetuned on ~70% less data. + +
+
+
+
+
+ + ☆ SuperHF: Supervised Iterative Learning from Human Feedback NeurIPS 2023 + + +
+ While large language models demonstrate remarkable capabilities, they often +present challenges in terms of safety, alignment with human values, and +stability during training. Here, we focus on two prevalent methods used to +align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning +from Human Feedback (RLHF). SFT is simple and robust, powering a host of +open-source models, while RLHF is a more sophisticated method used in top-tier +models like ChatGPT but also suffers from instability and susceptibility to +reward hacking. We propose a novel approach, Supervised Iterative Learning from +Human Feedback (SuperHF), which seeks to leverage the strengths of both +methods. Our hypothesis is two-fold: that the reward model used in RLHF is +critical for efficient data use and model generalization and that the use of +Proximal Policy Optimization (PPO) in RLHF may not be necessary and could +contribute to instability issues. SuperHF replaces PPO with a simple supervised +loss and a Kullback-Leibler (KL) divergence prior. It creates its own training +data by repeatedly sampling a batch of model outputs and filtering them through +the reward model in an online learning regime. We then break down the reward +optimization problem into three components: robustly optimizing the training +rewards themselves, preventing reward hacking-exploitation of the reward model +that degrades model performance-as measured by a novel METEOR similarity +metric, and maintaining good performance on downstream evaluations. Our +experimental results show SuperHF exceeds PPO-based RLHF on the training +objective, easily and favorably trades off high reward with low reward hacking, +improves downstream calibration, and performs the same on our GPT-4 based +qualitative evaluation scheme all the while being significantly simpler to +implement, highlighting SuperHF's potential as a competitive language model +alignment technique. + +
+
+ comment: Accepted to the Socially Responsible Language Modelling Research + (SoLaR) workshop at NeurIPS 2023 +
+
+
+
+
+ + ☆ IntenDD: A Unified Contrastive Learning Approach for Intent Detection + and Discovery EMNLP 2023 + + +
+ Identifying intents from dialogue utterances forms an integral component of +task-oriented dialogue systems. Intent-related tasks are typically formulated +either as a classification task, where the utterances are classified into +predefined categories or as a clustering task when new and previously unknown +intent categories need to be discovered from these utterances. Further, the +intent classification may be modeled in a multiclass (MC) or multilabel (ML) +setup. While typically these tasks are modeled as separate tasks, we propose +IntenDD, a unified approach leveraging a shared utterance encoding backbone. +IntenDD uses an entirely unsupervised contrastive learning strategy for +representation learning, where pseudo-labels for the unlabeled utterances are +generated based on their lexical features. Additionally, we introduce a +two-step post-processing setup for the classification tasks using modified +adsorption. Here, first, the residuals in the training data are propagated +followed by smoothing the labels both modeled in a transductive setting. +Through extensive evaluations on various benchmark datasets, we find that our +approach consistently outperforms competitive baselines across all three tasks. +On average, IntenDD reports percentage improvements of 2.32%, 1.26%, and 1.52% +in their respective metrics for few-shot MC, few-shot ML, and the intent +discovery tasks respectively. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning + in Large Language Models EMNLP 2023 + + +
+ Theory of Mind (ToM) is the ability to reason about one's own and others' +mental states. ToM plays a critical role in the development of intelligence, +language understanding, and cognitive processes. While previous work has +primarily focused on first and second-order ToM, we explore higher-order ToM, +which involves recursive reasoning on others' beliefs. We introduce HI-TOM, a +Higher Order Theory of Mind benchmark. Our experimental evaluation using +various Large Language Models (LLMs) indicates a decline in performance on +higher-order ToM tasks, demonstrating the limitations of current LLMs. We +conduct a thorough analysis of different failure cases of LLMs, and share our +thoughts on the implications of our findings on the future of NLP. + +
+
+ comment: Accepted at Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ PROMINET: Prototype-based Multi-View Network for Interpretable Email + Response Prediction EMNLP 2023 + + +
+ Email is a widely used tool for business communication, and email marketing +has emerged as a cost-effective strategy for enterprises. While previous +studies have examined factors affecting email marketing performance, limited +research has focused on understanding email response behavior by considering +email content and metadata. This study proposes a Prototype-based Multi-view +Network (PROMINET) that incorporates semantic and structural information from +email data. By utilizing prototype learning, the PROMINET model generates +latent exemplars, enabling interpretable email response prediction. The model +maps learned semantic and structural exemplars to observed samples in the +training data at different levels of granularity, such as document, sentence, +or phrase. The approach is evaluated on two real-world email datasets: the +Enron corpus and an in-house Email Marketing corpus. Experimental results +demonstrate that the PROMINET model outperforms baseline models, achieving a +~3% improvement in F1 score on both datasets. Additionally, the model provides +interpretability through prototypes at different granularity levels while +maintaining comparable performance to non-interpretable models. The learned +prototypes also show potential for generating suggestions to enhance email text +editing and improve the likelihood of effective email responses. This research +contributes to enhancing sender-receiver communication and customer engagement +in email interactions. + +
+
+ comment: Accepted at EMNLP 2023 (industry) +
+
+
+
+
+ + ☆ DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in + Indo-European Languages EMNLP 2023 + + +
+ Disfluency correction (DC) is the process of removing disfluent elements like +fillers, repetitions and corrections from spoken utterances to create readable +and interpretable text. DC is a vital post-processing step applied to Automatic +Speech Recognition (ASR) outputs, before subsequent processing by downstream +language understanding tasks. Existing DC research has primarily focused on +English due to the unavailability of large-scale open-source datasets. Towards +the goal of multilingual disfluency correction, we present a high-quality +human-annotated DC corpus covering four important Indo-European languages: +English, Hindi, German and French. We provide extensive analysis of results of +state-of-the-art DC models across all four languages obtaining F1 scores of +97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To +demonstrate the benefits of DC on downstream tasks, we show that DC leads to +5.65 points increase in BLEU scores on average when used in conjunction with a +state-of-the-art Machine Translation (MT) system. We release code to run our +experiments along with our annotated dataset here. + +
+
+ comment: Accepted at EMNLP 2023 Findings +
+
+
+
+
+ + ☆ HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis EMNLP-23 + + +
+ Authorship Analysis, also known as stylometry, has been an essential aspect +of Natural Language Processing (NLP) for a long time. Likewise, the recent +advancement of Large Language Models (LLMs) has made authorship analysis +increasingly crucial for distinguishing between human-written and AI-generated +texts. However, these authorship analysis tasks have primarily been focused on +written texts, not considering spoken texts. Thus, we introduce the largest +benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). +HANSEN encompasses meticulous curation of existing speech datasets accompanied +by transcripts, alongside the creation of novel AI-generated spoken text +datasets. Together, it comprises 17 human datasets, and AI-generated spoken +texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To +evaluate and demonstrate the utility of HANSEN, we perform Authorship +Attribution (AA) & Author Verification (AV) on human-spoken datasets and +conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) +models. While SOTA methods, such as, character ngram or Transformer-based +model, exhibit similar AA & AV performance in human-spoken datasets compared to +written ones, there is much room for improvement in AI-generated spoken text +detection. The HANSEN benchmark is available at: +https://huggingface.co/datasets/HANSEN-REPO/HANSEN. + +
+
+ comment: 9 pages, EMNLP-23 findings, 5 pages appendix, 6 figures, 17 tables +
+
+
+
+
+ + ☆ Improving Conversational Recommendation Systems via Bias Analysis and + Language-Model-Enhanced Data Augmentation EMNLP 2023 + + +
+ Conversational Recommendation System (CRS) is a rapidly growing research area +that has gained significant attention alongside advancements in language +modelling techniques. However, the current state of conversational +recommendation faces numerous challenges due to its relative novelty and +limited existing contributions. In this study, we delve into benchmark datasets +for developing CRS models and address potential biases arising from the +feedback loop inherent in multi-turn interactions, including selection bias and +multiple popularity bias variants. Drawing inspiration from the success of +generative data via using language models and data augmentation techniques, we +present two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model +performance while mitigating biases. Through extensive experiments on ReDial +and TG-ReDial benchmark datasets, we show a consistent improvement of CRS +techniques with our data augmentation approaches and offer additional insights +on addressing multiple newly formulated biases. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Disentangling Extraction and Reasoning in Multi-hop Spatial Reasoning EMNLP + + +
+ Spatial reasoning over text is challenging as the models not only need to +extract the direct spatial information from the text but also reason over those +and infer implicit spatial relations. Recent studies highlight the struggles +even large language models encounter when it comes to performing spatial +reasoning over text. In this paper, we explore the potential benefits of +disentangling the processes of information extraction and reasoning in models +to address this challenge. To explore this, we design various models that +disentangle extraction and reasoning(either symbolic or neural) and compare +them with state-of-the-art(SOTA) baselines with no explicit design for these +parts. Our experimental results consistently demonstrate the efficacy of +disentangling, showcasing its ability to enhance models' generalizability +within realistic data domains. + +
+
+ comment: Accepted in EMNLP-Finding 2023 +
+
+
+
+
+ + ☆ SkyMath: Technical Report + + +
+ Large language models (LLMs) have shown great potential to solve varieties of +natural language processing (NLP) tasks, including mathematical reasoning. In +this work, we present SkyMath, a large language model for mathematics with 13 +billion parameters. By applying self-compare fine-tuning, we have enhanced +mathematical reasoning abilities of Skywork-13B-Base remarkably. On GSM8K, +SkyMath outperforms all known open-source models of similar size and has +established a new SOTA performance. + +
+
+
+
+
+ + ☆ LLM Performance Predictors are good initializers for Architecture Search + + +
+ Large language models (LLMs) have become an integral component in solving a +wide range of NLP tasks. In this work, we explore a novel use case of using +LLMs to build performance predictors (PP): models that, given a specific deep +neural network architecture, predict its performance on a downstream task. We +design PP prompts for LLMs consisting of: (i) role: description of the role +assigned to the LLM, (ii) instructions: set of instructions to be followed by +the LLM to carry out performance prediction, (iii) hyperparameters: a +definition of each architecture-specific hyperparameter and (iv) +demonstrations: sample architectures along with their efficiency metrics and +'training from scratch' performance. For machine translation (MT) tasks, we +discover that GPT-4 with our PP prompts (LLM-PP) can predict the performance of +architecture with a mean absolute error matching the SOTA and a marginal +degradation in rank correlation coefficient compared to SOTA performance +predictors. Further, we show that the predictions from LLM-PP can be distilled +to a small regression model (LLM-Distill-PP). LLM-Distill-PP models +surprisingly retain the performance of LLM-PP largely and can be a +cost-effective alternative for heavy use cases of performance estimation. +Specifically, for neural architecture search (NAS), we propose a Hybrid-Search +algorithm for NAS (HS-NAS), which uses LLM-Distill-PP for the initial part of +search, resorting to the baseline predictor for rest of the search. We show +that HS-NAS performs very similar to SOTA NAS across benchmarks, reduces search +hours by 50% roughly, and in some cases, improves latency, GFLOPs, and model +size. + +
+
+
+
+
+ + ☆ Detection of news written by the ChatGPT through authorship attribution + performed by a Bidirectional LSTM model + + +
+ The large language based-model chatbot ChatGPT gained a lot of popularity +since its launch and has been used in a wide range of situations. This research +centers around a particular situation, when the ChatGPT is used to produce news +that will be consumed by the population, causing the facilitation in the +production of fake news, spread of misinformation and lack of trust in news +sources. Aware of these problems, this research aims to build an artificial +intelligence model capable of performing authorship attribution on news +articles, identifying the ones written by the ChatGPT. To achieve this goal, a +dataset containing equal amounts of human and ChatGPT written news was +assembled and different natural processing language techniques were used to +extract features from it that were used to train, validate and test three +models built with different techniques. The best performance was produced by +the Bidirectional Long Short Term Memory (LSTM) Neural Network model, achiving +91.57\% accuracy when tested against the data from the testing set. + +
+
+
+
+
+ + ☆ BabyStories: Can Reinforcement Learning Teach Baby Language Models to + Write Better Stories? CoNLL + + +
+ Language models have seen significant growth in the size of their corpus, +leading to notable performance improvements. Yet, there has been limited +progress in developing models that handle smaller, more human-like datasets. As +part of the BabyLM shared task, this study explores the impact of reinforcement +learning from human feedback (RLHF) on language models pretrained from scratch +with a limited training corpus. Comparing two GPT-2 variants, the larger model +performs better in storytelling tasks after RLHF fine-tuning. These findings +suggest that RLHF techniques may be more advantageous for larger models due to +their higher learning and adaptation capacity, though more experiments are +needed to confirm this finding. These insights highlight the potential benefits +of RLHF fine-tuning for language models within limited data, enhancing their +ability to maintain narrative focus and coherence while adhering better to +initial instructions in storytelling tasks. The code for this work is publicly +at https://github.com/Zephyr1022/BabyStories-UTSA. + +
+
+ comment: Accepted to BabyLM workshop at CoNLL +
+
+
+
+
+ + ☆ SSLCL: An Efficient Model-Agnostic Supervised Contrastive Learning + Framework for Emotion Recognition in Conversations + + +
+ Emotion recognition in conversations (ERC) is a rapidly evolving task within +the natural language processing community, which aims to detect the emotions +expressed by speakers during a conversation. Recently, a growing number of ERC +methods have focused on leveraging supervised contrastive learning (SCL) to +enhance the robustness and generalizability of learned features. However, +current SCL-based approaches in ERC are impeded by the constraint of large +batch sizes and the lack of compatibility with most existing ERC models. To +address these challenges, we propose an efficient and model-agnostic SCL +framework named Supervised Sample-Label Contrastive Learning with Soft-HGR +Maximal Correlation (SSLCL), which eliminates the need for a large batch size +and can be seamlessly integrated with existing ERC models without introducing +any model-specific assumptions. Specifically, we introduce a novel perspective +on utilizing label representations by projecting discrete labels into dense +embeddings through a shallow multilayer perceptron, and formulate the training +objective to maximize the similarity between sample features and their +corresponding ground-truth label embeddings, while minimizing the similarity +between sample features and label embeddings of disparate classes. Moreover, we +innovatively adopt the Soft-HGR maximal correlation as a measure of similarity +between sample features and label embeddings, leading to significant +performance improvements over conventional similarity measures. Additionally, +multimodal cues of utterances are effectively leveraged by SSLCL as data +augmentations to boost model performances. Extensive experiments on two ERC +benchmark datasets, IEMOCAP and MELD, demonstrate the compatibility and +superiority of our proposed SSLCL framework compared to existing +state-of-the-art SCL methods. Our code is available at +\url{https://github.com/TaoShi1998/SSLCL}. + +
+
+
+
+
+ + ☆ ChatGPT is a Potential Zero-Shot Dependency Parser + + +
+ Pre-trained language models have been widely used in dependency parsing task +and have achieved significant improvements in parser performance. However, it +remains an understudied question whether pre-trained language models can +spontaneously exhibit the ability of dependency parsing without introducing +additional parser structure in the zero-shot scenario. In this paper, we +propose to explore the dependency parsing ability of large language models such +as ChatGPT and conduct linguistic analysis. The experimental results +demonstrate that ChatGPT is a potential zero-shot dependency parser, and the +linguistic analysis also shows some unique preferences in parsing outputs. + +
+
+ comment: 10 pages +
+
+
+
+
+ + ☆ ArTST: Arabic Text and Speech Transformer + + +
+ We present ArTST, a pre-trained Arabic text and speech transformer for +supporting open-source speech technologies for the Arabic language. The model +architecture follows the unified-modal framework, SpeechT5, that was recently +released for English, and is focused on Modern Standard Arabic (MSA), with +plans to extend the model for dialectal and code-switched Arabic in future +editions. We pre-trained the model from scratch on MSA speech and text data, +and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), +Text-To-Speech synthesis (TTS), and spoken dialect identification. In our +experiments comparing ArTST with SpeechT5, as well as with previously reported +results in these tasks, ArTST performs on a par with or exceeding the current +state-of-the-art in all three tasks. Moreover, we find that our pre-training is +conducive for generalization, which is particularly evident in the low-resource +TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models +are released for research use. + +
+
+ comment: 11 pages, 1 figure, SIGARAB ArabicNLP 2023 +
+
+
+
+
+ + ☆ Context Does Matter: End-to-end Panoptic Narrative Grounding with + Deformable Attention Refined Matching Network ICDM 2023 + + +
+ Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that +aims to segment visual objects in images based on dense narrative captions. The +current state-of-the-art methods first refine the representation of phrase by +aggregating the most similar $k$ image pixels, and then match the refined text +representations with the pixels of the image feature map to generate +segmentation results. However, simply aggregating sampled image features +ignores the contextual information, which can lead to phrase-to-pixel +mis-match. In this paper, we propose a novel learning framework called +Deformable Attention Refined Matching Network (DRMN), whose main idea is to +bring deformable attention in the iterative process of feature learning to +incorporate essential context information of different scales of pixels. DRMN +iteratively re-encodes pixels with the deformable attention network after +updating the feature representation of the top-$k$ most similar pixels. As +such, DRMN can lead to accurate yet discriminative pixel representations, +purify the top-$k$ most similar pixels, and consequently alleviate the +phrase-to-pixel mis-match substantially.Experimental results show that our +novel design significantly improves the matching results between text phrases +and image pixels. Concretely, DRMN achieves new state-of-the-art performance on +the PNG benchmark with an average recall improvement 3.5%. The codes are +available in: https://github.com/JaMesLiMers/DRMN. + +
+
+ comment: Accepted by ICDM 2023 +
+
+
+
+
+ + ☆ Back Transcription as a Method for Evaluating Robustness of Natural + Language Understanding Models to Speech Recognition Errors EMNLP 2023 + + +
+ In a spoken dialogue system, an NLU model is preceded by a speech recognition +system that can deteriorate the performance of natural language understanding. +This paper proposes a method for investigating the impact of speech recognition +errors on the performance of natural language understanding models. The +proposed method combines the back transcription procedure with a fine-grained +technique for categorizing the errors that affect the performance of NLU +models. The method relies on the usage of synthesized speech for NLU +evaluation. We show that the use of synthesized speech in place of audio +recording does not change the outcomes of the presented technique in a +significant way. + +
+
+ comment: Accepted to EMNLP 2023 main conference +
+
+
+
+
+ + ☆ On the Interplay between Fairness and Explainability + + +
+ In order to build reliable and trustworthy NLP applications, models need to +be both fair across different demographics and explainable. Usually these two +objectives, fairness and explainability, are optimized and/or examined +independently of each other. Instead, we argue that forthcoming, trustworthy +NLP systems should consider both. In this work, we perform a first study to +understand how they influence each other: do fair(er) models rely on more +plausible rationales? and vice versa. To this end, we conduct experiments on +two English multi-class text classification datasets, BIOS and ECtHR, that +provide information on gender and nationality, respectively, as well as +human-annotated rationales. We fine-tune pre-trained language models with +several methods for (i) bias mitigation, which aims to improve fairness; (ii) +rationale extraction, which aims to produce plausible explanations. We find +that bias mitigation algorithms do not always lead to fairer models. Moreover, +we discover that empirical fairness and explainability are orthogonal. + +
+
+ comment: 15 pages (incl Appendix), 4 figures, 8 tables +
+
+
+
+
+ + ☆ Tailoring Personality Traits in Large Language Models via + Unsupervisedly-Built Personalized Lexicons + + +
+ Personality plays a pivotal role in shaping human expression patterns, and +empowering and manipulating large language models (LLMs) with personality +traits holds significant promise in enhancing the user experience of LLMs. +However, prior approaches either rely on fine-tuning LLMs on a corpus enriched +with personalized expressions or necessitate the manual crafting of prompts to +induce LLMs to produce personalized responses. The former approaches demand +substantial time and resources for collecting sufficient training examples +while the latter might fail in enabling the precise manipulation of the +personality traits at a fine-grained level (e.g., achieving high agreeableness +while reducing openness). In this study, we introduce a novel approach for +tailoring personality traits within LLMs, allowing for the incorporation of any +combination of the Big Five factors (i.e., openness, conscientiousness, +extraversion, agreeableness, and neuroticism) in a pluggable manner. This is +achieved by employing a set of Unsupervisedly-Built Personalized Lexicons +(UBPL) that are utilized to adjust the probability of the next token predicted +by the original LLMs during the decoding phase. This adjustment encourages the +models to generate words present in the personalized lexicons while preserving +the naturalness of the generated texts. Extensive experimentation demonstrates +the effectiveness of our approach in finely manipulating LLMs' personality +traits. Furthermore, our method can be seamlessly integrated into other LLMs +without necessitating updates to their parameters. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ☆ WSDMS: Debunk Fake News via Weakly Supervised Detection of Misinforming + Sentences with Contextualized Social Wisdom + + +
+ In recent years, we witness the explosion of false and unconfirmed +information (i.e., rumors) that went viral on social media and shocked the +public. Rumors can trigger versatile, mostly controversial stance expressions +among social media users. Rumor verification and stance detection are different +yet relevant tasks. Fake news debunking primarily focuses on determining the +truthfulness of news articles, which oversimplifies the issue as fake news +often combines elements of both truth and falsehood. Thus, it becomes crucial +to identify specific instances of misinformation within the articles. In this +research, we investigate a novel task in the field of fake news debunking, +which involves detecting sentence-level misinformation. One of the major +challenges in this task is the absence of a training dataset with +sentence-level annotations regarding veracity. Inspired by the Multiple +Instance Learning (MIL) approach, we propose a model called Weakly Supervised +Detection of Misinforming Sentences (WSDMS). This model only requires bag-level +labels for training but is capable of inferring both sentence-level +misinformation and article-level veracity, aided by relevant social media +conversations that are attentively contextualized with news sentences. We +evaluate WSDMS on three real-world benchmarks and demonstrate that it +outperforms existing state-of-the-art baselines in debunking fake news at both +the sentence and article levels. + +
+
+
+
+
+ + ☆ Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained + Language Models EMNLP + + +
+ Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich +in world knowledge. This fact has sparked the interest of the community in +quantifying the amount of factual knowledge present in PLMs, as this explains +their performance on downstream tasks, and potentially justifies their use as +knowledge bases. In this work, we survey methods and datasets that are used to +probe PLMs for factual knowledge. Our contributions are: (1) We propose a +categorization scheme for factual probing methods that is based on how their +inputs, outputs and the probed PLMs are adapted; (2) We provide an overview of +the datasets used for factual probing; (3) We synthesize insights about +knowledge retention and prompt optimization in PLMs, analyze obstacles to +adopting PLMs as knowledge bases and outline directions for future work. + +
+
+ comment: Accepted at EMNLP Findings 2023 +
+
+
+
+
+ + ☆ 1-PAGER: One Pass Answer Generation and Evidence Retrieval EMNLP 2023 + + +
+ We present 1-Pager the first system that answers a question and retrieves +evidence using a single Transformer-based model and decoding process. 1-Pager +incrementally partitions the retrieval corpus using constrained decoding to +select a document and answer string, and we show that this is competitive with +comparable retrieve-and-read alternatives according to both retrieval and +answer accuracy metrics. 1-Pager also outperforms the equivalent closed-book +question answering model, by grounding predictions in an evidence corpus. While +1-Pager is not yet on-par with more expensive systems that read many more +documents before generating an answer, we argue that it provides an important +step toward attributed generation by folding retrieval into the +sequence-to-sequence paradigm that is currently dominant in NLP. We also show +that the search paths used to partition the corpus are easy to read and +understand, paving a way forward for interpretable neural retrieval. + +
+
+ comment: Accepted at EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ FedTherapist: Mental Health Monitoring with User-Generated Linguistic + Expressions on Smartphones via Federated Learning EMNLP 2023 + + +
+ Psychiatrists diagnose mental disorders via the linguistic use of patients. +Still, due to data privacy, existing passive mental health monitoring systems +use alternative features such as activity, app usage, and location via mobile +devices. We propose FedTherapist, a mobile mental health monitoring system that +utilizes continuous speech and keyboard input in a privacy-preserving way via +federated learning. We explore multiple model designs by comparing their +performance and overhead for FedTherapist to overcome the complex nature of +on-device language model training on smartphones. We further propose a +Context-Aware Language Learning (CALL) methodology to effectively utilize +smartphones' large and noisy text for mental health signal sensing. Our +IRB-approved evaluation of the prediction of self-reported depression, stress, +anxiety, and mood from 46 participants shows higher accuracy of FedTherapist +compared with the performance with non-language features, achieving 0.15 AUROC +improvement and 8.21% MAE reduction. + +
+
+ comment: Accepted to the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP 2023) +
+
+
+
+
+ + ☆ R$^3$ Prompting: Review, Rephrase and Resolve for Chain-of-Thought + Reasoning in Large Language Models under Noisy Context + + +
+ With the help of Chain-of-Thought (CoT) prompting, Large Language Models +(LLMs) have achieved remarkable performance on various reasoning tasks. +However, most of them have been evaluated under noise-free context and the +dilemma for LLMs to produce inaccurate results under the noisy context has not +been fully investigated. Existing studies utilize trigger sentences to +encourage LLMs to concentrate on the relevant information but the trigger has +limited effect on final answer prediction. Inspired by interactive CoT method, +where intermediate reasoning steps are promoted by multiple rounds of +interaction between users and LLMs, we propose a novel prompting method, namely +R$^3$ prompting, for CoT reasoning under noisy context. Specifically, R$^3$ +prompting interacts with LLMs to perform key sentence extraction, variable +declaration and answer prediction, which corresponds to a thought process of +reviewing, rephrasing and resolving. The responses generated at the last +interaction will perform as hints to guide toward the responses of the next +interaction. Our experiments show that R$^3$ prompting significantly +outperforms existing CoT prompting methods on five reasoning tasks under noisy +context. With GPT-3.5-turbo, we observe 3.7% accuracy improvement on average on +the reasoning tasks under noisy context compared to the most competitive +prompting baseline. More analyses and ablation studies show the robustness and +generalization of R$^3$ prompting method in solving reasoning tasks in LLMs +under noisy context. + +
+
+
+
+
+ + ☆ An Early Evaluation of GPT-4V(ision) + + +
+ In this paper, we evaluate different abilities of GPT-4V including visual +understanding, language understanding, visual puzzle solving, and understanding +of other modalities such as depth, thermal, video, and audio. To estimate +GPT-4V's performance, we manually construct 656 test instances and carefully +evaluate the results of GPT-4V. The highlights of our findings are as follows: +(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks +but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows +inconsistent refusal behavior when answering questions related to sensitive +traits such as gender, race, and age; (3) GPT-4V obtains worse results than +GPT-4 (API) on language understanding tasks including general language +understanding benchmarks and visual commonsense knowledge evaluation +benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both +visual understanding and language understanding; (5) GPT-4V struggles to find +the nuances between two similar images and solve the easy math picture puzzles; +(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to +image, such as video and thermal. Our experimental results reveal the ability +and limitations of GPT-4V and we hope our paper can provide some insights into +the application and research of GPT-4V. + +
+
+ comment: Technical Report. Data are available at + https://github.com/albertwy/GPT-4V-Evaluation +
+
+
+
+
+ + ☆ CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task + Information Retrieval EMNLP 2023 + + +
+ We present the Charles University system for the MRL~2023 Shared Task on +Multi-lingual Multi-task Information Retrieval. The goal of the shared task was +to develop systems for named entity recognition and question answering in +several under-represented languages. Our solutions to both subtasks rely on the +translate-test approach. We first translate the unlabeled examples into English +using a multilingual machine translation model. Then, we run inference on the +translated data using a strong task-specific model. Finally, we project the +labeled data back into the original language. To keep the inferred tags on the +correct positions in the original language, we propose a method based on +scoring the candidate positions using a label-sensitive translation model. In +both settings, we experiment with finetuning the classification models on the +translated data. However, due to a domain mismatch between the development data +and the shared task validation and test sets, the finetuned models could not +outperform our baselines. + +
+
+ comment: 8 pages, 2 figures; System description paper at the MRL 2023 workshop + at EMNLP 2023 +
+
+
+
+
+ + ☆ Improving Diversity of Demographic Representation in Large Language + Models via Collective-Critiques and Self-Voting EMNLP 2023 + + +
+ A crucial challenge for generative large language models (LLMs) is diversity: +when a user's prompt is under-specified, models may follow implicit assumptions +while generating a response, which may result in homogenization of the +responses, as well as certain demographic groups being under-represented or +even erased from the generated responses. In this paper, we formalize diversity +of representation in generative LLMs. We present evaluation datasets and +propose metrics to measure diversity in generated responses along people and +culture axes. We find that LLMs understand the notion of diversity, and that +they can reason and critique their own responses for that goal. This finding +motivated a new prompting technique called collective-critique and self-voting +(CCSV) to self-improve people diversity of LLMs by tapping into its diversity +reasoning capabilities, without relying on handcrafted examples or prompt +tuning. Extensive empirical experiments with both human and automated +evaluations show that our proposed approach is effective at improving people +and culture diversity, and outperforms all baseline methods by a large margin. + +
+
+ comment: To appear at EMNLP 2023 main conference +
+
+
+
+
+ + ☆ OccuQuest: Mitigating Occupational Bias for Inclusive Large Language + Models + + +
+ The emergence of large language models (LLMs) has revolutionized natural +language processing tasks. However, existing instruction-tuning datasets suffer +from occupational bias: the majority of data relates to only a few occupations, +which hampers the instruction-tuned LLMs to generate helpful responses to +professional queries from practitioners in specific fields. To mitigate this +issue and promote occupation-inclusive LLMs, we create an instruction-tuning +dataset named \emph{OccuQuest}, which contains 110,000+ prompt-completion pairs +and 30,000+ dialogues covering over 1,000 occupations in 26 occupational +categories. We systematically request ChatGPT, organizing queries +hierarchically based on Occupation, Responsibility, Topic, and Question, to +ensure a comprehensive coverage of occupational specialty inquiries. By +comparing with three commonly used datasets (Dolly, ShareGPT, and WizardLM), we +observe that OccuQuest exhibits a more balanced distribution across +occupations. Furthermore, we assemble three test sets for comprehensive +evaluation, an occu-test set covering 25 occupational categories, an estate set +focusing on real estate, and an occu-quora set containing real-world questions +from Quora. We then fine-tune LLaMA on OccuQuest to obtain OccuLLaMA, which +significantly outperforms state-of-the-art LLaMA variants (Vicuna, Tulu, and +WizardLM) on professional questions in GPT-4 and human evaluations. Notably, on +the occu-quora set, OccuLLaMA reaches a high win rate of 86.4\% against +WizardLM. + +
+
+
+
+
+ + ☆ Subspace Chronicles: How Linguistic Information Emerges, Shifts and + Interacts during Language Model Training EMNLP 2023 + + +
+ Representational spaces learned via language modeling are fundamental to +Natural Language Processing (NLP), however there has been limited understanding +regarding how and when during training various types of linguistic information +emerge and interact. Leveraging a novel information theoretic probing suite, +which enables direct comparisons of not just task performance, but their +representational subspaces, we analyze nine tasks covering syntax, semantics +and reasoning, across 2M pre-training steps and five seeds. We identify +critical learning phases across tasks and time, during which subspaces emerge, +share information, and later disentangle to specialize. Across these phases, +syntactic knowledge is acquired rapidly after 0.5% of full training. Continued +performance improvements primarily stem from the acquisition of open-domain +knowledge, while semantics and reasoning tasks benefit from later boosts to +long-range contextualization and higher specialization. Measuring cross-task +similarity further reveals that linguistically related tasks share information +throughout training, and do so more during the critical phase of learning than +before or after. Our findings have implications for model interpretability, +multi-task learning, and learning from limited data. + +
+
+ comment: Accepted at EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ CLEX: Continuous Length Extrapolation for Large Language Models + + +
+ Transformer-based Large Language Models (LLMs) are pioneering advances in +many natural language processing tasks, however, their exceptional capabilities +are restricted within the preset context window of Transformer. Position +Embedding (PE) scaling methods, while effective in extending the context window +to a specific length, demonstrate either notable limitations in their +extrapolation abilities or sacrificing partial performance within the context +window. Length extrapolation methods, although theoretically capable of +extending the context window beyond the training sequence length, often +underperform in practical long-context applications. To address these +challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We +generalise the PE scaling approaches to model the continuous dynamics by +ordinary differential equations over the length scaling factor, thereby +overcoming the constraints of current PE scaling methods designed for specific +lengths. Moreover, by extending the dynamics to desired context lengths beyond +the training sequence length, CLEX facilitates the length extrapolation with +impressive performance in practical tasks. We demonstrate that CLEX can be +seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such +as LLaMA and GPT-NeoX, with negligible impact on training and inference +latency. Experimental results reveal that CLEX can effectively extend the +context window to over 4x or almost 8x training length, with no deterioration +in performance. Furthermore, when evaluated on the practical LongBench +benchmark, our model trained on a 4k length exhibits competitive performance +against state-of-the-art open-source models trained on context lengths up to +32k. + +
+
+
+
+
+ + ☆ Diversity Enhanced Narrative Question Generation for Storybooks EMNLP 2023 + + +
+ Question generation (QG) from a given context can enhance comprehension, +engagement, assessment, and overall efficacy in learning or conversational +environments. Despite recent advancements in QG, the challenge of enhancing or +measuring the diversity of generated questions often remains unaddressed. In +this paper, we introduce a multi-question generation model (mQG), which is +capable of generating multiple, diverse, and answerable questions by focusing +on context and questions. To validate the answerability of the generated +questions, we employ a SQuAD2.0 fine-tuned question answering model, +classifying the questions as answerable or not. We train and evaluate mQG on +the FairytaleQA dataset, a well-structured QA dataset based on storybooks, with +narrative questions. We further apply a zero-shot adaptation on the TellMeWhy +and SQuAD1.1 datasets. mQG shows promising results across various evaluation +metrics, among strong baselines. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning + in Language Models NeurIPS 2023 + + +
+ A long-standing goal of AI systems is to perform complex multimodal reasoning +like humans. Recently, large language models (LLMs) have made remarkable +strides in such multi-step reasoning on the language modality solely by +leveraging the chain of thought (CoT) to mimic human thinking. However, the +transfer of these advancements to multimodal contexts introduces heightened +challenges, including but not limited to the impractical need for +labor-intensive annotation and the limitations in terms of flexibility, +generalizability, and explainability. To evoke CoT reasoning in multimodality, +this work first conducts an in-depth analysis of these challenges posed by +multimodality and presents two key insights: "keeping critical thinking" and +"letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this +study proposes a novel DDCoT prompting that maintains a critical attitude +through negative-space prompting and incorporates multimodality into reasoning +by first dividing the reasoning responsibility of LLMs into reasoning and +recognition and then integrating the visual recognition capability of visual +models into the joint reasoning process. The rationales generated by DDCoT not +only improve the reasoning abilities of both large and small language models in +zero-shot prompting and fine-tuning learning, significantly outperforming +state-of-the-art methods but also exhibit impressive generalizability and +explainability. + +
+
+ comment: 24 pages, 13 figures, to be published in NeurIPS 2023 +
+
+
+
+
+ + ☆ PromptAgent: Strategic Planning with Language Models Enables + Expert-level Prompt Optimization + + +
+ Highly effective, task-specific prompts are often heavily engineered by +experts to integrate detailed instructions and domain insights based on a deep +understanding of both instincts of large language models (LLMs) and the +intricacies of the target task. However, automating the generation of such +expert-level prompts remains elusive. Existing prompt optimization methods tend +to overlook the depth of domain knowledge and struggle to efficiently explore +the vast space of expert-level prompts. Addressing this, we present +PromptAgent, an optimization method that autonomously crafts prompts equivalent +in quality to those handcrafted by experts. At its core, PromptAgent views +prompt optimization as a strategic planning problem and employs a principled +planning algorithm, rooted in Monte Carlo tree search, to strategically +navigate the expert-level prompt space. Inspired by human-like trial-and-error +exploration, PromptAgent induces precise expert-level insights and in-depth +instructions by reflecting on model errors and generating constructive error +feedback. Such a novel framework allows the agent to iteratively examine +intermediate prompts (states), refine them based on error feedbacks (actions), +simulate future rewards, and search for high-reward paths leading to expert +prompts. We apply PromptAgent to 12 tasks spanning three practical domains: +BIG-Bench Hard (BBH), as well as domain-specific and general NLP tasks, showing +it significantly outperforms strong Chain-of-Thought and recent prompt +optimization baselines. Extensive analyses emphasize its capability to craft +expert-level, detailed, and domain-insightful prompts with great efficiency and +generalizability. + +
+
+ comment: 34 pages, 10 figures +
+
+
+
+
+ + ☆ Enhanced Simultaneous Machine Translation with Word-level Policies EMNLP 2023 + + +
+ Recent years have seen remarkable advances in the field of Simultaneous +Machine Translation (SiMT) due to the introduction of innovative policies that +dictate whether to READ or WRITE at each step of the translation process. +However, a common assumption in many existing studies is that operations are +carried out at the subword level, even though the standard unit for input and +output in most practical scenarios is typically at the word level. This paper +demonstrates that policies devised and validated at the subword level are +surpassed by those operating at the word level, which process multiple subwords +to form a complete word in a single step. Additionally, we suggest a method to +boost SiMT models using language models (LMs), wherein the proposed word-level +policy plays a vital role in addressing the subword disparity between LMs and +SiMT models. Code is available at https://github.com/xl8-ai/WordSiMT. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Decoding Stumpers: Large Language Models vs. Human Problem-Solvers + + +
+ This paper investigates the problem-solving capabilities of Large Language +Models (LLMs) by evaluating their performance on stumpers, unique single-step +intuition problems that pose challenges for human solvers but are easily +verifiable. We compare the performance of four state-of-the-art LLMs +(Davinci-2, Davinci-3, GPT-3.5-Turbo, GPT-4) to human participants. Our +findings reveal that the new-generation LLMs excel in solving stumpers and +surpass human performance. However, humans exhibit superior skills in verifying +solutions to the same problems. This research enhances our understanding of +LLMs' cognitive abilities and provides insights for enhancing their +problem-solving potential across various domains. + +
+
+
+
+
+ + ☆ Video Referring Expression Comprehension via Transformer with + Content-conditioned Query + + +
+ Video Referring Expression Comprehension (REC) aims to localize a target +object in videos based on the queried natural language. Recent improvements in +video REC have been made using Transformer-based methods with learnable +queries. However, we contend that this naive query design is not ideal given +the open-world nature of video REC brought by text supervision. With numerous +potential semantic categories, relying on only a few slow-updated queries is +insufficient to characterize them. Our solution to this problem is to create +dynamic queries that are conditioned on both the input video and language to +model the diverse objects referred to. Specifically, we place a fixed number of +learnable bounding boxes throughout the frame and use corresponding region +features to provide prior information. Also, we noticed that current query +features overlook the importance of cross-modal alignment. To address this, we +align specific phrases in the sentence with semantically relevant visual areas, +annotating them in existing video datasets (VID-Sentence and VidSTG). By +incorporating these two designs, our proposed model (called ConFormer) +outperforms other models on widely benchmarked datasets. For example, in the +testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute +improvement on Accu.@0.6 compared to the previous state-of-the-art model. + +
+
+ comment: Accepted to ACM International Conference on Multimedia Workshop (ACM + MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.02953 +
+
+
+
+
+ + ☆ ZGUL: Zero-shot Generalization to Unseen Languages using Multi-source + Ensembling of Language Adapters + + +
+ We tackle the problem of zero-shot cross-lingual transfer in NLP tasks via +the use of language adapters (LAs). Most of the earlier works have explored +training with adapter of a single source (often English), and testing either +using the target LA or LA of another related language. Training target LA +requires unlabeled data, which may not be readily available for low resource +unseen languages: those that are neither seen by the underlying multilingual +language model (e.g., mBERT), nor do we have any (labeled or unlabeled) data +for them. We posit that for more effective cross-lingual transfer, instead of +just one source LA, we need to leverage LAs of multiple (linguistically or +geographically related) source languages, both at train and test-time - which +we investigate via our novel neural architecture, ZGUL. Extensive +experimentation across four language groups, covering 15 unseen target +languages, demonstrates improvements of up to 3.2 average F1 points over +standard fine-tuning and other strong baselines on POS tagging and NER tasks. +We also extend ZGUL to settings where either (1) some unlabeled data or (2) +few-shot training examples are available for the target language. We find that +ZGUL continues to outperform baselines in these settings too. + +
+
+
+
+
+ + ☆ Transformer-based Live Update Generation for Soccer Matches from + Microblog Posts EMNLP 2023 + + +
+ It has been known to be difficult to generate adequate sports updates from a +sequence of vast amounts of diverse live tweets, although the live sports +viewing experience with tweets is gaining the popularity. In this paper, we +focus on soccer matches and work on building a system to generate live updates +for soccer matches from tweets so that users can instantly grasp a match's +progress and enjoy the excitement of the match from raw tweets. Our proposed +system is based on a large pre-trained language model and incorporates a +mechanism to control the number of updates and a mechanism to reduce the +redundancy of duplicate and similar updates. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ InstructPTS: Instruction-Tuning LLMs for Product Title Summarization EMNLP 2023 + + +
+ E-commerce product catalogs contain billions of items. Most products have +lengthy titles, as sellers pack them with product attributes to improve +retrieval, and highlight key product aspects. This results in a gap between +such unnatural products titles, and how customers refer to them. It also limits +how e-commerce stores can use these seller-provided titles for recommendation, +QA, or review summarization. + Inspired by recent work on instruction-tuned LLMs, we present InstructPTS, a +controllable approach for the task of Product Title Summarization (PTS). +Trained using a novel instruction fine-tuning strategy, our approach is able to +summarize product titles according to various criteria (e.g. number of words in +a summary, inclusion of specific phrases, etc.). Extensive evaluation on a +real-world e-commerce catalog shows that compared to simple fine-tuning of +LLMs, our proposed approach can generate more accurate product name summaries, +with an improvement of over 14 and 8 BLEU and ROUGE points, respectively. + +
+
+ comment: Accepted by EMNLP 2023 (Industry Track) +
+
+
+
+
+ + ☆ From Simple to Complex: A Progressive Framework for Document-level + Informative Argument Extraction EMNLP 2023 + + +
+ Document-level Event Argument Extraction (EAE) requires the model to extract +arguments of multiple events from a single document. Considering the underlying +dependencies between these events, recent efforts leverage the idea of +"memory", where the results of already predicted events are cached and can be +retrieved to help the prediction of upcoming events. These methods extract +events according to their appearance order in the document, however, the event +that appears in the first sentence does not mean that it is the easiest to +extract. Existing methods might introduce noise to the extraction of upcoming +events if they rely on an incorrect prediction of previous events. In order to +provide more reliable memory, we propose a simple-to-complex progressive +framework for document-level EAE. Specifically, we first calculate the +difficulty of each event and then, we conduct the extraction following a +simple-to-complex order. In this way, the memory will store the most certain +results, and the model could use these reliable sources to help the prediction +of more difficult events. Experiments on WikiEvents show that our model +outperforms SOTA by 1.4% in F1, indicating the proposed simple-to-complex +framework is useful in the EAE task. + +
+
+ comment: Accepted to the Findings of EMNLP 2023 (Long Paper) +
+
+
+
+
+ + ☆ A Multi-Modal Multilingual Benchmark for Document Image Classification EMNLP 2023 + + +
+ Document image classification is different from plain-text document +classification and consists of classifying a document by understanding the +content and structure of documents such as forms, emails, and other such +documents. We show that the only existing dataset for this task (Lewis et al., +2006) has several limitations and we introduce two newly curated multilingual +datasets WIKI-DOC and MULTIEURLEX-DOC that overcome these limitations. We +further undertake a comprehensive study of popular visually-rich document +understanding or Document AI models in previously untested setting in document +image classification such as 1) multi-label classification, and 2) zero-shot +cross-lingual transfer setup. Experimental results show limitations of +multilingual Document AI models on cross-lingual transfer across typologically +distant languages. Our datasets and findings open the door for future research +into improving Document AI models. + +
+
+ comment: Accepted to EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Unraveling Feature Extraction Mechanisms in Neural Networks EMNLP 2023 + + +
+ The underlying mechanism of neural networks in capturing precise knowledge +has been the subject of consistent research efforts. In this work, we propose a +theoretical approach based on Neural Tangent Kernels (NTKs) to investigate such +mechanisms. Specifically, considering the infinite network width, we +hypothesize the learning dynamics of target models may intuitively unravel the +features they acquire from training data, deepening our insights into their +internal mechanisms. We apply our approach to several fundamental models and +reveal how these models leverage statistical features during gradient descent +and how they are integrated into final decisions. We also discovered that the +choice of activation function can affect feature extraction. For instance, the +use of the \textit{ReLU} activation function could potentially introduce a bias +in features, providing a plausible explanation for its replacement with +alternative functions in recent pre-trained language models. Additionally, we +find that while self-attention and CNN models may exhibit limitations in +learning n-grams, multiplication-based models seem to excel in this area. We +verify these theoretical findings through experiments and find that they can be +applied to analyze language modeling tasks, which can be regarded as a special +variant of classification. Our contributions offer insights into the roles and +capacities of fundamental components within large language models, thereby +aiding the broader understanding of these complex systems. + +
+
+ comment: Accepted by EMNLP 2023 +
+
+
+
+
+ + ☆ A Comprehensive Evaluation of Constrained Text Generation for Large + Language Models + + +
+ Advancements in natural language generation (NLG) and large language models +(LLMs) have led to proficient text generation in various tasks. However, +integrating intricate constraints into neural text generation, due to LLMs' +opacity, remains challenging. This study investigates constrained text +generation for LLMs, where predefined constraints are applied during LLM's +generation process. Our research examines multiple LLMs, including ChatGPT and +GPT-4, categorizing constraints into lexical, structural, and relation-based +types. We also present various benchmarks to facilitate fair evaluation. The +study addresses some key research questions, including the extent of LLMs' +compliance with constraints. Results illuminate LLMs' capacity and deficiency +to incorporate constraints and provide insights for future developments in +constrained text generation. Codes and datasets will be released upon +acceptance. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ☆ RCAgent: Cloud Root Cause Analysis by Autonomous Agents with + Tool-Augmented Large Language Models + + +
+ Large language model (LLM) applications in cloud root cause analysis (RCA) +have been actively explored recently. However, current methods are still +reliant on manual workflow settings and do not unleash LLMs' decision-making +and environment interaction capabilities. We present RCAgent, a tool-augmented +LLM autonomous agent framework for practical and privacy-aware industrial RCA +usage. Running on an internally deployed model rather than GPT families, +RCAgent is capable of free-form data collection and comprehensive analysis with +tools. Our framework combines a variety of enhancements, including a unique +Self-Consistency for action trajectories, and a suite of methods for context +management, stabilization, and importing domain knowledge. Our experiments show +RCAgent's evident and consistent superiority over ReAct across all aspects of +RCA -- predicting root causes, solutions, evidence, and responsibilities -- and +tasks covered or uncovered by current rules, as validated by both automated +metrics and human evaluations. Furthermore, RCAgent has already been integrated +into the diagnosis and issue discovery workflow of the Real-time Compute +Platform for Apache Flink of Alibaba Cloud. + +
+
+
+
+
+ + ☆ Generative Pre-training for Speech with Flow Matching + + +
+ Generative models have gained more and more attention in recent years for +their remarkable success in tasks that required estimating and sampling data +distribution to generate high-fidelity synthetic data. In speech, +text-to-speech synthesis and neural vocoder are good examples where generative +models have shined. While generative models have been applied to different +applications in speech, there exists no general-purpose generative model that +models speech directly. In this work, we take a step toward this direction by +showing a single pre-trained generative model can be adapted to different +downstream tasks with strong performance. Specifically, we pre-trained a +generative model, named SpeechFlow, on 60k hours of untranscribed speech with +Flow Matching and masked conditions. Experiment results show the pre-trained +generative model can be fine-tuned with task-specific data to match or surpass +existing expert models on speech enhancement, separation, and synthesis. Our +work suggested a foundational model for generation tasks in speech can be built +with generative pre-training. + +
+
+ comment: Preprint, under review +
+
+
+
+
+ + ☆ CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment + of Coherence in Generated Texts + + +
+ Coherence is a linguistic term that refers to the relations between small +textual units (sentences, propositions), which make the text logically +consistent and meaningful to the reader. With the advances of generative +foundational models in NLP, there is a pressing need to automatically assess +the human-perceived coherence of automatically generated texts. Up until now, +little work has been done on explicitly assessing the coherence of generated +texts and analyzing the factors contributing to (in)coherence. Previous work on +the topic used other tasks, e.g., sentence reordering, as proxies of coherence, +rather than approaching coherence detection heads on. In this paper, we +introduce {\sc CoheSentia}, a novel benchmark of human-perceived coherence of +automatically generated texts. Our annotation protocol reflects two +perspectives; one is global, assigning a single coherence score, and the other +is incremental, scoring sentence by sentence. The incremental method produces +an (in)coherence score for each text fragment and also pinpoints reasons for +incoherence at that point. Our benchmark contains 500 automatically-generated +and human-annotated paragraphs, each annotated in both methods, by multiple +raters. Our analysis shows that the inter-annotator agreement in the +incremental mode is higher than in the holistic alternative, and our +experiments show that standard LMs fine-tuned for coherence detection show +varied performance on the different factors contributing to (in)coherence. All +in all, these models yield unsatisfactory performance, emphasizing the need for +developing more reliable methods for coherence assessment. + +
+
+
+
+
+ + ☆ Samsung R&D Institute Philippines at WMT 2023 + + +
+ In this paper, we describe the constrained MT systems submitted by Samsung +R&D Institute Philippines to the WMT 2023 General Translation Task for two +directions: en$\rightarrow$he and he$\rightarrow$en. Our systems comprise of +Transformer-based sequence-to-sequence models that are trained with a mix of +best practices: comprehensive data preprocessing pipelines, synthetic +backtranslated data, and the use of noisy channel reranking during online +decoding. Our models perform comparably to, and sometimes outperform, strong +baseline unconstrained systems such as mBART50 M2M and NLLB 200 MoE despite +having significantly fewer parameters on two public benchmarks: FLORES-200 and +NTREX-128. + +
+
+ comment: To appear in Proceedings of the Eighth Conference on Machine + Translation 2023 (WMT) +
+
+
+
+
+ + ☆ DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue + Assessment EMNLP 2023 + + +
+ Dialogue assessment plays a critical role in the development of open-domain +dialogue systems. Existing work are uncapable of providing an end-to-end and +human-epistemic assessment dataset, while they only provide sub-metrics like +coherence or the dialogues are conversed between annotators far from real user +settings. In this paper, we release a large-scale dialogue quality assessment +dataset (DiQAD), for automatically assessing open-domain dialogue quality. +Specifically, we (1) establish the assessment criteria based on the dimensions +conforming to human judgements on dialogue qualities, and (2) annotate +large-scale dialogues that conversed between real users based on these +annotation criteria, which contains around 100,000 dialogues. We conduct +several experiments and report the performances of the baselines as the +benchmark on DiQAD. The dataset is openly accessible at +https://github.com/yukunZhao/Dataset_Dialogue_quality_evaluation. + +
+
+ comment: Accepted to Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ URL-BERT: Training Webpage Representations via Social Media Engagements + + +
+ Understanding and representing webpages is crucial to online social networks +where users may share and engage with URLs. Common language model (LM) encoders +such as BERT can be used to understand and represent the textual content of +webpages. However, these representations may not model thematic information of +web domains and URLs or accurately capture their appeal to social media users. +In this work, we introduce a new pre-training objective that can be used to +adapt LMs to understand URLs and webpages. Our proposed framework consists of +two steps: (1) scalable graph embeddings to learn shallow representations of +URLs based on user engagement on social media and (2) a contrastive objective +that aligns LM representations with the aforementioned graph-based +representation. We apply our framework to the multilingual version of BERT to +obtain the model URL-BERT. We experimentally demonstrate that our continued +pre-training approach improves webpage understanding on a variety of tasks and +Twitter internal and external benchmarks. + +
+
+
+
+
+ + ☆ Is ChatGPT a Good Multi-Party Conversation Solver? EMNLP 2023 + + +
+ Large Language Models (LLMs) have emerged as influential instruments within +the realm of natural language processing; nevertheless, their capacity to +handle multi-party conversations (MPCs) -- a scenario marked by the presence of +multiple interlocutors involved in intricate information exchanges -- remains +uncharted. In this paper, we delve into the potential of generative LLMs such +as ChatGPT and GPT-4 within the context of MPCs. An empirical analysis is +conducted to assess the zero-shot learning capabilities of ChatGPT and GPT-4 by +subjecting them to evaluation across three MPC datasets that encompass five +representative tasks. The findings reveal that ChatGPT's performance on a +number of evaluated MPC tasks leaves much to be desired, whilst GPT-4's results +portend a promising future. Additionally, we endeavor to bolster performance +through the incorporation of MPC structures, encompassing both speaker and +addressee architecture. This study provides an exhaustive evaluation and +analysis of applying generative LLMs to MPCs, casting a light upon the +conception and creation of increasingly effective and robust MPC agents. +Concurrently, this work underscores the challenges implicit in the utilization +of LLMs for MPCs, such as deciphering graphical information flows and +generating stylistically consistent responses. + +
+
+ comment: Accepted by Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ XFEVER: Exploring Fact Verification across Languages + + +
+ This paper introduces the Cross-lingual Fact Extraction and VERification +(XFEVER) dataset designed for benchmarking the fact verification models across +different languages. We constructed it by translating the claim and evidence +texts of the Fact Extraction and VERification (FEVER) dataset into six +languages. The training and development sets were translated using machine +translation, whereas the test set includes texts translated by professional +translators and machine-translated texts. Using the XFEVER dataset, two +cross-lingual fact verification scenarios, zero-shot learning and +translate-train learning, are defined, and baseline models for each scenario +are also proposed in this paper. Experimental results show that the +multilingual language model can be used to build fact verification models in +different languages efficiently. However, the performance varies by language +and is somewhat inferior to the English case. We also found that we can +effectively mitigate model miscalibration by considering the prediction +similarity between the English and target languages. The XFEVER dataset, code, +and model checkpoints are available at +https://github.com/nii-yamagishilab/xfever. + +
+
+ comment: Accepted for an oral presentation at the 35th Conference on + Computational Linguistics and Speech Processing (ROCLING 2023) +
+
+
+
+
+ + ☆ CycleAlign: Iterative Distillation from Black-box LLM to White-box + Models for Better Human Alignment + + +
+ Language models trained on large-scale corpus often generate content that is +harmful, toxic, or contrary to human preferences, making their alignment with +human values a critical concern. Reinforcement learning from human feedback +(RLHF) with algorithms like PPO is a prevalent approach for alignment but is +often complex, unstable, and resource-intensive. Recently, ranking-based +alignment methods have emerged, offering stability and effectiveness by +replacing the RL framework with supervised fine-tuning, but they are costly due +to the need for annotated data. Considering that existing large language models +(LLMs) like ChatGPT are already relatively well-aligned and cost-friendly, +researchers have begun to align the language model with human preference from +AI feedback. The common practices, which unidirectionally distill the +instruction-following responses from LLMs, are constrained by their bottleneck. +Thus we introduce CycleAlign to distill alignment capabilities from +parameter-invisible LLMs (black-box) to a parameter-visible model (white-box) +in an iterative manner. With in-context learning (ICL) as the core of the +cycle, the black-box models are able to rank the model-generated responses +guided by human-craft instruction and demonstrations about their preferences. +During iterative interaction, the white-box models also have a judgment about +responses generated by them. Consequently, the agreement ranking could be +viewed as a pseudo label to dynamically update the in-context demonstrations +and improve the preference ranking ability of black-box models. Through +multiple interactions, the CycleAlign framework could align the white-box model +with the black-box model effectively in a low-resource way. Empirical results +illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing +methods, and achieves the state-of-the-art performance in alignment with human +value. + +
+
+
+
+
+ + ☆ Attention Lens: A Tool for Mechanistically Interpreting the Attention + Head Information Retrieval Mechanism + + +
+ Transformer-based Large Language Models (LLMs) are the state-of-the-art for +natural language tasks. Recent work has attempted to decode, by reverse +engineering the role of linear layers, the internal mechanisms by which LLMs +arrive at their final predictions for text completion tasks. Yet little is +known about the specific role of attention heads in producing the final token +prediction. We propose Attention Lens, a tool that enables researchers to +translate the outputs of attention heads into vocabulary tokens via learned +attention-head-specific transformations called lenses. Preliminary findings +from our trained lenses indicate that attention heads play highly specialized +roles in language models. The code for Attention Lens is available at +github.com/msakarvadia/AttentionLens. + +
+
+
+
+
+ + ☆ Multilingual Coarse Political Stance Classification of Media. The + Editorial Line of a ChatGPT and Bard Newspaper EMNLP 2023 + + +
+ Neutrality is difficult to achieve and, in politics, subjective. Traditional +media typically adopt an editorial line that can be used by their potential +readers as an indicator of the media bias. Several platforms currently rate +news outlets according to their political bias. The editorial line and the +ratings help readers in gathering a balanced view of news. But in the advent of +instruction-following language models, tasks such as writing a newspaper +article can be delegated to computers. Without imposing a biased persona, where +would an AI-based news outlet lie within the bias ratings? In this work, we use +the ratings of authentic news outlets to create a multilingual corpus of news +with coarse stance annotations (Left and Right) along with automatically +extracted topic annotations. We show that classifiers trained on this data are +able to identify the editorial line of most unseen newspapers in English, +German, Spanish and Catalan. We then apply the classifiers to 101 +newspaper-like articles written by ChatGPT and Bard in the 4 languages at +different time periods. We observe that, similarly to traditional newspapers, +ChatGPT editorial line evolves with time and, being a data-driven system, the +stance of the generated articles differs among languages. + +
+
+ comment: To be published at EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Enhancing Large Language Models for Secure Code Generation: A + Dataset-driven Study on Vulnerability Mitigation + + +
+ Large language models (LLMs) have brought significant advancements to code +generation, benefiting both novice and experienced developers. However, their +training using unsanitized data from open-source repositories, like GitHub, +introduces the risk of inadvertently propagating security vulnerabilities. To +effectively mitigate this concern, this paper presents a comprehensive study +focused on evaluating and enhancing code LLMs from a software security +perspective. We introduce SecuCoGen\footnote{SecuCoGen has been uploaded as +supplemental material and will be made publicly available after publication.}, +a meticulously curated dataset targeting 21 critical vulnerability types. +SecuCoGen comprises 180 samples and serves as the foundation for conducting +experiments on three crucial code-related tasks: code generation, code repair +and vulnerability classification, with a strong emphasis on security. Our +experimental results reveal that existing models often overlook security +concerns during code generation, leading to the generation of vulnerable code. +To address this, we propose effective approaches to mitigate the security +vulnerabilities and enhance the overall robustness of code generated by LLMs. +Moreover, our study identifies weaknesses in existing models' ability to repair +vulnerable code, even when provided with vulnerability information. +Additionally, certain vulnerability types pose challenges for the models, +hindering their performance in vulnerability classification. Based on these +findings, we believe our study will have a positive impact on the software +engineering community, inspiring the development of improved methods for +training and utilizing LLMs, thereby leading to safer and more trustworthy +model deployment. + +
+
+
+
+
+ + ☆ The Distributional Hypothesis Does Not Fully Explain the Benefits of + Masked Language Model Pretraining EMNLP 2023 + + +
+ We analyze the masked language modeling pretraining objective function from +the perspective of the distributional hypothesis. We investigate whether better +sample efficiency and the better generalization capability of models pretrained +with masked language modeling can be attributed to the semantic similarity +encoded in the pretraining data's distributional property. Via a synthetic +dataset, our analysis suggests that distributional property indeed leads to the +better sample efficiency of pretrained masked language models, but does not +fully explain the generalization capability. We also conduct analyses over two +real-world datasets and demonstrate that the distributional property does not +explain the generalization ability of pretrained natural language models +either. Our results illustrate our limited understanding of model pretraining +and provide future research directions. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ math-PVS: A Large Language Model Framework to Map Scientific + Publications to PVS Theories + + +
+ As artificial intelligence (AI) gains greater adoption in a wide variety of +applications, it has immense potential to contribute to mathematical discovery, +by guiding conjecture generation, constructing counterexamples, assisting in +formalizing mathematics, and discovering connections between different +mathematical areas, to name a few. + While prior work has leveraged computers for exhaustive mathematical proof +search, recent efforts based on large language models (LLMs) aspire to position +computing platforms as co-contributors in the mathematical research process. +Despite their current limitations in logic and mathematical tasks, there is +growing interest in melding theorem proving systems with foundation models. +This work investigates the applicability of LLMs in formalizing advanced +mathematical concepts and proposes a framework that can critically review and +check mathematical reasoning in research papers. Given the noted reasoning +shortcomings of LLMs, our approach synergizes the capabilities of proof +assistants, specifically PVS, with LLMs, enabling a bridge between textual +descriptions in academic papers and formal specifications in PVS. By harnessing +the PVS environment, coupled with data ingestion and conversion mechanisms, we +envision an automated process, called \emph{math-PVS}, to extract and formalize +mathematical theorems from research papers, offering an innovative tool for +academic review and discovery. + +
+
+
+
+
+ + ☆ BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' + Generation EMNLP 2023 + + +
+ Large language models (LLMs) such as GPT-3 have demonstrated a strong +capability to generate coherent and contextually relevant text. However, amidst +their successes, a crucial issue persists: their generated outputs still lack +commonsense at times. Moreover, fine-tuning the entire LLM towards more +commonsensical outputs is computationally expensive if not infeasible. In this +paper, we present a computation-efficient framework that steers a frozen +Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., +producing a plausible output that incorporates a list of concepts in a +meaningful way). Specifically, we first construct a reference-free evaluator +that assigns a sentence with a commonsensical score by grounding the sentence +to a dynamic commonsense knowledge base from four different relational aspects. +We then use the scorer as the oracle for commonsense knowledge, and extend the +controllable generation method called NADO to train an auxiliary head that +guides a fixed PTLM to better satisfy the oracle. We test our framework on a +series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two +constrained concept-to-sentence benchmarks. Human evaluation results +demonstrate that our method consistently leads to the most commonsensical +outputs. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ On Surgical Fine-tuning for Language Encoders EMNLP 2023 + + +
+ Fine-tuning all the layers of a pre-trained neural language encoder (either +using all the parameters or using parameter-efficient methods) is often the +de-facto way of adapting it to a new task. We show evidence that for different +downstream language tasks, fine-tuning only a subset of layers is sufficient to +obtain performance that is close to and often better than fine-tuning all the +layers in the language encoder. We propose an efficient metric based on the +diagonal of the Fisher information matrix (FIM score), to select the candidate +layers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE +tasks and across distinct language encoders, that this metric can effectively +select layers leading to a strong downstream performance. Our work highlights +that task-specific information corresponding to a given downstream task is +often localized within a few layers, and tuning only those is sufficient for +strong performance. Additionally, we demonstrate the robustness of the FIM +score to rank layers in a manner that remains constant during the optimization +process. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Follow-on Question Suggestion via Voice Hints for Voice Assistants EMNLP'23 + + +
+ The adoption of voice assistants like Alexa or Siri has grown rapidly, +allowing users to instantly access information via voice search. Query +suggestion is a standard feature of screen-based search experiences, allowing +users to explore additional topics. However, this is not trivial to implement +in voice-based settings. To enable this, we tackle the novel task of suggesting +questions with compact and natural voice hints to allow users to ask follow-up +questions. + We define the task, ground it in syntactic theory and outline linguistic +desiderata for spoken hints. We propose baselines and an approach using +sequence-to-sequence Transformers to generate spoken hints from a list of +questions. Using a new dataset of 6681 input questions and human written hints, +we evaluated the models with automatic metrics and human evaluation. Results +show that a naive approach of concatenating suggested questions creates poor +voice hints. Our approach, which applies a linguistically-motivated pretraining +task was strongly preferred by humans for producing the most natural hints. + +
+
+ comment: Accepted as Long Paper at EMNLP'23 Findings +
+
+
+
+
+ + ☆ Controlled Decoding from Language Models + + +
+ We propose controlled decoding (CD), a novel off-policy reinforcement +learning method to control the autoregressive generation from language models +towards high reward outcomes. CD solves an off-policy reinforcement learning +problem through a value function for the reward, which we call a prefix scorer. +The prefix scorer is used at inference time to steer the generation towards +higher reward outcomes. We show that the prefix scorer may be trained on +(possibly) off-policy data to predict the expected reward when decoding is +continued from a partially decoded response. We empirically demonstrate that CD +is effective as a control mechanism on Reddit conversations corpus. We also +show that the modularity of the design of CD makes it possible to control for +multiple rewards, effectively solving a multi-objective reinforcement learning +problem with no additional complexity. Finally, we show that CD can be applied +in a novel blockwise fashion at inference-time, again without the need for any +training-time changes, essentially bridging the gap between the popular +best-of-$K$ strategy and token-level reinforcement learning. This makes CD a +promising approach for alignment of language models. + +
+
+
+
+
+ + ☆ Conditionally Combining Robot Skills using Large Language Models + + +
+ This paper combines two contributions. First, we introduce an extension of +the Meta-World benchmark, which we call "Language-World," which allows a large +language model to operate in a simulated robotic environment using +semi-structured natural language queries and scripted skills described using +natural language. By using the same set of tasks as Meta-World, Language-World +results can be easily compared to Meta-World results, allowing for a point of +comparison between recent methods using Large Language Models (LLMs) and those +using Deep Reinforcement Learning. Second, we introduce a method we call Plan +Conditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of +high-level plans using end-to-end demonstrations. Using Language-World, we show +that PCBC is able to achieve strong performance in a variety of few-shot +regimes, often achieving task generalization with as little as a single +demonstration. We have made Language-World available as open-source software at +https://github.com/krzentner/language-world/. + +
+
+
+
+
+ + ☆ An Integrative Survey on Mental Health Conversational Agents to Bridge + Computer Science and Medical Perspectives EMNLP 2023 + + +
+ Mental health conversational agents (a.k.a. chatbots) are widely studied for +their potential to offer accessible support to those experiencing mental health +challenges. Previous surveys on the topic primarily consider papers published +in either computer science or medicine, leading to a divide in understanding +and hindering the sharing of beneficial knowledge between both domains. To +bridge this gap, we conduct a comprehensive literature review using the PRISMA +framework, reviewing 534 papers published in both computer science and +medicine. Our systematic review reveals 136 key papers on building mental +health-related conversational agents with diverse characteristics of modeling +and experimental design techniques. We find that computer science papers focus +on LLM techniques and evaluating response quality using automated metrics with +little attention to the application while medical papers use rule-based +conversational agents and outcome metrics to measure the health outcomes of +participants. Based on our findings on transparency, ethics, and cultural +heterogeneity in this review, we provide a few recommendations to help bridge +the disciplinary divide and enable the cross-disciplinary development of mental +health conversational agents. + +
+
+ comment: Accepted in EMNLP 2023 Main Conference, camera ready +
+
+
+
+
+ + ☆ Data Augmentation for Emotion Detection in Small Imbalanced Text Data ICML + + +
+ Emotion recognition in text, the task of identifying emotions such as joy or +anger, is a challenging problem in NLP with many applications. One of the +challenges is the shortage of available datasets that have been annotated with +emotions. Certain existing datasets are small, follow different emotion +taxonomies and display imbalance in their emotion distribution. In this work, +we studied the impact of data augmentation techniques precisely when applied to +small imbalanced datasets, for which current state-of-the-art models (such as +RoBERTa) under-perform. Specifically, we utilized four data augmentation +methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and +ProtAugment) on three datasets that come from different sources and vary in +size, emotion categories and distributions. Our experimental results show that +using the augmented data when training the classifier model leads to +significant improvements. Finally, we conducted two case studies: a) directly +using the popular chat-GPT API to paraphrase text using different prompts, and +b) using external data to augment the training set. Results show the promising +potential of these methods. + +
+
+ comment: Accepted paper at IEEE ICMLA 2023 +
+
+
+
+
+ + ☆ This Reads Like That: Deep Learning for Interpretable Natural Language + Processing + + +
+ Prototype learning, a popular machine learning method designed for inherently +interpretable decisions, leverages similarities to learned prototypes for +classifying new data. While it is mainly applied in computer vision, in this +work, we build upon prior research and further explore the extension of +prototypical networks to natural language processing. We introduce a learned +weighted similarity measure that enhances the similarity computation by +focusing on informative dimensions of pre-trained sentence embeddings. +Additionally, we propose a post-hoc explainability mechanism that extracts +prediction-relevant words from both the prototype and input sentences. Finally, +we empirically demonstrate that our proposed method not only improves +predictive performance on the AG News and RT Polarity datasets over a previous +prototype-based approach, but also improves the faithfulness of explanations +compared to rationale-based recurrent convolutions. + +
+
+ comment: 10 pages, 1 figure, 5 tables +
+
+
+
+
+ + ☆ Quality > Quantity: Synthetic Corpora from Foundation Models for + Closed-Domain Extractive Question Answering + + +
+ Domain adaptation, the process of training a model in one domain and applying +it to another, has been extensively explored in machine learning. While +training a domain-specific foundation model (FM) from scratch is an option, +recent methods have focused on adapting pre-trained FMs for domain-specific +tasks. However, our experiments reveal that either approach does not +consistently achieve state-of-the-art (SOTA) results in the target domain. In +this work, we study extractive question answering within closed domains and +introduce the concept of targeted pre-training. This involves determining and +generating relevant data to further pre-train our models, as opposed to the +conventional philosophy of utilizing domain-specific FMs trained on a wide +range of data. Our proposed framework uses Galactica to generate synthetic, +``targeted'' corpora that align with specific writing styles and topics, such +as research papers and radiology reports. This process can be viewed as a form +of knowledge distillation. We apply our method to two biomedical extractive +question answering datasets, COVID-QA and RadQA, achieving a new benchmark on +the former and demonstrating overall improvements on the latter. Code available +at https://github.com/saptarshi059/CDQA-v1-Targetted-PreTraining/tree/main. + +
+
+
+
+
+ + ☆ How well can machine-generated texts be identified and can language + models be trained to avoid identification? + + +
+ With the rise of generative pre-trained transformer models such as GPT-3, +GPT-NeoX, or OPT, distinguishing human-generated texts from machine-generated +ones has become important. We refined five separate language models to generate +synthetic tweets, uncovering that shallow learning classification algorithms, +like Naive Bayes, achieve detection accuracy between 0.6 and 0.8. + Shallow learning classifiers differ from human-based detection, especially +when using higher temperature values during text generation, resulting in a +lower detection rate. Humans prioritize linguistic acceptability, which tends +to be higher at lower temperature values. In contrast, transformer-based +classifiers have an accuracy of 0.9 and above. We found that using a +reinforcement learning approach to refine our generative models can +successfully evade BERT-based classifiers with a detection accuracy of 0.15 or +less. + +
+
+ comment: This paper has been accepted for the upcoming 57th Hawaii + International Conference on System Sciences (HICSS-57) +
+
+
+
+
+ + ☆ STEER: Semantic Turn Extension-Expansion Recognition for Voice + Assistants EMNLP 2023 + + +
+ In the context of a voice assistant system, steering refers to the phenomenon +in which a user issues a follow-up command attempting to direct or clarify a +previous turn. We propose STEER, a steering detection model that predicts +whether a follow-up turn is a user's attempt to steer the previous command. +Constructing a training dataset for steering use cases poses challenges due to +the cold-start problem. To overcome this, we developed heuristic rules to +sample opt-in usage data, approximating positive and negative samples without +any annotation. Our experimental results show promising performance in +identifying steering intent, with over 95% accuracy on our sampled data. +Moreover, STEER, in conjunction with our sampling strategy, aligns effectively +with real-world steering scenarios, as evidenced by its strong zero-shot +performance on a human-graded evaluation set. In addition to relying solely on +user transcripts as input, we introduce STEER+, an enhanced version of the +model. STEER+ utilizes a semantic parse tree to provide more context on +out-of-vocabulary words, such as named entities that often occur at the +sentence boundary. This further improves model performance, reducing error rate +in domains where entities frequently appear, such as messaging. Lastly, we +present a data analysis that highlights the improvement in user experience when +voice assistants support steering use cases. + +
+
+ comment: EMNLP 2023 Industry Track +
+
+
+
+
+ + ☆ Understanding Social Structures from Contemporary Literary Fiction using + Character Interaction Graph -- Half Century Chronology of Influential Bengali + Writers + + +
+ Social structures and real-world incidents often influence contemporary +literary fiction. Existing research in literary fiction analysis explains these +real-world phenomena through the manual critical analysis of stories. +Conventional Natural Language Processing (NLP) methodologies, including +sentiment analysis, narrative summarization, and topic modeling, have +demonstrated substantial efficacy in analyzing and identifying similarities +within fictional works. However, the intricate dynamics of character +interactions within fiction necessitate a more nuanced approach that +incorporates visualization techniques. Character interaction graphs (or +networks) emerge as a highly suitable means for visualization and information +retrieval from the realm of fiction. Therefore, we leverage character +interaction graphs with NLP-derived features to explore a diverse spectrum of +societal inquiries about contemporary culture's impact on the landscape of +literary fiction. Our study involves constructing character interaction graphs +from fiction, extracting relevant graph features, and exploiting these features +to resolve various real-life queries. Experimental evaluation of influential +Bengali fiction over half a century demonstrates that character interaction +graphs can be highly effective in specific assessments and information +retrieval from literary fiction. Our data and codebase are available at +https://cutt.ly/fbMgGEM + +
+
+ comment: 8 pages, 11 figures, 6 pages appendix +
+
+
+
+
+ + ☆ Critic-Driven Decoding for Mitigating Hallucinations in Data-to-text + Generation EMNLP 2023 + + +
+ Hallucination of text ungrounded in the input is a well-known problem in +neural data-to-text generation. Many methods have been proposed to mitigate it, +but they typically require altering model architecture or collecting additional +data, and thus cannot be easily applied to an existing model. In this paper, we +explore a new way to mitigate hallucinations by combining the probabilistic +output of a generator language model (LM) with the output of a special "text +critic" classifier, which guides the generation by assessing the match between +the input data and the text generated so far. Our method does not need any +changes to the underlying LM's architecture or training procedure and can thus +be combined with any model and decoding operating on word probabilities. The +critic does not need any additional training data, using the base LM's training +data and synthetic negative examples. Our experimental results show that our +method improves over the baseline on the WebNLG and OpenDialKG benchmarks. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Zephyr: Direct Distillation of LM Alignment + + +
+ We aim to produce a smaller language model that is aligned to user intent. +Previous research has shown that applying distilled supervised fine-tuning +(dSFT) on larger models significantly improves task accuracy; however, these +models are unaligned, i.e. they do not respond well to natural prompts. To +distill this property, we experiment with the use of preference data from AI +Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, +we apply distilled direct preference optimization (dDPO) to learn a chat model +with significantly improved intent alignment. The approach requires only a few +hours of training without any additional sampling during fine-tuning. The final +result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B +parameter models, and requires no human annotation. In particular, results on +MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access +RLHF-based model. Code, models, data, and tutorials for the system are +available at https://github.com/huggingface/alignment-handbook. + +
+
+
+
+
+ + ☆ Learning Transfers over Several Programming Languages + + +
+ Large language models (LLMs) have recently become remarkably good at +improving developer productivity for high-resource programming languages. These +models use two kinds of data: large amounts of unlabeled code samples for +pretraining and relatively smaller amounts of labeled code samples for +fine-tuning or in-context learning. Unfortunately, many programming languages +are low-resource, lacking labeled samples for most tasks and often even lacking +unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or +new languages) miss out on the benefits of LLMs. Cross-lingual transfer +learning uses data from a source language to improve model performance on a +target language. It has been well-studied for natural languages, but has +received little attention for programming languages. This paper reports +extensive experiments on four tasks using a transformer-based LLM and 11 to 41 +programming languages to explore the following questions. First, how well +cross-lingual transfer works for a given task across different language pairs. +Second, given a task and target language, how to best choose a source language. +Third, the characteristics of a language pair that are predictive of transfer +performance, and fourth, how that depends on the given task. + +
+
+ comment: 16 pages, 5 figures, 5 tables +
+
+
+
+
+ + ☆ CL-MASR: A Continual Learning Benchmark for Multilingual ASR + + +
+ Modern multilingual automatic speech recognition (ASR) systems like Whisper +have made it possible to transcribe audio in multiple languages with a single +model. However, current state-of-the-art ASR models are typically evaluated on +individual languages or in a multi-task setting, overlooking the challenge of +continually learning new languages. There is insufficient research on how to +add new languages without losing valuable information from previous data. +Furthermore, existing continual learning benchmarks focus mostly on vision and +language tasks, leaving continual learning for multilingual ASR largely +unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for +studying multilingual ASR in a continual learning setting. CL-MASR provides a +diverse set of continual learning methods implemented on top of large-scale +pretrained ASR models, along with common metrics to assess the effectiveness of +learning new languages while addressing the issue of catastrophic forgetting. +To the best of our knowledge, CL-MASR is the first continual learning benchmark +for the multilingual ASR task. The code is available at +https://github.com/speechbrain/benchmarks. + +
+
+ comment: 16 pages, 5 figures, 5 tables +
+
+
+
+
+ + ☆ Physician Detection of Clinical Harm in Machine Translation: Quality + Estimation Aids in Reliance and Backtranslation Identifies Critical Errors EMNLP 2023 + + +
+ A major challenge in the practical use of Machine Translation (MT) is that +users lack guidance to make informed decisions about when to rely on outputs. +Progress in quality estimation research provides techniques to automatically +assess MT quality, but these techniques have primarily been evaluated in vitro +by comparison against human judgments outside of a specific context of use. +This paper evaluates quality estimation feedback in vivo with a human study +simulating decision-making in high-stakes medical settings. Using Emergency +Department discharge instructions, we study how interventions based on quality +estimation versus backtranslation assist physicians in deciding whether to show +MT outputs to a patient. We find that quality estimation improves appropriate +reliance on MT, but backtranslation helps physicians detect more clinically +harmful errors that QE alone often misses. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Divide et Impera: Multi-Transformer Architectures for Complex NLP-Tasks + + +
+ The growing capabilities of transformer models pave the way for solving +increasingly complex NLP tasks. A key to supporting application-specific +requirements is the ability to fine-tune. However, compiling a fine-tuning +dataset tailored to complex tasks is tedious and results in large datasets, +limiting the ability to control transformer output. We present an approach in +which complex tasks are divided into simpler subtasks. Multiple transformer +models are fine-tuned to one subtask each, and lined up to accomplish the +complex task. This simplifies the compilation of fine-tuning datasets and +increases overall controllability. Using the example of reducing gender bias as +a complex task, we demonstrate our approach and show that it performs better +than using a single model. + +
+
+ comment: Proceedings of the Swiss Text Analytics Conference 2023 +
+
+
+
+
+ + ♻ ☆ Hunayn: Elevating Translation Beyond the Literal + + +
+ This project introduces an advanced English-to-Arabic translator surpassing +conventional tools. Leveraging the Helsinki transformer (MarianMT), our +approach involves fine-tuning on a self-scraped, purely literary Arabic +dataset. Evaluations against Google Translate show consistent outperformance in +qualitative assessments. Notably, it excels in cultural sensitivity and context +accuracy. This research underscores the Helsinki transformer's superiority for +English-to-Arabic translation using a Fusha dataset. + +
+
+
+
+
+ + ♻ ☆ Talk2Care: Facilitating Asynchronous Patient-Provider Communication with + Large-Language-Model + + +
+ Despite the plethora of telehealth applications to assist home-based older +adults and healthcare providers, basic messaging and phone calls are still the +most common communication methods, which suffer from limited availability, +information loss, and process inefficiencies. One promising solution to +facilitate patient-provider communication is to leverage large language models +(LLMs) with their powerful natural conversation and summarization capability. +However, there is a limited understanding of LLMs' role during the +communication. We first conducted two interview studies with both older adults +(N=10) and healthcare providers (N=9) to understand their needs and +opportunities for LLMs in patient-provider asynchronous communication. Based on +the insights, we built an LLM-powered communication system, Talk2Care, and +designed interactive components for both groups: (1) For older adults, we +leveraged the convenience and accessibility of voice assistants (VAs) and built +an LLM-powered VA interface for effective information collection. (2) For +health providers, we built an LLM-based dashboard to summarize and present +important health information based on older adults' conversations with the VA. +We further conducted two user studies with older adults and providers to +evaluate the usability of the system. The results showed that Talk2Care could +facilitate the communication process, enrich the health information collected +from older adults, and considerably save providers' efforts and time. We +envision our work as an initial exploration of LLMs' capability in the +intersection of healthcare and interpersonal communication. + +
+
+ comment: Under submission to CHI2024 +
+
+
+
+
+ + ♻ ☆ Interpretable and Explainable Logical Policies via Neurally Guided + Symbolic Abstraction + + +
+ The limited priors required by neural networks make them the dominating +choice to encode and learn policies using reinforcement learning (RL). However, +they are also black-boxes, making it hard to understand the agent's behaviour, +especially when working on the image level. Therefore, neuro-symbolic RL aims +at creating policies that are interpretable in the first place. Unfortunately, +interpretability is not explainability. To achieve both, we introduce Neurally +gUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural +network-based agents to guide the search of candidate-weighted logic rules, +then uses differentiable logic to train the logic agents. Our experimental +evaluation demonstrates that NUDGE agents can induce interpretable and +explainable policies while outperforming purely neural ones and showing good +flexibility to environments of different initial states and problem sizes. + +
+
+ comment: 9 main pages + appendix (19 in total) +
+
+
+
+
+ + ♻ ☆ StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure EMNLP 2023 + + +
+ This work presents StrAE: a Structured Autoencoder framework that through +strict adherence to explicit structure, and use of a novel contrastive +objective over tree-structured representations, enables effective learning of +multi-level representations. Through comparison over different forms of +structure, we verify that our results are directly attributable to the +informativeness of the structure provided as input, and show that this is not +the case for existing tree models. We then further extend StrAE to allow the +model to define its own compositions using a simple localised-merge algorithm. +This variant, called Self-StrAE, outperforms baselines that don't involve +explicit hierarchical compositions, and is comparable to models given +informative structure (e.g. constituency parses). Our experiments are conducted +in a data-constrained (circa 10M tokens) setting to help tease apart the +contribution of the inductive bias to effective learning. However, we find that +this framework can be robust to scale, and when extended to a much larger +dataset (circa 100M tokens), our 430 parameter model performs comparably to a +6-layer RoBERTa many orders of magnitude larger in size. Our findings support +the utility of incorporating explicit composition as an inductive bias for +effective representation learning. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ♻ ☆ Is Attention always needed? A Case Study on Language Identification from + Speech + + +
+ Language Identification (LID) is a crucial preliminary process in the field +of Automatic Speech Recognition (ASR) that involves the identification of a +spoken language from audio samples. Contemporary systems that can process +speech in multiple languages require users to expressly designate one or more +languages prior to utilization. The LID task assumes a significant role in +scenarios where ASR systems are unable to comprehend the spoken language in +multilingual settings, leading to unsuccessful speech recognition outcomes. The +present study introduces convolutional recurrent neural network (CRNN) based +LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) +characteristics of audio samples. Furthermore, we replicate certain +state-of-the-art methodologies, specifically the Convolutional Neural Network +(CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with +attention), and conduct a comparative analysis with our CRNN-based approach. We +conducted comprehensive evaluations on thirteen distinct Indian languages and +our model resulted in over 98\% classification accuracy. The LID model exhibits +high-performance levels ranging from 97% to 100% for languages that are +linguistically similar. The proposed LID model exhibits a high degree of +extensibility to additional languages and demonstrates a strong resistance to +noise, achieving 91.2% accuracy in a noisy setting when applied to a European +Language (EU) dataset. + +
+
+ comment: Accepted for publication in Natural Language Engineering +
+
+
+
+
+ + ♻ ☆ A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for + Fairer Instruction-Tuned Machine Translation EMNLP 2023 + + +
+ Recent instruction fine-tuned models can solve multiple NLP tasks when +prompted to do so, with machine translation (MT) being a prominent use case. +However, current research often focuses on standard performance benchmarks, +leaving compelling fairness and ethical considerations behind. In MT, this +might lead to misgendered translations, resulting, among other harms, in the +perpetuation of stereotypes and prejudices. In this work, we address this gap +by investigating whether and to what extent such models exhibit gender bias in +machine translation and how we can mitigate it. Concretely, we compute +established gender bias metrics on the WinoMT corpus from English to German and +Spanish. We discover that IFT models default to male-inflected translations, +even disregarding female occupational stereotypes. Next, using interpretability +methods, we unveil that models systematically overlook the pronoun indicating +the gender of a target occupation in misgendered translations. Finally, based +on this finding, we propose an easy-to-implement and effective bias mitigation +solution based on few-shot learning that leads to significantly fairer +translations. + +
+
+ comment: Accepted at EMNLP 2023. Code and data at + https://github.com/MilaNLProc/interpretability-mt-gender-bias +
+
+
+
+
+ + ♻ ☆ MusicAgent: An AI Agent for Music Understanding and Generation with + Large Language Models + + +
+ AI-empowered music processing is a diverse field that encompasses dozens of +tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension +tasks (e.g., music classification). For developers and amateurs, it is very +difficult to grasp all of these task to satisfy their requirements in music +processing, especially considering the huge differences in the representations +of music data and the model applicability across platforms among various tasks. +Consequently, it is necessary to build a system to organize and integrate these +tasks, and thus help practitioners to automatically analyze their demand and +call suitable tools as solutions to fulfill their requirements. Inspired by the +recent success of large language models (LLMs) in task automation, we develop a +system, named MusicAgent, which integrates numerous music-related tools and an +autonomous workflow to address user requirements. More specifically, we build +1) toolset that collects tools from diverse sources, including Hugging Face, +GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., +ChatGPT) to organize these tools and automatically decompose user requests into +multiple sub-tasks and invoke corresponding music tools. The primary goal of +this system is to free users from the intricacies of AI-music tools, enabling +them to concentrate on the creative aspect. By granting users the freedom to +effortlessly combine tools, the system offers a seamless and enriching music +experience. + +
+
+
+
+
+ + ♻ ☆ FACE: Evaluating Natural Language Generation with Fourier Analysis of + Cross-Entropy NeurIPS 2023 + + +
+ Measuring the distance between machine-produced and human language is a +critical open problem. Inspired by empirical findings from psycholinguistics on +the periodicity of entropy in language, we propose FACE, a set of metrics based +on Fourier Analysis of the estimated Cross-Entropy of language, for measuring +the similarity between model-generated and human-written languages. Based on an +open-ended generation task and the experimental data from previous studies, we +find that FACE can effectively identify the human-model gap, scales with model +size, reflects the outcomes of different sampling methods for decoding, +correlates well with other evaluation metrics and with human judgment scores. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Open Domain Multi-document Summarization: A Comprehensive Study of Model + Brittleness under Retrieval EMNLP + + +
+ Multi-document summarization (MDS) assumes a set of topic-related documents +are provided as input. In practice, this document set is not always available; +it would need to be retrieved given an information need, i.e. a question or +topic statement, a setting we dub "open-domain" MDS. We study this more +challenging setting by formalizing the task and bootstrapping it using existing +datasets, retrievers and summarizers. Via extensive automatic and human +evaluation, we determine: (1) state-of-the-art summarizers suffer large +reductions in performance when applied to open-domain MDS, (2) additional +training in the open-domain setting can reduce this sensitivity to imperfect +retrieval, and (3) summarizers are insensitive to the retrieval of duplicate +documents and the order of retrieved documents, but highly sensitive to other +errors, like the retrieval of irrelevant documents. Based on our results, we +provide practical guidelines to enable future work on open-domain MDS, e.g. how +to choose the number of retrieved documents to summarize. Our results suggest +that new retrieval and summarization methods and annotated resources for +training and evaluation are necessary for further progress in the open-domain +setting. + +
+
+ comment: Accepted to EMNLP Findings 2023 +
+
+
+
+
+ + ♻ ☆ Multiscale Superpixel Structured Difference Graph Convolutional Network + for VL Representation + + +
+ Within the multimodal field, the key to integrating vision and language lies +in establishing a good alignment strategy. Recently, benefiting from the +success of self-supervised learning, significant progress has been made in +multimodal semantic representation based on pre-trained models for vision and +language. However, there is still room for improvement in visual semantic +representation. The lack of spatial semantic coherence and vulnerability to +noise makes it challenging for current pixel or patch-based methods to +accurately extract complex scene boundaries. To this end, this paper develops +superpixel as a comprehensive compact representation of learnable image data, +which effectively reduces the number of visual primitives for subsequent +processing by clustering perceptually similar pixels. To mine more precise +topological relations, we propose a Multiscale Difference Graph Convolutional +Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical +structure of constituent visual patterns, and captures multiscale features by +progressively merging adjacent superpixels as graph nodes. Moreover, we predict +the differences between adjacent nodes through the graph structure, +facilitating key information aggregation of graph nodes to reason actual +semantic relations. Afterward, we design a multi-level fusion rule in a +bottom-up manner to avoid understanding deviation by learning complementary +spatial information at different regional scales. Our proposed method can be +well applied to multiple downstream task learning. Extensive experiments +demonstrate that our method is competitive with other state-of-the-art methods +in visual reasoning. Our code will be released upon publication. + +
+
+
+
+
+ + ♻ ☆ A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue + Information Extraction + + +
+ This paper focuses on term-status pair extraction from medical dialogues +(MD-TSPE), which is essential in diagnosis dialogue systems and the automatic +scribe of electronic medical records (EMRs). In the past few years, works on +MD-TSPE have attracted increasing research attention, especially after the +remarkable progress made by generative methods. However, these generative +methods output a whole sequence consisting of term-status pairs in one stage +and ignore integrating prior knowledge, which demands a deeper understanding to +model the relationship between terms and infer the status of each term. This +paper presents a knowledge-enhanced two-stage generative framework (KTGF) to +address the above challenges. Using task-specific prompts, we employ a single +model to complete the MD-TSPE through two phases in a unified generative form: +we generate all terms the first and then generate the status of each generated +term. In this way, the relationship between terms can be learned more +effectively from the sequence containing only terms in the first phase, and our +designed knowledge-enhanced prompt in the second phase can leverage the +category and status candidates of the generated term for status generation. +Furthermore, our proposed special status "not mentioned" makes more terms +available and enriches the training data in the second phase, which is critical +in the low-resource setting. The experiments on the Chunyu and CMDD datasets +show that the proposed method achieves superior results compared to the +state-of-the-art models in the full training and low-resource settings. + +
+
+ comment: Published in Machine Intelligence Research +
+
+
+
+
+ + ♻ ☆ A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and + Why? EMNLP 2023 + + +
+ Understanding the fundamental concepts and trends in a scientific field is +crucial for keeping abreast of its continuous advancement. In this study, we +propose a systematic framework for analyzing the evolution of research topics +in a scientific field using causal discovery and inference techniques. We +define three variables to encompass diverse facets of the evolution of research +topics within NLP and utilize a causal discovery algorithm to unveil the causal +connections among these variables using observational data. Subsequently, we +leverage this structure to measure the intensity of these relationships. By +conducting extensive experiments on the ACL Anthology corpus, we demonstrate +that our framework effectively uncovers evolutionary trends and the underlying +causes for a wide range of NLP research topics. Specifically, we show that +tasks and methods are primary drivers of research in NLP, with datasets +following, while metrics have minimal impact. + +
+
+ comment: accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Unify word-level and span-level tasks: NJUNLP's Participation for the + WMT2023 Quality Estimation Shared Task + + +
+ We introduce the submissions of the NJUNLP team to the WMT 2023 Quality +Estimation (QE) shared task. Our team submitted predictions for the +English-German language pair on all two sub-tasks: (i) sentence- and word-level +quality prediction; and (ii) fine-grained error span detection. This year, we +further explore pseudo data methods for QE based on NJUQE framework +(https://github.com/NJUNLP/njuqe). We generate pseudo MQM data using parallel +data from the WMT translation task. We pre-train the XLMR large model on pseudo +QE data, then fine-tune it on real QE data. At both stages, we jointly learn +sentence-level scores and word-level tags. Empirically, we conduct experiments +to find the key hyper-parameters that improve the performance. Technically, we +propose a simple method that covert the word-level outputs to fine-grained +error span results. Overall, our models achieved the best results in +English-German for both word-level and fine-grained error span detection +sub-tasks by a considerable margin. + +
+
+
+
+
+ + ♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks + + +
+ Ocean science, which delves into the oceans that are reservoirs of life and +biodiversity, is of great significance given that oceans cover over 70% of our +planet's surface. Recently, advances in Large Language Models (LLMs) have +transformed the paradigm in science. Despite the success in other domains, +current LLMs often fall short in catering to the needs of domain experts like +oceanographers, and the potential of LLMs for ocean science is under-explored. +The intrinsic reason may be the immense and intricate nature of ocean data as +well as the necessity for higher granularity and richness in knowledge. To +alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean +domain, which is expert in various ocean science tasks. We propose DoInstruct, +a novel framework to automatically obtain a large volume of ocean domain +instruction data, which generates instructions based on multi-agent +collaboration. Additionally, we construct the first oceanography benchmark, +OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though +comprehensive experiments, OceanGPT not only shows a higher level of knowledge +expertise for oceans science tasks but also gains preliminary embodied +intelligence capabilities in ocean technology. Codes, data and checkpoints will +soon be available at https://github.com/zjunlp/KnowLM. + +
+
+ comment: Work in progress. Project Website: + https://zjunlp.github.io/project/OceanGPT/ +
+
+
+
+
+ + ♻ ☆ Select and Augment: Enhanced Dense Retrieval Knowledge Graph + Augmentation + + +
+ Injecting textual information into knowledge graph (KG) entity +representations has been a worthwhile expedition in terms of improving +performance in KG oriented tasks within the NLP community. External knowledge +often adopted to enhance KG embeddings ranges from semantically rich lexical +dependency parsed features to a set of relevant key words to entire text +descriptions supplied from an external corpus such as wikipedia and many more. +Despite the gains this innovation (Text-enhanced KG embeddings) has made, the +proposal in this work suggests that it can be improved even further. Instead of +using a single text description (which would not sufficiently represent an +entity because of the inherent lexical ambiguity of text), we propose a +multi-task framework that jointly selects a set of text descriptions relevant +to KG entities as well as align or augment KG embeddings with text +descriptions. Different from prior work that plugs formal entity descriptions +declared in knowledge bases, this framework leverages a retriever model to +selectively identify richer or highly relevant text descriptions to use in +augmenting entities. Furthermore, the framework treats the number of +descriptions to use in augmentation process as a parameter, which allows the +flexibility of enumerating across several numbers before identifying an +appropriate number. Experiment results for Link Prediction demonstrate a 5.5% +and 3.5% percentage increase in the Mean Reciprocal Rank (MRR) and Hits@10 +scores respectively, in comparison to text-enhanced knowledge graph +augmentation methods using traditional CNNs. + +
+
+ comment: Article has already been puclished to Journal of Artificial + Intelligence Research (JAIR) +
+
+
+
+
+ + ♻ ☆ Large Language Model for Multi-objective Evolutionary Optimization + + +
+ Multiobjective evolutionary algorithms (MOEAs) are major methods for solving +multiobjective optimization problems (MOPs). Many MOEAs have been proposed in +the past decades, of which the search operators need a carefully handcrafted +design with domain knowledge. Recently, some attempts have been made to replace +the manually designed operators in MOEAs with learning-based operators (e.g., +neural network models). However, much effort is still required for designing +and training such models, and the learned operators might not generalize well +on new problems. To tackle the above challenges, this work investigates a novel +approach that leverages the powerful large language model (LLM) to design MOEA +operators. With proper prompt engineering, we successfully let a general LLM +serve as a black-box search operator for decomposition-based MOEA (MOEA/D) in a +zero-shot manner. In addition, by learning from the LLM behavior, we further +design an explicit white-box operator with randomness and propose a new version +of decomposition-based MOEA, termed MOEA/D-LO. Experimental studies on +different test benchmarks show that our proposed method can achieve competitive +performance with widely used MOEAs. It is also promising to see the operator +only learned from a few instances can have robust generalization performance on +unseen problems with quite different patterns and settings. The results reveal +the potential benefits of using pre-trained LLMs in the design of MOEAs. + +
+
+
+
+
+ + ♻ ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ♻ ☆ A Targeted Assessment of Incremental Processing in Neural LanguageModels + and Humans ACL 2021 + + +
+ We present a targeted, scaled-up comparison of incremental processing in +humans and neural language models by collecting by-word reaction time data for +sixteen different syntactic test suites across a range of structural phenomena. +Human reaction time data comes from a novel online experimental paradigm called +the Interpolated Maze task. We compare human reaction times to by-word +probabilities for four contemporary language models, with different +architectures and trained on a range of data set sizes. We find that across +many phenomena, both humans and language models show increased processing +difficulty in ungrammatical sentence regions with human and model `accuracy' +scores (a la Marvin and Linzen(2018)) about equal. However, although language +model outputs match humans in direction, we show that models systematically +under-predict the difference in magnitude of incremental processing difficulty +between grammatical and ungrammatical sentences. Specifically, when models +encounter syntactic violations they fail to accurately predict the longer +reaction times observed in the human data. These results call into question +whether contemporary language models are approaching human-like performance for +sensitivity to syntactic violations. + +
+
+ comment: Published in the proceedings of ACL 2021 +
+
+
+
+
+ + ♻ ☆ Is a Prestigious Job the same as a Prestigious Country? A Case Study on + Multilingual Sentence Embeddings and European Countries EMNLP 2023 + + +
+ We study how multilingual sentence representations capture European countries +and occupations and how this differs across European languages. We prompt the +models with templated sentences that we machine-translate into 12 European +languages and analyze the most prominent dimensions in the embeddings.Our +analysis reveals that the most prominent feature in the embedding is the +geopolitical distinction between Eastern and Western Europe and the country's +economic strength in terms of GDP. When prompted specifically for job prestige, +the embedding space clearly distinguishes high and low-prestige jobs. The +occupational dimension is uncorrelated with the most dominant country +dimensions in three out of four studied models. The exception is a small +distilled model that exhibits a connection between occupational prestige and +country of origin, which is a potential source of nationality-based +discrimination. Our findings are consistent across languages. + +
+
+ comment: 10 pages, 1 figure; Findings of EMNLP 2023, camera-ready +
+
+
+
+
+ + ♻ ☆ Legal NLP Meets MiCAR: Advancing the Analysis of Crypto White Papers + + +
+ In the rapidly evolving field of crypto assets, white papers are essential +documents for investor guidance, and are now subject to unprecedented content +requirements under the European Union's Markets in Crypto-Assets Regulation +(MiCAR). Natural Language Processing (NLP) can serve as a powerful tool for +both analyzing these documents and assisting in regulatory compliance. This +paper delivers two contributions to the topic. First, we survey existing +applications of textual analysis to unregulated crypto asset white papers, +uncovering a research gap that could be bridged with interdisciplinary +collaboration. We then conduct an analysis of the changes introduced by MiCAR, +highlighting the opportunities and challenges of integrating NLP within the new +regulatory framework. The findings set the stage for further research, with the +potential to benefit regulators, crypto asset issuers, and investors. + +
+
+ comment: Accepted at NLLP23 +
+
+
+
+
+ + ♻ ☆ Overview of the BioLaySumm 2023 Shared Task on Lay Summarization of + Biomedical Research Articles ACL2023 + + +
+ This paper presents the results of the shared task on Lay Summarisation of +Biomedical Research Articles (BioLaySumm), hosted at the BioNLP Workshop at ACL +2023. The goal of this shared task is to develop abstractive summarisation +models capable of generating "lay summaries" (i.e., summaries that are +comprehensible to non-technical audiences) in both a controllable and +non-controllable setting. There are two subtasks: 1) Lay Summarisation, where +the goal is for participants to build models for lay summary generation only, +given the full article text and the corresponding abstract as input; and 2) +Readability-controlled Summarisation, where the goal is for participants to +train models to generate both the technical abstract and the lay summary, given +an article's main text as input. In addition to overall results, we report on +the setup and insights from the BioLaySumm shared task, which attracted a total +of 20 participating teams across both subtasks. + +
+
+ comment: Published at BioNLP@ACL2023 +
+
+
+
+
+ + ♻ ☆ TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for + Inference Cost Reduction EMNLP 2023 + + +
+ Since ChatGPT released its API for public use, the number of applications +built on top of commercial large language models (LLMs) increase exponentially. +One popular usage of such models is leveraging its in-context learning ability +and generating responses given user queries leveraging knowledge obtained by +retrieval augmentation. One problem of deploying commercial retrieval-augmented +LLMs is the cost due to the additionally retrieved context that largely +increases the input token size of the LLMs. To mitigate this, we propose a +token compression scheme that includes two methods: summarization compression +and semantic compression. The first method applies a T5-based model that is +fine-tuned by datasets generated using self-instruct containing samples with +varying lengths and reduce token size by doing summarization. The second method +further compresses the token size by removing words with lower impact on the +semantic. In order to adequately evaluate the effectiveness of the proposed +methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) +focusing on food recommendation for women around pregnancy period or infants. +Our summarization compression can reduce 65% of the retrieval token size with +further 0.3% improvement on the accuracy; semantic compression provides a more +flexible way to trade-off the token size with performance, for which we can +reduce the token size by 20% with only 1.6% of accuracy drop. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Evaluating Hallucinations in Chinese Large Language Models + + +
+ In this paper, we establish a benchmark named HalluQA (Chinese Hallucination +Question-Answering) to measure the hallucination phenomenon in Chinese large +language models. HalluQA contains 450 meticulously designed adversarial +questions, spanning multiple domains, and takes into account Chinese historical +culture, customs, and social phenomena. During the construction of HalluQA, we +consider two types of hallucinations: imitative falsehoods and factual errors, +and we construct adversarial samples based on GLM-130B and ChatGPT. For +evaluation, we design an automated evaluation method using GPT-4 to judge +whether a model output is hallucinated. We conduct extensive experiments on 24 +large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk +and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than +50%. This indicates that HalluQA is highly challenging. We analyze the primary +types of hallucinations in different types of models and their causes. +Additionally, we discuss which types of hallucinations should be prioritized +for different types of models. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ♻ ☆ AgentBench: Evaluating LLMs as Agents + + +
+ Large Language Models (LLMs) are becoming increasingly smart and autonomous, +targeting real-world pragmatic missions beyond traditional NLP tasks. As a +result, there has been an urgent need to evaluate LLMs as agents on challenging +tasks in interactive environments. We present AgentBench, a multi-dimensional +evolving benchmark that currently consists of 8 distinct environments to assess +LLM-as-Agent's reasoning and decision-making abilities in a multi-turn +open-ended generation setting. Our extensive test over 27 API-based and +open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong +ability of acting as agents in complex environments, there is a significant +disparity in performance between them and OSS competitors. We identify the +typical reasons of failures in environments and LLMs, showing that poor +long-term reasoning, decision-making, and instruction following abilities are +the main obstacles for developing usable LLM agents. Training on code and high +quality multi-turn alignment data could improve agent performance. Datasets, +environments, and an integrated evaluation package for AgentBench are released +at \url{https://github.com/THUDM/AgentBench}. + +
+
+ comment: 55 pages +
+
+
+
+
+ + ♻ ☆ An Investigation of LLMs' Inefficacy in Understanding Converse Relations EMNLP 2023 + + +
+ Large Language Models (LLMs) have achieved remarkable success in many formal +language oriented tasks, such as structural data-to-text and semantic parsing. +However current benchmarks mostly follow the data distribution of the +pre-training data of LLMs. Therefore, a natural question rises that do LLMs +really understand the structured semantics of formal languages. In this paper, +we investigate this problem on a special case, converse binary relation. We +introduce a new benchmark ConvRe focusing on converse relations, which contains +17 relations and 1240 triples extracted from popular knowledge graph completion +datasets. Our ConvRE features two tasks, Re2Text and Text2Re, which are +formulated as multi-choice question answering to evaluate LLMs' ability to +determine the matching between relations and associated text. For the +evaluation protocol, apart from different prompting methods, we further +introduce variants to the test text and few-shot example text. We conduct +experiments on three popular LLM families and have observed various scaling +trends. The results suggest that LLMs often resort to shortcut learning and +still face challenges on our proposed benchmark. + +
+
+ comment: Accepted by EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs EMNLP 2023 + + +
+ Recent research has demonstrated that Large Language Models (LLMs) can +enhance their capabilities by utilizing external tools. However, three pivotal +questions remain unanswered: (1) How effective are current LLMs in utilizing +tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What +obstacles need to be overcome to leverage tools? To address these questions, we +introduce API-Bank, a groundbreaking benchmark, specifically designed for +tool-augmented LLMs. For the first question, we develop a runnable evaluation +system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 +API calls to assess the existing LLMs' capabilities in planning, retrieving, +and calling APIs. For the second question, we construct a comprehensive +training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 +distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM +initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits +improved tool utilization compared to GPT-3, while GPT-4 excels in planning. +However, there is still significant potential for further improvement. +Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 +pts and approaches the effectiveness of GPT-3.5. Through error analysis, we +highlight the key challenges for future research in this field to answer the +third question. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ FANToM: A Benchmark for Stress-testing Machine Theory of Mind in + Interactions EMNLP 2023 + + +
+ Theory of mind (ToM) evaluations currently focus on testing models using +passive narratives that inherently lack interactivity. We introduce FANToM, a +new benchmark designed to stress-test ToM within information-asymmetric +conversational contexts via question answering. Our benchmark draws upon +important theoretical requisites from psychology and necessary empirical +considerations when evaluating large language models (LLMs). In particular, we +formulate multiple types of questions that demand the same underlying reasoning +to identify illusory or false sense of ToM capabilities in LLMs. We show that +FANToM is challenging for state-of-the-art LLMs, which perform significantly +worse than humans even with chain-of-thought reasoning or fine-tuning. + +
+
+ comment: EMNLP 2023. Code and dataset can be found here: + https://hyunw.kim/fantom +
+
+
+
+
+ + ♻ ☆ Asking Clarification Questions to Handle Ambiguity in Open-Domain QA EMNLP 2023 + + +
+ Ambiguous questions persist in open-domain question answering, because +formulating a precise question with a unique answer is often challenging. +Previously, Min et al. (2020) have tackled this issue by generating +disambiguated questions for all possible interpretations of the ambiguous +question. This can be effective, but not ideal for providing an answer to the +user. Instead, we propose to ask a clarification question, where the user's +response will help identify the interpretation that best aligns with the user's +intention. We first present CAMBIGNQ, a dataset consisting of 5,654 ambiguous +questions, each with relevant passages, possible answers, and a clarification +question. The clarification questions were efficiently created by generating +them using InstructGPT and manually revising them as necessary. We then define +a pipeline of tasks and design appropriate evaluation metrics. Lastly, we +achieve 61.3 F1 on ambiguity detection and 40.5 F1 on clarification-based QA, +providing strong baselines for future work. + +
+
+ comment: 15 pages, 4 figures, accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video + Understanding EMNLP 2023 + + +
+ We present Video-LLaMA a multi-modal framework that empowers Large Language +Models (LLMs) with the capability of understanding both visual and auditory +content in the video. Video-LLaMA bootstraps cross-modal training from the +frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike +previous works that complement LLMs to process the visual or audio signals +only, Video-LLaMA enables video comprehension by tackling two challenges: (1) +capturing the temporal changes in visual scenes, (2) integrating audio-visual +signals. To counter the first challenge, we propose a Video Q-former to +assemble a pre-trained image encoder into our video encoder and introduce a +video-to-text generation task to learn video-language correspondence. For the +second challenge, we leverage ImageBind, a universal embedding model aligning +multiple modalities, as the pre-trained audio encoder and introduce an Audio +Q-former on top of ImageBind to learn reasonable auditory query embeddings for +the LLM module. To align the output of both visual and audio encoders with +LLM's embedding space, we first train Video-LLaMA on massive +video/image-caption pairs and then tune our model with visual-instruction +datasets of moderate amount but higher quality. We found Video-LLaMA shows the +ability to perceive and comprehend video content and generate meaningful +responses grounded in the visual and auditory information presented in the +videos. + +
+
+ comment: Accepted by EMNLP 2023's demo track; Code, Pretrained Model, and + Dataset: https://github.com/DAMO-NLP-SG/Video-LLaMA +
+
+
+
+
+ + ♻ ☆ VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System EMNLP'23 + + +
+ Arabic is a complex language with many varieties and dialects spoken by over +450 millions all around the world. Due to the linguistic diversity and +variations, it is challenging to build a robust and generalized ASR system for +Arabic. In this work, we address this gap by developing and demoing a system, +dubbed VoxArabica, for dialect identification (DID) as well as automatic speech +recognition (ASR) of Arabic. We train a wide range of models such as HuBERT +(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR +tasks. Our DID models are trained to identify 17 different dialects in addition +to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. +Additionally, for the remaining dialects in ASR, we provide the option to +choose various models such as Whisper and MMS in a zero-shot setting. We +integrate these models into a single web interface with diverse features such +as audio recording, file upload, model selection, and the option to raise flags +for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide +range of audiences concerned with Arabic research. Our system is currently +running at https://cdce-206-12-100-168.ngrok.io/. + +
+
+ comment: Accepted at ArabicNLP conference co-located with EMNLP'23. First + three authors contributed equally +
+
+
+
+
+ + ♻ ☆ GPT Understands, Too + + +
+ Prompting a pretrained language model with natural language patterns has been +proved effective for natural language understanding (NLU). However, our +preliminary study reveals that manual discrete prompts often lead to unstable +performance -- e.g., changing a single word in the prompt might result in +substantial performance drop. We propose a novel method P-Tuning that employs +trainable continuous prompt embeddings in concatenation with discrete prompts. +Empirically, P-Tuning not only stabilizes training by minimizing the gap +between various discrete prompts, but also improves performance by a sizeable +margin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is +generally effective for both frozen and tuned language models, under both the +fully-supervised and few-shot settings. + +
+
+
+
+
+ + ♻ ☆ TaskDiff: A Similarity Metric for Task-Oriented Conversations EMNLP 2023 + + +
+ The popularity of conversational digital assistants has resulted in the +availability of large amounts of conversational data which can be utilized for +improved user experience and personalized response generation. Building these +assistants using popular large language models like ChatGPT also require +additional emphasis on prompt engineering and evaluation methods. Textual +similarity metrics are a key ingredient for such analysis and evaluations. +While many similarity metrics have been proposed in the literature, they have +not proven effective for task-oriented conversations as they do not take +advantage of unique conversational features. To address this gap, we present +TaskDiff, a novel conversational similarity metric that utilizes different +dialogue components (utterances, intents, and slots) and their distributions to +compute similarity. Extensive experimental evaluation of TaskDiff on a +benchmark dataset demonstrates its superior performance and improved robustness +over other related approaches. + +
+
+ comment: Accepted to the main conference at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple + Experts Fine-tuning + + +
+ We propose Multiple Experts Fine-tuning Framework to build a financial large +language model (LLM), DISC-FinLLM. Our methodology improves general LLMs by +endowing them with multi-turn question answering abilities, domain text +processing capabilities, mathematical computation skills, and +retrieval-enhanced generation capabilities. We build a financial +instruction-tuning dataset named DISC-FIN-SFT, including instruction samples of +four categories (consulting, NLP tasks, computing and retrieval-augmented +generation). Evaluations conducted on multiple benchmarks demonstrate that our +model performs better than baseline models in various financial scenarios. +Further resources can be found at https://github.com/FudanDISC/DISC-FinLLM. + +
+
+ comment: 18 pages, 13 figures, 7 tables +
+
+
+
+
+ + ♻ ☆ GLM-130B: An Open Bilingual Pre-trained Model ICLR 2023 + + +
+ We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language +model with 130 billion parameters. It is an attempt to open-source a 100B-scale +model at least as good as GPT-3 (davinci) and unveil how models of such a scale +can be successfully pre-trained. Over the course of this effort, we face +numerous unexpected technical and engineering challenges, particularly on loss +spikes and divergence. In this paper, we introduce the training process of +GLM-130B including its design choices, training strategies for both efficiency +and stability, and engineering efforts. The resultant GLM-130B model offers +significant outperformance over GPT-3 175B (davinci) on a wide range of popular +English benchmarks while the performance advantage is not observed in OPT-175B +and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN +3.0 260B -- the largest Chinese language model -- across related benchmarks. +Finally, we leverage a unique scaling property of GLM-130B to reach INT4 +quantization without post training, with almost no performance loss, making it +the first among 100B-scale models and more importantly, allowing its effective +inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the +most affordable GPUs required for using 100B-scale models. The GLM-130B model +weights are publicly accessible and its code, training logs, related toolkit, +and lessons learned are open-sourced at +\url{https://github.com/THUDM/GLM-130B/}. + +
+
+ comment: Accepted to ICLR 2023 +
+
+
+
+
+ + ♻ ☆ Can Large Language Models Discern Evidence for Scientific Hypotheses? + Case Studies in the Social Sciences + + +
+ Hypothesis formulation and testing are central to empirical research. A +strong hypothesis is a best guess based on existing evidence and informed by a +comprehensive view of relevant literature. However, with exponential increase +in the number of scientific articles published annually, manual aggregation and +synthesis of evidence related to a given hypothesis is a challenge. Our work +explores the ability of current large language models (LLMs) to discern +evidence in support or refute of specific hypotheses based on the text of +scientific abstracts. We share a novel dataset for the task of scientific +hypothesis evidencing using community-driven annotations of studies in the +social sciences. We compare the performance of LLMs to several state-of-the-art +benchmarks and highlight opportunities for future research in this area. The +dataset is available at +https://github.com/Sai90000/ScientificHypothesisEvidencing.git + +
+
+
+
+
+ + ♻ ☆ WebWISE: Web Interface Control and Sequential Exploration with Large + Language Models + + +
+ The paper investigates using a Large Language Model (LLM) to automatically +perform web software tasks using click, scroll, and text input operations. +Previous approaches, such as reinforcement learning (RL) or imitation learning, +are inefficient to train and task-specific. Our method uses filtered Document +Object Model (DOM) elements as observations and performs tasks step-by-step, +sequentially generating small programs based on the current observations. We +use in-context learning, either benefiting from a single manually provided +example, or an automatically generated example based on a successful zero-shot +trial. We evaluate the proposed method on the MiniWob++ benchmark. With only +one in-context example, our WebWISE method achieves similar or better +performance than other methods that require many demonstrations or trials. + +
+
+
+
+
+ + ♻ ☆ SeamlessM4T: Massively Multilingual & Multimodal Machine Translation + + +
+ What does it take to create the Babel Fish, a tool that can help individuals +translate speech between any two languages? While recent breakthroughs in +text-based models have pushed machine translation coverage beyond 200 +languages, unified speech-to-speech translation models have yet to achieve +similar strides. More specifically, conventional speech-to-speech translation +systems rely on cascaded systems that perform translation progressively, +putting high-performing unified systems out of reach. To address these gaps, we +introduce SeamlessM4T, a single model that supports speech-to-speech +translation, speech-to-text translation, text-to-speech translation, +text-to-text translation, and automatic speech recognition for up to 100 +languages. To build this, we used 1 million hours of open speech audio data to +learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, +we created a multimodal corpus of automatically aligned speech translations. +Filtered and combined with human-labeled and pseudo-labeled data, we developed +the first multilingual system capable of translating from and into English for +both speech and text. On FLEURS, SeamlessM4T sets a new standard for +translations into multiple target languages, achieving an improvement of 20% +BLEU over the previous SOTA in direct speech-to-text translation. Compared to +strong cascaded models, SeamlessM4T improves the quality of into-English +translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in +speech-to-speech. Tested for robustness, our system performs better against +background noises and speaker variations in speech-to-text tasks compared to +the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and +added toxicity to assess translation safety. Finally, all contributions in this +work are open-sourced and accessible at +https://github.com/facebookresearch/seamless_communication + +
+
+
+
+
+ + ♻ ☆ Accented Speech Recognition With Accent-specific Codebooks EMNLP 2023 + + +
+ Speech accents pose a significant challenge to state-of-the-art automatic +speech recognition (ASR) systems. Degradation in performance across +underrepresented accents is a severe deterrent to the inclusive adoption of +ASR. In this work, we propose a novel accent adaptation approach for end-to-end +ASR systems using cross-attention with a trainable set of codebooks. These +learnable codebooks capture accent-specific information and are integrated +within the ASR encoder layers. The model is trained on accented English speech, +while the test data also contained accents which were not seen during training. +On the Mozilla Common Voice multi-accented dataset, we show that our proposed +approach yields significant performance gains not only on the seen English +accents (up to $37\%$ relative improvement in word error rate) but also on the +unseen accents (up to $5\%$ relative improvement in WER). Further, we +illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We +also compare the performance with other approaches based on accent adversarial +training. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ♻ ☆ Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy? EMNLP 2023 + + +
+ Originally proposed as a method for knowledge transfer from one model to +another, some recent studies have suggested that knowledge distillation (KD) is +in fact a form of regularization. Perhaps the strongest argument of all for +this new perspective comes from its apparent similarities with label smoothing +(LS). Here we re-examine this stated equivalence between the two methods by +comparing the predictive confidences of the models they train. Experiments on +four text classification tasks involving models of different sizes show that: +(a) In most settings, KD and LS drive model confidence in completely opposite +directions, and (b) In KD, the student inherits not only its knowledge but also +its confidence from the teacher, reinforcing the classical knowledge transfer +view. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Training Priors Predict Text-To-Image Model Performance + + +
+ Text-to-image models can often generate some relations, i.e., "astronaut +riding horse", but fail to generate other relations composed of the same basic +parts, i.e., "horse riding astronaut". These failures are often taken as +evidence that models rely on training priors rather than constructing novel +images compositionally. This paper tests this intuition on the stablediffusion +2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads +that underlie these prompts (e.g., "astronaut", "ride", "horse"), we find that +the more often an SVO triad appears in the training data, the better the model +can generate an image aligned with that triad. Here, by aligned we mean that +each of the terms appears in the generated image in the proper relation to each +other. Surprisingly, this increased frequency also diminishes how well the +model can generate an image aligned with the flipped triad. For example, if +"astronaut riding horse" appears frequently in the training data, the image for +"horse riding astronaut" will tend to be poorly aligned. Our results thus show +that current models are biased to generate images with relations seen in +training, and provide new data to the ongoing debate on whether these +text-to-image models employ abstract compositional structure in a traditional +sense, or rather, interpolate between relations explicitly seen in the training +data. + +
+
+
+
+
+ + ♻ ☆ NormDial: A Comparable Bilingual Synthetic Dialog Dataset for Modeling + Social Norm Adherence and Violation EMNLP 2023 + + +
+ Social norms fundamentally shape interpersonal communication. We present +NormDial, a high-quality dyadic dialogue dataset with turn-by-turn annotations +of social norm adherences and violations for Chinese and American cultures. +Introducing the task of social norm observance detection, our dataset is +synthetically generated in both Chinese and English using a human-in-the-loop +pipeline by prompting large language models with a small collection of +expert-annotated social norms. We show that our generated dialogues are of high +quality through human evaluation and further evaluate the performance of +existing large language models on this task. Our findings point towards new +directions for understanding the nuances of social norms as they manifest in +conversational contexts that span across languages and cultures. + +
+
+ comment: EMNLP 2023 Main Conference, Short Paper; Data at + https://github.com/Aochong-Li/NormDial +
+
+
+
+
+ + ♻ ☆ WebArena: A Realistic Web Environment for Building Autonomous Agents + + +
+ With advances in generative AI, there is now potential for autonomous agents +to manage daily tasks via natural language commands. However, current agents +are primarily created and tested in simplified synthetic environments, leading +to a disconnect with real-world scenarios. In this paper, we build an +environment for language-guided agents that is highly realistic and +reproducible. Specifically, we focus on agents that perform tasks on the web, +and create an environment with fully functional websites from four common +domains: e-commerce, social forum discussions, collaborative software +development, and content management. Our environment is enriched with tools +(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage +human-like task-solving. Building upon our environment, we release a set of +benchmark tasks focusing on evaluating the functional correctness of task +completions. The tasks in our benchmark are diverse, long-horizon, and designed +to emulate tasks that humans routinely perform on the internet. We experiment +with several baseline agents, integrating recent techniques such as reasoning +before acting. The results demonstrate that solving complex tasks is +challenging: our best GPT-4-based agent only achieves an end-to-end task +success rate of 14.41%, significantly lower than the human performance of +78.24%. These results highlight the need for further development of robust +agents, that current state-of-the-art large language models are far from +perfect performance in these real-life tasks, and that WebArena can be used to +measure such progress. + +
+
+ comment: Our code, data, environment reproduction resources, and video + demonstrations are publicly available at https://webarena.dev/ +
+
+
+
+
+ + ♻ ☆ Can Knowledge Graphs Simplify Text? CIKM 2023 + + +
+ Knowledge Graph (KG)-to-Text Generation has seen recent improvements in +generating fluent and informative sentences which describe a given KG. As KGs +are widespread across multiple domains and contain important entity-relation +information, and as text simplification aims to reduce the complexity of a text +while preserving the meaning of the original text, we propose KGSimple, a novel +approach to unsupervised text simplification which infuses KG-established +techniques in order to construct a simplified KG path and generate a concise +text which preserves the original input's meaning. Through an iterative and +sampling KG-first approach, our model is capable of simplifying text when +starting from a KG by learning to keep important information while harnessing +KG-to-text generation to output fluent and descriptive sentences. We evaluate +various settings of the KGSimple model on currently-available KG-to-text +datasets, demonstrating its effectiveness compared to unsupervised text +simplification models which start with a given complex text. Our code is +available on GitHub. + +
+
+ comment: Accepted as a Main Conference Long Paper at CIKM 2023 +
+
+
+
+
+ + ♻ ☆ InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution EMNLP 2023 + + +
+ Over recent decades, significant advancements in cross-modal retrieval are +mainly driven by breakthroughs in visual and linguistic modeling. However, a +recent study shows that multi-modal data representations tend to cluster within +a limited convex cone (as representation degeneration problem), which hinders +retrieval performance due to the inseparability of these representations. In +our study, we first empirically validate the presence of the representation +degeneration problem across multiple cross-modal benchmarks and methods. Next, +to address it, we introduce a novel method, called InvGC, a post-processing +technique inspired by graph convolution and average pooling. Specifically, +InvGC defines the graph topology within the datasets and then applies graph +convolution in a subtractive manner. This method effectively separates +representations by increasing the distances between data points. To improve the +efficiency and effectiveness of InvGC, we propose an advanced graph topology, +LocalAdj, which only aims to increase the distances between each data point and +its nearest neighbors. To understand why InvGC works, we present a detailed +theoretical analysis, proving that the lower bound of recall will be improved +after deploying InvGC. Extensive empirical results show that InvGC and InvGC +w/LocalAdj significantly mitigate the representation degeneration problem, +thereby enhancing retrieval performance. + Our code is available at +https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Goal Driven Discovery of Distributional Differences via Language + Descriptions + + +
+ Mining large corpora can generate useful discoveries but is time-consuming +for humans. We formulate a new task, D5, that automatically discovers +differences between two large corpora in a goal-driven way. The task input is a +problem comprising a research goal "$\textit{comparing the side effects of drug +A and drug B}$" and a corpus pair (two large collections of patients' +self-reported reactions after taking each drug). The output is a language +description (discovery) of how these corpora differ (patients taking drug A +"$\textit{mention feelings of paranoia}$" more often). We build a D5 system, +and to quantitatively measure its performance, we 1) contribute a meta-dataset, +OpenD5, aggregating 675 open-ended problems ranging across business, social +sciences, humanities, machine learning, and health, and 2) propose a set of +unified evaluation metrics: validity, relevance, novelty, and significance. +With the dataset and the unified metrics, we confirm that language models can +use the goals to propose more relevant, novel, and significant candidate +discoveries. Finally, our system produces discoveries previously unknown to the +authors on a wide range of applications in OpenD5, including temporal and +demographic differences in discussion topics, political stances and stereotypes +in speech, insights in commercial reviews, and error patterns in NLP models. + +
+
+
+
+
+ + ♻ ☆ Natural Language Decompositions of Implicit Content Enable Better Text + Representations EMNLP 2023 + + +
+ When people interpret text, they rely on inferences that go beyond the +observed language itself. Inspired by this observation, we introduce a method +for the analysis of text that takes implicitly communicated content explicitly +into account. We use a large language model to produce sets of propositions +that are inferentially related to the text that has been observed, then +validate the plausibility of the generated content via human judgments. +Incorporating these explicit representations of implicit content proves useful +in multiple problem settings that involve the human interpretation of +utterances: assessing the similarity of arguments, making sense of a body of +opinion data, and modeling legislative behavior. Our results suggest that +modeling the meanings behind observed language, rather than the literal text +alone, is a valuable direction for NLP and particularly its applications to +social science. + +
+
+ comment: Accepted to EMNLP 2023 (Main conference) +
+
+
+
+
+ + ♻ ☆ Can LLMs Capture Intertemporal Preferences? + + +
+ We explore the viability of Large Language Models (LLMs), specifically +OpenAI's GPT-3.5 and GPT-4, in emulating human survey respondents and eliciting +preferences, with a focus on intertemporal choices. Leveraging the extensive +literature on intertemporal discounting for benchmarking, we examine responses +from LLMs across various languages and compare them to human responses, +exploring preferences between smaller, sooner, and larger, later rewards. Our +findings reveal that both GPT models demonstrate less patience than humans, +with GPT-3.5 exhibiting a lexicographic preference for earlier rewards, unlike +human decision-makers. Though GPT-4 does not display lexicographic preferences, +its measured discount rates are still considerably larger than those found in +humans. Interestingly, GPT models show greater patience in languages with weak +future tense references, such as German and Mandarin, aligning with existing +literature that suggests a correlation between language structure and +intertemporal preferences. We demonstrate how prompting GPT to explain its +decisions, a procedure we term ``chain-of-thought conjoint," can mitigate, but +does not eliminate, discrepancies between LLM and human responses. While +directly eliciting preferences using LLMs may yield misleading results, +combining chain-of-thought conjoint with topic modeling aids in hypothesis +generation, enabling researchers to explore the underpinnings of preferences. +Chain-of-thought conjoint provides a structured framework for marketers to use +LLMs to identify potential attributes or factors that can explain preference +heterogeneity across different customers and contexts. + +
+
+
+
+
+ + ♻ ☆ Investigating Antigram Behaviour using Distributional Semantics + + +
+ The field of computational linguistics constantly presents new challenges and +topics for research. Whether it be analyzing word usage changes over time or +identifying relationships between pairs of seemingly unrelated words. To this +point, we identify Anagrams and Antigrams as words possessing such unique +properties. The presented work is an exploration into generating anagrams from +a given word and determining whether there exists antigram (semantically +opposite anagrams) relationships between the pairs of generated anagrams using +GloVe embeddings. We propose a rudimentary, yet interpretable, rule-based +algorithm for detecting antigrams. On a small dataset of just 12 antigrams, our +approach yielded an accuracy of 39\% which shows that there is much work left +to be done in this space. + +
+
+
+
+
+ + ♻ ☆ MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, + Bard, and Other Large Multimodal Models + + +
+ Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit +impressive problem-solving skills in many tasks and domains, but their ability +in mathematical reasoning in visual contexts has not been systematically +studied. To bridge this gap, we present MathVista, a benchmark designed to +combine challenges from diverse mathematical and visual tasks. It consists of +6,141 examples, derived from 28 existing multimodal datasets involving +mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and +PaperQA). Completing these tasks requires fine-grained, deep visual +understanding and compositional reasoning, which all state-of-the-art +foundation models find challenging. With MathVista, we have conducted a +comprehensive, quantitative evaluation of 12 prominent foundation models. The +best-performing GPT-4V model achieves an overall accuracy of 49.9%, +substantially outperforming Bard, the second-best performer, by 15.1%. Our +in-depth analysis reveals that the superiority of GPT-4V is mainly attributed +to its enhanced visual perception and mathematical reasoning. However, GPT-4V +still falls short of human performance by 10.4%, as it often struggles to +understand complex figures and perform rigorous reasoning. This significant gap +underscores the critical role that MathVista will play in the development of +general-purpose AI agents capable of tackling mathematically intensive and +visually rich real-world tasks. We further explore the new ability of +self-verification, the application of self-consistency, and the interactive +chatbot capabilities of GPT-4V, highlighting its promising potential for future +research. The project is available at https://mathvista.github.io/. + +
+
+ comment: 112 pages, 117 figures. Work in progress +
+
+
+
+
+ + ♻ ☆ DecipherPref: Analyzing Influential Factors in Human Preference + Judgments via GPT-4 + + +
+ Human preference judgments are pivotal in guiding large language models +(LLMs) to produce outputs that align with human values. Human evaluations are +also used in summarization tasks to compare outputs from various systems, +complementing existing automatic metrics. Despite their significance, however, +there has been limited research probing these pairwise or $k$-wise comparisons. +The collective impact and relative importance of factors such as output length, +informativeness, fluency, and factual consistency are still not well +understood. It is also unclear if there are other hidden factors influencing +human judgments. In this paper, we conduct an in-depth examination of a +collection of pairwise human judgments released by OpenAI. Utilizing the +Bradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in +these human judgments. We find that the most favored factors vary across tasks +and genres, whereas the least favored factors tend to be consistent, e.g., +outputs are too brief, contain excessive off-focus content or hallucinated +facts. Our findings have implications on the construction of balanced datasets +in human preference evaluations, which is a crucial step in shaping the +behaviors of future LLMs. + +
+
+
+
+
+ + ♻ ☆ Multimodal Automated Fact-Checking: A Survey EMNLP + + +
+ Misinformation is often conveyed in multiple modalities, e.g. a miscaptioned +image. Multimodal misinformation is perceived as more credible by humans, and +spreads faster than its text-only counterparts. While an increasing body of +research investigates automated fact-checking (AFC), previous surveys mostly +focus on text. In this survey, we conceptualise a framework for AFC including +subtasks unique to multimodal misinformation. Furthermore, we discuss related +terms used in different communities and map them to our framework. We focus on +four modalities prevalent in real-world fact-checking: text, image, audio, and +video. We survey benchmarks and models, and discuss limitations and promising +directions for future research + +
+
+ comment: The 2023 Conference on Empirical Methods in Natural Language + Processing (EMNLP): Findings +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 114 + +
+
+
+ + ☆ SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous + Manipulation + + +
+ Humans excel at transferring manipulation skills across diverse object +shapes, poses, and appearances due to their understanding of semantic +correspondences between different instances. To endow robots with a similar +high-level understanding, we develop a Distilled Feature Field (DFF) for 3D +scenes, leveraging large 2D vision models to distill semantic features from +multiview images. While current research demonstrates advanced performance in +reconstructing DFFs from dense views, the development of learning a DFF from +sparse views is relatively nascent, despite its prevalence in numerous +manipulation tasks with fixed cameras. In this work, we introduce SparseDFF, a +novel method for acquiring view-consistent 3D DFFs from sparse RGBD +observations, enabling one-shot learning of dexterous manipulations that are +transferable to novel scenes. Specifically, we map the image features to the 3D +point cloud, allowing for propagation across the 3D space to establish a dense +feature field. At the core of SparseDFF is a lightweight feature refinement +network, optimized with a contrastive loss between pairwise views after +back-projecting the image features onto the 3D point cloud. Additionally, we +implement a point-pruning mechanism to augment feature continuity within each +local neighborhood. By establishing coherent feature fields on both source and +target scenes, we devise an energy function that facilitates the minimization +of feature discrepancies w.r.t. the end-effector parameters between the +demonstration and the target manipulation. We evaluate our approach using a +dexterous hand, mastering real-world manipulations on both rigid and deformable +objects, and showcase robust generalization in the face of object and +scene-context variations. + +
+
+
+
+
+ + ☆ LLM-FP4: 4-Bit Floating-Point Quantized Transformers EMNLP 2023 + + +
+ We propose LLM-FP4 for quantizing both weights and activations in large +language models (LLMs) down to 4-bit floating-point values, in a post-training +manner. Existing post-training quantization (PTQ) solutions are primarily +integer-based and struggle with bit widths below 8 bits. Compared to integer +quantization, floating-point (FP) quantization is more flexible and can better +handle long-tail or bell-shaped distributions, and it has emerged as a default +choice in many hardware platforms. One characteristic of FP quantization is +that its performance largely depends on the choice of exponent bits and +clipping range. In this regard, we construct a strong FP-PTQ baseline by +searching for the optimal quantization parameters. Furthermore, we observe a +high inter-channel variance and low intra-channel variance pattern in +activation distributions, which adds activation quantization difficulty. We +recognize this pattern to be consistent across a spectrum of transformer models +designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. +To tackle this, we propose per-channel activation quantization and show that +these additional scaling factors can be reparameterized as exponential biases +of weights, incurring a negligible cost. Our method, for the first time, can +quantize both weights and activations in the LLaMA-13B to only 4-bit and +achieves an average score of 63.1 on the common sense zero-shot reasoning +tasks, which is only 5.8 lower than the full-precision model, significantly +outperforming the previous state-of-the-art by 12.7 points. Code is available +at: https://github.com/nbasyl/LLM-FP4. + +
+
+ comment: EMNLP 2023 Main Conference +
+
+
+
+
+ + ☆ Proposal-Contrastive Pretraining for Object Detection from Fewer Data ICLR 2023 + + +
+ The use of pretrained deep neural networks represents an attractive way to +achieve strong results with few data available. When specialized in dense +problems such as object detection, learning local rather than global +information in images has proven to be more efficient. However, for +unsupervised pretraining, the popular contrastive learning requires a large +batch size and, therefore, a lot of resources. To address this problem, we are +interested in transformer-based object detectors that have recently gained +traction in the community with good performance and with the particularity of +generating many diverse object proposals. + In this work, we present Proposal Selection Contrast (ProSeCo), a novel +unsupervised overall pretraining approach that leverages this property. ProSeCo +uses the large number of object proposals generated by the detector for +contrastive learning, which allows the use of a smaller batch size, combined +with object-level features to learn local information in the images. To improve +the effectiveness of the contrastive loss, we introduce the object location +information in the selection of positive examples to take into account multiple +overlapping object proposals. When reusing pretrained backbone, we advocate for +consistency in learning local information between the backbone and the +detection head. + We show that our method outperforms state of the art in unsupervised +pretraining for object detection on standard and novel benchmarks in learning +with fewer data. + +
+
+ comment: Published as a conference paper at ICLR 2023 +
+
+
+
+
+ + ☆ LightSpeed: Light and Fast Neural Light Fields on Mobile Devices + + +
+ Real-time novel-view image synthesis on mobile devices is prohibitive due to +the limited computational power and storage. Using volumetric rendering +methods, such as NeRF and its derivatives, on mobile devices is not suitable +due to the high computational cost of volumetric rendering. On the other hand, +recent advances in neural light field representations have shown promising +real-time view synthesis results on mobile devices. Neural light field methods +learn a direct mapping from a ray representation to the pixel color. The +current choice of ray representation is either stratified ray sampling or +Pl\"{u}cker coordinates, overlooking the classic light slab (two-plane) +representation, the preferred representation to interpolate between light field +views. In this work, we find that using the light slab representation is an +efficient representation for learning a neural light field. More importantly, +it is a lower-dimensional ray representation enabling us to learn the 4D ray +space using feature grids which are significantly faster to train and render. +Although mostly designed for frontal views, we show that the light-slab +representation can be further extended to non-frontal scenes using a +divide-and-conquer strategy. Our method offers superior rendering quality +compared to previous light field methods and achieves a significantly improved +trade-off between rendering quality and speed. + +
+
+ comment: Project Page: http://lightspeed-r2l.github.io/website/ +
+
+
+
+
+ + ☆ PERF: Panoramic Neural Radiance Field from a Single Panorama + + +
+ Neural Radiance Field (NeRF) has achieved substantial progress in novel view +synthesis given multi-view images. Recently, some works have attempted to train +a NeRF from a single image with 3D priors. They mainly focus on a limited field +of view and there are few invisible occlusions, which greatly limits their +scalability to real-world 360-degree panoramic scenarios with large-size +occlusions. In this paper, we present PERF, a 360-degree novel view synthesis +framework that trains a panoramic neural radiance field from a single panorama. +Notably, PERF allows 3D roaming in a complex scene without expensive and +tedious image collection. To achieve this goal, we propose a novel +collaborative RGBD inpainting method and a progressive inpainting-and-erasing +method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first +predict a panoramic depth map as initialization given a single panorama, and +reconstruct visible 3D regions with volume rendering. Then we introduce a +collaborative RGBD inpainting approach into a NeRF for completing RGB images +and depth maps from random views, which is derived from an RGB Stable Diffusion +model and a monocular depth estimator. Finally, we introduce an +inpainting-and-erasing strategy to avoid inconsistent geometry between a +newly-sampled view and reference views. The two components are integrated into +the learning of NeRFs in a unified optimization framework and achieve promising +results. Extensive experiments on Replica and a new dataset PERF-in-the-wild +demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF +can be widely used for real-world applications, such as panorama-to-3D, +text-to-3D, and 3D scene stylization applications. Project page and code are +available at https://perf-project.github.io/. + +
+
+ comment: Project page and code: https://perf-project.github.io/ +
+
+
+
+
+ + ☆ TD-MPC2: Scalable, Robust World Models for Continuous Control + + +
+ TD-MPC is a model-based reinforcement learning (RL) algorithm that performs +local trajectory optimization in the latent space of a learned implicit +(decoder-free) world model. In this work, we present TD-MPC2: a series of +improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves +significantly over baselines across 104 online RL tasks spanning 4 diverse task +domains, achieving consistently strong results with a single set of +hyperparameters. We further show that agent capabilities increase with model +and data size, and successfully train a single 317M parameter agent to perform +80 tasks across multiple task domains, embodiments, and action spaces. We +conclude with an account of lessons, opportunities, and risks associated with +large TD-MPC2 agents. Explore videos, models, data, code, and more at +https://nicklashansen.github.io/td-mpc2 + +
+
+ comment: Explore videos, models, data, code, and more at + https://nicklashansen.github.io/td-mpc2 +
+
+
+
+
+ + ☆ CommonCanvas: An Open Diffusion Model Trained with Creative-Commons + Images + + +
+ We assemble a dataset of Creative-Commons-licensed (CC) images, which we use +to train a set of open diffusion models that are qualitatively competitive with +Stable Diffusion 2 (SD2). This task presents two challenges: (1) +high-resolution CC images lack the captions necessary to train text-to-image +generative models; (2) CC images are relatively scarce. In turn, to address +these challenges, we use an intuitive transfer learning technique to produce a +set of high-quality synthetic captions paired with curated CC images. We then +develop a data- and compute-efficient training recipe that requires as little +as 3% of the LAION-2B data needed to train existing SD2 models, but obtains +comparable quality. These results indicate that we have a sufficient number of +CC images (~70 million) for training high-quality models. Our training recipe +also implements a variety of optimizations that achieve ~3X training speed-ups, +enabling rapid model iteration. We leverage this recipe to train several +high-quality text-to-image models, which we dub the CommonCanvas family. Our +largest model achieves comparable performance to SD2 on a human evaluation, +despite being trained on our CC dataset that is significantly smaller than +LAION and using synthetic captions for training. We release our models, data, +and code at +https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md + +
+
+
+
+
+ + ☆ DreamCraft3D: Hierarchical 3D Generation with Bootstrapped Diffusion + Prior + + +
+ We present DreamCraft3D, a hierarchical 3D content generation method that +produces high-fidelity and coherent 3D objects. We tackle the problem by +leveraging a 2D reference image to guide the stages of geometry sculpting and +texture boosting. A central focus of this work is to address the consistency +issue that existing works encounter. To sculpt geometries that render +coherently, we perform score distillation sampling via a view-dependent +diffusion model. This 3D prior, alongside several training strategies, +prioritizes the geometry consistency but compromises the texture fidelity. We +further propose Bootstrapped Score Distillation to specifically boost the +texture. We train a personalized diffusion model, Dreambooth, on the augmented +renderings of the scene, imbuing it with 3D knowledge of the scene being +optimized. The score distillation from this 3D-aware diffusion prior provides +view-consistent guidance for the scene. Notably, through an alternating +optimization of the diffusion prior and 3D scene representation, we achieve +mutually reinforcing improvements: the optimized 3D scene aids in training the +scene-specific diffusion model, which offers increasingly view-consistent +guidance for 3D optimization. The optimization is thus bootstrapped and leads +to substantial texture boosting. With tailored 3D priors throughout the +hierarchical generation, DreamCraft3D generates coherent 3D objects with +photorealistic renderings, advancing the state-of-the-art in 3D content +generation. Code available at https://github.com/deepseek-ai/DreamCraft3D. + +
+
+ comment: Project Page: https://mrtornado24.github.io/DreamCraft3D/ +
+
+
+
+
+ + ☆ Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and + In-depth Evaluation + + +
+ This paper presents a comprehensive evaluation of the Optical Character +Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large +Multimodal Model (LMM). We assess the model's performance across a range of OCR +tasks, including scene text recognition, handwritten text recognition, +handwritten mathematical expression recognition, table structure recognition, +and information extraction from visually-rich document. The evaluation reveals +that GPT-4V performs well in recognizing and understanding Latin contents, but +struggles with multilingual scenarios and complex tasks. Based on these +observations, we delve deeper into the necessity of specialized OCR models and +deliberate on the strategies to fully harness the pretrained general LMMs like +GPT-4V for OCR downstream tasks. The study offers a critical reference for +future research in OCR with LMMs. Evaluation pipeline and results are available +at https://github.com/SCUT-DLVCLab/GPT-4V_OCR. + +
+
+
+
+
+ + ☆ Fingervein Verification using Convolutional Multi-Head Attention Network WACV + + +
+ Biometric verification systems are deployed in various security-based +access-control applications that require user-friendly and reliable person +verification. Among the different biometric characteristics, fingervein +biometrics have been extensively studied owing to their reliable verification +performance. Furthermore, fingervein patterns reside inside the skin and are +not visible outside; therefore, they possess inherent resistance to +presentation attacks and degradation due to external factors. In this paper, we +introduce a novel fingervein verification technique using a convolutional +multihead attention network called VeinAtnNet. The proposed VeinAtnNet is +designed to achieve light weight with a smaller number of learnable parameters +while extracting discriminant information from both normal and enhanced +fingervein images. The proposed VeinAtnNet was trained on the newly constructed +fingervein dataset with 300 unique fingervein patterns that were captured in +multiple sessions to obtain 92 samples per unique fingervein. Extensive +experiments were performed on the newly collected dataset FV-300 and the +publicly available FV-USM and FV-PolyU fingervein dataset. The performance of +the proposed method was compared with five state-of-the-art fingervein +verification systems, indicating the efficacy of the proposed VeinAtnNet. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ☆ The GOOSE Dataset for Perception in Unstructured Environments + + +
+ The potential for deploying autonomous systems can be significantly increased +by improving the perception and interpretation of the environment. However, the +development of deep learning-based techniques for autonomous systems in +unstructured outdoor environments poses challenges due to limited data +availability for training and testing. To address this gap, we present the +German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset +specifically designed for unstructured outdoor environments. The GOOSE dataset +incorporates 10 000 labeled pairs of images and point clouds, which are +utilized to train a range of state-of-the-art segmentation models on both image +and point cloud data. We open source the dataset, along with an ontology for +unstructured terrain, as well as dataset standards and guidelines. This +initiative aims to establish a common framework, enabling the seamless +inclusion of existing datasets and a fast way to enhance the perception +capabilities of various robots operating in unstructured environments. The +dataset, pre-trained models for offroad perception, and additional +documentation can be found at https://goose-dataset.de/. + +
+
+ comment: Preprint; Submitted to IEEE for review +
+
+
+
+
+ + ☆ S$^3$-TTA: Scale-Style Selection for Test-Time Augmentation in + Biomedical Image Segmentation + + +
+ Deep-learning models have been successful in biomedical image segmentation. +To generalize for real-world deployment, test-time augmentation (TTA) methods +are often used to transform the test image into different versions that are +hopefully closer to the training domain. Unfortunately, due to the vast +diversity of instance scale and image styles, many augmented test images +produce undesirable results, thus lowering the overall performance. This work +proposes a new TTA framework, S$^3$-TTA, which selects the suitable image scale +and style for each test image based on a transformation consistency metric. In +addition, S$^3$-TTA constructs an end-to-end augmentation-segmentation +joint-training pipeline to ensure a task-oriented augmentation. On public +benchmarks for cell and lung segmentation, S$^3$-TTA demonstrates improvements +over the prior art by 3.4% and 1.3%, respectively, by simply augmenting the +input data in testing phase. + +
+
+
+
+
+ + ☆ Kiki or Bouba? Sound Symbolism in Vision-and-Language Models NeurIPS 2023 + + +
+ Although the mapping between sound and meaning in human language is assumed +to be largely arbitrary, research in cognitive science has shown that there are +non-trivial correlations between particular sounds and meanings across +languages and demographic groups, a phenomenon known as sound symbolism. Among +the many dimensions of meaning, sound symbolism is particularly salient and +well-demonstrated with regards to cross-modal associations between language and +the visual domain. In this work, we address the question of whether sound +symbolism is reflected in vision-and-language models such as CLIP and Stable +Diffusion. Using zero-shot knowledge probing to investigate the inherent +knowledge of these models, we find strong evidence that they do show this +pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our +work provides a novel method for demonstrating sound symbolism and +understanding its nature using computational tools. Our code will be made +publicly available. + +
+
+ comment: Accepted to NeurIPS 2023 (spotlight). Project webpage: + https://kiki-bouba.github.io/ +
+
+
+
+
+ + ☆ MixerFlow for Image Modelling + + +
+ Normalising flows are statistical models that transform a complex density +into a simpler density through the use of bijective transformations enabling +both density estimation and data generation from a single model. In the context +of image modelling, the predominant choice has been the Glow-based +architecture, whereas alternative architectures remain largely unexplored in +the research community. In this work, we propose a novel architecture called +MixerFlow, based on the MLP-Mixer architecture, further unifying the generative +and discriminative modelling architectures. MixerFlow offers an effective +mechanism for weight sharing for flow-based models. Our results demonstrate +better density estimation on image datasets under a fixed computational budget +and scales well as the image resolution increases, making MixeFlow a powerful +yet simple alternative to the Glow-based architectures. We also show that +MixerFlow provides more informative embeddings than Glow-based architectures. + +
+
+
+
+
+ + ☆ ConvNets Match Vision Transformers at Scale + + +
+ Many researchers believe that ConvNets perform well on small or moderately +sized datasets, but are not competitive with Vision Transformers when given +access to datasets on the web-scale. We challenge this belief by evaluating a +performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset +of images often used for training foundation models. We consider pre-training +compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a +series of networks of increasing depth and width from the NFNet model family. +We observe a log-log scaling law between held out loss and compute budget. +After fine-tuning on ImageNet, NFNets match the reported performance of Vision +Transformers with comparable compute budgets. Our strongest fine-tuned model +achieves a Top-1 accuracy of 90.4%. + +
+
+
+
+
+ + ☆ CAD -- Contextual Multi-modal Alignment for Dynamic AVQA + + +
+ In the context of Audio Visual Question Answering (AVQA) tasks, the audio +visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and +3) Semantic. Existing AVQA methods suffer from two major shortcomings; the +audio-visual (AV) information passing through the network isn't aligned on +Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic +information is often not balanced within a context; this results in poor +performance. In this paper, we propose a novel end-to-end Contextual +Multi-modal Alignment (CAD) network that addresses the challenges in AVQA +methods by i) introducing a parameter-free stochastic Contextual block that +ensures robust audio and visual alignment on the Spatial level; ii) proposing a +pre-training technique for dynamic audio and visual alignment on Temporal level +in a self-supervised setting, and iii) introducing a cross-attention mechanism +to balance audio and visual information on Semantic level. The proposed novel +CAD network improves the overall performance over the state-of-the-art methods +on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our +proposed contributions to AVQA can be added to the existing methods to improve +their performance without additional complexity requirements. + +
+
+
+
+
+ + ☆ Metrically Scaled Monocular Depth Estimation through Sparse Priors for + Underwater Robots ICRA 2024 + + +
+ In this work, we address the problem of real-time dense depth estimation from +monocular images for mobile underwater vehicles. We formulate a deep learning +model that fuses sparse depth measurements from triangulated features to +improve the depth predictions and solve the problem of scale ambiguity. To +allow prior inputs of arbitrary sparsity, we apply a dense parameterization +method. Our model extends recent state-of-the-art approaches to monocular image +based depth estimation, using an efficient encoder-decoder backbone and modern +lightweight transformer optimization stage to encode global context. The +network is trained in a supervised fashion on the forward-looking underwater +dataset, FLSea. Evaluation results on this dataset demonstrate significant +improvement in depth prediction accuracy by the fusion of the sparse feature +priors. In addition, without any retraining, our method achieves similar depth +prediction accuracy on a downward looking dataset we collected with a diver +operated camera rig, conducting a survey of a coral reef. The method achieves +real-time performance, running at 160 FPS on a laptop GPU and 7 FPS on a single +CPU core and is suitable for direct deployment on embedded systems. The +implementation of this work is made publicly available at +https://github.com/ebnerluca/uw_depth. + +
+
+ comment: Submitted to ICRA 2024 +
+
+
+
+
+ + ☆ Interferometric Neural Networks + + +
+ On the one hand, artificial neural networks have many successful applications +in the field of machine learning and optimization. On the other hand, +interferometers are integral parts of any field that deals with waves such as +optics, astronomy, and quantum physics. Here, we introduce neural networks +composed of interferometers and then build generative adversarial networks from +them. Our networks do not have any classical layer and can be realized on +quantum computers or photonic chips. We demonstrate their applicability for +combinatorial optimization, image classification, and image generation. For +combinatorial optimization, our network consistently converges to the global +optimum or remains within a narrow range of it. In multi-class image +classification tasks, our networks achieve accuracies of 93% and 83%. Lastly, +we show their capability to generate images of digits from 0 to 9 as well as +human faces. + +
+
+ comment: 11 pages +
+
+
+
+
+ + ☆ A No-Reference Quality Assessment Method for Digital Human Head + + +
+ In recent years, digital humans have been widely applied in augmented/virtual +reality (A/VR), where viewers are allowed to freely observe and interact with +the volumetric content. However, the digital humans may be degraded with +various distortions during the procedure of generation and transmission. +Moreover, little effort has been put into the perceptual quality assessment of +digital humans. Therefore, it is urgent to carry out objective quality +assessment methods to tackle the challenge of digital human quality assessment +(DHQA). In this paper, we develop a novel no-reference (NR) method based on +Transformer to deal with DHQA in a multi-task manner. Specifically, the front +2D projections of the digital humans are rendered as inputs and the vision +transformer (ViT) is employed for the feature extraction. Then we design a +multi-task module to jointly classify the distortion types and predict the +perceptual quality levels of digital humans. The experimental results show that +the proposed method well correlates with the subjective ratings and outperforms +the state-of-the-art quality assessment methods. + +
+
+
+
+
+ + ☆ Rebuild City Buildings from Off-Nadir Aerial Images with Offset-Building + Model (OBM) + + +
+ Accurate measurement of the offset from roof-to-footprint in +very-high-resolution remote sensing imagery is crucial for urban information +extraction tasks. With the help of deep learning, existing methods typically +rely on two-stage CNN models to extract regions of interest on building feature +maps. At the first stage, a Region Proposal Network (RPN) is applied to extract +thousands of ROIs (Region of Interests) which will post-imported into a +Region-based Convolutional Neural Networks (RCNN) to extract wanted +information. However, because of inflexible RPN, these methods often lack +effective user interaction, encounter difficulties in instance correspondence, +and struggle to keep up with the advancements in general artificial +intelligence. This paper introduces an interactive Transformer model combined +with a prompt encoder to precisely extract building segmentation as well as the +offset vectors from roofs to footprints. In our model, a powerful module, +namely ROAM, was tailored for common problems in predicting roof-to-footprint +offsets. We tested our model's feasibility on the publicly available BONAI +dataset, achieving a significant reduction in Prompt-Instance-Level offset +errors ranging from 14.6% to 16.3%. Additionally, we developed a Distance-NMS +algorithm tailored for large-scale building offsets, significantly enhancing +the accuracy of predicted building offset angles and lengths in a +straightforward and efficient manner. To further validate the model's +robustness, we created a new test set using 0.5m remote sensing imagery from +Huizhou, China, for inference testing. Our code, training methods, and the +updated dataset will be accessable at https://github.com/likaiucas. + +
+
+ comment: 24 pages, 9 figures +
+
+
+
+
+ + ☆ Nighttime Driver Behavior Prediction Using Taillight Signal Recognition + via CNN-SVM Classifier + + +
+ This paper aims to enhance the ability to predict nighttime driving behavior +by identifying taillights of both human-driven and autonomous vehicles. The +proposed model incorporates a customized detector designed to accurately detect +front-vehicle taillights on the road. At the beginning of the detector, a +learnable pre-processing block is implemented, which extracts deep features +from input images and calculates the data rarity for each feature. In the next +step, drawing inspiration from soft attention, a weighted binary mask is +designed that guides the model to focus more on predetermined regions. This +research utilizes Convolutional Neural Networks (CNNs) to extract +distinguishing characteristics from these areas, then reduces dimensions using +Principal Component Analysis (PCA). Finally, the Support Vector Machine (SVM) +is used to predict the behavior of the vehicles. To train and evaluate the +model, a large-scale dataset is collected from two types of dash-cams and +Insta360 cameras from the rear view of Ford Motor Company vehicles. This +dataset includes over 12k frames captured during both daytime and nighttime +hours. To address the limited nighttime data, a unique pixel-wise image +processing technique is implemented to convert daytime images into realistic +night images. The findings from the experiments demonstrate that the proposed +methodology can accurately categorize vehicle behavior with 92.14% accuracy, +97.38% specificity, 92.09% sensitivity, 92.10% F1-measure, and 0.895 Cohen's +Kappa Statistic. Further details are available at +https://github.com/DeepCar/Taillight_Recognition. + +
+
+ comment: 12 pages, 10 figures +
+
+
+
+
+ + ☆ From Pointwise to Powerhouse: Initialising Neural Networks with + Generative Models + + +
+ Traditional initialisation methods, e.g. He and Xavier, have been effective +in avoiding the problem of vanishing or exploding gradients in neural networks. +However, they only use simple pointwise distributions, which model +one-dimensional variables. Moreover, they ignore most information about the +architecture and disregard past training experiences. These limitations can be +overcome by employing generative models for initialisation. In this paper, we +introduce two groups of new initialisation methods. First, we locally +initialise weight groups by employing variational autoencoders. Secondly, we +globally initialise full weight sets by employing graph hypernetworks. We +thoroughly evaluate the impact of the employed generative models on +state-of-the-art neural networks in terms of accuracy, convergence speed and +ensembling. Our results show that global initialisations result in higher +accuracy and faster initial convergence speed. However, the implementation +through graph hypernetworks leads to diminished ensemble performance on out of +distribution data. To counteract, we propose a modification called noise graph +hypernetwork, which encourages diversity in the produced ensemble members. +Furthermore, our approach might be able to transfer learned knowledge to +different image distributions. Our work provides insights into the potential, +the trade-offs and possible modifications of these new initialisation methods. + +
+
+
+
+
+ + ☆ DSAM-GN:Graph Network based on Dynamic Similarity Adjacency Matrices for + Vehicle Re-identification + + +
+ In recent years, vehicle re-identification (Re-ID) has gained increasing +importance in various applications such as assisted driving systems, traffic +flow management, and vehicle tracking, due to the growth of intelligent +transportation systems. However, the presence of extraneous background +information and occlusions can interfere with the learning of discriminative +features, leading to significant variations in the same vehicle image across +different scenarios. This paper proposes a method, named graph network based on +dynamic similarity adjacency matrices (DSAM-GN), which incorporates a novel +approach for constructing adjacency matrices to capture spatial relationships +of local features and reduce background noise. Specifically, the proposed +method divides the extracted vehicle features into different patches as nodes +within the graph network. A spatial attention-based similarity adjacency matrix +generation (SASAMG) module is employed to compute similarity matrices of nodes, +and a dynamic erasure operation is applied to disconnect nodes with low +similarity, resulting in similarity adjacency matrices. Finally, the nodes and +similarity adjacency matrices are fed into graph networks to extract more +discriminative features for vehicle Re-ID. Experimental results on public +datasets VeRi-776 and VehicleID demonstrate the effectiveness of the proposed +method compared with recent works. + +
+
+ comment: This paper has been accepted by the 20th Pacific Rim International + Conference on Artificial Intelligence in 2023 +
+
+
+
+
+ + ☆ Local Statistics for Generative Image Detection + + +
+ Diffusion models (DMs) are generative models that learn to synthesize images +from Gaussian noise. DMs can be trained to do a variety of tasks such as image +generation and image super-resolution. Researchers have made significant +improvement in the capability of synthesizing photorealistic images in the past +few years. These successes also hasten the need to address the potential misuse +of synthesized images. In this paper, we highlight the effectiveness of +computing local statistics, as opposed to global statistics, in distinguishing +digital camera images from DM-generated images. We hypothesized that local +statistics should be used to address the spatial non-stationarity problem in +images. We show that our approach produced promising results and it is also +robust to various perturbations such as image resizing and JPEG compression. + +
+
+
+
+
+ + ☆ CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary + Object Detection NeurIPS 2023 + + +
+ Deriving reliable region-word alignment from image-text pairs is critical to +learn object-level vision-language representations for open-vocabulary object +detection. Existing methods typically rely on pre-trained or self-trained +vision-language models for alignment, which are prone to limitations in +localization accuracy or generalization capabilities. In this paper, we propose +CoDet, a novel approach that overcomes the reliance on pre-aligned +vision-language space by reformulating region-word alignment as a co-occurring +object discovery problem. Intuitively, by grouping images that mention a shared +concept in their captions, objects corresponding to the shared concept shall +exhibit high co-occurrence among the group. CoDet then leverages visual +similarities to discover the co-occurring objects and align them with the +shared concept. Extensive experiments demonstrate that CoDet has superior +performances and compelling scalability in open-vocabulary detection, e.g., by +scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and +44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 +$\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at +https://github.com/CVMI-Lab/CoDet. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Robust Source-Free Domain Adaptation for Fundus Image Segmentation WACV2024 + + +
+ Unsupervised Domain Adaptation (UDA) is a learning technique that transfers +knowledge learned in the source domain from labelled training data to the +target domain with only unlabelled data. It is of significant importance to +medical image segmentation because of the usual lack of labelled training data. +Although extensive efforts have been made to optimize UDA techniques to improve +the ac?curacy of segmentation models in the target domain, few studies have +addressed the robustness of these models under UDA. In this study, we propose a +two-stage training strat?egy for robust domain adaptation. In the source +training stage, we utilize adversarial sample augmentation to en?hance the +robustness and generalization capability of the source model. And in the target +training stage, we propose a novel robust pseudo-label and pseudo-boundary +(PLPB) method, which effectively utilizes unlabeled target data to generate +pseudo labels and pseudo boundaries that enable model self-adaptation without +requiring source data. Ex?tensive experimental results on cross-domain fundus +image segmentation confirm the effectiveness and versatility of our method. +Source code of this study is openly accessible at +https://github.com/LinGrayy/PLPB. + +
+
+ comment: 10 pages, WACV2024 +
+
+
+
+
+ + ☆ Deep Learning Techniques for Cervical Cancer Diagnosis based on + Pathology and Colposcopy Images + + +
+ Cervical cancer is a prevalent disease affecting millions of women worldwide +every year. It requires significant attention, as early detection during the +precancerous stage provides an opportunity for a cure. The screening and +diagnosis of cervical cancer rely on cytology and colposcopy methods. Deep +learning, a promising technology in computer vision, has emerged as a potential +solution to improve the accuracy and efficiency of cervical cancer screening +compared to traditional clinical inspection methods that are prone to human +error. This review article discusses cervical cancer and its screening +processes, followed by the Deep Learning training process and the +classification, segmentation, and detection tasks for cervical cancer +diagnosis. Additionally, we explored the most common public datasets used in +both cytology and colposcopy and highlighted the popular and most utilized +architectures that researchers have applied to both cytology and colposcopy. We +reviewed 24 selected practical papers in this study and summarized them. This +article highlights the remarkable efficiency in enhancing the precision and +speed of cervical cancer analysis by Deep Learning, bringing us closer to early +diagnosis and saving lives. + +
+
+
+
+
+ + ☆ A Picture is Worth a Thousand Words: Principled Recaptioning Improves + Image Generation + + +
+ Text-to-image diffusion models achieved a remarkable leap in capabilities +over the last few years, enabling high-quality and diverse synthesis of images +from a textual prompt. However, even the most advanced models often struggle to +precisely follow all of the directions in their prompts. The vast majority of +these models are trained on datasets consisting of (image, caption) pairs where +the images often come from the web, and the captions are their HTML alternate +text. A notable example is the LAION dataset, used by Stable Diffusion and +other models. In this work we observe that these captions are often of low +quality, and argue that this significantly affects the model's capability to +understand nuanced semantics in the textual prompts. We show that by relabeling +the corpus with a specialized automatic captioning model and training a +text-to-image model on the recaptioned dataset, the model benefits +substantially across the board. First, in overall image quality: e.g. FID 14.84 +vs. the baseline of 17.87, and 64.3% improvement in faithful image generation +according to human evaluation. Second, in semantic alignment, e.g. semantic +object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and +positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the +corpus and provide evidence that this technique, which we call RECAP, both +reduces the train-inference discrepancy and provides the model with more +information per example, increasing sample efficiency and allowing the model to +better understand the relations between captions and images. + +
+
+
+
+
+ + ☆ EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression + Recognition + + +
+ Facial Expression Recognition (FER) is a crucial task in affective computing, +but its conventional focus on the seven basic emotions limits its applicability +to the complex and expanding emotional spectrum. To address the issue of new +and unseen emotions present in dynamic in-the-wild FER, we propose a novel +vision-language model that utilises sample-level text descriptions (i.e. +captions of the context, expressions or emotional cues) as natural language +supervision, aiming to enhance the learning of rich latent representations, for +zero-shot classification. To test this, we evaluate using zero-shot +classification of the model trained on sample-level descriptions on four +popular dynamic FER datasets. Our findings show that this approach yields +significant improvements when compared to baseline methods. Specifically, for +zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted +Average Recall and 5\% in terms of Unweighted Average Recall on several +datasets. Furthermore, we evaluate the representations obtained from the +network trained using sample-level descriptions on the downstream task of +mental health symptom estimation, achieving performance comparable or superior +to state-of-the-art methods and strong agreement with human experts. Namely, we +achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia +symptom severity estimation, which is comparable to human experts' agreement. +The code is publicly available at: https://github.com/NickyFot/EmoCLIP. + +
+
+ comment: 10 pages, 3 figures +
+
+
+
+
+ + ☆ Driving through the Concept Gridlock: Unraveling Explainability + Bottlenecks + + +
+ Concept bottleneck models have been successfully used for explainable machine +learning by encoding information within the model with a set of human-defined +concepts. In the context of human-assisted or autonomous driving, +explainability models can help user acceptance and understanding of decisions +made by the autonomous vehicle, which can be used to rationalize and explain +driver or vehicle behavior. We propose a new approach using concept bottlenecks +as visual features for control command predictions and explanations of user and +vehicle behavior. We learn a human-understandable concept layer that we use to +explain sequential driving scenes while learning vehicle control commands. This +approach can then be used to determine whether a change in a preferred gap or +steering commands from a human (or autonomous vehicle) is led by an external +stimulus or change in preferences. We achieve competitive performance to latent +visual features while gaining interpretability within our model setup. + +
+
+
+
+
+ + ☆ EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless + LiDAR-Camera Calibration + + +
+ In multimodal perception systems, achieving precise extrinsic calibration +between LiDAR and camera is of critical importance. Previous calibration +methods often required specific targets or manual adjustments, making them both +labor-intensive and costly. Online calibration methods based on features have +been proposed, but these methods encounter challenges such as imprecise feature +extraction, unreliable cross-modality associations, and high scene-specific +requirements. To address this, we introduce an edge-based approach for +automatic online calibration of LiDAR and cameras in real-world scenarios. The +edge features, which are prevalent in various environments, are aligned in both +images and point clouds to determine the extrinsic parameters. Specifically, +stable and robust image edge features are extracted using a SAM-based method +and the edge features extracted from the point cloud are weighted through a +multi-frame weighting strategy for feature filtering. Finally, accurate +extrinsic parameters are optimized based on edge correspondence constraints. We +conducted evaluations on both the KITTI dataset and our dataset. The results +show a state-of-the-art rotation accuracy of 0.086{\deg} and a translation +accuracy of 0.977 cm, outperforming existing edge-based calibration methods in +both precision and robustness. + +
+
+
+
+
+ + ☆ Real-time 6-DoF Pose Estimation by an Event-based Camera using Active + LED Markers WACV 2024 + + +
+ Real-time applications for autonomous operations depend largely on fast and +robust vision-based localization systems. Since image processing tasks require +processing large amounts of data, the computational resources often limit the +performance of other processes. To overcome this limitation, traditional +marker-based localization systems are widely used since they are easy to +integrate and achieve reliable accuracy. However, classical marker-based +localization systems significantly depend on standard cameras with low frame +rates, which often lack accuracy due to motion blur. In contrast, event-based +cameras provide high temporal resolution and a high dynamic range, which can be +utilized for fast localization tasks, even under challenging visual conditions. +This paper proposes a simple but effective event-based pose estimation system +using active LED markers (ALM) for fast and accurate pose estimation. The +proposed algorithm is able to operate in real time with a latency below +\SI{0.5}{\milli\second} while maintaining output rates of \SI{3}{\kilo \hertz}. +Experimental results in static and dynamic scenarios are presented to +demonstrate the performance of the proposed approach in terms of computational +speed and absolute accuracy, using the OptiTrack system as the basis for +measurement. + +
+
+ comment: 14 pages, 12 figures, this paper has been accepted to WACV 2024 +
+
+
+
+
+ + ☆ Context Does Matter: End-to-end Panoptic Narrative Grounding with + Deformable Attention Refined Matching Network ICDM 2023 + + +
+ Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that +aims to segment visual objects in images based on dense narrative captions. The +current state-of-the-art methods first refine the representation of phrase by +aggregating the most similar $k$ image pixels, and then match the refined text +representations with the pixels of the image feature map to generate +segmentation results. However, simply aggregating sampled image features +ignores the contextual information, which can lead to phrase-to-pixel +mis-match. In this paper, we propose a novel learning framework called +Deformable Attention Refined Matching Network (DRMN), whose main idea is to +bring deformable attention in the iterative process of feature learning to +incorporate essential context information of different scales of pixels. DRMN +iteratively re-encodes pixels with the deformable attention network after +updating the feature representation of the top-$k$ most similar pixels. As +such, DRMN can lead to accurate yet discriminative pixel representations, +purify the top-$k$ most similar pixels, and consequently alleviate the +phrase-to-pixel mis-match substantially.Experimental results show that our +novel design significantly improves the matching results between text phrases +and image pixels. Concretely, DRMN achieves new state-of-the-art performance on +the PNG benchmark with an average recall improvement 3.5%. The codes are +available in: https://github.com/JaMesLiMers/DRMN. + +
+
+ comment: Accepted by ICDM 2023 +
+
+
+
+
+ + ☆ $\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual + $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal + $\mathbb{GR}$aphs WACV 2024 + + +
+ We propose $\mathbb{VD}$-$\mathbb{GR}$ - a novel visual dialog model that +combines pre-trained language models (LMs) with graph neural networks (GNNs). +Prior works mainly focused on one class of models at the expense of the other, +thus missing out on the opportunity of combining their respective benefits. At +the core of $\mathbb{VD}$-$\mathbb{GR}$ is a novel integration mechanism that +alternates between spatial-temporal multi-modal GNNs and BERT layers, and that +covers three distinct contributions: First, we use multi-modal GNNs to process +the features of each modality (image, question, and dialog history) and exploit +their local structures before performing BERT global attention. Second, we +propose hub-nodes that link to all other nodes within one modality graph, +allowing the model to propagate information from one GNN (modality) to the +other in a cascaded manner. Third, we augment the BERT hidden states with +fine-grained multi-modal GNN features before passing them to the next +$\mathbb{VD}$-$\mathbb{GR}$ layer. Evaluations on VisDial v1.0, VisDial v0.9, +VisDialConv, and VisPro show that $\mathbb{VD}$-$\mathbb{GR}$ achieves new +state-of-the-art results across all four datasets. + +
+
+ comment: WACV 2024 +
+
+
+
+
+ + ☆ Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent + Representations NeurIPS 2023 + + +
+ Uncertainty estimation aims to evaluate the confidence of a trained deep +neural network. However, existing uncertainty estimation approaches rely on +low-dimensional distributional assumptions and thus suffer from the high +dimensionality of latent features. Existing approaches tend to focus on +uncertainty on discrete classification probabilities, which leads to poor +generalizability to uncertainty estimation for other tasks. Moreover, most of +the literature requires seeing the out-of-distribution (OOD) data in the +training for better estimation of uncertainty, which limits the uncertainty +estimation performance in practice because the OOD data are typically unseen. +To overcome these limitations, we propose a new framework using data-adaptive +high-dimensional hypothesis testing for uncertainty estimation, which leverages +the statistical properties of the feature representations. Our method directly +operates on latent representations and thus does not require retraining the +feature encoder under a modified objective. The test statistic relaxes the +feature distribution assumptions to high dimensionality, and it is more +discriminative to uncertainties in the latent representations. We demonstrate +that encoding features with Bayesian neural networks can enhance testing +performance and lead to more accurate uncertainty estimation. We further +introduce a family-wise testing procedure to determine the optimal threshold of +OOD detection, which minimizes the false discovery rate (FDR). Extensive +experiments validate the satisfactory performance of our framework on +uncertainty estimation and task-specific prediction over a variety of +competitors. The experiments on the OOD detection task also show satisfactory +performance of our method when the OOD data are unseen in the training. Codes +are available at https://github.com/HKU-MedAI/bnn_uncertainty. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Learning to Explain: A Model-Agnostic Framework for Explaining Black Box + Models + + +
+ We present Learning to Explain (LTX), a model-agnostic framework designed for +providing post-hoc explanations for vision models. The LTX framework introduces +an "explainer" model that generates explanation maps, highlighting the crucial +regions that justify the predictions made by the model being explained. To +train the explainer, we employ a two-stage process consisting of initial +pretraining followed by per-instance finetuning. During both stages of +training, we utilize a unique configuration where we compare the explained +model's prediction for a masked input with its original prediction for the +unmasked input. This approach enables the use of a novel counterfactual +objective, which aims to anticipate the model's output using masked versions of +the input image. Importantly, the LTX framework is not restricted to a specific +model architecture and can provide explanations for both Transformer-based and +convolutional models. Through our evaluations, we demonstrate that LTX +significantly outperforms the current state-of-the-art in explainability across +various metrics. + +
+
+
+
+
+ + ☆ Adapt Anything: Tailor Any Image Classifiers across Domains And + Categories Using Text-to-Image Diffusion Models + + +
+ We do not pursue a novel method in this paper, but aim to study if a modern +text-to-image diffusion model can tailor any task-adaptive image classifier +across domains and categories. Existing domain adaptive image classification +works exploit both source and target data for domain alignment so as to +transfer the knowledge learned from the labeled source data to the unlabeled +target data. However, as the development of the text-to-image diffusion model, +we wonder if the high-fidelity synthetic data from the text-to-image generator +can serve as a surrogate of the source data in real world. In this way, we do +not need to collect and annotate the source data for each domain adaptation +task in a one-for-one manner. Instead, we utilize only one off-the-shelf +text-to-image model to synthesize images with category labels derived from the +corresponding text prompts, and then leverage the surrogate data as a bridge to +transfer the knowledge embedded in the task-agnostic text-to-image generator to +the task-oriented image classifier via domain adaptation. Such a one-for-all +adaptation paradigm allows us to adapt anything in the world using only one +text-to-image generator as well as the corresponding unlabeled target data. +Extensive experiments validate the feasibility of the proposed idea, which even +surpasses the state-of-the-art domain adaptation works using the source data +collected and annotated in real world. + +
+
+ comment: 11 pages, 6 figures +
+
+
+
+
+ + ☆ Flow-Attention-based Spatio-Temporal Aggregation Network for 3D Mask + Detection NeurIPS 2023 + + +
+ Anti-spoofing detection has become a necessity for face recognition systems +due to the security threat posed by spoofing attacks. Despite great success in +traditional attacks, most deep-learning-based methods perform poorly in 3D +masks, which can highly simulate real faces in appearance and structure, +suffering generalizability insufficiency while focusing only on the spatial +domain with single frame input. This has been mitigated by the recent +introduction of a biomedical technology called rPPG (remote +photoplethysmography). However, rPPG-based methods are sensitive to noisy +interference and require at least one second (> 25 frames) of observation time, +which induces high computational overhead. To address these challenges, we +propose a novel 3D mask detection framework, called FASTEN +(Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the +network for focusing more on fine-grained details in large movements, which can +eliminate redundant spatio-temporal feature interference and quickly capture +splicing traces of 3D masks in fewer frames. Our proposed network contains +three key modules: 1) a facial optical flow network to obtain non-RGB +inter-frame flow information; 2) flow attention to assign different +significance to each frame; 3) spatio-temporal aggregation to aggregate +high-level spatial features and temporal transition features. Through extensive +experiments, FASTEN only requires five frames of input and outperforms eight +competitors for both intra-dataset and cross-dataset evaluations in terms of +multiple detection metrics. Moreover, FASTEN has been deployed in real-world +mobile devices for practical 3D mask detection. + +
+
+ comment: 13 pages, 5 figures. Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ ParisLuco3D: A high-quality target dataset for domain generalization of + LiDAR perception + + +
+ LiDAR is a sensor system that supports autonomous driving by gathering +precise geometric information about the scene. Exploiting this information for +perception is interesting as the amount of available data increases. + As the quantitative performance of various perception tasks has improved, the +focus has shifted from source-to-source perception to domain adaptation and +domain generalization for perception. These new goals require access to a large +variety of domains for evaluation. Unfortunately, the various annotation +strategies of data providers complicate the computation of cross-domain +performance based on the available data + This paper provides a novel dataset, specifically designed for cross-domain +evaluation to make it easier to evaluate the performance of various source +datasets. Alongside the dataset, a flexible online benchmark is provided to +ensure a fair comparison across methods. + +
+
+
+
+
+ + ☆ Dual Defense: Adversarial, Traceable, and Invisible Robust Watermarking + against Face Swapping + + +
+ The malicious applications of deep forgery, represented by face swapping, +have introduced security threats such as misinformation dissemination and +identity fraud. While some research has proposed the use of robust watermarking +methods to trace the copyright of facial images for post-event traceability, +these methods cannot effectively prevent the generation of forgeries at the +source and curb their dissemination. To address this problem, we propose a +novel comprehensive active defense mechanism that combines traceability and +adversariality, called Dual Defense. Dual Defense invisibly embeds a single +robust watermark within the target face to actively respond to sudden cases of +malicious face swapping. It disrupts the output of the face swapping model +while maintaining the integrity of watermark information throughout the entire +dissemination process. This allows for watermark extraction at any stage of +image tracking for traceability. Specifically, we introduce a watermark +embedding network based on original-domain feature impersonation attack. This +network learns robust adversarial features of target facial images and embeds +watermarks, seeking a well-balanced trade-off between watermark invisibility, +adversariality, and traceability through perceptual adversarial encoding +strategies. Extensive experiments demonstrate that Dual Defense achieves +optimal overall defense success rates and exhibits promising universality in +anti-face swapping tasks and dataset generalization ability. It maintains +impressive adversariality and traceability in both original and robust +settings, surpassing current forgery defense methods that possess only one of +these capabilities, including CMUA-Watermark, Anti-Forgery, FakeTagger, or PGD +methods. + +
+
+
+
+
+ + ☆ An Early Evaluation of GPT-4V(ision) + + +
+ In this paper, we evaluate different abilities of GPT-4V including visual +understanding, language understanding, visual puzzle solving, and understanding +of other modalities such as depth, thermal, video, and audio. To estimate +GPT-4V's performance, we manually construct 656 test instances and carefully +evaluate the results of GPT-4V. The highlights of our findings are as follows: +(1) GPT-4V exhibits impressive performance on English visual-centric benchmarks +but fails to recognize simple Chinese texts in the images; (2) GPT-4V shows +inconsistent refusal behavior when answering questions related to sensitive +traits such as gender, race, and age; (3) GPT-4V obtains worse results than +GPT-4 (API) on language understanding tasks including general language +understanding benchmarks and visual commonsense knowledge evaluation +benchmarks; (4) Few-shot prompting can improve GPT-4V's performance on both +visual understanding and language understanding; (5) GPT-4V struggles to find +the nuances between two similar images and solve the easy math picture puzzles; +(6) GPT-4V shows non-trivial performance on the tasks of similar modalities to +image, such as video and thermal. Our experimental results reveal the ability +and limitations of GPT-4V and we hope our paper can provide some insights into +the application and research of GPT-4V. + +
+
+ comment: Technical Report. Data are available at + https://github.com/albertwy/GPT-4V-Evaluation +
+
+
+
+
+ + ☆ Learning Robust Deep Visual Representations from EEG Brain Recordings WACV 2024 + + +
+ Decoding the human brain has been a hallmark of neuroscientists and +Artificial Intelligence researchers alike. Reconstruction of visual images from +brain Electroencephalography (EEG) signals has garnered a lot of interest due +to its applications in brain-computer interfacing. This study proposes a +two-stage method where the first step is to obtain EEG-derived features for +robust learning of deep representations and subsequently utilize the learned +representation for image generation and classification. We demonstrate the +generalizability of our feature extraction pipeline across three different +datasets using deep-learning architectures with supervised and contrastive +learning methods. We have performed the zero-shot EEG classification task to +support the generalizability claim further. We observed that a subject +invariant linearly separable visual representation was learned using EEG data +alone in an unimodal setting that gives better k-means accuracy as compared to +a joint representation learning between EEG and images. Finally, we propose a +novel framework to transform unseen images into the EEG space and reconstruct +them with approximation, showcasing the potential for image reconstruction from +EEG signals. Our proposed image synthesis method from EEG shows 62.9% and +36.13% inception score improvement on the EEGCVPR40 and the Thoughtviz +datasets, which is better than state-of-the-art performance in GAN. + +
+
+ comment: Accepted in WACV 2024 +
+
+
+
+
+ + ☆ Enhancing Document Information Analysis with Multi-Task Pre-training: A + Robust Approach for Information Extraction in Visually-Rich Documents + + +
+ This paper introduces a deep learning model tailored for document information +analysis, emphasizing document classification, entity relation extraction, and +document visual question answering. The proposed model leverages +transformer-based models to encode all the information present in a document +image, including textual, visual, and layout information. The model is +pre-trained and subsequently fine-tuned for various document image analysis +tasks. The proposed model incorporates three additional tasks during the +pre-training phase, including reading order identification of different layout +segments in a document image, layout segments categorization as per PubLayNet, +and generation of the text sequence within a given layout segment (text block). +The model also incorporates a collective pre-training scheme where losses of +all the tasks under consideration, including pre-training and fine-tuning tasks +with all datasets, are considered. Additional encoder and decoder blocks are +added to the RoBERTa network to generate results for all tasks. The proposed +model achieved impressive results across all tasks, with an accuracy of 95.87% +on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, +0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets +respectively for entity relation extraction, and an ANLS score of 0.8468 on the +DocVQA dataset for visual question answering. The results highlight the +effectiveness of the proposed model in understanding and interpreting complex +document layouts and content, making it a promising tool for document analysis +tasks. + +
+
+
+
+
+ + ☆ Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph + prediction 3DV 2024 + + +
+ D scene graphs are an emerging 3D scene representation, that models both the +objects present in the scene as well as their relationships. However, learning +3D scene graphs is a challenging task because it requires not only object +labels but also relationship annotations, which are very scarce in datasets. +While it is widely accepted that pre-training is an effective approach to +improve model performance in low data regimes, in this paper, we find that +existing pre-training methods are ill-suited for 3D scene graphs. To solve this +issue, we present the first language-based pre-training approach for 3D scene +graphs, whereby we exploit the strong relationship between scene graphs and +language. To this end, we leverage the language encoder of CLIP, a popular +vision-language model, to distill its knowledge into our graph-based network. +We formulate a contrastive pre-training, which aligns text embeddings of +relationships (subject-predicate-object triplets) and predicted 3D graph +features. Our method achieves state-of-the-art results on the main semantic 3D +scene graph benchmark by showing improved effectiveness over pre-training +baselines and outperforming all the existing fully supervised scene graph +prediction methods by a significant margin. Furthermore, since our scene graph +features are language-aligned, it allows us to query the language space of the +features in a zero-shot manner. In this paper, we show an example of utilizing +this property of the features to predict the room type of a scene without +further training. + +
+
+ comment: 3DV 2024. Project page: https://kochsebastian.com/lang3dsg +
+
+
+
+
+ + ☆ On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection NeurIPS 2023 + + +
+ Successful detection of Out-of-Distribution (OoD) data is becoming +increasingly important to ensure safe deployment of neural networks. One of the +main challenges in OoD detection is that neural networks output overconfident +predictions on OoD data, make it difficult to determine OoD-ness of data solely +based on their predictions. Outlier exposure addresses this issue by +introducing an additional loss that encourages low-confidence predictions on +OoD data during training. While outlier exposure has shown promising potential +in improving OoD detection performance, all previous studies on outlier +exposure have been limited to utilizing visual outliers. Drawing inspiration +from the recent advancements in vision-language pre-training, this paper +venture out to the uncharted territory of textual outlier exposure. First, we +uncover the benefits of using textual outliers by replacing real or virtual +outliers in the image-domain with textual equivalents. Then, we propose various +ways of generating preferable textual outliers. Our extensive experiments +demonstrate that generated textual outliers achieve competitive performance on +large-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical +analyses of textual outliers to provide primary criteria for designing +advantageous textual outliers: near-distribution, descriptiveness, and +inclusion of visual semantics. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Gramian Attention Heads are Strong yet Efficient Vision Learners + + +
+ We introduce a novel architecture design that enhances expressiveness by +incorporating multiple head classifiers (\ie, classification heads) instead of +relying on channel expansion or additional building blocks. Our approach +employs attention-based aggregation, utilizing pairwise feature similarity to +enhance multiple lightweight heads with minimal resource overhead. We compute +the Gramian matrices to reinforce class tokens in an attention layer for each +head. This enables the heads to learn more discriminative representations, +enhancing their aggregation capabilities. Furthermore, we propose a learning +algorithm that encourages heads to complement each other by reducing +correlation for aggregation. Our models eventually surpass state-of-the-art +CNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and +deliver remarkable performance across various downstream tasks, such as COCO +object instance segmentation, ADE20k semantic segmentation, and fine-grained +visual classification datasets. The effectiveness of our framework is +substantiated by practical experimental results and further underpinned by +generalization error bound. We release the code publicly at: +https://github.com/Lab-LVM/imagenet-models. + +
+
+
+
+
+ + ☆ Show from Tell: Audio-Visual Modelling in Clinical Settings + + +
+ Auditory and visual signals usually present together and correlate with each +other, not only in natural environments but also in clinical settings. However, +the audio-visual modelling in the latter case can be more challenging, due to +the different sources of audio/video signals and the noise (both signal-level +and semantic-level) in auditory signals -- usually speech. In this paper, we +consider audio-visual modelling in a clinical setting, providing a solution to +learn medical representations that benefit various clinical tasks, without +human expert annotation. A simple yet effective multi-modal self-supervised +learning framework is proposed for this purpose. The proposed approach is able +to localise anatomical regions of interest during ultrasound imaging, with only +speech audio as a reference. Experimental evaluations on a large-scale clinical +multi-modal ultrasound video dataset show that the proposed self-supervised +method learns good transferable anatomical representations that boost the +performance of automated downstream clinical tasks, even outperforming +fully-supervised solutions. + +
+
+
+
+
+ + ☆ DualMatch: Robust Semi-Supervised Learning with Dual-Level Interaction ECML + + +
+ Semi-supervised learning provides an expressive framework for exploiting +unlabeled data when labels are insufficient. Previous semi-supervised learning +methods typically match model predictions of different data-augmented views in +a single-level interaction manner, which highly relies on the quality of +pseudo-labels and results in semi-supervised learning not robust. In this +paper, we propose a novel SSL method called DualMatch, in which the class +prediction jointly invokes feature embedding in a dual-level interaction +manner. DualMatch requires consistent regularizations for data augmentation, +specifically, 1) ensuring that different augmented views are regulated with +consistent class predictions, and 2) ensuring that different data of one class +are regulated with similar feature embeddings. Extensive experiments +demonstrate the effectiveness of DualMatch. In the standard SSL setting, the +proposal achieves 9% error reduction compared with SOTA methods, even in a more +challenging class-imbalanced setting, the proposal can still achieve 6% error +reduction. Code is available at https://github.com/CWangAI/DualMatch + +
+
+ comment: 14 pages, 8 figures, Accepted by ECMLPKDD 2023 +
+
+
+
+
+ + ☆ Towards Explainability in Monocular Depth Estimation + + +
+ The estimation of depth in two-dimensional images has long been a challenging +and extensively studied subject in computer vision. Recently, significant +progress has been made with the emergence of Deep Learning-based approaches, +which have proven highly successful. This paper focuses on the explainability +in monocular depth estimation methods, in terms of how humans perceive depth. +This preliminary study emphasizes on one of the most significant visual cues, +the relative size, which is prominent in almost all viewed images. We designed +a specific experiment to mimic the experiments in humans and have tested +state-of-the-art methods to indirectly assess the explainability in the context +defined. In addition, we observed that measuring the accuracy required further +attention and a particular approach is proposed to this end. The results show +that a mean accuracy of around 77% across methods is achieved, with some of the +methods performing markedly better, thus, indirectly revealing their +corresponding potential to uncover monocular depth cues, like relative size. + +
+
+
+
+
+ + ☆ ChimpACT: A Longitudinal Dataset for Understanding Chimpanzee Behaviors NeurIPS 2023 + + +
+ Understanding the behavior of non-human primates is crucial for improving +animal welfare, modeling social behavior, and gaining insights into +distinctively human and phylogenetically shared behaviors. However, the lack of +datasets on non-human primate behavior hinders in-depth exploration of primate +social interactions, posing challenges to research on our closest living +relatives. To address these limitations, we present ChimpACT, a comprehensive +dataset for quantifying the longitudinal behavior and social relations of +chimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT +features videos of a group of over 20 chimpanzees residing at the Leipzig Zoo, +Germany, with a particular focus on documenting the developmental trajectory of +one young male, Azibo. ChimpACT is both comprehensive and challenging, +consisting of 163 videos with a cumulative 160,500 frames, each richly +annotated with detection, identification, pose estimation, and fine-grained +spatiotemporal behavior labels. We benchmark representative methods of three +tracks on ChimpACT: (i) tracking and identification, (ii) pose estimation, and +(iii) spatiotemporal action detection of the chimpanzees. Our experiments +reveal that ChimpACT offers ample opportunities for both devising new methods +and adapting existing ones to solve fundamental computer vision tasks applied +to chimpanzee groups, such as detection, pose estimation, and behavior +analysis, ultimately deepening our comprehension of communication and sociality +in non-human primates. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning + in Language Models NeurIPS 2023 + + +
+ A long-standing goal of AI systems is to perform complex multimodal reasoning +like humans. Recently, large language models (LLMs) have made remarkable +strides in such multi-step reasoning on the language modality solely by +leveraging the chain of thought (CoT) to mimic human thinking. However, the +transfer of these advancements to multimodal contexts introduces heightened +challenges, including but not limited to the impractical need for +labor-intensive annotation and the limitations in terms of flexibility, +generalizability, and explainability. To evoke CoT reasoning in multimodality, +this work first conducts an in-depth analysis of these challenges posed by +multimodality and presents two key insights: "keeping critical thinking" and +"letting everyone do their jobs" in multimodal CoT reasoning. Furthermore, this +study proposes a novel DDCoT prompting that maintains a critical attitude +through negative-space prompting and incorporates multimodality into reasoning +by first dividing the reasoning responsibility of LLMs into reasoning and +recognition and then integrating the visual recognition capability of visual +models into the joint reasoning process. The rationales generated by DDCoT not +only improve the reasoning abilities of both large and small language models in +zero-shot prompting and fine-tuning learning, significantly outperforming +state-of-the-art methods but also exhibit impressive generalizability and +explainability. + +
+
+ comment: 24 pages, 13 figures, to be published in NeurIPS 2023 +
+
+
+
+
+ + ☆ On Pixel-level Performance Assessment in Anomaly Detection + + +
+ Anomaly detection methods have demonstrated remarkable success across various +applications. However, assessing their performance, particularly at the +pixel-level, presents a complex challenge due to the severe imbalance that is +most commonly present between normal and abnormal samples. Commonly adopted +evaluation metrics designed for pixel-level detection may not effectively +capture the nuanced performance variations arising from this class imbalance. +In this paper, we dissect the intricacies of this challenge, underscored by +visual evidence and statistical analysis, leading to delve into the need for +evaluation metrics that account for the imbalance. We offer insights into more +accurate metrics, using eleven leading contemporary anomaly detection methods +on twenty-one anomaly detection problems. Overall, from this extensive +experimental evaluation, we can conclude that Precision-Recall-based metrics +can better capture relative method performance, making them more suitable for +the task. + +
+
+ comment: 5 pages, 5 figures, 1 table +
+
+
+
+
+ + ☆ An Integrative Paradigm for Enhanced Stroke Prediction: Synergizing + XGBoost and xDeepFM Algorithms + + +
+ Stroke prediction plays a crucial role in preventing and managing this +debilitating condition. In this study, we address the challenge of stroke +prediction using a comprehensive dataset, and propose an ensemble model that +combines the power of XGBoost and xDeepFM algorithms. Our work aims to improve +upon existing stroke prediction models by achieving higher accuracy and +robustness. Through rigorous experimentation, we validate the effectiveness of +our ensemble model using the AUC metric. Through comparing our findings with +those of other models in the field, we gain valuable insights into the merits +and drawbacks of various approaches. This, in turn, contributes significantly +to the progress of machine learning and deep learning techniques specifically +in the domain of stroke prediction. + +
+
+
+
+
+ + ☆ Video Referring Expression Comprehension via Transformer with + Content-conditioned Query + + +
+ Video Referring Expression Comprehension (REC) aims to localize a target +object in videos based on the queried natural language. Recent improvements in +video REC have been made using Transformer-based methods with learnable +queries. However, we contend that this naive query design is not ideal given +the open-world nature of video REC brought by text supervision. With numerous +potential semantic categories, relying on only a few slow-updated queries is +insufficient to characterize them. Our solution to this problem is to create +dynamic queries that are conditioned on both the input video and language to +model the diverse objects referred to. Specifically, we place a fixed number of +learnable bounding boxes throughout the frame and use corresponding region +features to provide prior information. Also, we noticed that current query +features overlook the importance of cross-modal alignment. To address this, we +align specific phrases in the sentence with semantically relevant visual areas, +annotating them in existing video datasets (VID-Sentence and VidSTG). By +incorporating these two designs, our proposed model (called ConFormer) +outperforms other models on widely benchmarked datasets. For example, in the +testing split of VID-Sentence dataset, ConFormer achieves 8.75% absolute +improvement on Accu.@0.6 compared to the previous state-of-the-art model. + +
+
+ comment: Accepted to ACM International Conference on Multimedia Workshop (ACM + MM), 2023. arXiv admin note: substantial text overlap with arXiv:2210.02953 +
+
+
+
+
+ + ☆ Fuse Your Latents: Video Editing with Multi-source Latent Diffusion + Models + + +
+ Latent Diffusion Models (LDMs) are renowned for their powerful capabilities +in image and video synthesis. Yet, video editing methods suffer from +insufficient pre-training data or video-by-video re-training cost. In +addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a +training-free framework to achieve text-guided video editing by applying +off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses +latents from an image LDM and an video LDM during the denoising process. In +this way, temporal consistency can be kept with video LDM while high-fidelity +from the image LDM can also be exploited. Meanwhile, FLDM possesses high +flexibility since both image LDM and video LDM can be replaced so advanced +image editing methods such as InstructPix2Pix and ControlNet can be exploited. +To the best of our knowledge, FLDM is the first method to adapt off-the-shelf +image editing methods into video LDMs for video editing. Extensive quantitative +and qualitative experiments demonstrate that FLDM can improve the textual +alignment and temporal consistency of edited videos. + +
+
+
+
+
+ + ☆ Winning Prize Comes from Losing Tickets: Improve Invariant Learning by + Exploring Variant Parameters for Out-of-Distribution Generalization + + +
+ Out-of-Distribution (OOD) Generalization aims to learn robust models that +generalize well to various environments without fitting to +distribution-specific features. Recent studies based on Lottery Ticket +Hypothesis (LTH) address this problem by minimizing the learning target to find +some of the parameters that are critical to the task. However, in OOD problems, +such solutions are suboptimal as the learning task contains severe distribution +noises, which can mislead the optimization process. Therefore, apart from +finding the task-related parameters (i.e., invariant parameters), we propose +Exploring Variant parameters for Invariant Learning (EVIL) which also leverages +the distribution knowledge to find the parameters that are sensitive to +distribution shift (i.e., variant parameters). Once the variant parameters are +left out of invariant learning, a robust subnetwork that is resistant to +distribution shift can be found. Additionally, the parameters that are +relatively stable across distributions can be considered invariant ones to +improve invariant learning. By fully exploring both variant and invariant +parameters, our EVIL can effectively identify a robust subnetwork to improve +OOD generalization. In extensive experiments on integrated testbed: DomainBed, +EVIL can effectively and efficiently enhance many popular methods, such as ERM, +IRM, SAM, etc. + +
+
+ comment: 27 pages, 9 figures +
+
+
+
+
+ + ☆ MVFAN: Multi-View Feature Assisted Network for 4D Radar Object Detection ICONIP 2023 + + +
+ 4D radar is recognized for its resilience and cost-effectiveness under +adverse weather conditions, thus playing a pivotal role in autonomous driving. +While cameras and LiDAR are typically the primary sensors used in perception +modules for autonomous vehicles, radar serves as a valuable supplementary +sensor. Unlike LiDAR and cameras, radar remains unimpaired by harsh weather +conditions, thereby offering a dependable alternative in challenging +environments. Developing radar-based 3D object detection not only augments the +competency of autonomous vehicles but also provides economic benefits. In +response, we propose the Multi-View Feature Assisted Network (\textit{MVFAN}), +an end-to-end, anchor-free, and single-stage framework for 4D-radar-based 3D +object detection for autonomous vehicles. We tackle the issue of insufficient +feature utilization by introducing a novel Position Map Generation module to +enhance feature learning by reweighing foreground and background points, and +their features, considering the irregular distribution of radar point clouds. +Additionally, we propose a pioneering backbone, the Radar Feature Assisted +backbone, explicitly crafted to fully exploit the valuable Doppler velocity and +reflectivity data provided by the 4D radar sensor. Comprehensive experiments +and ablation studies carried out on Astyx and VoD datasets attest to the +efficacy of our framework. The incorporation of Doppler velocity and RCS +reflectivity dramatically improves the detection performance for small moving +objects such as pedestrians and cyclists. Consequently, our approach culminates +in a highly optimized 4D-radar-based 3D object detection capability for +autonomous driving systems, setting a new standard in the field. + +
+
+ comment: 19 Pages, 7 figures, Accepted by ICONIP 2023 +
+
+
+
+
+ + ☆ Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles + + +
+ In the dynamic realm of deepfake detection, this work presents an innovative +approach to validate video content. The methodology blends advanced +2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is +uniquely tailored to capture spatiotemporal features via sliding filters, +extending through both spatial and temporal dimensions. This configuration +enables nuanced pattern recognition in pixel arrangement and temporal evolution +across frames. Simultaneously, the 2D model leverages EfficientNet +architecture, harnessing auto-scaling in Convolutional Neural Networks. +Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted +Ensembling. Strategic prioritization of the 3-dimensional model's output +capitalizes on its exceptional spatio-temporal feature extraction. Experimental +validation underscores the effectiveness of this strategy, showcasing its +potential in countering deepfake generation's deceptive practices. + +
+
+ comment: 6 pages, 2 figures +
+
+
+
+
+ + ☆ Frequency-Aware Transformer for Learned Image Compression + + +
+ Learned image compression (LIC) has gained traction as an effective solution +for image storage and transmission in recent years. However, existing LIC +methods are redundant in latent representation due to limitations in capturing +anisotropic frequency components and preserving directional details. To +overcome these challenges, we propose a novel frequency-aware transformer (FAT) +block that for the first time achieves multiscale directional ananlysis for +LIC. The FAT block comprises frequency-decomposition window attention (FDWA) +modules to capture multiscale and directional frequency components of natural +images. Additionally, we introduce frequency-modulation feed-forward network +(FMFFN) to adaptively modulate different frequency components, improving +rate-distortion performance. Furthermore, we present a transformer-based +channel-wise autoregressive (T-CA) model that effectively exploits channel +dependencies. Experiments show that our method achieves state-of-the-art +rate-distortion performance compared to existing LIC methods, and evidently +outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in +BD-rate on the Kodak, Tecnick, and CLIC datasets. + +
+
+
+
+
+ + ☆ Open-NeRF: Towards Open Vocabulary NeRF Decomposition WACV 2024 + + +
+ In this paper, we address the challenge of decomposing Neural Radiance Fields +(NeRF) into objects from an open vocabulary, a critical task for object +manipulation in 3D reconstruction and view synthesis. Current techniques for +NeRF decomposition involve a trade-off between the flexibility of processing +open-vocabulary queries and the accuracy of 3D segmentation. We present, +Open-vocabulary Embedded Neural Radiance Fields (Open-NeRF), that leverage +large-scale, off-the-shelf, segmentation models like the Segment Anything Model +(SAM) and introduce an integrate-and-distill paradigm with hierarchical +embeddings to achieve both the flexibility of open-vocabulary querying and 3D +segmentation accuracy. Open-NeRF first utilizes large-scale foundation models +to generate hierarchical 2D mask proposals from varying viewpoints. These +proposals are then aligned via tracking approaches and integrated within the 3D +space and subsequently distilled into the 3D field. This process ensures +consistent recognition and granularity of objects from different viewpoints, +even in challenging scenarios involving occlusion and indistinct features. Our +experimental results show that the proposed Open-NeRF outperforms +state-of-the-art methods such as LERF \cite{lerf} and FFD \cite{ffd} in +open-vocabulary scenarios. Open-NeRF offers a promising solution to NeRF +decomposition, guided by open-vocabulary queries, enabling novel applications +in robotics and vision-language interaction in open-world 3D scenes. + +
+
+ comment: Accepted by WACV 2024 +
+
+
+
+
+ + ☆ Towards Large-scale Masked Face Recognition ICCV2021 + + +
+ During the COVID-19 coronavirus epidemic, almost everyone is wearing masks, +which poses a huge challenge for deep learning-based face recognition +algorithms. In this paper, we will present our \textbf{championship} solutions +in ICCV MFR WebFace260M and InsightFace unconstrained tracks. We will focus on +four challenges in large-scale masked face recognition, i.e., super-large scale +training, data noise handling, masked and non-masked face recognition accuracy +balancing, and how to design inference-friendly model architecture. We hope +that the discussion on these four aspects can guide future research towards +more robust masked face recognition systems. + +
+
+ comment: the top1 solution for ICCV2021-MFR challenge +
+
+
+
+
+ + ☆ DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object + Detection + + +
+ Denoising diffusion models show remarkable performances in generative tasks, +and their potential applications in perception tasks are gaining interest. In +this paper, we introduce a novel framework named DiffRef3D which adopts the +diffusion process on 3D object detection with point clouds for the first time. +Specifically, we formulate the proposal refinement stage of two-stage 3D object +detectors as a conditional diffusion process. During training, DiffRef3D +gradually adds noise to the residuals between proposals and target objects, +then applies the noisy residuals to proposals to generate hypotheses. The +refinement module utilizes these hypotheses to denoise the noisy residuals and +generate accurate box predictions. In the inference phase, DiffRef3D generates +initial hypotheses by sampling noise from a Gaussian distribution as residuals +and refines the hypotheses through iterative steps. DiffRef3D is a versatile +proposal refinement framework that consistently improves the performance of +existing 3D object detection models. We demonstrate the significance of +DiffRef3D through extensive experiments on the KITTI benchmark. Code will be +available. + +
+
+
+
+
+ + ☆ Dolfin: Diffusion Layout Transformers without Autoencoder + + +
+ In this paper, we introduce a novel generative model, Diffusion Layout +Transformers without Autoencoder (Dolfin), which significantly improves the +modeling capability with reduced complexity compared to existing methods. +Dolfin employs a Transformer-based diffusion process to model layout +generation. In addition to an efficient bi-directional (non-causal joint) +sequence representation, we further propose an autoregressive diffusion model +(Dolfin-AR) that is especially adept at capturing rich semantic correlations +for the neighboring objects, such as alignment, size, and overlap. When +evaluated against standard generative layout benchmarks, Dolfin notably +improves performance across various metrics (fid, alignment, overlap, MaxIoU +and DocSim scores), enhancing transparency and interoperability in the process. +Moreover, Dolfin's applications extend beyond layout generation, making it +suitable for modeling geometric structures, such as line segments. Our +experiments present both qualitative and quantitative results to demonstrate +the advantages of Dolfin. + +
+
+
+
+
+ + ☆ Instance-wise Linearization of Neural Network for Model Interpretation + + +
+ Neural network have achieved remarkable successes in many scientific fields. +However, the interpretability of the neural network model is still a major +bottlenecks to deploy such technique into our daily life. The challenge can +dive into the non-linear behavior of the neural network, which rises a critical +question that how a model use input feature to make a decision. The classical +approach to address this challenge is feature attribution, which assigns an +important score to each input feature and reveal its importance of current +prediction. However, current feature attribution approaches often indicate the +importance of each input feature without detail of how they are actually +processed by a model internally. These attribution approaches often raise a +concern that whether they highlight correct features for a model prediction. + For a neural network model, the non-linear behavior is often caused by +non-linear activation units of a model. However, the computation behavior of a +prediction from a neural network model is locally linear, because one +prediction has only one activation pattern. Base on the observation, we propose +an instance-wise linearization approach to reformulates the forward computation +process of a neural network prediction. This approach reformulates different +layers of convolution neural networks into linear matrix multiplication. +Aggregating all layers' computation, a prediction complex convolution neural +network operations can be described as a linear matrix multiplication $F(x) = W +\cdot x + b$. This equation can not only provides a feature attribution map +that highlights the important of the input features but also tells how each +input feature contributes to a prediction exactly. Furthermore, we discuss the +application of this technique in both supervise classification and unsupervised +neural network learning parametric t-SNE dimension reduction. + +
+
+
+
+
+ + ☆ MotionAGFormer: Enhancing 3D Human Pose Estimation with a + Transformer-GCNFormer Network + + +
+ Recent transformer-based approaches have demonstrated excellent performance +in 3D human pose estimation. However, they have a holistic view and by encoding +global relationships between all the joints, they do not capture the local +dependencies precisely. In this paper, we present a novel Attention-GCNFormer +(AGFormer) block that divides the number of channels by using two parallel +transformer and GCNFormer streams. Our proposed GCNFormer module exploits the +local relationship between adjacent joints, outputting a new representation +that is complementary to the transformer output. By fusing these two +representation in an adaptive way, AGFormer exhibits the ability to better +learn the underlying 3D structure. By stacking multiple AGFormer blocks, we +propose MotionAGFormer in four different variants, which can be chosen based on +the speed-accuracy trade-off. We evaluate our model on two popular benchmark +datasets: Human3.6M and MPI-INF-3DHP. MotionAGFormer-B achieves +state-of-the-art results, with P1 errors of 38.4mm and 16.2mm, respectively. +Remarkably, it uses a quarter of the parameters and is three times more +computationally efficient than the previous leading model on Human3.6M dataset. +Code and models are available at https://github.com/TaatiTeam/MotionAGFormer. + +
+
+
+
+
+ + ☆ TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer + + +
+ Estimating the 6D object pose is an essential task in many applications. Due +to the lack of depth information, existing RGB-based methods are sensitive to +occlusion and illumination changes. How to extract and utilize the geometry +features in depth information is crucial to achieve accurate predictions. To +this end, we propose TransPose, a novel 6D pose framework that exploits +Transformer Encoder with geometry-aware module to develop better learning of +point cloud feature representations. Specifically, we first uniformly sample +point cloud and extract local geometry features with the designed local feature +extractor base on graph convolution network. To improve robustness to +occlusion, we adopt Transformer to perform the exchange of global information, +making each local feature contains global information. Finally, we introduce +geometry-aware module in Transformer Encoder, which to form an effective +constrain for point cloud feature learning and makes the global information +exchange more tightly coupled with point cloud tasks. Extensive experiments +indicate the effectiveness of TransPose, our pose estimation pipeline achieves +competitive results on three benchmark datasets. + +
+
+ comment: 10 pages, 5 figures, IEEE Journal +
+
+
+
+
+ + ☆ Deep Learning for Plant Identification and Disease Classification from + Leaf Images: Multi-prediction Approaches + + +
+ Deep learning plays an important role in modern agriculture, especially in +plant pathology using leaf images where convolutional neural networks (CNN) are +attracting a lot of attention. While numerous reviews have explored the +applications of deep learning within this research domain, there remains a +notable absence of an empirical study to offer insightful comparisons due to +the employment of varied datasets in the evaluation. Furthermore, a majority of +these approaches tend to address the problem as a singular prediction task, +overlooking the multifaceted nature of predicting various aspects of plant +species and disease types. Lastly, there is an evident need for a more profound +consideration of the semantic relationships that underlie plant species and +disease types. In this paper, we start our study by surveying current deep +learning approaches for plant identification and disease classification. We +categorise the approaches into multi-model, multi-label, multi-output, and +multi-task, in which different backbone CNNs can be employed. Furthermore, +based on the survey of existing approaches in plant pathology and the study of +available approaches in machine learning, we propose a new model named +Generalised Stacking Multi-output CNN (GSMo-CNN). To investigate the +effectiveness of different backbone CNNs and learning approaches, we conduct an +intensive experiment on three benchmark datasets Plant Village, Plant Leaves, +and PlantDoc. The experimental results demonstrate that InceptionV3 can be a +good choice for a backbone CNN as its performance is better than AlexNet, +VGG16, ResNet101, EfficientNet, MobileNet, and a custom CNN developed by us. +Interestingly, empirical results support the hypothesis that using a single +model can be comparable or better than using two models. Finally, we show that +the proposed GSMo-CNN achieves state-of-the-art performance on three benchmark +datasets. + +
+
+ comment: Jianping and Son are joint first authors (equal contribution) +
+
+
+
+
+ + ☆ SCB-ST-Dataset4: Extending the Spatio-Temporal Behavior Dataset in + Student Classroom Scenarios Through Image Dataset Method + + +
+ Using deep learning methods to detect students' classroom behavior +automatically is a promising approach for analyzing their class performance and +improving teaching effectiveness. However, the lack of publicly available +spatio-temporal datasets on student behavior, as well as the high cost of +manually labeling such datasets, pose significant challenges for researchers in +this field. To address this issue, we proposed a method for extending the +spatio-temporal behavior dataset in Student Classroom Scenarios +(SCB-ST-Dataset4) through image dataset. Our SCB-ST-Dataset4 comprises 754094 +images with 25670 labels, focusing on 3 behaviors: hand-raising, reading, +writing. Our proposed method can rapidly generate spatio-temporal behavioral +datasets without requiring annotation. Furthermore, we proposed a Behavior +Similarity Index (BSI) to explore the similarity of behaviors. We evaluated the +dataset using the YOLOv5, YOLOv7, YOLOv8, and SlowFast algorithms, achieving a +mean average precision (map) of up to 82.3%. The experiment further +demonstrates the effectiveness of our method. This dataset provides a robust +foundation for future research in student behavior detection, potentially +contributing to advancements in this field. The SCB-ST-Dataset4 is available +for download at: https://github.com/Whiffe/SCB-dataset. + +
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2310.02522; + text overlap with arXiv:2306.03318 +
+
+
+
+
+ + ☆ UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception + + +
+ Tremendous variations coupled with large degrees of freedom in UAV-based +imaging conditions lead to a significant lack of data in adequately learning +UAV-based perception models. Using various synthetic renderers in conjunction +with perception models is prevalent to create synthetic data to augment the +learning in the ground-based imaging domain. However, severe challenges in the +austere UAV-based domain require distinctive solutions to image synthesis for +data augmentation. In this work, we leverage recent advancements in neural +rendering to improve static and dynamic novelview UAV-based image synthesis, +especially from high altitudes, capturing salient scene attributes. Finally, we +demonstrate a considerable performance boost is achieved when a state-ofthe-art +detection model is optimized primarily on hybrid sets of real and synthetic +data instead of the real or synthetic data separately. + +
+
+ comment: Video Link: https://www.youtube.com/watch?v=ucPzbPLqqpI +
+
+
+
+
+ + ☆ Exploring Question Decomposition for Zero-Shot VQA NeurIPS 2023 + + +
+ Visual question answering (VQA) has traditionally been treated as a +single-step task where each question receives the same amount of effort, unlike +natural human question-answering strategies. We explore a question +decomposition strategy for VQA to overcome this limitation. We probe the +ability of recently developed large vision-language models to use human-written +decompositions and produce their own decompositions of visual questions, +finding they are capable of learning both tasks from demonstrations alone. +However, we show that naive application of model-written decompositions can +hurt performance. We introduce a model-driven selective decomposition approach +for second-guessing predictions and correcting errors, and validate its +effectiveness on eight VQA tasks across three domains, showing consistent +improvements in accuracy, including improvements of >20% on medical VQA +datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA +reformulation of the challenging Winoground task. Project Site: +https://zaidkhan.me/decomposition-0shot-vqa/ + +
+
+ comment: NeurIPS 2023 Camera Ready +
+
+
+
+
+ + ☆ StochGradAdam: Accelerating Neural Networks Training with Stochastic + Gradient Sampling + + +
+ In the rapidly advancing domain of deep learning optimization, this paper +unveils the StochGradAdam optimizer, a novel adaptation of the well-regarded +Adam algorithm. Central to StochGradAdam is its gradient sampling technique. +This method not only ensures stable convergence but also leverages the +advantages of selective gradient consideration, fostering robust training by +potentially mitigating the effects of noisy or outlier data and enhancing the +exploration of the loss landscape for more dependable convergence. In both +image classification and segmentation tasks, StochGradAdam has demonstrated +superior performance compared to the traditional Adam optimizer. By judiciously +sampling a subset of gradients at each iteration, the optimizer is optimized +for managing intricate models. The paper provides a comprehensive exploration +of StochGradAdam's methodology, from its mathematical foundations to bias +correction strategies, heralding a promising advancement in deep learning +training techniques. + +
+
+
+
+
+ + ☆ Trust, but Verify: Robust Image Segmentation using Deep Learning + + +
+ We describe a method for verifying the output of a deep neural network for +medical image segmentation that is robust to several classes of random as well +as worst-case perturbations i.e. adversarial attacks. This method is based on a +general approach recently developed by the authors called ``Trust, but Verify" +wherein an auxiliary verification network produces predictions about certain +masked features in the input image using the segmentation as an input. A +well-designed auxiliary network will produce high-quality predictions when the +input segmentations are accurate, but will produce low-quality predictions when +the segmentations are incorrect. Checking the predictions of such a network +with the original image allows us to detect bad segmentations. However, to +ensure the verification method is truly robust, we need a method for checking +the quality of the predictions that does not itself rely on a black-box neural +network. Indeed, we show that previous methods for segmentation evaluation that +do use deep neural regression networks are vulnerable to false negatives i.e. +can inaccurately label bad segmentations as good. We describe the design of a +verification network that avoids such vulnerability and present results to +demonstrate its robustness compared to previous methods. + +
+
+ comment: 5 Pages, 8 Figures, conference +
+
+
+
+
+ + ☆ An Efficient Deep Learning-based approach for Recognizing Agricultural + Pests in the Wild + + +
+ One of the biggest challenges that the farmers go through is to fight insect +pests during agricultural product yields. The problem can be solved easily and +avoid economic losses by taking timely preventive measures. This requires +identifying insect pests in an easy and effective manner. Most of the insect +species have similarities between them. Without proper help from the +agriculturist academician it is very challenging for the farmers to identify +the crop pests accurately. To address this issue we have done extensive +experiments considering different methods to find out the best method among +all. This paper presents a detailed overview of the experiments done on mainly +a robust dataset named IP102 including transfer learning with finetuning, +attention mechanism and custom architecture. Some example from another dataset +D0 is also shown to show robustness of our experimented techniques. + +
+
+
+
+
+ + ☆ Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo + Label Self-Refinement WACV 2024 + + +
+ Deep learning-based solutions for semantic segmentation suffer from +significant performance degradation when tested on data with different +characteristics than what was used during the training. Adapting the models +using annotated data from the new domain is not always practical. Unsupervised +Domain Adaptation (UDA) approaches are crucial in deploying these models in the +actual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ +a teacher-student self-training approach, where a teacher model is used to +generate pseudo-labels for the new data which in turn guide the training +process of the student model. Though this approach has seen a lot of success, +it suffers from the issue of noisy pseudo-labels being propagated in the +training process. To address this issue, we propose an auxiliary pseudo-label +refinement network (PRN) for online refining of the pseudo labels and also +localizing the pixels whose predicted labels are likely to be noisy. Being able +to improve the quality of pseudo labels and select highly reliable ones, PRN +helps self-training of segmentation models to be robust against pseudo label +noise propagation during different stages of adaptation. We evaluate our +approach on benchmark datasets with three different domain shifts, and our +approach consistently performs significantly better than the previous +state-of-the-art methods. + +
+
+ comment: WACV 2024 +
+
+
+
+
+ + ☆ The Significance of Machine Learning in Clinical Disease Diagnosis: A + Review + + +
+ The global need for effective disease diagnosis remains substantial, given +the complexities of various disease mechanisms and diverse patient symptoms. To +tackle these challenges, researchers, physicians, and patients are turning to +machine learning (ML), an artificial intelligence (AI) discipline, to develop +solutions. By leveraging sophisticated ML and AI methods, healthcare +stakeholders gain enhanced diagnostic and treatment capabilities. However, +there is a scarcity of research focused on ML algorithms for enhancing the +accuracy and computational efficiency. This research investigates the capacity +of machine learning algorithms to improve the transmission of heart rate data +in time series healthcare metrics, concentrating particularly on optimizing +accuracy and efficiency. By exploring various ML algorithms used in healthcare +applications, the review presents the latest trends and approaches in ML-based +disease diagnosis (MLBDD). The factors under consideration include the +algorithm utilized, the types of diseases targeted, the data types employed, +the applications, and the evaluation metrics. This review aims to shed light on +the prospects of ML in healthcare, particularly in disease diagnosis. By +analyzing the current literature, the study provides insights into +state-of-the-art methodologies and their performance metrics. + +
+
+ comment: 8 pages +
+
+
+
+
+ + ♻ ☆ Multispectral Imaging for Differential Face Morphing Attack Detection: A + Preliminary Study WACV + + +
+ Face morphing attack detection is emerging as an increasingly challenging +problem owing to advancements in high-quality and realistic morphing attack +generation. Reliable detection of morphing attacks is essential because these +attacks are targeted for border control applications. This paper presents a +multispectral framework for differential morphing-attack detection (D-MAD). The +D-MAD methods are based on using two facial images that are captured from the +ePassport (also called the reference image) and the trusted device (for +example, Automatic Border Control (ABC) gates) to detect whether the face image +presented in ePassport is morphed. The proposed multispectral D-MAD framework +introduce a multispectral image captured as a trusted capture to acquire seven +different spectral bands to detect morphing attacks. Extensive experiments were +conducted on the newly created Multispectral Morphed Datasets (MSMD) with 143 +unique data subjects that were captured using both visible and multispectral +cameras in multiple sessions. The results indicate the superior performance of +the proposed multispectral framework compared to visible images. + +
+
+ comment: Accepted in IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV), 2024 +
+
+
+
+
+ + ♻ ☆ ReDi: Efficient Learning-Free Diffusion Inference via Trajectory + Retrieval ICML 2023 + + +
+ Diffusion models show promising generation capability for a variety of data. +Despite their high generation quality, the inference for diffusion models is +still time-consuming due to the numerous sampling iterations required. To +accelerate the inference, we propose ReDi, a simple yet learning-free +Retrieval-based Diffusion sampling framework. From a precomputed knowledge +base, ReDi retrieves a trajectory similar to the partially generated trajectory +at an early stage of generation, skips a large portion of intermediate steps, +and continues sampling from a later step in the retrieved trajectory. We +theoretically prove that the generation performance of ReDi is guaranteed. Our +experiments demonstrate that ReDi improves the model inference efficiency by 2x +speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain +image generation such as image stylization. + +
+
+ comment: ICML 2023 +
+
+
+
+
+ + ♻ ☆ Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark NeurIPS 2023 + + +
+ We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering +Benchmark. Recent advances in inverse rendering have enabled a wide range of +real-world applications in 3D content generation, moving rapidly from research +and commercial use cases to consumer devices. While the results continue to +improve, there is no real-world benchmark that can quantitatively assess and +compare the performance of various inverse rendering methods. Existing +real-world datasets typically only consist of the shape and multi-view images +of objects, which are not sufficient for evaluating the quality of material +recovery and object relighting. Methods capable of recovering material and +lighting often resort to synthetic data for quantitative evaluation, which on +the other hand does not guarantee generalization to complex real-world +environments. We introduce a new dataset of real-world objects captured under a +variety of natural scenes with ground-truth 3D scans, multi-view images, and +environment lighting. Using this dataset, we establish the first comprehensive +real-world evaluation benchmark for object inverse rendering tasks from +in-the-wild scenes, and compare the performance of various existing methods. + +
+
+ comment: NeurIPS 2023 Datasets and Benchmarks Track. The first two authors + contributed equally to this work. Project page: + https://stanfordorb.github.io/ +
+
+
+
+
+ + ♻ ☆ Neural Foundations of Mental Simulation: Future Prediction of Latent + Representations on Dynamic Scenes NeurIPS 2023 + + +
+ Humans and animals have a rich and flexible understanding of the physical +world, which enables them to infer the underlying dynamical trajectories of +objects and events, plausible future states, and use that to plan and +anticipate the consequences of actions. However, the neural mechanisms +underlying these computations are unclear. We combine a goal-driven modeling +approach with dense neurophysiological data and high-throughput human +behavioral readouts to directly impinge on this question. Specifically, we +construct and evaluate several classes of sensory-cognitive networks to predict +the future state of rich, ethologically-relevant environments, ranging from +self-supervised end-to-end models with pixel-wise or object-centric objectives, +to models that future predict in the latent space of purely static image-based +or dynamic video-based pretrained foundation models. We find strong +differentiation across these model classes in their ability to predict neural +and behavioral data both within and across diverse environments. In particular, +we find that neural responses are currently best predicted by models trained to +predict the future state of their environment in the latent space of pretrained +foundation models optimized for dynamic scenes in a self-supervised manner. +Notably, models that future predict in the latent space of video foundation +models that are optimized to support a diverse range of sensorimotor tasks, +reasonably match both human behavioral error patterns and neural dynamics +across all environmental scenarios that we were able to test. Overall, these +findings suggest that the neural mechanisms and behaviors of primate mental +simulation are thus far most consistent with being optimized to future predict +on dynamic, reusable visual representations that are useful for Embodied AI +more generally. + +
+
+ comment: 20 pages, 10 figures, NeurIPS 2023 Camera Ready Version (spotlight) +
+
+
+
+
+ + ♻ ☆ Improving Robustness and Reliability in Medical Image Classification + with Latent-Guided Diffusion and Nested-Ensembles + + +
+ While deep learning models have achieved remarkable success across a range of +medical image analysis tasks, deployment of these models in real clinical +contexts requires that they be robust to variability in the acquired images. +While many methods apply predefined transformations to augment the training +data to enhance test-time robustness, these transformations may not ensure the +model's robustness to the diverse variability seen in patient images. In this +paper, we introduce a novel three-stage approach based on transformers coupled +with conditional diffusion models, with the goal of improving model robustness +to the kinds of imaging variability commonly encountered in practice without +the need for pre-determined data augmentation strategies. To this end, multiple +image encoders first learn hierarchical feature representations to build +discriminative latent spaces. Next, a reverse diffusion process, guided by the +latent code, acts on an informative prior and proposes prediction candidates in +a generative manner. Finally, several prediction candidates are aggregated in a +bi-level aggregation protocol to produce the final output. Through extensive +experiments on medical imaging benchmark datasets, we show that our method +improves upon state-of-the-art methods in terms of robustness and confidence +calibration. Additionally, we introduce a strategy to quantify the prediction +uncertainty at the instance level, increasing their trustworthiness to +clinicians using them in clinical practice. + +
+
+ comment: 13 pages, 6 figures, 7 tables +
+
+
+
+
+ + ♻ ☆ S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist + Captions NeurIPS 2023 + + +
+ Vision-language models, such as contrastive language-image pre-training +(CLIP), have demonstrated impressive results in natural image domains. However, +these models often struggle when applied to specialized domains like remote +sensing, and adapting to such domains is challenging due to the limited number +of image-text pairs available for training. To address this, we propose S-CLIP, +a semi-supervised learning method for training CLIP that utilizes additional +unpaired images. S-CLIP employs two pseudo-labeling strategies specifically +designed for contrastive learning and the language modality. The caption-level +pseudo-label is given by a combination of captions of paired images, obtained +by solving an optimal transport problem between unpaired and paired images. The +keyword-level pseudo-label is given by a keyword in the caption of the nearest +paired image, trained through partial label learning that assumes a candidate +set of labels for supervision instead of the exact one. By combining these +objectives, S-CLIP significantly enhances the training of CLIP using only a few +image-text pairs, as demonstrated in various specialist domains, including +remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP +improves CLIP by 10% for zero-shot classification and 4% for image-text +retrieval on the remote sensing benchmark, matching the performance of +supervised CLIP while using three times fewer image-text pairs. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Land-cover change detection using paired OpenStreetMap data and optical + high-resolution imagery via object-guided Transformer + + +
+ Optical high-resolution imagery and OpenStreetMap (OSM) data are two +important data sources for land-cover change detection. Previous studies in +these two data sources focus on utilizing the information in OSM data to aid +the change detection on multi-temporal optical high-resolution images. This +paper pioneers the direct detection of land-cover changes utilizing paired OSM +data and optical imagery, thereby broadening the horizons of change detection +tasks to encompass more dynamic earth observations. To this end, we propose an +object-guided Transformer (ObjFormer) architecture by naturally combining the +prevalent object-based image analysis (OBIA) technique with the advanced vision +Transformer architecture. The introduction of OBIA can significantly reduce the +computational overhead and memory burden in the self-attention module. +Specifically, the proposed ObjFormer has a hierarchical pseudo-siamese encoder +consisting of object-guided self-attention modules that extract representative +features of different levels from OSM data and optical images; a decoder +consisting of object-guided cross-attention modules can progressively recover +the land-cover changes from the extracted heterogeneous features. In addition +to the basic supervised binary change detection task, this paper raises a new +semi-supervised semantic change detection task that does not require any +manually annotated land-cover labels of optical images to train semantic change +detectors. Two lightweight semantic decoders are added to ObjFormer to +accomplish this task efficiently. A converse cross-entropy loss is designed to +fully utilize the negative samples, thereby contributing to the great +performance improvement in this task. The first large-scale benchmark dataset +containing 1,287 map-image pairs (1024$\times$ 1024 pixels for each sample) +covering 40 regions on six continents ...(see the manuscript for the full +abstract) + +
+
+
+
+
+ + ♻ ☆ VMAF Re-implementation on PyTorch: Some Experimental Results + + +
+ Based on the standard VMAF implementation we propose an implementation of +VMAF using PyTorch framework. For this implementation comparisons with the +standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We +investigate gradients computation when using VMAF as an objective function and +demonstrate that training using this function does not result in ill-behaving +gradients. + +
+
+ comment: 4 pages +
+
+
+
+
+ + ♻ ☆ Interpretable Alzheimer's Disease Classification Via a Contrastive + Diffusion Autoencoder + + +
+ In visual object classification, humans often justify their choices by +comparing objects to prototypical examples within that class. We may therefore +increase the interpretability of deep learning models by imbuing them with a +similar style of reasoning. In this work, we apply this principle by +classifying Alzheimer's Disease based on the similarity of images to training +examples within the latent space. We use a contrastive loss combined with a +diffusion autoencoder backbone, to produce a semantically meaningful latent +space, such that neighbouring latents have similar image-level features. We +achieve a classification accuracy comparable to black box approaches on a +dataset of 2D MRI images, whilst producing human interpretable model +explanations. Therefore, this work stands as a contribution to the pertinent +development of accurate and interpretable deep learning within medical imaging. + +
+
+
+
+
+ + ♻ ☆ Multiscale Superpixel Structured Difference Graph Convolutional Network + for VL Representation + + +
+ Within the multimodal field, the key to integrating vision and language lies +in establishing a good alignment strategy. Recently, benefiting from the +success of self-supervised learning, significant progress has been made in +multimodal semantic representation based on pre-trained models for vision and +language. However, there is still room for improvement in visual semantic +representation. The lack of spatial semantic coherence and vulnerability to +noise makes it challenging for current pixel or patch-based methods to +accurately extract complex scene boundaries. To this end, this paper develops +superpixel as a comprehensive compact representation of learnable image data, +which effectively reduces the number of visual primitives for subsequent +processing by clustering perceptually similar pixels. To mine more precise +topological relations, we propose a Multiscale Difference Graph Convolutional +Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical +structure of constituent visual patterns, and captures multiscale features by +progressively merging adjacent superpixels as graph nodes. Moreover, we predict +the differences between adjacent nodes through the graph structure, +facilitating key information aggregation of graph nodes to reason actual +semantic relations. Afterward, we design a multi-level fusion rule in a +bottom-up manner to avoid understanding deviation by learning complementary +spatial information at different regional scales. Our proposed method can be +well applied to multiple downstream task learning. Extensive experiments +demonstrate that our method is competitive with other state-of-the-art methods +in visual reasoning. Our code will be released upon publication. + +
+
+
+
+
+ + ♻ ☆ A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect + Dataset + + +
+ In an effort to catalog insect biodiversity, we propose a new large dataset +of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is +taxonomically classified by an expert, and also has associated genetic +information including raw nucleotide barcode sequences and assigned barcode +index numbers, which are genetically-based proxies for species classification. +This paper presents a curated million-image dataset, primarily to train +computer-vision models capable of providing image-based taxonomic assessment, +however, the dataset also presents compelling characteristics, the study of +which would be of interest to the broader machine learning community. Driven by +the biological nature inherent to the dataset, a characteristic long-tailed +class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is +a hierarchical classification scheme, presenting a highly fine-grained +classification problem at lower levels. Beyond spurring interest in +biodiversity research within the machine learning community, progress on +creating an image-based taxonomic classifier will also further the ultimate +goal of all BIOSCAN research: to lay the foundation for a comprehensive survey +of global biodiversity. This paper introduces the dataset and explores the +classification task through the implementation and analysis of a baseline +classifier. + +
+
+
+
+
+ + ♻ ☆ A Survey of Deep Learning for Low-Shot Object Detection + + +
+ Object detection has achieved a huge breakthrough with deep neural networks +and massive annotated data. However, current detection methods cannot be +directly transferred to the scenario where the annotated data is scarce due to +the severe overfitting problem. Although few-shot learning and zero-shot +learning have been extensively explored in the field of image classification, +it is indispensable to design new methods for object detection in the +data-scarce scenario since object detection has an additional challenging +localization task. Low-Shot Object Detection (LSOD) is an emerging research +topic of detecting objects from a few or even no annotated samples, consisting +of One-Shot Object Detection (OSOD), Few-Shot Object Detection (FSOD) and +Zero-Shot Object Detection (ZSD). This survey provides a comprehensive review +of LSOD methods. First, we propose a thorough taxonomy of LSOD methods and +analyze them systematically, comprising some extensional topics of LSOD +(semi-supervised LSOD, weakly-supervised LSOD, and incremental LSOD). Then, we +indicate the pros and cons of current LSOD methods with a comparison of their +performance. Finally, we discuss the challenges and promising directions of +LSOD to provide guidance for future works. + +
+
+
+
+
+ + ♻ ☆ Evaluation and Improvement of Interpretability for Self-Explainable + Part-Prototype Networks + + +
+ Part-prototype networks (e.g., ProtoPNet, ProtoTree, and ProtoPool) have +attracted broad research interest for their intrinsic interpretability and +comparable accuracy to non-interpretable counterparts. However, recent works +find that the interpretability from prototypes is fragile, due to the semantic +gap between the similarities in the feature space and that in the input space. +In this work, we strive to address this challenge by making the first attempt +to quantitatively and objectively evaluate the interpretability of the +part-prototype networks. Specifically, we propose two evaluation metrics, +termed as consistency score and stability score, to evaluate the explanation +consistency across images and the explanation robustness against perturbations, +respectively, both of which are essential for explanations taken into practice. +Furthermore, we propose an elaborated part-prototype network with a +shallow-deep feature alignment (SDFA) module and a score aggregation (SA) +module to improve the interpretability of prototypes. We conduct systematical +evaluation experiments and provide substantial discussions to uncover the +interpretability of existing part-prototype networks. Experiments on three +benchmarks across nine architectures demonstrate that our model achieves +significantly superior performance to the state of the art, in both the +accuracy and interpretability. Our code is available at +https://github.com/hqhQAQ/EvalProtoPNet. + +
+
+
+
+
+ + ♻ ☆ DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch + Diffusion in Histopathology + + +
+ We present DiffInfinite, a hierarchical diffusion model that generates +arbitrarily large histological images while preserving long-range correlation +structural information. Our approach first generates synthetic segmentation +masks, subsequently used as conditions for the high-fidelity generative +diffusion process. The proposed sampling method can be scaled up to any desired +image size while only requiring small patches for fast training. Moreover, it +can be parallelized more efficiently than previous large-content generation +methods while avoiding tiling artifacts. The training leverages classifier-free +guidance to augment a small, sparsely annotated dataset with unlabelled data. +Our method alleviates unique challenges in histopathological imaging practice: +large-scale information, costly manual annotation, and protective data +handling. The biological plausibility of DiffInfinite data is evaluated in a +survey by ten experienced pathologists as well as a downstream classification +and segmentation task. Samples from the model score strongly on anti-copying +metrics which is relevant for the protection of patient data. + +
+
+
+
+
+ + ♻ ☆ Edge Aware Learning for 3D Point Cloud + + +
+ This paper proposes an innovative approach to Hierarchical Edge Aware 3D +Point Cloud Learning (HEA-Net) that seeks to address the challenges of noise in +point cloud data, and improve object recognition and segmentation by focusing +on edge features. In this study, we present an innovative edge-aware learning +methodology, specifically designed to enhance point cloud classification and +segmentation. Drawing inspiration from the human visual system, the concept of +edge-awareness has been incorporated into this methodology, contributing to +improved object recognition while simultaneously reducing computational time. +Our research has led to the development of an advanced 3D point cloud learning +framework that effectively manages object classification and segmentation +tasks. A unique fusion of local and global network learning paradigms has been +employed, enriched by edge-focused local and global embeddings, thereby +significantly augmenting the model's interpretative prowess. Further, we have +applied a hierarchical transformer architecture to boost point cloud processing +efficiency, thus providing nuanced insights into structural understanding. Our +approach demonstrates significant promise in managing noisy point cloud data +and highlights the potential of edge-aware strategies in 3D point cloud +learning. The proposed approach is shown to outperform existing techniques in +object classification and segmentation tasks, as demonstrated by experiments on +ModelNet40 and ShapeNet datasets. + +
+
+ comment: CGI 2023 +
+
+
+
+
+ + ♻ ☆ Segment Any Building For Remote Sensing + + +
+ The task of identifying and segmenting buildings within remote sensing +imagery has perennially stood at the forefront of scholarly investigations. +This manuscript accentuates the potency of harnessing diversified datasets in +tandem with cutting-edge representation learning paradigms for building +segmentation in such images. Through the strategic amalgamation of disparate +datasets, we have not only expanded the informational horizon accessible for +model training but also manifested unparalleled performance metrics across +multiple datasets. Our avant-garde joint training regimen underscores the merit +of our approach, bearing significant implications in pivotal domains such as +urban infrastructural development, disaster mitigation strategies, and +ecological surveillance. Our methodology, predicated upon the fusion of +datasets and gleaning insights from pre-trained models, carves a new benchmark +in the annals of building segmentation endeavors. The outcomes of this research +both fortify the foundations for ensuing scholarly pursuits and presage a +horizon replete with innovative applications in the discipline of building +segmentation. + +
+
+ comment: Accepted by CGI 2023 +
+
+
+
+
+ + ♻ ☆ Distance Weighted Trans Network for Image Completion + + +
+ The challenge of image generation has been effectively modeled as a problem +of structure priors or transformation. However, existing models have +unsatisfactory performance in understanding the global input image structures +because of particular inherent features (for example, local inductive prior). +Recent studies have shown that self-attention is an efficient modeling +technique for image completion problems. In this paper, we propose a new +architecture that relies on Distance-based Weighted Transformer (DWT) to better +understand the relationships between an image's components. In our model, we +leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT +blocks to enhance the image completion process. Specifically, CNNs are used to +augment the local texture information of coarse priors and DWT blocks are used +to recover certain coarse textures and coherent visual structures. Unlike +current approaches that generally use CNNs to create feature maps, we use the +DWT to encode global dependencies and compute distance-based weighted feature +maps, which substantially minimizes the problem of visual ambiguities. +Meanwhile, to better produce repeated textures, we introduce Residual Fast +Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features +with the coarse features provided by our generator. Furthermore, a simple yet +effective technique is proposed to normalize the non-zero values of +convolutions, and fine-tune the network layers for regularization of the +gradient norms to provide an efficient training stabiliser. Extensive +quantitative and qualitative experiments on three challenging datasets +demonstrate the superiority of our proposed model compared to existing +approaches. + +
+
+
+
+
+ + ♻ ☆ CL-MAE: Curriculum-Learned Masked Autoencoders WACV 2024 + + +
+ Masked image modeling has been demonstrated as a powerful pretext task for +generating robust representations that can be effectively generalized across +multiple downstream tasks. Typically, this approach involves randomly masking +patches (tokens) in input images, with the masking strategy remaining unchanged +during training. In this paper, we propose a curriculum learning approach that +updates the masking strategy to continually increase the complexity of the +self-supervised reconstruction task. We conjecture that, by gradually +increasing the task complexity, the model can learn more sophisticated and +transferable representations. To facilitate this, we introduce a novel +learnable masking module that possesses the capability to generate masks of +different complexities, and integrate the proposed module into masked +autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting +its behavior during training, transitioning from a partner to the MAE +(optimizing the same reconstruction loss) to an adversary (optimizing the +opposite loss), while passing through a neutral state. The transition between +these behaviors is smooth, being regulated by a factor that is multiplied with +the reconstruction loss of the masking module. The resulting training procedure +generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked +Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior +representation learning capabilities compared to MAE. The empirical results on +five downstream tasks confirm our conjecture, demonstrating that curriculum +learning can be successfully used to self-supervise masked autoencoders. We +release our code at https://github.com/ristea/cl-mae. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ♻ ☆ Learning Unseen Modality Interaction NeurIPS 2023 + + +
+ Multimodal learning assumes all modality combinations of interest are +available during training to learn cross-modal correspondences. In this paper, +we challenge this modality-complete assumption for multimodal learning and +instead strive for generalization to unseen modality combinations during +inference. We pose the problem of unseen modality interaction and introduce a +first solution. It exploits a module that projects the multidimensional +features of different modalities into a common space with rich information +preserved. This allows the information to be accumulated with a simple +summation operation across available modalities. To reduce overfitting to less +discriminative modality combinations during training, we further improve the +model learning with pseudo-supervision indicating the reliability of a +modality's prediction. We demonstrate that our approach is effective for +diverse tasks and modalities by evaluating it for multimodal video +classification, robot state regression, and multimedia retrieval. Project +website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. + +
+
+ comment: Published at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Guide Your Agent with Adaptive Multimodal Rewards NeurIPS 2023 + + +
+ Developing an agent capable of adapting to unseen environments remains a +difficult challenge in imitation learning. This work presents Adaptive +Return-conditioned Policy (ARP), an efficient framework designed to enhance the +agent's generalization ability using natural language task descriptions and +pre-trained multimodal encoders. Our key idea is to calculate a similarity +between visual observations and natural language instructions in the +pre-trained multimodal embedding space (such as CLIP) and use it as a reward +signal. We then train a return-conditioned policy using expert demonstrations +labeled with multimodal rewards. Because the multimodal rewards provide +adaptive signals at each timestep, our ARP effectively mitigates the goal +misgeneralization. This results in superior generalization performances even +when faced with unseen text instructions, compared to existing text-conditioned +policies. To improve the quality of rewards, we also introduce a fine-tuning +method for pre-trained multimodal encoders, further enhancing the performance. +Video demonstrations and source code are available on the project website: +\url{https://sites.google.com/view/2023arp}. + +
+
+ comment: Accepted to NeurIPS 2023. Project webpage: + https://sites.google.com/view/2023arp +
+
+
+
+
+ + ♻ ☆ Chilled Sampling for Uncertainty Quantification: A Motivation From A + Meteorological Inverse Problem + + +
+ Atmospheric motion vectors (AMVs) extracted from satellite imagery are the +only wind observations with good global coverage. They are important features +for feeding numerical weather prediction (NWP) models. Several Bayesian models +have been proposed to estimate AMVs. Although critical for correct assimilation +into NWP models, very few methods provide a thorough characterization of the +estimation errors. The difficulty of estimating errors stems from the +specificity of the posterior distribution, which is both very high dimensional, +and highly ill-conditioned due to a singular likelihood. Motivated by this +difficult inverse problem, this work studies the evaluation of the (expected) +estimation errors using gradient-based Markov Chain Monte Carlo (MCMC) +algorithms. The main contribution is to propose a general strategy, called here +chilling, which amounts to sampling a local approximation of the posterior +distribution in the neighborhood of a point estimate. From a theoretical point +of view, we show that under regularity assumptions, the family of chilled +posterior distributions converges in distribution as temperature decreases to +an optimal Gaussian approximation at a point estimate given by the Maximum A +Posteriori, also known as the Laplace approximation. Chilled sampling therefore +provides access to this approximation generally out of reach in such +high-dimensional nonlinear contexts. From an empirical perspective, we evaluate +the proposed approach based on some quantitative Bayesian criteria. Our +numerical simulations are performed on synthetic and real meteorological data. +They reveal that not only the proposed chilling exhibits a significant gain in +terms of accuracy of the point estimates and of their associated expected +errors, but also a substantial acceleration in the convergence speed of the +MCMC algorithms. + +
+
+
+
+
+ + ♻ ☆ Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video + Understanding EMNLP 2023 + + +
+ We present Video-LLaMA a multi-modal framework that empowers Large Language +Models (LLMs) with the capability of understanding both visual and auditory +content in the video. Video-LLaMA bootstraps cross-modal training from the +frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike +previous works that complement LLMs to process the visual or audio signals +only, Video-LLaMA enables video comprehension by tackling two challenges: (1) +capturing the temporal changes in visual scenes, (2) integrating audio-visual +signals. To counter the first challenge, we propose a Video Q-former to +assemble a pre-trained image encoder into our video encoder and introduce a +video-to-text generation task to learn video-language correspondence. For the +second challenge, we leverage ImageBind, a universal embedding model aligning +multiple modalities, as the pre-trained audio encoder and introduce an Audio +Q-former on top of ImageBind to learn reasonable auditory query embeddings for +the LLM module. To align the output of both visual and audio encoders with +LLM's embedding space, we first train Video-LLaMA on massive +video/image-caption pairs and then tune our model with visual-instruction +datasets of moderate amount but higher quality. We found Video-LLaMA shows the +ability to perceive and comprehend video content and generate meaningful +responses grounded in the visual and auditory information presented in the +videos. + +
+
+ comment: Accepted by EMNLP 2023's demo track; Code, Pretrained Model, and + Dataset: https://github.com/DAMO-NLP-SG/Video-LLaMA +
+
+
+
+
+ + ♻ ☆ Data Pruning via Moving-one-Sample-out NeurIPS 2023 + + +
+ In this paper, we propose a novel data-pruning approach called +moving-one-sample-out (MoSo), which aims to identify and remove the least +informative samples from the training set. The core insight behind MoSo is to +determine the importance of each sample by assessing its impact on the optimal +empirical risk. This is achieved by measuring the extent to which the empirical +risk changes when a particular sample is excluded from the training set. +Instead of using the computationally expensive leaving-one-out-retraining +procedure, we propose an efficient first-order approximator that only requires +gradient information from different training stages. The key idea behind our +approximation is that samples with gradients that are consistently aligned with +the average gradient of the training set are more informative and should +receive higher scores, which could be intuitively understood as follows: if the +gradient from a specific sample is consistent with the average gradient vector, +it implies that optimizing the network using the sample will yield a similar +effect on all remaining samples. Experimental results demonstrate that MoSo +effectively mitigates severe performance degradation at high pruning ratios and +achieves satisfactory performance across various settings. + +
+
+ comment: Accepted by the Thirty-seventh Conference on Neural Information + Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via + Cross-modal Distillation and Super-Voxel Clustering ICCV + + +
+ Semantic segmentation of point clouds usually requires exhausting efforts of +human annotations, hence it attracts wide attention to the challenging topic of +learning from unlabeled or weaker forms of annotations. In this paper, we take +the first attempt for fully unsupervised semantic segmentation of point clouds, +which aims to delineate semantically meaningful objects without any form of +annotations. Previous works of unsupervised pipeline on 2D images fails in this +task of point clouds, due to: 1) Clustering Ambiguity caused by limited +magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity +caused by the irregular sparsity of point cloud. Therefore, we propose a novel +framework, PointDC, which is comprised of two steps that handle the +aforementioned problems respectively: Cross-Modal Distillation (CMD) and +Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual +features are back-projected to the 3D space and aggregated to a unified point +feature to distill the training of the point representation. In the second +stage of SVC, the point features are aggregated to super-voxels and then fed to +the iterative clustering process for excavating semantic classes. PointDC +yields a significant improvement over the prior state-of-the-art unsupervised +methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic +segmentation benchmarks. + +
+
+ comment: Accepted by International Conference on Computer Vision (ICCV) 2023 +
+
+
+
+
+ + ♻ ☆ LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning NeurIPS 2023 + + +
+ We present a novel vision-language prompt learning approach for few-shot +out-of-distribution (OOD) detection. Few-shot OOD detection aims to detect OOD +images from classes that are unseen during training using only a few labeled +in-distribution (ID) images. While prompt learning methods such as CoOp have +shown effectiveness and efficiency in few-shot ID classification, they still +face limitations in OOD detection due to the potential presence of +ID-irrelevant information in text embeddings. To address this issue, we +introduce a new approach called Local regularized Context Optimization +(LoCoOp), which performs OOD regularization that utilizes the portions of CLIP +local features as OOD features during training. CLIP's local features have a +lot of ID-irrelevant nuisances (e.g., backgrounds), and by learning to push +them away from the ID class text embeddings, we can remove the nuisances in the +ID class text embeddings and enhance the separation between ID and OOD. +Experiments on the large-scale ImageNet OOD detection benchmarks demonstrate +the superiority of our LoCoOp over zero-shot, fully supervised detection +methods and prompt learning methods. Notably, even in a one-shot setting -- +just one label per class, LoCoOp outperforms existing zero-shot and fully +supervised detection methods. The code will be available via +https://github.com/AtsuMiyai/LoCoOp. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Distribution-Aligned Diffusion for Human Mesh Recovery ICCV 2023 + + +
+ Recovering a 3D human mesh from a single RGB image is a challenging task due +to depth ambiguity and self-occlusion, resulting in a high degree of +uncertainty. Meanwhile, diffusion models have recently seen much success in +generating high-quality outputs by progressively denoising noisy inputs. +Inspired by their capability, we explore a diffusion-based approach for human +mesh recovery, and propose a Human Mesh Diffusion (HMDiff) framework which +frames mesh recovery as a reverse diffusion process. We also propose a +Distribution Alignment Technique (DAT) that infuses prior distribution +information into the mesh distribution diffusion process, and provides useful +prior knowledge to facilitate the mesh recovery task. Our method achieves +state-of-the-art performance on three widely used datasets. Project page: +https://gongjia0208.github.io/HMDiff/. + +
+
+ comment: Accepted to ICCV 2023 +
+
+
+
+
+ + ♻ ☆ Training Priors Predict Text-To-Image Model Performance + + +
+ Text-to-image models can often generate some relations, i.e., "astronaut +riding horse", but fail to generate other relations composed of the same basic +parts, i.e., "horse riding astronaut". These failures are often taken as +evidence that models rely on training priors rather than constructing novel +images compositionally. This paper tests this intuition on the stablediffusion +2.1 text-to-image model. By looking at the subject-verb-object (SVO) triads +that underlie these prompts (e.g., "astronaut", "ride", "horse"), we find that +the more often an SVO triad appears in the training data, the better the model +can generate an image aligned with that triad. Here, by aligned we mean that +each of the terms appears in the generated image in the proper relation to each +other. Surprisingly, this increased frequency also diminishes how well the +model can generate an image aligned with the flipped triad. For example, if +"astronaut riding horse" appears frequently in the training data, the image for +"horse riding astronaut" will tend to be poorly aligned. Our results thus show +that current models are biased to generate images with relations seen in +training, and provide new data to the ongoing debate on whether these +text-to-image models employ abstract compositional structure in a traditional +sense, or rather, interpolate between relations explicitly seen in the training +data. + +
+
+
+
+
+ + ♻ ☆ Understanding Optimization of Deep Learning via Jacobian Matrix and + Lipschitz Constant + + +
+ This article provides a comprehensive understanding of optimization in deep +learning, with a primary focus on the challenges of gradient vanishing and +gradient exploding, which normally lead to diminished model representational +ability and training instability, respectively. We analyze these two challenges +through several strategic measures, including the improvement of gradient flow +and the imposition of constraints on a network's Lipschitz constant. To help +understand the current optimization methodologies, we categorize them into two +classes: explicit optimization and implicit optimization. Explicit optimization +methods involve direct manipulation of optimizer parameters, including weight, +gradient, learning rate, and weight decay. Implicit optimization methods, by +contrast, focus on improving the overall landscape of a network by enhancing +its modules, such as residual shortcuts, normalization methods, attention +mechanisms, and activations. In this article, we provide an in-depth analysis +of these two optimization classes and undertake a thorough examination of the +Jacobian matrices and the Lipschitz constants of many widely used deep learning +modules, highlighting existing issues as well as potential improvements. +Moreover, we also conduct a series of analytical experiments to substantiate +our theoretical discussions. This article does not aim to propose a new +optimizer or network. Rather, our intention is to present a comprehensive +understanding of optimization in deep learning. We hope that this article will +assist readers in gaining a deeper insight in this field and encourages the +development of more robust, efficient, and high-performing models. + +
+
+ comment: International Digital Economy Academy (IDEA) +
+
+
+
+
+ + ♻ ☆ R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image + Generation + + +
+ Recent text-to-image (T2I) diffusion models have achieved remarkable progress +in generating high-quality images given text-prompts as input. However, these +models fail to convey appropriate spatial composition specified by a layout +instruction. In this work, we probe into zero-shot grounded T2I generation with +diffusion models, that is, generating images corresponding to the input layout +information without training auxiliary modules or finetuning diffusion models. +We propose a Region and Boundary (R&B) aware cross-attention guidance approach +that gradually modulates the attention maps of diffusion model during +generative process, and assists the model to synthesize images (1) with high +fidelity, (2) highly compatible with textual input, and (3) interpreting layout +instructions accurately. Specifically, we leverage the discrete sampling to +bridge the gap between consecutive attention maps and discrete layout +constraints, and design a region-aware loss to refine the generative layout +during diffusion process. We further propose a boundary-aware loss to +strengthen object discriminability within the corresponding regions. +Experimental results show that our method outperforms existing state-of-the-art +zero-shot grounded T2I generation methods by a large margin both qualitatively +and quantitatively on several benchmarks. + +
+
+ comment: Preprint. Under review. Project page: + https://sagileo.github.io/Region-and-Boundary +
+
+
+
+
+ + ♻ ☆ InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution EMNLP 2023 + + +
+ Over recent decades, significant advancements in cross-modal retrieval are +mainly driven by breakthroughs in visual and linguistic modeling. However, a +recent study shows that multi-modal data representations tend to cluster within +a limited convex cone (as representation degeneration problem), which hinders +retrieval performance due to the inseparability of these representations. In +our study, we first empirically validate the presence of the representation +degeneration problem across multiple cross-modal benchmarks and methods. Next, +to address it, we introduce a novel method, called InvGC, a post-processing +technique inspired by graph convolution and average pooling. Specifically, +InvGC defines the graph topology within the datasets and then applies graph +convolution in a subtractive manner. This method effectively separates +representations by increasing the distances between data points. To improve the +efficiency and effectiveness of InvGC, we propose an advanced graph topology, +LocalAdj, which only aims to increase the distances between each data point and +its nearest neighbors. To understand why InvGC works, we present a detailed +theoretical analysis, proving that the lower bound of recall will be improved +after deploying InvGC. Extensive empirical results show that InvGC and InvGC +w/LocalAdj significantly mitigate the representation degeneration problem, +thereby enhancing retrieval performance. + Our code is available at +https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ A Recycling Training Strategy for Medical Image Segmentation with + Diffusion Denoising Models + + +
+ Denoising diffusion models have found applications in image segmentation by +generating segmented masks conditioned on images. Existing studies +predominantly focus on adjusting model architecture or improving inference, +such as test-time sampling strategies. In this work, we focus on improving the +training strategy and propose a novel recycling method. During each training +step, a segmentation mask is first predicted given an image and a random noise. +This predicted mask, which replaces the conventional ground truth mask, is used +for denoising task during training. This approach can be interpreted as +aligning the training strategy with inference by eliminating the dependence on +ground truth masks for generating noisy samples. Our proposed method +significantly outperforms standard diffusion training, self-conditioning, and +existing recycling strategies across multiple medical imaging data sets: muscle +ultrasound, abdominal CT, prostate MR, and brain MR. This holds for two widely +adopted sampling strategies: denoising diffusion probabilistic model and +denoising diffusion implicit model. Importantly, existing diffusion models +often display a declining or unstable performance during inference, whereas our +novel recycling consistently enhances or maintains performance. We show that, +under a fair comparison with the same network architectures and computing +budget, the proposed recycling-based diffusion models achieved on-par +performance with non-diffusion-based supervised training. By ensembling the +proposed diffusion and the non-diffusion models, significant improvements to +the non-diffusion models have been observed across all applications, +demonstrating the value of this novel training method. This paper summarizes +these quantitative results and discusses their values, with a fully +reproducible JAX-based implementation, released at +https://github.com/mathpluscode/ImgX-DiffSeg. + +
+
+
+
+
+ + ♻ ☆ Recurrent Transformer Encoders for Vision-based Estimation of Fatigue + and Engagement in Cognitive Training Sessions + + +
+ Computerized cognitive training (CCT) is a scalable, well-tolerated +intervention that has promise for slowing cognitive decline. Outcomes from CCT +are limited by a lack of effective engagement, which is decreased by factors +such as mental fatigue, particularly in older adults at risk for dementia. +There is a need for scalable, automated measures that can monitor mental +fatigue during CCT. Here, we develop and validate a novel Recurrent Video +Transformer (RVT) method for monitoring real-time mental fatigue in older +adults with mild cognitive impairment from video-recorded facial gestures +during CCT. The RVT model achieved the highest balanced accuracy(78%) and +precision (0.82) compared to the prior state-of-the-art models for binary and +multi-class classification of mental fatigue and was additionally validated via +significant association (p=0.023) with CCT reaction time. By leveraging dynamic +temporal information, the RVT model demonstrates the potential to accurately +measure real-time mental fatigue, laying the foundation for future personalized +CCT that increase effective engagement. + +
+
+ comment: 23 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ AutoFocusFormer: Image Segmentation off the Grid CVPR 2023 + + +
+ Real world images often have highly imbalanced content density. Some areas +are very uniform, e.g., large patches of blue sky, while other areas are +scattered with many small objects. Yet, the commonly used successive grid +downsampling strategy in convolutional deep networks treats all areas equally. +Hence, small objects are represented in very few spatial locations, leading to +worse results in tasks such as segmentation. Intuitively, retaining more pixels +representing small objects during downsampling helps to preserve important +information. To achieve this, we propose AutoFocusFormer (AFF), a +local-attention transformer image recognition backbone, which performs adaptive +downsampling by learning to retain the most important pixels for the task. +Since adaptive downsampling generates a set of pixels irregularly distributed +on the image plane, we abandon the classic grid structure. Instead, we develop +a novel point-based local attention block, facilitated by a balanced clustering +module and a learnable neighborhood merging module, which yields +representations for our point-based versions of state-of-the-art segmentation +heads. Experiments show that our AutoFocusFormer (AFF) improves significantly +over baseline models of similar sizes. + +
+
+ comment: CVPR 2023 +
+
+
+
+
+ + ♻ ☆ Tuned Compositional Feature Replays for Efficient Stream Learning + + +
+ Our brains extract durable, generalizable knowledge from transient +experiences of the world. Artificial neural networks come nowhere close: when +tasked with learning to classify objects by training on non-repeating video +frames in temporal order (online stream learning), models that learn well from +shuffled datasets catastrophically forget old knowledge upon learning new +stimuli. We propose a new continual learning algorithm, Compositional Replay +Using Memory Blocks (CRUMB), which mitigates forgetting by replaying feature +maps reconstructed by recombining generic parts. CRUMB concatenates trainable +and re-usable "memory block" vectors to compositionally reconstruct feature map +tensors in convolutional neural networks, like crumbs forming a loaf of bread. +CRUMB stores the indices of memory blocks used to reconstruct new stimuli, +enabling replay of specific memories during later tasks. This reconstruction +mechanism also primes the neural network to minimize catastrophic forgetting by +forcing it to attend to information about object shapes more than information +about image textures, and stabilizes the network during stream learning by +providing a shared feature-level basis for all training examples. These +properties allow CRUMB to outperform an otherwise identical algorithm that +stores and replays raw images while occupying only 3.6% as much memory. We +stress-tested CRUMB alongside 13 competing methods on 7 challenging datasets. +To address the limited number of existing online stream learning datasets, we +introduce 2 new benchmarks by adapting existing datasets for stream learning. +With about 4% as much memory and 30% as much runtime, CRUMB mitigates +catastrophic forgetting more effectively than the prior state-of-the-art. Our +code is available on GitHub at https://github.com/MorganBDT/crumb. + +
+
+ comment: Copyright 2023 IEEE. Personal use of this material is permitted. + Permission from IEEE must be obtained for all other uses, in any current or + future media, including reprinting/republishing this material for advertising + or promotional purposes, creating new collective works, for resale or + redistribution to servers or lists, or reuse of any copyrighted component of + this work in other works +
+
+
+
+
+ + ♻ ☆ Open-World Object Manipulation using Pre-trained Vision-Language Models + + +
+ For robots to follow instructions from people, they must be able to connect +the rich semantic information in human vocabulary, e.g. "can you get me the +pink stuffed whale?" to their sensory observations and actions. This brings up +a notably difficult challenge for robots: while robot learning approaches allow +robots to learn many different behaviors from first-hand experience, it is +impractical for robots to have first-hand experiences that span all of this +semantic information. We would like a robot's policy to be able to perceive and +pick up the pink stuffed whale, even if it has never seen any data interacting +with a stuffed whale before. Fortunately, static data on the internet has vast +semantic information, and this information is captured in pre-trained +vision-language models. In this paper, we study whether we can interface robot +policies with these pre-trained models, with the aim of allowing robots to +complete instructions involving object categories that the robot has never seen +first-hand. We develop a simple approach, which we call Manipulation of +Open-World Objects (MOO), which leverages a pre-trained vision-language model +to extract object-identifying information from the language command and image, +and conditions the robot policy on the current image, the instruction, and the +extracted object information. In a variety of experiments on a real mobile +manipulator, we find that MOO generalizes zero-shot to a wide range of novel +object categories and environments. In addition, we show how MOO generalizes to +other, non-language-based input modalities to specify the object of interest +such as finger pointing, and how it can be further extended to enable +open-world navigation and manipulation. The project's website and evaluation +videos can be found at https://robot-moo.github.io/ + +
+
+ comment: Accepted at the 7th Conference on Robot Learning (CoRL 2023) +
+
+
+
+
+ + ♻ ☆ Probing Conceptual Understanding of Large Visual-Language Models + + +
+ In recent years large visual-language (V+L) models have achieved great +success in various downstream tasks. However, it is not well studied whether +these models have a conceptual grasp of the visual content. In this work we +focus on conceptual understanding of these large V+L models.To facilitate this +study, we propose novel benchmarking datasets for probing three different +aspects of content understanding, 1) relations, 2) composition and 3) context. +Our probes are grounded in cognitive science and help determine if a V+L model +can, for example, determine if ``snow garnished with a man'' is implausible, or +if it can identify beach furniture by knowing it is located on a beach. We +experimented with five different state-of-the-art V+L models and observe that +these models mostly fail to demonstrate a conceptual understanding. This study +reveals several interesting insights such as cross-attention helps learning +conceptual understanding, and that CNNs are better with texture and patterns, +while Transformers are better at color and shape. We further utilize some of +these insights and propose a baseline for improving performance by a simple +finetuning technique that rewards the three conceptual understanding measures +with promising initial results. We believe that the proposed benchmarks will +help the community assess and improve the conceptual understanding capabilities +of large V+L models. + +
+
+ comment: All code and dataset is available at: + https://tinyurl.com/vlm-robustness +
+
+
+
+
+ + ♻ ☆ Universal Test-time Adaptation through Weight Ensembling, Diversity + Weighting, and Prior Correction WACV 2024 + + +
+ Since distribution shifts are likely to occur during test-time and can +drastically decrease the model's performance, online test-time adaptation (TTA) +continues to update the model after deployment, leveraging the current test +data. Clearly, a method proposed for online TTA has to perform well for all +kinds of environmental conditions. By introducing the variable factors domain +non-stationarity and temporal correlation, we first unfold all practically +relevant settings and define the entity as universal TTA. We want to highlight +that this is the first work that covers such a broad spectrum, which is +indispensable for the use in practice. To tackle the problem of universal TTA, +we identify and highlight several challenges a self-training based method has +to deal with: 1) model bias and the occurrence of trivial solutions when +performing entropy minimization on varying sequence lengths with and without +multiple domain shifts, 2) loss of generalization which exacerbates the +adaptation to multiple domain shifts and the occurrence of catastrophic +forgetting, and 3) performance degradation due to shifts in class prior. To +prevent the model from becoming biased, we leverage a dataset and +model-agnostic certainty and diversity weighting. In order to maintain +generalization and prevent catastrophic forgetting, we propose to continually +weight-average the source and adapted model. To compensate for disparities in +the class prior during test-time, we propose an adaptive prior correction +scheme that reweights the model's predictions. We evaluate our approach, named +ROID, on a wide range of settings, datasets, and models, setting new standards +in the field of universal TTA. Code is available at: +https://github.com/mariodoebler/test-time-adaptation + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ♻ ☆ SACSoN: Scalable Autonomous Control for Social Navigation + + +
+ Machine learning provides a powerful tool for building socially compliant +robotic systems that go beyond simple predictive models of human behavior. By +observing and understanding human interactions from past experiences, learning +can enable effective social navigation behaviors directly from data. In this +paper, our goal is to develop methods for training policies for socially +unobtrusive navigation, such that robots can navigate among humans in ways that +don't disturb human behavior. We introduce a definition for such behavior based +on the counterfactual perturbation of the human: if the robot had not intruded +into the space, would the human have acted in the same way? By minimizing this +counterfactual perturbation, we can induce robots to behave in ways that do not +alter the natural behavior of humans in the shared space. Instantiating this +principle requires training policies to minimize their effect on human +behavior, and this in turn requires data that allows us to model the behavior +of humans in the presence of robots. Therefore, our approach is based on two +key contributions. First, we collect a large dataset where an indoor mobile +robot interacts with human bystanders. Second, we utilize this dataset to train +policies that minimize counterfactual perturbation. We provide supplementary +videos and make publicly available the largest-of-its-kind visual navigation +dataset on our project page. + +
+
+ comment: 11 pages, 15 figures, 4 tables +
+
+
+
+
+ + ♻ ☆ MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, + Bard, and Other Large Multimodal Models + + +
+ Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit +impressive problem-solving skills in many tasks and domains, but their ability +in mathematical reasoning in visual contexts has not been systematically +studied. To bridge this gap, we present MathVista, a benchmark designed to +combine challenges from diverse mathematical and visual tasks. It consists of +6,141 examples, derived from 28 existing multimodal datasets involving +mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and +PaperQA). Completing these tasks requires fine-grained, deep visual +understanding and compositional reasoning, which all state-of-the-art +foundation models find challenging. With MathVista, we have conducted a +comprehensive, quantitative evaluation of 12 prominent foundation models. The +best-performing GPT-4V model achieves an overall accuracy of 49.9%, +substantially outperforming Bard, the second-best performer, by 15.1%. Our +in-depth analysis reveals that the superiority of GPT-4V is mainly attributed +to its enhanced visual perception and mathematical reasoning. However, GPT-4V +still falls short of human performance by 10.4%, as it often struggles to +understand complex figures and perform rigorous reasoning. This significant gap +underscores the critical role that MathVista will play in the development of +general-purpose AI agents capable of tackling mathematically intensive and +visually rich real-world tasks. We further explore the new ability of +self-verification, the application of self-consistency, and the interactive +chatbot capabilities of GPT-4V, highlighting its promising potential for future +research. The project is available at https://mathvista.github.io/. + +
+
+ comment: 112 pages, 117 figures. Work in progress +
+
+
+
+
+
+
+
+ + Information Retrieval 13 + +
+
+
+ + ☆ Improving Conversational Recommendation Systems via Bias Analysis and + Language-Model-Enhanced Data Augmentation EMNLP 2023 + + +
+ Conversational Recommendation System (CRS) is a rapidly growing research area +that has gained significant attention alongside advancements in language +modelling techniques. However, the current state of conversational +recommendation faces numerous challenges due to its relative novelty and +limited existing contributions. In this study, we delve into benchmark datasets +for developing CRS models and address potential biases arising from the +feedback loop inherent in multi-turn interactions, including selection bias and +multiple popularity bias variants. Drawing inspiration from the success of +generative data via using language models and data augmentation techniques, we +present two novel strategies, 'Once-Aug' and 'PopNudge', to enhance model +performance while mitigating biases. Through extensive experiments on ReDial +and TG-ReDial benchmark datasets, we show a consistent improvement of CRS +techniques with our data augmentation approaches and offer additional insights +on addressing multiple newly formulated biases. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ Exploring Large Language Models for Code Explanation + + +
+ Automating code documentation through explanatory text can prove highly +beneficial in code understanding. Large Language Models (LLMs) have made +remarkable strides in Natural Language Processing, especially within software +engineering tasks such as code generation and code summarization. This study +specifically delves into the task of generating natural-language summaries for +code snippets, using various LLMs. The findings indicate that Code LLMs +outperform their generic counterparts, and zero-shot methods yield superior +results when dealing with datasets with dissimilar distributions between +training and testing sets. + +
+
+ comment: Accepted at the Forum for Information Retrieval Evaluation 2023 (IRSE + Track) +
+
+
+
+
+ + ☆ Distributionally Robust Unsupervised Dense Retrieval Training on Web + Graphs + + +
+ This paper introduces Web-DRO, an unsupervised dense retrieval model, which +clusters documents based on web structures and reweights the groups during +contrastive training. Specifically, we first leverage web graph links and +contrastively train an embedding model for clustering anchor-document pairs. +Then we use Group Distributional Robust Optimization to reweight different +clusters of anchor-document pairs, which guides the model to assign more +weights to the group with higher contrastive loss and pay more attention to the +worst case during training. Our experiments on MS MARCO and BEIR show that our +model, Web-DRO, significantly improves the retrieval effectiveness in +unsupervised scenarios. A comparison of clustering techniques shows that +training on the web graph combining URL information reaches optimal performance +on clustering. Further analysis confirms that group weights are stable and +valid, indicating consistent model preferences as well as effective +up-weighting of valuable groups and down-weighting of uninformative ones. The +code of this paper can be obtained from https://github.com/OpenMatch/Web-DRO. + +
+
+ comment: 9 pages, 5 figures, 5 tables +
+
+
+
+
+ + ☆ Model-enhanced Contrastive Reinforcement Learning for Sequential + Recommendation + + +
+ Reinforcement learning (RL) has been widely applied in recommendation systems +due to its potential in optimizing the long-term engagement of users. From the +perspective of RL, recommendation can be formulated as a Markov decision +process (MDP), where recommendation system (agent) can interact with users +(environment) and acquire feedback (reward signals).However, it is impractical +to conduct online interactions with the concern on user experience and +implementation complexity, and we can only train RL recommenders with offline +datasets containing limited reward signals and state transitions. Therefore, +the data sparsity issue of reward signals and state transitions is very severe, +while it has long been overlooked by existing RL recommenders.Worse still, RL +methods learn through the trial-and-error mode, but negative feedback cannot be +obtained in implicit feedback recommendation tasks, which aggravates the +overestimation problem of offline RL recommender. To address these challenges, +we propose a novel RL recommender named model-enhanced contrastive +reinforcement learning (MCRL). On the one hand, we learn a value function to +estimate the long-term engagement of users, together with a conservative value +learning mechanism to alleviate the overestimation problem.On the other hand, +we construct some positive and negative state-action pairs to model the reward +function and state transition function with contrastive learning to exploit the +internal structure information of MDP. Experiments demonstrate that the +proposed method significantly outperforms existing offline RL and +self-supervised RL methods with different representative backbone networks on +two real-world datasets. + +
+
+ comment: 11 pages, 7 figures +
+
+
+
+
+ + ☆ Faithful Path Language Modelling for Explainable Recommendation over + Knowledge Graph + + +
+ Path reasoning methods over knowledge graphs have gained popularity for their +potential to improve transparency in recommender systems. However, the +resulting models still rely on pre-trained knowledge graph embeddings, fail to +fully exploit the interdependence between entities and relations in the KG for +recommendation, and may generate inaccurate explanations. In this paper, we +introduce PEARLM, a novel approach that efficiently captures user behaviour and +product-side knowledge through language modelling. With our approach, knowledge +graph embeddings are directly learned from paths over the KG by the language +model, which also unifies entities and relations in the same optimisation +space. Constraints on the sequence decoding additionally guarantee path +faithfulness with respect to the KG. Experiments on two datasets show the +effectiveness of our approach compared to state-of-the-art baselines. Source +code and datasets: AVAILABLE AFTER GETTING ACCEPTED. + +
+
+
+
+
+ + ☆ Multiple Key-value Strategy in Recommendation Systems Incorporating + Large Language Model CIKM2023 + + +
+ Recommendation system (RS) plays significant roles in matching users +information needs for Internet applications, and it usually utilizes the +vanilla neural network as the backbone to handle embedding details. Recently, +the large language model (LLM) has exhibited emergent abilities and achieved +great breakthroughs both in the CV and NLP communities. Thus, it is logical to +incorporate RS with LLM better, which has become an emerging research +direction. Although some existing works have made their contributions to this +issue, they mainly consider the single key situation (e.g. historical +interactions), especially in sequential recommendation. The situation of +multiple key-value data is simply neglected. This significant scenario is +mainstream in real practical applications, where the information of users (e.g. +age, occupation, etc) and items (e.g. title, category, etc) has more than one +key. Therefore, we aim to implement sequential recommendations based on +multiple key-value data by incorporating RS with LLM. In particular, we +instruct tuning a prevalent open-source LLM (Llama 7B) in order to inject +domain knowledge of RS into the pre-trained LLM. Since we adopt multiple +key-value strategies, LLM is hard to learn well among these keys. Thus the +general and innovative shuffle and mask strategies, as an innovative manner of +data argument, are designed. To demonstrate the effectiveness of our approach, +extensive experiments are conducted on the popular and suitable dataset +MovieLens which contains multiple keys-value. The experimental results +demonstrate that our approach can nicely and effectively complete this +challenging issue. + +
+
+ comment: Accepted by CIKM2023 workshop at GenRec'23 +
+
+
+
+
+ + ☆ URL-BERT: Training Webpage Representations via Social Media Engagements + + +
+ Understanding and representing webpages is crucial to online social networks +where users may share and engage with URLs. Common language model (LM) encoders +such as BERT can be used to understand and represent the textual content of +webpages. However, these representations may not model thematic information of +web domains and URLs or accurately capture their appeal to social media users. +In this work, we introduce a new pre-training objective that can be used to +adapt LMs to understand URLs and webpages. Our proposed framework consists of +two steps: (1) scalable graph embeddings to learn shallow representations of +URLs based on user engagement on social media and (2) a contrastive objective +that aligns LM representations with the aforementioned graph-based +representation. We apply our framework to the multilingual version of BERT to +obtain the model URL-BERT. We experimentally demonstrate that our continued +pre-training approach improves webpage understanding on a variety of tasks and +Twitter internal and external benchmarks. + +
+
+
+
+
+ + ☆ On Surgical Fine-tuning for Language Encoders EMNLP 2023 + + +
+ Fine-tuning all the layers of a pre-trained neural language encoder (either +using all the parameters or using parameter-efficient methods) is often the +de-facto way of adapting it to a new task. We show evidence that for different +downstream language tasks, fine-tuning only a subset of layers is sufficient to +obtain performance that is close to and often better than fine-tuning all the +layers in the language encoder. We propose an efficient metric based on the +diagonal of the Fisher information matrix (FIM score), to select the candidate +layers for selective fine-tuning. We show, empirically on GLUE and SuperGLUE +tasks and across distinct language encoders, that this metric can effectively +select layers leading to a strong downstream performance. Our work highlights +that task-specific information corresponding to a given downstream task is +often localized within a few layers, and tuning only those is sufficient for +strong performance. Additionally, we demonstrate the robustness of the FIM +score to rank layers in a manner that remains constant during the optimization +process. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ The Word2vec Graph Model for Author Attribution and Genre Detection in + Literary Analysis + + +
+ Analyzing the writing styles of authors and articles is a key to supporting +various literary analyses such as author attribution and genre detection. Over +the years, rich sets of features that include stylometry, bag-of-words, n-grams +have been widely used to perform such analysis. However, the effectiveness of +these features largely depends on the linguistic aspects of a particular +language and datasets specific characteristics. Consequently, techniques based +on these feature sets cannot give desired results across domains. In this +paper, we propose a novel Word2vec graph based modeling of a document that can +rightly capture both context and style of the document. By using these Word2vec +graph based features, we perform classification to perform author attribution +and genre detection tasks. Our detailed experimental study with a comprehensive +set of literary writings shows the effectiveness of this method over +traditional feature based approaches. Our code and data are publicly available +at https://cutt.ly/svLjSgk + +
+
+ comment: 12 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ Sheaf Neural Networks for Graph-based Recommender Systems + + +
+ Recent progress in Graph Neural Networks has resulted in wide adoption by +many applications, including recommendation systems. The reason for Graph +Neural Networks' superiority over other approaches is that many problems in +recommendation systems can be naturally modeled as graphs, where nodes can be +either users or items and edges represent preference relationships. In current +Graph Neural Network approaches, nodes are represented with a static vector +learned at training time. This static vector might only be suitable to capture +some of the nuances of users or items they define. To overcome this limitation, +we propose using a recently proposed model inspired by category theory: Sheaf +Neural Networks. Sheaf Neural Networks, and its connected Laplacian, can +address the previous problem by associating every node (and edge) with a vector +space instead than a single vector. The vector space representation is richer +and allows picking the proper representation at inference time. This approach +can be generalized for different related tasks on graphs and achieves +state-of-the-art performance in terms of F1-Score@N in collaborative filtering +and Hits@20 in link prediction. For collaborative filtering, the approach is +evaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a +5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link +prediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the +respective baselines. + +
+
+ comment: 9 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for + Inference Cost Reduction EMNLP 2023 + + +
+ Since ChatGPT released its API for public use, the number of applications +built on top of commercial large language models (LLMs) increase exponentially. +One popular usage of such models is leveraging its in-context learning ability +and generating responses given user queries leveraging knowledge obtained by +retrieval augmentation. One problem of deploying commercial retrieval-augmented +LLMs is the cost due to the additionally retrieved context that largely +increases the input token size of the LLMs. To mitigate this, we propose a +token compression scheme that includes two methods: summarization compression +and semantic compression. The first method applies a T5-based model that is +fine-tuned by datasets generated using self-instruct containing samples with +varying lengths and reduce token size by doing summarization. The second method +further compresses the token size by removing words with lower impact on the +semantic. In order to adequately evaluate the effectiveness of the proposed +methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) +focusing on food recommendation for women around pregnancy period or infants. +Our summarization compression can reduce 65% of the retrieval token size with +further 0.3% improvement on the accuracy; semantic compression provides a more +flexible way to trade-off the token size with performance, for which we can +reduce the token size by 20% with only 1.6% of accuracy drop. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Retrieve Anything To Augment Large Language Models + + +
+ Large language models (LLMs) face significant challenges stemming from their +inherent limitations in knowledge, memory, alignment, and action. These +challenges cannot be addressed by LLMs alone, but should rely on assistance +from the external world, such as knowledge base, memory store, demonstration +examples, and tools. Retrieval augmentation stands as a vital mechanism for +bridging the gap between LLMs and the external assistance. However, +conventional methods encounter two pressing issues. On the one hand, the +general-purpose retrievers are not properly optimized for the retrieval +augmentation of LLMs. On the other hand, the task-specific retrievers lack the +required versatility, hindering their performance across the diverse retrieval +augmentation scenarios. + In this work, we present a novel approach, the LLM-Embedder, which +comprehensively supports the diverse retrieval augmentation needs of LLMs with +one unified embedding model. Training such a unified model is non-trivial, as +various retrieval tasks aim to capture distinct semantic relationships, often +subject to mutual interference. To address this challenge, we systematically +optimize our training methodology. This includes reward formulation based on +LLMs' feedback, the stabilization of knowledge distillation, multi-task +fine-tuning with explicit instructions, and homogeneous in-batch negative +sampling. These optimization strategies contribute to the outstanding empirical +performance of the LLM-Embedder. Notably, it yields remarkable enhancements in +retrieval augmentation for LLMs, surpassing both general-purpose and +task-specific retrievers in various evaluation scenarios. Our checkpoint and +source code are publicly available at +https://github.com/FlagOpen/FlagEmbedding. + +
+
+
+
+
+ + ♻ ☆ Duplicate Question Retrieval and Confirmation Time Prediction in + Software Communities + + +
+ Community Question Answering (CQA) in different domains is growing at a large +scale because of the availability of several platforms and huge shareable +information among users. With the rapid growth of such online platforms, a +massive amount of archived data makes it difficult for moderators to retrieve +possible duplicates for a new question and identify and confirm existing +question pairs as duplicates at the right time. This problem is even more +critical in CQAs corresponding to large software systems like askubuntu where +moderators need to be experts to comprehend something as a duplicate. Note that +the prime challenge in such CQA platforms is that the moderators are themselves +experts and are therefore usually extremely busy with their time being +extraordinarily expensive. To facilitate the task of the moderators, in this +work, we have tackled two significant issues for the askubuntu CQA platform: +(1) retrieval of duplicate questions given a new question and (2) duplicate +question confirmation time prediction. In the first task, we focus on +retrieving duplicate questions from a question pool for a particular newly +posted question. In the second task, we solve a regression problem to rank a +pair of questions that could potentially take a long time to get confirmed as +duplicates. For duplicate question retrieval, we propose a Siamese neural +network based approach by exploiting both text and network-based features, +which outperforms several state-of-the-art baseline techniques. Our method +outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate +confirmation time prediction, we have used both the standard machine learning +models and neural network along with the text and graph-based features. We +obtain Spearman's rank correlation of 0.20 and 0.213 (statistically +significant) for text and graph based features respectively. + +
+
+ comment: Full paper accepted at ASONAM 2023: The 2023 IEEE/ACM International + Conference on Advances in Social Networks Analysis and Mining +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ RDBench: ML Benchmark for Relational Databases + + +
+ Benefiting from high-quality datasets and standardized evaluation metrics, +machine learning (ML) has achieved sustained progress and widespread +applications. However, while applying machine learning to relational databases +(RDBs), the absence of a well-established benchmark remains a significant +obstacle to the development of ML. To address this issue, we introduce ML +Benchmark For Relational Databases (RDBench), a standardized benchmark that +aims to promote reproducible ML research on RDBs that include multiple tables. +RDBench offers diverse RDB datasets of varying scales, domains, and relational +structures, organized into 4 levels. Notably, to simplify the adoption of +RDBench for diverse ML domains, for any given database, RDBench exposes three +types of interfaces including tabular data, homogeneous graphs, and +heterogeneous graphs, sharing the same underlying task definition. For the +first time, RDBench enables meaningful comparisons between ML methods from +diverse domains, ranging from XGBoost to Graph Neural Networks, under RDB +prediction tasks. We design multiple classification and regression tasks for +each RDB dataset and report averaged results over the same dataset, further +enhancing the robustness of the experimental findings. RDBench is implemented +with DBGym, a user-friendly platform for ML research and application on +databases, enabling benchmarking new ML methods with RDBench at ease. + +
+
+
+
+
+ + ☆ Proposal-Contrastive Pretraining for Object Detection from Fewer Data ICLR 2023 + + +
+ The use of pretrained deep neural networks represents an attractive way to +achieve strong results with few data available. When specialized in dense +problems such as object detection, learning local rather than global +information in images has proven to be more efficient. However, for +unsupervised pretraining, the popular contrastive learning requires a large +batch size and, therefore, a lot of resources. To address this problem, we are +interested in transformer-based object detectors that have recently gained +traction in the community with good performance and with the particularity of +generating many diverse object proposals. + In this work, we present Proposal Selection Contrast (ProSeCo), a novel +unsupervised overall pretraining approach that leverages this property. ProSeCo +uses the large number of object proposals generated by the detector for +contrastive learning, which allows the use of a smaller batch size, combined +with object-level features to learn local information in the images. To improve +the effectiveness of the contrastive loss, we introduce the object location +information in the selection of positive examples to take into account multiple +overlapping object proposals. When reusing pretrained backbone, we advocate for +consistency in learning local information between the backbone and the +detection head. + We show that our method outperforms state of the art in unsupervised +pretraining for object detection on standard and novel benchmarks in learning +with fewer data. + +
+
+ comment: Published as a conference paper at ICLR 2023 +
+
+
+
+
+ + ☆ Discrete Diffusion Language Modeling by Estimating the Ratios of the + Data Distribution + + +
+ Despite their groundbreaking performance for many generative modeling tasks, +diffusion models have fallen short on discrete data domains such as natural +language. Crucially, standard diffusion models rely on the well-established +theory of score matching, but efforts to generalize this to discrete structures +have not yielded the same empirical gains. In this work, we bridge this gap by +proposing score entropy, a novel discrete score matching loss that is more +stable than existing methods, forms an ELBO for maximum likelihood training, +and can be efficiently optimized with a denoising variant. We scale our Score +Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, +achieving highly competitive likelihoods while also introducing distinct +algorithmic advantages. In particular, when comparing similarly sized SEDD and +GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of +and sometimes outperforming the baseline). Furthermore, SEDD models learn a +more faithful sequence distribution (around $4\times$ better compared to GPT-2 +models with ancestral sampling as measured by large models), can trade off +compute for generation quality (needing only $16\times$ fewer network +evaluations to match GPT-2), and enables arbitrary infilling beyond the +standard left to right prompting. + +
+
+ comment: 30 pages +
+
+
+
+
+ + ☆ PERF: Panoramic Neural Radiance Field from a Single Panorama + + +
+ Neural Radiance Field (NeRF) has achieved substantial progress in novel view +synthesis given multi-view images. Recently, some works have attempted to train +a NeRF from a single image with 3D priors. They mainly focus on a limited field +of view and there are few invisible occlusions, which greatly limits their +scalability to real-world 360-degree panoramic scenarios with large-size +occlusions. In this paper, we present PERF, a 360-degree novel view synthesis +framework that trains a panoramic neural radiance field from a single panorama. +Notably, PERF allows 3D roaming in a complex scene without expensive and +tedious image collection. To achieve this goal, we propose a novel +collaborative RGBD inpainting method and a progressive inpainting-and-erasing +method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first +predict a panoramic depth map as initialization given a single panorama, and +reconstruct visible 3D regions with volume rendering. Then we introduce a +collaborative RGBD inpainting approach into a NeRF for completing RGB images +and depth maps from random views, which is derived from an RGB Stable Diffusion +model and a monocular depth estimator. Finally, we introduce an +inpainting-and-erasing strategy to avoid inconsistent geometry between a +newly-sampled view and reference views. The two components are integrated into +the learning of NeRFs in a unified optimization framework and achieve promising +results. Extensive experiments on Replica and a new dataset PERF-in-the-wild +demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF +can be widely used for real-world applications, such as panorama-to-3D, +text-to-3D, and 3D scene stylization applications. Project page and code are +available at https://perf-project.github.io/. + +
+
+ comment: Project page and code: https://perf-project.github.io/ +
+
+
+
+
+ + ☆ TD-MPC2: Scalable, Robust World Models for Continuous Control + + +
+ TD-MPC is a model-based reinforcement learning (RL) algorithm that performs +local trajectory optimization in the latent space of a learned implicit +(decoder-free) world model. In this work, we present TD-MPC2: a series of +improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves +significantly over baselines across 104 online RL tasks spanning 4 diverse task +domains, achieving consistently strong results with a single set of +hyperparameters. We further show that agent capabilities increase with model +and data size, and successfully train a single 317M parameter agent to perform +80 tasks across multiple task domains, embodiments, and action spaces. We +conclude with an account of lessons, opportunities, and risks associated with +large TD-MPC2 agents. Explore videos, models, data, code, and more at +https://nicklashansen.github.io/td-mpc2 + +
+
+ comment: Explore videos, models, data, code, and more at + https://nicklashansen.github.io/td-mpc2 +
+
+
+
+
+ + ☆ Deep machine learning for meteor monitoring: advances with transfer + learning and gradient-weighted class activation mapping + + +
+ In recent decades, the use of optical detection systems for meteor studies +has increased dramatically, resulting in huge amounts of data being analyzed. +Automated meteor detection tools are essential for studying the continuous +meteoroid incoming flux, recovering fresh meteorites, and achieving a better +understanding of our Solar System. Concerning meteor detection, distinguishing +false positives between meteor and non-meteor images has traditionally been +performed by hand, which is significantly time-consuming. To address this +issue, we developed a fully automated pipeline that uses Convolutional Neural +Networks (CNNs) to classify candidate meteor detections. Our new method is able +to detect meteors even in images that contain static elements such as clouds, +the Moon, and buildings. To accurately locate the meteor within each frame, we +employ the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. +This method facilitates the identification of the region of interest by +multiplying the activations from the last convolutional layer with the average +of the gradients across the feature map of that layer. By combining these +findings with the activation map derived from the first convolutional layer, we +effectively pinpoint the most probable pixel location of the meteor. We trained +and evaluated our model on a large dataset collected by the Spanish Meteor +Network (SPMN) and achieved a precision of 98\%. Our new methodology presented +here has the potential to reduce the workload of meteor scientists and station +operators and improve the accuracy of meteor tracking and classification. + +
+
+ comment: Accepted in Planetary and Space Science +
+
+
+
+
+ + ☆ CATE Lasso: Conditional Average Treatment Effect Estimation with + High-Dimensional Linear Regression + + +
+ In causal inference about two treatments, Conditional Average Treatment +Effects (CATEs) play an important role as a quantity representing an +individualized causal effect, defined as a difference between the expected +outcomes of the two treatments conditioned on covariates. This study assumes +two linear regression models between a potential outcome and covariates of the +two treatments and defines CATEs as a difference between the linear regression +models. Then, we propose a method for consistently estimating CATEs even under +high-dimensional and non-sparse parameters. In our study, we demonstrate that +desirable theoretical properties, such as consistency, remain attainable even +without assuming sparsity explicitly if we assume a weaker assumption called +implicit sparsity originating from the definition of CATEs. In this assumption, +we suppose that parameters of linear models in potential outcomes can be +divided into treatment-specific and common parameters, where the +treatment-specific parameters take difference values between each linear +regression model, while the common parameters remain identical. Thus, in a +difference between two linear regression models, the common parameters +disappear, leaving only differences in the treatment-specific parameters. +Consequently, the non-zero parameters in CATEs correspond to the differences in +the treatment-specific parameters. Leveraging this assumption, we develop a +Lasso regression method specialized for CATE estimation and present that the +estimator is consistent. Finally, we confirm the soundness of the proposed +method by simulation studies. + +
+
+
+
+
+ + ☆ Learning COVID-19 Regional Transmission Using Universal Differential + Equations in a SIR model + + +
+ Highly-interconnected societies difficult to model the spread of infectious +diseases such as COVID-19. Single-region SIR models fail to account for +incoming forces of infection and expanding them to a large number of +interacting regions involves many assumptions that do not hold in the real +world. We propose using Universal Differential Equations (UDEs) to capture the +influence of neighboring regions and improve the model's predictions in a +combined SIR+UDE model. UDEs are differential equations totally or partially +defined by a deep neural network (DNN). We include an additive term to the SIR +equations composed by a DNN that learns the incoming force of infection from +the other regions. The learning is performed using automatic differentiation +and gradient descent to approach the change in the target system caused by the +state of the neighboring regions. We compared the proposed model using a +simulated COVID-19 outbreak against a single-region SIR and a fully data-driven +model composed only of a DNN. The proposed UDE+SIR model generates predictions +that capture the outbreak dynamic more accurately, but a decay in performance +is observed at the last stages of the outbreak. The single-area SIR and the +fully data-driven approach do not capture the proper dynamics accurately. Once +the predictions were obtained, we employed the SINDy algorithm to substitute +the DNN with a regression, removing the black box element of the model with no +considerable increase in the error levels. + +
+
+ comment: 18 pages +
+
+
+
+
+ + ☆ Language Agnostic Code Embeddings + + +
+ Recently, code language models have achieved notable advancements in +addressing a diverse array of essential code comprehension and generation +tasks. Yet, the field lacks a comprehensive deep dive and understanding of the +code embeddings of multilingual code models. In this paper, we present a +comprehensive study on multilingual code embeddings, focusing on the +cross-lingual capabilities of these embeddings across different programming +languages. Through probing experiments, we demonstrate that code embeddings +comprise two distinct components: one deeply tied to the nuances and syntax of +a specific language, and the other remaining agnostic to these details, +primarily focusing on semantics. Further, we show that when we isolate and +eliminate this language-specific component, we witness significant improvements +in downstream code retrieval tasks, leading to an absolute increase of up to ++17 in the Mean Reciprocal Rank (MRR). + +
+
+
+
+
+ + ☆ From Molecules to Materials: Pre-training Large Generalizable Models for + Atomic Property Prediction + + +
+ Foundation models have been transformational in machine learning fields such +as natural language processing and computer vision. Similar success in atomic +property prediction has been limited due to the challenges of training +effective models across multiple chemical domains. To address this, we +introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training +strategy that simultaneously trains on multiple datasets from different +chemical domains, treating each dataset as a unique pre-training task within a +multi-task framework. Our combined training dataset consists of $\sim$120M +systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and +generalization by fine-tuning over a diverse set of downstream tasks and +datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP +demonstrates an average improvement of 59% over training from scratch, and +matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the +potential of pre-training strategies that utilize diverse data to advance +property prediction across chemical domains, especially for low-data tasks. + +
+
+
+
+
+ + ☆ QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models + + +
+ Mixture-of-Experts (MoE) architectures offer a general solution to the high +inference costs of large language models (LLMs) via sparse routing, bringing +faster and more accurate models, at the cost of massive parameter counts. For +example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, +requiring 3.2TB of accelerator memory to run efficiently, which makes practical +deployment challenging and expensive. In this paper, we present a solution to +this memory problem, in form of a new compression and execution framework +called QMoE. Specifically, QMoE consists of a scalable algorithm which +accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, +in a custom format co-designed with bespoke GPU decoding kernels to facilitate +efficient end-to-end compressed inference, with minor runtime overheads +relative to uncompressed execution. Concretely, QMoE can compress the 1.6 +trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x +compression, 0.8 bits per parameter) at only minor accuracy loss, in less than +a day on a single GPU. This enables, for the first time, the execution of a +trillion-parameter model on affordable commodity hardware, like a single server +with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead +relative to ideal uncompressed inference. The source code and compressed models +are available at github.com/IST-DASLab/qmoe. + +
+
+
+
+
+ + ☆ Learning Independent Program and Architecture Representations for + Generalizable Performance Modeling + + +
+ This paper proposes PerfVec, a novel deep learning-based performance modeling +framework that learns high-dimensional, independent/orthogonal program and +microarchitecture representations. Once learned, a program representation can +be used to predict its performance on any microarchitecture, and likewise, a +microarchitecture representation can be applied in the performance prediction +of any program. Additionally, PerfVec yields a foundation model that captures +the performance essence of instructions, which can be directly used by +developers in numerous performance modeling related tasks without incurring its +training cost. The evaluation demonstrates that PerfVec is more general, +efficient, and accurate than previous approaches. + +
+
+
+
+
+ + ☆ Covert Planning against Imperfect Observers + + +
+ Covert planning refers to a class of constrained planning problems where an +agent aims to accomplish a task with minimal information leaked to a passive +observer to avoid detection. However, existing methods of covert planning often +consider deterministic environments or do not exploit the observer's imperfect +information. This paper studies how covert planning can leverage the coupling +of stochastic dynamics and the observer's imperfect observation to achieve +optimal task performance without being detected. Specifically, we employ a +Markov decision process to model the interaction between the agent and its +stochastic environment, and a partial observation function to capture the +leaked information to a passive observer. Assuming the observer employs +hypothesis testing to detect if the observation deviates from a nominal policy, +the covert planning agent aims to maximize the total discounted reward while +keeping the probability of being detected as an adversary below a given +threshold. We prove that finite-memory policies are more powerful than +Markovian policies in covert planning. Then, we develop a primal-dual proximal +policy gradient method with a two-time-scale update to compute a (locally) +optimal covert policy. We demonstrate the effectiveness of our methods using a +stochastic gridworld example. Our experimental results illustrate that the +proposed method computes a policy that maximizes the adversary's expected +reward without violating the detection constraint, and empirically demonstrates +how the environmental noises can influence the performance of the covert +policies. + +
+
+
+
+
+ + ☆ Improving a Named Entity Recognizer Trained on Noisy Data with a Few + Clean Instances + + +
+ To achieve state-of-the-art performance, one still needs to train NER models +on large-scale, high-quality annotated data, an asset that is both costly and +time-intensive to accumulate. In contrast, real-world applications often resort +to massive low-quality labeled data through non-expert annotators via +crowdsourcing and external knowledge bases via distant supervision as a +cost-effective alternative. However, these annotation methods result in noisy +labels, which in turn lead to a notable decline in performance. Hence, we +propose to denoise the noisy NER data with guidance from a small set of clean +instances. Along with the main NER model we train a discriminator model and use +its outputs to recalibrate the sample weights. The discriminator is capable of +detecting both span and category errors with different discriminative prompts. +Results on public crowdsourcing and distant supervision datasets show that the +proposed method can consistently improve performance with a small guidance set. + +
+
+ comment: 14 pages +
+
+
+
+
+ + ☆ Detecting Pretraining Data from Large Language Models + + +
+ Although large language models (LLMs) are widely deployed, the data used to +train them is rarely disclosed. Given the incredible scale of this data, up to +trillions of tokens, it is all but certain that it includes potentially +problematic text such as copyrighted materials, personally identifiable +information, and test data for widely reported reference benchmarks. However, +we currently have no way to know which data of these types is included or in +what proportions. In this paper, we study the pretraining data detection +problem: given a piece of text and black-box access to an LLM without knowing +the pretraining data, can we determine if the model was trained on the provided +text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that +uses data created before and after model training to support gold truth +detection. We also introduce a new detection method Min-K% Prob based on a +simple hypothesis: an unseen example is likely to contain a few outlier words +with low probabilities under the LLM, while a seen example is less likely to +have words with such low probabilities. Min-K% Prob can be applied without any +knowledge about the pretraining corpus or any additional training, departing +from previous detection methods that require training a reference model on data +that is similar to the pretraining data. Moreover, our experiments demonstrate +that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous +methods. We apply Min-K% Prob to two real-world scenarios, copyrighted book +detection, and contaminated downstream example detection, and find it a +consistently effective solution. + +
+
+
+
+
+ + ☆ The GOOSE Dataset for Perception in Unstructured Environments + + +
+ The potential for deploying autonomous systems can be significantly increased +by improving the perception and interpretation of the environment. However, the +development of deep learning-based techniques for autonomous systems in +unstructured outdoor environments poses challenges due to limited data +availability for training and testing. To address this gap, we present the +German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset +specifically designed for unstructured outdoor environments. The GOOSE dataset +incorporates 10 000 labeled pairs of images and point clouds, which are +utilized to train a range of state-of-the-art segmentation models on both image +and point cloud data. We open source the dataset, along with an ontology for +unstructured terrain, as well as dataset standards and guidelines. This +initiative aims to establish a common framework, enabling the seamless +inclusion of existing datasets and a fast way to enhance the perception +capabilities of various robots operating in unstructured environments. The +dataset, pre-trained models for offroad perception, and additional +documentation can be found at https://goose-dataset.de/. + +
+
+ comment: Preprint; Submitted to IEEE for review +
+
+
+
+
+ + ☆ The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing + & Attribution in AI + + +
+ The race to train language models on vast, diverse, and inconsistently +documented datasets has raised pressing concerns about the legal and ethical +risks for practitioners. To remedy these practices threatening data +transparency and understanding, we convene a multi-disciplinary effort between +legal and machine learning experts to systematically audit and trace 1800+ text +datasets. We develop tools and standards to trace the lineage of these +datasets, from their source, creators, series of license conditions, +properties, and subsequent use. Our landscape analysis highlights the sharp +divides in composition and focus of commercially open vs closed datasets, with +closed datasets monopolizing important categories: lower resource languages, +more creative tasks, richer topic variety, newer and more synthetic training +data. This points to a deepening divide in the types of data that are made +available under different license conditions, and heightened implications for +jurisdictional legal interpretations of copyright and fair use. We also observe +frequent miscategorization of licenses on widely used dataset hosting sites, +with license omission of 72%+ and error rates of 50%+. This points to a crisis +in misattribution and informed use of the most popular datasets driving many +recent breakthroughs. As a contribution to ongoing improvements in dataset +transparency and responsible use, we release our entire audit, with an +interactive UI, the Data Provenance Explorer, which allows practitioners to +trace and filter on data provenance for the most popular open source finetuning +data collections: www.dataprovenance.org. + +
+
+ comment: 30 pages (18 main), 6 figures, 5 tables +
+
+
+
+
+ + ☆ The Simplest Inflationary Potentials + + +
+ Inflation is a highly favoured theory for the early Universe. It is +compatible with current observations of the cosmic microwave background and +large scale structure and is a driver in the quest to detect primordial +gravitational waves. It is also, given the current quality of the data, highly +under-determined with a large number of candidate implementations. We use a new +method in symbolic regression to generate all possible simple scalar field +potentials for one of two possible basis sets of operators. Treating these as +single-field, slow-roll inflationary models we then score them with an +information-theoretic metric ("minimum description length") that quantifies +their efficiency in compressing the information in the Planck data. We explore +two possible priors on the parameter space of potentials, one related to the +functions' structural complexity and one that uses a Katz back-off language +model to prefer functions that may be theoretically motivated. This enables us +to identify the inflaton potentials that optimally balance simplicity with +accuracy at explaining the Planck data, which may subsequently find theoretical +motivation. Our exploratory study opens the door to extraction of fundamental +physics directly from data, and may be augmented with more refined theoretical +priors in the quest for a complete understanding of the early Universe. + +
+
+ comment: 13+4 pages, 4 figures; submitted to Physical Review D +
+
+
+
+
+ + ☆ Kiki or Bouba? Sound Symbolism in Vision-and-Language Models NeurIPS 2023 + + +
+ Although the mapping between sound and meaning in human language is assumed +to be largely arbitrary, research in cognitive science has shown that there are +non-trivial correlations between particular sounds and meanings across +languages and demographic groups, a phenomenon known as sound symbolism. Among +the many dimensions of meaning, sound symbolism is particularly salient and +well-demonstrated with regards to cross-modal associations between language and +the visual domain. In this work, we address the question of whether sound +symbolism is reflected in vision-and-language models such as CLIP and Stable +Diffusion. Using zero-shot knowledge probing to investigate the inherent +knowledge of these models, we find strong evidence that they do show this +pattern, paralleling the well-known kiki-bouba effect in psycholinguistics. Our +work provides a novel method for demonstrating sound symbolism and +understanding its nature using computational tools. Our code will be made +publicly available. + +
+
+ comment: Accepted to NeurIPS 2023 (spotlight). Project webpage: + https://kiki-bouba.github.io/ +
+
+
+
+
+ + ☆ Multi-scale Diffusion Denoised Smoothing NeurIPS 2023 + + +
+ Along with recent diffusion models, randomized smoothing has become one of a +few tangible approaches that offers adversarial robustness to models at scale, +e.g., those of large pre-trained models. Specifically, one can perform +randomized smoothing on any classifier via a simple "denoise-and-classify" +pipeline, so-called denoised smoothing, given that an accurate denoiser is +available - such as diffusion model. In this paper, we investigate the +trade-off between accuracy and certified robustness of denoised smoothing: for +example, we question on which representation of diffusion model would maximize +the certified robustness of denoised smoothing. We consider a new objective +that aims collective robustness of smoothed classifiers across multiple noise +levels at a shared diffusion model, which also suggests a new way to compensate +the cost of accuracy in randomized smoothing for its certified robustness. This +objective motivates us to fine-tune diffusion model (a) to perform consistent +denoising whenever the original image is recoverable, but (b) to generate +rather diverse outputs otherwise. Our experiments show that this fine-tuning +scheme of diffusion models combined with the multi-scale smoothing enables a +strong certified robustness possible at highest noise level while maintaining +the accuracy closer to non-smoothed classifiers. + +
+
+ comment: 24 pages; NeurIPS 2023; Code is available at + https://github.com/jh-jeong/smoothing-multiscale +
+
+
+
+
+ + ☆ MixerFlow for Image Modelling + + +
+ Normalising flows are statistical models that transform a complex density +into a simpler density through the use of bijective transformations enabling +both density estimation and data generation from a single model. In the context +of image modelling, the predominant choice has been the Glow-based +architecture, whereas alternative architectures remain largely unexplored in +the research community. In this work, we propose a novel architecture called +MixerFlow, based on the MLP-Mixer architecture, further unifying the generative +and discriminative modelling architectures. MixerFlow offers an effective +mechanism for weight sharing for flow-based models. Our results demonstrate +better density estimation on image datasets under a fixed computational budget +and scales well as the image resolution increases, making MixeFlow a powerful +yet simple alternative to the Glow-based architectures. We also show that +MixerFlow provides more informative embeddings than Glow-based architectures. + +
+
+
+
+
+ + ☆ ConvNets Match Vision Transformers at Scale + + +
+ Many researchers believe that ConvNets perform well on small or moderately +sized datasets, but are not competitive with Vision Transformers when given +access to datasets on the web-scale. We challenge this belief by evaluating a +performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset +of images often used for training foundation models. We consider pre-training +compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a +series of networks of increasing depth and width from the NFNet model family. +We observe a log-log scaling law between held out loss and compute budget. +After fine-tuning on ImageNet, NFNets match the reported performance of Vision +Transformers with comparable compute budgets. Our strongest fine-tuned model +achieves a Top-1 accuracy of 90.4%. + +
+
+
+
+
+ + ☆ SuperHF: Supervised Iterative Learning from Human Feedback NeurIPS 2023 + + +
+ While large language models demonstrate remarkable capabilities, they often +present challenges in terms of safety, alignment with human values, and +stability during training. Here, we focus on two prevalent methods used to +align these models, Supervised Fine-Tuning (SFT) and Reinforcement Learning +from Human Feedback (RLHF). SFT is simple and robust, powering a host of +open-source models, while RLHF is a more sophisticated method used in top-tier +models like ChatGPT but also suffers from instability and susceptibility to +reward hacking. We propose a novel approach, Supervised Iterative Learning from +Human Feedback (SuperHF), which seeks to leverage the strengths of both +methods. Our hypothesis is two-fold: that the reward model used in RLHF is +critical for efficient data use and model generalization and that the use of +Proximal Policy Optimization (PPO) in RLHF may not be necessary and could +contribute to instability issues. SuperHF replaces PPO with a simple supervised +loss and a Kullback-Leibler (KL) divergence prior. It creates its own training +data by repeatedly sampling a batch of model outputs and filtering them through +the reward model in an online learning regime. We then break down the reward +optimization problem into three components: robustly optimizing the training +rewards themselves, preventing reward hacking-exploitation of the reward model +that degrades model performance-as measured by a novel METEOR similarity +metric, and maintaining good performance on downstream evaluations. Our +experimental results show SuperHF exceeds PPO-based RLHF on the training +objective, easily and favorably trades off high reward with low reward hacking, +improves downstream calibration, and performs the same on our GPT-4 based +qualitative evaluation scheme all the while being significantly simpler to +implement, highlighting SuperHF's potential as a competitive language model +alignment technique. + +
+
+ comment: Accepted to the Socially Responsible Language Modelling Research + (SoLaR) workshop at NeurIPS 2023 +
+
+
+
+
+ + ☆ PROMINET: Prototype-based Multi-View Network for Interpretable Email + Response Prediction EMNLP 2023 + + +
+ Email is a widely used tool for business communication, and email marketing +has emerged as a cost-effective strategy for enterprises. While previous +studies have examined factors affecting email marketing performance, limited +research has focused on understanding email response behavior by considering +email content and metadata. This study proposes a Prototype-based Multi-view +Network (PROMINET) that incorporates semantic and structural information from +email data. By utilizing prototype learning, the PROMINET model generates +latent exemplars, enabling interpretable email response prediction. The model +maps learned semantic and structural exemplars to observed samples in the +training data at different levels of granularity, such as document, sentence, +or phrase. The approach is evaluated on two real-world email datasets: the +Enron corpus and an in-house Email Marketing corpus. Experimental results +demonstrate that the PROMINET model outperforms baseline models, achieving a +~3% improvement in F1 score on both datasets. Additionally, the model provides +interpretability through prototypes at different granularity levels while +maintaining comparable performance to non-interpretable models. The learned +prototypes also show potential for generating suggestions to enhance email text +editing and improve the likelihood of effective email responses. This research +contributes to enhancing sender-receiver communication and customer engagement +in email interactions. + +
+
+ comment: Accepted at EMNLP 2023 (industry) +
+
+
+
+
+ + ☆ Simple, Scalable and Effective Clustering via One-Dimensional + Projections NeurIPS 2023 + + +
+ Clustering is a fundamental problem in unsupervised machine learning with +many applications in data analysis. Popular clustering algorithms such as +Lloyd's algorithm and $k$-means++ can take $\Omega(ndk)$ time when clustering +$n$ points in a $d$-dimensional space (represented by an $n\times d$ matrix +$X$) into $k$ clusters. In applications with moderate to large $k$, the +multiplicative $k$ factor can become very expensive. We introduce a simple +randomized clustering algorithm that provably runs in expected time +$O(\mathrm{nnz}(X) + n\log n)$ for arbitrary $k$. Here $\mathrm{nnz}(X)$ is the +total number of non-zero entries in the input dataset $X$, which is upper +bounded by $nd$ and can be significantly smaller for sparse datasets. We prove +that our algorithm achieves approximation ratio $\smash{\widetilde{O}(k^4)}$ on +any input dataset for the $k$-means objective. We also believe that our +theoretical analysis is of independent interest, as we show that the +approximation ratio of a $k$-means algorithm is approximately preserved under a +class of projections and that $k$-means++ seeding can be implemented in +expected $O(n \log n)$ time in one dimension. Finally, we show experimentally +that our clustering algorithm gives a new tradeoff between running time and +cluster quality compared to previous state-of-the-art methods for these tasks. + +
+
+ comment: 41 pages, 6 figures, to appear in NeurIPS 2023 +
+
+
+
+
+ + ☆ Interferometric Neural Networks + + +
+ On the one hand, artificial neural networks have many successful applications +in the field of machine learning and optimization. On the other hand, +interferometers are integral parts of any field that deals with waves such as +optics, astronomy, and quantum physics. Here, we introduce neural networks +composed of interferometers and then build generative adversarial networks from +them. Our networks do not have any classical layer and can be realized on +quantum computers or photonic chips. We demonstrate their applicability for +combinatorial optimization, image classification, and image generation. For +combinatorial optimization, our network consistently converges to the global +optimum or remains within a narrow range of it. In multi-class image +classification tasks, our networks achieve accuracies of 93% and 83%. Lastly, +we show their capability to generate images of digits from 0 to 9 as well as +human faces. + +
+
+ comment: 11 pages +
+
+
+
+
+ + ☆ Stochastic Latent Transformer: Efficient Modelling of Stochastically + Forced Zonal Jets + + +
+ We introduce the 'Stochastic Latent Transformer', a probabilistic deep +learning approach for efficient reduced-order modelling of stochastic partial +differential equations (SPDEs). Despite recent advances in deep learning for +fluid mechanics, limited research has explored modelling stochastically driven +flows - which play a crucial role in understanding a broad spectrum of +phenomena, from jets on giant planets to ocean circulation and the variability +of midlatitude weather. The model architecture consists of a +stochastically-forced transformer, paired with a translation-equivariant +autoencoder, that we demonstrate is capable of reproducing system dynamics +across various integration periods. We demonstrate its effectiveness applied to +a well-researched zonal jet system, with the neural network achieving a +five-order-of-magnitude speedup compared to numerical integration. This +facilitates the cost-effective generation of large ensembles, enabling the +exploration of statistical questions concerning probabilities of spontaneous +transition events. + +
+
+ comment: 23 pages, 9 figures +
+
+
+
+
+ + ☆ MultiPrompter: Cooperative Prompt Optimization with Multi-Agent + Reinforcement Learning + + +
+ Recently, there has been an increasing interest in automated prompt +optimization based on reinforcement learning (RL). This approach offers +important advantages, such as generating interpretable prompts and being +compatible with black-box foundation models. However, the substantial prompt +space size poses challenges for RL-based methods, often leading to suboptimal +policy convergence. This paper introduces MultiPrompter, a new framework that +views prompt optimization as a cooperative game between prompters which take +turns composing a prompt together. Our cooperative prompt optimization +effectively reduces the problem size and helps prompters learn optimal prompts. +We test our method on the text-to-image task and show its ability to generate +higher-quality images than baselines. + +
+
+
+
+
+ + ☆ AI Hazard Management: A framework for the systematic management of root + causes for AI risks + + +
+ Recent advancements in the field of Artificial Intelligence (AI) establish +the basis to address challenging tasks. However, with the integration of AI, +new risks arise. Therefore, to benefit from its advantages, it is essential to +adequately handle the risks associated with AI. Existing risk management +processes in related fields, such as software systems, need to sufficiently +consider the specifics of AI. A key challenge is to systematically and +transparently identify and address AI risks' root causes - also called AI +hazards. This paper introduces the AI Hazard Management (AIHM) framework, which +provides a structured process to systematically identify, assess, and treat AI +hazards. The proposed process is conducted in parallel with the development to +ensure that any AI hazard is captured at the earliest possible stage of the AI +system's life cycle. In addition, to ensure the AI system's auditability, the +proposed framework systematically documents evidence that the potential impact +of identified AI hazards could be reduced to a tolerable level. The framework +builds upon an AI hazard list from a comprehensive state-of-the-art analysis. +Also, we provide a taxonomy that supports the optimal treatment of the +identified AI hazards. Additionally, we illustrate how the AIHM framework can +increase the overall quality of a power grid AI use case by systematically +reducing the impact of identified hazards to an acceptable level. + +
+
+
+
+
+ + ☆ Wasserstein Gradient Flow over Variational Parameter Space for + Variational Inference + + +
+ Variational inference (VI) can be cast as an optimization problem in which +the variational parameters are tuned to closely align a variational +distribution with the true posterior. The optimization task can be approached +through vanilla gradient descent in black-box VI or natural-gradient descent in +natural-gradient VI. In this work, we reframe VI as the optimization of an +objective that concerns probability distributions defined over a +\textit{variational parameter space}. Subsequently, we propose Wasserstein +gradient descent for tackling this optimization problem. Notably, the +optimization techniques, namely black-box VI and natural-gradient VI, can be +reinterpreted as specific instances of the proposed Wasserstein gradient +descent. To enhance the efficiency of optimization, we develop practical +methods for numerically solving the discrete gradient flows. We validate the +effectiveness of the proposed methods through empirical experiments on a +synthetic dataset, supplemented by theoretical analyses. + +
+
+
+
+
+ + ☆ Interpretable time series neural representation for classification + purposes + + +
+ Deep learning has made significant advances in creating efficient +representations of time series data by automatically identifying complex +patterns. However, these approaches lack interpretability, as the time series +is transformed into a latent vector that is not easily interpretable. On the +other hand, Symbolic Aggregate approximation (SAX) methods allow the creation +of symbolic representations that can be interpreted but do not capture complex +patterns effectively. In this work, we propose a set of requirements for a +neural representation of univariate time series to be interpretable. We propose +a new unsupervised neural architecture that meets these requirements. The +proposed model produces consistent, discrete, interpretable, and visualizable +representations. The model is learned independently of any downstream tasks in +an unsupervised setting to ensure robustness. As a demonstration of the +effectiveness of the proposed model, we propose experiments on classification +tasks using UCR archive datasets. The obtained results are extensively compared +to other interpretable models and state-of-the-art neural representation +learning models. The experiments show that the proposed model yields, on +average better results than other interpretable approaches on multiple +datasets. We also present qualitative experiments to asses the interpretability +of the approach. + +
+
+ comment: International Conference on Data Science and Advanced Analytics + (DSAA) 2023 +
+
+
+
+
+ + ☆ From Pointwise to Powerhouse: Initialising Neural Networks with + Generative Models + + +
+ Traditional initialisation methods, e.g. He and Xavier, have been effective +in avoiding the problem of vanishing or exploding gradients in neural networks. +However, they only use simple pointwise distributions, which model +one-dimensional variables. Moreover, they ignore most information about the +architecture and disregard past training experiences. These limitations can be +overcome by employing generative models for initialisation. In this paper, we +introduce two groups of new initialisation methods. First, we locally +initialise weight groups by employing variational autoencoders. Secondly, we +globally initialise full weight sets by employing graph hypernetworks. We +thoroughly evaluate the impact of the employed generative models on +state-of-the-art neural networks in terms of accuracy, convergence speed and +ensembling. Our results show that global initialisations result in higher +accuracy and faster initial convergence speed. However, the implementation +through graph hypernetworks leads to diminished ensemble performance on out of +distribution data. To counteract, we propose a modification called noise graph +hypernetwork, which encourages diversity in the produced ensemble members. +Furthermore, our approach might be able to transfer learned knowledge to +different image distributions. Our work provides insights into the potential, +the trade-offs and possible modifications of these new initialisation methods. + +
+
+
+
+
+ + ☆ Learning-based adaption of robotic friction models + + +
+ In the Fourth Industrial Revolution, wherein artificial intelligence and the +automation of machines occupy a central role, the deployment of robots is +indispensable. However, the manufacturing process using robots, especially in +collaboration with humans, is highly intricate. In particular, modeling the +friction torque in robotic joints is a longstanding problem due to the lack of +a good mathematical description. This motivates the usage of data-driven +methods in recent works. However, model-based and data-driven models often +exhibit limitations in their ability to generalize beyond the specific dynamics +they were trained on, as we demonstrate in this paper. To address this +challenge, we introduce a novel approach based on residual learning, which aims +to adapt an existing friction model to new dynamics using as little data as +possible. We validate our approach by training a base neural network on a +symmetric friction data set to learn an accurate relation between the velocity +and the friction torque. Subsequently, to adapt to more complex asymmetric +settings, we train a second network on a small dataset, focusing on predicting +the residual of the initial network's output. By combining the output of both +networks in a suitable manner, our proposed estimator outperforms the +conventional model-based approach and the base neural network significantly. +Furthermore, we evaluate our method on trajectories involving external loads +and still observe a substantial improvement, approximately 60-70\%, over the +conventional approach. Our method does not rely on data with external load +during training, eliminating the need for external torque sensors. This +demonstrates the generalization capability of our approach, even with a small +amount of data-only 43 seconds of a robot movement-enabling adaptation to +diverse scenarios based on prior knowledge about friction in different +settings. + +
+
+
+
+
+ + ☆ Dynamics Generalisation in Reinforcement Learning via Adaptive + Context-Aware Policies NeurIPS 2023 + + +
+ While reinforcement learning has achieved remarkable successes in several +domains, its real-world application is limited due to many methods failing to +generalise to unfamiliar conditions. In this work, we consider the problem of +generalising to new transition dynamics, corresponding to cases in which the +environment's response to the agent's actions differs. For example, the +gravitational force exerted on a robot depends on its mass and changes the +robot's mobility. Consequently, in such cases, it is necessary to condition an +agent's actions on extrinsic state information and pertinent contextual +information reflecting how the environment responds. While the need for +context-sensitive policies has been established, the manner in which context is +incorporated architecturally has received less attention. Thus, in this work, +we present an investigation into how context information should be incorporated +into behaviour learning to improve generalisation. To this end, we introduce a +neural network architecture, the Decision Adapter, which generates the weights +of an adapter module and conditions the behaviour of an agent on the context +information. We show that the Decision Adapter is a useful generalisation of a +previously proposed architecture and empirically demonstrate that it results in +superior generalisation performance compared to previous approaches in several +environments. Beyond this, the Decision Adapter is more robust to irrelevant +distractor variables than several alternative methods. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Robust and Actively Secure Serverless Collaborative Learning NeurIPS 2023 + + +
+ Collaborative machine learning (ML) is widely used to enable institutions to +learn better models from distributed data. While collaborative approaches to +learning intuitively protect user data, they remain vulnerable to either the +server, the clients, or both, deviating from the protocol. Indeed, because the +protocol is asymmetric, a malicious server can abuse its power to reconstruct +client data points. Conversely, malicious clients can corrupt learning with +malicious updates. Thus, both clients and servers require a guarantee when the +other cannot be trusted to fully cooperate. In this work, we propose a +peer-to-peer (P2P) learning scheme that is secure against malicious servers and +robust to malicious clients. Our core contribution is a generic framework that +transforms any (compatible) algorithm for robust aggregation of model updates +to the setting where servers and clients can act maliciously. Finally, we +demonstrate the computational efficiency of our approach even with 1-million +parameter models trained by 100s of peers on standard datasets. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ UAV Pathfinding in Dynamic Obstacle Avoidance with Multi-agent + Reinforcement Learning + + +
+ Multi-agent reinforcement learning based methods are significant for online +planning of feasible and safe paths for agents in dynamic and uncertain +scenarios. Although some methods like fully centralized and fully decentralized +methods achieve a certain measure of success, they also encounter problems such +as dimension explosion and poor convergence, respectively. In this paper, we +propose a novel centralized training with decentralized execution method based +on multi-agent reinforcement learning to solve the dynamic obstacle avoidance +problem online. In this approach, each agent communicates only with the central +planner or only with its neighbors, respectively, to plan feasible and safe +paths online. We improve our methods based on the idea of model predictive +control to increase the training efficiency and sample utilization of agents. +The experimental results in both simulation, indoor, and outdoor environments +validate the effectiveness of our method. The video is available at +https://www.bilibili.com/video/BV1gw41197hV/?vd_source=9de61aecdd9fb684e546d032ef7fe7bf + +
+
+
+
+
+ + ☆ A Picture is Worth a Thousand Words: Principled Recaptioning Improves + Image Generation + + +
+ Text-to-image diffusion models achieved a remarkable leap in capabilities +over the last few years, enabling high-quality and diverse synthesis of images +from a textual prompt. However, even the most advanced models often struggle to +precisely follow all of the directions in their prompts. The vast majority of +these models are trained on datasets consisting of (image, caption) pairs where +the images often come from the web, and the captions are their HTML alternate +text. A notable example is the LAION dataset, used by Stable Diffusion and +other models. In this work we observe that these captions are often of low +quality, and argue that this significantly affects the model's capability to +understand nuanced semantics in the textual prompts. We show that by relabeling +the corpus with a specialized automatic captioning model and training a +text-to-image model on the recaptioned dataset, the model benefits +substantially across the board. First, in overall image quality: e.g. FID 14.84 +vs. the baseline of 17.87, and 64.3% improvement in faithful image generation +according to human evaluation. Second, in semantic alignment, e.g. semantic +object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and +positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the +corpus and provide evidence that this technique, which we call RECAP, both +reduces the train-inference discrepancy and provides the model with more +information per example, increasing sample efficiency and allowing the model to +better understand the relations between captions and images. + +
+
+
+
+
+ + ☆ Towards Control-Centric Representations in Reinforcement Learning from + Images + + +
+ Image-based Reinforcement Learning is a practical yet challenging task. A +major hurdle lies in extracting control-centric representations while +disregarding irrelevant information. While approaches that follow the +bisimulation principle exhibit the potential in learning state representations +to address this issue, they still grapple with the limited expressive capacity +of latent dynamics and the inadaptability to sparse reward environments. To +address these limitations, we introduce ReBis, which aims to capture +control-centric information by integrating reward-free control information +alongside reward-specific knowledge. ReBis utilizes a transformer architecture +to implicitly model the dynamics and incorporates block-wise masking to +eliminate spatiotemporal redundancy. Moreover, ReBis combines +bisimulation-based loss with asymmetric reconstruction loss to prevent feature +collapse in environments with sparse rewards. Empirical studies on two large +benchmarks, including Atari games and DeepMind Control Suit, demonstrate that +ReBis has superior performance compared to existing methods, proving its +effectiveness. + +
+
+
+
+
+ + ☆ How Robust is Federated Learning to Communication Error? A Comparison + Study Between Uplink and Downlink Channels + + +
+ Because of its privacy-preserving capability, federated learning (FL) has +attracted significant attention from both academia and industry. However, when +being implemented over wireless networks, it is not clear how much +communication error can be tolerated by FL. This paper investigates the +robustness of FL to the uplink and downlink communication error. Our +theoretical analysis reveals that the robustness depends on two critical +parameters, namely the number of clients and the numerical range of model +parameters. It is also shown that the uplink communication in FL can tolerate a +higher bit error rate (BER) than downlink communication, and this difference is +quantified by a proposed formula. The findings and theoretical analyses are +further validated by extensive experiments. + +
+
+ comment: Submitted to IEEE for possible publication +
+
+
+
+
+ + ☆ Posterior Consistency for Missing Data in Variational Autoencoders ECML + + +
+ We consider the problem of learning Variational Autoencoders (VAEs), i.e., a +type of deep generative model, from data with missing values. Such data is +omnipresent in real-world applications of machine learning because complete +data is often impossible or too costly to obtain. We particularly focus on +improving a VAE's amortized posterior inference, i.e., the encoder, which in +the case of missing data can be susceptible to learning inconsistent posterior +distributions regarding the missingness. To this end, we provide a formal +definition of posterior consistency and propose an approach for regularizing an +encoder's posterior distribution which promotes this consistency. We observe +that the proposed regularization suggests a different training objective than +that typically considered in the literature when facing missing values. +Furthermore, we empirically demonstrate that our regularization leads to +improved performance in missing value settings in terms of reconstruction +quality and downstream tasks utilizing uncertainty in the latent space. This +improved performance can be observed for many classes of VAEs including VAEs +equipped with normalizing flows. + +
+
+ comment: First published in ECML PKDD 2023, Proceedings, Part II, by Springer + Nature (https://doi.org/10.1007/978-3-031-43415-0_30). This version of the + work has been extended with the addition of an Appendix, which includes + proofs, the derivation of the posterior regularization, additional background + information on technical topics, an extended related work section, and + additional experimental results +
+
+
+
+
+ + ☆ Achieving Constraints in Neural Networks: A Stochastic Augmented + Lagrangian Approach + + +
+ Regularizing Deep Neural Networks (DNNs) is essential for improving +generalizability and preventing overfitting. Fixed penalty methods, though +common, lack adaptability and suffer from hyperparameter sensitivity. In this +paper, we propose a novel approach to DNN regularization by framing the +training process as a constrained optimization problem. Where the data fidelity +term is the minimization objective and the regularization terms serve as +constraints. Then, we employ the Stochastic Augmented Lagrangian (SAL) method +to achieve a more flexible and efficient regularization mechanism. Our approach +extends beyond black-box regularization, demonstrating significant improvements +in white-box models, where weights are often subject to hard constraints to +ensure interpretability. Experimental results on image-based classification on +MNIST, CIFAR10, and CIFAR100 datasets validate the effectiveness of our +approach. SAL consistently achieves higher Accuracy while also achieving better +constraint satisfaction, thus showcasing its potential for optimizing DNNs +under constrained settings. + +
+
+
+
+
+ + ☆ Model predictive control-based value estimation for efficient + reinforcement learning + + +
+ Reinforcement learning suffers from limitations in real practices primarily +due to the numbers of required interactions with virtual environments. It +results in a challenging problem that we are implausible to obtain an optimal +strategy only with a few attempts for many learning method. Hereby, we design +an improved reinforcement learning method based on model predictive control +that models the environment through a data-driven approach. Based on learned +environmental model, it performs multi-step prediction to estimate the value +function and optimize the policy. The method demonstrates higher learning +efficiency, faster convergent speed of strategies tending to the optimal value, +and fewer sample capacity space required by experience replay buffers. +Experimental results, both in classic databases and in a dynamic obstacle +avoidance scenario for unmanned aerial vehicle, validate the proposed +approaches. + +
+
+
+
+
+ + ☆ Driving through the Concept Gridlock: Unraveling Explainability + Bottlenecks + + +
+ Concept bottleneck models have been successfully used for explainable machine +learning by encoding information within the model with a set of human-defined +concepts. In the context of human-assisted or autonomous driving, +explainability models can help user acceptance and understanding of decisions +made by the autonomous vehicle, which can be used to rationalize and explain +driver or vehicle behavior. We propose a new approach using concept bottlenecks +as visual features for control command predictions and explanations of user and +vehicle behavior. We learn a human-understandable concept layer that we use to +explain sequential driving scenes while learning vehicle control commands. This +approach can then be used to determine whether a change in a preferred gap or +steering commands from a human (or autonomous vehicle) is led by an external +stimulus or change in preferences. We achieve competitive performance to latent +visual features while gaining interpretability within our model setup. + +
+
+
+
+
+ + ☆ Covariate Shift Adaptation Robust to Density-Ratio Estimation + + +
+ Consider a scenario where we have access to train data with both covariates +and outcomes while test data only contains covariates. In this scenario, our +primary aim is to predict the missing outcomes of the test data. With this +objective in mind, we train parametric regression models under a covariate +shift, where covariate distributions are different between the train and test +data. For this problem, existing studies have proposed covariate shift +adaptation via importance weighting using the density ratio. This approach +averages the train data losses, each weighted by an estimated ratio of the +covariate densities between the train and test data, to approximate the +test-data risk. Although it allows us to obtain a test-data risk minimizer, its +performance heavily relies on the accuracy of the density ratio estimation. +Moreover, even if the density ratio can be consistently estimated, the +estimation errors of the density ratio also yield bias in the estimators of the +regression model's parameters of interest. To mitigate these challenges, we +introduce a doubly robust estimator for covariate shift adaptation via +importance weighting, which incorporates an additional estimator for the +regression function. Leveraging double machine learning techniques, our +estimator reduces the bias arising from the density ratio estimation errors. We +demonstrate the asymptotic distribution of the regression parameter estimator. +Notably, our estimator remains consistent if either the density ratio estimator +or the regression function is consistent, showcasing its robustness against +potential errors in density ratio estimation. Finally, we confirm the soundness +of our proposed method via simulation studies. + +
+
+
+
+
+ + ☆ Photometric Redshifts with Copula Entropy + + +
+ In this paper we propose to apply copula entropy (CE) to photometric +redshifts. CE is used to measure the correlations between photometric +measurements and redshifts and then the measurements associated with high CEs +are selected for predicting redshifts. We verified the proposed method on the +SDSS quasar data. Experimental results show that the accuracy of photometric +redshifts is improved with the selected measurements compared to the results +with all the measurements used in the experiments, especially for the samples +with high redshifts. The measurements selected with CE include luminosity +magnitude, the brightness in ultraviolet band with standard deviation, and the +brightness of the other four bands. Since CE is a rigorously defined +mathematical concept, the models such derived is interpretable. + +
+
+ comment: 15 pages, 7 figures, 1 table +
+
+
+
+
+ + ☆ Free-form Flows: Make Any Architecture a Normalizing Flow + + +
+ Normalizing Flows are generative models that directly maximize the +likelihood. Previously, the design of normalizing flows was largely constrained +by the need for analytical invertibility. We overcome this constraint by a +training procedure that uses an efficient estimator for the gradient of the +change of variables formula. This enables any dimension-preserving neural +network to serve as a generative model through maximum likelihood training. Our +approach allows placing the emphasis on tailoring inductive biases precisely to +the task at hand. Specifically, we achieve excellent results in molecule +generation benchmarks utilizing $E(n)$-equivariant networks. Moreover, our +method is competitive in an inverse problem benchmark, while employing +off-the-shelf ResNet architectures. + +
+
+
+
+
+ + ☆ SpikingJelly: An open-source machine learning infrastructure platform + for spike-based intelligence + + +
+ Spiking neural networks (SNNs) aim to realize brain-inspired intelligence on +neuromorphic chips with high energy efficiency by introducing neural dynamics +and spike properties. As the emerging spiking deep learning paradigm attracts +increasing interest, traditional programming frameworks cannot meet the demands +of the automatic differentiation, parallel computation acceleration, and high +integration of processing neuromorphic datasets and deployment. In this work, +we present the SpikingJelly framework to address the aforementioned dilemma. We +contribute a full-stack toolkit for pre-processing neuromorphic datasets, +building deep SNNs, optimizing their parameters, and deploying SNNs on +neuromorphic chips. Compared to existing methods, the training of deep SNNs can +be accelerated $11\times$, and the superior extensibility and flexibility of +SpikingJelly enable users to accelerate custom models at low costs through +multilevel inheritance and semiautomatic code generation. SpikingJelly paves +the way for synthesizing truly energy-efficient SNN-based machine intelligence +systems, which will enrich the ecology of neuromorphic computing. + +
+
+ comment: Accepted in Science Advances + (https://www.science.org/doi/10.1126/sciadv.adi1480) +
+
+
+
+
+ + ☆ Performative Prediction: Past and Future + + +
+ Predictions in the social world generally influence the target of prediction, +a phenomenon known as performativity. Self-fulfilling and self-negating +predictions are examples of performativity. Of fundamental importance to +economics, finance, and the social sciences, the notion has been absent from +the development of machine learning. In machine learning applications, +performativity often surfaces as distribution shift. A predictive model +deployed on a digital platform, for example, influences consumption and thereby +changes the data-generating distribution. We survey the recently founded area +of performative prediction that provides a definition and conceptual framework +to study performativity in machine learning. A consequence of performative +prediction is a natural equilibrium notion that gives rise to new optimization +challenges. Another consequence is a distinction between learning and steering, +two mechanisms at play in performative prediction. The notion of steering is in +turn intimately related to questions of power in digital markets. We review the +notion of performative power that gives an answer to the question how much a +platform can steer participants through its predictions. We end on a discussion +of future directions, such as the role that performativity plays in contesting +algorithmic systems. + +
+
+
+
+
+ + ☆ AirFL-Mem: Improving Communication-Learning Trade-Off by Long-Term + Memory + + +
+ Addressing the communication bottleneck inherent in federated learning (FL), +over-the-air FL (AirFL) has emerged as a promising solution, which is, however, +hampered by deep fading conditions. In this paper, we propose AirFL-Mem, a +novel scheme designed to mitigate the impact of deep fading by implementing a +\emph{long-term} memory mechanism. Convergence bounds are provided that account +for long-term memory, as well as for existing AirFL variants with short-term +memory, for general non-convex objectives. The theory demonstrates that +AirFL-Mem exhibits the same convergence rate of federated averaging (FedAvg) +with ideal communication, while the performance of existing schemes is +generally limited by error floors. The theoretical results are also leveraged +to propose a novel convex optimization strategy for the truncation threshold +used for power control in the presence of Rayleigh fading channels. +Experimental results validate the analysis, confirming the advantages of a +long-term memory mechanism for the mitigation of deep fading. + +
+
+ comment: 8 pages, 3 figures, this is the full version of the conference + version that is submitted to IEEE WCNC2024 for possible publication +
+
+
+
+
+ + ☆ Parcel loss prediction in last-mile delivery: deep and non-deep + approaches with insights from Explainable AI + + +
+ Within the domain of e-commerce retail, an important objective is the +reduction of parcel loss during the last-mile delivery phase. The +ever-increasing availability of data, including product, customer, and order +information, has made it possible for the application of machine learning in +parcel loss prediction. However, a significant challenge arises from the +inherent imbalance in the data, i.e., only a very low percentage of parcels are +lost. In this paper, we propose two machine learning approaches, namely, Data +Balance with Supervised Learning (DBSL) and Deep Hybrid Ensemble Learning +(DHEL), to accurately predict parcel loss. The practical implication of such +predictions is their value in aiding e-commerce retailers in optimizing +insurance-related decision-making policies. We conduct a comprehensive +evaluation of the proposed machine learning models using one year data from +Belgian shipments. The findings show that the DHEL model, which combines a +feed-forward autoencoder with a random forest, achieves the highest +classification performance. Furthermore, we use the techniques from Explainable +AI (XAI) to illustrate how prediction models can be used in enhancing business +processes and augmenting the overall value proposition for e-commerce retailers +in the last mile delivery. + +
+
+
+
+
+ + ☆ Balancing central and marginal rejection when combining independent + significance tests + + +
+ A common approach to evaluating the significance of a collection of +$p$-values combines them with a pooling function, in particular when the +original data are not available. These pooled $p$-values convert a sample of +$p$-values into a single number which behaves like a univariate $p$-value. To +clarify discussion of these functions, a telescoping series of alternative +hypotheses are introduced that communicate the strength and prevalence of +non-null evidence in the $p$-values before general pooling formulae are +discussed. A pattern noticed in the UMP pooled $p$-value for a particular +alternative motivates the definition and discussion of central and marginal +rejection levels at $\alpha$. It is proven that central rejection is always +greater than or equal to marginal rejection, motivating a quotient to measure +the balance between the two for pooled $p$-values. A combining function based +on the $\chi^2_{\kappa}$ quantile transformation is proposed to control this +quotient and shown to be robust to mis-specified parameters relative to the +UMP. Different powers for different parameter settings motivate a map of +plausible alternatives based on where this pooled $p$-value is minimized. + +
+
+ comment: 55 page, 18 figures, public technical report +
+
+
+
+
+ + ☆ Beyond IID weights: sparse and low-rank deep Neural Networks are also + Gaussian Processes + + +
+ The infinitely wide neural network has been proven a useful and manageable +mathematical model that enables the understanding of many phenomena appearing +in deep learning. One example is the convergence of random deep networks to +Gaussian processes that allows a rigorous analysis of the way the choice of +activation function and network weights impacts the training dynamics. In this +paper, we extend the seminal proof of Matthews et al. (2018) to a larger class +of initial weight distributions (which we call PSEUDO-IID), including the +established cases of IID and orthogonal weights, as well as the emerging +low-rank and structured sparse settings celebrated for their computational +speed-up benefits. We show that fully-connected and convolutional networks +initialized with PSEUDO-IID distributions are all effectively equivalent up to +their variance. Using our results, one can identify the Edge-of-Chaos for a +broader class of neural networks and tune them at criticality in order to +enhance their training. + +
+
+
+
+
+ + ☆ Over-the-air Federated Policy Gradient + + +
+ In recent years, over-the-air aggregation has been widely considered in +large-scale distributed learning, optimization, and sensing. In this paper, we +propose the over-the-air federated policy gradient algorithm, where all agents +simultaneously broadcast an analog signal carrying local information to a +common wireless channel, and a central controller uses the received aggregated +waveform to update the policy parameters. We investigate the effect of noise +and channel distortion on the convergence of the proposed algorithm, and +establish the complexities of communication and sampling for finding an +$\epsilon$-approximate stationary point. Finally, we present some simulation +results to show the effectiveness of the algorithm. + +
+
+
+
+
+ + ☆ Multi-parallel-task Time-delay Reservoir Computing combining a Silicon + Microring with WDM + + +
+ We numerically demonstrate a microring-based time-delay reservoir computing +scheme that simultaneously solves three tasks involving time-series prediction, +classification, and wireless channel equalization. Each task performed on a +wavelength-multiplexed channel achieves state-of-the-art performance with +optimized power and frequency detuning. + +
+
+ comment: 3 pages, 2 figures, Submitted to Optical Fiber Communication + Conference (OFC) 2024 +
+
+
+
+
+ + ☆ Adaptive Uncertainty Estimation via High-Dimensional Testing on Latent + Representations NeurIPS 2023 + + +
+ Uncertainty estimation aims to evaluate the confidence of a trained deep +neural network. However, existing uncertainty estimation approaches rely on +low-dimensional distributional assumptions and thus suffer from the high +dimensionality of latent features. Existing approaches tend to focus on +uncertainty on discrete classification probabilities, which leads to poor +generalizability to uncertainty estimation for other tasks. Moreover, most of +the literature requires seeing the out-of-distribution (OOD) data in the +training for better estimation of uncertainty, which limits the uncertainty +estimation performance in practice because the OOD data are typically unseen. +To overcome these limitations, we propose a new framework using data-adaptive +high-dimensional hypothesis testing for uncertainty estimation, which leverages +the statistical properties of the feature representations. Our method directly +operates on latent representations and thus does not require retraining the +feature encoder under a modified objective. The test statistic relaxes the +feature distribution assumptions to high dimensionality, and it is more +discriminative to uncertainties in the latent representations. We demonstrate +that encoding features with Bayesian neural networks can enhance testing +performance and lead to more accurate uncertainty estimation. We further +introduce a family-wise testing procedure to determine the optimal threshold of +OOD detection, which minimizes the false discovery rate (FDR). Extensive +experiments validate the satisfactory performance of our framework on +uncertainty estimation and task-specific prediction over a variety of +competitors. The experiments on the OOD detection task also show satisfactory +performance of our method when the OOD data are unseen in the training. Codes +are available at https://github.com/HKU-MedAI/bnn_uncertainty. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Mapping the magnetic field using a magnetometer array with noisy input + Gaussian process regression + + +
+ Ferromagnetic materials in indoor environments give rise to disturbances in +the ambient magnetic field. Maps of these magnetic disturbances can be used for +indoor localisation. A Gaussian process can be used to learn the spatially +varying magnitude of the magnetic field using magnetometer measurements and +information about the position of the magnetometer. The position of the +magnetometer, however, is frequently only approximately known. This negatively +affects the quality of the magnetic field map. In this paper, we investigate +how an array of magnetometers can be used to improve the quality of the +magnetic field map. The position of the array is approximately known, but the +relative locations of the magnetometers on the array are known. We include this +information in a novel method to make a map of the ambient magnetic field. We +study the properties of our method in simulation and show that our method +improves the map quality. We also demonstrate the efficacy of our method with +experimental data for the mapping of the magnetic field using an array of 30 +magnetometers. + +
+
+
+
+
+ + ☆ Large-scale magnetic field maps using structured kernel interpolation + for Gaussian process regression + + +
+ We present a mapping algorithm to compute large-scale magnetic field maps in +indoor environments with approximate Gaussian process (GP) regression. Mapping +the spatial variations in the ambient magnetic field can be used for +localization algorithms in indoor areas. To compute such a map, GP regression +is a suitable tool because it provides predictions of the magnetic field at new +locations along with uncertainty quantification. Because full GP regression has +a complexity that grows cubically with the number of data points, +approximations for GPs have been extensively studied. In this paper, we build +on the structured kernel interpolation (SKI) framework, speeding up inference +by exploiting efficient Krylov subspace methods. More specifically, we +incorporate SKI with derivatives (D-SKI) into the scalar potential model for +magnetic field modeling and compute both predictive mean and covariance with a +complexity that is linear in the data points. In our simulations, we show that +our method achieves better accuracy than current state-of-the-art methods on +magnetic field maps with a growing mapping area. In our large-scale +experiments, we construct magnetic field maps from up to 40000 +three-dimensional magnetic field measurements in less than two minutes on a +standard laptop. + +
+
+
+
+
+ + ☆ Model-enhanced Contrastive Reinforcement Learning for Sequential + Recommendation + + +
+ Reinforcement learning (RL) has been widely applied in recommendation systems +due to its potential in optimizing the long-term engagement of users. From the +perspective of RL, recommendation can be formulated as a Markov decision +process (MDP), where recommendation system (agent) can interact with users +(environment) and acquire feedback (reward signals).However, it is impractical +to conduct online interactions with the concern on user experience and +implementation complexity, and we can only train RL recommenders with offline +datasets containing limited reward signals and state transitions. Therefore, +the data sparsity issue of reward signals and state transitions is very severe, +while it has long been overlooked by existing RL recommenders.Worse still, RL +methods learn through the trial-and-error mode, but negative feedback cannot be +obtained in implicit feedback recommendation tasks, which aggravates the +overestimation problem of offline RL recommender. To address these challenges, +we propose a novel RL recommender named model-enhanced contrastive +reinforcement learning (MCRL). On the one hand, we learn a value function to +estimate the long-term engagement of users, together with a conservative value +learning mechanism to alleviate the overestimation problem.On the other hand, +we construct some positive and negative state-action pairs to model the reward +function and state transition function with contrastive learning to exploit the +internal structure information of MDP. Experiments demonstrate that the +proposed method significantly outperforms existing offline RL and +self-supervised RL methods with different representative backbone networks on +two real-world datasets. + +
+
+ comment: 11 pages, 7 figures +
+
+
+
+
+ + ☆ Label Propagation for Graph Label Noise + + +
+ Label noise is a common challenge in large datasets, as it can significantly +degrade the generalization ability of deep neural networks. Most existing +studies focus on noisy labels in computer vision; however, graph models +encompass both node features and graph topology as input, and become more +susceptible to label noise through message-passing mechanisms. Recently, only a +few works have been proposed to tackle the label noise on graphs. One major +limitation is that they assume the graph is homophilous and the labels are +smoothly distributed. Nevertheless, real-world graphs may contain varying +degrees of heterophily or even be heterophily-dominated, leading to the +inadequacy of current methods. In this paper, we study graph label noise in the +context of arbitrary heterophily, with the aim of rectifying noisy labels and +assigning labels to previously unlabeled nodes. We begin by conducting two +empirical analyses to explore the impact of graph homophily on graph label +noise. Following observations, we propose a simple yet efficient algorithm, +denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three +steps: (1) reconstruct the graph to recover the homophily property, (2) utilize +label propagation to rectify the noisy labels, (3) select high-confidence +labels to retain for the next iteration. By iterating these steps, we obtain a +set of correct labels, ultimately achieving high accuracy in the node +classification task. The theoretical analysis is also provided to demonstrate +its remarkable denoising "effect". Finally, we conduct experiments on 10 +benchmark datasets under varying graph heterophily levels and noise types, +comparing the performance of LP4GLN with 7 typical baselines. Our results +illustrate the superior performance of the proposed LP4GLN. + +
+
+
+
+
+ + ☆ DECWA : Density-Based Clustering using Wasserstein Distance CIKM 2020 + + +
+ Clustering is a data analysis method for extracting knowledge by discovering +groups of data called clusters. Among these methods, state-of-the-art +density-based clustering methods have proven to be effective for +arbitrary-shaped clusters. Despite their encouraging results, they suffer to +find low-density clusters, near clusters with similar densities, and +high-dimensional data. Our proposals are a new characterization of clusters and +a new clustering algorithm based on spatial density and probabilistic approach. +First of all, sub-clusters are built using spatial density represented as +probability density function ($p.d.f$) of pairwise distances between points. A +method is then proposed to agglomerate similar sub-clusters by using both their +density ($p.d.f$) and their spatial distance. The key idea we propose is to use +the Wasserstein metric, a powerful tool to measure the distance between $p.d.f$ +of sub-clusters. We show that our approach outperforms other state-of-the-art +density-based clustering methods on a wide variety of datasets. + +
+
+ comment: 6 pages, CIKM 2020 +
+
+
+
+
+ + ☆ Pitfall of Optimism: Distributional Reinforcement Learning by + Randomizing Risk Criterion NeurIPS 2023 + + +
+ Distributional reinforcement learning algorithms have attempted to utilize +estimated uncertainty for exploration, such as optimism in the face of +uncertainty. However, using the estimated variance for optimistic exploration +may cause biased data collection and hinder convergence or performance. In this +paper, we present a novel distributional reinforcement learning algorithm that +selects actions by randomizing risk criterion to avoid one-sided tendency on +risk. We provide a perturbed distributional Bellman optimality operator by +distorting the risk measure and prove the convergence and optimality of the +proposed method with the weaker contraction property. Our theoretical results +support that the proposed method does not fall into biased exploration and is +guaranteed to converge to an optimal return. Finally, we empirically show that +our method outperforms other existing distribution-based algorithms in various +environments including Atari 55 games. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ FedTherapist: Mental Health Monitoring with User-Generated Linguistic + Expressions on Smartphones via Federated Learning EMNLP 2023 + + +
+ Psychiatrists diagnose mental disorders via the linguistic use of patients. +Still, due to data privacy, existing passive mental health monitoring systems +use alternative features such as activity, app usage, and location via mobile +devices. We propose FedTherapist, a mobile mental health monitoring system that +utilizes continuous speech and keyboard input in a privacy-preserving way via +federated learning. We explore multiple model designs by comparing their +performance and overhead for FedTherapist to overcome the complex nature of +on-device language model training on smartphones. We further propose a +Context-Aware Language Learning (CALL) methodology to effectively utilize +smartphones' large and noisy text for mental health signal sensing. Our +IRB-approved evaluation of the prediction of self-reported depression, stress, +anxiety, and mood from 46 participants shows higher accuracy of FedTherapist +compared with the performance with non-language features, achieving 0.15 AUROC +improvement and 8.21% MAE reduction. + +
+
+ comment: Accepted to the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP 2023) +
+
+
+
+
+ + ☆ Enhancing Document Information Analysis with Multi-Task Pre-training: A + Robust Approach for Information Extraction in Visually-Rich Documents + + +
+ This paper introduces a deep learning model tailored for document information +analysis, emphasizing document classification, entity relation extraction, and +document visual question answering. The proposed model leverages +transformer-based models to encode all the information present in a document +image, including textual, visual, and layout information. The model is +pre-trained and subsequently fine-tuned for various document image analysis +tasks. The proposed model incorporates three additional tasks during the +pre-training phase, including reading order identification of different layout +segments in a document image, layout segments categorization as per PubLayNet, +and generation of the text sequence within a given layout segment (text block). +The model also incorporates a collective pre-training scheme where losses of +all the tasks under consideration, including pre-training and fine-tuning tasks +with all datasets, are considered. Additional encoder and decoder blocks are +added to the RoBERTa network to generate results for all tasks. The proposed +model achieved impressive results across all tasks, with an accuracy of 95.87% +on the RVL-CDIP dataset for document classification, F1 scores of 0.9306, +0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets +respectively for entity relation extraction, and an ANLS score of 0.8468 on the +DocVQA dataset for visual question answering. The results highlight the +effectiveness of the proposed model in understanding and interpreting complex +document layouts and content, making it a promising tool for document analysis +tasks. + +
+
+
+
+
+ + ☆ Cyclic Directed Probabilistic Graphical Model: A Proposal Based on + Structured Outcomes + + +
+ In the process of building (structural learning) a probabilistic graphical +model from a set of observed data, the directional, cyclic dependencies between +the random variables of the model are often found. Existing graphical models +such as Bayesian and Markov networks can reflect such dependencies. However, +this requires complicating those models, such as adding additional variables or +dividing the model graph into separate subgraphs. Herein, we describe a +probabilistic graphical model - probabilistic relation network - that allows +the direct capture of directional cyclic dependencies during structural +learning. This model is based on the simple idea that each sample of the +observed data can be represented by an arbitrary graph (structured outcome), +which reflects the structure of the dependencies of the variables included in +the sample. Each of the outcomes contains only a part of the graphical model +structure; however, a complete graph of the probabilistic model is obtained by +combining different outcomes. Such a graph, unlike Bayesian and Markov +networks, can be directed and can have cycles. We explored the full joint +distribution and conditional distribution and conditional independence +properties of variables in the proposed model. We defined the algorithms for +constructing of the model from the dataset and for calculating the conditional +and full joint distributions. We also performed a numerical comparison with +Bayesian and Markov networks. This model does not violate the probability +axioms, and it supports learning from observed data. Notably, it supports +probabilistic inference, making it a prospective tool in data analysis and in +expert and design-making applications. + +
+
+ comment: 41 pages, 11 figures, arXiv:2206.06089v1 +
+
+
+
+
+ + ☆ Can You Rely on Your Model Evaluation? Improving Model Evaluation with + Synthetic Test Data NeurIPS 2023 + + +
+ Evaluating the performance of machine learning models on diverse and +underrepresented subgroups is essential for ensuring fairness and reliability +in real-world applications. However, accurately assessing model performance +becomes challenging due to two main issues: (1) a scarcity of test data, +especially for small subgroups, and (2) possible distributional shifts in the +model's deployment setting, which may not align with the available test data. +In this work, we introduce 3S Testing, a deep generative modeling framework to +facilitate model evaluation by generating synthetic test sets for small +subgroups and simulating distributional shifts. Our experiments demonstrate +that 3S Testing outperforms traditional baselines -- including real test data +alone -- in estimating model performance on minority subgroups and under +plausible distributional shifts. In addition, 3S offers intervals around its +performance estimates, exhibiting superior coverage of the ground truth +compared to existing approaches. Overall, these results raise the question of +whether we need a paradigm shift away from limited real test data towards +synthetic test data. + +
+
+ comment: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). + Van Breugel & Seedat contributed equally +
+
+
+
+
+ + ☆ Towards Self-Interpretable Graph-Level Anomaly Detection NeurIPS 2023 + + +
+ Graph-level anomaly detection (GLAD) aims to identify graphs that exhibit +notable dissimilarity compared to the majority in a collection. However, +current works primarily focus on evaluating graph-level abnormality while +failing to provide meaningful explanations for the predictions, which largely +limits their reliability and application scope. In this paper, we investigate a +new challenging problem, explainable GLAD, where the learning objective is to +predict the abnormality of each graph sample with corresponding explanations, +i.e., the vital subgraph that leads to the predictions. To address this +challenging problem, we propose a Self-Interpretable Graph aNomaly dETection +model (SIGNET for short) that detects anomalous graphs as well as generates +informative explanations simultaneously. Specifically, we first introduce the +multi-view subgraph information bottleneck (MSIB) framework, serving as the +design basis of our self-interpretable GLAD approach. This way SIGNET is able +to not only measure the abnormality of each graph based on cross-view mutual +information but also provide informative graph rationales by extracting +bottleneck subgraphs from the input graph and its dual hypergraph in a +self-supervised way. Extensive experiments on 16 datasets demonstrate the +anomaly detection capability and self-interpretability of SIGNET. + +
+
+ comment: 23 pages; accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ Particle-based Variational Inference with Generalized Wasserstein + Gradient Flow + + +
+ Particle-based variational inference methods (ParVIs) such as Stein +variational gradient descent (SVGD) update the particles based on the +kernelized Wasserstein gradient flow for the Kullback-Leibler (KL) divergence. +However, the design of kernels is often non-trivial and can be restrictive for +the flexibility of the method. Recent works show that functional gradient flow +approximations with quadratic form regularization terms can improve +performance. In this paper, we propose a ParVI framework, called generalized +Wasserstein gradient descent (GWG), based on a generalized Wasserstein gradient +flow of the KL divergence, which can be viewed as a functional gradient method +with a broader class of regularizers induced by convex functions. We show that +GWG exhibits strong convergence guarantees. We also provide an adaptive version +that automatically chooses Wasserstein metric to accelerate convergence. In +experiments, we demonstrate the effectiveness and efficiency of the proposed +framework on both simulated and real data problems. + +
+
+
+
+
+ + ☆ Identifying Reasons for Bias: An Argumentation-Based Approach + + +
+ As algorithmic decision-making systems become more prevalent in society, +ensuring the fairness of these systems is becoming increasingly important. +Whilst there has been substantial research in building fair algorithmic +decision-making systems, the majority of these methods require access to the +training data, including personal characteristics, and are not transparent +regarding which individuals are classified unfairly. In this paper, we propose +a novel model-agnostic argumentation-based method to determine why an +individual is classified differently in comparison to similar individuals. Our +method uses a quantitative argumentation framework to represent attribute-value +pairs of an individual and of those similar to them, and uses a well-known +semantics to identify the attribute-value pairs in the individual contributing +most to their different classification. We evaluate our method on two datasets +commonly used in the fairness literature and illustrate its effectiveness in +the identification of bias. + +
+
+ comment: 10 pages +
+
+
+
+
+ + ☆ Data Optimization in Deep Learning: A Survey + + +
+ Large-scale, high-quality data are considered an essential factor for the +successful application of many deep learning techniques. Meanwhile, numerous +real-world deep learning tasks still have to contend with the lack of +sufficient amounts of high-quality data. Additionally, issues such as model +robustness, fairness, and trustworthiness are also closely related to training +data. Consequently, a huge number of studies in the existing literature have +focused on the data aspect in deep learning tasks. Some typical data +optimization techniques include data augmentation, logit perturbation, sample +weighting, and data condensation. These techniques usually come from different +deep learning divisions and their theoretical inspirations or heuristic +motivations may seem unrelated to each other. This study aims to organize a +wide range of existing data optimization methodologies for deep learning from +the previous literature, and makes the effort to construct a comprehensive +taxonomy for them. The constructed taxonomy considers the diversity of split +dimensions, and deep sub-taxonomies are constructed for each dimension. On the +basis of the taxonomy, connections among the extensive data optimization +methods for deep learning are built in terms of four aspects. We probe into +rendering several promising and interesting future directions. The constructed +taxonomy and the revealed connections will enlighten the better understanding +of existing methods and the design of novel data optimization techniques. +Furthermore, our aspiration for this survey is to promote data optimization as +an independent subdivision of deep learning. A curated, up-to-date list of +resources related to data optimization in deep learning is available at +\url{https://github.com/YaoRujing/Data-Optimization}. + +
+
+
+
+
+ + ☆ Citizen participation: crowd-sensed sustainable indoor location services + + +
+ In the present era of sustainable innovation, the circular economy paradigm +dictates the optimal use and exploitation of existing finite resources. At the +same time, the transition to smart infrastructures requires considerable +investment in capital, resources and people. In this work, we present a general +machine learning approach for offering indoor location awareness without the +need to invest in additional and specialised hardware. We explore use cases +where visitors equipped with their smart phone would interact with the +available WiFi infrastructure to estimate their location, since the indoor +requirement poses a limitation to standard GPS solutions. Results have shown +that the proposed approach achieves a less than 2m accuracy and the model is +resilient even in the case where a substantial number of BSSIDs are dropped. + +
+
+ comment: Preprint submitted to Elsevier +
+
+
+
+
+ + ☆ On the Powerfulness of Textual Outlier Exposure for Visual OoD Detection NeurIPS 2023 + + +
+ Successful detection of Out-of-Distribution (OoD) data is becoming +increasingly important to ensure safe deployment of neural networks. One of the +main challenges in OoD detection is that neural networks output overconfident +predictions on OoD data, make it difficult to determine OoD-ness of data solely +based on their predictions. Outlier exposure addresses this issue by +introducing an additional loss that encourages low-confidence predictions on +OoD data during training. While outlier exposure has shown promising potential +in improving OoD detection performance, all previous studies on outlier +exposure have been limited to utilizing visual outliers. Drawing inspiration +from the recent advancements in vision-language pre-training, this paper +venture out to the uncharted territory of textual outlier exposure. First, we +uncover the benefits of using textual outliers by replacing real or virtual +outliers in the image-domain with textual equivalents. Then, we propose various +ways of generating preferable textual outliers. Our extensive experiments +demonstrate that generated textual outliers achieve competitive performance on +large-scale OoD and hard OoD benchmarks. Furthermore, we conduct empirical +analyses of textual outliers to provide primary criteria for designing +advantageous textual outliers: near-distribution, descriptiveness, and +inclusion of visual semantics. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ TSONN: Time-stepping-oriented neural network for solving partial + differential equations + + +
+ Deep neural networks (DNNs), especially physics-informed neural networks +(PINNs), have recently become a new popular method for solving forward and +inverse problems governed by partial differential equations (PDEs). However, +these methods still face challenges in achieving stable training and obtaining +correct results in many problems, since minimizing PDE residuals with PDE-based +soft constraint make the problem ill-conditioned. Different from all existing +methods that directly minimize PDE residuals, this work integrates +time-stepping method with deep learning, and transforms the original +ill-conditioned optimization problem into a series of well-conditioned +sub-problems over given pseudo time intervals. The convergence of model +training is significantly improved by following the trajectory of the pseudo +time-stepping process, yielding a robust optimization-based PDE solver. Our +results show that the proposed method achieves stable training and correct +results in many problems that standard PINNs fail to solve, requiring only a +simple modification on the loss function. In addition, we demonstrate several +novel properties and advantages of time-stepping methods within the framework +of neural network-based optimization approach, in comparison to traditional +grid-based numerical method. Specifically, explicit scheme allows significantly +larger time step, while implicit scheme can be implemented as straightforwardly +as explicit scheme. + +
+
+
+
+
+ + ☆ Hyperparameter Optimization for Multi-Objective Reinforcement Learning + + +
+ Reinforcement learning (RL) has emerged as a powerful approach for tackling +complex problems. The recent introduction of multi-objective reinforcement +learning (MORL) has further expanded the scope of RL by enabling agents to make +trade-offs among multiple objectives. This advancement not only has broadened +the range of problems that can be tackled but also created numerous +opportunities for exploration and advancement. Yet, the effectiveness of RL +agents heavily relies on appropriately setting their hyperparameters. In +practice, this task often proves to be challenging, leading to unsuccessful +deployments of these techniques in various instances. Hence, prior research has +explored hyperparameter optimization in RL to address this concern. + This paper presents an initial investigation into the challenge of +hyperparameter optimization specifically for MORL. We formalize the problem, +highlight its distinctive challenges, and propose a systematic methodology to +address it. The proposed methodology is applied to a well-known environment +using a state-of-the-art MORL algorithm, and preliminary results are reported. +Our findings indicate that the proposed methodology can effectively provide +hyperparameter configurations that significantly enhance the performance of +MORL agents. Furthermore, this study identifies various future research +opportunities to further advance the field of hyperparameter optimization for +MORL. + +
+
+ comment: Presented at the MODeM workshop https://modem2023.vub.ac.be/# +
+
+
+
+
+ + ☆ A Comprehensive Python Library for Deep Learning-Based Event Detection + in Multivariate Time Series Data and Information Retrieval in NLP ICML + + +
+ Event detection in time series data is crucial in various domains, including +finance, healthcare, cybersecurity, and science. Accurately identifying events +in time series data is vital for making informed decisions, detecting +anomalies, and predicting future trends. Despite extensive research exploring +diverse methods for event detection in time series, with deep learning +approaches being among the most advanced, there is still room for improvement +and innovation in this field. In this paper, we present a new deep learning +supervised method for detecting events in multivariate time series data. Our +method combines four distinct novelties compared to existing deep-learning +supervised methods. Firstly, it is based on regression instead of binary +classification. Secondly, it does not require labeled datasets where each point +is labeled; instead, it only requires reference events defined as time points +or intervals of time. Thirdly, it is designed to be robust by using a stacked +ensemble learning meta-model that combines deep learning models, ranging from +classic feed-forward neural networks (FFNs) to state-of-the-art architectures +like transformers. This ensemble approach can mitigate individual model +weaknesses and biases, resulting in more robust predictions. Finally, to +facilitate practical implementation, we have developed a Python package to +accompany our proposed method. The package, called eventdetector-ts, can be +installed through the Python Package Index (PyPI). In this paper, we present +our method and provide a comprehensive guide on the usage of the package. We +showcase its versatility and effectiveness through different real-world use +cases from natural language processing (NLP) to financial security domains. + +
+
+ comment: Accepted for the 22nd International Conference on Machine Learning + and Applications (ICMLA) +
+
+
+
+
+ + ☆ Symphony of experts: orchestration with adversarial insights in + reinforcement learning + + +
+ Structured reinforcement learning leverages policies with advantageous +properties to reach better performance, particularly in scenarios where +exploration poses challenges. We explore this field through the concept of +orchestration, where a (small) set of expert policies guides decision-making; +the modeling thereof constitutes our first contribution. We then establish +value-functions regret bounds for orchestration in the tabular setting by +transferring regret-bound results from adversarial settings. We generalize and +extend the analysis of natural policy gradient in Agarwal et al. [2021, Section +5.3] to arbitrary adversarial aggregation strategies. We also extend it to the +case of estimated advantage functions, providing insights into sample +complexity both in expectation and high probability. A key point of our +approach lies in its arguably more transparent proofs compared to existing +methods. Finally, we present simulations for a stochastic matching toy model. + +
+
+
+
+
+ + ☆ Learning Continuous Network Emerging Dynamics from Scarce Observations + via Data-Adaptive Stochastic Processes + + +
+ Learning network dynamics from the empirical structure and spatio-temporal +observation data is crucial to revealing the interaction mechanisms of complex +networks in a wide range of domains. However, most existing methods only aim at +learning network dynamic behaviors generated by a specific ordinary +differential equation instance, resulting in ineffectiveness for new ones, and +generally require dense observations. The observed data, especially from +network emerging dynamics, are usually difficult to obtain, which brings +trouble to model learning. Therefore, how to learn accurate network dynamics +with sparse, irregularly-sampled, partial, and noisy observations remains a +fundamental challenge. We introduce Neural ODE Processes for Network Dynamics +(NDP4ND), a new class of stochastic processes governed by stochastic +data-adaptive network dynamics, to overcome the challenge and learn continuous +network dynamics from scarce observations. Intensive experiments conducted on +various network dynamics in ecological population evolution, phototaxis +movement, brain activity, epidemic spreading, and real-world empirical systems, +demonstrate that the proposed method has excellent data adaptability and +computational efficiency, and can adapt to unseen network emerging dynamics, +producing accurate interpolation and extrapolation with reducing the ratio of +required observation data to only about 6\% and improving the learning speed +for new dynamics by three orders of magnitude. + +
+
+ comment: preprint +
+
+
+
+
+ + ☆ Towards Explainability in Monocular Depth Estimation + + +
+ The estimation of depth in two-dimensional images has long been a challenging +and extensively studied subject in computer vision. Recently, significant +progress has been made with the emergence of Deep Learning-based approaches, +which have proven highly successful. This paper focuses on the explainability +in monocular depth estimation methods, in terms of how humans perceive depth. +This preliminary study emphasizes on one of the most significant visual cues, +the relative size, which is prominent in almost all viewed images. We designed +a specific experiment to mimic the experiments in humans and have tested +state-of-the-art methods to indirectly assess the explainability in the context +defined. In addition, we observed that measuring the accuracy required further +attention and a particular approach is proposed to this end. The results show +that a mean accuracy of around 77% across methods is achieved, with some of the +methods performing markedly better, thus, indirectly revealing their +corresponding potential to uncover monocular depth cues, like relative size. + +
+
+
+
+
+ + ☆ ClearMark: Intuitive and Robust Model Watermarking via Transposed Model + Training + + +
+ Due to costly efforts during data acquisition and model training, Deep Neural +Networks (DNNs) belong to the intellectual property of the model creator. +Hence, unauthorized use, theft, or modification may lead to legal +repercussions. Existing DNN watermarking methods for ownership proof are often +non-intuitive, embed human-invisible marks, require trust in algorithmic +assessment that lacks human-understandable attributes, and rely on rigid +thresholds, making it susceptible to failure in cases of partial watermark +erasure. + This paper introduces ClearMark, the first DNN watermarking method designed +for intuitive human assessment. ClearMark embeds visible watermarks, enabling +human decision-making without rigid value thresholds while allowing +technology-assisted evaluations. ClearMark defines a transposed model +architecture allowing to use of the model in a backward fashion to interwove +the watermark with the main task within all model parameters. Compared to +existing watermarking methods, ClearMark produces visual watermarks that are +easy for humans to understand without requiring complex verification algorithms +or strict thresholds. The watermark is embedded within all model parameters and +entangled with the main task, exhibiting superior robustness. It shows an +8,544-bit watermark capacity comparable to the strongest existing work. +Crucially, ClearMark's effectiveness is model and dataset-agnostic, and +resilient against adversarial model manipulations, as demonstrated in a +comprehensive study performed with four datasets and seven architectures. + +
+
+ comment: 20 pages, 18 figures, 4 tables +
+
+
+
+
+ + ☆ Faithful Path Language Modelling for Explainable Recommendation over + Knowledge Graph + + +
+ Path reasoning methods over knowledge graphs have gained popularity for their +potential to improve transparency in recommender systems. However, the +resulting models still rely on pre-trained knowledge graph embeddings, fail to +fully exploit the interdependence between entities and relations in the KG for +recommendation, and may generate inaccurate explanations. In this paper, we +introduce PEARLM, a novel approach that efficiently captures user behaviour and +product-side knowledge through language modelling. With our approach, knowledge +graph embeddings are directly learned from paths over the KG by the language +model, which also unifies entities and relations in the same optimisation +space. Constraints on the sequence decoding additionally guarantee path +faithfulness with respect to the KG. Experiments on two datasets show the +effectiveness of our approach compared to state-of-the-art baselines. Source +code and datasets: AVAILABLE AFTER GETTING ACCEPTED. + +
+
+
+
+
+ + ☆ Grokking in Linear Estimators -- A Solvable Model that Groks without + Understanding + + +
+ Grokking is the intriguing phenomenon where a model learns to generalize long +after it has fit the training data. We show both analytically and numerically +that grokking can surprisingly occur in linear networks performing linear tasks +in a simple teacher-student setup with Gaussian inputs. In this setting, the +full training dynamics is derived in terms of the training and generalization +data covariance matrix. We present exact predictions on how the grokking time +depends on input and output dimensionality, train sample size, regularization, +and network initialization. We demonstrate that the sharp increase in +generalization accuracy may not imply a transition from "memorization" to +"understanding", but can simply be an artifact of the accuracy measure. We +provide empirical verification for our calculations, along with preliminary +results indicating that some predictions also hold for deeper networks, with +non-linear activations. + +
+
+ comment: 17 pages, 6 figures +
+
+
+
+
+ + ☆ Non-isotropic Persistent Homology: Leveraging the Metric Dependency of + PH + + +
+ Persistent Homology is a widely used topological data analysis tool that +creates a concise description of the topological properties of a point cloud +based on a specified filtration. Most filtrations used for persistent homology +depend (implicitly) on a chosen metric, which is typically agnostically chosen +as the standard Euclidean metric on $\mathbb{R}^n$. Recent work has tried to +uncover the 'true' metric on the point cloud using distance-to-measure +functions, in order to obtain more meaningful persistent homology results. Here +we propose an alternative look at this problem: we posit that information on +the point cloud is lost when restricting persistent homology to a single +(correct) distance function. Instead, we show how by varying the distance +function on the underlying space and analysing the corresponding shifts in the +persistence diagrams, we can extract additional topological and geometrical +information. Finally, we numerically show that non-isotropic persistent +homology can extract information on orientation, orientational variance, and +scaling of randomly generated point clouds with good accuracy and conduct some +experiments on real-world data. + +
+
+ comment: 30 pages, 17 figures, comments welcome! +
+
+
+
+
+ + ☆ FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness + for Semi-Supervised Learning NeurIPS 2023 + + +
+ Semi-Supervised Learning (SSL) has been an effective way to leverage abundant +unlabeled data with extremely scarce labeled data. However, most SSL methods +are commonly based on instance-wise consistency between different data +transformations. Therefore, the label guidance on labeled data is hard to be +propagated to unlabeled data. Consequently, the learning process on labeled +data is much faster than on unlabeled data which is likely to fall into a local +minima that does not favor unlabeled data, leading to sub-optimal +generalization performance. In this paper, we propose FlatMatch which minimizes +a cross-sharpness measure to ensure consistent learning performance between the +two datasets. Specifically, we increase the empirical risk on labeled data to +obtain a worst-case model which is a failure case that needs to be enhanced. +Then, by leveraging the richness of unlabeled data, we penalize the prediction +difference (i.e., cross-sharpness) between the worst-case model and the +original model so that the learning direction is beneficial to generalization +on unlabeled data. Therefore, we can calibrate the learning process without +being limited to insufficient label information. As a result, the mismatched +learning performance can be mitigated, further enabling the effective +exploitation of unlabeled data and improving SSL performance. Through +comprehensive validation, we show FlatMatch achieves state-of-the-art results +in many SSL settings. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in + AlphaZero + + +
+ Artificial Intelligence (AI) systems have made remarkable progress, attaining +super-human performance across various domains. This presents us with an +opportunity to further human knowledge and improve human expert performance by +leveraging the hidden knowledge encoded within these highly performant AI +systems. Yet, this knowledge is often hard to extract, and may be hard to +understand or learn from. Here, we show that this is possible by proposing a +new method that allows us to extract new chess concepts in AlphaZero, an AI +system that mastered the game of chess via self-play without human supervision. +Our analysis indicates that AlphaZero may encode knowledge that extends beyond +the existing human knowledge, but knowledge that is ultimately not beyond human +grasp, and can be successfully learned from. In a human study, we show that +these concepts are learnable by top human experts, as four top chess +grandmasters show improvements in solving the presented concept prototype +positions. This marks an important first milestone in advancing the frontier of +human knowledge by leveraging AI; a development that could bear profound +implications and help us shape how we interact with AI systems across many AI +applications. + +
+
+ comment: 61 pages, 29 figures +
+
+
+
+
+ + ☆ Multiple Key-value Strategy in Recommendation Systems Incorporating + Large Language Model CIKM2023 + + +
+ Recommendation system (RS) plays significant roles in matching users +information needs for Internet applications, and it usually utilizes the +vanilla neural network as the backbone to handle embedding details. Recently, +the large language model (LLM) has exhibited emergent abilities and achieved +great breakthroughs both in the CV and NLP communities. Thus, it is logical to +incorporate RS with LLM better, which has become an emerging research +direction. Although some existing works have made their contributions to this +issue, they mainly consider the single key situation (e.g. historical +interactions), especially in sequential recommendation. The situation of +multiple key-value data is simply neglected. This significant scenario is +mainstream in real practical applications, where the information of users (e.g. +age, occupation, etc) and items (e.g. title, category, etc) has more than one +key. Therefore, we aim to implement sequential recommendations based on +multiple key-value data by incorporating RS with LLM. In particular, we +instruct tuning a prevalent open-source LLM (Llama 7B) in order to inject +domain knowledge of RS into the pre-trained LLM. Since we adopt multiple +key-value strategies, LLM is hard to learn well among these keys. Thus the +general and innovative shuffle and mask strategies, as an innovative manner of +data argument, are designed. To demonstrate the effectiveness of our approach, +extensive experiments are conducted on the popular and suitable dataset +MovieLens which contains multiple keys-value. The experimental results +demonstrate that our approach can nicely and effectively complete this +challenging issue. + +
+
+ comment: Accepted by CIKM2023 workshop at GenRec'23 +
+
+
+
+
+ + ☆ Information-Theoretic Generalization Analysis for Topology-aware + Heterogeneous Federated Edge Learning over Noisy Channels + + +
+ With the rapid growth of edge intelligence, the deployment of federated +learning (FL) over wireless networks has garnered increasing attention, which +is called Federated Edge Learning (FEEL). In FEEL, both mobile devices +transmitting model parameters over noisy channels and collecting data in +diverse environments pose challenges to the generalization of trained models. +Moreover, devices can engage in decentralized FL via Device-to-Device +communication while the communication topology of connected devices also +impacts the generalization of models. Most recent theoretical studies overlook +the incorporation of all these effects into FEEL when developing generalization +analyses. In contrast, our work presents an information-theoretic +generalization analysis for topology-aware FEEL in the presence of data +heterogeneity and noisy channels. Additionally, we propose a novel +regularization method called Federated Global Mutual Information Reduction +(FedGMIR) to enhance the performance of models based on our analysis. Numerical +results validate our theoretical findings and provide evidence for the +effectiveness of the proposed method. + +
+
+
+
+
+ + ☆ Graph Neural Networks with a Distribution of Parametrized Graphs + + +
+ Traditionally, graph neural networks have been trained using a single +observed graph. However, the observed graph represents only one possible +realization. In many applications, the graph may encounter uncertainties, such +as having erroneous or missing edges, as well as edge weights that provide +little informative value. To address these challenges and capture additional +information previously absent in the observed graph, we introduce latent +variables to parameterize and generate multiple graphs. We obtain the maximum +likelihood estimate of the network parameters in an Expectation-Maximization +(EM) framework based on the multiple graphs. Specifically, we iteratively +determine the distribution of the graphs using a Markov Chain Monte Carlo +(MCMC) method, incorporating the principles of PAC-Bayesian theory. Numerical +experiments demonstrate improvements in performance against baseline models on +node classification for heterogeneous graphs and graph regression on chemistry +datasets. + +
+
+
+
+
+ + ☆ Learning Efficient Surrogate Dynamic Models with Graph Spline Networks NeurIPS 2023 + + +
+ While complex simulations of physical systems have been widely used in +engineering and scientific computing, lowering their often prohibitive +computational requirements has only recently been tackled by deep learning +approaches. In this paper, we present GraphSplineNets, a novel deep-learning +method to speed up the forecasting of physical systems by reducing the grid +size and number of iteration steps of deep surrogate models. Our method uses +two differentiable orthogonal spline collocation methods to efficiently predict +response at any location in time and space. Additionally, we introduce an +adaptive collocation strategy in space to prioritize sampling from the most +important regions. GraphSplineNets improve the accuracy-speedup tradeoff in +forecasting various dynamical systems with increasing complexity, including the +heat equation, damped wave propagation, Navier-Stokes equations, and real-world +ocean currents in both regular and irregular domains. + +
+
+ comment: Published as a conference paper in NeurIPS 2023 +
+
+
+
+
+ + ☆ Winning Prize Comes from Losing Tickets: Improve Invariant Learning by + Exploring Variant Parameters for Out-of-Distribution Generalization + + +
+ Out-of-Distribution (OOD) Generalization aims to learn robust models that +generalize well to various environments without fitting to +distribution-specific features. Recent studies based on Lottery Ticket +Hypothesis (LTH) address this problem by minimizing the learning target to find +some of the parameters that are critical to the task. However, in OOD problems, +such solutions are suboptimal as the learning task contains severe distribution +noises, which can mislead the optimization process. Therefore, apart from +finding the task-related parameters (i.e., invariant parameters), we propose +Exploring Variant parameters for Invariant Learning (EVIL) which also leverages +the distribution knowledge to find the parameters that are sensitive to +distribution shift (i.e., variant parameters). Once the variant parameters are +left out of invariant learning, a robust subnetwork that is resistant to +distribution shift can be found. Additionally, the parameters that are +relatively stable across distributions can be considered invariant ones to +improve invariant learning. By fully exploring both variant and invariant +parameters, our EVIL can effectively identify a robust subnetwork to improve +OOD generalization. In extensive experiments on integrated testbed: DomainBed, +EVIL can effectively and efficiently enhance many popular methods, such as ERM, +IRM, SAM, etc. + +
+
+ comment: 27 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Efficient Estimation of Average-Case Robustness for Multi-Class + Classification + + +
+ Robustness in machine learning is commonly studied in the adversarial +setting, yet real-world noise (such as measurement noise) is random rather than +adversarial. Model behavior under such noise is captured by average-case +robustness, i.e., the probability of obtaining consistent predictions in a +local region around an input. However, the na\"ive approach to computing +average-case robustness based on Monte-Carlo sampling is statistically +inefficient, especially for high-dimensional data, leading to prohibitive +computational costs for large-scale applications. In this work, we develop the +first analytical estimators to efficiently compute average-case robustness of +multi-class discriminative models. These estimators linearize models in the +local region around an input and analytically compute the robustness of the +resulting linear models. We show empirically that these estimators efficiently +compute the robustness of standard deep learning models and demonstrate these +estimators' usefulness for various tasks involving robustness, such as +measuring robustness bias and identifying dataset samples that are vulnerable +to noise perturbation. In doing so, this work not only proposes a new framework +for robustness, but also makes its computation practical, enabling the use of +average-case robustness in downstream applications. + +
+
+
+
+
+ + ♻ ☆ Revisiting Deep Learning Models for Tabular Data NeurIPS 2021 + + +
+ The existing literature on deep learning for tabular data proposes a wide +range of novel architectures and reports competitive results on various +datasets. However, the proposed models are usually not properly compared to +each other and existing works often use different benchmarks and experiment +protocols. As a result, it is unclear for both researchers and practitioners +what models perform best. Additionally, the field still lacks effective +baselines, that is, the easy-to-use models that provide competitive performance +across different problems. + In this work, we perform an overview of the main families of DL architectures +for tabular data and raise the bar of baselines in tabular DL by identifying +two simple and powerful deep architectures. The first one is a ResNet-like +architecture which turns out to be a strong baseline that is often missing in +prior works. The second model is our simple adaptation of the Transformer +architecture for tabular data, which outperforms other solutions on most tasks. +Both models are compared to many existing architectures on a diverse set of +tasks under the same training and tuning protocols. We also compare the best DL +models with Gradient Boosted Decision Trees and conclude that there is still no +universally superior solution. + +
+
+ comment: NeurIPS 2021 camera-ready. Code: + https://github.com/yandex-research/tabular-dl-revisiting-models (v4: minor + update) +
+
+
+
+
+ + ♻ ☆ Inverse Dynamics Pretraining Learns Good Representations for Multitask + Imitation + + +
+ In recent years, domains such as natural language processing and image +recognition have popularized the paradigm of using large datasets to pretrain +representations that can be effectively transferred to downstream tasks. In +this work we evaluate how such a paradigm should be done in imitation learning, +where both pretraining and finetuning data are trajectories collected by +experts interacting with an unknown environment. Namely, we consider a setting +where the pretraining corpus consists of multitask demonstrations and the task +for each demonstration is set by an unobserved latent context variable. The +goal is to use the pretraining corpus to learn a low dimensional representation +of the high dimensional (e.g., visual) observation space which can be +transferred to a novel context for finetuning on a limited dataset of +demonstrations. Among a variety of possible pretraining objectives, we argue +that inverse dynamics modeling -- i.e., predicting an action given the +observations appearing before and after it in the demonstration -- is +well-suited to this setting. We provide empirical evidence of this claim +through evaluations on a variety of simulated visuomotor manipulation problems. +While previous work has attempted various theoretical explanations regarding +the benefit of inverse dynamics modeling, we find that these arguments are +insufficient to explain the empirical advantages often observed in our +settings, and so we derive a novel analysis using a simple but general +environment model. + +
+
+
+
+
+ + ♻ ☆ ReDi: Efficient Learning-Free Diffusion Inference via Trajectory + Retrieval ICML 2023 + + +
+ Diffusion models show promising generation capability for a variety of data. +Despite their high generation quality, the inference for diffusion models is +still time-consuming due to the numerous sampling iterations required. To +accelerate the inference, we propose ReDi, a simple yet learning-free +Retrieval-based Diffusion sampling framework. From a precomputed knowledge +base, ReDi retrieves a trajectory similar to the partially generated trajectory +at an early stage of generation, skips a large portion of intermediate steps, +and continues sampling from a later step in the retrieved trajectory. We +theoretically prove that the generation performance of ReDi is guaranteed. Our +experiments demonstrate that ReDi improves the model inference efficiency by 2x +speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain +image generation such as image stylization. + +
+
+ comment: ICML 2023 +
+
+
+
+
+ + ♻ ☆ Estimating Class Separability of Datasets Using Persistent Homology with + Application to LLM Fine-Tuning + + +
+ This paper proposes a method to estimate the class separability of an +unlabeled text dataset by inspecting the topological characteristics of +sentence-transformer embeddings of the text. Experiments conducted involve both +binary and multi-class cases, with balanced and imbalanced scenarios. The +results demonstrate a clear correlation and a better consistency between the +proposed method and other separability and classification metrics, such as +Thornton's method and the AUC score of a logistic regression classifier, as +well as unsupervised methods. Finally, we empirically show that the proposed +method can be part of a stopping criterion for fine-tuning language-model +classifiers. By monitoring the class separability of the embedding space after +each training iteration, we can detect when the training process stops +improving the separability of the embeddings without using additional labels. + +
+
+ comment: Rewrite of the manuscript with more baselines, extended related works + section, and discussion +
+
+
+
+
+ + ♻ ☆ Interpretable and Explainable Logical Policies via Neurally Guided + Symbolic Abstraction + + +
+ The limited priors required by neural networks make them the dominating +choice to encode and learn policies using reinforcement learning (RL). However, +they are also black-boxes, making it hard to understand the agent's behaviour, +especially when working on the image level. Therefore, neuro-symbolic RL aims +at creating policies that are interpretable in the first place. Unfortunately, +interpretability is not explainability. To achieve both, we introduce Neurally +gUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural +network-based agents to guide the search of candidate-weighted logic rules, +then uses differentiable logic to train the logic agents. Our experimental +evaluation demonstrates that NUDGE agents can induce interpretable and +explainable policies while outperforming purely neural ones and showing good +flexibility to environments of different initial states and problem sizes. + +
+
+ comment: 9 main pages + appendix (19 in total) +
+
+
+
+
+ + ♻ ☆ A Vulnerability of Attribution Methods Using Pre-Softmax Scores + + +
+ We discuss a vulnerability involving a category of attribution methods used +to provide explanations for the outputs of convolutional neural networks +working as classifiers. It is known that this type of networks are vulnerable +to adversarial attacks, in which imperceptible perturbations of the input may +alter the outputs of the model. In contrast, here we focus on effects that +small modifications in the model may cause on the attribution method without +altering the model outputs. + +
+
+ comment: 7 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large + Stepsizes and Edge of Stability + + +
+ In this paper, we investigate the impact of stochasticity and large stepsizes +on the implicit regularisation of gradient descent (GD) and stochastic gradient +descent (SGD) over diagonal linear networks. We prove the convergence of GD and +SGD with macroscopic stepsizes in an overparametrised regression setting and +characterise their solutions through an implicit regularisation problem. Our +crisp characterisation leads to qualitative insights about the impact of +stochasticity and stepsizes on the recovered solution. Specifically, we show +that large stepsizes consistently benefit SGD for sparse regression problems, +while they can hinder the recovery of sparse solutions for GD. These effects +are magnified for stepsizes in a tight window just below the divergence +threshold, in the "edge of stability" regime. Our findings are supported by +experimental results. + +
+
+
+
+
+ + ♻ ☆ Scaling Laws for Hyperparameter Optimization NeurIPS 2023 + + +
+ Hyperparameter optimization is an important subfield of machine learning that +focuses on tuning the hyperparameters of a chosen algorithm to achieve peak +performance. Recently, there has been a stream of methods that tackle the issue +of hyperparameter optimization, however, most of the methods do not exploit the +dominant power law nature of learning curves for Bayesian optimization. In this +work, we propose Deep Power Laws (DPL), an ensemble of neural network models +conditioned to yield predictions that follow a power-law scaling pattern. Our +method dynamically decides which configurations to pause and train +incrementally by making use of gray-box evaluations. We compare our method +against 7 state-of-the-art competitors on 3 benchmarks related to tabular, +image, and NLP datasets covering 59 diverse tasks. Our method achieves the best +results across all benchmarks by obtaining the best any-time results compared +to all competitors. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Saddle-to-Saddle Dynamics in Diagonal Linear Networks + + +
+ In this paper we fully describe the trajectory of gradient flow over diagonal +linear networks in the limit of vanishing initialisation. We show that the +limiting flow successively jumps from a saddle of the training loss to another +until reaching the minimum $\ell_1$-norm solution. This saddle-to-saddle +dynamics translates to an incremental learning process as each saddle +corresponds to the minimiser of the loss constrained to an active set outside +of which the coordinates must be zero. We explicitly characterise the visited +saddles as well as the jumping times through a recursive algorithm reminiscent +of the LARS algorithm used for computing the Lasso path. Our proof leverages a +convenient arc-length time-reparametrisation which enables to keep track of the +heteroclinic transitions between the jumps. Our analysis requires negligible +assumptions on the data, applies to both under and overparametrised settings +and covers complex cases where there is no monotonicity of the number of active +coordinates. We provide numerical experiments to support our findings. + +
+
+
+
+
+ + ♻ ☆ Don't be so Monotone: Relaxing Stochastic Line Search in + Over-Parameterized Models + + +
+ Recent works have shown that line search methods can speed up Stochastic +Gradient Descent (SGD) and Adam in modern over-parameterized settings. However, +existing line searches may take steps that are smaller than necessary since +they require a monotone decrease of the (mini-)batch objective function. We +explore nonmonotone line search methods to relax this condition and possibly +accept larger step sizes. Despite the lack of a monotonic decrease, we prove +the same fast rates of convergence as in the monotone case. Our experiments +show that nonmonotone methods improve the speed of convergence and +generalization properties of SGD/Adam even beyond the previous monotone line +searches. We propose a POlyak NOnmonotone Stochastic (PoNoS) method, obtained +by combining a nonmonotone line search with a Polyak initial step size. +Furthermore, we develop a new resetting technique that in the majority of the +iterations reduces the amount of backtracks to zero while still maintaining a +large initial step size. To the best of our knowledge, a first runtime +comparison shows that the epoch-wise advantage of line-search-based methods +gets reflected in the overall computational time. + +
+
+
+
+
+ + ♻ ☆ On Single Index Models beyond Gaussian Data + + +
+ Sparse high-dimensional functions have arisen as a rich framework to study +the behavior of gradient-descent methods using shallow neural networks, +showcasing their ability to perform feature learning beyond linear models. +Amongst those functions, the simplest are single-index models $f(x) = \phi( x +\cdot \theta^*)$, where the labels are generated by an arbitrary non-linear +scalar link function $\phi$ applied to an unknown one-dimensional projection +$\theta^*$ of the input data. By focusing on Gaussian data, several recent +works have built a remarkable picture, where the so-called information exponent +(related to the regularity of the link function) controls the required sample +complexity. In essence, these tools exploit the stability and spherical +symmetry of Gaussian distributions. In this work, building from the framework +of \cite{arous2020online}, we explore extensions of this picture beyond the +Gaussian setting, where both stability or symmetry might be violated. Focusing +on the planted setting where $\phi$ is known, our main results establish that +Stochastic Gradient Descent can efficiently recover the unknown direction +$\theta^*$ in the high-dimensional regime, under assumptions that extend +previous works \cite{yehudai2020learning,wu2022learning}. + +
+
+
+
+
+ + ♻ ☆ Implicit Two-Tower Policies + + +
+ We present a new class of structured reinforcement learning +policy-architectures, Implicit Two-Tower (ITT) policies, where the actions are +chosen based on the attention scores of their learnable latent representations +with those of the input states. By explicitly disentangling action from state +processing in the policy stack, we achieve two main goals: substantial +computational gains and better performance. Our architectures are compatible +with both: discrete and continuous action spaces. By conducting tests on 15 +environments from OpenAI Gym and DeepMind Control Suite, we show that +ITT-architectures are particularly suited for blackbox/evolutionary +optimization and the corresponding policy training algorithms outperform their +vanilla unstructured implicit counterparts as well as commonly used explicit +policies. We complement our analysis by showing how techniques such as hashing +and lazy tower updates, critically relying on the two-tower structure of ITTs, +can be applied to obtain additional computational improvements. + +
+
+
+
+
+ + ♻ ☆ Are GATs Out of Balance? NeurIPS + + +
+ While the expressive power and computational capabilities of graph neural +networks (GNNs) have been theoretically studied, their optimization and +learning dynamics, in general, remain largely unexplored. Our study undertakes +the Graph Attention Network (GAT), a popular GNN architecture in which a node's +neighborhood aggregation is weighted by parameterized attention coefficients. +We derive a conservation law of GAT gradient flow dynamics, which explains why +a high portion of parameters in GATs with standard initialization struggle to +change during training. This effect is amplified in deeper GATs, which perform +significantly worse than their shallow counterparts. To alleviate this problem, +we devise an initialization scheme that balances the GAT network. Our approach +i) allows more effective propagation of gradients and in turn enables +trainability of deeper networks, and ii) attains a considerable speedup in +training and convergence time in comparison to the standard initialization. Our +main theorem serves as a stepping stone to studying the learning dynamics of +positive homogeneous models with attention mechanisms. + +
+
+ comment: 25 pages. To be published in Advances in Neural Information + Processing Systems (NeurIPS), 2023 +
+
+
+
+
+ + ♻ ☆ Learning to Receive Help: Intervention-Aware Concept Embedding Models NeurIPS 2023 + + +
+ Concept Bottleneck Models (CBMs) tackle the opacity of neural architectures +by constructing and explaining their predictions using a set of high-level +concepts. A special property of these models is that they permit concept +interventions, wherein users can correct mispredicted concepts and thus improve +the model's performance. Recent work, however, has shown that intervention +efficacy can be highly dependent on the order in which concepts are intervened +on and on the model's architecture and training hyperparameters. We argue that +this is rooted in a CBM's lack of train-time incentives for the model to be +appropriately receptive to concept interventions. To address this, we propose +Intervention-aware Concept Embedding models (IntCEMs), a novel CBM-based +architecture and training paradigm that improves a model's receptiveness to +test-time interventions. Our model learns a concept intervention policy in an +end-to-end fashion from where it can sample meaningful intervention +trajectories at train-time. This conditions IntCEMs to effectively select and +receive concept interventions when deployed at test-time. Our experiments show +that IntCEMs significantly outperform state-of-the-art concept-interpretable +models when provided with test-time concept interventions, demonstrating the +effectiveness of our approach. + +
+
+ comment: Accepted as a spotlight at the Thirty-seventh Conference on Neural + Information Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure EMNLP 2023 + + +
+ This work presents StrAE: a Structured Autoencoder framework that through +strict adherence to explicit structure, and use of a novel contrastive +objective over tree-structured representations, enables effective learning of +multi-level representations. Through comparison over different forms of +structure, we verify that our results are directly attributable to the +informativeness of the structure provided as input, and show that this is not +the case for existing tree models. We then further extend StrAE to allow the +model to define its own compositions using a simple localised-merge algorithm. +This variant, called Self-StrAE, outperforms baselines that don't involve +explicit hierarchical compositions, and is comparable to models given +informative structure (e.g. constituency parses). Our experiments are conducted +in a data-constrained (circa 10M tokens) setting to help tease apart the +contribution of the inductive bias to effective learning. However, we find that +this framework can be robust to scale, and when extended to a much larger +dataset (circa 100M tokens), our 430 parameter model performs comparably to a +6-layer RoBERTa many orders of magnitude larger in size. Our findings support +the utility of incorporating explicit composition as an inductive bias for +effective representation learning. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ♻ ☆ Is Attention always needed? A Case Study on Language Identification from + Speech + + +
+ Language Identification (LID) is a crucial preliminary process in the field +of Automatic Speech Recognition (ASR) that involves the identification of a +spoken language from audio samples. Contemporary systems that can process +speech in multiple languages require users to expressly designate one or more +languages prior to utilization. The LID task assumes a significant role in +scenarios where ASR systems are unable to comprehend the spoken language in +multilingual settings, leading to unsuccessful speech recognition outcomes. The +present study introduces convolutional recurrent neural network (CRNN) based +LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) +characteristics of audio samples. Furthermore, we replicate certain +state-of-the-art methodologies, specifically the Convolutional Neural Network +(CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with +attention), and conduct a comparative analysis with our CRNN-based approach. We +conducted comprehensive evaluations on thirteen distinct Indian languages and +our model resulted in over 98\% classification accuracy. The LID model exhibits +high-performance levels ranging from 97% to 100% for languages that are +linguistically similar. The proposed LID model exhibits a high degree of +extensibility to additional languages and demonstrates a strong resistance to +noise, achieving 91.2% accuracy in a noisy setting when applied to a European +Language (EU) dataset. + +
+
+ comment: Accepted for publication in Natural Language Engineering +
+
+
+
+
+ + ♻ ☆ From Tempered to Benign Overfitting in ReLU Neural Networks NeurIPS 2023 + + +
+ Overparameterized neural networks (NNs) are observed to generalize well even +when trained to perfectly fit noisy data. This phenomenon motivated a large +body of work on "benign overfitting", where interpolating predictors achieve +near-optimal performance. Recently, it was conjectured and empirically observed +that the behavior of NNs is often better described as "tempered overfitting", +where the performance is non-optimal yet also non-trivial, and degrades as a +function of the noise level. However, a theoretical justification of this claim +for non-linear NNs has been lacking so far. In this work, we provide several +results that aim at bridging these complementing views. We study a simple +classification setting with 2-layer ReLU NNs, and prove that under various +assumptions, the type of overfitting transitions from tempered in the extreme +case of one-dimensional data, to benign in high dimensions. Thus, we show that +the input dimension has a crucial role on the type of overfitting in this +setting, which we also validate empirically for intermediate dimensions. +Overall, our results shed light on the intricate connections between the +dimension, sample size, architecture and training algorithm on the one hand, +and the type of resulting overfitting on the other hand. + +
+
+ comment: NeurIPS 2023 camera ready version +
+
+
+
+
+ + ♻ ☆ Guarantees for Self-Play in Multiplayer Games via Polymatrix + Decomposability NeurIPS 2023 + + +
+ Self-play is a technique for machine learning in multi-agent systems where a +learning algorithm learns by interacting with copies of itself. Self-play is +useful for generating large quantities of data for learning, but has the +drawback that the agents the learner will face post-training may have +dramatically different behavior than the learner came to expect by interacting +with itself. For the special case of two-player constant-sum games, self-play +that reaches Nash equilibrium is guaranteed to produce strategies that perform +well against any post-training opponent; however, no such guarantee exists for +multiplayer games. We show that in games that approximately decompose into a +set of two-player constant-sum games (called constant-sum polymatrix games) +where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria +in each subgame (called subgame stability), any no-external-regret algorithm +that learns by self-play will produce a strategy with bounded vulnerability. +For the first time, our results identify a structural property of multiplayer +games that enable performance guarantees for the strategies produced by a broad +class of self-play algorithms. We demonstrate our findings through experiments +on Leduc poker. + +
+
+ comment: To appear at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ PeFLL: Personalized Federated Learning by Learning to Learn + + +
+ We present PeFLL, a new personalized federated learning algorithm that +improves over the state-of-the-art in three aspects: 1) it produces more +accurate models, especially in the low-data regime, and not only for clients +present during its training phase, but also for any that may emerge in the +future; 2) it reduces the amount of on-client computation and client-server +communication by providing future clients with ready-to-use personalized models +that require no additional finetuning or optimization; 3) it comes with +theoretical guarantees that establish generalization from the observed clients +to future ones. At the core of PeFLL lies a learning-to-learn approach that +jointly trains an embedding network and a hypernetwork. The embedding network +is used to represent clients in a latent descriptor space in a way that +reflects their similarity to each other. The hypernetwork takes as input such +descriptors and outputs the parameters of fully personalized client models. In +combination, both networks constitute a learning algorithm that achieves +state-of-the-art performance in several personalized federated learning +benchmarks. + +
+
+
+
+
+ + ♻ ☆ Regularized Data Programming with Automated Bayesian Prior Selection + + +
+ The cost of manual data labeling can be a significant obstacle in supervised +learning. Data programming (DP) offers a weakly supervised solution for +training dataset creation, wherein the outputs of user-defined programmatic +labeling functions (LFs) are reconciled through unsupervised learning. However, +DP can fail to outperform an unweighted majority vote in some scenarios, +including low-data contexts. This work introduces a Bayesian extension of +classical DP that mitigates failures of unsupervised learning by augmenting the +DP objective with regularization terms. Regularized learning is achieved +through maximum a posteriori estimation with informative priors. Majority vote +is proposed as a proxy signal for automated prior parameter selection. Results +suggest that regularized DP improves performance relative to maximum likelihood +and majority voting, confers greater interpretability, and bolsters performance +in low-data regimes. + +
+
+
+
+
+ + ♻ ☆ Improving Robustness and Reliability in Medical Image Classification + with Latent-Guided Diffusion and Nested-Ensembles + + +
+ While deep learning models have achieved remarkable success across a range of +medical image analysis tasks, deployment of these models in real clinical +contexts requires that they be robust to variability in the acquired images. +While many methods apply predefined transformations to augment the training +data to enhance test-time robustness, these transformations may not ensure the +model's robustness to the diverse variability seen in patient images. In this +paper, we introduce a novel three-stage approach based on transformers coupled +with conditional diffusion models, with the goal of improving model robustness +to the kinds of imaging variability commonly encountered in practice without +the need for pre-determined data augmentation strategies. To this end, multiple +image encoders first learn hierarchical feature representations to build +discriminative latent spaces. Next, a reverse diffusion process, guided by the +latent code, acts on an informative prior and proposes prediction candidates in +a generative manner. Finally, several prediction candidates are aggregated in a +bi-level aggregation protocol to produce the final output. Through extensive +experiments on medical imaging benchmark datasets, we show that our method +improves upon state-of-the-art methods in terms of robustness and confidence +calibration. Additionally, we introduce a strategy to quantify the prediction +uncertainty at the instance level, increasing their trustworthiness to +clinicians using them in clinical practice. + +
+
+ comment: 13 pages, 6 figures, 7 tables +
+
+
+
+
+ + ♻ ☆ Maximize to Explore: One Objective Function Fusing Estimation, Planning, + and Exploration + + +
+ In online reinforcement learning (online RL), balancing exploration and +exploitation is crucial for finding an optimal policy in a sample-efficient +way. To achieve this, existing sample-efficient online RL algorithms typically +consist of three components: estimation, planning, and exploration. However, in +order to cope with general function approximators, most of them involve +impractical algorithmic components to incentivize exploration, such as +optimization within data-dependent level-sets or complicated sampling +procedures. To address this challenge, we propose an easy-to-implement RL +framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs +to optimize \emph{unconstrainedly} a single objective that integrates the +estimation and planning components while balancing exploration and exploitation +automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear +regret with general function approximations for Markov decision processes (MDP) +and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, +we adapt deep RL baselines to design practical versions of \texttt{MEX}, in +both model-free and model-based manners, which can outperform baselines by a +stable margin in various MuJoCo environments with sparse rewards. Compared with +existing sample-efficient online RL algorithms with general function +approximations, \texttt{MEX} achieves similar sample efficiency while enjoying +a lower computational cost and is more compatible with modern deep RL methods. + +
+
+
+
+
+ + ♻ ☆ S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist + Captions NeurIPS 2023 + + +
+ Vision-language models, such as contrastive language-image pre-training +(CLIP), have demonstrated impressive results in natural image domains. However, +these models often struggle when applied to specialized domains like remote +sensing, and adapting to such domains is challenging due to the limited number +of image-text pairs available for training. To address this, we propose S-CLIP, +a semi-supervised learning method for training CLIP that utilizes additional +unpaired images. S-CLIP employs two pseudo-labeling strategies specifically +designed for contrastive learning and the language modality. The caption-level +pseudo-label is given by a combination of captions of paired images, obtained +by solving an optimal transport problem between unpaired and paired images. The +keyword-level pseudo-label is given by a keyword in the caption of the nearest +paired image, trained through partial label learning that assumes a candidate +set of labels for supervision instead of the exact one. By combining these +objectives, S-CLIP significantly enhances the training of CLIP using only a few +image-text pairs, as demonstrated in various specialist domains, including +remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP +improves CLIP by 10% for zero-shot classification and 4% for image-text +retrieval on the remote sensing benchmark, matching the performance of +supervised CLIP while using three times fewer image-text pairs. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ VMAF Re-implementation on PyTorch: Some Experimental Results + + +
+ Based on the standard VMAF implementation we propose an implementation of +VMAF using PyTorch framework. For this implementation comparisons with the +standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We +investigate gradients computation when using VMAF as an objective function and +demonstrate that training using this function does not result in ill-behaving +gradients. + +
+
+ comment: 4 pages +
+
+
+
+
+ + ♻ ☆ Limitations of Deep Learning for Inverse Problems on Digital Hardware + + +
+ Deep neural networks have seen tremendous success over the last years. Since +the training is performed on digital hardware, in this paper, we analyze what +actually can be computed on current hardware platforms modeled as Turing +machines, which would lead to inherent restrictions of deep learning. For this, +we focus on the class of inverse problems, which, in particular, encompasses +any task to reconstruct data from measurements. We prove that +finite-dimensional inverse problems are not Banach-Mazur computable for small +relaxation parameters. Even more, our results introduce a lower bound on the +accuracy that can be obtained algorithmically. + +
+
+ comment: To be published in IEEE Transactions on Information Theory +
+
+
+
+
+ + ♻ ☆ Leveraging the two timescale regime to demonstrate convergence of neural + networks NeurIPS 2023 + + +
+ We study the training dynamics of shallow neural networks, in a two-timescale +regime in which the stepsizes for the inner layer are much smaller than those +for the outer layer. In this regime, we prove convergence of the gradient flow +to a global optimum of the non-convex optimization problem in a simple +univariate setting. The number of neurons need not be asymptotically large for +our result to hold, distinguishing our result from popular recent approaches +such as the neural tangent kernel or mean-field regimes. Experimental +illustration is provided, showing that the stochastic gradient descent behaves +according to our description of the gradient flow and thus converges to a +global optimum in the two-timescale regime, but can fail outside of this +regime. + +
+
+ comment: NeurIPS 2023. 34 pages, 10 figures +
+
+
+
+
+ + ♻ ☆ Sheaf Neural Networks for Graph-based Recommender Systems + + +
+ Recent progress in Graph Neural Networks has resulted in wide adoption by +many applications, including recommendation systems. The reason for Graph +Neural Networks' superiority over other approaches is that many problems in +recommendation systems can be naturally modeled as graphs, where nodes can be +either users or items and edges represent preference relationships. In current +Graph Neural Network approaches, nodes are represented with a static vector +learned at training time. This static vector might only be suitable to capture +some of the nuances of users or items they define. To overcome this limitation, +we propose using a recently proposed model inspired by category theory: Sheaf +Neural Networks. Sheaf Neural Networks, and its connected Laplacian, can +address the previous problem by associating every node (and edge) with a vector +space instead than a single vector. The vector space representation is richer +and allows picking the proper representation at inference time. This approach +can be generalized for different related tasks on graphs and achieves +state-of-the-art performance in terms of F1-Score@N in collaborative filtering +and Hits@20 in link prediction. For collaborative filtering, the approach is +evaluated on the MovieLens 100K with a 5.1% improvement, on MovieLens 1M with a +5.4% improvement and on Book-Crossing with a 2.8% improvement, while for link +prediction on the ogbl-ddi dataset with a 1.6% refinement with respect to the +respective baselines. + +
+
+ comment: 9 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ Bayesian Neural Networks for Geothermal Resource Assessment: Prediction + with Uncertainty + + +
+ We consider the application of machine learning to the evaluation of +geothermal resource potential. A supervised learning problem is defined where +maps of 10 geological and geophysical features within the state of Nevada, USA +are used to define geothermal potential across a broad region. We have +available a relatively small set of positive training sites (known resources or +active power plants) and negative training sites (known drill sites with +unsuitable geothermal conditions) and use these to constrain and optimize +artificial neural networks for this classification task. The main objective is +to predict the geothermal resource potential at unknown sites within a large +geographic area where the defining features are known. These predictions could +be used to target promising areas for further detailed investigations. We +describe the evolution of our work from defining a specific neural network +architecture to training and optimization trials. Upon analysis we expose the +inevitable problems of model variability and resulting prediction uncertainty. +Finally, to address these problems we apply the concept of Bayesian neural +networks, a heuristic approach to regularization in network training, and make +use of the practical interpretation of the formal uncertainty measures they +provide. + +
+
+ comment: 27 pages, 12 figures +
+
+
+
+
+ + ♻ ☆ Interpretable Alzheimer's Disease Classification Via a Contrastive + Diffusion Autoencoder + + +
+ In visual object classification, humans often justify their choices by +comparing objects to prototypical examples within that class. We may therefore +increase the interpretability of deep learning models by imbuing them with a +similar style of reasoning. In this work, we apply this principle by +classifying Alzheimer's Disease based on the similarity of images to training +examples within the latent space. We use a contrastive loss combined with a +diffusion autoencoder backbone, to produce a semantically meaningful latent +space, such that neighbouring latents have similar image-level features. We +achieve a classification accuracy comparable to black box approaches on a +dataset of 2D MRI images, whilst producing human interpretable model +explanations. Therefore, this work stands as a contribution to the pertinent +development of accurate and interpretable deep learning within medical imaging. + +
+
+
+
+
+ + ♻ ☆ How do I update my model? On the resilience of Predictive Process + Monitoring models to change + + +
+ Existing well investigated Predictive Process Monitoring techniques typically +construct a predictive model based on past process executions, and then use it +to predict the future of new ongoing cases, without the possibility of updating +it with new cases when they complete their execution. This can make Predictive +Process Monitoring too rigid to deal with the variability of processes working +in real environments that continuously evolve and/or exhibit new variant +behaviours over time. As a solution to this problem, we evaluate the use of +three different strategies that allow the periodic rediscovery or incremental +construction of the predictive model so as to exploit new available data. The +evaluation focuses on the performance of the new learned predictive models, in +terms of accuracy and time, against the original one, and uses a number of real +and synthetic datasets with and without explicit Concept Drift. The results +provide an evidence of the potential of incremental learning algorithms for +predicting process monitoring in real environments. + +
+
+
+
+
+ + ♻ ☆ A Tale of Pronouns: Interpretability Informs Gender Bias Mitigation for + Fairer Instruction-Tuned Machine Translation EMNLP 2023 + + +
+ Recent instruction fine-tuned models can solve multiple NLP tasks when +prompted to do so, with machine translation (MT) being a prominent use case. +However, current research often focuses on standard performance benchmarks, +leaving compelling fairness and ethical considerations behind. In MT, this +might lead to misgendered translations, resulting, among other harms, in the +perpetuation of stereotypes and prejudices. In this work, we address this gap +by investigating whether and to what extent such models exhibit gender bias in +machine translation and how we can mitigate it. Concretely, we compute +established gender bias metrics on the WinoMT corpus from English to German and +Spanish. We discover that IFT models default to male-inflected translations, +even disregarding female occupational stereotypes. Next, using interpretability +methods, we unveil that models systematically overlook the pronoun indicating +the gender of a target occupation in misgendered translations. Finally, based +on this finding, we propose an easy-to-implement and effective bias mitigation +solution based on few-shot learning that leads to significantly fairer +translations. + +
+
+ comment: Accepted at EMNLP 2023. Code and data at + https://github.com/MilaNLProc/interpretability-mt-gender-bias +
+
+
+
+
+ + ♻ ☆ Necessary and Sufficient Conditions for Optimal Decision Trees using + Dynamic Programming + + +
+ Global optimization of decision trees has shown to be promising in terms of +accuracy, size, and consequently human comprehensibility. However, many of the +methods used rely on general-purpose solvers for which scalability remains an +issue. Dynamic programming methods have been shown to scale much better because +they exploit the tree structure by solving subtrees as independent subproblems. +However, this only works when an objective can be optimized separately for +subtrees. We explore this relationship in detail and show necessary and +sufficient conditions for such separability and generalize previous dynamic +programming approaches into a framework that can optimize any combination of +separable objectives and constraints. Experiments on five application domains +show the general applicability of this framework, while outperforming the +scalability of general-purpose solvers by a large margin. + +
+
+
+
+
+ + ♻ ☆ Implementation of The Future of Drug Discovery: QuantumBased Machine + Learning Simulation (QMLS) + + +
+ The Research & Development (R&D) phase of drug development is a lengthy and +costly process. To revolutionize this process, we introduce our new concept +QMLS to shorten the whole R&D phase to three to six months and decrease the +cost to merely fifty to eighty thousand USD. For Hit Generation, Machine +Learning Molecule Generation (MLMG) generates possible hits according to the +molecular structure of the target protein while the Quantum Simulation (QS) +filters molecules from the primary essay based on the reaction and binding +effectiveness with the target protein. Then, For Lead Optimization, the +resultant molecules generated and filtered from MLMG and QS are compared, and +molecules that appear as a result of both processes will be made into dozens of +molecular variations through Machine Learning Molecule Variation (MLMV), while +others will only be made into a few variations. Lastly, all optimized molecules +would undergo multiple rounds of QS filtering with a high standard for reaction +effectiveness and safety, creating a few dozen pre-clinical-trail-ready drugs. +This paper is based on our first paper, where we pitched the concept of machine +learning combined with quantum simulations. In this paper we will go over the +detailed design and framework of QMLS, including MLMG, MLMV, and QS. + +
+
+ comment: 13 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect + Dataset + + +
+ In an effort to catalog insect biodiversity, we propose a new large dataset +of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is +taxonomically classified by an expert, and also has associated genetic +information including raw nucleotide barcode sequences and assigned barcode +index numbers, which are genetically-based proxies for species classification. +This paper presents a curated million-image dataset, primarily to train +computer-vision models capable of providing image-based taxonomic assessment, +however, the dataset also presents compelling characteristics, the study of +which would be of interest to the broader machine learning community. Driven by +the biological nature inherent to the dataset, a characteristic long-tailed +class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is +a hierarchical classification scheme, presenting a highly fine-grained +classification problem at lower levels. Beyond spurring interest in +biodiversity research within the machine learning community, progress on +creating an image-based taxonomic classifier will also further the ultimate +goal of all BIOSCAN research: to lay the foundation for a comprehensive survey +of global biodiversity. This paper introduces the dataset and explores the +classification task through the implementation and analysis of a baseline +classifier. + +
+
+
+
+
+ + ♻ ☆ AtMan: Understanding Transformer Predictions Through Memory Efficient + Attention Manipulation + + +
+ Generative transformer models have become increasingly complex, with large +numbers of parameters and the ability to process multiple input modalities. +Current methods for explaining their predictions are resource-intensive. Most +crucially, they require prohibitively large amounts of extra memory, since they +rely on backpropagation which allocates almost twice as much GPU memory as the +forward pass. This makes it difficult, if not impossible, to use them in +production. We present AtMan that provides explanations of generative +transformer models at almost no extra cost. Specifically, AtMan is a +modality-agnostic perturbation method that manipulates the attention mechanisms +of transformers to produce relevance maps for the input with respect to the +output prediction. Instead of using backpropagation, AtMan applies a +parallelizable token-based search method based on cosine similarity +neighborhood in the embedding space. Our exhaustive experiments on text and +image-text benchmarks demonstrate that AtMan outperforms current +state-of-the-art gradient-based methods on several metrics while being +computationally efficient. As such, AtMan is suitable for use in large model +inference deployments. + +
+
+
+
+
+ + ♻ ☆ ACES: Generating Diverse Programming Puzzles with Autotelic Language + Models and Semantic Descriptors + + +
+ Finding and selecting new and interesting problems to solve is at the heart +of curiosity, science and innovation. We here study automated problem +generation in the context of the open-ended space of python programming +puzzles. Existing generative models often aim at modeling a reference +distribution without any explicit diversity optimization. Other methods +explicitly optimizing for diversity do so either in limited hand-coded +representation spaces or in uninterpretable learned embedding spaces that may +not align with human perceptions of interesting variations. With ACES +(Autotelic Code Exploration via Semantic descriptors), we introduce a new +autotelic generation method that leverages semantic descriptors produced by a +large language model (LLM) to directly optimize for interesting diversity, as +well as few-shot-based generation. Each puzzle is labeled along 10 dimensions, +each capturing a programming skill required to solve it. ACES generates and +pursues novel and feasible goals to explore that abstract semantic space, +slowly discovering a diversity of solvable programming puzzles in any given +run. Across a set of experiments, we show that ACES discovers a richer +diversity of puzzles than existing diversity-maximizing algorithms as measured +across a range of diversity metrics. We further study whether and in which +conditions this diversity can translate into the successful training of puzzle +solving models. + +
+
+
+
+
+ + ♻ ☆ DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch + Diffusion in Histopathology + + +
+ We present DiffInfinite, a hierarchical diffusion model that generates +arbitrarily large histological images while preserving long-range correlation +structural information. Our approach first generates synthetic segmentation +masks, subsequently used as conditions for the high-fidelity generative +diffusion process. The proposed sampling method can be scaled up to any desired +image size while only requiring small patches for fast training. Moreover, it +can be parallelized more efficiently than previous large-content generation +methods while avoiding tiling artifacts. The training leverages classifier-free +guidance to augment a small, sparsely annotated dataset with unlabelled data. +Our method alleviates unique challenges in histopathological imaging practice: +large-scale information, costly manual annotation, and protective data +handling. The biological plausibility of DiffInfinite data is evaluated in a +survey by ten experienced pathologists as well as a downstream classification +and segmentation task. Samples from the model score strongly on anti-copying +metrics which is relevant for the protection of patient data. + +
+
+
+
+
+ + ♻ ☆ CL-MAE: Curriculum-Learned Masked Autoencoders WACV 2024 + + +
+ Masked image modeling has been demonstrated as a powerful pretext task for +generating robust representations that can be effectively generalized across +multiple downstream tasks. Typically, this approach involves randomly masking +patches (tokens) in input images, with the masking strategy remaining unchanged +during training. In this paper, we propose a curriculum learning approach that +updates the masking strategy to continually increase the complexity of the +self-supervised reconstruction task. We conjecture that, by gradually +increasing the task complexity, the model can learn more sophisticated and +transferable representations. To facilitate this, we introduce a novel +learnable masking module that possesses the capability to generate masks of +different complexities, and integrate the proposed module into masked +autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting +its behavior during training, transitioning from a partner to the MAE +(optimizing the same reconstruction loss) to an adversary (optimizing the +opposite loss), while passing through a neutral state. The transition between +these behaviors is smooth, being regulated by a factor that is multiplied with +the reconstruction loss of the masking module. The resulting training procedure +generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked +Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior +representation learning capabilities compared to MAE. The empirical results on +five downstream tasks confirm our conjecture, demonstrating that curriculum +learning can be successfully used to self-supervise masked autoencoders. We +release our code at https://github.com/ristea/cl-mae. + +
+
+ comment: Accepted at WACV 2024 +
+
+
+
+
+ + ♻ ☆ Data-Driven Network Neuroscience: On Data Collection and Benchmark + + +
+ This paper presents a comprehensive and quality collection of functional +human brain network data for potential research in the intersection of +neuroscience, machine learning, and graph analytics. Anatomical and functional +MRI images have been used to understand the functional connectivity of the +human brain and are particularly important in identifying underlying +neurodegenerative conditions such as Alzheimer's, Parkinson's, and Autism. +Recently, the study of the brain in the form of brain networks using machine +learning and graph analytics has become increasingly popular, especially to +predict the early onset of these conditions. A brain network, represented as a +graph, retains rich structural and positional information that traditional +examination methods are unable to capture. However, the lack of publicly +accessible brain network data prevents researchers from data-driven +explorations. One of the main difficulties lies in the complicated +domain-specific preprocessing steps and the exhaustive computation required to +convert the data from MRI images into brain networks. We bridge this gap by +collecting a large amount of MRI images from public databases and a private +source, working with domain experts to make sensible design choices, and +preprocessing the MRI images to produce a collection of brain network datasets. +The datasets originate from 6 different sources, cover 4 brain conditions, and +consist of a total of 2,702 subjects. We test our graph datasets on 12 machine +learning models to provide baselines and validate the data quality on a recent +graph analysis model. To lower the barrier to entry and promote the research in +this interdisciplinary field, we release our brain network data and complete +preprocessing details including codes at +https://doi.org/10.17608/k6.auckland.21397377 and +https://github.com/brainnetuoa/data_driven_network_neuroscience. + +
+
+
+
+
+ + ♻ ☆ OceanGPT: A Large Language Model for Ocean Science Tasks + + +
+ Ocean science, which delves into the oceans that are reservoirs of life and +biodiversity, is of great significance given that oceans cover over 70% of our +planet's surface. Recently, advances in Large Language Models (LLMs) have +transformed the paradigm in science. Despite the success in other domains, +current LLMs often fall short in catering to the needs of domain experts like +oceanographers, and the potential of LLMs for ocean science is under-explored. +The intrinsic reason may be the immense and intricate nature of ocean data as +well as the necessity for higher granularity and richness in knowledge. To +alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean +domain, which is expert in various ocean science tasks. We propose DoInstruct, +a novel framework to automatically obtain a large volume of ocean domain +instruction data, which generates instructions based on multi-agent +collaboration. Additionally, we construct the first oceanography benchmark, +OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though +comprehensive experiments, OceanGPT not only shows a higher level of knowledge +expertise for oceans science tasks but also gains preliminary embodied +intelligence capabilities in ocean technology. Codes, data and checkpoints will +soon be available at https://github.com/zjunlp/KnowLM. + +
+
+ comment: Work in progress. Project Website: + https://zjunlp.github.io/project/OceanGPT/ +
+
+
+
+
+ + ♻ ☆ Multiplication-Free Transformer Training via Piecewise Affine Operations NeurIPS 2023 + + +
+ Multiplications are responsible for most of the computational cost involved +in neural network training and inference. Recent research has thus looked for +ways to reduce the cost associated with them. Inspired by Mogami (2020), we +replace multiplication with a cheap piecewise affine approximation that is +achieved by adding the bit representation of the floating point numbers +together as integers. We show that transformers can be trained with the +resulting modified matrix multiplications on both vision and language tasks +with little to no performance impact, and without changes to the training +hyperparameters. We further replace all non-linearities in the networks making +them fully and jointly piecewise affine in both inputs and weights. Finally, we +show that we can eliminate all multiplications in the entire training process, +including operations in the forward pass, backward pass and optimizer update, +demonstrating the first successful training of modern neural network +architectures in a fully multiplication-free fashion. + +
+
+ comment: Accepted to the 37th Conference on Neural Information Processing + Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Evaluating Robustness and Uncertainty of Graph Models Under Structural + Distributional Shifts + + +
+ In reliable decision-making systems based on machine learning, models have to +be robust to distributional shifts or provide the uncertainty of their +predictions. In node-level problems of graph learning, distributional shifts +can be especially complex since the samples are interdependent. To evaluate the +performance of graph models, it is important to test them on diverse and +meaningful distributional shifts. However, most graph benchmarks considering +distributional shifts for node-level problems focus mainly on node features, +while structural properties are also essential for graph problems. In this +work, we propose a general approach for inducing diverse distributional shifts +based on graph structure. We use this approach to create data splits according +to several structural node properties: popularity, locality, and density. In +our experiments, we thoroughly evaluate the proposed distributional shifts and +show that they can be quite challenging for existing graph models. We also +reveal that simple models often outperform more sophisticated methods on the +considered structural shifts. Finally, our experiments provide evidence that +there is a trade-off between the quality of learned representations for the +base classification task under structural distributional shift and the ability +to separate the nodes from different distributions using these representations. + +
+
+
+
+
+ + ♻ ☆ Unsupervised Episode Generation for Graph Meta-learning + + +
+ We investigate Unsupervised Episode Generation methods to solve Few-Shot +Node-Classification (FSNC) task via Meta-learning without labels. Dominant +meta-learning methodologies for FSNC were developed under the existence of +abundant labeled nodes from diverse base classes for training, which however +may not be possible to obtain in the real-world. Although a few studies tried +to tackle the label-scarcity problem in graph meta-learning, they still rely on +a few labeled nodes, which hinders the full utilization of the information of +all nodes in a graph. Despite the effectiveness of graph contrastive learning +(GCL) methods in the FSNC task without using the label information, they mainly +learn generic node embeddings without consideration of the downstream task to +be solved, which may limit its performance in the FSNC task. To this end, we +propose a simple yet effective unsupervised episode generation method to +benefit from the generalization ability of meta-learning for the FSNC task, +while resolving the label-scarcity problem. Our proposed method, called +Neighbors as Queries (NaQ), generates training episodes based on pre-calculated +node-node similarity. Moreover, NaQ is model-agnostic; hence, it can be used to +train any existing supervised graph meta-learning methods in an unsupervised +manner, while not sacrificing much of their performance or sometimes even +improving them. Extensive experimental results demonstrate the potential of our +unsupervised episode generation methods for graph meta-learning towards the +FSNC task. Our code is available at: https://github.com/JhngJng/NaQ-PyTorch + +
+
+ comment: 12 pages, 12 figures, Preprint version +
+
+
+
+
+ + ♻ ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ♻ ☆ TAPS: Connecting Certified and Adversarial Training + + +
+ Training certifiably robust neural networks remains a notoriously hard +problem. On one side, adversarial training optimizes under-approximations of +the worst-case loss, which leads to insufficient regularization for +certification, while on the other, sound certified training methods optimize +loose over-approximations, leading to over-regularization and poor (standard) +accuracy. In this work we propose TAPS, an (unsound) certified training method +that combines IBP and PGD training to yield precise, although not necessarily +sound, worst-case loss approximations, reducing over-regularization and +increasing certified and standard accuracies. Empirically, TAPS achieves a new +state-of-the-art in many settings, e.g., reaching a certified accuracy of +$22\%$ on TinyImageNet for $\ell_\infty$-perturbations with radius +$\epsilon=1/255$. We make our implementation and networks public at +https://github.com/eth-sri/taps. + +
+
+ comment: NeuIPS'23 +
+
+
+
+
+ + ♻ ☆ Learning Unseen Modality Interaction NeurIPS 2023 + + +
+ Multimodal learning assumes all modality combinations of interest are +available during training to learn cross-modal correspondences. In this paper, +we challenge this modality-complete assumption for multimodal learning and +instead strive for generalization to unseen modality combinations during +inference. We pose the problem of unseen modality interaction and introduce a +first solution. It exploits a module that projects the multidimensional +features of different modalities into a common space with rich information +preserved. This allows the information to be accumulated with a simple +summation operation across available modalities. To reduce overfitting to less +discriminative modality combinations during training, we further improve the +model learning with pseudo-supervision indicating the reliability of a +modality's prediction. We demonstrate that our approach is effective for +diverse tasks and modalities by evaluating it for multimodal video +classification, robot state regression, and multimedia retrieval. Project +website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. + +
+
+ comment: Published at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Universal adversarial perturbations for multiple classification tasks + with quantum classifiers + + +
+ Quantum adversarial machine learning is an emerging field that studies the +vulnerability of quantum learning systems against adversarial perturbations and +develops possible defense strategies. Quantum universal adversarial +perturbations are small perturbations, which can make different input samples +into adversarial examples that may deceive a given quantum classifier. This is +a field that was rarely looked into but worthwhile investigating because +universal perturbations might simplify malicious attacks to a large extent, +causing unexpected devastation to quantum machine learning models. In this +paper, we take a step forward and explore the quantum universal perturbations +in the context of heterogeneous classification tasks. In particular, we find +that quantum classifiers that achieve almost state-of-the-art accuracy on two +different classification tasks can be both conclusively deceived by one +carefully-crafted universal perturbation. This result is explicitly +demonstrated with well-designed quantum continual learning models with elastic +weight consolidation method to avoid catastrophic forgetting, as well as +real-life heterogeneous datasets from hand-written digits and medical MRI +images. Our results provide a simple and efficient way to generate universal +perturbations on heterogeneous classification tasks and thus would provide +valuable guidance for future quantum learning technologies. + +
+
+
+
+
+ + ♻ ☆ Deep Nonparametric Estimation of Intrinsic Data Structures by Chart + Autoencoders: Generalization Error and Robustness + + +
+ Autoencoders have demonstrated remarkable success in learning low-dimensional +latent features of high-dimensional data across various applications. Assuming +that data are sampled near a low-dimensional manifold, we employ chart +autoencoders, which encode data into low-dimensional latent features on a +collection of charts, preserving the topology and geometry of the data +manifold. Our paper establishes statistical guarantees on the generalization +error of chart autoencoders, and we demonstrate their denoising capabilities by +considering $n$ noisy training samples, along with their noise-free +counterparts, on a $d$-dimensional manifold. By training autoencoders, we show +that chart autoencoders can effectively denoise the input data with normal +noise. We prove that, under proper network architectures, chart autoencoders +achieve a squared generalization error in the order of $\displaystyle +n^{-\frac{2}{d+2}}\log^4 n$, which depends on the intrinsic dimension of the +manifold and only weakly depends on the ambient dimension and noise level. We +further extend our theory on data with noise containing both normal and +tangential components, where chart autoencoders still exhibit a denoising +effect for the normal component. As a special case, our theory also applies to +classical autoencoders, as long as the data manifold has a global +parametrization. Our results provide a solid theoretical foundation for the +effectiveness of autoencoders, which is further validated through several +numerical experiments. + +
+
+
+
+
+ + ♻ ☆ Guide Your Agent with Adaptive Multimodal Rewards NeurIPS 2023 + + +
+ Developing an agent capable of adapting to unseen environments remains a +difficult challenge in imitation learning. This work presents Adaptive +Return-conditioned Policy (ARP), an efficient framework designed to enhance the +agent's generalization ability using natural language task descriptions and +pre-trained multimodal encoders. Our key idea is to calculate a similarity +between visual observations and natural language instructions in the +pre-trained multimodal embedding space (such as CLIP) and use it as a reward +signal. We then train a return-conditioned policy using expert demonstrations +labeled with multimodal rewards. Because the multimodal rewards provide +adaptive signals at each timestep, our ARP effectively mitigates the goal +misgeneralization. This results in superior generalization performances even +when faced with unseen text instructions, compared to existing text-conditioned +policies. To improve the quality of rewards, we also introduce a fine-tuning +method for pre-trained multimodal encoders, further enhancing the performance. +Video demonstrations and source code are available on the project website: +\url{https://sites.google.com/view/2023arp}. + +
+
+ comment: Accepted to NeurIPS 2023. Project webpage: + https://sites.google.com/view/2023arp +
+
+
+
+
+ + ♻ ☆ Bolstering Stochastic Gradient Descent with Model Building + + +
+ Stochastic gradient descent method and its variants constitute the core +optimization algorithms that achieve good convergence rates for solving machine +learning problems. These rates are obtained especially when these algorithms +are fine-tuned for the application at hand. Although this tuning process can +require large computational costs, recent work has shown that these costs can +be reduced by line search methods that iteratively adjust the step length. We +propose an alternative approach to stochastic line search by using a new +algorithm based on forward step model building. This model building step +incorporates second-order information that allows adjusting not only the step +length but also the search direction. Noting that deep learning model +parameters come in groups (layers of tensors), our method builds its model and +calculates a new step for each parameter group. This novel diagonalization +approach makes the selected step lengths adaptive. We provide convergence rate +analysis, and experimentally show that the proposed algorithm achieves faster +convergence and better generalization in well-known test problems. More +precisely, SMB requires less tuning, and shows comparable performance to other +adaptive methods. + +
+
+
+
+
+ + ♻ ☆ On Momentum-Based Gradient Methods for Bilevel Optimization with + Nonconvex Lower-Level + + +
+ Bilevel optimization is a popular two-level hierarchical optimization, which +has been widely applied to many machine learning tasks such as hyperparameter +learning, meta learning and continual learning. Although many bilevel +optimization methods recently have been developed, the bilevel methods are not +well studied when the lower-level problem is nonconvex. To fill this gap, in +the paper, we study a class of nonconvex bilevel optimization problems, where +both upper-level and lower-level problems are nonconvex, and the lower-level +problem satisfies Polyak-{\L}ojasiewicz (PL) condition. We propose an efficient +momentum-based gradient bilevel method (MGBiO) to solve these deterministic +problems. Meanwhile, we propose a class of efficient momentum-based stochastic +gradient bilevel methods (MSGBiO and VR-MSGBiO) to solve these stochastic +problems. Moreover, we provide a useful convergence analysis framework for our +methods. Specifically, under some mild conditions, we prove that our MGBiO +method has a sample (or gradient) complexity of $O(\epsilon^{-2})$ for finding +an $\epsilon$-stationary solution of the deterministic bilevel problems (i.e., +$\|\nabla F(x)\|\leq \epsilon$), which improves the existing best results by a +factor of $O(\epsilon^{-1})$. Meanwhile, we prove that our MSGBiO and VR-MSGBiO +methods have sample complexities of $\tilde{O}(\epsilon^{-4})$ and +$\tilde{O}(\epsilon^{-3})$, respectively, in finding an $\epsilon$-stationary +solution of the stochastic bilevel problems (i.e., $\mathbb{E}\|\nabla +F(x)\|\leq \epsilon$), which improves the existing best results by a factor of +$\tilde{O}(\epsilon^{-3})$. Extensive experimental results on bilevel PL game +and hyper-representation learning demonstrate the efficiency of our algorithms. +This paper commemorates the mathematician Boris Polyak (1935 -2023). + +
+
+ comment: In new version of our paper, we relaxed some assumptions, updated our + algorithms and added some numerical experiments +
+
+
+
+
+ + ♻ ☆ A Neurocomputational Account of Consciousness: The Goal-Aligning + Representation Internal Manipulation Theory (GARIM) + + +
+ Consciousness, a central element of human cognition, has been studied with +multiple scientific approaches spanning neuroscience, psychology, artificial +intelligence and robotics. Unfortunately, poor integration between these fields +limits a full and clear understanding of consciousness. Here we contribute to +improving this integration by proposing, within a neurocomputational framework, +the `Goal-Aligning Representations Internal Manipulation' (GARIM) theory of +consciousness. The central idea of the GARIM theory is that consciousness +supports the active manipulation of goal-relevant internal representations +(e.g., world states, objects, and action sequences), making them more aligned +with the goals pursued. These manipulations allow the conscious agent to +internally produce the knowledge it lacks to cope with novel conditions and +goals, increasing the flexibility of goal-directed behaviour. The manipulation +of representations is supported by four neuro-functional macro-systems +(hierarchical perceptual working memories, abstract working memory, internal +manipulator, motivational systems) that operate through a set of computational +manipulation operations (abstraction, specification, decomposition, +composition). The theory also presents the concept of `GARIM agency', proposing +that subjective conscious experience derives from the ability of agents to +generate and control a vivid internally simulated reality. Furthermore, the +theory highlights the criticalities of the experimental investigation of +consciousness, suggesting a new approach to testing consciousness in biological +and artificial agents. Finally, the GARIM theory can benefit technological +fields such as machine learning and autonomous robotics (e.g., the manipulation +processes proposed by the theory could be linked to the operations performed by +systems based on transformers). + +
+
+
+
+
+ + ♻ ☆ Making AI Less "Thirsty": Uncovering and Addressing the Secret Water + Footprint of AI Models + + +
+ The growing carbon footprint of artificial intelligence (AI) models, +especially large ones such as GPT-3, has been undergoing public scrutiny. +Unfortunately, however, the equally important and enormous water (withdrawal +and consumption) footprint of AI models has remained under the radar. For +example, training GPT-3 in Microsoft's state-of-the-art U.S. data centers can +directly evaporate 700,000 liters of clean freshwater, but such information has +been kept a secret. More critically, the global AI demand might be accountable +for 4.2 -- 6.6 billion cubic meters of water withdrawal in 2027, which is more +than the total annual water withdrawal of 4 -- 6 Denmark or half of the United +Kingdom. This is very concerning, as freshwater scarcity has become one of the +most pressing challenges shared by all of us in the wake of the rapidly growing +population, depleting water resources, and aging water infrastructures. To +respond to the global water challenges, AI models can, and also must, take +social responsibility and lead by example by addressing their own water +footprint. In this paper, we provide a principled methodology to estimate the +water footprint of AI models, and also discuss the unique spatial-temporal +diversities of AI models' runtime water efficiency. Finally, we highlight the +necessity of holistically addressing water footprint along with carbon +footprint to enable truly sustainable AI. + +
+
+ comment: New updates include discussion on water withdrawal and water + consumption, scope definition for water, and new estimates of GPT-3's water + footprint based on Microsoft's new WUE and PUE data. Source codes available + at: https://github.com/Ren-Research/Making-AI-Less-Thirsty +
+
+
+
+
+ + ♻ ☆ RePo: Resilient Model-Based Reinforcement Learning by Regularizing + Posterior Predictability + + +
+ Visual model-based RL methods typically encode image observations into +low-dimensional representations in a manner that does not eliminate redundant +information. This leaves them susceptible to spurious variations -- changes in +task-irrelevant components such as background distractors or lighting +conditions. In this paper, we propose a visual model-based RL method that +learns a latent representation resilient to such spurious variations. Our +training objective encourages the representation to be maximally predictive of +dynamics and reward, while constraining the information flow from the +observation to the latent representation. We demonstrate that this objective +significantly bolsters the resilience of visual model-based RL methods to +visual distractors, allowing them to operate in dynamic environments. We then +show that while the learned encoder is resilient to spirious variations, it is +not invariant under significant distribution shift. To address this, we propose +a simple reward-free alignment procedure that enables test time adaptation of +the encoder. This allows for quick adaptation to widely differing environments +without having to relearn the dynamics and policy. Our effort is a step towards +making model-based RL a practical and useful tool for dynamic, diverse domains. +We show its effectiveness in simulation benchmarks with significant spurious +variations as well as a real-world egocentric navigation task with noisy TVs in +the background. Videos and code at https://zchuning.github.io/repo-website/. + +
+
+
+
+
+ + ♻ ☆ AgentBench: Evaluating LLMs as Agents + + +
+ Large Language Models (LLMs) are becoming increasingly smart and autonomous, +targeting real-world pragmatic missions beyond traditional NLP tasks. As a +result, there has been an urgent need to evaluate LLMs as agents on challenging +tasks in interactive environments. We present AgentBench, a multi-dimensional +evolving benchmark that currently consists of 8 distinct environments to assess +LLM-as-Agent's reasoning and decision-making abilities in a multi-turn +open-ended generation setting. Our extensive test over 27 API-based and +open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong +ability of acting as agents in complex environments, there is a significant +disparity in performance between them and OSS competitors. We identify the +typical reasons of failures in environments and LLMs, showing that poor +long-term reasoning, decision-making, and instruction following abilities are +the main obstacles for developing usable LLM agents. Training on code and high +quality multi-turn alignment data could improve agent performance. Datasets, +environments, and an integrated evaluation package for AgentBench are released +at \url{https://github.com/THUDM/AgentBench}. + +
+
+ comment: 55 pages +
+
+
+
+
+ + ♻ ☆ Impartial Games: A Challenge for Reinforcement Learning + + +
+ AlphaZero-style reinforcement learning (RL) algorithms excel in various board +games but face challenges with impartial games, where players share pieces. We +present a concrete example of a game - namely the children's game of nim - and +other impartial games that seem to be a stumbling block for AlphaZero-style and +similar reinforcement learning algorithms. + Our findings are consistent with recent studies showing that AlphaZero-style +algorithms are vulnerable to adversarial attacks and adversarial perturbations, +showing the difficulty of learning to master the games in all legal states. + We show that nim can be learned on small boards, but AlphaZero-style +algorithms learning dramatically slows down when the board size increases. +Intuitively, the difference between impartial games like nim and partisan games +like Chess and Go can be explained by the fact that if a tiny amount of noise +is added to the system (e.g. if a small part of the board is covered), for +impartial games, it is typically not possible to predict whether the position +is good or bad (won or lost). There is often zero correlation between the +visible part of a partly blanked-out position and its correct evaluation. This +situation starkly contrasts partisan games where a partly blanked-out +configuration typically provides abundant or at least non-trifle information +about the value of the fully uncovered position. + +
+
+
+
+
+ + ♻ ☆ Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via + Mixed-Effect Models and Hierarchical Clustering + + +
+ Maize is a major crop providing vital calories in sub-Saharan Africa, Asia +and Latin America, with a global cultivation area of 197 million hectares in +2021. Therefore, many statistical models (such as mixed-effect and random +coefficients models) and machine learning models (such as random forests and +deep learning architectures) have been developed to predict maize yield and how +it is affected by genotype, environment and genotype-environment interaction +factors, including field management. However, these models do not fully +leverage the network of causal relationships between these factors and the +hierarchical structure of the agronomic data arising from data collection. + Bayesian networks (BNs) provide a powerful framework for modelling causal and +probabilistic relationships using directed acyclic graphs to illustrate the +connections between variables. This study introduces a novel approach that +integrates random effects into BN learning. Rooted in the linear mixed-effects +models framework, it is particularly well-suited to hierarchical data. Results +from a real-world agronomic trial suggest that the proposed approach enhances +BN learning, leading to a more interpretable model and discovering new causal +connections. At the same time, the error rate of maize yield prediction is +reduced from 28% to 17%. Therefore, we argue that BNs should be the tool of +choice to construct practical decision support tools for hierarchical agronomic +data that allow for causal inference. + +
+
+ comment: 34 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ Dynamic Decision Frequency with Continuous Options IROS + + +
+ In classic reinforcement learning algorithms, agents make decisions at +discrete and fixed time intervals. The duration between decisions becomes a +crucial hyperparameter, as setting it too short may increase the problem's +difficulty by requiring the agent to make numerous decisions to achieve its +goal while setting it too long can result in the agent losing control over the +system. However, physical systems do not necessarily require a constant control +frequency, and for learning agents, it is often preferable to operate with a +low frequency when possible and a high frequency when necessary. We propose a +framework called Continuous-Time Continuous-Options (CTCO), where the agent +chooses options as sub-policies of variable durations. These options are +time-continuous and can interact with the system at any desired frequency +providing a smooth change of actions. We demonstrate the effectiveness of CTCO +by comparing its performance to classical RL and temporal-abstraction RL +methods on simulated continuous control tasks with various action-cycle times. +We show that our algorithm's performance is not affected by the choice of +environment interaction frequency. Furthermore, we demonstrate the efficacy of +CTCO in facilitating exploration in a real-world visual reaching task for a 7 +DOF robotic arm with sparse rewards. + +
+
+ comment: Appears in the Proceedings of the 2023 International Conference on + Intelligent Robots and Systems (IROS). Source code at + https://github.com/amir-karimi96/continuous-time-continuous-option-policy-gradient.git +
+
+
+
+
+ + ♻ ☆ MGAS: Multi-Granularity Architecture Search for Effective and Efficient + Neural Networks + + +
+ Differentiable architecture search (DAS) revolutionizes neural architecture +search (NAS) with time-efficient automation, transitioning from discrete +candidate sampling and evaluation to differentiable super-net optimization and +discretization. However, existing DAS methods either only conduct +coarse-grained operation-level search or manually define the remaining ratios +for fine-grained kernel-level and weight-level units, which fail to +simultaneously optimize model size and model performance. Furthermore, these +methods compromise search quality to reduce memory consumption. To tackle these +issues, we introduce multi-granularity architecture search (MGAS), a unified +framework which aims to comprehensively and memory-efficiently explore the +multi-granularity search space to discover both effective and efficient neural +networks. Specifically, we learn discretization functions specific to each +granularity level to adaptively determine the remaining ratios according to the +evolving architecture. This ensures an optimal balance among units of different +granularity levels for different target model sizes. Considering the memory +demands, we break down the super-net optimization and discretization into +multiple sub-net stages. Nevertheless, the greedy nature of this approach may +introduce bias in the early stages. To compensate for the bias, we propose +progressive re-evaluation to allow for re-pruning and regrowing of previous +units during subsequent stages. Extensive experiments on CIFAR-10, CIFAR-100 +and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods +in achieving a better trade-off between model performance and model size. + +
+
+
+
+
+ + ♻ ☆ Data Pruning via Moving-one-Sample-out NeurIPS 2023 + + +
+ In this paper, we propose a novel data-pruning approach called +moving-one-sample-out (MoSo), which aims to identify and remove the least +informative samples from the training set. The core insight behind MoSo is to +determine the importance of each sample by assessing its impact on the optimal +empirical risk. This is achieved by measuring the extent to which the empirical +risk changes when a particular sample is excluded from the training set. +Instead of using the computationally expensive leaving-one-out-retraining +procedure, we propose an efficient first-order approximator that only requires +gradient information from different training stages. The key idea behind our +approximation is that samples with gradients that are consistently aligned with +the average gradient of the training set are more informative and should +receive higher scores, which could be intuitively understood as follows: if the +gradient from a specific sample is consistent with the average gradient vector, +it implies that optimizing the network using the sample will yield a similar +effect on all remaining samples. Experimental results demonstrate that MoSo +effectively mitigates severe performance degradation at high pruning ratios and +achieves satisfactory performance across various settings. + +
+
+ comment: Accepted by the Thirty-seventh Conference on Neural Information + Processing Systems (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ GPT Understands, Too + + +
+ Prompting a pretrained language model with natural language patterns has been +proved effective for natural language understanding (NLU). However, our +preliminary study reveals that manual discrete prompts often lead to unstable +performance -- e.g., changing a single word in the prompt might result in +substantial performance drop. We propose a novel method P-Tuning that employs +trainable continuous prompt embeddings in concatenation with discrete prompts. +Empirically, P-Tuning not only stabilizes training by minimizing the gap +between various discrete prompts, but also improves performance by a sizeable +margin on a wide range of NLU tasks including LAMA and SuperGLUE. P-Tuning is +generally effective for both frozen and tuned language models, under both the +fully-supervised and few-shot settings. + +
+
+
+
+
+
+
+
+ + Multimedia 7 + +
+
+
+ + ☆ Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity + and Relation Extraction + + +
+ How can we better extract entities and relations from text? Using multimodal +extraction with images and text obtains more signals for entities and +relations, and aligns them through graphs or hierarchical fusion, aiding in +extraction. Despite attempts at various fusions, previous works have overlooked +many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes +innovative pre-training objectives for entity-object and relation-image +alignment, extracting objects from images and aligning them with entity and +relation prompts for soft pseudo-labels. These labels are used as +self-supervised signals for pre-training, enhancing the ability to extract +entities and relations. Experiments on three datasets show an average 3.41% F1 +improvement over prior SOTA. Additionally, our method is orthogonal to previous +multimodal fusions, and using it on prior SOTA fusions further improves 5.47% +F1. + +
+
+ comment: Accepted to ACM Multimedia 2023 +
+
+
+
+
+ + ☆ Adapt Anything: Tailor Any Image Classifiers across Domains And + Categories Using Text-to-Image Diffusion Models + + +
+ We do not pursue a novel method in this paper, but aim to study if a modern +text-to-image diffusion model can tailor any task-adaptive image classifier +across domains and categories. Existing domain adaptive image classification +works exploit both source and target data for domain alignment so as to +transfer the knowledge learned from the labeled source data to the unlabeled +target data. However, as the development of the text-to-image diffusion model, +we wonder if the high-fidelity synthetic data from the text-to-image generator +can serve as a surrogate of the source data in real world. In this way, we do +not need to collect and annotate the source data for each domain adaptation +task in a one-for-one manner. Instead, we utilize only one off-the-shelf +text-to-image model to synthesize images with category labels derived from the +corresponding text prompts, and then leverage the surrogate data as a bridge to +transfer the knowledge embedded in the task-agnostic text-to-image generator to +the task-oriented image classifier via domain adaptation. Such a one-for-all +adaptation paradigm allows us to adapt anything in the world using only one +text-to-image generator as well as the corresponding unlabeled target data. +Extensive experiments validate the feasibility of the proposed idea, which even +surpasses the state-of-the-art domain adaptation works using the source data +collected and annotated in real world. + +
+
+ comment: 11 pages, 6 figures +
+
+
+
+
+ + ☆ AccoMontage-3: Full-Band Accompaniment Arrangement via Sequential Style + Transfer and Multi-Track Function Prior + + +
+ We propose AccoMontage-3, a symbolic music automation system capable of +generating multi-track, full-band accompaniment based on the input of a lead +melody with chords (i.e., a lead sheet). The system contains three modular +components, each modelling a vital aspect of full-band composition. The first +component is a piano arranger that generates piano accompaniment for the lead +sheet by transferring texture styles to the chords using latent chord-texture +disentanglement and heuristic retrieval of texture donors. The second component +orchestrates the piano accompaniment score into full-band arrangement according +to the orchestration style encoded by individual track functions. The third +component, which connects the previous two, is a prior model characterizing the +global structure of orchestration style over the whole piece of music. From end +to end, the system learns to generate full-band accompaniment in a +self-supervised fashion, applying style transfer at two levels of polyphonic +composition: texture and orchestration. Experiments show that our system +outperforms the baselines significantly, and the modular design offers +effective controls in a musically meaningful way. + +
+
+
+
+
+ + ♻ ☆ Land-cover change detection using paired OpenStreetMap data and optical + high-resolution imagery via object-guided Transformer + + +
+ Optical high-resolution imagery and OpenStreetMap (OSM) data are two +important data sources for land-cover change detection. Previous studies in +these two data sources focus on utilizing the information in OSM data to aid +the change detection on multi-temporal optical high-resolution images. This +paper pioneers the direct detection of land-cover changes utilizing paired OSM +data and optical imagery, thereby broadening the horizons of change detection +tasks to encompass more dynamic earth observations. To this end, we propose an +object-guided Transformer (ObjFormer) architecture by naturally combining the +prevalent object-based image analysis (OBIA) technique with the advanced vision +Transformer architecture. The introduction of OBIA can significantly reduce the +computational overhead and memory burden in the self-attention module. +Specifically, the proposed ObjFormer has a hierarchical pseudo-siamese encoder +consisting of object-guided self-attention modules that extract representative +features of different levels from OSM data and optical images; a decoder +consisting of object-guided cross-attention modules can progressively recover +the land-cover changes from the extracted heterogeneous features. In addition +to the basic supervised binary change detection task, this paper raises a new +semi-supervised semantic change detection task that does not require any +manually annotated land-cover labels of optical images to train semantic change +detectors. Two lightweight semantic decoders are added to ObjFormer to +accomplish this task efficiently. A converse cross-entropy loss is designed to +fully utilize the negative samples, thereby contributing to the great +performance improvement in this task. The first large-scale benchmark dataset +containing 1,287 map-image pairs (1024$\times$ 1024 pixels for each sample) +covering 40 regions on six continents ...(see the manuscript for the full +abstract) + +
+
+
+
+
+ + ♻ ☆ MusicAgent: An AI Agent for Music Understanding and Generation with + Large Language Models + + +
+ AI-empowered music processing is a diverse field that encompasses dozens of +tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension +tasks (e.g., music classification). For developers and amateurs, it is very +difficult to grasp all of these task to satisfy their requirements in music +processing, especially considering the huge differences in the representations +of music data and the model applicability across platforms among various tasks. +Consequently, it is necessary to build a system to organize and integrate these +tasks, and thus help practitioners to automatically analyze their demand and +call suitable tools as solutions to fulfill their requirements. Inspired by the +recent success of large language models (LLMs) in task automation, we develop a +system, named MusicAgent, which integrates numerous music-related tools and an +autonomous workflow to address user requirements. More specifically, we build +1) toolset that collects tools from diverse sources, including Hugging Face, +GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., +ChatGPT) to organize these tools and automatically decompose user requests into +multiple sub-tasks and invoke corresponding music tools. The primary goal of +this system is to free users from the intricacies of AI-music tools, enabling +them to concentrate on the creative aspect. By granting users the freedom to +effortlessly combine tools, the system offers a seamless and enriching music +experience. + +
+
+
+
+
+ + ♻ ☆ Learning Unseen Modality Interaction NeurIPS 2023 + + +
+ Multimodal learning assumes all modality combinations of interest are +available during training to learn cross-modal correspondences. In this paper, +we challenge this modality-complete assumption for multimodal learning and +instead strive for generalization to unseen modality combinations during +inference. We pose the problem of unseen modality interaction and introduce a +first solution. It exploits a module that projects the multidimensional +features of different modalities into a common space with rich information +preserved. This allows the information to be accumulated with a simple +summation operation across available modalities. To reduce overfitting to less +discriminative modality combinations during training, we further improve the +model learning with pseudo-supervision indicating the reliability of a +modality's prediction. We demonstrate that our approach is effective for +diverse tasks and modalities by evaluating it for multimodal video +classification, robot state regression, and multimedia retrieval. Project +website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. + +
+
+ comment: Published at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution EMNLP 2023 + + +
+ Over recent decades, significant advancements in cross-modal retrieval are +mainly driven by breakthroughs in visual and linguistic modeling. However, a +recent study shows that multi-modal data representations tend to cluster within +a limited convex cone (as representation degeneration problem), which hinders +retrieval performance due to the inseparability of these representations. In +our study, we first empirically validate the presence of the representation +degeneration problem across multiple cross-modal benchmarks and methods. Next, +to address it, we introduce a novel method, called InvGC, a post-processing +technique inspired by graph convolution and average pooling. Specifically, +InvGC defines the graph topology within the datasets and then applies graph +convolution in a subtractive manner. This method effectively separates +representations by increasing the distances between data points. To improve the +efficiency and effectiveness of InvGC, we propose an advanced graph topology, +LocalAdj, which only aims to increase the distances between each data point and +its nearest neighbors. To understand why InvGC works, we present a detailed +theoretical analysis, proving that the lower bound of recall will be improved +after deploying InvGC. Extensive empirical results show that InvGC and InvGC +w/LocalAdj significantly mitigate the representation degeneration problem, +thereby enhancing retrieval performance. + Our code is available at +https://github.com/yimuwangcs/Better_Cross_Modal_Retrieval + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 184 + +
+
+
+ + ☆ MuSR: Testing the Limits of Chain-of-thought with Multistep Soft + Reasoning + + +
+ While large language models (LLMs) equipped with techniques like +chain-of-thought prompting have demonstrated impressive capabilities, they +still fall short in their ability to reason robustly in complex settings. +However, evaluating LLM reasoning is challenging because system capabilities +continue to grow while benchmark datasets for tasks like logical deduction have +remained static. We introduce MuSR, a dataset for evaluating language models on +multistep soft reasoning tasks specified in a natural language narrative. This +dataset has two crucial features. First, it is created through a novel +neurosymbolic synthetic-to-natural generation algorithm, enabling the +construction of complex reasoning instances that challenge GPT-4 (e.g., murder +mysteries roughly 1000 words in length) and which can be scaled further as more +capable LLMs are released. Second, our dataset instances are free text +narratives corresponding to real-world domains of reasoning; this makes it +simultaneously much more challenging than other synthetically-crafted +benchmarks while remaining realistic and tractable for human annotators to +solve with high accuracy. We evaluate a range of LLMs and prompting techniques +on this dataset and characterize the gaps that remain for techniques like +chain-of-thought to perform robust reasoning. + +
+
+
+
+
+ + ☆ AI Alignment and Social Choice: Fundamental Limitations and Policy + Implications + + +
+ Aligning AI agents to human intentions and values is a key bottleneck in +building safe and deployable AI applications. But whose values should AI agents +be aligned with? Reinforcement learning with human feedback (RLHF) has emerged +as the key framework for AI alignment. RLHF uses feedback from human +reinforcers to fine-tune outputs; all widely deployed large language models +(LLMs) use RLHF to align their outputs to human values. It is critical to +understand the limitations of RLHF and consider policy challenges arising from +these limitations. In this paper, we investigate a specific challenge in +building RLHF systems that respect democratic norms. Building on impossibility +results in social choice theory, we show that, under fairly broad assumptions, +there is no unique voting protocol to universally align AI systems using RLHF +through democratic processes. Further, we show that aligning AI agents with the +values of all individuals will always violate certain private ethical +preferences of an individual user i.e., universal AI alignment using RLHF is +impossible. We discuss policy implications for the governance of AI systems +built using RLHF: first, the need for mandating transparent voting rules to +hold model builders accountable. Second, the need for model builders to focus +on developing AI agents that are narrowly aligned to specific user groups. + +
+
+ comment: 10 pages, no figures +
+
+
+
+
+ + ☆ Woodpecker: Hallucination Correction for Multimodal Large Language + Models + + +
+ Hallucination is a big shadow hanging over the rapidly evolving Multimodal +Large Language Models (MLLMs), referring to the phenomenon that the generated +text is inconsistent with the image content. In order to mitigate +hallucinations, existing studies mainly resort to an instruction-tuning manner +that requires retraining the models with specific data. In this paper, we pave +a different way, introducing a training-free method named Woodpecker. Like a +woodpecker heals trees, it picks out and corrects hallucinations from the +generated text. Concretely, Woodpecker consists of five stages: key concept +extraction, question formulation, visual knowledge validation, visual claim +generation, and hallucination correction. Implemented in a post-remedy manner, +Woodpecker can easily serve different MLLMs, while being interpretable by +accessing intermediate outputs of the five stages. We evaluate Woodpecker both +quantitatively and qualitatively and show the huge potential of this new +paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement +in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released +at https://github.com/BradyFU/Woodpecker. + +
+
+ comment: 16 pages, 7 figures. Code Website: + https://github.com/BradyFU/Woodpecker +
+
+
+
+
+ + ☆ WebWISE: Web Interface Control and Sequential Exploration with Large + Language Models + + +
+ The paper investigates using a Large Language Model (LLM) to automatically +perform web software tasks using click, scroll, and text input operations. +Previous approaches, such as reinforcement learning (RL) or imitation learning, +are inefficient to train and task-specific. Our method uses filtered Document +Object Model (DOM) elements as observations and performs tasks step-by-step, +sequentially generating small programs based on the current observations. We +use in-context learning, either benefiting from a single manually provided +example, or an automatically generated example based on a successful zero-shot +trial. We evaluate the proposed method on the MiniWob++ benchmark. With only +one in-context example, our WebWISE method achieves similar or better +performance than other methods that require many demonstrations or trials. + +
+
+
+
+
+ + ☆ Instruct and Extract: Instruction Tuning for On-Demand Information + Extraction EMNLP 2023 + + +
+ Large language models with instruction-following capabilities open the door +to a wider group of users. However, when it comes to information extraction - a +classic task in natural language processing - most task-specific systems cannot +align well with long-tail ad hoc extraction use cases for non-expert users. To +address this, we propose a novel paradigm, termed On-Demand Information +Extraction, to fulfill the personalized demands of real-world users. Our task +aims to follow the instructions to extract the desired content from the +associated text and present it in a structured tabular format. The table +headers can either be user-specified or inferred contextually by the model. To +facilitate research in this emerging area, we present a benchmark named +InstructIE, inclusive of both automatically generated training data, as well as +the human-annotated test set. Building on InstructIE, we further develop an +On-Demand Information Extractor, ODIE. Comprehensive evaluations on our +benchmark reveal that ODIE substantially outperforms the existing open-source +models of similar size. Our code and dataset are released on +https://github.com/yzjiao/On-Demand-IE. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ What's Left? Concept Grounding with Logic-Enhanced Foundation Models NeurIPS 2023 + + +
+ Recent works such as VisProg and ViperGPT have smartly composed foundation +models for visual reasoning-using large language models (LLMs) to produce +programs that can be executed by pre-trained vision-language models. However, +they operate in limited domains, such as 2D images, not fully exploiting the +generalization of language: abstract concepts like "left" can also be grounded +in 3D, temporal, and action data, as in moving to your left. This limited +generalization stems from these inference-only methods' inability to learn or +adapt pre-trained models to a new domain. We propose the Logic-Enhanced +Foundation Model (LEFT), a unified framework that learns to ground and reason +with concepts across domains with a differentiable, domain-independent, +first-order logic-based program executor. LEFT has an LLM interpreter that +outputs a program represented in a general, logic-based reasoning language, +which is shared across all domains and tasks. LEFT's executor then executes the +program with trainable domain-specific grounding modules. We show that LEFT +flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, +and robotic manipulation. It exhibits strong reasoning ability in a wide +variety of tasks, including those that are complex and not seen during +training, and can be easily applied to new domains. + +
+
+ comment: NeurIPS 2023. First two authors contributed equally. Project page: + https://web.stanford.edu/~joycj/projects/left_neurips_2023 +
+
+
+
+
+ + ☆ Visual Cropping Improves Zero-Shot Question Answering of Multimodal + Large Language Models + + +
+ Multimodal Large Language Models (LLMs) have recently achieved promising +zero-shot accuracy on visual question answering (VQA) -- a fundamental task +affecting various downstream applications and domains. Given the great +potential for the broad use of these models, it is important to investigate +their limitations in dealing with different image and question properties. In +this work, we investigate whether multimodal LLMs can perceive small details as +well as large details in images. In particular, we show that their zero-shot +accuracy in answering visual questions is very sensitive to the size of the +visual subject of the question, declining up to $46\%$ with size. Furthermore, +we show that this effect is causal by observing that human visual cropping can +significantly mitigate their sensitivity to size. Inspired by the usefulness of +human cropping, we then propose three automatic visual cropping methods as +inference time mechanisms to improve the zero-shot performance of multimodal +LLMs. We study their effectiveness on four popular VQA datasets, and a subset +of the VQAv2 dataset tailored towards fine visual details. Our findings suggest +that multimodal LLMs should be used with caution in detail-sensitive VQA +applications, and that visual cropping is a promising direction to improve +their zero-shot performance. Our code and data are publicly available. + +
+
+ comment: 11 pages, 4 figures, 4 tables +
+
+
+
+
+ + ☆ What Algorithms can Transformers Learn? A Study in Length Generalization + + +
+ Large language models exhibit surprising emergent generalization properties, +yet also struggle on many simple reasoning tasks such as arithmetic and parity. +This raises the question of if and when Transformer models can learn the true +algorithm for solving a task. We study the scope of Transformers' abilities in +the specific setting of length generalization on algorithmic tasks. Here, we +propose a unifying framework to understand when and how Transformers can +exhibit strong length generalization on a given task. Specifically, we leverage +RASP (Weiss et al., 2021) -- a programming language designed for the +computational model of a Transformer -- and introduce the RASP-Generalization +Conjecture: Transformers tend to length generalize on a task if the task can be +solved by a short RASP program which works for all input lengths. This simple +conjecture remarkably captures most known instances of length generalization on +algorithmic tasks. Moreover, we leverage our insights to drastically improve +generalization performance on traditionally hard tasks (such as parity and +addition). On the theoretical side, we give a simple example where the +"min-degree-interpolator" model of learning from Abbe et al. (2023) does not +correctly predict Transformers' out-of-distribution behavior, but our +conjecture does. Overall, our work provides a novel perspective on the +mechanisms of compositional generalization and the algorithmic capabilities of +Transformers. + +
+
+ comment: Preprint +
+
+
+
+
+ + ☆ Dissecting In-Context Learning of Translations in GPTs EMNLP + + +
+ Most of the recent work in leveraging Large Language Models (LLMs) such as +GPT-3 for Machine Translation (MT) has focused on selecting the few-shot +samples for prompting. In this work, we try to better understand the role of +demonstration attributes for the in-context learning of translations through +perturbations of high-quality, in-domain demonstrations. We find that +asymmetric perturbation of the source-target mappings yield vastly different +results. We show that the perturbation of the source side has surprisingly +little impact, while target perturbation can drastically reduce translation +quality, suggesting that it is the output text distribution that provides the +most important learning signal during in-context learning of translations. We +propose a method named Zero-Shot-Context to add this signal automatically in +Zero-Shot prompting. We demonstrate that it improves upon the zero-shot +translation performance of GPT-3, even making it competitive with few-shot +prompted translations. + +
+
+ comment: EMNLP Findings (+ Minor Updates over Camera-Ready) +
+
+
+
+
+ + ☆ Accented Speech Recognition With Accent-specific Codebooks EMNLP 2023 + + +
+ Speech accents pose a significant challenge to state-of-the-art automatic +speech recognition (ASR) systems. Degradation in performance across +underrepresented accents is a severe deterrent to the inclusive adoption of +ASR. In this work, we propose a novel accent adaptation approach for end-to-end +ASR systems using cross-attention with a trainable set of codebooks. These +learnable codebooks capture accent-specific information and are integrated +within the ASR encoder layers. The model is trained on accented English speech, +while the test data also contained accents which were not seen during training. +On the Mozilla Common Voice multi-accented dataset, we show that our proposed +approach yields significant performance gains not only on the seen English +accents (up to $37\%$ relative improvement in word error rate) but also on the +unseen accents (up to $5\%$ relative improvement in WER). Further, we +illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We +also compare the performance with other approaches based on accent adversarial +training. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ☆ Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation + + +
+ Despite the promise of Mixture of Experts (MoE) models in increasing +parameter counts of Transformer models while maintaining training and inference +costs, their application carries notable drawbacks. The key strategy of these +models is to, for each processed token, activate at most a few experts - +subsets of an extensive feed-forward layer. But this approach is not without +its challenges. The operation of matching experts and tokens is discrete, which +makes MoE models prone to issues like training instability and uneven expert +utilization. Existing techniques designed to address these concerns, such as +auxiliary losses or balance-aware matching, result either in lower model +performance or are more difficult to train. In response to these issues, we +propose Mixture of Tokens, a fully-differentiable model that retains the +benefits of MoE architectures while avoiding the aforementioned difficulties. +Rather than routing tokens to experts, this approach mixes tokens from +different examples prior to feeding them to experts, enabling the model to +learn from all token-expert combinations. Importantly, this mixing can be +disabled to avoid mixing of different sequences during inference. Crucially, +this method is fully compatible with both masked and causal Large Language +Model training and inference. + +
+
+
+
+
+ + ☆ NoteChat: A Dataset of Synthetic Doctor-Patient Conversations + Conditioned on Clinical Notes + + +
+ The detailed clinical records drafted by doctors after each patient's visit +are crucial for medical practitioners and researchers. Automating the creation +of these notes with language models can reduce the workload of doctors. +However, training such models can be difficult due to the limited public +availability of conversations between patients and doctors. In this paper, we +introduce NoteChat, a cooperative multi-agent framework leveraging Large +Language Models (LLMs) for generating synthetic doctor-patient conversations +conditioned on clinical notes. NoteChat consists of Planning, Roleplay, and +Polish modules. We provide a comprehensive automatic and human evaluation of +NoteChat, comparing it with state-of-the-art models, including OpenAI's ChatGPT +and GPT-4. Results demonstrate that NoteChat facilitates high-quality synthetic +doctor-patient conversations, underscoring the untapped potential of LLMs in +healthcare. This work represents the first instance of multiple LLMs +cooperating to complete a doctor-patient conversation conditioned on clinical +notes, offering promising avenues for the intersection of AI and healthcare + +
+
+
+
+
+ + ☆ This is not a Dataset: A Large Negation Benchmark to Challenge Large + Language Models EMNLP 2023 + + +
+ Although large language models (LLMs) have apparently acquired a certain +level of grammatical knowledge and the ability to make generalizations, they +fail to interpret negation, a crucial step in Natural Language Processing. We +try to clarify the reasons for the sub-optimal performance of LLMs +understanding negation. We introduce a large semi-automatically generated +dataset of circa 400,000 descriptive sentences about commonsense knowledge that +can be true or false in which negation is present in about 2/3 of the corpus in +different forms. We have used our dataset with the largest available open LLMs +in a zero-shot approach to grasp their generalization and inference capability +and we have also fine-tuned some of the models to assess whether the +understanding of negation can be trained. Our findings show that, while LLMs +are proficient at classifying affirmative sentences, they struggle with +negative sentences and lack a deep understanding of negation, often relying on +superficial cues. Although fine-tuning the models on negative sentences +improves their performance, the lack of generalization in handling negation is +persistent, highlighting the ongoing challenges of LLMs regarding negation +understanding and generalization. The dataset and code are publicly available. + +
+
+ comment: Accepted in the The 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP 2023) +
+
+
+
+
+ + ☆ E-Sparse: Boosting the Large Language Model Inference through + Entropy-based N:M Sparsity + + +
+ Traditional pruning methods are known to be challenging to work in Large +Language Models (LLMs) for Generative AI because of their unaffordable training +process and large computational demands. For the first time, we introduce the +information entropy of hidden state features into a pruning metric design, +namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse +employs the information richness to leverage the channel importance, and +further incorporates several novel techniques to put it into effect: (1) it +introduces information entropy to enhance the significance of parameter weights +and input feature norms as a novel pruning metric, and performs N:M sparsity +without modifying the remaining weights. (2) it designs global naive shuffle +and local block shuffle to quickly optimize the information distribution and +adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is +implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere +GPUs. Extensive experiments on the LLaMA family and OPT models show that +E-Sparse can significantly speed up the model inference over the dense model +(up to 1.53X) and obtain significant memory saving (up to 43.52%), with +acceptable accuracy loss. + +
+
+
+
+
+ + ☆ Contrastive Learning-based Sentence Encoders Implicitly Weight + Informative Words EMNLP 2023 + + +
+ The performance of sentence encoders can be significantly improved through +the simple practice of fine-tuning using contrastive loss. A natural question +arises: what characteristics do models acquire during contrastive learning? +This paper theoretically and experimentally shows that contrastive-based +sentence encoders implicitly weight words based on information-theoretic +quantities; that is, more informative words receive greater weight, while +others receive less. The theory states that, in the lower bound of the optimal +value of the contrastive learning objective, the norm of word embedding +reflects the information gain associated with the distribution of surrounding +words. We also conduct comprehensive experiments using various models, multiple +datasets, two methods to measure the implicit weighting of models (Integrated +Gradients and SHAP), and two information-theoretic quantities (information gain +and self-information). The results provide empirical evidence that contrastive +fine-tuning emphasizes informative words. + +
+
+ comment: 16 pages, 6 figures, accepted to EMNLP 2023 Findings (short paper) +
+
+
+
+
+ + ☆ In-Context Learning Creates Task Vectors EMNLP 2023 + + +
+ In-context learning (ICL) in Large Language Models (LLMs) has emerged as a +powerful new learning paradigm. However, its underlying mechanism is still not +well understood. In particular, it is challenging to map it to the "standard" +machine learning framework, where one uses a training set $S$ to find a +best-fitting function $f(x)$ in some hypothesis class. Here we make progress on +this problem by showing that the functions learned by ICL often have a very +simple structure: they correspond to the transformer LLM whose only inputs are +the query $x$ and a single "task vector" calculated from the training set. +Thus, ICL can be seen as compressing $S$ into a single task vector +$\boldsymbol{\theta}(S)$ and then using this task vector to modulate the +transformer to produce the output. We support the above claim via comprehensive +experiments across a range of models and tasks. + +
+
+ comment: Accepted at Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Characterizing Mechanisms for Factual Recall in Language Models + + +
+ Language Models (LMs) often must integrate facts they memorized in +pretraining with new information that appears in a given context. These two +sources can disagree, causing competition within the model, and it is unclear +how an LM will resolve the conflict. On a dataset that queries for knowledge of +world capitals, we investigate both distributional and mechanistic determinants +of LM behavior in such situations. Specifically, we measure the proportion of +the time an LM will use a counterfactual prefix (e.g., "The capital of Poland +is London") to overwrite what it learned in pretraining ("Warsaw"). On Pythia +and GPT2, the training frequency of both the query country ("Poland") and the +in-context city ("London") highly affect the models' likelihood of using the +counterfactual. We then use head attribution to identify individual attention +heads that either promote the memorized answer or the in-context answer in the +logits. By scaling up or down the value vector of these heads, we can control +the likelihood of using the in-context answer on new data. This method can +increase the rate of generating the in-context answer to 88\% of the time +simply by scaling a single head at runtime. Our work contributes to a body of +evidence showing that we can often localize model behaviors to specific +components and provides a proof of concept for how future methods might control +model behavior dynamically at runtime. + +
+
+
+
+
+ + ☆ Is Probing All You Need? Indicator Tasks as an Alternative to Probing + Embedding Spaces EMNLP 2023 + + +
+ The ability to identify and control different kinds of linguistic information +encoded in vector representations of words has many use cases, especially for +explainability and bias removal. This is usually done via a set of simple +classification tasks, termed probes, to evaluate the information encoded in the +embedding space. However, the involvement of a trainable classifier leads to +entanglement between the probe's results and the classifier's nature. As a +result, contemporary works on probing include tasks that do not involve +training of auxiliary models. In this work we introduce the term indicator +tasks for non-trainable tasks which are used to query embedding spaces for the +existence of certain properties, and claim that this kind of tasks may point to +a direction opposite to probes, and that this contradiction complicates the +decision on whether a property exists in an embedding space. We demonstrate our +claims with two test cases, one dealing with gender debiasing and another with +the erasure of morphological information from embedding spaces. We show that +the application of a suitable indicator provides a more accurate picture of the +information captured and removed compared to probes. We thus conclude that +indicator tasks should be implemented and taken into consideration when +eliciting information from embedded representations. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Do Stochastic Parrots have Feelings Too? Improving Neural Detection of + Synthetic Text via Emotion Recognition EMNLP 2023 + + +
+ Recent developments in generative AI have shone a spotlight on +high-performance synthetic text generation technologies. The now wide +availability and ease of use of such models highlights the urgent need to +provide equally powerful technologies capable of identifying synthetic text. +With this in mind, we draw inspiration from psychological studies which suggest +that people can be driven by emotion and encode emotion in the text they +compose. We hypothesize that pretrained language models (PLMs) have an +affective deficit because they lack such an emotional driver when generating +text and consequently may generate synthetic text which has affective +incoherence i.e. lacking the kind of emotional coherence present in +human-authored text. We subsequently develop an emotionally aware detector by +fine-tuning a PLM on emotion. Experiment results indicate that our +emotionally-aware detector achieves improvements across a range of synthetic +text generators, various sized models, datasets, and domains. Finally, we +compare our emotionally-aware synthetic text detector to ChatGPT in the task of +identification of its own output and show substantial gains, reinforcing the +potential of emotion as a signal to identify synthetic text. Code, models, and +datasets are available at https: //github.com/alanagiasi/emoPLMsynth + +
+
+ comment: Accepted to Findings of EMNLP 2023 (long paper). Camera ready version +
+
+
+
+
+ + ☆ BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs + with Multi-turn Health Conversations Polished by ChatGPT + + +
+ Large language models (LLMs) have performed well in providing general and +extensive health suggestions in single-turn conversations, exemplified by +systems such as ChatGPT, ChatGLM, ChatDoctor, DoctorGLM, and etc. However, the +limited information provided by users during single turn results in inadequate +personalization and targeting of the generated suggestions, which requires +users to independently select the useful part. It is mainly caused by the +missing ability to engage in multi-turn questioning. In real-world medical +consultations, doctors usually employ a series of iterative inquiries to +comprehend the patient's condition thoroughly, enabling them to provide +effective and personalized suggestions subsequently, which can be defined as +chain of questioning (CoQ) for LLMs. To improve the CoQ of LLMs, we propose +BianQue, a ChatGLM-based LLM finetuned with the self-constructed health +conversation dataset BianQueCorpus that is consist of multiple turns of +questioning and health suggestions polished by ChatGPT. Experimental results +demonstrate that the proposed BianQue can simultaneously balance the +capabilities of both questioning and health suggestions, which will help +promote the research and application of LLMs in the field of proactive health. + +
+
+
+
+
+ + ☆ Using Artificial French Data to Understand the Emergence of Gender Bias + in Transformer Language Models EMNLP'23 + + +
+ Numerous studies have demonstrated the ability of neural language models to +learn various linguistic properties without direct supervision. This work takes +an initial step towards exploring the less researched topic of how neural +models discover linguistic properties of words, such as gender, as well as the +rules governing their usage. We propose to use an artificial corpus generated +by a PCFG based on French to precisely control the gender distribution in the +training data and determine under which conditions a model correctly captures +gender information or, on the contrary, appears gender-biased. + +
+
+ comment: Accepted at EMNLP'23 +
+
+
+
+
+ + ☆ Self-Guard: Empower the LLM to Safeguard Itself + + +
+ The jailbreak attack can bypass the safety measures of a Large Language Model +(LLM), generating harmful content. This misuse of LLM has led to negative +societal consequences. Currently, there are two main approaches to address +jailbreak attacks: safety training and safeguards. Safety training focuses on +further training LLM to enhance its safety. On the other hand, safeguards +involve implementing external models or filters to prevent harmful outputs. +However, safety training has constraints in its ability to adapt to new attack +types and often leads to a drop in model performance. Safeguards have proven to +be of limited help. To tackle these issues, we propose a novel approach called +Self-Guard, which combines the strengths of both safety methods. Self-Guard +includes two stages. In the first stage, we enhance the model's ability to +assess harmful content, and in the second stage, we instruct the model to +consistently perform harmful content detection on its own responses. The +experiment has demonstrated that Self-Guard is robust against jailbreak +attacks. In the bad case analysis, we find that LLM occasionally provides +harmless responses to harmful queries. Additionally, we evaluated the general +capabilities of the LLM before and after safety training, providing evidence +that Self-Guard does not result in the LLM's performance degradation. In +sensitivity tests, Self-Guard not only avoids inducing over-sensitivity in LLM +but also can even mitigate this issue. + +
+
+
+
+
+ + ☆ A Diffusion Weighted Graph Framework for New Intent Discovery EMNLP 2023 + + +
+ New Intent Discovery (NID) aims to recognize both new and known intents from +unlabeled data with the aid of limited labeled data containing only known +intents. Without considering structure relationships between samples, previous +methods generate noisy supervisory signals which cannot strike a balance +between quantity and quality, hindering the formation of new intent clusters +and effective transfer of the pre-training knowledge. To mitigate this +limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to +capture both semantic similarities and structure relationships inherent in +data, enabling more sufficient and reliable supervisory signals. Specifically, +for each sample, we diffuse neighborhood relationships along semantic paths +guided by the nearest neighbors for multiple hops to characterize its local +structure discriminately. Then, we sample its positive keys and weigh them +based on semantic similarities and local structures for contrastive learning. +During inference, we further propose Graph Smoothing Filter (GSF) to explicitly +utilize the structure relationships to filter high-frequency noise embodied in +semantically ambiguous samples on the cluster boundary. Extensive experiments +show that our method outperforms state-of-the-art models on all evaluation +metrics across multiple benchmark datasets. Code and data are available at +https://github.com/yibai-shi/DWGF. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ☆ Unnatural language processing: How do language models handle + machine-generated prompts? EMNLP 2023 + + +
+ Language model prompt optimization research has shown that semantically and +grammatically well-formed manually crafted prompts are routinely outperformed +by automatically generated token sequences with no apparent meaning or +syntactic structure, including sequences of vectors from a model's embedding +space. We use machine-generated prompts to probe how models respond to input +that is not composed of natural language expressions. We study the behavior of +models of different sizes in multiple semantic tasks in response to both +continuous and discrete machine-generated prompts, and compare it to the +behavior in response to human-generated natural-language prompts. Even when +producing a similar output, machine-generated and human prompts trigger +different response patterns through the network processing pathways, including +different perplexities, different attention and output entropy distributions, +and different unit activation profiles. We provide preliminary insight into the +nature of the units activated by different prompt types, suggesting that only +natural language prompts recruit a genuinely linguistic circuit. + +
+
+ comment: Findings of EMNLP 2023 Camera-Ready +
+
+
+
+
+ + ☆ Rosetta Stone at KSAA-RD Shared Task: A Hop From Language Modeling To + Word--Definition Alignment + + +
+ A Reverse Dictionary is a tool enabling users to discover a word based on its +provided definition, meaning, or description. Such a technique proves valuable +in various scenarios, aiding language learners who possess a description of a +word without its identity, and benefiting writers seeking precise terminology. +These scenarios often encapsulate what is referred to as the +"Tip-of-the-Tongue" (TOT) phenomena. In this work, we present our winning +solution for the Arabic Reverse Dictionary shared task. This task focuses on +deriving a vector representation of an Arabic word from its accompanying +description. The shared task encompasses two distinct subtasks: the first +involves an Arabic definition as input, while the second employs an English +definition. For the first subtask, our approach relies on an ensemble of +finetuned Arabic BERT-based models, predicting the word embedding for a given +definition. The final representation is obtained through averaging the output +embeddings from each model within the ensemble. In contrast, the most effective +solution for the second subtask involves translating the English test +definitions into Arabic and applying them to the finetuned models originally +trained for the first subtask. This straightforward method achieves the highest +score across both subtasks. + +
+
+ comment: ArabicNLP 2023 +
+
+
+
+
+ + ☆ Generative Language Models Exhibit Social Identity Biases + + +
+ The surge in popularity of large language models has given rise to concerns +about biases that these models could learn from humans. In this study, we +investigate whether ingroup solidarity and outgroup hostility, fundamental +social biases known from social science, are present in 51 large language +models. We find that almost all foundational language models and some +instruction fine-tuned models exhibit clear ingroup-positive and +outgroup-negative biases when prompted to complete sentences (e.g., "We +are..."). A comparison of LLM-generated sentences with human-written sentences +on the internet reveals that these models exhibit similar level, if not +greater, levels of bias than human text. To investigate where these biases stem +from, we experimentally varied the amount of ingroup-positive or +outgroup-negative sentences the model was exposed to during fine-tuning in the +context of the United States Democrat-Republican divide. Doing so resulted in +the models exhibiting a marked increase in ingroup solidarity and an even +greater increase in outgroup hostility. Furthermore, removing either +ingroup-positive or outgroup-negative sentences (or both) from the fine-tuning +data leads to a significant reduction in both ingroup solidarity and outgroup +hostility, suggesting that biases can be reduced by removing biased training +data. Our findings suggest that modern language models exhibit fundamental +social identity biases and that such biases can be mitigated by curating +training data. Our results have practical implications for creating less biased +large-language models and further underscore the need for more research into +user interactions with LLMs to prevent potential bias reinforcement in humans. + +
+
+ comment: supplementary material, data, and code see + https://osf.io/9ht32/?view_only=f0ab4b23325f4c31ad3e12a7353b55f5 +
+
+
+
+
+ + ☆ DALE: Generative Data Augmentation for Low-Resource Legal NLP EMNLP 2023 + + +
+ We present DALE, a novel and effective generative Data Augmentation framework +for low-resource LEgal NLP. DALE addresses the challenges existing frameworks +pose in generating effective data augmentations of legal documents - legal +language, with its specialized vocabulary and complex semantics, morphology, +and syntax, does not benefit from data augmentations that merely rephrase the +source sentence. To address this, DALE, built on an Encoder-Decoder Language +Model, is pre-trained on a novel unsupervised text denoising objective based on +selective masking - our masking strategy exploits the domain-specific language +characteristics of templatized legal documents to mask collocated spans of +text. Denoising these spans helps DALE acquire knowledge about legal concepts, +principles, and language usage. Consequently, it develops the ability to +generate coherent and diverse augmentations with novel contexts. Finally, DALE +performs conditional generation to generate synthetic augmentations for +low-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13 +datasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our +baselines, including LLMs, qualitatively and quantitatively, with improvements +of 1%-50%. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference. Code: + https://github.com/Sreyan88/DALE +
+
+
+
+
+ + ☆ Random Entity Quantization for Parameter-Efficient Compositional + Knowledge Graph Representation EMNLP 2023 + + +
+ Representation Learning on Knowledge Graphs (KGs) is essential for downstream +tasks. The dominant approach, KG Embedding (KGE), represents entities with +independent vectors and faces the scalability challenge. Recent studies propose +an alternative way for parameter efficiency, which represents entities by +composing entity-corresponding codewords matched from predefined small-scale +codebooks. We refer to the process of obtaining corresponding codewords of each +entity as entity quantization, for which previous works have designed +complicated strategies. Surprisingly, this paper shows that simple random +entity quantization can achieve similar results to current strategies. We +analyze this phenomenon and reveal that entity codes, the quantization outcomes +for expressing entities, have higher entropy at the code level and Jaccard +distance at the codeword level under random entity quantization. Therefore, +different entities become more easily distinguished, facilitating effective KG +representation. The above results show that current quantization strategies are +not critical for KG representation, and there is still room for improvement in +entity distinguishability beyond current strategies. The code to reproduce our +results is available at https://github.com/JiaangL/RandomQuantization. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Improving generalization in large language models by learning prefix + subspaces + + +
+ This article focuses on large language models (LLMs) fine-tuning in the +scarce data regime (also known as the "few-shot" learning setting). We propose +a method to increase the generalization capabilities of LLMs based on neural +network subspaces. This optimization method, recently introduced in computer +vision, aims to improve model generalization by identifying wider local optima +through the joint optimization of an entire simplex of models in parameter +space. Its adaptation to massive, pretrained transformers, however, poses some +challenges. First, their considerable number of parameters makes it difficult +to train several models jointly, and second, their deterministic parameter +initialization schemes make them unfit for the subspace method as originally +proposed. We show in this paper that "Parameter Efficient Fine-Tuning" (PEFT) +methods, however, are perfectly compatible with this original approach, and +propose to learn entire simplex of continuous prefixes. We test our method on a +variant of the GLUE benchmark adapted to the few-shot learning setting, and +show that both our contributions jointly lead to a gain in average performances +compared to sota methods. The implementation can be found at the following +link: https://github.com/Liloulou/prefix_subspace + +
+
+
+
+
+ + ☆ MindLLM: Pre-training Lightweight Large Language Model from Scratch, + Evaluations and Domain Applications + + +
+ Large Language Models (LLMs) have demonstrated remarkable performance across +various natural language tasks, marking significant strides towards general +artificial intelligence. While general artificial intelligence is leveraged by +developing increasingly large-scale models, there could be another branch to +develop lightweight custom models that better serve certain domains, taking +into account the high cost of training and deploying LLMs and the scarcity of +resources. In this paper, we present MindLLM, a novel series of bilingual +lightweight large language models, trained from scratch, alleviating such +burdens by offering models with 1.3 billion and 3 billion parameters. A +thorough account of experiences accrued during large model development is +given, covering every step of the process, including data construction, model +architecture, evaluation, and applications. Such insights are hopefully +valuable for fellow academics and developers. MindLLM consistently matches or +surpasses the performance of other open-source larger models on some public +benchmarks. We also introduce an innovative instruction tuning framework +tailored for smaller models to enhance their capabilities efficiently. +Moreover, we explore the application of MindLLM in specific vertical domains +such as law and finance, underscoring the agility and adaptability of our +lightweight models. + +
+
+ comment: Working in progress +
+
+
+
+
+ + ☆ BLESS: Benchmarking Large Language Models on Sentence Simplification EMNLP 2023 + + +
+ We present BLESS, a comprehensive performance benchmark of the most recent +state-of-the-art large language models (LLMs) on the task of text +simplification (TS). We examine how well off-the-shelf LLMs can solve this +challenging task, assessing a total of 44 models, differing in size, +architecture, pre-training methods, and accessibility, on three test sets from +different domains (Wikipedia, news, and medical) under a few-shot setting. Our +analysis considers a suite of automatic metrics as well as a large-scale +quantitative investigation into the types of common edit operations performed +by the different models. Furthermore, we perform a manual qualitative analysis +on a subset of model outputs to better gauge the quality of the generated +simplifications. Our evaluation indicates that the best LLMs, despite not being +trained on TS, perform comparably with state-of-the-art TS baselines. +Additionally, we find that certain LLMs demonstrate a greater range and +diversity of edit operations. Our performance benchmark will be available as a +resource for the development of future TS methods and evaluation metrics. + +
+
+ comment: This paper has been accepted to EMNLP 2023 as a main long paper. 9 + pages, 7 figures +
+
+
+
+
+ + ☆ Learning From Free-Text Human Feedback -- Collect New Datasets Or Extend + Existing Ones? EMNLP 2023 + + +
+ Learning from free-text human feedback is essential for dialog systems, but +annotated data is scarce and usually covers only a small fraction of error +types known in conversational AI. Instead of collecting and annotating new +datasets from scratch, recent advances in synthetic dialog generation could be +used to augment existing dialog datasets with the necessary annotations. +However, to assess the feasibility of such an effort, it is important to know +the types and frequency of free-text human feedback included in these datasets. +In this work, we investigate this question for a variety of commonly used +dialog datasets, including MultiWoZ, SGD, BABI, PersonaChat, +Wizards-of-Wikipedia, and the human-bot split of the Self-Feeding Chatbot. +Using our observations, we derive new taxonomies for the annotation of +free-text human feedback in dialogs and investigate the impact of including +such data in response generation for three SOTA language generation models, +including GPT-2, LLAMA, and Flan-T5. Our findings provide new insights into the +composition of the datasets examined, including error types, user response +types, and the relations between them. + +
+
+ comment: Accepted to be presented at EMNLP 2023 +
+
+
+
+
+ + ☆ Do Differences in Values Influence Disagreements in Online Discussions? EMNLP 2023 + + +
+ Disagreements are common in online discussions. Disagreement may foster +collaboration and improve the quality of a discussion under some conditions. +Although there exist methods for recognizing disagreement, a deeper +understanding of factors that influence disagreement is lacking in the +literature. We investigate a hypothesis that differences in personal values are +indicative of disagreement in online discussions. We show how state-of-the-art +models can be used for estimating values in online discussions and how the +estimated values can be aggregated into value profiles. We evaluate the +estimated value profiles based on human-annotated agreement labels. We find +that the dissimilarity of value profiles correlates with disagreement in +specific cases. We also find that including value information in agreement +prediction improves performance. + +
+
+ comment: Accepted as main paper at EMNLP 2023 +
+
+
+
+
+ + ☆ Integrating Language Models into Direct Speech Translation: An + Inference-Time Solution to Control Gender Inflection EMNLP 2023 + + +
+ When translating words referring to the speaker, speech translation (ST) +systems should not resort to default masculine generics nor rely on potentially +misleading vocal traits. Rather, they should assign gender according to the +speakers' preference. The existing solutions to do so, though effective, are +hardly feasible in practice as they involve dedicated model re-training on +gender-labeled ST data. To overcome these limitations, we propose the first +inference-time solution to control speaker-related gender inflections in ST. +Our approach partially replaces the (biased) internal language model (LM) +implicitly learned by the ST decoder with gender-specific external LMs. +Experiments on en->es/fr/it show that our solution outperforms the base models +and the best training-time mitigation strategy by up to 31.0 and 1.6 points in +gender accuracy, respectively, for feminine forms. The gains are even larger +(up to 32.0 and 3.4) in the challenging condition where speakers' vocal traits +conflict with their gender. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ☆ Failures Pave the Way: Enhancing Large Language Models through + Tuning-free Rule Accumulation EMNLP 2023 + + +
+ Large Language Models (LLMs) have showcased impressive performance. However, +due to their inability to capture relationships among samples, these frozen +LLMs inevitably keep repeating similar mistakes. In this work, we propose our +Tuning-free Rule Accumulation (TRAN) framework, which guides LLMs in improving +their performance by learning from previous mistakes. Considering data arrives +sequentially, LLMs gradually accumulate rules from incorrect cases, forming a +rule collection. These rules are then utilized by the LLMs to avoid making +similar mistakes when processing subsequent inputs. Moreover, the rules remain +independent of the primary prompts, seamlessly complementing prompt design +strategies. Experimentally, we show that TRAN improves over recent baselines by +a large margin. + +
+
+ comment: This paper is accepted by the EMNLP 2023 Main Conference +
+
+
+
+
+ + ☆ RAPL: A Relation-Aware Prototype Learning Approach for Few-Shot + Document-Level Relation Extraction EMNLP 2023 + + +
+ How to identify semantic relations among entities in a document when only a +few labeled documents are available? Few-shot document-level relation +extraction (FSDLRE) is crucial for addressing the pervasive data scarcity +problem in real-world scenarios. Metric-based meta-learning is an effective +framework widely adopted for FSDLRE, which constructs class prototypes for +classification. However, existing works often struggle to obtain class +prototypes with accurate relational semantics: 1) To build prototype for a +target relation type, they aggregate the representations of all entity pairs +holding that relation, while these entity pairs may also hold other relations, +thus disturbing the prototype. 2) They use a set of generic NOTA +(none-of-the-above) prototypes across all tasks, neglecting that the NOTA +semantics differs in tasks with different target relation types. In this paper, +we propose a relation-aware prototype learning method for FSDLRE to strengthen +the relational semantics of prototype representations. By judiciously +leveraging the relation descriptions and realistic NOTA instances as guidance, +our method effectively refines the relation prototypes and generates +task-specific NOTA prototypes. Extensive experiments demonstrate that our +method outperforms state-of-the-art approaches by average 2.61% $F_1$ across +various settings of two FSDLRE benchmarks. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Variator: Accelerating Pre-trained Models with Plug-and-Play Compression + Modules EMNLP + + +
+ Pre-trained language models (PLMs) have achieved remarkable results on NLP +tasks but at the expense of huge parameter sizes and the consequent +computational costs. In this paper, we propose Variator, a parameter-efficient +acceleration method that enhances computational efficiency through +plug-and-play compression plugins. Compression plugins are designed to reduce +the sequence length via compressing multiple hidden vectors into one and +trained with original PLMs frozen. Different from traditional model +acceleration methods, which compress PLMs to smaller sizes, Variator offers two +distinct advantages: (1) In real-world applications, the plug-and-play nature +of our compression plugins enables dynamic selection of different compression +plugins with varying acceleration ratios based on the current workload. (2) The +compression plugin comprises a few compact neural network layers with minimal +parameters, significantly saving storage and memory overhead, particularly in +scenarios with a growing number of tasks. We validate the effectiveness of +Variator on seven datasets. Experimental results show that Variator can save +53% computational costs using only 0.9% additional parameters with a +performance drop of less than 2%. Moreover, when the model scales to billions +of parameters, Variator matches the strong performance of uncompressed PLMs. + +
+
+ comment: Accepted by Findings of EMNLP +
+
+
+
+
+ + ☆ Re-Temp: Relation-Aware Temporal Representation Learning for Temporal + Knowledge Graph Completion EMNLP 2023 + + +
+ Temporal Knowledge Graph Completion (TKGC) under the extrapolation setting +aims to predict the missing entity from a fact in the future, posing a +challenge that aligns more closely with real-world prediction problems. +Existing research mostly encodes entities and relations using sequential graph +neural networks applied to recent snapshots. However, these approaches tend to +overlook the ability to skip irrelevant snapshots according to entity-related +relations in the query and disregard the importance of explicit temporal +information. To address this, we propose our model, Re-Temp (Relation-Aware +Temporal Representation Learning), which leverages explicit temporal embedding +as input and incorporates skip information flow after each timestamp to skip +unnecessary information for prediction. Additionally, we introduce a two-phase +forward propagation method to prevent information leakage. Through the +evaluation on six TKGC (extrapolation) datasets, we demonstrate that our model +outperforms all eight recent state-of-the-art models by a significant margin. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Ensemble of Task-Specific Language Models for Brain Encoding + + +
+ Language models have been shown to be rich enough to encode fMRI activations +of certain Regions of Interest in our Brains. Previous works have explored +transfer learning from representations learned for popular natural language +processing tasks for predicting brain responses. In our work, we improve the +performance of such encoders by creating an ensemble model out of 10 popular +Language Models (2 syntactic and 8 semantic). We beat the current baselines by +10% on average across all ROIs through our ensembling methods. + +
+
+
+
+
+ + ☆ Enhancing Biomedical Lay Summarisation with External Knowledge Graphs EMNLP 2023 + + +
+ Previous approaches for automatic lay summarisation are exclusively reliant +on the source article that, given it is written for a technical audience (e.g., +researchers), is unlikely to explicitly define all technical concepts or state +all of the background information that is relevant for a lay audience. We +address this issue by augmenting eLife, an existing biomedical lay +summarisation dataset, with article-specific knowledge graphs, each containing +detailed information on relevant biomedical concepts. Using both automatic and +human evaluations, we systematically investigate the effectiveness of three +different approaches for incorporating knowledge graphs within lay +summarisation models, with each method targeting a distinct area of the +encoder-decoder model architecture. Our results confirm that integrating +graph-based domain knowledge can significantly benefit lay summarisation by +substantially increasing the readability of generated text and improving the +explanation of technical concepts. + +
+
+ comment: Accepted to the EMNLP 2023 main conference +
+
+
+
+
+ + ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ☆ Towards Automated Recipe Genre Classification using Semi-Supervised + Learning + + +
+ Sharing cooking recipes is a great way to exchange culinary ideas and provide +instructions for food preparation. However, categorizing raw recipes found +online into appropriate food genres can be challenging due to a lack of +adequate labeled data. In this study, we present a dataset named the +``Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking +Recipe Dataset" that contains two million culinary recipes labeled in +respective categories with extended named entities extracted from recipe +descriptions. This collection of data includes various features such as title, +NER, directions, and extended NER, as well as nine different labels +representing genres including bakery, drinks, non-veg, vegetables, fast food, +cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends +the size of the Named Entity Recognition (NER) list to address missing named +entities like heat, time or process from the recipe directions using two NER +extraction tools. 3A2M+ dataset provides a comprehensive solution to the +various challenging recipe-related tasks, including classification, named +entity recognition, and recipe generation. Furthermore, we have demonstrated +traditional machine learning, deep learning and pre-trained language models to +classify the recipes into their corresponding genre and achieved an overall +accuracy of 98.6\%. Our investigation indicates that the title feature played a +more significant role in classifying the genre. + +
+
+
+
+
+ + ☆ Creating a silver standard for patent simplification SIGIR 2023 + + +
+ Patents are legal documents that aim at protecting inventions on the one hand +and at making technical knowledge circulate on the other. Their complex style +-- a mix of legal, technical, and extremely vague language -- makes their +content hard to access for humans and machines and poses substantial challenges +to the information retrieval community. This paper proposes an approach to +automatically simplify patent text through rephrasing. Since no in-domain +parallel simplification data exist, we propose a method to automatically +generate a large-scale silver standard for patent sentences. To obtain +candidates, we use a general-domain paraphrasing system; however, the process +is error-prone and difficult to control. Thus, we pair it with proper filters +and construct a cleaner corpus that can successfully be used to train a +simplification system. Human evaluation of the synthetic silver corpus shows +that it is considered grammatical, adequate, and contains simple sentences. + +
+
+ comment: This paper has been published at SIGIR 2023 +
+
+
+
+
+ + ☆ Improving Biomedical Abstractive Summarisation with Knowledge + Aggregation from Citation Papers EMNLP 2023 + + +
+ Abstracts derived from biomedical literature possess distinct domain-specific +characteristics, including specialised writing styles and biomedical +terminologies, which necessitate a deep understanding of the related +literature. As a result, existing language models struggle to generate +technical summaries that are on par with those produced by biomedical experts, +given the absence of domain-specific background knowledge. This paper aims to +enhance the performance of language models in biomedical abstractive +summarisation by aggregating knowledge from external papers cited within the +source article. We propose a novel attention-based citation aggregation model +that integrates domain-specific knowledge from citation papers, allowing neural +networks to generate summaries by leveraging both the paper content and +relevant knowledge from citation papers. Furthermore, we construct and release +a large-scale biomedical summarisation dataset that serves as a foundation for +our research. Extensive experiments demonstrate that our model outperforms +state-of-the-art approaches and achieves substantial improvements in +abstractive biomedical text summarisation. + +
+
+ comment: Accepted by EMNLP 2023 +
+
+
+
+
+ + ☆ Prevalence and prevention of large language model use in crowd work + + +
+ We show that the use of large language models (LLMs) is prevalent among crowd +workers, and that targeted mitigation strategies can significantly reduce, but +not eliminate, LLM use. On a text summarization task where workers were not +directed in any way regarding their LLM use, the estimated prevalence of LLM +use was around 30%, but was reduced by about half by asking workers to not use +LLMs and by raising the cost of using them, e.g., by disabling copy-pasting. +Secondary analyses give further insight into LLM use and its prevention: LLM +use yields high-quality but homogeneous responses, which may harm research +concerned with human (rather than model) behavior and degrade future models +trained with crowdsourced data. At the same time, preventing LLM use may be at +odds with obtaining high-quality responses; e.g., when requesting workers not +to use LLMs, summaries contained fewer keywords carrying essential information. +Our estimates will likely change as LLMs increase in popularity or +capabilities, and as norms around their usage change. Yet, understanding the +co-evolution of LLM-based tools and users is key to maintaining the validity of +research done using crowdsourcing, and we provide a critical baseline before +widespread adoption ensues. + +
+
+ comment: VV and MHR equal contribution. 14 pages, 1 figure, 1 table +
+
+
+
+
+ + ☆ How Much Context Does My Attention-Based ASR System Need? + + +
+ For the task of speech recognition, the use of more than 30 seconds of +acoustic context during training is uncommon, and under-investigated in +literature. In this work, we examine the effect of scaling the sequence length +used to train/evaluate (dense-attention based) acoustic and language models on +speech recognition performance. For these experiments a dataset of roughly +100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 +seconds to 1 hour being explored. Zero-shot evaluations on long-format datasets +Earnings-22 and Tedlium demonstrate a benefit from training with around 80 +seconds of acoustic context, showing up to a 14.9% relative improvement from a +limited context baseline. Furthermore, we perform a system combination with +long-context transformer language models via beam search for a fully +long-context ASR system, with results that are competitive with the current +state-of-the-art. + +
+
+
+
+
+ + ☆ Expression Syntax Information Bottleneck for Math Word Problems SIGIR 2022 + + +
+ Math Word Problems (MWP) aims to automatically solve mathematical questions +given in texts. Previous studies tend to design complex models to capture +additional information in the original text so as to enable the model to gain +more comprehensive features. In this paper, we turn our attention in the +opposite direction, and work on how to discard redundant features containing +spurious correlations for MWP. To this end, we design an Expression Syntax +Information Bottleneck method for MWP (called ESIB) based on variational +information bottleneck, which extracts essential features of expression syntax +tree while filtering latent-specific redundancy containing syntax-irrelevant +features. The key idea of ESIB is to encourage multiple models to predict the +same expression syntax tree for different problem representations of the same +problem by mutual learning so as to capture consistent information of +expression syntax tree and discard latent-specific redundancy. To improve the +generalization ability of the model and generate more diverse expressions, we +design a self-distillation loss to encourage the model to rely more on the +expression syntax information in the latent space. Experimental results on two +large-scale benchmarks show that our model not only achieves state-of-the-art +results but also generates more diverse solutions. The code is available. + +
+
+ comment: This paper has been accepted by SIGIR 2022. The code can be found at + https://github.com/menik1126/math_ESIB +
+
+
+
+
+ + ☆ A Survey on Detection of LLMs-Generated Content + + +
+ The burgeoning capabilities of advanced large language models (LLMs) such as +ChatGPT have led to an increase in synthetic content generation with +implications across a variety of sectors, including media, cybersecurity, +public discourse, and education. As such, the ability to detect LLMs-generated +content has become of paramount importance. We aim to provide a detailed +overview of existing detection strategies and benchmarks, scrutinizing their +differences and identifying key challenges and prospects in the field, +advocating for more adaptable and robust models to enhance detection accuracy. +We also posit the necessity for a multi-faceted approach to defend against +various attacks to counter the rapidly advancing capabilities of LLMs. To the +best of our knowledge, this work is the first comprehensive survey on the +detection in the era of LLMs. We hope it will provide a broad understanding of +the current landscape of LLMs-generated content detection, offering a guiding +reference for researchers and practitioners striving to uphold the integrity of +digital information in an era increasingly dominated by synthetic content. The +relevant papers are summarized and will be consistently updated at +https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git. + +
+
+ comment: We will keep updating at + https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git +
+
+
+
+
+ + ☆ CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large + Language Models for Data Annotation + + +
+ Annotated data plays a critical role in Natural Language Processing (NLP) in +training models and evaluating their performance. Given recent developments in +Large Language Models (LLMs), models such as ChatGPT demonstrate zero-shot +capability on many text-annotation tasks, comparable with or even exceeding +human annotators. Such LLMs can serve as alternatives for manual annotation, +due to lower costs and higher scalability. However, limited work has leveraged +LLMs as complementary annotators, nor explored how annotation work is best +allocated among humans and LLMs to achieve both quality and cost objectives. We +propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of +unstructured texts at scale. Under this framework, we utilize uncertainty to +estimate LLMs' annotation capability. Our empirical study shows CoAnnotating to +be an effective means to allocate work from results on different datasets, with +up to 21% performance improvement over random baseline. For code +implementation, see https://github.com/SALT-NLP/CoAnnotating. + +
+
+
+
+
+ + ☆ Career Path Prediction using Resume Representation Learning and + Skill-based Matching RecSys + + +
+ The impact of person-job fit on job satisfaction and performance is widely +acknowledged, which highlights the importance of providing workers with next +steps at the right time in their career. This task of predicting the next step +in a career is known as career path prediction, and has diverse applications +such as turnover prevention and internal job mobility. Existing methods to +career path prediction rely on large amounts of private career history data to +model the interactions between job titles and companies. We propose leveraging +the unexplored textual descriptions that are part of work experience sections +in resumes. We introduce a structured dataset of 2,164 anonymized career +histories, annotated with ESCO occupation labels. Based on this dataset, we +present a novel representation learning approach, CareerBERT, specifically +designed for work history data. We develop a skill-based model and a text-based +model for career path prediction, which achieve 35.24% and 39.61% recall@10 +respectively on our dataset. Finally, we show that both approaches are +complementary as a hybrid approach achieves the strongest result with 43.01% +recall@10. + +
+
+ comment: Accepted to the 3nd Workshop on Recommender Systems for Human + Resources (RecSys in HR 2023) as part of RecSys 2023 +
+
+
+
+
+ + ☆ Tips for making the most of 64-bit architectures in langage design, + libraries or garbage collection + + +
+ The 64-bit architectures that have become standard today offer unprecedented +low-level programming possibilities. For the first time in the history of +computing, the size of address registers far exceeded the physical capacity of +their bus.After a brief reminder of the possibilities offered by the small size +of addresses compared to the available 64 bits,we develop three concrete +examples of how the vacant bits of these registers can be used.Among these +examples, two of them concern the implementation of a library for a new +statically typed programming language.Firstly, the implementation of +multi-precision integers, with the aim of improving performance in terms of +both calculation speed and RAM savings.The second example focuses on the +library's handling of UTF-8 character strings.Here, the idea is to make +indexing easier by ignoring the physical size of each UTF-8 characters.Finally, +the third example is a possible enhancement of garbage collectors, in +particular the mark \& sweep for the object marking phase. + +
+
+
+
+
+ + ☆ Machine Translation for Nko: Tools, Corpora and Baseline Results + + +
+ Currently, there is no usable machine translation system for Nko, a language +spoken by tens of millions of people across multiple West African countries, +which holds significant cultural and educational value. To address this issue, +we present a set of tools, resources, and baseline results aimed towards the +development of usable machine translation systems for Nko and other languages +that do not currently have sufficiently large parallel text corpora available. +(1) Friallel: A novel collaborative parallel text curation software that +incorporates quality control through copyedit-based workflows. (2) Expansion of +the FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko +translations in parallel with 204 and 40 other languages. (3) nicolingua-0005: +A collection of trilingual and bilingual corpora with 130,850 parallel segments +and monolingual corpora containing over 3 million Nko words. (4) Baseline +bilingual and multilingual neural machine translation results with the best +model scoring 30.83 English-Nko chrF++ on FLoRes-devtest. + +
+
+
+
+
+ + ☆ MUSER: A Multi-View Similar Case Retrieval Dataset CIKM 2023 + + +
+ Similar case retrieval (SCR) is a representative legal AI application that +plays a pivotal role in promoting judicial fairness. However, existing SCR +datasets only focus on the fact description section when judging the similarity +between cases, ignoring other valuable sections (e.g., the court's opinion) +that can provide insightful reasoning process behind. Furthermore, the case +similarities are typically measured solely by the textual semantics of the fact +descriptions, which may fail to capture the full complexity of legal cases from +the perspective of legal knowledge. In this work, we present MUSER, a similar +case retrieval dataset based on multi-view similarity measurement and +comprehensive legal element with sentence-level legal element annotations. +Specifically, we select three perspectives (legal fact, dispute focus, and law +statutory) and build a comprehensive and structured label schema of legal +elements for each of them, to enable accurate and knowledgeable evaluation of +case similarities. The constructed dataset originates from Chinese civil cases +and contains 100 query cases and 4,024 candidate cases. We implement several +text classification algorithms for legal element prediction and various +retrieval methods for retrieving similar cases on MUSER. The experimental +results indicate that incorporating legal elements can benefit the performance +of SCR models, but further efforts are still required to address the remaining +challenges posed by MUSER. The source code and dataset are released at +https://github.com/THUlawtech/MUSER. + +
+
+ comment: Accepted by CIKM 2023 Resource Track +
+
+
+
+
+ + ☆ Retrieval-based Knowledge Transfer: An Effective Approach for Extreme + Large Language Model Compression EMNLP 2023 + + +
+ Large-scale pre-trained language models (LLMs) have demonstrated exceptional +performance in various natural language processing (NLP) tasks. However, the +massive size of these models poses huge challenges for their deployment in +real-world applications. While numerous model compression techniques have been +proposed, most of them are not well-suited for achieving extreme model +compression when there is a significant gap in model scale. In this paper, we +introduce a novel compression paradigm called Retrieval-based Knowledge +Transfer (RetriKT), which effectively transfers the knowledge of LLMs to +extremely small-scale models (e.g., 1%). In particular, our approach extracts +knowledge from LLMs to construct a knowledge store, from which the small-scale +model can retrieve relevant information and leverage it for effective +inference. To improve the quality of the model, soft prompt tuning and Proximal +Policy Optimization (PPO) reinforcement learning techniques are employed. +Extensive experiments are conducted on low-resource tasks from SuperGLUE and +GLUE benchmarks. The results demonstrate that the proposed approach +significantly enhances the performance of small-scale models by leveraging the +knowledge from LLMs. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ ScanDL: A Diffusion Model for Generating Synthetic Scanpaths on Texts EMNLP 2023 + + +
+ Eye movements in reading play a crucial role in psycholinguistic research +studying the cognitive mechanisms underlying human language processing. More +recently, the tight coupling between eye movements and cognition has also been +leveraged for language-related machine learning tasks such as the +interpretability, enhancement, and pre-training of language models, as well as +the inference of reader- and text-specific properties. However, scarcity of eye +movement data and its unavailability at application time poses a major +challenge for this line of research. Initially, this problem was tackled by +resorting to cognitive models for synthesizing eye movement data. However, for +the sole purpose of generating human-like scanpaths, purely data-driven +machine-learning-based methods have proven to be more suitable. Following +recent advances in adapting diffusion processes to discrete data, we propose +ScanDL, a novel discrete sequence-to-sequence diffusion model that generates +synthetic scanpaths on texts. By leveraging pre-trained word representations +and jointly embedding both the stimulus text and the fixation sequence, our +model captures multi-modal interactions between the two inputs. We evaluate +ScanDL within- and across-dataset and demonstrate that it significantly +outperforms state-of-the-art scanpath generation methods. Finally, we provide +an extensive psycholinguistic analysis that underlines the model's ability to +exhibit human-like reading behavior. Our implementation is made available at +https://github.com/DiLi-Lab/ScanDL. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Multimodal Representations for Teacher-Guided Compositional Visual + Reasoning + + +
+ Neural Module Networks (NMN) are a compelling method for visual question +answering, enabling the translation of a question into a program consisting of +a series of reasoning sub-tasks that are sequentially executed on the image to +produce an answer. NMNs provide enhanced explainability compared to integrated +models, allowing for a better understanding of the underlying reasoning +process. To improve the effectiveness of NMNs we propose to exploit features +obtained by a large-scale cross-modal encoder. Also, the current training +approach of NMNs relies on the propagation of module outputs to subsequent +modules, leading to the accumulation of prediction errors and the generation of +false answers. To mitigate this, we introduce an NMN learning strategy +involving scheduled teacher guidance. Initially, the model is fully guided by +the ground-truth intermediate outputs, but gradually transitions to an +autonomous behavior as training progresses. This reduces error accumulation, +thus improving training efficiency and final performance.We demonstrate that by +incorporating cross-modal features and employing more effective training +techniques for NMN, we achieve a favorable balance between performance and +transparency in the reasoning process. + +
+
+
+
+
+ + ☆ CONTRASTE: Supervised Contrastive Pre-training With Aspect-based Prompts + For Aspect Sentiment Triplet Extraction EMNLP 2023 + + +
+ Existing works on Aspect Sentiment Triplet Extraction (ASTE) explicitly focus +on developing more efficient fine-tuning techniques for the task. Instead, our +motivation is to come up with a generic approach that can improve the +downstream performances of multiple ABSA tasks simultaneously. Towards this, we +present CONTRASTE, a novel pre-training strategy using CONTRastive learning to +enhance the ASTE performance. While we primarily focus on ASTE, we also +demonstrate the advantage of our proposed technique on other ABSA tasks such as +ACOS, TASD, and AESC. Given a sentence and its associated (aspect, opinion, +sentiment) triplets, first, we design aspect-based prompts with corresponding +sentiments masked. We then (pre)train an encoder-decoder model by applying +contrastive learning on the decoder-generated aspect-aware sentiment +representations of the masked terms. For fine-tuning the model weights thus +obtained, we then propose a novel multi-task approach where the base +encoder-decoder model is combined with two complementary modules, a +tagging-based Opinion Term Detector, and a regression-based Triplet Count +Estimator. Exhaustive experiments on four benchmark datasets and a detailed +ablation study establish the importance of each of our proposed components as +we achieve new state-of-the-art ASTE results. + +
+
+ comment: Accepted as a Long Paper at EMNLP 2023 (Findings); 16 pages; Codes: + https://github.com/nitkannen/CONTRASTE/ +
+
+
+
+
+ + ☆ POE: Process of Elimination for Multiple Choice Reasoning EMNLP 2023 + + +
+ Language models (LMs) are capable of conducting in-context learning for +multiple choice reasoning tasks, but the options in these tasks are treated +equally. As humans often first eliminate wrong options before picking the final +correct answer, we argue a similar two-step strategy can make LMs better at +these tasks. To this end, we present the Process of Elimination (POE), a +two-step scoring method. In the first step, POE scores each option, and +eliminates seemingly wrong options. In the second step, POE masks these wrong +options, and makes the final prediction from the remaining options. Zero-shot +experiments on 8 reasoning tasks illustrate the effectiveness of POE, and a +following analysis finds our method to be especially performant on logical +reasoning tasks. We further analyze the effect of masks, and show that POE +applies to few-shot settings and large language models (LLMs) like ChatGPT. + +
+
+ comment: Accepted as a short paper at EMNLP 2023 +
+
+
+
+
+ + ☆ Natural Language Processing for Drug Discovery Knowledge Graphs: + promises and pitfalls + + +
+ Building and analysing knowledge graphs (KGs) to aid drug discovery is a +topical area of research. A salient feature of KGs is their ability to combine +many heterogeneous data sources in a format that facilitates discovering +connections. The utility of KGs has been exemplified in areas such as drug +repurposing, with insights made through manual exploration and modelling of the +data. In this article, we discuss promises and pitfalls of using natural +language processing (NLP) to mine unstructured text typically from scientific +literature as a data source for KGs. This draws on our experience of initially +parsing structured data sources such as ChEMBL as the basis for data within a +KG, and then enriching or expanding upon them using NLP. The fundamental +promise of NLP for KGs is the automated extraction of data from millions of +documents a task practically impossible to do via human curation alone. +However, there are many potential pitfalls in NLP-KG pipelines such as +incorrect named entity recognition and ontology linking all of which could +ultimately lead to erroneous inferences and conclusions. + +
+
+ comment: 17 pages, 7 figures +
+
+
+
+
+ + ☆ Visually Grounded Continual Language Learning with Selective + Specialization EMNLP 2023 + + +
+ A desirable trait of an artificial agent acting in the visual world is to +continually learn a sequence of language-informed tasks while striking a +balance between sufficiently specializing in each task and building a +generalized knowledge for transfer. Selective specialization, i.e., a careful +selection of model components to specialize in each task, is a strategy to +provide control over this trade-off. However, the design of selection +strategies requires insights on the role of each model component in learning +rather specialized or generalizable representations, which poses a gap in +current research. Thus, our aim with this work is to provide an extensive +analysis of selection strategies for visually grounded continual language +learning. Due to the lack of suitable benchmarks for this purpose, we introduce +two novel diagnostic datasets that provide enough control and flexibility for a +thorough model analysis. We assess various heuristics for module specialization +strategies as well as quantifiable measures for two different types of model +architectures. Finally, we design conceptually simple approaches based on our +analysis that outperform common continual learning baselines. Our results +demonstrate the need for further efforts towards better aligning continual +learning algorithms with the learning behaviors of individual model parts. + +
+
+ comment: Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ☆ MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in + the Materials Science Domain + + +
+ Keeping track of all relevant recent publications and experimental results +for a research area is a challenging task. Prior work has demonstrated the +efficacy of information extraction models in various scientific areas. +Recently, several datasets have been released for the yet understudied +materials science domain. However, these datasets focus on sub-problems such as +parsing synthesis procedures or on sub-domains, e.g., solid oxide fuel cells. +In this resource paper, we present MuLMS, a new dataset of 50 open-access +articles, spanning seven sub-domains of materials science. The corpus has been +annotated by domain experts with several layers ranging from named entities +over relations to frame structures. We present competitive neural models for +all tasks and demonstrate that multi-task training with existing related +resources leads to benefits. + +
+
+ comment: 17 pages, 2 figures, 28 tables, to be published in "Proceedings of + the second Workshop on Information Extraction from Scientific Publications" +
+
+
+
+
+ + ☆ TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for + Inference Cost Reduction EMNLP 2023 + + +
+ Since ChatGPT released its API for public use, the number of applications +built on top of commercial large language models (LLMs) increase exponentially. +One popular usage of such models is leveraging its in-context learning ability +and generating responses given user queries leveraging knowledge obtained by +retrieval augmentation. One problem of deploying commercial retrieval-augmented +LLMs is the cost due to the additionally retrieved context that largely +increases the input token size of the LLMs. To mitigate this, we propose a +token compression scheme that includes two methods: summarization compression +and semantic compression. The first method applies a T5-based model that is +fine-tuned by datasets generated using self-instruct containing samples with +varying lengths and reduce token size by doing summarization. The second method +further compresses the token size by removing words with lower impact on the +semantic. In order to adequately evaluate the effectiveness of the proposed +methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) +focusing on food recommendation for women around pregnancy period or infants. +Our summarization compression can reduce 65% of the retrieval token size with +further 0.3% improvement on the accuracy; semantic compression provides a more +flexible way to trade-off the token size with performance, for which we can +reduce the token size by 20% with only 1.6% of accuracy drop. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Unveiling Multilinguality in Transformer Models: Exploring Language + Specificity in Feed-Forward Networks + + +
+ Recent research suggests that the feed-forward module within Transformers can +be viewed as a collection of key-value memories, where the keys learn to +capture specific patterns from the input based on the training examples. The +values then combine the output from the 'memories' of the keys to generate +predictions about the next token. This leads to an incremental process of +prediction that gradually converges towards the final token choice near the +output layers. This interesting perspective raises questions about how +multilingual models might leverage this mechanism. Specifically, for +autoregressive models trained on two or more languages, do all neurons (across +layers) respond equally to all languages? No! Our hypothesis centers around the +notion that during pretraining, certain model parameters learn strong +language-specific features, while others learn more language-agnostic (shared +across languages) features. To validate this, we conduct experiments utilizing +parallel corpora of two languages that the model was initially pretrained on. +Our findings reveal that the layers closest to the network's input or output +tend to exhibit more language-specific behaviour compared to the layers in the +middle. + +
+
+
+
+
+ + ☆ Improving Language Models Meaning Understanding and Consistency by + Learning Conceptual Roles from Dictionary + + +
+ The non-humanlike behaviour of contemporary pre-trained language models +(PLMs) is a leading cause undermining their trustworthiness. A striking +phenomenon of such faulty behaviours is the generation of inconsistent +predictions, which produces logically contradictory results, such as generating +different predictions for texts delivering the same meaning or violating +logical properties. Previous studies exploited data augmentation or implemented +specialised loss functions to alleviate the issue. However, their usage is +limited, because they consume expensive training resources for large-sized PLMs +and can only handle a certain consistency type. To this end, we propose a +practical approach that alleviates the inconsistent behaviour issue by +fundamentally improving PLMs' meaning awareness. Based on the conceptual role +theory, our method allows PLMs to capture accurate meaning by learning precise +interrelationships between concepts from word-definition pairs in a dictionary. +Next, we propose an efficient parameter integration technique that updates only +a few additional parameters to combine the learned interrelationship with PLMs' +pre-trained knowledge. Our experimental results reveal that the approach can +concurrently improve multiple types of consistency, enables efficient knowledge +integration, and easily applies to other languages. + +
+
+ comment: 15 pages +
+
+
+
+
+ + ☆ SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code + Translation + + +
+ With the recent focus on Large Language Models (LLMs), both StarCoder (Li et +al., 2023) and Code Llama (Rozi\`ere et al., 2023) have demonstrated remarkable +performance in code generation. However, there is still a need for improvement +in code translation functionality with efficient training techniques. In +response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM +designed specifically for multi-programming language-to-Python code +translation. In particular, SteloCoder achieves C++, C#, JavaScript, Java, or +PHP-to-Python code translation without specifying the input programming +language. We modified StarCoder model architecture by incorporating a +Mixture-of-Experts (MoE) technique featuring five experts and a gating network +for multi-task handling. Experts are obtained by StarCoder fine-tuning. +Specifically, we use a Low-Rank Adaptive Method (LoRA) technique, limiting each +expert size as only 0.06% of number of StarCoder's parameters. At the same +time, to enhance training efficiency in terms of time, we adopt curriculum +learning strategy and use self-instruct data for efficient fine-tuning. As a +result, each expert takes only 6 hours to train on one single 80Gb A100 HBM. +With experiments on XLCoST datasets, SteloCoder achieves an average of 73.76 +CodeBLEU score in multi-programming language-to-Python translation, surpassing +the top performance from the leaderboard by at least 3.5. This accomplishment +is attributed to only 45M extra parameters with StarCoder as the backbone and +32 hours of valid training on one 80GB A100 HBM. The source code is release +here: https://github.com/sade-adrien/SteloCoder. + +
+
+
+
+
+ + ☆ MarkQA: A large scale KBQA dataset with numerical reasoning EMNLP 2023 + + +
+ While question answering over knowledge bases (KBQA) has shown progress in +addressing factoid questions, KBQA with numerical reasoning remains relatively +unexplored. In this paper, we focus on the complex numerical reasoning in KBQA +and propose a new task, NR-KBQA, which necessitates the ability to perform both +multi-hop reasoning and numerical reasoning. We design a logic form in Python +format called PyQL to represent the reasoning process of numerical reasoning +questions. To facilitate the development of NR-KBQA, we present a large dataset +called MarkQA, which is automatically constructed from a small set of seeds. +Each question in MarkQA is equipped with its corresponding SPARQL query, +alongside the step-by-step reasoning process in the QDMR format and PyQL +program. Experimental results of some state-of-the-art QA methods on the MarkQA +show that complex numerical reasoning in KBQA faces great challenges. + +
+
+ comment: camera ready for EMNLP 2023 +
+
+
+
+
+ + ☆ Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting + Elusive Disinformation EMNLP 2023 + + +
+ Recent ubiquity and disruptive impacts of large language models (LLMs) have +raised concerns about their potential to be misused (.i.e, generating +large-scale harmful and misleading content). To combat this emerging risk of +LLMs, we propose a novel "Fighting Fire with Fire" (F3) strategy that harnesses +modern LLMs' generative and emergent reasoning capabilities to counter +human-written and LLM-generated disinformation. First, we leverage +GPT-3.5-turbo to synthesize authentic and deceptive LLM-generated content +through paraphrase-based and perturbation-based prefix-style prompts, +respectively. Second, we apply zero-shot in-context semantic reasoning +techniques with cloze-style prompts to discern genuine from deceptive posts and +news articles. In our extensive experiments, we observe GPT-3.5-turbo's +zero-shot superiority for both in-distribution and out-of-distribution +datasets, where GPT-3.5-turbo consistently achieved accuracy at 68-72%, unlike +the decline observed in previous customized and fine-tuned disinformation +detectors. Our codebase and dataset are available at +https://github.com/mickeymst/F3. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ☆ A Joint Matrix Factorization Analysis of Multilingual Representations EMNLP 2023 + + +
+ We present an analysis tool based on joint matrix factorization for comparing +latent representations of multilingual and monolingual models. An alternative +to probing, this tool allows us to analyze multiple sets of representations in +a joint manner. Using this tool, we study to what extent and how +morphosyntactic features are reflected in the representations learned by +multilingual pre-trained models. We conduct a large-scale empirical study of +over 33 languages and 17 morphosyntactic categories. Our findings demonstrate +variations in the encoding of morphosyntactic information across upper and +lower layers, with category-specific differences influenced by language +properties. Hierarchical clustering of the factorization outputs yields a tree +structure that is related to phylogenetic trees manually crafted by linguists. +Moreover, we find the factorization outputs exhibit strong associations with +performance observed across different cross-lingual tasks. We release our code +to facilitate future research. + +
+
+ comment: Accepted to Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ KITAB: Evaluating LLMs on Constraint Satisfaction for Information + Retrieval + + +
+ We study the ability of state-of-the art models to answer constraint +satisfaction queries for information retrieval (e.g., 'a list of ice cream +shops in San Diego'). In the past, such queries were considered to be tasks +that could only be solved via web-search or knowledge bases. More recently, +large language models (LLMs) have demonstrated initial emergent abilities in +this task. However, many current retrieval benchmarks are either saturated or +do not measure constraint satisfaction. Motivated by rising concerns around +factual incorrectness and hallucinations of LLMs, we present KITAB, a new +dataset for measuring constraint satisfaction abilities of language models. +KITAB consists of book-related data across more than 600 authors and 13,000 +queries, and also offers an associated dynamic data collection and constraint +verification approach for acquiring similar test data for other authors. Our +extended experiments on GPT4 and GPT3.5 characterize and decouple common +failure modes across dimensions such as information popularity, constraint +types, and context availability. Results show that in the absence of context, +models exhibit severe limitations as measured by irrelevant information, +factual errors, and incompleteness, many of which exacerbate as information +popularity decreases. While context availability mitigates irrelevant +information, it is not helpful for satisfying constraints, identifying +fundamental barriers to constraint satisfaction. We open source our +contributions to foster further research on improving constraint satisfaction +abilities of future models. + +
+
+ comment: 23 pages +
+
+
+
+
+ + ☆ TRAMS: Training-free Memory Selection for Long-range Language Modeling + + +
+ The Transformer architecture is crucial for numerous AI models, but it still +faces challenges in long-range language modeling. Though several specific +transformer architectures have been designed to tackle issues of long-range +dependencies, existing methods like Transformer-XL are plagued by a high +percentage of ineffective memories. In this study, we present a plug-and-play +strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens +participating in attention calculation based on one simple metric. This +strategy allows us to keep tokens that are likely to have a high attention +score with the current queries and ignore the other ones. We have tested our +approach on the word-level benchmark (WikiText-103) and the character-level +benchmark (enwik8), and the results indicate an improvement without having +additional training or adding additional parameters. + +
+
+
+
+
+ + ☆ NuTrea: Neural Tree Search for Context-guided Multi-hop KGQA NeurIPS + + +
+ Multi-hop Knowledge Graph Question Answering (KGQA) is a task that involves +retrieving nodes from a knowledge graph (KG) to answer natural language +questions. Recent GNN-based approaches formulate this task as a KG path +searching problem, where messages are sequentially propagated from the seed +node towards the answer nodes. However, these messages are past-oriented, and +they do not consider the full KG context. To make matters worse, KG nodes often +represent proper noun entities and are sometimes encrypted, being uninformative +in selecting between paths. To address these problems, we propose Neural Tree +Search (NuTrea), a tree search-based GNN model that incorporates the broader KG +context. Our model adopts a message-passing scheme that probes the unreached +subtree regions to boost the past-oriented embeddings. In addition, we +introduce the Relation Frequency-Inverse Entity Frequency (RF-IEF) node +embedding that considers the global KG context to better characterize ambiguous +KG nodes. The general effectiveness of our approach is demonstrated through +experiments on three major multi-hop KGQA benchmark datasets, and our extensive +analyses further validate its expressiveness and robustness. Overall, NuTrea +provides a powerful means to query the KG with complex natural language +questions. Code is available at https://github.com/mlvlab/NuTrea. + +
+
+ comment: Neural Information Processing Systems (NeurIPS) 2023 +
+
+
+
+
+ + ☆ CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without + Full Large Language Model EMNLP 2023 + + +
+ Instruction tuning has recently been recognized as an effective way of +aligning Large Language Models (LLMs) to enhance their generalization ability +across various tasks. However, when tuning publicly accessible, centralized +LLMs with private instruction data, privacy concerns are inevitable. While +direct transfer of parameterized modules between models is a plausible approach +to address this, its implications and effectiveness need further exploration. +This paper focuses on Offsite-Tuning (OFT), a representative technique that +transfers transformer blocks between centralized LLMs and downstream emulators. +Given the limited understanding of the underlying mechanism of OFT, we perform +an empirical analysis on LLMs from the perspectives of representation and +functional similarity. Interestingly, our findings reveal a unique modular +structure within the layers of LLMs that appears to emerge as the model size +expands. Simultaneously, we note subtle but potentially significant changes in +representation and intermediate predictions across the layers. Inspired by +these observations, we propose CRaSh, involving Clustering, Removing, and +Sharing, a training-free strategy to derive improved emulators from LLMs. CRaSh +significantly boosts performance of OFT with billions of parameters. +Furthermore, we investigate the optimal solutions yielded by fine-tuning with +and without full model through the lens of loss landscape. Our findings +demonstrate a linear connectivity among these optima falling over the same +basin, thereby highlighting the effectiveness of CRaSh and OFT. The source code +is publicly available at https://github.com/TsinghuaC3I/CRaSh. + +
+
+ comment: Accepted to EMNLP 2023 (Main Conference) +
+
+
+
+
+ + ☆ Continual Event Extraction with Semantic Confusion Rectification EMNLP 2023 + + +
+ We study continual event extraction, which aims to extract incessantly +emerging event information while avoiding forgetting. We observe that the +semantic confusion on event types stems from the annotations of the same text +being updated over time. The imbalance between event types even aggravates this +issue. This paper proposes a novel continual event extraction model with +semantic confusion rectification. We mark pseudo labels for each sentence to +alleviate semantic confusion. We transfer pivotal knowledge between current and +previous models to enhance the understanding of event types. Moreover, we +encourage the model to focus on the semantics of long-tailed event types by +leveraging other associated types. Experimental results show that our model +outperforms state-of-the-art baselines and is proficient in imbalanced +datasets. + +
+
+ comment: Accepted in the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP 2023) +
+
+
+
+
+ + ☆ The Janus Interface: How Fine-Tuning in Large Language Models Amplifies + the Privacy Risks + + +
+ The era post-2018 marked the advent of Large Language Models (LLMs), with +innovations such as OpenAI's ChatGPT showcasing prodigious linguistic prowess. +As the industry galloped toward augmenting model parameters and capitalizing on +vast swaths of human language data, security and privacy challenges also +emerged. Foremost among these is the potential inadvertent accrual of Personal +Identifiable Information (PII) during web-based data acquisition, posing risks +of unintended PII disclosure. While strategies like RLHF during training and +Catastrophic Forgetting have been marshaled to control the risk of privacy +infringements, recent advancements in LLMs, epitomized by OpenAI's fine-tuning +interface for GPT-3.5, have reignited concerns. One may ask: can the +fine-tuning of LLMs precipitate the leakage of personal information embedded +within training datasets? This paper reports the first endeavor to seek the +answer to the question, particularly our discovery of a new LLM exploitation +avenue, called the Janus attack. In the attack, one can construct a PII +association task, whereby an LLM is fine-tuned using a minuscule PII dataset, +to potentially reinstate and reveal concealed PIIs. Our findings indicate that, +with a trivial fine-tuning outlay, LLMs such as GPT-3.5 can transition from +being impermeable to PII extraction to a state where they divulge a substantial +proportion of concealed PII. This research, through its deep dive into the +Janus attack vector, underscores the imperative of navigating the intricate +interplay between LLM utility and privacy preservation. + +
+
+
+
+
+ + ☆ Interpreting Answers to Yes-No Questions in User-Generated Content EMNLP 2023 + + +
+ Interpreting answers to yes-no questions in social media is difficult. Yes +and no keywords are uncommon, and the few answers that include them are rarely +to be interpreted what the keywords suggest. In this paper, we present a new +corpus of 4,442 yes-no question-answer pairs from Twitter. We discuss +linguistic characteristics of answers whose interpretation is yes or no, as +well as answers whose interpretation is unknown. We show that large language +models are far from solving this problem, even after fine-tuning and blending +other corpora for the same problem but outside social media. + +
+
+ comment: Accepted at the Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Facilitating Self-Guided Mental Health Interventions Through + Human-Language Model Interaction: A Case Study of Cognitive Restructuring + + +
+ Self-guided mental health interventions, such as "do-it-yourself" tools to +learn and practice coping strategies, show great promise to improve access to +mental health care. However, these interventions are often cognitively +demanding and emotionally triggering, creating accessibility barriers that +limit their wide-scale implementation and adoption. In this paper, we study how +human-language model interaction can support self-guided mental health +interventions. We take cognitive restructuring, an evidence-based therapeutic +technique to overcome negative thinking, as a case study. In an IRB-approved +randomized field study on a large mental health website with 15,531 +participants, we design and evaluate a system that uses language models to +support people through various steps of cognitive restructuring. Our findings +reveal that our system positively impacts emotional intensity for 67% of +participants and helps 65% overcome negative thoughts. Although adolescents +report relatively worse outcomes, we find that tailored interventions that +simplify language model generations improve overall effectiveness and equity. + +
+
+
+
+
+ + ☆ K-HATERS: A Hate Speech Detection Corpus in Korean with Target-Specific + Ratings EMNLP 2023 + + +
+ Numerous datasets have been proposed to combat the spread of online hate. +Despite these efforts, a majority of these resources are English-centric, +primarily focusing on overt forms of hate. This research gap calls for +developing high-quality corpora in diverse languages that also encapsulate more +subtle hate expressions. This study introduces K-HATERS, a new corpus for hate +speech detection in Korean, comprising approximately 192K news comments with +target-specific offensiveness ratings. This resource is the largest offensive +language corpus in Korean and is the first to offer target-specific ratings on +a three-point Likert scale, enabling the detection of hate expressions in +Korean across varying degrees of offensiveness. We conduct experiments showing +the effectiveness of the proposed corpus, including a comparison with existing +datasets. Additionally, to address potential noise and bias in human +annotations, we explore a novel idea of adopting the Cognitive Reflection Test, +which is widely used in social science for assessing an individual's cognitive +ability, as a proxy of labeling quality. Findings indicate that annotations +from individuals with the lowest test scores tend to yield detection models +that make biased predictions toward specific target groups and are less +accurate. This study contributes to the NLP research on hate speech detection +and resource construction. The code and dataset can be accessed at +https://github.com/ssu-humane/K-HATERS. + +
+
+ comment: 15 pages, EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts + and Rationales for Disambiguating Defeasible Social and Moral Situations EMNLP + + +
+ Moral or ethical judgments rely heavily on the specific contexts in which +they occur. Understanding varying shades of defeasible contextualizations +(i.e., additional information that strengthens or attenuates the moral +acceptability of an action) is critical to accurately represent the subtlety +and intricacy of grounded human moral judgment in real-life scenarios. + We introduce defeasible moral reasoning: a task to provide grounded contexts +that make an action more or less morally acceptable, along with commonsense +rationales that justify the reasoning. To elicit high-quality task data, we +take an iterative self-distillation approach that starts from a small amount of +unstructured seed knowledge from GPT-3 and then alternates between (1) +self-distillation from student models; (2) targeted filtering with a critic +model trained by human judgment (to boost validity) and NLI (to boost +diversity); (3) self-imitation learning (to amplify the desired data quality). +This process yields a student model that produces defeasible contexts with +improved validity, diversity, and defeasibility. From this model we distill a +high-quality dataset, \delta-Rules-of-Thumb, of 1.2M entries of +contextualizations and rationales for 115K defeasible moral actions rated +highly by human annotators 85.9% to 99.8% of the time. Using \delta-RoT we +obtain a final student model that wins over all intermediate student models by +a notable margin. + +
+
+ comment: Camera Ready EMNLP Findings 2023. First two authors contributed + equally +
+
+
+
+
+ + ☆ Beyond Sentiment: Leveraging Topic Metrics for Political Stance + Classification + + +
+ Sentiment analysis, widely critiqued for capturing merely the overall tone of +a corpus, falls short in accurately reflecting the latent structures and +political stances within texts. This study introduces topic metrics, dummy +variables converted from extracted topics, as both an alternative and +complement to sentiment metrics in stance classification. By employing three +datasets identified by Bestvater and Monroe (2023), this study demonstrates +BERTopic's proficiency in extracting coherent topics and the effectiveness of +topic metrics in stance classification. The experiment results show that +BERTopic improves coherence scores by 17.07% to 54.20% when compared to +traditional approaches such as Dirichlet Allocation (LDA) and Non-negative +Matrix Factorization (NMF), prevalent in earlier political science research. +Additionally, our results indicate topic metrics outperform sentiment metrics +in stance classification, increasing performance by as much as 18.95%. Our +findings suggest topic metrics are especially effective for context-rich texts +and corpus where stance and sentiment correlations are weak. The combination of +sentiment and topic metrics achieve an optimal performance in most of the +scenarios and can further address the limitations of relying solely on +sentiment as well as the low coherence score of topic metrics. + +
+
+
+
+
+ + ☆ The Mason-Alberta Phonetic Segmenter: A forced alignment system based on + deep neural networks and interpolation + + +
+ Forced alignment systems automatically determine boundaries between segments +in speech data, given an orthographic transcription. These tools are +commonplace in phonetics to facilitate the use of speech data that would be +infeasible to manually transcribe and segment. In the present paper, we +describe a new neural network-based forced alignment system, the Mason-Alberta +Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two +possible improvements we pursue for forced alignment systems. The first is +treating the acoustic model in a forced aligner as a tagging task, rather than +a classification task, motivated by the common understanding that segments in +speech are not truly discrete and commonly overlap. The second is an +interpolation technique to allow boundaries more precise than the common 10 ms +limit in modern forced alignment systems. We compare configurations of our +system to a state-of-the-art system, the Montreal Forced Aligner. The tagging +approach did not generally yield improved results over the Montreal Forced +Aligner. However, a system with the interpolation technique had a 27.92% +increase relative to the Montreal Forced Aligner in the amount of boundaries +within 10 ms of the target on the test set. We also reflect on the task and +training process for acoustic modeling in forced alignment, highlighting how +the output targets for these models do not match phoneticians' conception of +similarity between phones and that reconciliation of this tension may require +rethinking the task and output targets or how speech itself should be +segmented. + +
+
+ comment: submitted for publication +
+
+
+
+
+ + ☆ FANToM: A Benchmark for Stress-testing Machine Theory of Mind in + Interactions EMNLP 2023 + + +
+ Theory of mind (ToM) evaluations currently focus on testing models using +passive narratives that inherently lack interactivity. We introduce FANToM, a +new benchmark designed to stress-test ToM within information-asymmetric +conversational contexts via question answering. Our benchmark draws upon +important theoretical requisites from psychology and necessary empirical +considerations when evaluating large language models (LLMs). In particular, we +formulate multiple types of questions that demand the same underlying reasoning +to identify illusory or false sense of ToM capabilities in LLMs. We show that +FANToM is challenging for state-of-the-art LLMs, which perform significantly +worse than humans even with chain-of-thought reasoning or fine-tuning. + +
+
+ comment: EMNLP 2023. Code and dataset can be found here: + https://hyunw.kim/fantom +
+
+
+
+
+ + ☆ Let the Pretrained Language Models "Imagine" for Short Texts Topic + Modeling + + +
+ Topic models are one of the compelling methods for discovering latent +semantics in a document collection. However, it assumes that a document has +sufficient co-occurrence information to be effective. However, in short texts, +co-occurrence information is minimal, which results in feature sparsity in +document representation. Therefore, existing topic models (probabilistic or +neural) mostly fail to mine patterns from them to generate coherent topics. In +this paper, we take a new approach to short-text topic modeling to address the +data-sparsity issue by extending short text into longer sequences using +existing pre-trained language models (PLMs). Besides, we provide a simple +solution extending a neural topic model to reduce the effect of noisy +out-of-topics text generation from PLMs. We observe that our model can +substantially improve the performance of short-text topic modeling. Extensive +experiments on multiple real-world datasets under extreme data sparsity +scenarios show that our models can generate high-quality topics outperforming +state-of-the-art models. + +
+
+
+
+
+ + ☆ Speakerly: A Voice-based Writing Assistant for Text Composition EMNLP 2023 + + +
+ We present Speakerly, a new real-time voice-based writing assistance system +that helps users with text composition across various use cases such as emails, +instant messages, and notes. The user can interact with the system through +instructions or dictation, and the system generates a well-formatted and +coherent document. We describe the system architecture and detail how we +address the various challenges while building and deploying such a system at +scale. More specifically, our system uses a combination of small, task-specific +models as well as pre-trained language models for fast and effective text +composition while supporting a variety of input modes for better usability. + +
+
+ comment: Accepted at EMNLP 2023 Industry Track +
+
+
+
+
+ + ☆ GlotLID: Language Identification for Low-Resource Languages EMNLP 2023 + + +
+ Several recent papers have published good solutions for language +identification (LID) for about 300 high-resource and medium-resource languages. +However, there is no LID available that (i) covers a wide range of low-resource +languages, (ii) is rigorously evaluated and reliable and (iii) efficient and +easy to use. Here, we publish GlotLID-M, an LID model that satisfies the +desiderata of wide coverage, reliability and efficiency. It identifies 1665 +languages, a large increase in coverage compared to prior work. In our +experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and +NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique +challenges that low-resource LID poses: incorrect corpus metadata, leakage from +high-resource languages, difficulty separating closely related languages, +handling of macrolanguage vs varieties and in general noisy data. We hope that +integrating GlotLID-M into dataset creation pipelines will improve quality and +enhance accessibility of NLP technology for low-resource languages and +cultures. GlotLID-M model, code, and list of data sources are available: +https://github.com/cisnlp/GlotLID. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ ZzzGPT: An Interactive GPT Approach to Enhance Sleep Quality + + +
+ In today's world, sleep quality is pivotal for overall well-being. While +wearable sensors offer real-time monitoring, they often lack actionable +insights, leading to user abandonment. This paper delves into the role of +technology in understanding sleep patterns. We introduce a two-stage framework, +utilizing Large Language Models (LLMs), aiming to provide accurate sleep +predictions with actionable feedback. Leveraging the GLOBEM dataset and +synthetic data from LLMs, we highlight enhanced results with models like +XGBoost. Our approach merges advanced machine learning with user-centric +design, blending scientific accuracy with practicality. + +
+
+
+
+
+ + ☆ Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting + Pre-trained Language Models EMNLP 2023 + + +
+ In this work, we propose a method that combines two popular research areas by +injecting linguistic structures into pre-trained language models in the +parameter-efficient fine-tuning (PEFT) setting. In our approach, parallel +adapter modules encoding different linguistic structures are combined using a +novel Mixture-of-Linguistic-Experts architecture, where Gumbel-Softmax gates +are used to determine the importance of these modules at each layer of the +model. To reduce the number of parameters, we first train the model for a fixed +small number of steps before pruning the experts based on their importance +scores. Our experiment results with three different pre-trained models show +that our approach can outperform state-of-the-art PEFT methods with a +comparable number of parameters. In addition, we provide additional analysis to +examine the experts selected by each model at each layer to provide insights +for future studies. + +
+
+ comment: 14 pages, 3 figures, Camera-Ready for EMNLP 2023 Findings (Long + Paper) +
+
+
+
+
+ + ☆ TiC-CLIP: Continual Training of CLIP Models + + +
+ Keeping large foundation models up to date on latest data is inherently +expensive. To avoid the prohibitive costs of constantly retraining, it is +imperative to continually train these models. This problem is exacerbated by +the lack of any large scale continual learning benchmarks or baselines. We +introduce the first set of web-scale Time-Continual (TiC) benchmarks for +training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with +over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first +use our benchmarks to curate various dynamic evaluations to measure temporal +robustness of existing models. We show OpenAI's CLIP (trained on data up to +2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from +2021--2022 compared with more recently trained models in OpenCLIP repository. +We then study how to efficiently train models on time-continuous data. We +demonstrate that a simple rehearsal-based approach that continues training from +the last checkpoint and replays old data reduces compute by $2.5\times$ when +compared to the standard practice of retraining from scratch. + +
+
+
+
+
+ + ☆ CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset EMNLP 2023 + + +
+ The CoNLL-03 corpus is arguably the most well-known and utilized benchmark +dataset for named entity recognition (NER). However, prior works found +significant numbers of annotation errors, incompleteness, and inconsistencies +in the data. This poses challenges to objectively comparing NER approaches and +analyzing their errors, as current state-of-the-art models achieve F1-scores +that are comparable to or even exceed the estimated noise level in CoNLL-03. To +address this issue, we present a comprehensive relabeling effort assisted by +automatic consistency checking that corrects 7.0% of all labels in the English +CoNLL-03. Our effort adds a layer of entity linking annotation both for better +explainability of NER labels and as additional safeguard of annotation quality. +Our experimental evaluation finds not only that state-of-the-art approaches +reach significantly higher F1-scores (97.1%) on our data, but crucially that +the share of correct predictions falsely counted as errors due to annotation +noise drops from 47% to 6%. This indicates that our resource is well suited to +analyze the remaining errors made by state-of-the-art models, and that the +theoretical upper bound even on high resource, coarse-grained NER is not yet +reached. To facilitate such analysis, we make CleanCoNLL publicly available to +the research community. + +
+
+ comment: EMNLP 2023 camera-ready version +
+
+
+
+
+ + ☆ Knowledge Editing for Large Language Models: A Survey + + +
+ Large language models (LLMs) have recently transformed both the academic and +industrial landscapes due to their remarkable capacity to understand, analyze, +and generate texts based on their vast knowledge and reasoning ability. +Nevertheless, one major drawback of LLMs is their substantial computational +cost for pre-training due to their unprecedented amounts of parameters. The +disadvantage is exacerbated when new knowledge frequently needs to be +introduced into the pre-trained model. Therefore, it is imperative to develop +effective and efficient techniques to update pre-trained LLMs. Traditional +methods encode new knowledge in pre-trained LLMs through direct fine-tuning. +However, naively re-training LLMs can be computationally intensive and risks +degenerating valuable pre-trained knowledge irrelevant to the update in the +model. Recently, Knowledge-based Model Editing (KME) has attracted increasing +attention, which aims to precisely modify the LLMs to incorporate specific +knowledge, without negatively influencing other irrelevant knowledge. In this +survey, we aim to provide a comprehensive and in-depth overview of recent +advances in the field of KME. We first introduce a general formulation of KME +to encompass different KME strategies. Afterward, we provide an innovative +taxonomy of KME techniques based on how the new knowledge is introduced into +pre-trained LLMs, and investigate existing KME strategies while analyzing key +insights, advantages, and limitations of methods from each category. Moreover, +representative metrics, datasets, and applications of KME are introduced +accordingly. Finally, we provide an in-depth analysis regarding the +practicality and remaining challenges of KME and suggest promising research +directions for further advancement in this field. + +
+
+ comment: 31 pages +
+
+
+
+
+ + ☆ Background Summarization of Event Timelines EMNLP 2023 + + +
+ Generating concise summaries of news events is a challenging natural language +processing task. While journalists often curate timelines to highlight key +sub-events, newcomers to a news event face challenges in catching up on its +historical context. In this paper, we address this need by introducing the task +of background news summarization, which complements each timeline update with a +background summary of relevant preceding events. We construct a dataset by +merging existing timeline datasets and asking human annotators to write a +background summary for each timestep of each news event. We establish strong +baseline performance using state-of-the-art summarization systems and propose a +query-focused variant to generate background summaries. To evaluate background +summary quality, we present a question-answering-based evaluation metric, +Background Utility Score (BUS), which measures the percentage of questions +about a current event timestep that a background summary answers. Our +experiments show the effectiveness of instruction fine-tuned systems such as +Flan-T5, in addition to strong zero-shot performance using GPT-3.5. + +
+
+ comment: EMNLP 2023 camera-ready +
+
+
+
+
+ + ☆ Length is a Curse and a Blessing for Document-level Semantics EMNLP 2023 + + +
+ In recent years, contrastive learning (CL) has been extensively utilized to +recover sentence and document-level encoding capability from pre-trained +language models. In this work, we question the length generalizability of +CL-based models, i.e., their vulnerability towards length-induced semantic +shift. We verify not only that length vulnerability is a significant yet +overlooked research gap, but we can devise unsupervised CL methods solely +depending on the semantic signal provided by document length. We first derive +the theoretical foundations underlying length attacks, showing that elongating +a document would intensify the high intra-document similarity that is already +brought by CL. Moreover, we found that isotropy promised by CL is highly +dependent on the length range of text exposed in training. Inspired by these +findings, we introduce a simple yet universal document representation learning +framework, LA(SER)$^{3}$: length-agnostic self-reference for semantically +robust sentence representation learning, achieving state-of-the-art +unsupervised performance on the standard information retrieval benchmark. + +
+
+ comment: Accepted at EMNLP 2023. Our code is publicly available at + https://github.com/gowitheflow-1998/LA-SER-cubed +
+
+
+
+
+ + ☆ BLP 2023 Task 2: Sentiment Analysis EMNLP-23 + + +
+ We present an overview of the BLP Sentiment Shared Task, organized as part of +the inaugural BLP 2023 workshop, co-located with EMNLP 2023. The task is +defined as the detection of sentiment in a given piece of social media text. +This task attracted interest from 71 participants, among whom 29 and 30 teams +submitted systems during the development and evaluation phases, respectively. +In total, participants submitted 597 runs. However, a total of 15 teams +submitted system description papers. The range of approaches in the submitted +systems spans from classical machine learning models, fine-tuning pre-trained +models, to leveraging Large Language Model (LLMs) in zero- and few-shot +settings. In this paper, we provide a detailed account of the task setup, +including dataset development and evaluation setup. Additionally, we provide a +brief overview of the systems submitted by the participants. All datasets and +evaluation scripts from the shared task have been made publicly available for +the research community, to foster further research in this domain + +
+
+ comment: Accepted in BLP Workshop at EMNLP-23 +
+
+
+
+
+ + ☆ Hidden Citations Obscure True Impact in Science + + +
+ References, the mechanism scientists rely on to signal previous knowledge, +lately have turned into widely used and misused measures of scientific impact. +Yet, when a discovery becomes common knowledge, citations suffer from +obliteration by incorporation. This leads to the concept of hidden citation, +representing a clear textual credit to a discovery without a reference to the +publication embodying it. Here, we rely on unsupervised interpretable machine +learning applied to the full text of each paper to systematically identify +hidden citations. We find that for influential discoveries hidden citations +outnumber citation counts, emerging regardless of publishing venue and +discipline. We show that the prevalence of hidden citations is not driven by +citation counts, but rather by the degree of the discourse on the topic within +the text of the manuscripts, indicating that the more discussed is a discovery, +the less visible it is to standard bibliometric analysis. Hidden citations +indicate that bibliometric measures offer a limited perspective on quantifying +the true impact of a discovery, raising the need to extract knowledge from the +full text of the scientific corpus. + +
+
+
+
+
+ + ☆ Correction with Backtracking Reduces Hallucination in Summarization + + +
+ Abstractive summarization aims at generating natural language summaries of a +source document that are succinct while preserving the important elements. +Despite recent advances, neural text summarization models are known to be +susceptible to hallucinating (or more correctly confabulating), that is to +produce summaries with details that are not grounded in the source document. In +this paper, we introduce a simple yet efficient technique, CoBa, to reduce +hallucination in abstractive summarization. The approach is based on two steps: +hallucination detection and mitigation. We show that the former can be achieved +through measuring simple statistics about conditional word probabilities and +distance to context words. Further, we demonstrate that straight-forward +backtracking is surprisingly effective at mitigation. We thoroughly evaluate +the proposed method with prior art on three benchmark datasets for text +summarization. The results show that CoBa is effective and efficient in +reducing hallucination, and offers great adaptability and flexibility. + +
+
+
+
+
+ + ☆ WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task + + +
+ We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) +Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering +novel NER datasets (i.e., Wojood) and the definition of subtasks designed to +facilitate meaningful comparisons between different NER approaches. +WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 +unique teams registered for this shared task, with 11 of them actively +participating in the test phase. Specifically, 11 teams participated in +FlatNER, while $8$ teams tackled NestedNER. The winning teams achieved F1 +scores of 91.96 and 93.73 in FlatNER and NestedNER, respectively. + +
+
+
+
+
+ + ☆ PreWoMe: Exploiting Presuppositions as Working Memory for Long Form + Question Answering EMNLP 2023 + + +
+ Information-seeking questions in long-form question answering (LFQA) often +prove misleading due to ambiguity or false presupposition in the question. +While many existing approaches handle misleading questions, they are tailored +to limited questions, which are insufficient in a real-world setting with +unpredictable input characteristics. In this work, we propose PreWoMe, a +unified approach capable of handling any type of information-seeking question. +The key idea of PreWoMe involves extracting presuppositions in the question and +exploiting them as working memory to generate feedback and action about the +question. Our experiment shows that PreWoMe is effective not only in tackling +misleading questions but also in handling normal ones, thereby demonstrating +the effectiveness of leveraging presuppositions, feedback, and action for +real-world QA settings. + +
+
+ comment: 11 pages 3 figures, Accepted to EMNLP 2023 (short) +
+
+
+
+
+ + ☆ Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model + System for Answering Medical Questions using Scientific Literature + + +
+ The quickly-expanding nature of published medical literature makes it +challenging for clinicians and researchers to keep up with and summarize +recent, relevant findings in a timely manner. While several closed-source +summarization tools based on large language models (LLMs) now exist, rigorous +and systematic evaluations of their outputs are lacking. Furthermore, there is +a paucity of high-quality datasets and appropriate benchmark tasks with which +to evaluate these tools. We address these issues with four contributions: we +release Clinfo.ai, an open-source WebApp that answers clinical questions based +on dynamically retrieved scientific literature; we specify an information +retrieval and abstractive summarization task to evaluate the performance of +such retrieval-augmented LLM systems; we release a dataset of 200 questions and +corresponding answers derived from published systematic reviews, which we name +PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for +Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200. + +
+
+ comment: Preprint of an article published in Pacific Symposium on Biocomputing + copyright 2024 World Scientific Publishing Co., Singapore, + http://psb.stanford.edu/ +
+
+
+
+
+ + ☆ A Language Model with Limited Memory Capacity Captures Interference in + Human Sentence Processing EMNLP 2023 + + +
+ Two of the central factors believed to underpin human sentence processing +difficulty are expectations and retrieval from working memory. A recent attempt +to create a unified cognitive model integrating these two factors relied on the +parallels between the self-attention mechanism of transformer language models +and cue-based retrieval theories of working memory in human sentence processing +(Ryu and Lewis 2021). While Ryu and Lewis show that attention patterns in +specialized attention heads of GPT-2 are consistent with similarity-based +interference, a key prediction of cue-based retrieval models, their method +requires identifying syntactically specialized attention heads, and makes the +cognitively implausible assumption that hundreds of memory retrieval operations +take place in parallel. In the present work, we develop a recurrent neural +language model with a single self-attention head, which more closely parallels +the memory system assumed by cognitive theories. We show that our model's +single attention head captures semantic and syntactic interference effects +observed in human experiments. + +
+
+ comment: To appear in Findings of the Association for Computational + Linguistics: EMNLP 2023 +
+
+
+
+
+ + ☆ Can You Follow Me? Testing Situational Understanding in ChatGPT EMNLP 2023 + + +
+ Understanding sentence meanings and updating information states appropriately +across time -- what we call "situational understanding" (SU) -- is a critical +ability for human-like AI agents. SU is essential in particular for chat +models, such as ChatGPT, to enable consistent, coherent, and effective dialogue +between humans and AI. Previous works have identified certain SU limitations in +non-chatbot Large Language models (LLMs), but the extent and causes of these +limitations are not well understood, and capabilities of current chat-based +models in this domain have not been explored. In this work we tackle these +questions, proposing a novel synthetic environment for SU testing which allows +us to do controlled and systematic testing of SU in chat-oriented models, +through assessment of models' ability to track and enumerate environment +states. Our environment also allows for close analysis of dynamics of model +performance, to better understand underlying causes for performance patterns. +We apply our test to ChatGPT, the state-of-the-art chatbot, and find that +despite the fundamental simplicity of the task, the model's performance +reflects an inability to retain correct environment states across time. Our +follow-up analyses suggest that performance degradation is largely because +ChatGPT has non-persistent in-context memory (although it can access the full +dialogue history) and it is susceptible to hallucinated updates -- including +updates that artificially inflate accuracies. Our findings suggest overall that +ChatGPT is not currently equipped for robust tracking of situation states, and +that trust in the impressive dialogue performance of ChatGPT comes with risks. +We release the codebase for reproducing our test environment, as well as all +prompts and API responses from ChatGPT, at +https://github.com/yangalan123/SituationalTesting. + +
+
+ comment: EMNLP 2023 Main Paper (Camera Ready) +
+
+
+
+
+ + ☆ GenKIE: Robust Generative Multimodal Document Key Information Extraction EMNLP 2023 + + +
+ Key information extraction (KIE) from scanned documents has gained increasing +attention because of its applications in various domains. Although promising +results have been achieved by some recent KIE approaches, they are usually +built based on discriminative models, which lack the ability to handle optical +character recognition (OCR) errors and require laborious token-level labelling. +In this paper, we propose a novel generative end-to-end model, named GenKIE, to +address the KIE task. GenKIE is a sequence-to-sequence multimodal generative +model that utilizes multimodal encoders to embed visual, layout and textual +features and a decoder to generate the desired output. Well-designed prompts +are leveraged to incorporate the label semantics as the weakly supervised +signals and entice the generation of the key information. One notable advantage +of the generative model is that it enables automatic correction of OCR errors. +Besides, token-level granular annotation is not required. Extensive experiments +on multiple public real-world datasets show that GenKIE effectively generalizes +over different types of documents and achieves state-of-the-art results. Our +experiments also validate the model's robustness against OCR errors, making +GenKIE highly applicable in real-world scenarios. + +
+
+ comment: Accepted by EMNLP 2023, Findings paper +
+
+
+
+
+ + ☆ Octopus: A Multitask Model and Toolkit for Arabic Natural Language + Generation + + +
+ Understanding Arabic text and generating human-like responses is a +challenging endeavor. While many researchers have proposed models and solutions +for individual problems, there is an acute shortage of a comprehensive Arabic +natural language generation toolkit that is capable of handling a wide range of +tasks. In this work, we present a novel Arabic text-to-text Transformer model, +namely AraT5v2. Our new model is methodically trained on extensive and diverse +data, utilizing an extended sequence length of 2,048 tokens. We explore various +pretraining strategies including unsupervised, supervised, and joint +pertaining, under both single and multitask settings. Our models outperform +competitive baselines with large margins. We take our work one step further by +developing and publicly releasing Octopus, a Python-based package and +command-line toolkit tailored for eight Arabic generation tasks all exploiting +a single model. We release the models and the toolkit on our public repository. + +
+
+
+
+
+ + ☆ NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task + + +
+ We describe the findings of the fourth Nuanced Arabic Dialect Identification +Shared Task (NADI 2023). The objective of NADI is to help advance +state-of-the-art Arabic NLP by creating opportunities for teams of researchers +to collaboratively compete under standardized conditions. It does so with a +focus on Arabic dialects, offering novel datasets and defining subtasks that +allow for meaningful comparisons between different approaches. NADI 2023 +targeted both dialect identification (Subtask 1) and dialect-to-MSA machine +translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered +for the shared task, of whom 18 teams have participated (with 76 valid +submissions during test phase). Among these, 16 teams participated in Subtask +1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning +teams achieved 87.27 + F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, +respectively. Results show that all three subtasks remain challenging, thereby +motivating future work in this area. We describe the methods employed by the +participating teams and briefly offer an outlook for NADI. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2210.09582 +
+
+
+
+
+ + ☆ Locally Differentially Private Document Generation Using Zero Shot + Prompting EMNLP 2023 + + +
+ Numerous studies have highlighted the privacy risks associated with +pretrained large language models. In contrast, our research offers a unique +perspective by demonstrating that pretrained large language models can +effectively contribute to privacy preservation. We propose a locally +differentially private mechanism called DP-Prompt, which leverages the power of +pretrained large language models and zero-shot prompting to counter author +de-anonymization attacks while minimizing the impact on downstream utility. +When DP-Prompt is used with a powerful language model like ChatGPT (gpt-3.5), +we observe a notable reduction in the success rate of de-anonymization attacks, +showing that it surpasses existing approaches by a considerable margin despite +its simpler design. For instance, in the case of the IMDB dataset, DP-Prompt +(with ChatGPT) perfectly recovers the clean sentiment F1 score while achieving +a 46\% reduction in author identification F1 score against static attackers and +a 26\% reduction against adaptive attackers. We conduct extensive experiments +across six open-source large language models, ranging up to 7 billion +parameters, to analyze various effects of the privacy-utility tradeoff. + +
+
+ comment: Accepted at EMNLP 2023 (Findings) +
+
+
+
+
+ + ☆ CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn + from Financial Reports EMNLP 2023 + + +
+ In this paper, we introduce CR-COPEC called Causal Rationale of Corporate +Performance Changes from financial reports. This is a comprehensive large-scale +domain-adaptation causal sentence dataset to detect financial performance +changes of corporate. CR-COPEC contributes to two major achievements. First, it +detects causal rationale from 10-K annual reports of the U.S. companies, which +contain experts' causal analysis following accounting standards in a formal +manner. This dataset can be widely used by both individual investors and +analysts as material information resources for investing and decision making +without tremendous effort to read through all the documents. Second, it +carefully considers different characteristics which affect the financial +performance of companies in twelve industries. As a result, CR-COPEC can +distinguish causal sentences in various industries by taking unique narratives +in each industry into consideration. We also provide an extensive analysis of +how well CR-COPEC dataset is constructed and suited for classifying target +sentences as causal ones with respect to industry characteristics. Our dataset +and experimental codes are publicly available. + +
+
+ comment: Accepted in Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Baby Llama: knowledge distillation from an ensemble of teachers trained + on a small dataset with no performance penalty CoNLL + + +
+ We present our submission to the BabyLM challenge, whose goal was to improve +the sample efficiency of language models. We trained an ensemble consisting of +a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word +BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, +which exceeds in performance both of its teachers as well as a similar model +trained without distillation. This suggests that distillation can not only +retain the full performance of the teacher model when the latter is trained on +a sufficiently small dataset; it can exceed it, and lead to significantly +better performance than direct training. + +
+
+ comment: 11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and + accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint + available at https://huggingface.co/timinar/baby-llama-58m, training code + available at https://github.com/timinar/BabyLlama +
+
+
+
+
+ + ♻ ☆ Dolphin: A Challenging and Diverse Benchmark for Arabic NLG + + +
+ We present Dolphin, a novel benchmark that addresses the need for a natural +language generation (NLG) evaluation framework dedicated to the wide collection +of Arabic languages and varieties. The proposed benchmark encompasses a broad +range of 13 different NLG tasks, including dialogue generation, question +answering, machine translation, summarization, among others. Dolphin comprises +a substantial corpus of 40 diverse and representative public datasets across 50 +test splits, carefully curated to reflect real-world scenarios and the +linguistic richness of Arabic. It sets a new standard for evaluating the +performance and generalization capabilities of Arabic and multilingual models, +promising to enable researchers to push the boundaries of current +methodologies. We provide an extensive analysis of Dolphin, highlighting its +diversity and identifying gaps in current Arabic NLG research. We also offer a +public leaderboard that is both interactive and modular and evaluate several +models on our benchmark, allowing us to set strong baselines against which +researchers can compare. + +
+
+
+
+
+ + ♻ ☆ Data Selection for Language Models via Importance Resampling NeurIPS 2023 + + +
+ Selecting a suitable pretraining dataset is crucial for both general-domain +(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We +formalize this problem as selecting a subset of a large raw unlabeled dataset +to match a desired target distribution given some unlabeled target samples. Due +to the large scale and dimensionality of the raw text data, existing methods +use simple heuristics or use experts to manually curate data. Instead, we +extend the classic importance resampling approach used in low-dimensions for LM +data selection. We propose Data Selection with Importance Resampling (DSIR), an +efficient and scalable framework that estimates importance weights in a reduced +feature space for tractability and selects data with importance resampling +according to these weights. To determine an appropriate feature space, we show +that KL reduction, a data metric that measures the proximity between selected +pretraining data and the target in a feature space, has high correlation with +average downstream accuracy (r=0.89) when computed with simple n-gram features. +This motivates our instantiation of DSIR using n-gram features. When performing +continued pretraining towards a specific domain, DSIR performs comparably to +expert curation across 8 target distributions. When pretraining general-domain +models (target is Wikipedia + books), DSIR improves over random selection and +heuristic filtering baselines by 2-2.5% on the GLUE benchmark. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism + with Neural Networks + + +
+ In the realm of deep learning, the self-attention mechanism has substantiated +its pivotal role across a myriad of tasks, encompassing natural language +processing and computer vision. Despite achieving success across diverse +applications, the traditional self-attention mechanism primarily leverages +linear transformations for the computation of query, key, and value (QKV), +which may not invariably be the optimal choice under specific circumstances. +This paper probes into a novel methodology for QKV computation-implementing a +specially-designed neural network structure for the calculation. Utilizing a +modified Marian model, we conducted experiments on the IWSLT 2017 +German-English translation task dataset and juxtaposed our method with the +conventional approach. The experimental results unveil a significant +enhancement in BLEU scores with our method. Furthermore, our approach also +manifested superiority when training the Roberta model with the Wikitext-103 +dataset, reflecting a notable reduction in model perplexity compared to its +original counterpart. These experimental outcomes not only validate the +efficacy of our method but also reveal the immense potential in optimizing the +self-attention mechanism through neural network-based QKV computation, paving +the way for future research and practical applications. The source code and +implementation details for our proposed method can be accessed at +https://github.com/ocislyjrti/NeuralAttention. + +
+
+ comment: Updated the formulas in Section 3.2 "Detailed Methodology" and + revised Section 2 "Background" for clarity and accuracy +
+
+
+
+
+ + ♻ ☆ Towards Understanding Sycophancy in Language Models + + +
+ Reinforcement learning from human feedback (RLHF) is a popular technique for +training high-quality AI assistants. However, RLHF may also encourage model +responses that match user beliefs over truthful responses, a behavior known as +sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models +and whether human preference judgements are responsible. We first demonstrate +that five state-of-the-art AI assistants consistently exhibit sycophantic +behavior across four varied free-form text-generation tasks. To understand if +human preferences drive this broadly observed behavior of RLHF models, we +analyze existing human preference data. We find that when a response matches a +user's views, it is more likely to be preferred. Moreover, both humans and +preference models (PMs) prefer convincingly-written sycophantic responses over +correct ones a non-negligible fraction of the time. Optimizing model outputs +against PMs also sometimes sacrifices truthfulness in favor of sycophancy. +Overall, our results indicate that sycophancy is a general behavior of RLHF +models, likely driven in part by human preference judgements favoring +sycophantic responses. + +
+
+ comment: 32 pages, 20 figures +
+
+
+
+
+ + ♻ ☆ Learning from Mistakes via Cooperative Study Assistant for Large + Language Models EMNLP 2023 + + +
+ Large language models (LLMs) have demonstrated their potential to refine +their generation based on their own feedback. However, the feedback from LLM +itself is often inaccurate, thereby limiting its benefits. In this paper, we +propose Study Assistant for Large LAnguage Model (SALAM), a novel framework +with an auxiliary agent to assist the main LLM in learning from mistakes +through interactive cooperation. In the gathering phase, the student assistant +agent probes the main LLM, analyzes its errors, and collects the interaction in +a mistake memory. During the examination phase, the study assistant provides +guidelines by retrieving relevant cases to help the main LLM anticipate and +avoid similar errors. We first investigate the effectiveness of a general study +assistant and then customize it to provide LLM-specific guidance through +imitation learning from successful guidance experiences. Our experiments on +three LLMs using two challenging frameworks demonstrate that SALAM can +significantly boost LLMs by an accuracy margin of up to 6.6 on BBH and 12.6 on +BBQ. + +
+
+ comment: Accepted by EMNLP 2023 main conference +
+
+
+
+
+ + ♻ ☆ RADAR: Robust AI-Text Detection via Adversarial Learning NeurIPS 2023 + + +
+ Recent advances in large language models (LLMs) and the intensifying +popularity of ChatGPT-like applications have blurred the boundary of +high-quality text generation between humans and machines. However, in addition +to the anticipated revolutionary changes to our technology and society, the +difficulty of distinguishing LLM-generated texts (AI-text) from human-generated +texts poses new challenges of misuse and fairness, such as fake content +generation, plagiarism, and false accusations of innocent writers. While +existing works show that current AI-text detectors are not robust to LLM-based +paraphrasing, this paper aims to bridge this gap by proposing a new framework +called RADAR, which jointly trains a robust AI-text detector via adversarial +learning. RADAR is based on adversarial training of a paraphraser and a +detector. The paraphraser's goal is to generate realistic content to evade +AI-text detection. RADAR uses the feedback from the detector to update the +paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly +2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, +experimental results show that RADAR significantly outperforms existing AI-text +detection methods, especially when paraphrasing is in place. We also identify +the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, +and evaluate the improved capability of RADAR via GPT-3.5-Turbo. + +
+
+ comment: Accepted by NeurIPS 2023. Project page and demos: + https://radar.vizhub.ai +
+
+
+
+
+ + ♻ ☆ Ask Language Model to Clean Your Noisy Translation Data EMNLP 2023 + + +
+ Transformer models have demonstrated remarkable performance in neural machine +translation (NMT). However, their vulnerability to noisy input poses a +significant challenge in practical implementation, where generating clean +output from noisy input is crucial. The MTNT dataset is widely used as a +benchmark for evaluating the robustness of NMT models against noisy input. +Nevertheless, its utility is limited due to the presence of noise in both the +source and target sentences. To address this limitation, we focus on cleaning +the noise from the target sentences in MTNT, making it more suitable as a +benchmark for noise evaluation. Leveraging the capabilities of large language +models (LLMs), we observe their impressive abilities in noise removal. For +example, they can remove emojis while considering their semantic meaning. +Additionally, we show that LLM can effectively rephrase slang, jargon, and +profanities. The resulting datasets, called C-MTNT, exhibit significantly less +noise in the target sentences while preserving the semantic integrity of the +original sentences. Our human and GPT-4 evaluations also lead to a consistent +conclusion that LLM performs well on this task. Lastly, experiments on C-MTNT +showcased its effectiveness in evaluating the robustness of NMT models, +highlighting the potential of advanced language models for data cleaning and +emphasizing C-MTNT as a valuable resource. + +
+
+ comment: EMNLP 2023, Findings +
+
+
+
+
+ + ♻ ☆ PuoBERTa: Training and evaluation of a curated language model for + Setswana + + +
+ Natural language processing (NLP) has made significant progress for +well-resourced languages such as English but lagged behind for low-resource +languages like Setswana. This paper addresses this gap by presenting PuoBERTa, +a customised masked language model trained specifically for Setswana. We cover +how we collected, curated, and prepared diverse monolingual texts to generate a +high-quality corpus for PuoBERTa's training. Building upon previous efforts in +creating monolingual resources for Setswana, we evaluated PuoBERTa across +several NLP tasks, including part-of-speech (POS) tagging, named entity +recognition (NER), and news categorisation. Additionally, we introduced a new +Setswana news categorisation dataset and provided the initial benchmarks using +PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP +capabilities for understudied languages like Setswana and paves the way for +future research directions. + +
+
+ comment: Accepted for SACAIR 2023 +
+
+
+
+
+ + ♻ ☆ Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time + Controllable Text Generation EMNLP 2023 + + +
+ Controllable text generation (CTG) aims to generate text with desired +attributes, and decoding-time-based methods have shown promising performance on +this task. However, in this paper, we identify the phenomenon of Attribute +Collapse for the first time. It causes the fluency of generated text to rapidly +decrease when the control strength exceeds a critical value, rendering the text +completely unusable. This limitation hinders the effectiveness of decoding +methods in achieving high levels of controllability. To address this problem, +we propose a novel lightweight decoding framework named Air-Decoding. Its main +idea is reconstructing the attribute distributions to balance the weights +between attribute words and non-attribute words to generate more fluent text. +Specifically, we train prefixes by prefix-tuning to obtain attribute +distributions. Then we design a novel attribute distribution reconstruction +method to balance the obtained distributions and use the reconstructed +distributions to guide language models for generation, effectively avoiding the +issue of Attribute Collapse. Experiments on multiple CTG tasks prove that our +method achieves a new state-of-the-art control performance. + +
+
+ comment: Accepted as an EMNLP 2023 main paper +
+
+
+
+
+ + ♻ ☆ AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised + Features for Audio-Visual Speech Enhancement ICASSP 2024 + + +
+ Speech enhancement systems are typically trained using pairs of clean and +noisy speech. In audio-visual speech enhancement (AVSE), there is not as much +ground-truth clean data available; most audio-visual datasets are collected in +real-world environments with background noise and reverberation, hampering the +development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based +audio-visual speech enhancement approach that can generate clean speech despite +the challenges of real-world training data. We obtain a subset of nearly clean +speech from an audio-visual corpus using a neural quality estimator, and then +train a diffusion model on this subset to generate waveforms conditioned on +continuous speech representations from AV-HuBERT with noise-robust training. We +use continuous rather than discrete representations to retain prosody and +speaker information. With this vocoding task alone, the model can perform +speech enhancement better than a masking-based baseline. We further fine-tune +the diffusion model on clean/noisy utterance pairs to improve the performance. +Our approach outperforms a masking-based baseline in terms of both automatic +metrics and a human listening test and is close in quality to the target speech +in the listening test. Audio samples can be found at +https://home.ttic.edu/~jcchou/demo/avse/avse_demo.html. + +
+
+ comment: Submitted to ICASSP 2024 +
+
+
+
+
+ + ♻ ☆ Zero-Shot Cross-Lingual Summarization via Large Language Models EMNLP + 2023 + + +
+ Given a document in a source language, cross-lingual summarization (CLS) aims +to generate a summary in a different target language. Recently, the emergence +of Large Language Models (LLMs), such as GPT-3.5, ChatGPT and GPT-4, has +attracted wide attention from the computational linguistics community. However, +it is not yet known the performance of LLMs on CLS. In this report, we +empirically use various prompts to guide LLMs to perform zero-shot CLS from +different paradigms (i.e., end-to-end and pipeline), and provide a preliminary +evaluation on the generated summaries. We find that ChatGPT and GPT-4 +originally prefer to produce lengthy summaries with detailed information. These +two LLMs can further balance informativeness and conciseness with the help of +an interactive prompt, significantly improving their CLS performance. +Experimental results on three widely-used CLS datasets show that GPT-4 achieves +state-of-the-art zero-shot CLS performance, and performs competitively compared +with the fine-tuned mBART-50. Moreover, we also find some multi-lingual and +bilingual LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) have limited +zero-shot CLS ability. Due to the composite nature of CLS, which requires +models to perform summarization and translation simultaneously, accomplishing +this task in a zero-shot manner is even a challenge for LLMs. Therefore, we +sincerely hope and recommend future LLM research could use CLS as a testbed. + +
+
+ comment: Both first authors contributed equally. Technical Report, 12 pages. + Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP + 2023) +
+
+
+
+
+ + ♻ ☆ SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented + Dialogue Agents NeurIPS 2023 + + +
+ Task-oriented dialogue (TOD) models have made significant progress in recent +years. However, previous studies primarily focus on datasets written by +annotators, which has resulted in a gap between academic research and +real-world spoken conversation scenarios. While several small-scale spoken TOD +datasets are proposed to address robustness issues such as ASR errors, they +ignore the unique challenges in spoken conversation. To tackle the limitations, +we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD, +containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from +human-to-human spoken conversations. SpokenWOZ further incorporates common +spoken characteristics such as word-by-word processing and reasoning in spoken +language. Based on these characteristics, we present cross-turn slot and +reasoning slot detection as new challenges. We conduct experiments on various +baselines, including text-modal models, newly proposed dual-modal models, and +LLMs, e.g., ChatGPT. The results show that the current models still have +substantial room for improvement in spoken conversation, where the most +advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and +the SOTA end-to-end model only correctly completes the user request in 52.1% of +dialogues. The dataset, code, and leaderboard are available: +https://spokenwoz.github.io/SpokenWOZ-github.io/. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative + Language Models EMNLP 2023 + + +
+ Automated theorem proving (ATP) has become an appealing domain for exploring +the reasoning ability of the recent successful generative language models. +However, current ATP benchmarks mainly focus on symbolic inference, but rarely +involve the understanding of complex number combination reasoning. In this +work, we propose TRIGO, an ATP benchmark that not only requires a model to +reduce a trigonometric expression with step-by-step proofs but also evaluates a +generative LM's reasoning ability on formulas and its capability to manipulate, +group, and factor number terms. We gather trigonometric expressions and their +reduced forms from the web, annotate the simplification process manually, and +translate it into the Lean formal language system. We then automatically +generate additional examples from the annotated samples to expand the dataset. +Furthermore, we develop an automatic generator based on Lean-Gym to create +dataset splits of varying difficulties and distributions in order to thoroughly +analyze the model's generalization ability. Our extensive experiments show our +proposed TRIGO poses a new challenge for advanced generative LM's including +GPT-4 which is pre-trained on a considerable amount of open-source formal +theorem-proving language data, and provide a new tool to study the generative +LM's ability on both formal and mathematical reasoning. + +
+
+ comment: Accepted by EMNLP 2023. Code is available at + https://github.com/menik1126/TRIGO +
+
+
+
+
+ + ♻ ☆ Meta learning with language models: Challenges and opportunities in the + classification of imbalanced text + + +
+ Detecting out of policy speech (OOPS) content is important but difficult. +While machine learning is a powerful tool to tackle this challenging task, it +is hard to break the performance ceiling due to factors like quantity and +quality limitations on training data and inconsistencies in OOPS definition and +data labeling. To realize the full potential of available limited resources, we +propose a meta learning technique (MLT) that combines individual models built +with different text representations. We analytically show that the resulting +technique is numerically stable and produces reasonable combining weights. We +combine the MLT with a threshold-moving (TM) technique to further improve the +performance of the combined predictor on highly-imbalanced in-distribution and +out-of-distribution datasets. We also provide computational results to show the +statistically significant advantages of the proposed MLT approach. + All authors contributed equally to this work. + +
+
+ comment: 22 pages, including 5 figures, 12 tables, 1 appendix +
+
+
+
+
+ + ♻ ☆ SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented + Inference for Unlabeled Entity Problem in NER COLING 2022 + + +
+ Named Entity Recognition is the task to locate and classify the entities in +the text. However, Unlabeled Entity Problem in NER datasets seriously hinders +the improvement of NER performance. This paper proposes SCL-RAI to cope with +this problem. Firstly, we decrease the distance of span representations with +the same label while increasing it for different ones via span-based +contrastive learning, which relieves the ambiguity among entities and improves +the robustness of the model over unlabeled entities. Then we propose retrieval +augmented inference to mitigate the decision boundary shifting problem. Our +method significantly outperforms the previous SOTA method by 4.21% and 8.64% +F1-score on two real-world datasets. + +
+
+ comment: COLING 2022 +
+
+
+
+
+ + ♻ ☆ AnglE-optimized Text Embeddings + + +
+ High-quality text embedding is pivotal in improving semantic textual +similarity (STS) tasks, which are crucial components in Large Language Model +(LLM) applications. However, a common challenge existing text embedding models +face is the problem of vanishing gradients, primarily due to their reliance on +the cosine function in the optimization objective, which has saturation zones. +To address this issue, this paper proposes a novel angle-optimized text +embedding model called AnglE. The core idea of AnglE is to introduce angle +optimization in a complex space. This novel approach effectively mitigates the +adverse effects of the saturation zone in the cosine function, which can impede +gradient and hinder optimization processes. To set up a comprehensive STS +evaluation, we experimented on existing short-text STS datasets and a newly +collected long-text STS dataset from GitHub Issues. Furthermore, we examine +domain-specific STS scenarios with limited labeled data and explore how AnglE +works with LLM-annotated data. Extensive experiments were conducted on various +tasks including short-text STS, long-text STS, and domain-specific STS tasks. +The results show that AnglE outperforms the state-of-the-art (SOTA) STS models +that ignore the cosine saturation zone. These findings demonstrate the ability +of AnglE to generate high-quality text embeddings and the usefulness of angle +optimization in STS. + +
+
+ comment: update results and add non-STS transfer tasks +
+
+
+
+
+ + ♻ ☆ Overview of ImageArg-2023: The First Shared Task in Multimodal Argument + Mining EMNLP + + +
+ This paper presents an overview of the ImageArg shared task, the first +multimodal Argument Mining shared task co-located with the 10th Workshop on +Argument Mining at EMNLP 2023. The shared task comprises two classification +subtasks - (1) Subtask-A: Argument Stance Classification; (2) Subtask-B: Image +Persuasiveness Classification. The former determines the stance of a tweet +containing an image and a piece of text toward a controversial topic (e.g., gun +control and abortion). The latter determines whether the image makes the tweet +text more persuasive. The shared task received 31 submissions for Subtask-A and +21 submissions for Subtask-B from 9 different teams across 6 countries. The top +submission in Subtask-A achieved an F1-score of 0.8647 while the best +submission in Subtask-B achieved an F1-score of 0.5561. + +
+
+ comment: In The 10th Argument Mining Workshop, held in conjunction with The + Conference on Empirical Methods in Natural Language Processing (EMNLP), + December 2023 +
+
+
+
+
+ + ♻ ☆ Is ChatGPT a Good NLG Evaluator? A Preliminary Study EMNLP + 2023 + + +
+ Recently, the emergence of ChatGPT has attracted wide attention from the +computational linguistics community. Many prior studies have shown that ChatGPT +achieves remarkable performance on various NLP tasks in terms of automatic +evaluation metrics. However, the ability of ChatGPT to serve as an evaluation +metric is still underexplored. Considering assessing the quality of natural +language generation (NLG) models is an arduous task and NLG metrics notoriously +show their poor correlation with human judgments, we wonder whether ChatGPT is +a good NLG evaluation metric. In this report, we provide a preliminary +meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, +we regard ChatGPT as a human evaluator and give task-specific (e.g., +summarization) and aspect-specific (e.g., relevance) instruction to prompt +ChatGPT to evaluate the generated results of NLG models. We conduct experiments +on five NLG meta-evaluation datasets (including summarization, story generation +and data-to-text tasks). Experimental results show that compared with previous +automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation +with human judgments in most cases. In addition, we find that the effectiveness +of the ChatGPT evaluator might be influenced by the creation method of the +meta-evaluation datasets. For the meta-evaluation datasets which are created +greatly depending on the reference and thus are biased, the ChatGPT evaluator +might lose its effectiveness. We hope our preliminary study could prompt the +emergence of a general-purposed reliable NLG metric. + +
+
+ comment: Both first authors contributed equally. Technical Report, 11 pages. + Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP + 2023) +
+
+
+
+
+ + ♻ ☆ nanoT5: A PyTorch Framework for Pre-training and Fine-tuning T5-style + Models with Limited Resources + + +
+ State-of-the-art language models like T5 have revolutionized the NLP +landscape, but their computational demands hinder a large portion of the +research community. To address this challenge, we present nanoT5, a +specially-optimized PyTorch framework for efficient pre-training and +fine-tuning of T5 models. Drawing on insights from optimizer differences and +prioritizing efficiency, nanoT5 allows a T5-Base model to be pre-trained on a +single GPU in just 16 hours, without any loss in performance. With the +introduction of this open-source framework, we hope to widen the accessibility +to language modelling research and cater to the community's demand for more +user-friendly T5 (Encoder-Decoder) implementations. We make our contributions, +including configurations, codebase, pre-training insights, and pre-trained +models, available to the public. + +
+
+ comment: To appear at 3rd Workshop for Natural Language Processing Open Source + Software +
+
+
+
+
+ + ♻ ☆ Elaborative Simplification as Implicit Questions Under Discussion EMNLP + 2023 + + +
+ Automated text simplification, a technique useful for making text more +accessible to people such as children and emergent bilinguals, is often thought +of as a monolingual translation task from complex sentences to simplified +sentences using encoder-decoder models. This view fails to account for +elaborative simplification, where new information is added into the simplified +text. This paper proposes to view elaborative simplification through the lens +of the Question Under Discussion (QUD) framework, providing a robust way to +investigate what writers elaborate upon, how they elaborate, and how +elaborations fit into the discourse context by viewing elaborations as explicit +answers to implicit questions. We introduce ElabQUD, consisting of 1.3K +elaborations accompanied with implicit QUDs, to study these phenomena. We show +that explicitly modeling QUD (via question generation) not only provides +essential understanding of elaborative simplification and how the elaborations +connect with the rest of the discourse, but also substantially improves the +quality of elaboration generation. + +
+
+ comment: Equal contribution by Yating Wu and William Sheffield. This the EMNLP + 2023 Main camera-ready version +
+
+
+
+
+ + ♻ ☆ Document-Level Machine Translation with Large Language Models + + +
+ Large language models (LLMs) such as ChatGPT can produce coherent, cohesive, +relevant, and fluent answers for various natural language processing (NLP) +tasks. Taking document-level machine translation (MT) as a testbed, this paper +provides an in-depth evaluation of LLMs' ability on discourse modeling. The +study focuses on three aspects: 1) Effects of Context-Aware Prompts, where we +investigate the impact of different prompts on document-level translation +quality and discourse phenomena; 2) Comparison of Translation Models, where we +compare the translation performance of ChatGPT with commercial MT systems and +advanced document-level MT methods; 3) Analysis of Discourse Modelling +Abilities, where we further probe discourse knowledge encoded in LLMs and shed +light on impacts of training techniques on discourse modeling. By evaluating on +a number of benchmarks, we surprisingly find that LLMs have demonstrated +superior performance and show potential to become a new paradigm for +document-level translation: 1) leveraging their powerful long-text modeling +capabilities, GPT-3.5 and GPT-4 outperform commercial MT systems in terms of +human evaluation; 2) GPT-4 demonstrates a stronger ability for probing +linguistic knowledge than GPT-3.5. This work highlights the challenges and +opportunities of LLMs for MT, which we hope can inspire the future design and +evaluation of LLMs.We release our data and annotations at +https://github.com/longyuewangdcu/Document-MT-LLM. + +
+
+ comment: Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang are equal + contributors +
+
+
+
+
+ + ♻ ☆ OPT-R: Exploring the Role of Explanations in Finetuning and Prompting + for Reasoning Skills of Large Language Models ACL 2023 + + +
+ In this paper, we conduct a thorough investigation into the reasoning +capabilities of Large Language Models (LLMs), focusing specifically on the Open +Pretrained Transformers (OPT) models as a representative of such models. Our +study entails finetuning three different sizes of OPT on a carefully curated +reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned +without explanations, and OPT-RE, finetuned with explanations. We then evaluate +all models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS +benchmark, covering 26 distinct reasoning skills, utilizing three prompting +techniques. Through a comprehensive grid of 27 configurations and 6,156 test +evaluations, we investigate the dimensions of finetuning, prompting, and scale +to understand the role of explanations on different reasoning skills. Our +findings reveal that having explanations in the fewshot exemplar has no +significant impact on the model's performance when the model is finetuned, +while positively affecting the non-finetuned counterpart. Moreover, we observe +a slight yet consistent increase in classification accuracy as we incorporate +explanations during prompting and finetuning, respectively. Finally, we offer +insights on which skills benefit the most from incorporating explanations +during finetuning and prompting, such as Numerical (+20.4%) and Analogical +(+13.9%) reasoning, as well as skills that exhibit negligible or negative +effects. + +
+
+ comment: Proceedings of the 1st Workshop on Natural Language Reasoning and + Structured Explanations (NLRSE) at ACL 2023 +
+
+
+
+
+ + ♻ ☆ Multilingual Pixel Representations for Translation and Effective + Cross-lingual Transfer EMNLP 2023 + + +
+ We introduce and demonstrate how to effectively train multilingual machine +translation models with pixel representations. We experiment with two different +data settings with a variety of language and script coverage, demonstrating +improved performance compared to subword embeddings. We explore various +properties of pixel representations such as parameter sharing within and across +scripts to better understand where they lead to positive transfer. We observe +that these properties not only enable seamless cross-lingual transfer to unseen +scripts, but make pixel representations more data-efficient than alternatives +such as vocabulary expansion. We hope this work contributes to more extensible +multilingual models for all languages and scripts. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Evaluating Hallucinations in Chinese Large Language Models + + +
+ In this paper, we establish a benchmark named HalluQA (Chinese Hallucination +Question-Answering) to measure the hallucination phenomenon in Chinese large +language models. HalluQA contains 450 meticulously designed adversarial +questions, spanning multiple domains, and takes into account Chinese historical +culture, customs, and social phenomena. During the construction of HalluQA, we +consider two types of hallucinations: imitative falsehoods and factual errors, +and we construct adversarial samples based on GLM-130B and ChatGPT. For +evaluation, we design an automated evaluation method using GPT-4 to judge +whether a model output is hallucinated. We conduct extensive experiments on 24 +large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk +and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than +50%. This indicates that HalluQA is highly challenging. We analyze the primary +types of hallucinations in different types of models and their causes. +Additionally, we discuss which types of hallucinations should be prioritized +for different types of models. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ♻ ☆ Meta-learning For Vision-and-language Cross-lingual Transfer EMNLP2023 + + +
+ Current pre-trained vison-language models (PVLMs) achieve excellent +performance on a range of multi-modal datasets. Recent work has aimed at +building multilingual models, and a range of novel multilingual multi-modal +datasets have been proposed. Current PVLMs typically perform poorly on these +datasets when used for multi-modal zero-shot or few-shot cross-lingual +transfer, especially for low-resource languages. To alleviate this problem, we +propose a novel meta-learning fine-tuning framework. Our framework makes +current PVLMs rapidly adaptive to new languages in vision-language scenarios by +designing MAML in a cross-lingual multi-modal manner. Experiments show that our +method boosts the performance of current state-of-the-art PVLMs in both +zero-shot and few-shot cross-lingual transfer on a range of vision-language +understanding tasks and datasets (XVNLI, xGQA, MaRVL, xFlicker&Co) + +
+
+ comment: MRL2023 (co-located with EMNLP2023) +
+
+
+
+
+ + ♻ ☆ CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a + Context Synergized Hyperbolic Network EMNLP 2023 + + +
+ The tremendous growth of social media users interacting in online +conversations has led to significant growth in hate speech, affecting people +from various demographics. Most of the prior works focus on detecting explicit +hate speech, which is overt and leverages hateful phrases, with very little +work focusing on detecting hate speech that is implicit or denotes hatred +through indirect or coded language. In this paper, we present CoSyn, a +context-synergized neural network that explicitly incorporates user- and +conversational context for detecting implicit hate speech in online +conversations. CoSyn introduces novel ways to encode these external contexts +and employs a novel context interaction mechanism that clearly captures the +interplay between them, making independent assessments of the amounts of +information to be retrieved from these noisy contexts. Additionally, it carries +out all these operations in the hyperbolic space to account for the scale-free +dynamics of social media. We demonstrate the effectiveness of CoSyn on 6 hate +speech datasets and show that CoSyn outperforms all our baselines in detecting +implicit hate speech with absolute improvements in the range of 1.24% - 57.8%. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference. Code: + https://github.com/Sreyan88/CoSyn +
+
+
+
+
+ + ♻ ☆ Avalon's Game of Thoughts: Battle Against Deception through Recursive + Contemplation + + +
+ Recent breakthroughs in large language models (LLMs) have brought remarkable +success in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is +that the information processed by LLMs is consistently honest, neglecting the +pervasive deceptive or misleading information in human society and AI-generated +content. This oversight makes LLMs susceptible to malicious manipulations, +potentially resulting in detrimental outcomes. This study utilizes the +intricate Avalon game as a testbed to explore LLMs' potential in deceptive +environments. Avalon, full of misinformation and requiring sophisticated logic, +manifests as a "Game-of-Thoughts". Inspired by the efficacy of humans' +recursive thinking and perspective-taking in the Avalon game, we introduce a +novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to +identify and counteract deceptive information. ReCon combines formulation and +refinement contemplation processes; formulation contemplation produces initial +thoughts and speech, while refinement contemplation further polishes them. +Additionally, we incorporate first-order and second-order perspective +transitions into these processes respectively. Specifically, the first-order +allows an LLM agent to infer others' mental states, and the second-order +involves understanding how others perceive the agent's mental state. After +integrating ReCon with different LLMs, extensive experiment results from the +Avalon game indicate its efficacy in aiding LLMs to discern and maneuver around +deceptive information without extra fine-tuning and data. Finally, we offer a +possible explanation for the efficacy of ReCon and explore the current +limitations of LLMs in terms of safety, reasoning, speaking style, and format, +potentially furnishing insights for subsequent research. + +
+
+ comment: 40 pages +
+
+
+
+
+ + ♻ ☆ VECHR: A Dataset for Explainable and Robust Classification of + Vulnerability Type in the European Court of Human Rights EMNLP 2023 + + +
+ Recognizing vulnerability is crucial for understanding and implementing +targeted support to empower individuals in need. This is especially important +at the European Court of Human Rights (ECtHR), where the court adapts +Convention standards to meet actual individual needs and thus ensures effective +human rights protection. However, the concept of vulnerability remains elusive +at the ECtHR and no prior NLP research has dealt with it. To enable future +research in this area, we present VECHR, a novel expert-annotated multi-label +dataset comprising of vulnerability type classification and explanation +rationale. We benchmark the performance of state-of-the-art models on VECHR +from both prediction and explainability perspectives. Our results demonstrate +the challenging nature of the task with lower prediction performance and +limited agreement between models and experts. Further, we analyze the +robustness of these models in dealing with out-of-domain (OOD) data and observe +overall limited performance. Our dataset poses unique challenges offering +significant room for improvement regarding performance, explainability, and +robustness. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ From Dissonance to Insights: Dissecting Disagreements in Rationale + Construction for Case Outcome Classification EMNLP 2023 + + +
+ In legal NLP, Case Outcome Classification (COC) must not only be accurate but +also trustworthy and explainable. Existing work in explainable COC has been +limited to annotations by a single expert. However, it is well-known that +lawyers may disagree in their assessment of case facts. We hence collect a +novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two +experts in the domain of international human rights law, for whom we observe +weak agreement. We study their disagreements and build a two-level +task-independent taxonomy, supplemented with COC-specific subcategories. To our +knowledge, this is the first work in the legal NLP that focuses on human label +variation. We quantitatively assess different taxonomy categories and find that +disagreements mainly stem from underspecification of the legal context, which +poses challenges given the typically limited granularity and noise in COC +metadata. We further assess the explainablility of SOTA COC models on RAVE and +observe limited agreement between models and experts. Overall, our case study +reveals hitherto underappreciated complexities in creating benchmark datasets +in legal NLP that revolve around identifying aspects of a case's facts +supposedly relevant to its outcome. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Utilizing Weak Supervision To Generate Indonesian Conservation Dataset + + +
+ Weak supervision has emerged as a promising approach for rapid and +large-scale dataset creation in response to the increasing demand for +accelerated NLP development. By leveraging labeling functions, weak supervision +allows practitioners to generate datasets quickly by creating learned label +models that produce soft-labeled datasets. This paper aims to show how such an +approach can be utilized to build an Indonesian NLP dataset from conservation +news text. We construct two types of datasets: multi-class classification and +sentiment classification. We then provide baseline experiments using various +pretrained language models. These baseline results demonstrate test +performances of 59.79% accuracy and 55.72% F1-score for sentiment +classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC +for multi-class classification. Additionally, we release the datasets and +labeling functions used in this work for further research and exploration. + +
+
+
+
+
+ + ♻ ☆ A Survey on LLM-generated Text Detection: Necessity, Methods, and Future + Directions + + +
+ The powerful ability to understand, follow, and generate complex language +emerging from large language models (LLMs) makes LLM-generated text flood many +areas of our daily lives at an incredible speed and is widely accepted by +humans. As LLMs continue to expand, there is an imperative need to develop +detectors that can detect LLM-generated text. This is crucial to mitigate +potential misuse of LLMs and safeguard realms like artistic expression and +social networks from harmful influence of LLM-generated content. The +LLM-generated text detection aims to discern if a piece of text was produced by +an LLM, which is essentially a binary classification task. The detector +techniques have witnessed notable advancements recently, propelled by +innovations in watermarking techniques, zero-shot methods, fine-turning LMs +methods, adversarial learning methods, LLMs as detectors, and human-assisted +methods. In this survey, we collate recent research breakthroughs in this area +and underscore the pressing need to bolster detector research. We also delve +into prevalent datasets, elucidating their limitations and developmental +requirements. Furthermore, we analyze various LLM-generated text detection +paradigms, shedding light on challenges like out-of-distribution problems, +potential attacks, and data ambiguity. Conclusively, we highlight interesting +directions for future research in LLM-generated text detection to advance the +implementation of responsible artificial intelligence (AI). Our aim with this +survey is to provide a clear and comprehensive introduction for newcomers while +also offering seasoned researchers a valuable update in the field of +LLM-generated text detection. The useful resources are publicly available at: +https://github.com/NLP2CT/LLM-generated-Text-Detection. + +
+
+
+
+
+ + ♻ ☆ Exploring Affordance and Situated Meaning in Image Captions: A + Multimodal Analysis + + +
+ This paper explores the grounding issue regarding multimodal semantic +representation from a computational cognitive-linguistic view. We annotate +images from the Flickr30k dataset with five perceptual properties: Affordance, +Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche +Association (ENA), and examine their association with textual elements in the +image captions. Our findings reveal that images with Gibsonian affordance show +a higher frequency of captions containing 'holding-verbs' and 'container-nouns' +compared to images displaying telic affordance. Perceptual Salience, Object +Number, and ENA are also associated with the choice of linguistic expressions. +Our study demonstrates that comprehensive understanding of objects or events +requires cognitive attention, semantic nuances in language, and integration +across multiple modalities. We highlight the vital importance of situated +meaning and affordance grounding in natural language understanding, with the +potential to advance human-like interpretation in various scenarios. + +
+
+ comment: 10 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ How does GPT-2 compute greater-than?: Interpreting mathematical + abilities in a pre-trained language model NeurIPS 2023 + + +
+ Pre-trained language models can be surprisingly adept at tasks they were not +explicitly trained on, but how they implement these capabilities is poorly +understood. In this paper, we investigate the basic mathematical abilities +often acquired by pre-trained language models. Concretely, we use mechanistic +interpretability techniques to explain the (limited) mathematical abilities of +GPT-2 small. As a case study, we examine its ability to take in sentences such +as "The war lasted from the year 1732 to the year 17", and predict valid +two-digit end years (years > 32). We first identify a circuit, a small subset +of GPT-2 small's computational graph that computes this task's output. Then, we +explain the role of each circuit component, showing that GPT-2 small's final +multi-layer perceptrons boost the probability of end years greater than the +start year. Finally, we find related tasks that activate our circuit. Our +results suggest that GPT-2 small computes greater-than using a complex but +general mechanism that activates across diverse contexts. + +
+
+ comment: NeurIPS 2023 Camera Ready Version +
+
+
+
+
+ + ♻ ☆ Distilling ChatGPT for Explainable Automated Student Answer Assessment EMNLP 2023 + + +
+ Providing explainable and faithful feedback is crucial for automated student +answer assessment. In this paper, we introduce a novel framework that explores +using ChatGPT, a cutting-edge large language model, for the concurrent tasks of +student answer scoring and rationale generation. We identify the appropriate +instructions by prompting ChatGPT with different templates to collect the +rationales, where inconsistent rationales are refined to align with marking +standards. The refined ChatGPT outputs enable us to fine-tune a smaller +language model that simultaneously assesses student answers and provides +rationales. Extensive experiments on the benchmark dataset show that the +proposed method improves the overall QWK score by 11% compared to ChatGPT. +Furthermore, our thorough analysis and human evaluation demonstrate that the +rationales generated by our proposed method are comparable to those of ChatGPT. +Our approach provides a viable solution to achieve explainable automated +assessment in education. Code available at +https://github.com/lijiazheng99/aera. + +
+
+ comment: Accepted EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ T-Projection: High Quality Annotation Projection for Sequence Labeling + Tasks EMNLP 2023 + + +
+ In the absence of readily available labeled data for a given sequence +labeling task and language, annotation projection has been proposed as one of +the possible strategies to automatically generate annotated data. Annotation +projection has often been formulated as the task of transporting, on parallel +corpora, the labels pertaining to a given span in the source language into its +corresponding span in the target language. In this paper we present +T-Projection, a novel approach for annotation projection that leverages large +pretrained text-to-text language models and state-of-the-art machine +translation technology. T-Projection decomposes the label projection task into +two subtasks: (i) A candidate generation step, in which a set of projection +candidates using a multilingual T5 model is generated and, (ii) a candidate +selection step, in which the generated candidates are ranked based on +translation probabilities. We conducted experiments on intrinsic and extrinsic +tasks in 5 Indo-European and 8 low-resource African languages. We demostrate +that T-projection outperforms previous annotation projection methods by a wide +margin. We believe that T-Projection can help to automatically alleviate the +lack of high-quality training data for sequence labeling tasks. Code and data +are publicly available. + +
+
+ comment: Findings of the EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Nearest Neighbor Machine Translation is Meta-Optimizer on Output + Projection Layer EMNLP2023 + + +
+ Nearest Neighbor Machine Translation ($k$NN-MT) has achieved great success in +domain adaptation tasks by integrating pre-trained Neural Machine Translation +(NMT) models with domain-specific token-level retrieval. However, the reasons +underlying its success have not been thoroughly investigated. In this paper, we +comprehensively analyze $k$NN-MT through theoretical and empirical studies. +Initially, we provide new insights into the working mechanism of $k$NN-MT as an +efficient technique to implicitly execute gradient descent on the output +projection layer of NMT, indicating that it is a specific case of model +fine-tuning. Subsequently, we conduct multi-domain experiments and word-level +analysis to examine the differences in performance between $k$NN-MT and +entire-model fine-tuning. Our findings suggest that: (1) Incorporating $k$NN-MT +with adapters yields comparable translation performance to fine-tuning on +in-domain test sets, while achieving better performance on out-of-domain test +sets; (2) Fine-tuning significantly outperforms $k$NN-MT on the recall of +in-domain low-frequency words, but this gap could be bridged by optimizing the +context representations with additional adapter layers. + +
+
+ comment: Accepted by EMNLP2023 +
+
+
+
+
+ + ♻ ☆ A Diachronic Analysis of Paradigm Shifts in NLP Research: When, How, and + Why? + + +
+ Understanding the fundamental concepts and trends in a scientific field is +crucial for keeping abreast of its continuous advancement. In this study, we +propose a systematic framework for analyzing the evolution of research topics +in a scientific field using causal discovery and inference techniques. We +define three variables to encompass diverse facets of the evolution of research +topics within NLP and utilize a causal discovery algorithm to unveil the causal +connections among these variables using observational data. Subsequently, we +leverage this structure to measure the intensity of these relationships. By +conducting extensive experiments on the ACL Anthology corpus, we demonstrate +that our framework effectively uncovers evolutionary trends and the underlying +causes for a wide range of NLP research topics. Specifically, we show that +tasks and methods are primary drivers of research in NLP, with datasets +following, while metrics have minimal impact. + +
+
+
+
+
+ + ♻ ☆ Contrastive Learning of Sentence Embeddings from Scratch + + +
+ Contrastive learning has been the dominant approach to train state-of-the-art +sentence embeddings. Previous studies have typically learned sentence +embeddings either through the use of human-annotated natural language inference +(NLI) data or via large-scale unlabeled sentences in an unsupervised manner. +However, even in the case of unlabeled data, their acquisition presents +challenges in certain domains due to various reasons. To address these issues, +we present SynCSE, a contrastive learning framework that trains sentence +embeddings with synthesized data. Specifically, we explore utilizing large +language models to synthesize the required data samples for contrastive +learning, including (1) producing positive and negative annotations given +unlabeled sentences (SynCSE-partial), and (2) generating sentences along with +their corresponding annotations from scratch (SynCSE-scratch). Experimental +results on sentence similarity and reranking tasks indicate that both +SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines, +and SynCSE-partial even achieves comparable performance to the supervised +models in most settings. + +
+
+ comment: Emnlp 2023 +
+
+
+
+
+ + ♻ ☆ InstructAlign: High-and-Low Resource Language Alignment via Continual + Crosslingual Instruction Tuning + + +
+ Large language models (LLMs) that are tuned with instructions have +demonstrated remarkable capabilities in various tasks and languages. However, +their ability to generalize to underrepresented languages is limited due to the +scarcity of available data. Additionally, directly adapting new languages to +instruction-tuned LLMs can result in catastrophic forgetting, which leads to +the loss of multitasking ability. To address this issue, we propose +InstructAlign which uses continual crosslingual instruction tuning to enable +LLMs to align new unseen languages with previously learned high-resource +languages. Our results demonstrate the effectiveness of InstructAlign in +enabling the model to understand low-resource languages with limited parallel +data while preventing catastrophic forgetting. Our work contributes to the +advancement of language adaptation methods, particularly for adapting +instruction-tuned LLMs to underrepresented languages. Our code is released on +https://github.com/HLTCHKUST/InstructAlign + +
+
+
+
+
+ + ♻ ☆ MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities + + +
+ We propose MM-Vet, an evaluation benchmark that examines large multimodal +models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various +intriguing abilities, such as solving math problems written on the blackboard, +reasoning about events and celebrities in news images, and explaining visual +jokes. Rapid model advancements pose challenges to evaluation benchmark +development. Problems include: (1) How to systematically structure and evaluate +the complicated multimodal tasks; (2) How to design evaluation metrics that +work well across question and answer types; and (3) How to give model insights +beyond a simple performance ranking. To this end, we present MM-Vet, designed +based on the insight that the intriguing ability to solve complicated tasks is +often achieved by a generalist model being able to integrate different core +vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and +examines the 16 integrations of interest derived from the capability +combination. For evaluation metrics, we propose an LLM-based evaluator for +open-ended outputs. The evaluator enables the evaluation across different +question types and answer styles, resulting in a unified scoring metric. We +evaluate representative LMMs on MM-Vet, providing insights into the +capabilities of different LMM system paradigms and models. Code and data are +available at https://github.com/yuweihao/MM-Vet. + +
+
+ comment: Add results of GPT-4V. Code, data and leaderboard: + https://github.com/yuweihao/MM-Vet +
+
+
+
+
+ + ♻ ☆ Batch Prompting: Efficient Inference with Large Language Model APIs EMNLP 2023 + + +
+ Performing inference on large volumes of samples with large language models +(LLMs) can be computationally and financially costly in industry and real-world +use. We propose batch prompting, a simple yet effective prompting approach that +enables the LLM to run inference in batches, instead of one sample at a time. +Our method reduces both token and time costs while retaining downstream +performance. We theoretically demonstrate that under a few-shot in-context +learning setting, the inference costs decrease almost inverse linearly with the +number of samples in each batch. We extensively validate the effectiveness of +batch prompting on ten datasets across commonsense QA, arithmetic reasoning, +and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch) +reduces the LLM (Codex) inference token and time costs while achieving better +or comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5 +and GPT-4, we show the benefits of batch prompting also hold. Further analysis +shows that the number of samples in each batch and the complexity of tasks +affect its performance. Moreover, batch prompting can be applied across +different reasoning methods using LLMs. Our code can be found at the site +https://github.com/xlang-ai/batch-prompting. + +
+
+ comment: EMNLP 2023 Industry Track +
+
+
+
+
+ + ♻ ☆ Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through + Active Exploration EMNLP 2023 + + +
+ Instruction-tuning can be substantially optimized through enhanced diversity, +resulting in models capable of handling a broader spectrum of tasks. However, +existing data employed for such tuning often exhibit an inadequate coverage of +individual domains, limiting the scope for nuanced comprehension and +interactions within these areas. To address this deficiency, we propose +Explore-Instruct, a novel approach to enhance the data coverage to be used in +domain-specific instruction-tuning through active exploration via Large +Language Models (LLMs). Built upon representative domain use cases, +Explore-Instruct explores a multitude of variations or possibilities by +implementing a search algorithm to obtain diversified and domain-focused +instruction-tuning data. Our data-centric analysis validates the effectiveness +of this proposed approach in improving domain-specific instruction coverage. +Moreover, our model's performance demonstrates considerable advancements over +multiple baselines, including those utilizing domain-specific data enhancement. +Our findings offer a promising opportunity to improve instruction coverage, +especially in domain-specific contexts, thereby advancing the development of +adaptable language models. Our code, model weights, and data are public at +\url{https://github.com/fanqiwan/Explore-Instruct}. + +
+
+ comment: Accepted to EMNLP 2023 (Main Conference) +
+
+
+
+
+ + ♻ ☆ Improving End-to-End Speech Processing by Efficient Text Data + Utilization with Latent Synthesis EMNLP 2023 + + +
+ Training a high performance end-to-end speech (E2E) processing model requires +an enormous amount of labeled speech data, especially in the era of +data-centric artificial intelligence. However, labeled speech data are usually +scarcer and more expensive for collection, compared to textual data. We propose +Latent Synthesis (LaSyn), an efficient textual data utilization framework for +E2E speech processing models. We train a latent synthesizer to convert textual +data into an intermediate latent representation of a pre-trained speech model. +These pseudo acoustic representations of textual data augment acoustic data for +model training. We evaluate LaSyn on low-resource automatic speech recognition +(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an +E2E baseline trained on LibriSpeech train-clean-100, with relative word error +rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our +E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for +slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) +and EM-Tree accuracies on STOP respectively. With fewer parameters, the results +of LaSyn are competitive to published state-of-the-art works. The results +demonstrate the quality of the augmented training data. + +
+
+ comment: 15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Rethinking Word-Level Auto-Completion in Computer-Aided Translation EMNLP2023 + + +
+ Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted +Translation. It aims at providing word-level auto-completion suggestions for +human translators. While previous studies have primarily focused on designing +complex model architectures, this paper takes a different perspective by +rethinking the fundamental question: what kind of words are good +auto-completions? We introduce a measurable criterion to answer this question +and discover that existing WLAC models often fail to meet this criterion. +Building upon this observation, we propose an effective approach to enhance +WLAC performance by promoting adherence to the criterion. Notably, the proposed +approach is general and can be applied to various encoder-based architectures. +Through extensive experiments, we demonstrate that our approach outperforms the +top-performing system submitted to the WLAC shared tasks in WMT2022, while +utilizing significantly smaller model sizes. + +
+
+ comment: EMNLP2023 +
+
+
+
+
+ + ♻ ☆ Preserving Knowledge Invariance: Rethinking Robustness Evaluation of + Open Information Extraction EMNLP 2023 + + +
+ The robustness to distribution changes ensures that NLP models can be +successfully applied in the realistic world, especially for information +extraction tasks. However, most prior evaluation benchmarks have been devoted +to validating pairwise matching correctness, ignoring the crucial measurement +of robustness. In this paper, we present the first benchmark that simulates the +evaluation of open information extraction models in the real world, where the +syntactic and expressive distributions under the same knowledge meaning may +drift variously. We design and annotate a large-scale testbed in which each +example is a knowledge-invariant clique that consists of sentences with +structured knowledge of the same meaning but with different syntactic and +expressive forms. By further elaborating the robustness metric, a model is +judged to be robust if its performance is consistently accurate on the overall +cliques. We perform experiments on typical models published in the last decade +as well as a popular large language model, the results show that the existing +successful models exhibit a frustrating degradation, with a maximum drop of +23.43 F1 score. Our resources and code are available at +https://github.com/qijimrc/ROBUST. + +
+
+ comment: Accepted by EMNLP 2023 Main Conference +
+
+
+
+
+ + ♻ ☆ SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, + IIT Madras SP + + +
+ India is home to a multitude of languages of which 22 languages are +recognised by the Indian Constitution as official. Building speech based +applications for the Indian population is a difficult problem owing to limited +data and the number of languages and accents to accommodate. To encourage the +language technology community to build speech based applications in Indian +languages, we are open sourcing SPRING-INX data which has about 2000 hours of +legally sourced and manually transcribed speech data for ASR system building in +Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi +and Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology +Madras and is a part of National Language Translation Mission (NLTM), funded by +the Indian Ministry of Electronics and Information Technology (MeitY), +Government of India. We describe the data collection and data cleaning process +along with the data statistics in this paper. + +
+
+ comment: 3 pages, About SPRING-INX Data +
+
+
+
+
+ + ♻ ☆ MaXM: Towards Multilingual Visual Question Answering EMNLP 2023 + + +
+ Visual Question Answering (VQA) has been primarily studied through the lens +of the English language. Yet, tackling VQA in other languages in the same +manner would require a considerable amount of resources. In this paper, we +propose scalable solutions to multilingual visual question answering (mVQA), on +both data and modeling fronts. We first propose a translation-based framework +to mVQA data generation that requires much less human annotation efforts than +the conventional approach of directly collection questions and answers. Then, +we apply our framework to the multilingual captions in the Crossmodal-3600 +dataset and develop an efficient annotation protocol to create MaXM, a +test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, +lightweight, and effective approach as well as benchmark state-of-the-art +English and multilingual VQA models. We hope that our benchmark encourages +further research on mVQA. + +
+
+ comment: EMNLP 2023 (Findings). + https://github.com/google-research-datasets/maxm +
+
+
+
+
+ + ♻ ☆ The ACL OCL Corpus: Advancing Open Science in Computational Linguistics EMNLP2023 + + +
+ We present ACL OCL, a scholarly corpus derived from the ACL Anthology to +assist Open scientific research in the Computational Linguistics domain. +Integrating and enhancing the previous versions of the ACL Anthology, the ACL +OCL contributes metadata, PDF files, citation graphs and additional structured +full texts with sections, figures, and links to a large knowledge resource +(Semantic Scholar). The ACL OCL spans seven decades, containing 73K papers, +alongside 210K figures. + We spotlight how ACL OCL applies to observe trends in computational +linguistics. By detecting paper topics with a supervised neural model, we note +that interest in "Syntax: Tagging, Chunking and Parsing" is waning and "Natural +Language Generation" is resurging. Our dataset is available from HuggingFace +(https://huggingface.co/datasets/WINGNUS/ACL-OCL). + +
+
+ comment: To appear in EMNLP2023 +
+
+
+
+
+ + ♻ ☆ Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech + Model + + +
+ Prompting and adapter tuning have emerged as efficient alternatives to +fine-tuning (FT) methods. However, existing studies on speech prompting focused +on classification tasks and failed on more complex sequence generation tasks. +Besides, adapter tuning is primarily applied with a focus on encoder-only +self-supervised models. Our experiments show that prompting on Wav2Seq, a +self-supervised encoder-decoder model, surpasses previous works in sequence +generation tasks. It achieves a remarkable 53% relative improvement in word +error rate for ASR and a 27% in F1 score for slot filling. Additionally, +prompting competes with the FT method in the low-resource scenario. Moreover, +we show the transferability of prompting and adapter tuning on Wav2Seq in +cross-lingual ASR. When limited trainable parameters are involved, prompting +and adapter tuning consistently outperform conventional FT across 7 languages. +Notably, in the low-resource scenario, prompting consistently outperforms +adapter tuning. + +
+
+ comment: Accepted to IEEE ASRU 2023 +
+
+
+
+
+ + ♻ ☆ Don't Trust ChatGPT when Your Question is not in English: A Study of + Multilingual Abilities and Types of LLMs EMNLP 2023 + + +
+ Large Language Models (LLMs) have demonstrated exceptional natural language +understanding abilities and have excelled in a variety of natural language +processing (NLP)tasks in recent years. Despite the fact that most LLMs are +trained predominantly in English, multiple studies have demonstrated their +comparative performance in many other languages. However, fundamental questions +persist regarding how LLMs acquire their multi-lingual abilities and how +performance varies across different languages. These inquiries are crucial for +the study of LLMs since users and researchers often come from diverse language +backgrounds, potentially influencing their utilization and interpretation of +LLMs' results. In this work, we propose a systematic way of qualifying the +performance disparities of LLMs under multilingual settings. We investigate the +phenomenon of across-language generalizations in LLMs, wherein insufficient +multi-lingual training data leads to advanced multi-lingual capabilities. To +accomplish this, we employ a novel back-translation-based prompting method. The +results show that GPT exhibits highly translating-like behaviour in +multilingual settings. + +
+
+ comment: Paper accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence + Scores from Language Models Fine-Tuned with Human Feedback EMNLP 2023 + + +
+ A trustworthy real-world prediction system should produce well-calibrated +confidence scores; that is, its confidence in an answer should be indicative of +the likelihood that the answer is correct, enabling deferral to an expert in +cases of low-confidence predictions. Recent studies have shown that +unsupervised pre-training produces large language models (LMs) whose +conditional probabilities are remarkably well-calibrated. However, the most +widely-used LMs are fine-tuned with reinforcement learning from human feedback +(RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional +probabilities that are very poorly calibrated. In light of this perceived +weakness, we conduct a broad evaluation of methods for extracting confidence +scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find +that verbalized confidences emitted as output tokens are typically +better-calibrated than the model's conditional probabilities on the TriviaQA, +SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error +by a relative 50%. + +
+
+ comment: EMNLP 2023 Camera Ready +
+
+
+
+
+ + ♻ ☆ Towards A Unified View of Sparse Feed-Forward Network in Pretraining + Large Language Model EMNLP 2023 + + +
+ Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) +have proven effective in scaling up Transformers model size for +\textit{pretraining} large language models. By only activating part of the FFN +parameters conditioning on input, S-FFN improves generalization performance +while keeping training and inference costs (in FLOPs) fixed. In this work, we +analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) +size and the memory block selection method under a general conceptual framework +of sparse neural memory. Using this unified framework, we compare several S-FFN +architectures for language modeling and provide insights into their relative +efficacy and efficiency. We found a simpler selection method -- +\textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated +hidden states, achieving lower perplexity in language model pretraining +compared to existing MoE architectures including Switch Transformer (Fedus et +al., 2021) and HashLayer (Roller et al., 2021). + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked + + +
+ Large language models (LLMs) are popular for high-quality text generation but +can produce harmful content, even when aligned with human values through +reinforcement learning. Adversarial prompts can bypass their safety measures. +We propose LLM Self Defense, a simple approach to defend against these attacks +by having an LLM screen the induced responses. Our method does not require any +fine-tuning, input preprocessing, or iterative output generation. Instead, we +incorporate the generated content into a pre-defined prompt and employ another +instance of an LLM to analyze the text and predict whether it is harmful. We +test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent +LLMs against various types of attacks, such as forcefully inducing affirmative +responses to prompts and prompt engineering attacks. Notably, LLM Self Defense +succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 +and Llama 2. + +
+
+
+
+
+ + ♻ ☆ WebArena: A Realistic Web Environment for Building Autonomous Agents + + +
+ With advances in generative AI, there is now potential for autonomous agents +to manage daily tasks via natural language commands. However, current agents +are primarily created and tested in simplified synthetic environments, leading +to a disconnect with real-world scenarios. In this paper, we build an +environment for language-guided agents that is highly realistic and +reproducible. Specifically, we focus on agents that perform tasks on the web, +and create an environment with fully functional websites from four common +domains: e-commerce, social forum discussions, collaborative software +development, and content management. Our environment is enriched with tools +(e.g., a map) and external knowledge bases (e.g., user manuals) to encourage +human-like task-solving. Building upon our environment, we release a set of +benchmark tasks focusing on evaluating the functional correctness of task +completions. The tasks in our benchmark are diverse, long-horizon, and designed +to emulate tasks that humans routinely perform on the internet. We experiment +with several baseline agents, integrating recent techniques such as reasoning +before acting. The results demonstrate that solving complex tasks is +challenging: our best GPT-4-based agent only achieves an end-to-end task +success rate of 14.41%, significantly lower than the human performance of +78.24%. These results highlight the need for further development of robust +agents, that current state-of-the-art large language models are far from +perfect performance in these real-life tasks, and that WebArena can be used to +measure such progress. + +
+
+ comment: Our code, data, environment reproduction resources, and video + demonstrations are publicly available at https://webarena.dev/ +
+
+
+
+
+ + ♻ ☆ VPGTrans: Transfer Visual Prompt Generator across LLMs NeurIPS 2023 + + +
+ While developing a new multimodal LLM (MLLM) by pre-training on tremendous +image-text pairs from scratch can be exceedingly resource-consuming, connecting +an existing LLM with a comparatively lightweight visual prompt generator (VPG) +becomes a feasible paradigm. However, further tuning the VPG part of the MLLM +still suffers from indispensable computational costs, i.e., requiring thousands +of GPU hours and millions of training data. One alternative solution is to +transfer an existing VPG from any existing MLLMs for the target MLLM. + In this work, we for the first time investigate the VPG transferability +across LLMs, and explore a solution to reduce the cost of VPG transfer. We +first study the VPG transfer across different LLM sizes (e.g., small-to-large), +and across different LLM types, through which we diagnose the key factors to +maximize the transfer efficiency. Based on our observation, we design a +two-stage transfer framework named VPGTrans, which is simple yet highly +effective. Through extensive experiments, we demonstrate that VPGTrans helps +significantly speed up the transfer learning process without compromising +performance. Remarkably, it helps achieve the VPG transfer from BLIP-2 +OPT$_\text{2.7B}$ to BLIP-2 OPT$_\text{6.7B}$ with over 10 times speed-up and +10.7% training data compared with connecting a VPG to OPT$_\text{6.7B}$ from +scratch. Further, a series of intriguing findings and potential rationales +behind them are provided and discussed. Finally, we showcase the practical +value of our VPGTrans approach, by customizing two novel MLLMs, including +VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs. + +
+
+ comment: Project Website: https://vpgtrans.github.io Code: + https://github.com/VPGTrans/VPGTrans NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ CorefPrompt: Prompt-based Event Coreference Resolution by Measuring + Event Type and Argument Compatibilities EMNLP2023 + + +
+ Event coreference resolution (ECR) aims to group event mentions referring to +the same real-world event into clusters. Most previous studies adopt the +"encoding first, then scoring" framework, making the coreference judgment rely +on event encoding. Furthermore, current methods struggle to leverage +human-summarized ECR rules, e.g., coreferential events should have the same +event type, to guide the model. To address these two issues, we propose a +prompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM +(masked language model) task. This allows for simultaneous event modeling and +coreference discrimination within a single template, with a fully shared +context. In addition, we introduce two auxiliary prompt tasks, event-type +compatibility and argument compatibility, to explicitly demonstrate the +reasoning process of ECR, which helps the model make final predictions. +Experimental results show that our method CorefPrompt performs well in a +state-of-the-art (SOTA) benchmark. + +
+
+ comment: Accepted by EMNLP2023 +
+
+
+
+
+ + ♻ ☆ Unify word-level and span-level tasks: NJUNLP's Participation for the + WMT2023 Quality Estimation Shared Task + + +
+ We introduce the submissions of the NJUNLP team to the WMT 2023 Quality +Estimation (QE) shared task. Our team submitted predictions for the +English-German language pair on all two sub-tasks: (i) sentence- and word-level +quality prediction; and (ii) fine-grained error span detection. This year, we +further explore pseudo data methods for QE based on NJUQE framework +(https://github.com/NJUNLP/njuqe). We generate pseudo MQM data using parallel +data from the WMT translation task. We pre-train the XLMR large model on pseudo +QE data, then fine-tune it on real QE data. At both stages, we jointly learn +sentence-level scores and word-level tags. Empirically, we conduct experiments +to find the key hyper-parameters that improve the performance. Technically, we +propose a simple method that covert the word-level outputs to fine-grained +error span results. Overall, our models achieved the best results in +English-German for both word-level and fine-grained error span detection +sub-tasks by a considerable margin. + +
+
+
+
+
+ + ♻ ☆ Clinical Text Summarization: Adapting Large Language Models Can + Outperform Human Experts + + +
+ Sifting through vast textual data and summarizing key information from +electronic health records (EHR) imposes a substantial burden on how clinicians +allocate their time. Although large language models (LLMs) have shown immense +promise in natural language processing (NLP) tasks, their efficacy on a diverse +range of clinical summarization tasks has not yet been rigorously demonstrated. +In this work, we apply domain adaptation methods to eight LLMs, spanning six +datasets and four distinct clinical summarization tasks: radiology reports, +patient questions, progress notes, and doctor-patient dialogue. Our thorough +quantitative assessment reveals trade-offs between models and adaptation +methods in addition to instances where recent advances in LLMs may not improve +results. Further, in a clinical reader study with ten physicians, we show that +summaries from our best-adapted LLMs are preferable to human summaries in terms +of completeness and correctness. Our ensuing qualitative analysis highlights +challenges faced by both LLMs and human experts. Lastly, we correlate +traditional quantitative NLP metrics with reader study scores to enhance our +understanding of how these metrics align with physician preferences. Our +research marks the first evidence of LLMs outperforming human experts in +clinical text summarization across multiple tasks. This implies that +integrating LLMs into clinical workflows could alleviate documentation burden, +empowering clinicians to focus more on personalized patient care and the +inherently human aspects of medicine. + +
+
+ comment: 24 pages, 24 figures. Compared to the original, newer versions + include minor edits and supplementary additional experiments that reinforce + the initial findings +
+
+
+
+
+ + ♻ ☆ Harnessing ChatGPT for thematic analysis: Are we ready? + + +
+ ChatGPT is an advanced natural language processing tool with growing +applications across various disciplines in medical research. Thematic analysis, +a qualitative research method to identify and interpret patterns in data, is +one application that stands to benefit from this technology. This viewpoint +explores the utilization of ChatGPT in three core phases of thematic analysis +within a medical context: 1) direct coding of transcripts, 2) generating themes +from a predefined list of codes, and 3) preprocessing quotes for manuscript +inclusion. Additionally, we explore the potential of ChatGPT to generate +interview transcripts, which may be used for training purposes. We assess the +strengths and limitations of using ChatGPT in these roles, highlighting areas +where human intervention remains necessary. Overall, we argue that ChatGPT can +function as a valuable tool during analysis, enhancing the efficiency of the +thematic analysis and offering additional insights into the qualitative data. + +
+
+ comment: 23 pages, 7 figures, 3 tables, 1 textbox +
+
+
+
+
+ + ♻ ☆ MCC-KD: Multi-CoT Consistent Knowledge Distillation + + +
+ Large language models (LLMs) have showcased remarkable capabilities in +complex reasoning through chain of thought (CoT) prompting. Recently, there has +been a growing interest in transferring these reasoning abilities from LLMs to +smaller models. However, achieving both the diversity and consistency in +rationales presents a challenge. In this paper, we focus on enhancing these two +aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to +efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple +rationales for each question and enforce consistency among the corresponding +predictions by minimizing the bidirectional KL-divergence between the answer +distributions. We investigate the effectiveness of MCC-KD with different model +architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both +mathematical reasoning and commonsense reasoning benchmarks. The empirical +results not only confirm MCC-KD's superior performance on in-distribution +datasets but also highlight its robust generalization ability on +out-of-distribution datasets. + +
+
+ comment: Accepted to ENMLP 2023 +
+
+
+
+
+ + ♻ ☆ GROVE: A Retrieval-augmented Complex Story Generation Framework with A + Forest of Evidence EMNLP 2023 + + +
+ Conditional story generation is significant in human-machine interaction, +particularly in producing stories with complex plots. While Large language +models (LLMs) perform well on multiple NLP tasks, including story generation, +it is challenging to generate stories with both complex and creative plots. +Existing methods often rely on detailed prompts to guide LLMs to meet target +conditions, which inadvertently restrict the creative potential of the +generated stories. We argue that leveraging information from exemplary +human-written stories facilitates generating more diverse plotlines. Delving +deeper into story details helps build complex and credible plots. In this +paper, we propose a retrieval-au\textbf{G}mented sto\textbf{R}y generation +framework with a f\textbf{O}rest of e\textbf{V}id\textbf{E}nce (GROVE) to +enhance stories' complexity. We build a retrieval repository for target +conditions to produce few-shot examples to prompt LLMs. Additionally, we design +an ``asking-why'' prompting scheme that extracts a forest of evidence, +providing compensation for the ambiguities that may occur in the generated +story. This iterative process uncovers underlying story backgrounds. Finally, +we select the most fitting chains of evidence from the evidence forest and +integrate them into the generated story, thereby enhancing the narrative's +complexity and credibility. Experimental results and numerous examples verify +the effectiveness of our method. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ A New Benchmark and Reverse Validation Method for Passage-level + Hallucination Detection EMNLP2023 + + +
+ Large Language Models (LLMs) have shown their ability to collaborate +effectively with humans in real-world scenarios. However, LLMs are apt to +generate hallucinations, i.e., makeup incorrect text and unverified +information, which can cause significant damage when deployed for +mission-critical tasks. In this paper, we propose a self-check approach based +on reverse validation to detect factual errors automatically in a zero-resource +fashion. To facilitate future studies and assess different methods, we +construct a hallucination detection benchmark named PHD, which is generated by +ChatGPT and annotated by human annotators. Contrasting previous studies of +zero-resource hallucination detection, our method and benchmark concentrate on +passage-level detection instead of sentence-level. We empirically evaluate our +method and existing zero-resource detection methods on two datasets. The +experimental results demonstrate that the proposed method considerably +outperforms the baselines while costing fewer tokens and less time. +Furthermore, we manually analyze some hallucination cases that LLM failed to +capture, revealing the shared limitation of zero-resource methods. + +
+
+ comment: EMNLP2023 Findings +
+
+
+
+
+ + ♻ ☆ Interpretable Text Classification Via Prototype Trajectories + + +
+ We propose a novel interpretable deep neural network for text classification, +called ProtoryNet, based on a new concept of prototype trajectories. Motivated +by the prototype theory in modern linguistics, ProtoryNet makes a prediction by +finding the most similar prototype for each sentence in a text sequence and +feeding an RNN backbone with the proximity of each sentence to the +corresponding active prototype. The RNN backbone then captures the temporal +pattern of the prototypes, which we refer to as prototype trajectories. +Prototype trajectories enable intuitive and fine-grained interpretation of the +reasoning process of the RNN model, in resemblance to how humans analyze texts. +We also design a prototype pruning procedure to reduce the total number of +prototypes used by the model for better interpretability. Experiments on +multiple public data sets show that ProtoryNet is more accurate than the +baseline prototype-based deep neural net and reduces the performance gap +compared to state-of-the-art black-box models. In addition, after prototype +pruning, the resulting ProtoryNet models only need less than or around 20 +prototypes for all datasets, which significantly benefits interpretability. +Furthermore, we report a survey result indicating that human users find +ProtoryNet more intuitive and easier to understand than other prototype-based +methods. + +
+
+
+
+
+ + ♻ ☆ GRACE: Discriminator-Guided Chain-of-Thought Reasoning EMNLP 2023 + + +
+ In the context of multi-step reasoning, e.g., with chain-of-thought, language +models (LMs) can easily assign a high likelihood to incorrect steps. As a +result, decoding strategies that optimize for solution likelihood often yield +incorrect solutions. To address this issue, we propose Guiding chain-of-thought +ReAsoning with a CorrectnEss Discriminator (GRACE), a stepwise decoding +approach that steers the decoding process towards producing correct reasoning +steps. GRACE employs a discriminator trained with a contrastive loss over +correct and incorrect steps, which is used during decoding to score next-step +candidates based on their correctness. Importantly, GRACE only requires +sampling from the LM, without the need for LM training or fine-tuning. Using +models from FLAN-T5 and LLaMA families, we evaluate GRACE over four math and +two symbolic reasoning tasks, where it exhibits substantial performance gains +compared to greedy decoding, verifiers, and self-consistency in most settings. +When further combined with self-consistency, GRACE outperforms all the +baselines by sizeable margins. Human and LLM evaluations over GSM8K show that +GRACE not only improves the final answer accuracy but also the correctness of +the intermediate reasoning. Our implementation can be accessed at +\url{https://github.com/mukhal/grace}. + +
+
+ comment: To appear at Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ GQA: Training Generalized Multi-Query Transformer Models from Multi-Head + Checkpoints EMNLP 2023 + + +
+ Multi-query attention (MQA), which only uses a single key-value head, +drastically speeds up decoder inference. However, MQA can lead to quality +degradation, and moreover it may not be desirable to train a separate model +just for faster inference. We (1) propose a recipe for uptraining existing +multi-head language model checkpoints into models with MQA using 5% of original +pre-training compute, and (2) introduce grouped-query attention (GQA), a +generalization of multi-query attention which uses an intermediate (more than +one, less than number of query heads) number of key-value heads. We show that +uptrained GQA achieves quality close to multi-head attention with comparable +speed to MQA. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ CoLT5: Faster Long-Range Transformers with Conditional Computation EMNLP 2023 + + +
+ Many natural language processing tasks benefit from long inputs, but +processing long documents with Transformers is expensive -- not only due to +quadratic attention complexity but also from applying feedforward and +projection layers to every token. However, not all tokens are equally +important, especially for longer documents. We propose CoLT5, a long-input +Transformer model that builds on this intuition by employing conditional +computation, devoting more resources to important tokens in both feedforward +and attention layers. We show that CoLT5 achieves stronger performance than +LongT5 with much faster training and inference, achieving SOTA on the +long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably +make use of extremely long inputs, showing strong gains up to 64k input length. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as + You May Think -- Introducing AI Detectability Index EMNLP 2023 + + +
+ With the rise of prolific ChatGPT, the risk and consequences of AI-generated +text has increased alarmingly. To address the inevitable question of ownership +attribution for AI-generated artifacts, the US Copyright Office released a +statement stating that 'If a work's traditional elements of authorship were +produced by a machine, the work lacks human authorship and the Office will not +register it'. Furthermore, both the US and the EU governments have recently +drafted their initial proposals regarding the regulatory framework for AI. +Given this cynosural spotlight on generative AI, AI-generated text detection +(AGTD) has emerged as a topic that has already received immediate attention in +research, with some initial methods having been proposed, soon followed by +emergence of techniques to bypass detection. This paper introduces the Counter +Turing Test (CT^2), a benchmark consisting of techniques aiming to offer a +comprehensive evaluation of the robustness of existing AGTD techniques. Our +empirical findings unequivocally highlight the fragility of the proposed AGTD +methods under scrutiny. Amidst the extensive deliberations on policy-making for +regulating AI development, it is of utmost importance to assess the +detectability of content generated by LLMs. Thus, to establish a quantifiable +spectrum facilitating the evaluation and ranking of LLMs according to their +detectability levels, we propose the AI Detectability Index (ADI). We conduct a +thorough examination of 15 contemporary LLMs, empirically demonstrating that +larger LLMs tend to have a higher ADI, indicating they are less detectable +compared to smaller LLMs. We firmly believe that ADI holds significant value as +a tool for the wider NLP community, with the potential to serve as a rubric in +AI-related policy-making. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ♻ ☆ DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining NeurIPS 2023 + + +
+ The mixture proportions of pretraining data domains (e.g., Wikipedia, books, +web text) greatly affect language model (LM) performance. In this paper, we +propose Domain Reweighting with Minimax Optimization (DoReMi), which first +trains a small proxy model using group distributionally robust optimization +(Group DRO) over domains to produce domain weights (mixture proportions) +without knowledge of downstream tasks. We then resample a dataset with these +domain weights and train a larger, full-sized model. In our experiments, we use +DoReMi on a 280M-parameter proxy model to find domain weights for training an +8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves +perplexity across all domains, even when it downweights a domain. DoReMi +improves average few-shot downstream accuracy by 6.5% points over a baseline +model trained using The Pile's default domain weights and reaches the baseline +accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has +no knowledge of downstream tasks, even matches the performance of using domain +weights tuned on downstream tasks. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Task-Based MoE for Multitask Multilingual Machine Translation + + +
+ Mixture-of-experts (MoE) architecture has been proven a powerful method for +diverse tasks in training deep models in many applications. However, current +MoE implementations are task agnostic, treating all tokens from different tasks +in the same manner. In this work, we instead design a novel method that +incorporates task information into MoE models at different granular levels with +shared dynamic task-based adapters. Our experiments and analysis show the +advantages of our approaches over the dense and canonical MoE models on +multi-task multilingual machine translations. With task-specific adapters, our +models can additionally generalize to new tasks efficiently. + +
+
+
+
+
+ + ♻ ☆ TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks EMNLP 2023 + + +
+ While LLMs have shown great success in understanding and generating text in +traditional conversational settings, their potential for performing ill-defined +complex tasks is largely under-studied. Indeed, we are yet to conduct +comprehensive benchmarking studies with multiple LLMs that are exclusively +focused on a complex task. However, conducting such benchmarking studies is +challenging because of the large variations in LLMs' performance when different +prompt types/styles are used and different degrees of detail are provided in +the prompts. To address this issue, the paper proposes a general taxonomy that +can be used to design prompts with specific properties in order to perform a +wide range of complex tasks. This taxonomy will allow future benchmarking +studies to report the specific categories of prompts used as part of the study, +enabling meaningful comparisons across different studies. Also, by establishing +a common standard through this taxonomy, researchers will be able to draw more +accurate conclusions about LLMs' performance on a specific complex task. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for + Improved Vision-Language Compositionality EMNLP 2023 + + +
+ Contrastively trained vision-language models have achieved remarkable +progress in vision and language representation learning, leading to +state-of-the-art models for various downstream multimodal tasks. However, +recent research has highlighted severe limitations of these models in their +ability to perform compositional reasoning over objects, attributes, and +relations. Scene graphs have emerged as an effective way to understand images +compositionally. These are graph-structured semantic representations of images +that contain objects, their attributes, and relations with other objects in a +scene. In this work, we consider the scene graph parsed from text as a proxy +for the image scene graph and propose a graph decomposition and augmentation +framework along with a coarse-to-fine contrastive learning objective between +images and text that aligns sentences of various complexities to the same +image. Along with this, we propose novel negative mining techniques in the +scene graph space for improving attribute binding and relation understanding. +Through extensive experiments, we demonstrate the effectiveness of our approach +that significantly improves attribute binding, relation understanding, +systematic generalization, and productivity on multiple recently proposed +benchmarks (For example, improvements upto $18\%$ for systematic +generalization, $16.5\%$ for relation understanding over a strong baseline), +while achieving similar or better performance than CLIP on various general +multimodal tasks. + +
+
+ comment: EMNLP 2023 (long paper, main conference) +
+
+
+
+
+ + ♻ ☆ JASMINE: Arabic GPT Models for Few-Shot Learning + + +
+ Scholarship on generative pretraining (GPT) remains acutely Anglocentric, +leaving serious gaps in our understanding of the whole class of autoregressive +models. For example, we have little knowledge about the potential of these +models and their societal impacts in diverse linguistic and cultural settings. +We alleviate this issue for Arabic, a wide collection of languages and +dialectal varieties with more than 400 million population, by introducing +JASMINE. JASMINE is a suite of powerful Arabic autoregressive Transformer +language models ranging in size between 300 million-6.7 billion parameters +pretrained on a large and diverse dataset (~ 235 GB of text). We also carefully +design and release a comprehensive benchmark for both automated and human +evaluation of Arabic autoregressive models, with coverage of potential social +biases, harms, and toxicity. Using our novel benchmark, we evaluate JASMINE +extensively showing powerful performance intrinsically as well as in few-shot +learning on a wide range of NLP tasks. We aim to responsibly release our models +and evaluation benchmark with interested researchers, along with code for +experimenting with them. + +
+
+
+
+
+ + ♻ ☆ Tokenization Consistency Matters for Generative Models on Extractive NLP + Tasks EMNLP2023 + + +
+ Generative models have been widely applied to solve extractive tasks, where +parts of the input is extracted to form the desired output, and achieved +significant success. For example, in extractive question answering (QA), +generative models have constantly yielded state-of-the-art results. In this +work, we identify the issue of tokenization inconsistency that is commonly +neglected in training these models. This issue damages the extractive nature of +these tasks after the input and output are tokenized inconsistently by the +tokenizer, and thus leads to performance drop as well as hallucination. We +propose a simple yet effective fix to this issue and conduct a case study on +extractive QA. We show that, with consistent tokenization, the model performs +better in both in-domain and out-of-domain datasets, with a notable average of ++1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA +datasets. Further, the model converges faster, and becomes less likely to +generate out-of-context answers. With these findings, we would like to call for +more attention on how tokenization should be done when solving extractive tasks +and recommend applying consistent tokenization during training. + +
+
+ comment: Findings of EMNLP2023 +
+
+
+
+
+ + ♻ ☆ Image Manipulation via Multi-Hop Instructions -- A New Dataset and + Weakly-Supervised Neuro-Symbolic Approach EMNLP 2023 + + +
+ We are interested in image manipulation via natural language text -- a task +that is useful for multiple AI applications but requires complex reasoning over +multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning +(NSCL), which has been quite effective for the task of Visual Question +Answering (VQA), for the task of image manipulation. Our system referred to as +NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and +only requires weak supervision in the form of annotated data for VQA. NeuroSIM +parses an instruction into a symbolic program, based on a Domain Specific +Language (DSL) comprising of object attributes and manipulation operations, +that guides its execution. We create a new dataset for the task, and extensive +experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA +baselines that make use of supervised data for manipulation. + +
+
+ comment: EMNLP 2023 (long paper, main conference) +
+
+
+
+
+ + ♻ ☆ What to Read in a Contract? Party-Specific Summarization of Legal + Obligations, Entitlements, and Prohibitions EMNLP 2023 + + +
+ Reviewing and comprehending key obligations, entitlements, and prohibitions +in legal contracts can be a tedious task due to their length and +domain-specificity. Furthermore, the key rights and duties requiring review +vary for each contracting party. In this work, we propose a new task of +party-specific extractive summarization for legal contracts to facilitate +faster reviewing and improved comprehension of rights and duties. To facilitate +this, we curate a dataset comprising of party-specific pairwise importance +comparisons annotated by legal experts, covering ~293K sentence pairs that +include obligations, entitlements, and prohibitions extracted from lease +agreements. Using this dataset, we train a pairwise importance ranker and +propose a pipeline-based extractive summarization system that generates a +party-specific contract summary. We establish the need for incorporating +domain-specific notion of importance during summarization by comparing our +system against various baselines using both automatic and human evaluation +methods + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Mitigating Data Imbalance and Representation Degeneration in + Multilingual Machine Translation EMNLP 2023 + + +
+ Despite advances in multilingual neural machine translation (MNMT), we argue +that there are still two major challenges in this area: data imbalance and +representation degeneration. The data imbalance problem refers to the imbalance +in the amount of parallel corpora for all language pairs, especially for +long-tail languages (i.e., very low-resource languages). The representation +degeneration problem refers to the problem of encoded tokens tending to appear +only in a small subspace of the full space available to the MNMT model. To +solve these two issues, we propose Bi-ACL, a framework that uses only +target-side monolingual data and a bilingual dictionary to improve the +performance of the MNMT model. We define two modules, named bidirectional +autoencoder and bidirectional contrastive learning, which we combine with an +online constrained beam search and a curriculum learning sampling strategy. +Extensive experiments show that our proposed method is more effective both in +long-tail languages and in high-resource languages. We also demonstrate that +our approach is capable of transferring knowledge between domains and languages +in zero-shot scenarios. + +
+
+ comment: Accepted to Findings of EMNLP 2023, add statistical significance + tests. code available at https://github.com/lavine-lmu/Bi-ACL +
+
+
+
+
+ + ♻ ☆ Improving Summarization with Human Edits EMNLP + + +
+ Recent work has shown the promise of learning with human feedback paradigms +to produce human-determined high-quality text. Existing works use human +feedback to train large language models (LLMs) in general domain abstractive +summarization and have obtained summary quality exceeding traditional +likelihood training. In this paper, we focus on a less explored form of human +feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training +(SALT), a novel technique to use both the human-edited and model-generated data +together in the training loop. In addition, we demonstrate simulating Human +Edits with ground truth summaries coming from existing training data -- +Imitation edits, along with the model-generated summaries obtained after the +training, to reduce the need for expensive human-edit data. In our experiments, +we extend human feedback exploration from general domain summarization to +medical domain summarization. Our results demonstrate the effectiveness of SALT +in improving the summary quality with Human and Imitation Edits. Through +additional experiments, we show that SALT outperforms the conventional RLHF +method (designed for human preferences) -- DPO, when applied to human-edit +data. We hope the evidence in our paper prompts researchers to explore, +collect, and better use different human feedback approaches scalably. + +
+
+ comment: To appear in proceedings of the Main Conference on Empirical Methods + in Natural Language Processing (EMNLP) 2023 +
+
+
+
+
+ + ♻ ☆ KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large + Language Models + + +
+ Large language models (LLMs) demonstrate remarkable performance on +knowledge-intensive tasks, suggesting that real-world knowledge is encoded in +their model parameters. However, besides explorations on a few probing tasks in +limited knowledge domains, it is not well understood how to evaluate LLMs' +knowledge systematically and how well their knowledge abilities generalize, +across a spectrum of knowledge domains and progressively complex task formats. +To this end, we propose KGQuiz, a knowledge-intensive benchmark to +comprehensively investigate the knowledge generalization abilities of LLMs. +KGQuiz is a scalable framework constructed from triplet-based knowledge, which +covers three knowledge domains and consists of five tasks with increasing +complexity: true-or-false, multiple-choice QA, blank filling, factual editing, +and open-ended knowledge generation. To gain a better understanding of LLMs' +knowledge abilities and their generalization, we evaluate 10 open-source and +black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive +tasks and knowledge domains. Extensive experiments demonstrate that LLMs +achieve impressive performance in straightforward knowledge QA tasks, while +settings and contexts requiring more complex reasoning or employing +domain-specific facts still present significant challenges. We envision KGQuiz +as a testbed to analyze such nuanced variations in performance across domains +and task formats, and ultimately to understand, evaluate, and improve LLMs' +knowledge abilities across a wide spectrum of knowledge domains and tasks. + +
+
+
+
+
+ + ♻ ☆ Reproducing Whisper-Style Training Using an Open-Source Toolkit and + Publicly Available Data + + +
+ Pre-training speech models on large volumes of data has achieved remarkable +success. OpenAI Whisper is a multilingual multitask model trained on 680k hours +of supervised speech data. It generalizes well to various speech recognition +and translation benchmarks even in a zero-shot setup. However, the full +pipeline for developing such models (from data collection to training) is not +publicly accessible, which makes it difficult for researchers to further +improve its performance and address training-related issues such as efficiency, +robustness, fairness, and bias. This work presents an Open Whisper-style Speech +Model (OWSM), which reproduces Whisper-style training using an open-source +toolkit and publicly available data. OWSM even supports more translation +directions and can be more efficient to train. We will publicly release all +scripts used for data preparation, training, inference, and scoring as well as +pre-trained models and training logs to promote open science. + +
+
+ comment: Accepted at ASRU 2023 +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 132 + +
+
+
+ + ☆ Synthetic Data as Validation + + +
+ This study leverages synthetic data as a validation set to reduce overfitting +and ease the selection of the best model in AI development. While synthetic +data have been used for augmenting the training set, we find that synthetic +data can also significantly diversify the validation set, offering marked +advantages in domains like healthcare, where data are typically limited, +sensitive, and from out-domain sources (i.e., hospitals). In this study, we +illustrate the effectiveness of synthetic data for early cancer detection in +computed tomography (CT) volumes, where synthetic tumors are generated and +superimposed onto healthy organs, thereby creating an extensive dataset for +rigorous validation. Using synthetic data as validation can improve AI +robustness in both in-domain and out-domain test sets. Furthermore, we +establish a new continual learning framework that continuously trains AI models +on a stream of out-domain data with synthetic tumors. The AI model trained and +validated in dynamically expanding synthetic data can consistently outperform +models trained and validated exclusively on real-world data. Specifically, the +DSC score for liver tumor segmentation improves from 26.7% (95% CI: +22.6%-30.9%) to 34.5% (30.8%-38.2%) when evaluated on an in-domain dataset and +from 31.1% (26.0%-36.2%) to 35.4% (32.1%-38.7%) on an out-domain dataset. +Importantly, the performance gain is particularly significant in identifying +very tiny liver tumors (radius < 5mm) in CT volumes, with Sensitivity improving +from 33.1% to 55.4% on an in-domain dataset and 33.9% to 52.3% on an out-domain +dataset, justifying the efficacy in early detection of cancer. The application +of synthetic data, from both training and validation perspectives, underlines a +promising avenue to enhance AI robustness when dealing with data from varying +domains. + +
+
+
+
+
+ + ☆ From Posterior Sampling to Meaningful Diversity in Image Restoration + + +
+ Image restoration problems are typically ill-posed in the sense that each +degraded image can be restored in infinitely many valid ways. To accommodate +this, many works generate a diverse set of outputs by attempting to randomly +sample from the posterior distribution of natural images given the degraded +input. Here we argue that this strategy is commonly of limited practical value +because of the heavy tail of the posterior distribution. Consider for example +inpainting a missing region of the sky in an image. Since there is a high +probability that the missing region contains no object but clouds, any set of +samples from the posterior would be entirely dominated by (practically +identical) completions of sky. However, arguably, presenting users with only +one clear sky completion, along with several alternative solutions such as +airships, birds, and balloons, would better outline the set of possibilities. +In this paper, we initiate the study of meaningfully diverse image restoration. +We explore several post-processing approaches that can be combined with any +diverse image restoration method to yield semantically meaningful diversity. +Moreover, we propose a practical approach for allowing diffusion based image +restoration methods to generate meaningfully diverse outputs, while incurring +only negligent computational overhead. We conduct extensive user studies to +analyze the proposed techniques, and find the strategy of reducing similarity +between outputs to be significantly favorable over posterior sampling. Code and +examples are available in https://noa-cohen.github.io/MeaningfulDiversityInIR + +
+
+ comment: Code and examples are available in + https://noa-cohen.github.io/MeaningfulDiversityInIR +
+
+
+
+
+ + ☆ Woodpecker: Hallucination Correction for Multimodal Large Language + Models + + +
+ Hallucination is a big shadow hanging over the rapidly evolving Multimodal +Large Language Models (MLLMs), referring to the phenomenon that the generated +text is inconsistent with the image content. In order to mitigate +hallucinations, existing studies mainly resort to an instruction-tuning manner +that requires retraining the models with specific data. In this paper, we pave +a different way, introducing a training-free method named Woodpecker. Like a +woodpecker heals trees, it picks out and corrects hallucinations from the +generated text. Concretely, Woodpecker consists of five stages: key concept +extraction, question formulation, visual knowledge validation, visual claim +generation, and hallucination correction. Implemented in a post-remedy manner, +Woodpecker can easily serve different MLLMs, while being interpretable by +accessing intermediate outputs of the five stages. We evaluate Woodpecker both +quantitatively and qualitatively and show the huge potential of this new +paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement +in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released +at https://github.com/BradyFU/Woodpecker. + +
+
+ comment: 16 pages, 7 figures. Code Website: + https://github.com/BradyFU/Woodpecker +
+
+
+
+
+ + ☆ Stanford-ORB: A Real-World 3D Object Inverse Rendering Benchmark NeurIPS 2023 + + +
+ We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering +Benchmark. Recent advances in inverse rendering have enabled a wide range of +real-world applications in 3D content generation, moving rapidly from research +and commercial use cases to consumer devices. While the results continue to +improve, there is no real-world benchmark that can quantitatively assess and +compare the performance of various inverse rendering methods. Existing +real-world datasets typically only consist of the shape and multi-view images +of objects, which are not sufficient for evaluating the quality of material +recovery and object relighting. Methods capable of recovering material and +lighting often resort to synthetic data for quantitative evaluation, which on +the other hand does not guarantee generalization to complex real-world +environments. We introduce a new dataset of real-world objects captured under a +variety of natural scenes with ground-truth 3D scans, multi-view images, and +environment lighting. Using this dataset, we establish the first comprehensive +real-world evaluation benchmark for object inverse rendering tasks from +in-the-wild scenes, and compare the performance of various existing methods. +All data, code, and models can be accessed at https://stanfordorb.github.io/. + +
+
+ comment: The first two authors contributed equality to this work. NeurIPS 2023 + Datasets and Benchmarks +
+
+
+
+
+ + ☆ What's Left? Concept Grounding with Logic-Enhanced Foundation Models NeurIPS 2023 + + +
+ Recent works such as VisProg and ViperGPT have smartly composed foundation +models for visual reasoning-using large language models (LLMs) to produce +programs that can be executed by pre-trained vision-language models. However, +they operate in limited domains, such as 2D images, not fully exploiting the +generalization of language: abstract concepts like "left" can also be grounded +in 3D, temporal, and action data, as in moving to your left. This limited +generalization stems from these inference-only methods' inability to learn or +adapt pre-trained models to a new domain. We propose the Logic-Enhanced +Foundation Model (LEFT), a unified framework that learns to ground and reason +with concepts across domains with a differentiable, domain-independent, +first-order logic-based program executor. LEFT has an LLM interpreter that +outputs a program represented in a general, logic-based reasoning language, +which is shared across all domains and tasks. LEFT's executor then executes the +program with trainable domain-specific grounding modules. We show that LEFT +flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, +and robotic manipulation. It exhibits strong reasoning ability in a wide +variety of tasks, including those that are complex and not seen during +training, and can be easily applied to new domains. + +
+
+ comment: NeurIPS 2023. First two authors contributed equally. Project page: + https://web.stanford.edu/~joycj/projects/left_neurips_2023 +
+
+
+
+
+ + ☆ Visual Cropping Improves Zero-Shot Question Answering of Multimodal + Large Language Models + + +
+ Multimodal Large Language Models (LLMs) have recently achieved promising +zero-shot accuracy on visual question answering (VQA) -- a fundamental task +affecting various downstream applications and domains. Given the great +potential for the broad use of these models, it is important to investigate +their limitations in dealing with different image and question properties. In +this work, we investigate whether multimodal LLMs can perceive small details as +well as large details in images. In particular, we show that their zero-shot +accuracy in answering visual questions is very sensitive to the size of the +visual subject of the question, declining up to $46\%$ with size. Furthermore, +we show that this effect is causal by observing that human visual cropping can +significantly mitigate their sensitivity to size. Inspired by the usefulness of +human cropping, we then propose three automatic visual cropping methods as +inference time mechanisms to improve the zero-shot performance of multimodal +LLMs. We study their effectiveness on four popular VQA datasets, and a subset +of the VQAv2 dataset tailored towards fine visual details. Our findings suggest +that multimodal LLMs should be used with caution in detail-sensitive VQA +applications, and that visual cropping is a promising direction to improve +their zero-shot performance. Our code and data are publicly available. + +
+
+ comment: 11 pages, 4 figures, 4 tables +
+
+
+
+
+ + ☆ Finetuning Offline World Models in the Real World + + +
+ Reinforcement Learning (RL) is notoriously data-inefficient, which makes +training on a real robot difficult. While model-based RL algorithms (world +models) improve data-efficiency to some extent, they still require hours or +days of interaction to learn skills. Recently, offline RL has been proposed as +a framework for training RL policies on pre-existing datasets without any +online interaction. However, constraining an algorithm to a fixed dataset +induces a state-action distribution shift between training and inference, and +limits its applicability to new tasks. In this work, we seek to get the best of +both worlds: we consider the problem of pretraining a world model with offline +data collected on a real robot, and then finetuning the model on online data +collected by planning with the learned model. To mitigate extrapolation errors +during online interaction, we propose to regularize the planner at test-time by +balancing estimated returns and (epistemic) model uncertainty. We evaluate our +method on a variety of visuo-motor control tasks in simulation and on a real +robot, and find that our method enables few-shot finetuning to seen and unseen +tasks even when offline data is limited. Videos, code, and data are available +at https://yunhaifeng.com/FOWM . + +
+
+ comment: CoRL 2023 Oral; Project website: https://yunhaifeng.com/FOWM +
+
+
+
+
+ + ☆ ConvBKI: Real-Time Probabilistic Semantic Mapping Network with + Quantifiable Uncertainty + + +
+ In this paper, we develop a modular neural network for real-time semantic +mapping in uncertain environments, which explicitly updates per-voxel +probabilistic distributions within a neural network layer. Our approach +combines the reliability of classical probabilistic algorithms with the +performance and efficiency of modern neural networks. Although robotic +perception is often divided between modern differentiable methods and classical +explicit methods, a union of both is necessary for real-time and trustworthy +performance. We introduce a novel Convolutional Bayesian Kernel Inference +(ConvBKI) layer which incorporates semantic segmentation predictions online +into a 3D map through a depthwise convolution layer by leveraging conjugate +priors. We compare ConvBKI against state-of-the-art deep learning approaches +and probabilistic algorithms for mapping to evaluate reliability and +performance. We also create a Robot Operating System (ROS) package of ConvBKI +and test it on real-world perceptually challenging off-road driving data. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2209.10663 +
+
+
+
+
+ + ☆ Human-in-the-Loop Task and Motion Planning for Imitation Learning + + +
+ Imitation learning from human demonstrations can teach robots complex +manipulation skills, but is time-consuming and labor intensive. In contrast, +Task and Motion Planning (TAMP) systems are automated and excel at solving +long-horizon tasks, but they are difficult to apply to contact-rich tasks. In +this paper, we present Human-in-the-Loop Task and Motion Planning (HITL-TAMP), +a novel system that leverages the benefits of both approaches. The system +employs a TAMP-gated control mechanism, which selectively gives and takes +control to and from a human teleoperator. This enables the human teleoperator +to manage a fleet of robots, maximizing data collection efficiency. The +collected human data is then combined with an imitation learning framework to +train a TAMP-gated policy, leading to superior performance compared to training +on full task demonstrations. We compared HITL-TAMP to a conventional +teleoperation system -- users gathered more than 3x the number of demos given +the same time budget. Furthermore, proficient agents (75\%+ success) could be +trained from just 10 minutes of non-expert teleoperation data. Finally, we +collected 2.1K demos with HITL-TAMP across 12 contact-rich, long-horizon tasks +and show that the system often produces near-perfect agents. Videos and +additional results at https://hitltamp.github.io . + +
+
+ comment: Conference on Robot Learning (CoRL) 2023 +
+
+
+
+
+ + ☆ CVPR 2023 Text Guided Video Editing Competition + + +
+ Humans watch more than a billion hours of video per day. Most of this video +was edited manually, which is a tedious process. However, AI-enabled +video-generation and video-editing is on the rise. Building on text-to-image +models like Stable Diffusion and Imagen, generative AI has improved +dramatically on video tasks. But it's hard to evaluate progress in these video +tasks because there is no standard benchmark. So, we propose a new dataset for +text-guided video editing (TGVE), and we run a competition at CVPR to evaluate +models on our TGVE dataset. In this paper we present a retrospective on the +competition and describe the winning method. The competition dataset is +available at https://sites.google.com/view/loveucvpr23/track4. + +
+
+ comment: Project page: https://sites.google.com/view/loveucvpr23/track4 +
+
+
+
+
+ + ☆ Integrating View Conditions for Image Synthesis + + +
+ In the field of image processing, applying intricate semantic modifications +within existing images remains an enduring challenge. This paper introduces a +pioneering framework that integrates viewpoint information to enhance the +control of image editing tasks. By surveying existing object editing +methodologies, we distill three essential criteria, consistency, +controllability, and harmony, that should be met for an image editing method. +In contrast to previous approaches, our method takes the lead in satisfying all +three requirements for addressing the challenge of image synthesis. Through +comprehensive experiments, encompassing both quantitative assessments and +qualitative comparisons with contemporary state-of-the-art methods, we present +compelling evidence of our framework's superior performance across multiple +dimensions. This work establishes a promising avenue for advancing image +synthesis techniques and empowering precise object modifications while +preserving the visual coherence of the entire composition. + +
+
+
+
+
+ + ☆ Transitivity Recovering Decompositions: Interpretable and Robust + Fine-Grained Relationships NeurIPS + + +
+ Recent advances in fine-grained representation learning leverage +local-to-global (emergent) relationships for achieving state-of-the-art +results. The relational representations relied upon by such methods, however, +are abstract. We aim to deconstruct this abstraction by expressing them as +interpretable graphs over image views. We begin by theoretically showing that +abstract relational representations are nothing but a way of recovering +transitive relationships among local views. Based on this, we design +Transitivity Recovering Decompositions (TRD), a graph-space search algorithm +that identifies interpretable equivalents of abstract emergent relationships at +both instance and class levels, and with no post-hoc computations. We +additionally show that TRD is provably robust to noisy views, with empirical +evidence also supporting this finding. The latter allows TRD to perform at par +or even better than the state-of-the-art, while being fully interpretable. +Implementation is available at https://github.com/abhrac/trd. + +
+
+ comment: Neural Information Processing Systems (NeurIPS) 2023 +
+
+
+
+
+ + ☆ Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning + + +
+ This paper presents a novel approach to Single-Positive Multi-label Learning. +In general multi-label learning, a model learns to predict multiple labels or +categories for a single input image. This is in contrast with standard +multi-class image classification, where the task is predicting a single label +from many possible labels for an image. Single-Positive Multi-label Learning +(SPML) specifically considers learning to predict multiple labels when there is +only a single annotation per image in the training data. Multi-label learning +is in many ways a more realistic task than single-label learning as real-world +data often involves instances belonging to multiple categories simultaneously; +however, most common computer vision datasets predominantly contain single +labels due to the inherent complexity and cost of collecting multiple high +quality annotations for each instance. We propose a novel approach called +Vision-Language Pseudo-Labeling (VLPL), which uses a vision-language model to +suggest strong positive and negative pseudo-labels, and outperforms the current +SOTA methods by 5.5% on Pascal VOC, 18.4% on MS-COCO, 15.2% on NUS-WIDE, and +8.4% on CUB-Birds. Our code and data are available at +https://github.com/mvrl/VLPL. + +
+
+
+
+
+ + ☆ Geometry-Aware Video Quality Assessment for Dynamic Digital Human + + +
+ Dynamic Digital Humans (DDHs) are 3D digital models that are animated using +predefined motions and are inevitably bothered by noise/shift during the +generation process and compression distortion during the transmission process, +which needs to be perceptually evaluated. Usually, DDHs are displayed as 2D +rendered animation videos and it is natural to adapt video quality assessment +(VQA) methods to DDH quality assessment (DDH-QA) tasks. However, the VQA +methods are highly dependent on viewpoints and less sensitive to geometry-based +distortions. Therefore, in this paper, we propose a novel no-reference (NR) +geometry-aware video quality assessment method for DDH-QA challenge. Geometry +characteristics are described by the statistical parameters estimated from the +DDHs' geometry attribute distributions. Spatial and temporal features are +acquired from the rendered videos. Finally, all kinds of features are +integrated and regressed into quality values. Experimental results show that +the proposed method achieves state-of-the-art performance on the DDH-QA +database. + +
+
+
+
+
+ + ☆ Decoupled DETR: Spatially Disentangling Localization and Classification + for Improved End-to-End Object Detection ICCV2023 + + +
+ The introduction of DETR represents a new paradigm for object detection. +However, its decoder conducts classification and box localization using shared +queries and cross-attention layers, leading to suboptimal results. We observe +that different regions of interest in the visual feature map are suitable for +performing query classification and box localization tasks, even for the same +object. Salient regions provide vital information for classification, while the +boundaries around them are more favorable for box regression. Unfortunately, +such spatial misalignment between these two tasks greatly hinders DETR's +training. Therefore, in this work, we focus on decoupling localization and +classification tasks in DETR. To achieve this, we introduce a new design scheme +called spatially decoupled DETR (SD-DETR), which includes a task-aware query +generation module and a disentangled feature learning process. We elaborately +design the task-aware query initialization process and divide the +cross-attention block in the decoder to allow the task-aware queries to match +different visual regions. Meanwhile, we also observe that the prediction +misalignment problem for high classification confidence and precise +localization exists, so we propose an alignment loss to further guide the +spatially decoupled DETR training. Through extensive experiments, we +demonstrate that our approach achieves a significant improvement in MSCOCO +datasets compared to previous work. For instance, we improve the performance of +Conditional DETR by 4.5 AP. By spatially disentangling the two tasks, our +method overcomes the misalignment problem and greatly improves the performance +of DETR for object detection. + +
+
+ comment: accepted by ICCV2023 +
+
+
+
+
+ + ☆ Improving Robustness and Reliability in Medical Image Classification + with Latent-Guided Diffusion and Nested-Ensembles + + +
+ While deep learning models have achieved remarkable success across a range of +medical image analysis tasks, deployment of these models in real clinical +contexts requires that they be robust to variability in the acquired images. +While many methods apply predefined transformations to augment the training +data to enhance test-time robustness, these transformations may not ensure the +model's robustness to the diverse variability seen in patient images. In this +paper, we introduce a novel three-stage approach based on transformers coupled +with conditional diffusion models, with the goal of improving model robustness +to the kinds of imaging variability commonly encountered in practice without +the need for pre-determined data augmentation strategies. To this end, multiple +image encoders first learn hierarchical feature representations to build +discriminative latent spaces. Next, a reverse diffusion process, guided by the +latent code, acts on an informative prior and proposes prediction candidates in +a generative manner. Finally, several prediction candidates are aggregated in a +bi-level aggregation protocol to produce the final output. Through extensive +experiments on medical imaging benchmark datasets, we show that our method +improves upon state-of-the-art methods in terms of robustness and confidence +calibration. Additionally, we introduce a strategy to quantify the prediction +uncertainty at the instance level, increasing their trustworthiness to +clinicians using them in clinical practice. + +
+
+ comment: 13 pages, 6 figures +
+
+
+
+
+ + ☆ Language-driven Scene Synthesis using Multi-conditional Diffusion Model NeurIPS 2023 + + +
+ Scene synthesis is a challenging problem with several industrial +applications. Recently, substantial efforts have been directed to synthesize +the scene using human motions, room layouts, or spatial graphs as the input. +However, few studies have addressed this problem from multiple modalities, +especially combining text prompts. In this paper, we propose a language-driven +scene synthesis task, which is a new task that integrates text prompts, human +motion, and existing objects for scene synthesis. Unlike other single-condition +synthesis tasks, our problem involves multiple conditions and requires a +strategy for processing and encoding them into a unified space. To address the +challenge, we present a multi-conditional diffusion model, which differs from +the implicit unification approach of other diffusion literature by explicitly +predicting the guiding points for the original data distribution. We +demonstrate that our approach is theoretically supportive. The intensive +experiment results illustrate that our method outperforms state-of-the-art +benchmarks and enables natural scene editing applications. The source code and +dataset can be accessed at https://lang-scene-synth.github.io/. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ☆ ShARc: Shape and Appearance Recognition for Person Identification + In-the-wild WACV 2024 + + +
+ Identifying individuals in unconstrained video settings is a valuable yet +challenging task in biometric analysis due to variations in appearances, +environments, degradations, and occlusions. In this paper, we present ShARc, a +multimodal approach for video-based person identification in uncontrolled +environments that emphasizes 3-D body shape, pose, and appearance. We introduce +two encoders: a Pose and Shape Encoder (PSE) and an Aggregated Appearance +Encoder (AAE). PSE encodes the body shape via binarized silhouettes, skeleton +motions, and 3-D body shape, while AAE provides two levels of temporal +appearance feature aggregation: attention-based feature aggregation and +averaging aggregation. For attention-based feature aggregation, we employ +spatial and temporal attention to focus on key areas for person distinction. +For averaging aggregation, we introduce a novel flattening layer after +averaging to extract more distinguishable information and reduce overfitting of +attention. We utilize centroid feature averaging for gallery registration. We +demonstrate significant improvements over existing state-of-the-art methods on +public datasets, including CCVID, MEVID, and BRIAR. + +
+
+ comment: WACV 2024 +
+
+
+
+
+ + ☆ Mitigate Domain Shift by Primary-Auxiliary Objectives Association for + Generalizing Person ReID WACV2024 + + +
+ While deep learning has significantly improved ReID model accuracy under the +independent and identical distribution (IID) assumption, it has also become +clear that such models degrade notably when applied to an unseen novel domain +due to unpredictable/unknown domain shift. Contemporary domain generalization +(DG) ReID models struggle in learning domain-invariant representation solely +through training on an instance classification objective. We consider that a +deep learning model is heavily influenced and therefore biased towards +domain-specific characteristics, e.g., background clutter, scale and viewpoint +variations, limiting the generalizability of the learned model, and hypothesize +that the pedestrians are domain invariant owning they share the same structural +characteristics. To enable the ReID model to be less domain-specific from these +pure pedestrians, we introduce a method that guides model learning of the +primary ReID instance classification objective by a concurrent auxiliary +learning objective on weakly labeled pedestrian saliency detection. To solve +the problem of conflicting optimization criteria in the model parameter space +between the two learning objectives, we introduce a Primary-Auxiliary +Objectives Association (PAOA) mechanism to calibrate the loss gradients of the +auxiliary task towards the primary learning task gradients. Benefiting from the +harmonious multitask learning design, our model can be extended with the recent +test-time diagram to form the PAOA+, which performs on-the-fly optimization +against the auxiliary objective in order to maximize the model's generative +capacity in the test target domain. Experiments demonstrate the superiority of +the proposed PAOA model. + +
+
+ comment: Accepted to WACV2024 +
+
+
+
+
+ + ☆ YOLO-Angio: An Algorithm for Coronary Anatomy Segmentation MICCAI + + +
+ Coronary angiography remains the gold standard for diagnosis of coronary +artery disease, the most common cause of death worldwide. While this procedure +is performed more than 2 million times annually, there remain few methods for +fast and accurate automated measurement of disease and localization of coronary +anatomy. Here, we present our solution to the Automatic Region-based Coronary +Artery Disease diagnostics using X-ray angiography images (ARCADE) challenge +held at MICCAI 2023. For the artery segmentation task, our three-stage approach +combines preprocessing and feature selection by classical computer vision to +enhance vessel contrast, followed by an ensemble model based on YOLOv8 to +propose possible vessel candidates by generating a vessel map. A final +segmentation is based on a logic-based approach to reconstruct the coronary +tree in a graph-based sorting method. Our entry to the ARCADE challenge placed +3rd overall. Using the official metric for evaluation, we achieved an F1 score +of 0.422 and 0.4289 on the validation and hold-out sets respectively. + +
+
+ comment: MICCAI Conference ARCADE Grand Challenge, YOLO, Computer Vision, +
+
+
+
+
+ + ☆ On Responsible Machine Learning Datasets with Fairness, Privacy, and + Regulatory Norms + + +
+ Artificial Intelligence (AI) has made its way into various scientific fields, +providing astonishing improvements over existing algorithms for a wide variety +of tasks. In recent years, there have been severe concerns over the +trustworthiness of AI technologies. The scientific community has focused on the +development of trustworthy AI algorithms. However, machine and deep learning +algorithms, popular in the AI community today, depend heavily on the data used +during their development. These learning algorithms identify patterns in the +data, learning the behavioral objective. Any flaws in the data have the +potential to translate directly into algorithms. In this study, we discuss the +importance of Responsible Machine Learning Datasets and propose a framework to +evaluate the datasets through a responsible rubric. While existing work focuses +on the post-hoc evaluation of algorithms for their trustworthiness, we provide +a framework that considers the data component separately to understand its role +in the algorithm. We discuss responsible datasets through the lens of fairness, +privacy, and regulatory compliance and provide recommendations for constructing +future datasets. After surveying over 100 datasets, we use 60 datasets for +analysis and demonstrate that none of these datasets is immune to issues of +fairness, privacy preservation, and regulatory compliance. We provide +modifications to the ``datasheets for datasets" with important additions for +improved dataset documentation. With governments around the world regularizing +data protection laws, the method for the creation of datasets in the scientific +community requires revision. We believe this study is timely and relevant in +today's era of AI. + +
+
+
+
+
+ + ☆ Automatic Aorta Segmentation with Heavily Augmented, High-Resolution 3-D + ResUNet: Contribution to the SEG.A Challenge MICCAI 2023 + + +
+ Automatic aorta segmentation from 3-D medical volumes is an important yet +difficult task. Several factors make the problem challenging, e.g. the +possibility of aortic dissection or the difficulty with segmenting and +annotating the small branches. This work presents a contribution by the MedGIFT +team to the SEG.A challenge organized during the MICCAI 2023 conference. We +propose a fully automated algorithm based on deep encoder-decoder architecture. +The main assumption behind our work is that data preprocessing and augmentation +are much more important than the deep architecture, especially in low data +regimes. Therefore, the solution is based on a variant of traditional +convolutional U-Net. The proposed solution achieved a Dice score above 0.9 for +all testing cases with the highest stability among all participants. The method +scored 1st, 4th, and 3rd in terms of the clinical evaluation, quantitative +results, and volumetric meshing quality, respectively. We freely release the +source code, pretrained model, and provide access to the algorithm on the +Grand-Challenge platform. + +
+
+ comment: MICCAI 2023 - SEG.A Challenge Contribution +
+
+
+
+
+ + ☆ SequenceMatch: Revisiting the design of weak-strong augmentations for + Semi-supervised learning WACV 2024 + + +
+ Semi-supervised learning (SSL) has become popular in recent years because it +allows the training of a model using a large amount of unlabeled data. However, +one issue that many SSL methods face is the confirmation bias, which occurs +when the model is overfitted to the small labeled training dataset and produces +overconfident, incorrect predictions. To address this issue, we propose +SequenceMatch, an efficient SSL method that utilizes multiple data +augmentations. The key element of SequenceMatch is the inclusion of a medium +augmentation for unlabeled data. By taking advantage of different augmentations +and the consistency constraints between each pair of augmented examples, +SequenceMatch helps reduce the divergence between the prediction distribution +of the model for weakly and strongly augmented examples. In addition, +SequenceMatch defines two different consistency constraints for high and +low-confidence predictions. As a result, SequenceMatch is more data-efficient +than ReMixMatch, and more time-efficient than both ReMixMatch ($\times4$) and +CoMatch ($\times2$) while having higher accuracy. Despite its simplicity, +SequenceMatch consistently outperforms prior methods on standard benchmarks, +such as CIFAR-10/100, SVHN, and STL-10. It also surpasses prior +state-of-the-art methods by a large margin on large-scale datasets such as +ImageNet, with a 38.46\% error rate. Code is available at +https://github.com/beandkay/SequenceMatch. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ 3D Masked Autoencoders for Enhanced Privacy in MRI Scans + + +
+ MRI scans provide valuable medical information, however they also contain +sensitive and personally identifiable information (PII) that needs to be +protected. Whereas MRI metadata is easily sanitized, MRI image data is a +privacy risk because it contains information to render highly-realistic 3D +visualizations of a patient's head, enabling malicious actors to possibly +identify the subject by cross-referencing a database. Data anonymization and +de-identification is concerned with ensuring the privacy and confidentiality of +individuals' personal information. Traditional MRI de-identification methods +remove privacy-sensitive parts (e.g. eyes, nose etc.) from a given scan. This +comes at the expense of introducing a domain shift that can throw off +downstream analyses. Recently, a GAN-based approach was proposed to de-identify +a patient's scan by remodeling it (e.g. changing the face) rather than by +removing parts. In this work, we propose CP-MAE, a model that de-identifies the +face using masked autoencoders and that outperforms all previous approaches in +terms of downstream task performance as well as de-identification. With our +method we are able to synthesize scans of resolution up to $256^3$ (previously +128 cubic) which constitutes an eight-fold increase in the number of voxels. +Using our construction we were able to design a system that exhibits a highly +robust training stage, making it easy to fit the network on novel data. + +
+
+
+
+
+ + ☆ Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning + + +
+ High-resolution (HR) magnetic resonance imaging (MRI) is crucial for +enhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent +limitation of MRI resolution restricts its widespread applicability. Deep +learning-based image super-resolution (SR) methods exhibit promise in improving +MRI resolution without additional cost. However, these methods frequently +require a substantial number of HR MRI images for training, which can be +challenging to acquire. In this paper, we propose an unpaired MRI SR approach +that employs self-supervised contrastive learning to enhance SR performance +with limited training data. Our approach leverages both authentic HR images and +synthetically generated SR images to construct positive and negative sample +pairs, thus facilitating the learning of discriminative features. Empirical +results presented in this study underscore significant enhancements in the peak +signal-to-noise ratio and structural similarity index, even when a paucity of +HR images is available. These findings accentuate the potential of our approach +in addressing the challenge of limited training data, thereby contributing to +the advancement of high-resolution MRI in clinical applications. + +
+
+
+
+
+ + ☆ Debiasing, calibrating, and improving Semi-supervised Learning + performance via simple Ensemble Projector WACV 2024 + + +
+ Recent studies on semi-supervised learning (SSL) have achieved great success. +Despite their promising performance, current state-of-the-art methods tend +toward increasingly complex designs at the cost of introducing more network +components and additional training procedures. In this paper, we propose a +simple method named Ensemble Projectors Aided for Semi-supervised Learning +(EPASS), which focuses mainly on improving the learned embeddings to boost the +performance of the existing contrastive joint-training semi-supervised learning +frameworks. Unlike standard methods, where the learned embeddings from one +projector are stored in memory banks to be used with contrastive learning, +EPASS stores the ensemble embeddings from multiple projectors in memory banks. +As a result, EPASS improves generalization, strengthens feature representation, +and boosts performance. For instance, EPASS improves strong baselines for +semi-supervised learning by 39.47\%/31.39\%/24.70\% top-1 error rate, while +using only 100k/1\%/10\% of labeled data for SimMatch, and achieves +40.24\%/32.64\%/25.90\% top-1 error rate for CoMatch on the ImageNet dataset. +These improvements are consistent across methods, network architectures, and +datasets, proving the general effectiveness of the proposed methods. Code is +available at https://github.com/beandkay/EPASS. + +
+
+ comment: Accepted to WACV 2024 +
+
+
+
+
+ + ☆ Large Language Models are Temporal and Causal Reasoners for Video + Question Answering EMNLP 2023 + + +
+ Large Language Models (LLMs) have shown remarkable performances on a wide +range of natural language understanding and generation tasks. We observe that +the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ +for temporal and causal reasoning in Video Question Answering (VideoQA). +However, such priors often cause suboptimal results on VideoQA by leading the +model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, +while ignoring visual content. This is also known as `ungrounded guesses' or +`hallucinations'. To address this problem while leveraging LLMs' prior on +VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to +predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping +the source pair and the target label to understand their complex relationships, +$\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, +respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to +LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five +challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general +framework that is applicable to various LLMs (OPT and GPT-J) and consistently +improves their performances. We empirically demonstrate that Flipped-VQA not +only enhances the exploitation of linguistic shortcuts but also mitigates the +linguistic bias, which causes incorrect answers over-relying on the question. +Code is available at https://github.com/mlvlab/Flipped-VQA. + +
+
+ comment: Accepted paper at EMNLP 2023 Main +
+
+
+
+
+ + ☆ Interpretable Medical Image Classification using Prototype Learning and + Privileged Information MICCAI 2023 + + +
+ Interpretability is often an essential requirement in medical imaging. +Advanced deep learning methods are required to address this need for +explainability and high performance. In this work, we investigate whether +additional information available during the training process can be used to +create an understandable and powerful model. We propose an innovative solution +called Proto-Caps that leverages the benefits of capsule networks, prototype +learning and the use of privileged information. Evaluating the proposed +solution on the LIDC-IDRI dataset shows that it combines increased +interpretability with above state-of-the-art prediction performance. Compared +to the explainable baseline model, our method achieves more than 6 % higher +accuracy in predicting both malignancy (93.0 %) and mean characteristic +features of lung nodules. Simultaneously, the model provides case-based +reasoning with prototype representations that allow visual validation of +radiologist-defined attributes. + +
+
+ comment: MICCAI 2023 Medical Image Computing and Computer Assisted + Intervention +
+
+
+
+
+ + ☆ Query-adaptive DETR for Crowded Pedestrian Detection + + +
+ DEtection TRansformer (DETR) and its variants (DETRs) have been successfully +applied to crowded pedestrian detection, which achieved promising performance. +However, we find that, in different degrees of crowded scenes, the number of +DETRs' queries must be adjusted manually, otherwise, the performance would +degrade to varying degrees. In this paper, we first analyze the two current +query generation methods and summarize four guidelines for designing the +adaptive query generation method. Then, we propose Rank-based Adaptive Query +Generation (RAQG) to alleviate the problem. Specifically, we design a rank +prediction head that can predict the rank of the lowest confidence positive +training sample produced by the encoder. Based on the predicted rank, we design +an adaptive selection method that can adaptively select coarse detection +results produced by the encoder to generate queries. Moreover, to train the +rank prediction head better, we propose Soft Gradient L1 Loss. The gradient of +Soft Gradient L1 Loss is continuous, which can describe the relationship +between the loss value and the updated value of model parameters granularly. +Our method is simple and effective, which can be plugged into any DETRs to make +it query-adaptive in theory. The experimental results on Crowdhuman dataset and +Citypersons dataset show that our method can adaptively generate queries for +DETRs and achieve competitive results. Especially, our method achieves +state-of-the-art 39.4% MR on Crowdhuman dataset. + +
+
+ comment: 10 pages, 6 figures +
+
+
+
+
+ + ☆ GNeSF: Generalizable Neural Semantic Fields + + +
+ 3D scene segmentation based on neural implicit representation has emerged +recently with the advantage of training only on 2D supervision. However, +existing approaches still requires expensive per-scene optimization that +prohibits generalization to novel scenes during inference. To circumvent this +problem, we introduce a generalizable 3D segmentation framework based on +implicit representation. Specifically, our framework takes in multi-view image +features and semantic maps as the inputs instead of only spatial information to +avoid overfitting to scene-specific geometric and semantic information. We +propose a novel soft voting mechanism to aggregate the 2D semantic information +from different views for each 3D point. In addition to the image features, view +difference information is also encoded in our framework to predict the voting +scores. Intuitively, this allows the semantic information from nearby views to +contribute more compared to distant ones. Furthermore, a visibility module is +also designed to detect and filter out detrimental information from occluded +views. Due to the generalizability of our proposed method, we can synthesize +semantic maps or conduct 3D semantic segmentation for novel scenes with solely +2D semantic supervision. Experimental results show that our approach achieves +comparable performance with scene-specific approaches. More importantly, our +approach can even outperform existing strong supervision-based approaches with +only 2D annotations. Our source code is available at: +https://github.com/HLinChen/GNeSF. + +
+
+ comment: NeurPIS 2023 +
+
+
+
+
+ + ☆ Physics-Informed with Power-Enhanced Residual Network for Interpolation + and Inverse Problems + + +
+ This paper introduces a novel neural network structure called the +Power-Enhancing residual network, designed to improve interpolation +capabilities for both smooth and non-smooth functions in 2D and 3D settings. By +adding power terms to residual elements, the architecture boosts the network's +expressive power. The study explores network depth, width, and optimization +methods, showing the architecture's adaptability and performance advantages. +Consistently, the results emphasize the exceptional accuracy of the proposed +Power-Enhancing residual network, particularly for non-smooth functions. +Real-world examples also confirm its superiority over plain neural network in +terms of accuracy, convergence, and efficiency. The study also looks at the +impact of deeper network. Moreover, the proposed architecture is also applied +to solving the inverse Burgers' equation, demonstrating superior performance. +In conclusion, the Power-Enhancing residual network offers a versatile solution +that significantly enhances neural network capabilities. The codes implemented +are available at: \url{https://github.com/CMMAi/ResNet_for_PINN}. + +
+
+
+
+
+ + ☆ Nighttime Thermal Infrared Image Colorization with Feedback-based Object + Appearance Learning + + +
+ Stable imaging in adverse environments (e.g., total darkness) makes thermal +infrared (TIR) cameras a prevalent option for night scene perception. However, +the low contrast and lack of chromaticity of TIR images are detrimental to +human interpretation and subsequent deployment of RGB-based vision algorithms. +Therefore, it makes sense to colorize the nighttime TIR images by translating +them into the corresponding daytime color images (NTIR2DC). Despite the +impressive progress made in the NTIR2DC task, how to improve the translation +performance of small object classes is under-explored. To address this problem, +we propose a generative adversarial network incorporating feedback-based object +appearance learning (FoalGAN). Specifically, an occlusion-aware mixup module +and corresponding appearance consistency loss are proposed to reduce the +context dependence of object translation. As a representative example of small +objects in nighttime street scenes, we illustrate how to enhance the realism of +traffic light by designing a traffic light appearance loss. To further improve +the appearance learning of small objects, we devise a dual feedback learning +strategy to selectively adjust the learning frequency of different samples. In +addition, we provide pixel-level annotation for a subset of the Brno dataset, +which can facilitate the research of NTIR image understanding under multiple +weather conditions. Extensive experiments illustrate that the proposed FoalGAN +is not only effective for appearance learning of small objects, but also +outperforms other image translation methods in terms of semantic preservation +and edge consistency for the NTIR2DC task. + +
+
+ comment: 14 pages, 14 figures. arXiv admin note: text overlap with + arXiv:2208.02960 +
+
+
+
+
+ + ☆ Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive + Survey and Evaluation + + +
+ Multi-modal 3D scene understanding has gained considerable attention due to +its wide applications in many areas, such as autonomous driving and +human-computer interaction. Compared to conventional single-modal 3D +understanding, introducing an additional modality not only elevates the +richness and precision of scene interpretation but also ensures a more robust +and resilient understanding. This becomes especially crucial in varied and +challenging environments where solely relying on 3D data might be inadequate. +While there has been a surge in the development of multi-modal 3D methods over +past three years, especially those integrating multi-camera images (3D+2D) and +textual descriptions (3D+language), a comprehensive and in-depth review is +notably absent. In this article, we present a systematic survey of recent +progress to bridge this gap. We begin by briefly introducing a background that +formally defines various 3D multi-modal tasks and summarizes their inherent +challenges. After that, we present a novel taxonomy that delivers a thorough +categorization of existing methods according to modalities and tasks, exploring +their respective strengths and limitations. Furthermore, comparative results of +recent approaches on several benchmark datasets, together with insightful +analysis, are offered. Finally, we discuss the unresolved issues and provide +several potential avenues for future research. + +
+
+
+
+
+ + ☆ Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection NeurIPS 2023 + + +
+ Current research is primarily dedicated to advancing the accuracy of +camera-only 3D object detectors (apprentice) through the knowledge transferred +from LiDAR- or multi-modal-based counterparts (expert). However, the presence +of the domain gap between LiDAR and camera features, coupled with the inherent +incompatibility in temporal fusion, significantly hinders the effectiveness of +distillation-based enhancements for apprentices. Motivated by the success of +uni-modal distillation, an apprentice-friendly expert model would predominantly +rely on camera features, while still achieving comparable performance to +multi-modal models. To this end, we introduce VCD, a framework to improve the +camera-only apprentice model, including an apprentice-friendly multi-modal +expert and temporal-fusion-friendly distillation supervision. The multi-modal +expert VCD-E adopts an identical structure as that of the camera-only +apprentice in order to alleviate the feature disparity, and leverages LiDAR +input as a depth prior to reconstruct the 3D scene, achieving the performance +on par with other heterogeneous multi-modal experts. Additionally, a +fine-grained trajectory-based distillation module is introduced with the +purpose of individually rectifying the motion misalignment for each object in +the scene. With those improvements, our camera-only apprentice VCD-A sets new +state-of-the-art on nuScenes with a score of 63.1% NDS. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Region-controlled Style Transfer + + +
+ Image style transfer is a challenging task in computational vision. Existing +algorithms transfer the color and texture of style images by controlling the +neural network's feature layers. However, they fail to control the strength of +textures in different regions of the content image. To address this issue, we +propose a training method that uses a loss function to constrain the style +intensity in different regions. This method guides the transfer strength of +style features in different regions based on the gradient relationship between +style and content images. Additionally, we introduce a novel feature fusion +method that linearly transforms content features to resemble style features +while preserving their semantic relationships. Extensive experiments have +demonstrated the effectiveness of our proposed approach. + +
+
+
+
+
+ + ☆ Breaking of brightness consistency in optical flow with a lightweight + CNN network + + +
+ Sparse optical flow is widely used in various computer vision tasks, however +assuming brightness consistency limits its performance in High Dynamic Range +(HDR) environments. In this work, a lightweight network is used to extract +illumination robust convolutional features and corners with strong invariance. +Modifying the typical brightness consistency of the optical flow method to the +convolutional feature consistency yields the light-robust hybrid optical flow +method. The proposed network runs at 190 FPS on a commercial CPU because it +uses only four convolutional layers to extract feature maps and score maps +simultaneously. Since the shallow network is difficult to train directly, a +deep network is designed to compute the reliability map that helps it. An +end-to-end unsupervised training mode is used for both networks. To validate +the proposed method, we compare corner repeatability and matching performance +with origin optical flow under dynamic illumination. In addition, a more +accurate visual inertial system is constructed by replacing the optical flow +method in VINS-Mono. In a public HDR dataset, it reduces translation errors by +93\%. The code is publicly available at https://github.com/linyicheng1/LET-NET. + +
+
+ comment: 7 pages,7 figures +
+
+
+
+
+ + ☆ Mean Teacher DETR with Masked Feature Alignment: A Robust Domain + Adaptive Detection Transformer Framework + + +
+ Unsupervised domain adaptation object detection(UDAOD) research on Detection +Transformer(DETR) mainly focuses on feature alignment and existing methods can +be divided into two kinds, each of which has its unresolved issues. One-stage +feature alignment methods can easily lead to performance fluctuation and +training stagnation. Two-stage feature alignment method based on mean teacher +comprises a pretraining stage followed by a self-training stage, each facing +problems in obtaining reliable pretrained model and achieving consistent +performance gains. Methods mentioned above have not yet explore how to utilize +the third related domain such as target-like domain to assist adaptation. To +address these issues, we propose a two-stage framework named MTM, i.e. Mean +Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we +utilize labeled target-like images produced by image style transfer to avoid +performance fluctuation. In the self-training stage, we leverage unlabeled +target images by pseudo labels based on mean teacher and propose a module +called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance +gains of the student model. Most importantly, we propose masked feature +alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) +and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a +more robust way, which not only prevent training stagnation and lead to a +robust pretrained model in the pretraining stage, but also enhance the model's +target performance in the self-training stage. Experiments on three challenging +scenarios and a theoretical analysis verify the effectiveness of MTM. + +
+
+
+
+
+ + ☆ GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D + Object Detection + + +
+ Geometry plays a significant role in monocular 3D object detection. It can be +used to estimate object depth by using the perspective projection between +object's physical size and 2D projection in the image plane, which can +introduce mathematical priors into deep models. However, this projection +process also introduces error amplification, where the error of the estimated +height is amplified and reflected into the projected depth. It leads to +unreliable depth inferences and also impairs training stability. To tackle this +problem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++) +by modeling geometry projection in a probabilistic manner. This ensures depth +predictions are well-bounded and associated with a reasonable uncertainty. The +significance of introducing such geometric uncertainty is two-fold: (1). It +models the uncertainty propagation relationship of the geometry projection +during training, improving the stability and efficiency of the end-to-end model +learning. (2). It can be derived to a highly reliable confidence to indicate +the quality of the 3D detection result, enabling more reliable detection +inference. Experiments show that the proposed approach not only obtains +(state-of-the-art) SOTA performance in image-based monocular 3D detection but +also demonstrates superiority in efficacy with a simplified framework. + +
+
+ comment: 18 pages, 9 figures +
+
+
+
+
+ + ☆ Grasp Multiple Objects with One Hand + + +
+ The human hand's complex kinematics allow for simultaneous grasping and +manipulation of multiple objects, essential for tasks like object transfer and +in-hand manipulation. Despite its importance, robotic multi-object grasping +remains underexplored and presents challenges in kinematics, dynamics, and +object configurations. This paper introduces MultiGrasp, a two-stage method for +multi-object grasping on a tabletop with a multi-finger dexterous hand. It +involves (i) generating pre-grasp proposals and (ii) executing the grasp and +lifting the objects. Experimental results primarily focus on dual-object +grasping and report a 44.13% success rate, showcasing adaptability to unseen +object configurations and imprecise grasps. The framework also demonstrates the +capability to grasp more than two objects, albeit at a reduced inference speed. + +
+
+
+
+
+ + ☆ Emergent Communication in Interactive Sketch Question Answering NeurIPS 2023 + + +
+ Vision-based emergent communication (EC) aims to learn to communicate through +sketches and demystify the evolution of human communication. Ironically, +previous works neglect multi-round interaction, which is indispensable in human +communication. To fill this gap, we first introduce a novel Interactive Sketch +Question Answering (ISQA) task, where two collaborative players are interacting +through sketches to answer a question about an image in a multi-round manner. +To accomplish this task, we design a new and efficient interactive EC system, +which can achieve an effective balance among three evaluation factors, +including the question answering accuracy, drawing complexity and human +interpretability. Our experimental results including human evaluation +demonstrate that multi-round interactive mechanism facilitates targeted and +efficient communication between intelligent agents with decent human +interpretability. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Facial Data Minimization: Shallow Model as Your Privacy Filter + + +
+ Face recognition service has been used in many fields and brings much +convenience to people. However, once the user's facial data is transmitted to a +service provider, the user will lose control of his/her private data. In recent +years, there exist various security and privacy issues due to the leakage of +facial data. Although many privacy-preserving methods have been proposed, they +usually fail when they are not accessible to adversaries' strategies or +auxiliary data. Hence, in this paper, by fully considering two cases of +uploading facial images and facial features, which are very typical in face +recognition service systems, we proposed a data privacy minimization +transformation (PMT) method. This method can process the original facial data +based on the shallow model of authorized services to obtain the obfuscated +data. The obfuscated data can not only maintain satisfactory performance on +authorized models and restrict the performance on other unauthorized models but +also prevent original privacy data from leaking by AI methods and human visual +theft. Additionally, since a service provider may execute preprocessing +operations on the received data, we also propose an enhanced perturbation +method to improve the robustness of PMT. Besides, to authorize one facial image +to multiple service models simultaneously, a multiple restriction mechanism is +proposed to improve the scalability of PMT. Finally, we conduct extensive +experiments and evaluate the effectiveness of the proposed PMT in defending +against face reconstruction, data abuse, and face attribute estimation attacks. +These experimental results demonstrate that PMT performs well in preventing +facial data abuse and privacy leakage while maintaining face recognition +accuracy. + +
+
+ comment: 14 pages, 11 figures +
+
+
+
+
+ + ☆ Multimodal Representations for Teacher-Guided Compositional Visual + Reasoning + + +
+ Neural Module Networks (NMN) are a compelling method for visual question +answering, enabling the translation of a question into a program consisting of +a series of reasoning sub-tasks that are sequentially executed on the image to +produce an answer. NMNs provide enhanced explainability compared to integrated +models, allowing for a better understanding of the underlying reasoning +process. To improve the effectiveness of NMNs we propose to exploit features +obtained by a large-scale cross-modal encoder. Also, the current training +approach of NMNs relies on the propagation of module outputs to subsequent +modules, leading to the accumulation of prediction errors and the generation of +false answers. To mitigate this, we introduce an NMN learning strategy +involving scheduled teacher guidance. Initially, the model is fully guided by +the ground-truth intermediate outputs, but gradually transitions to an +autonomous behavior as training progresses. This reduces error accumulation, +thus improving training efficiency and final performance.We demonstrate that by +incorporating cross-modal features and employing more effective training +techniques for NMN, we achieve a favorable balance between performance and +transparency in the reasoning process. + +
+
+
+
+
+ + ☆ VMAF Re-implementation on PyTorch: Some Experimental Results + + +
+ Based on the standard VMAF implementation we propose an implementation of +VMAF using PyTorch framework. For this implementation comparisons with the +standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We +investigate gradients computation when using VMAF as an objective function and +demonstrate that training using this function does not result in ill-behaving +gradients. + +
+
+ comment: 4 pages +
+
+
+
+
+ + ☆ I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal + Mutual Distillation + + +
+ Recent progresses on self-supervised 3D human action representation learning +are largely attributed to contrastive learning. However, in conventional +contrastive frameworks, the rich complementarity between different skeleton +modalities remains under-explored. Moreover, optimized with distinguishing +self-augmented samples, models struggle with numerous similar positive +instances in the case of limited action categories. In this work, we tackle the +aforementioned problems by introducing a general Inter- and Intra-modal Mutual +Distillation (I$^2$MD) framework. In I$^2$MD, we first re-formulate the +cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process. +Different from existing distillation solutions that transfer the knowledge of a +pre-trained and fixed teacher to the student, in CMD, the knowledge is +continuously updated and bidirectionally distilled between modalities during +pre-training. To alleviate the interference of similar samples and exploit +their underlying contexts, we further design the Intra-modal Mutual +Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA) +mechanism is first introduced, where an additional cluster-level discrimination +branch is instantiated in each modality. It adaptively aggregates +highly-correlated neighboring features, forming local cluster-level +contrasting. Mutual distillation is then performed between the two branches for +cross-level knowledge exchange. Extensive experiments on three datasets show +that our approach sets a series of new records. + +
+
+ comment: submitted to IJCV. arXiv admin note: substantial text overlap with + arXiv:2208.12448 +
+
+
+
+
+ + ☆ PET Synthesis via Self-supervised Adaptive Residual Estimation + Generative Adversarial Network + + +
+ Positron emission tomography (PET) is a widely used, highly sensitive +molecular imaging in clinical diagnosis. There is interest in reducing the +radiation exposure from PET but also maintaining adequate image quality. Recent +methods using convolutional neural networks (CNNs) to generate synthesized +high-quality PET images from low-dose counterparts have been reported to be +state-of-the-art for low-to-high image recovery methods. However, these methods +are prone to exhibiting discrepancies in texture and structure between +synthesized and real images. Furthermore, the distribution shift between +low-dose PET and standard PET has not been fully investigated. To address these +issues, we developed a self-supervised adaptive residual estimation generative +adversarial network (SS-AEGAN). We introduce (1) An adaptive residual +estimation mapping mechanism, AE-Net, designed to dynamically rectify the +preliminary synthesized PET images by taking the residual map between the +low-dose PET and synthesized output as the input, and (2) A self-supervised +pre-training strategy to enhance the feature representation of the coarse +generator. Our experiments with a public benchmark dataset of total-body PET +images show that SS-AEGAN consistently outperformed the state-of-the-art +synthesis methods with various dose reduction factors. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ☆ Learning with Noisy Labels Using Collaborative Sample Selection and + Contrastive Semi-Supervised Learning + + +
+ Learning with noisy labels (LNL) has been extensively studied, with existing +approaches typically following a framework that alternates between clean sample +selection and semi-supervised learning (SSL). However, this approach has a +limitation: the clean set selected by the Deep Neural Network (DNN) classifier, +trained through self-training, inevitably contains noisy samples. This mixture +of clean and noisy samples leads to misguidance in DNN training during SSL, +resulting in impaired generalization performance due to confirmation bias +caused by error accumulation in sample selection. To address this issue, we +propose a method called Collaborative Sample Selection (CSS), which leverages +the large-scale pre-trained model CLIP. CSS aims to remove the mixed noisy +samples from the identified clean set. We achieve this by training a +2-Dimensional Gaussian Mixture Model (2D-GMM) that combines the probabilities +from CLIP with the predictions from the DNN classifier. To further enhance the +adaptation of CLIP to LNL, we introduce a co-training mechanism with a +contrastive loss in semi-supervised learning. This allows us to jointly train +the prompt of CLIP and the DNN classifier, resulting in improved feature +representation, boosted classification performance of DNNs, and reciprocal +benefits to our Collaborative Sample Selection. By incorporating auxiliary +information from CLIP and utilizing prompt fine-tuning, we effectively +eliminate noisy samples from the clean set and mitigate confirmation bias +during training. Experimental results on multiple benchmark datasets +demonstrate the effectiveness of our proposed method in comparison with the +state-of-the-art approaches. + +
+
+
+
+
+ + ☆ Cross-view Self-localization from Synthesized Scene-graphs + + +
+ Cross-view self-localization is a challenging scenario of visual place +recognition in which database images are provided from sparse viewpoints. +Recently, an approach for synthesizing database images from unseen viewpoints +using NeRF (Neural Radiance Fields) technology has emerged with impressive +performance. However, synthesized images provided by these techniques are often +of lower quality than the original images, and furthermore they significantly +increase the storage cost of the database. In this study, we explore a new +hybrid scene model that combines the advantages of view-invariant appearance +features computed from raw images and view-dependent spatial-semantic features +computed from synthesized images. These two types of features are then fused +into scene graphs, and compressively learned and recognized by a graph neural +network. The effectiveness of the proposed method was verified using a novel +cross-view self-localization dataset with many unseen views generated using a +photorealistic Habitat simulator. + +
+
+ comment: 5 pages, 5 figures, technical report +
+
+
+
+
+ + ☆ Salient Object Detection in RGB-D Videos + + +
+ Given the widespread adoption of depth-sensing acquisition devices, RGB-D +videos and related data/media have gained considerable traction in various +aspects of daily life. Consequently, conducting salient object detection (SOD) +in RGB-D videos presents a highly promising and evolving avenue. Despite the +potential of this area, SOD in RGB-D videos remains somewhat under-explored, +with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To +explore this emerging field, this paper makes two primary contributions: the +dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D +VSOD dataset with realistic depth and characterized by its diversity of scenes +and rigorous frame-by-frame annotations. We validate the dataset through +comprehensive attribute and object-oriented analyses, and provide training and +testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored +for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical +flow as auxiliary modalities. In pursuit of effective feature enhancement, +refinement, and fusion for precise final prediction, we propose two modules: +the multi-modal attention module (MAM) and the refinement fusion module (RFM). +To enhance interaction and fusion within RFM, we design a universal interaction +module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) +for refining multi-modal low-level features before reaching RFMs. Comprehensive +experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, +highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD +models. Ablation experiments were performed on both pseudo and realistic RGB-D +video datasets to demonstrate the advantages of individual modules as well as +the necessity of introducing realistic depth. Our code together with RDVS +dataset will be available at https://github.com/kerenfu/RDVS/. + +
+
+
+
+
+ + ☆ DeepIron: Predicting Unwarped Garment Texture from a Single Image + + +
+ Realistic reconstruction of 3D clothing from an image has wide applications, +such as avatar creation and virtual try-on. This paper presents a novel +framework that reconstructs the texture map for 3D garments from a single image +with pose. Assuming that 3D garments are modeled by stitching 2D garment sewing +patterns, our specific goal is to generate a texture image for the sewing +patterns. A key component of our framework, the Texture Unwarper, infers the +original texture image from the input clothing image, which exhibits warping +and occlusion of texture due to the user's body shape and pose. The Texture +Unwarper effectively transforms between the input and output images by mapping +the latent spaces of the two images. By inferring the unwarped original texture +of the input garment, our method helps reconstruct 3D garment models that can +show high-quality texture images realistically deformed for new poses. We +validate the effectiveness of our approach through a comparison with other +methods and ablation studies. Additionally, we release a large dataset of +garment sewing patterns with textures and images of avatars wearing the +garments, which will be useful for future research on garment texture +reconstruction and synthesis. + +
+
+
+
+
+ + ☆ Fast Propagation is Better: Accelerating Single-Step Adversarial + Training via Sampling Subnetworks + + +
+ Adversarial training has shown promise in building robust models against +adversarial examples. A major drawback of adversarial training is the +computational overhead introduced by the generation of adversarial examples. To +overcome this limitation, adversarial training based on single-step attacks has +been explored. Previous work improves the single-step adversarial training from +different perspectives, e.g., sample initialization, loss regularization, and +training strategy. Almost all of them treat the underlying model as a black +box. In this work, we propose to exploit the interior building blocks of the +model to improve efficiency. Specifically, we propose to dynamically sample +lightweight subnetworks as a surrogate model during training. By doing this, +both the forward and backward passes can be accelerated for efficient +adversarial training. Besides, we provide theoretical analysis to show the +model robustness can be improved by the single-step adversarial training with +sampled subnetworks. Furthermore, we propose a novel sampling strategy where +the sampling varies from layer to layer and from iteration to iteration. +Compared with previous methods, our method not only reduces the training cost +but also achieves better model robustness. Evaluations on a series of popular +datasets demonstrate the effectiveness of the proposed FB-Better. Our code has +been released at https://github.com/jiaxiaojunQAQ/FP-Better. + +
+
+
+
+
+ + ☆ G2-MonoDepth: A General Framework of Generalized Depth Inference from + Monocular RGB+X Data + + +
+ Monocular depth inference is a fundamental problem for scene perception of +robots. Specific robots may be equipped with a camera plus an optional depth +sensor of any type and located in various scenes of different scales, whereas +recent advances derived multiple individual sub-tasks. It leads to additional +burdens to fine-tune models for specific robots and thereby high-cost +customization in large-scale industrialization. This paper investigates a +unified task of monocular depth inference, which infers high-quality depth maps +from all kinds of input raw data from various robots in unseen scenes. A basic +benchmark G2-MonoDepth is developed for this task, which comprises four +components: (a) a unified data representation RGB+X to accommodate RGB plus raw +depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and +errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth +sparsity/errors of input raw data and diverse scales of output scenes, (c) an +improved network to well propagate diverse scene scales from input to output, +and (d) a data augmentation pipeline to simulate all types of real artifacts in +raw depth maps for training. G2-MonoDepth is applied in three sub-tasks +including depth estimation, depth completion with different sparsity, and depth +enhancement in unseen scenes, and it always outperforms SOTA baselines on both +real-world data and synthetic data. + +
+
+ comment: 18 pages, 16 figures +
+
+
+
+
+ + ☆ Pixel-Level Clustering Network for Unsupervised Image Segmentation + + +
+ While image segmentation is crucial in various computer vision applications, +such as autonomous driving, grasping, and robot navigation, annotating all +objects at the pixel-level for training is nearly impossible. Therefore, the +study of unsupervised image segmentation methods is essential. In this paper, +we present a pixel-level clustering framework for segmenting images into +regions without using ground truth annotations. The proposed framework includes +feature embedding modules with an attention mechanism, a feature statistics +computing module, image reconstruction, and superpixel segmentation to achieve +accurate unsupervised segmentation. Additionally, we propose a training +strategy that utilizes intra-consistency within each superpixel, +inter-similarity/dissimilarity between neighboring superpixels, and structural +similarity between images. To avoid potential over-segmentation caused by +superpixel-based losses, we also propose a post-processing method. Furthermore, +we present an extension of the proposed method for unsupervised semantic +segmentation. We conducted experiments on three publicly available datasets +(Berkeley segmentation dataset, PASCAL VOC 2012 dataset, and COCO-Stuff +dataset) to demonstrate the effectiveness of the proposed framework. The +experimental results show that the proposed framework outperforms previous +state-of-the-art methods. + +
+
+ comment: 13 pages +
+
+
+
+
+ + ☆ On the Foundations of Shortcut Learning + + +
+ Deep-learning models can extract a rich assortment of features from data. +Which features a model uses depends not only on predictivity-how reliably a +feature indicates train-set labels-but also on availability-how easily the +feature can be extracted, or leveraged, from inputs. The literature on shortcut +learning has noted examples in which models privilege one feature over another, +for example texture over shape and image backgrounds over foreground objects. +Here, we test hypotheses about which input properties are more available to a +model, and systematically study how predictivity and availability interact to +shape models' feature use. We construct a minimal, explicit generative +framework for synthesizing classification datasets with two latent features +that vary in predictivity and in factors we hypothesize to relate to +availability, and quantify a model's shortcut bias-its over-reliance on the +shortcut (more available, less predictive) feature at the expense of the core +(less available, more predictive) feature. We find that linear models are +relatively unbiased, but introducing a single hidden layer with ReLU or Tanh +units yields a bias. Our empirical findings are consistent with a theoretical +account based on Neural Tangent Kernels. Finally, we study how models used in +practice trade off predictivity and availability in naturalistic datasets, +discovering availability manipulations which increase models' degree of +shortcut bias. Taken together, these findings suggest that the propensity to +learn shortcut features is a fundamental characteristic of deep nonlinear +architectures warranting systematic study given its role in shaping how models +solve tasks. + +
+
+
+
+
+ + ☆ TiC-CLIP: Continual Training of CLIP Models + + +
+ Keeping large foundation models up to date on latest data is inherently +expensive. To avoid the prohibitive costs of constantly retraining, it is +imperative to continually train these models. This problem is exacerbated by +the lack of any large scale continual learning benchmarks or baselines. We +introduce the first set of web-scale Time-Continual (TiC) benchmarks for +training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with +over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first +use our benchmarks to curate various dynamic evaluations to measure temporal +robustness of existing models. We show OpenAI's CLIP (trained on data up to +2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from +2021--2022 compared with more recently trained models in OpenCLIP repository. +We then study how to efficiently train models on time-continuous data. We +demonstrate that a simple rehearsal-based approach that continues training from +the last checkpoint and replays old data reduces compute by $2.5\times$ when +compared to the standard practice of retraining from scratch. + +
+
+
+
+
+ + ☆ Hierarchical Randomized Smoothing + + +
+ Real-world data is complex and often consists of objects that can be +decomposed into multiple entities (e.g. images into pixels, graphs into +interconnected nodes). Randomized smoothing is a powerful framework for making +models provably robust against small changes to their inputs - by guaranteeing +robustness of the majority vote when randomly adding noise before +classification. Yet, certifying robustness on such complex data via randomized +smoothing is challenging when adversaries do not arbitrarily perturb entire +objects (e.g. images) but only a subset of their entities (e.g. pixels). As a +solution, we introduce hierarchical randomized smoothing: We partially smooth +objects by adding random noise only on a randomly selected subset of their +entities. By adding noise in a more targeted manner than existing methods we +obtain stronger robustness guarantees while maintaining high accuracy. We +initialize hierarchical smoothing using different noising distributions, +yielding novel robustness certificates for discrete and continuous domains. We +experimentally demonstrate the importance of hierarchical smoothing in image +and node classification, where it yields superior robustness-accuracy +trade-offs. Overall, hierarchical smoothing is an important contribution +towards models that are both - certifiably robust to perturbations and +accurate. + +
+
+
+
+
+ + ☆ ShadowSense: Unsupervised Domain Adaptation and Feature Fusion for + Shadow-Agnostic Tree Crown Detection from RGB-Thermal Drone Imagery WACV + + +
+ Accurate detection of individual tree crowns from remote sensing data poses a +significant challenge due to the dense nature of forest canopy and the presence +of diverse environmental variations, e.g., overlapping canopies, occlusions, +and varying lighting conditions. Additionally, the lack of data for training +robust models adds another limitation in effectively studying complex forest +conditions. This paper presents a novel method for detecting shadowed tree +crowns and provides a challenging dataset comprising roughly 50k paired +RGB-thermal images to facilitate future research for illumination-invariant +detection. The proposed method (ShadowSense) is entirely self-supervised, +leveraging domain adversarial training without source domain annotations for +feature extraction and foreground feature alignment for feature pyramid +networks to adapt domain-invariant representations by focusing on visible +foreground regions, respectively. It then fuses complementary information of +both modalities to effectively improve upon the predictions of an RGB-trained +detector and boost the overall accuracy. Extensive experiments demonstrate the +superiority of the proposed method over both the baseline RGB-trained detector +and state-of-the-art techniques that rely on unsupervised domain adaptation or +early image fusion. Our code and data are available: +https://github.com/rudrakshkapil/ShadowSense + +
+
+ comment: Accepted in IEEE/CVF Winter Applications of Computer Vision (WACV) + 2024 main conference! 8 pages (11 with bibliography), 5 figures, 3 tables +
+
+
+
+
+ + ☆ Sea-Land-Cloud Segmentation in Satellite Hyperspectral Imagery by Deep + Learning + + +
+ Satellites are increasingly adopting on-board Artificial Intelligence (AI) +techniques to enhance platforms' autonomy through edge inference. In this +context, the utilization of deep learning (DL) techniques for segmentation in +HS satellite imagery offers advantages for remote sensing applications, and +therefore, we train 16 different models, whose codes are made available through +our study, which we consider to be relevant for on-board multi-class +segmentation of HS imagery, focusing on classifying oceanic (sea), terrestrial +(land), and cloud formations. We employ the HYPSO-1 mission as an illustrative +case for sea-land-cloud segmentation, and to demonstrate the utility of the +segments, we introduce a novel sea-land-cloud ranking application scenario. Our +system prioritizes HS image downlink based on sea, land, and cloud coverage +levels from the segmented images. We comparatively evaluate the models for +in-orbit deployment, considering performance, parameter count, and inference +time. The models include both shallow and deep models, and after we propose +four new DL models, we demonstrate that segmenting single spectral signatures +(1D) outperforms 3D data processing comprising both spectral (1D) and spatial +(2D) contexts. We conclude that our lightweight DL model, called +1D-Justo-LiuNet, consistently surpasses state-of-the-art models for +sea-land-cloud segmentation, such as U-Net and its variations, in terms of +performance (0.93 accuracy) and parameter count (4,563). However, the 1D models +present longer inference time (15s) in the tested processing architecture, +which is clearly suboptimal. Finally, after demonstrating that in-orbit image +segmentation should occur post L1b radiance calibration rather than on raw +data, we additionally show that reducing spectral channels down to 3 lowers +models' parameters and inference time, at the cost of weaker segmentation +performance. + +
+
+ comment: Remote Sensing, Satellite Imagery, Hyperspectral Imaging, Deep + Learning, Segmentation +
+
+
+
+
+ + ☆ Learning Low-Rank Latent Spaces with Simple Deterministic Autoencoder: + Theoretical and Empirical Insights WACV 2024 + + +
+ The autoencoder is an unsupervised learning paradigm that aims to create a +compact latent representation of data by minimizing the reconstruction loss. +However, it tends to overlook the fact that most data (images) are embedded in +a lower-dimensional space, which is crucial for effective data representation. +To address this limitation, we propose a novel approach called Low-Rank +Autoencoder (LoRAE). In LoRAE, we incorporated a low-rank regularizer to +adaptively reconstruct a low-dimensional latent space while preserving the +basic objective of an autoencoder. This helps embed the data in a +lower-dimensional space while preserving important information. It is a simple +autoencoder extension that learns low-rank latent space. Theoretically, we +establish a tighter error bound for our model. Empirically, our model's +superiority shines through various tasks such as image generation and +downstream classification. Both theoretical and practical outcomes highlight +the importance of acquiring low-dimensional embeddings. + +
+
+ comment: Accepted @ IEEE/CVF WACV 2024 +
+
+
+
+
+ + ☆ G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D + Medical Image Segmentation WACV 2024 + + +
+ In recent years, medical image segmentation has become an important +application in the field of computer-aided diagnosis. In this paper, we are the +first to propose a new graph convolution-based decoder namely, Cascaded Graph +Convolutional Attention Decoder (G-CASCADE), for 2D medical image segmentation. +G-CASCADE progressively refines multi-stage feature maps generated by +hierarchical transformer encoders with an efficient graph convolution block. +The encoder utilizes the self-attention mechanism to capture long-range +dependencies, while the decoder refines the feature maps preserving long-range +information due to the global receptive fields of the graph convolution block. +Rigorous evaluations of our decoder with multiple transformer encoders on five +medical image segmentation tasks (i.e., Abdomen organs, Cardiac organs, Polyp +lesions, Skin lesions, and Retinal vessels) show that our model outperforms +other state-of-the-art (SOTA) methods. We also demonstrate that our decoder +achieves better DICE scores than the SOTA CASCADE decoder with 80.8% fewer +parameters and 82.3% fewer FLOPs. Our decoder can easily be used with other +hierarchical encoders for general-purpose semantic and medical image +segmentation tasks. + +
+
+ comment: 13 pages, IEEE/CVF Winter Conference on Applications of Computer + Vision (WACV 2024) +
+
+
+
+
+ + ☆ iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis SIGGRAPH + + +
+ We present a method for generating consistent novel views from a single +source image. Our approach focuses on maximizing the reuse of visible pixels +from the source image. To achieve this, we use a monocular depth estimator that +transfers visible pixels from the source view to the target view. Starting from +a pre-trained 2D inpainting diffusion model, we train our method on the +large-scale Objaverse dataset to learn 3D object priors. While training we use +a novel masking mechanism based on epipolar lines to further improve the +quality of our approach. This allows our framework to perform zero-shot novel +view synthesis on a variety of objects. We evaluate the zero-shot abilities of +our framework on three challenging datasets: Google Scanned Objects, Ray Traced +Multiview, and Common Objects in 3D. See our webpage for more details: +https://yashkant.github.io/invs/ + +
+
+ comment: Accepted to SIGGRAPH Asia, 2023 (Conference Papers) +
+
+
+
+
+ + ☆ MyriadAL: Active Few Shot Learning for Histopathology + + +
+ Active Learning (AL) and Few Shot Learning (FSL) are two label-efficient +methods which have achieved excellent results recently. However, most prior +arts in both learning paradigms fail to explore the wealth of the vast +unlabelled data. In this study, we address this issue in the scenario where the +annotation budget is very limited, yet a large amount of unlabelled data for +the target task is available. We frame this work in the context of +histopathology where labelling is prohibitively expensive. To this end, we +introduce an active few shot learning framework, Myriad Active Learning (MAL), +including a contrastive-learning encoder, pseudo-label generation, and novel +query sample selection in the loop. Specifically, we propose to massage +unlabelled data in a self-supervised manner, where the obtained data +representations and clustering knowledge form the basis to activate the AL +loop. With feedback from the oracle in each AL cycle, the pseudo-labels of the +unlabelled data are refined by optimizing a shallow task-specific net on top of +the encoder. These updated pseudo-labels serve to inform and improve the active +learning query selection process. Furthermore, we introduce a novel recipe to +combine existing uncertainty measures and utilize the entire uncertainty list +to reduce sample redundancy in AL. Extensive experiments on two public +histopathology datasets show that MAL has superior test accuracy, macro +F1-score, and label efficiency compared to prior works, and can achieve a +comparable test accuracy to a fully supervised algorithm while labelling only +5% of the dataset. + +
+
+ comment: 9 pages, 2 figures, 6 tables +
+
+
+
+
+ + ☆ Yin Yang Convolutional Nets: Image Manifold Extraction by the Analysis + of Opposites + + +
+ Computer vision in general presented several advances such as training +optimizations, new architectures (pure attention, efficient block, vision +language models, generative models, among others). This have improved +performance in several tasks such as classification, and others. However, the +majority of these models focus on modifications that are taking distance from +realistic neuroscientific approaches related to the brain. In this work, we +adopt a more bio-inspired approach and present the Yin Yang Convolutional +Network, an architecture that extracts visual manifold, its blocks are intended +to separate analysis of colors and forms at its initial layers, simulating +occipital lobe's operations. Our results shows that our architecture provides +State-of-the-Art efficiency among low parameter architectures in the dataset +CIFAR-10. Our first model reached 93.32\% test accuracy, 0.8\% more than the +older SOTA in this category, while having 150k less parameters (726k in total). +Our second model uses 52k parameters, losing only 3.86\% test accuracy. We also +performed an analysis on ImageNet, where we reached 66.49\% validation accuracy +with 1.6M parameters. We make the code publicly available at: +https://github.com/NoSavedDATA/YinYang_CNN. + +
+
+ comment: 12 pages, 5 tables and 6 figures +
+
+
+
+
+ + ☆ Pix2HDR -- A pixel-wise acquisition and deep learning-based synthesis + approach for high-speed HDR videos + + +
+ Accurately capturing dynamic scenes with wide-ranging motion and light +intensity is crucial for many vision applications. However, acquiring +high-speed high dynamic range (HDR) video is challenging because the camera's +frame rate restricts its dynamic range. Existing methods sacrifice speed to +acquire multi-exposure frames. Yet, misaligned motion in these frames can still +pose complications for HDR fusion algorithms, resulting in artifacts. Instead +of frame-based exposures, we sample the videos using individual pixels at +varying exposures and phase offsets. Implemented on a pixel-wise programmable +image sensor, our sampling pattern simultaneously captures fast motion at a +high dynamic range. We then transform pixel-wise outputs into an HDR video +using end-to-end learned weights from deep neural networks, achieving high +spatiotemporal resolution with minimized motion blurring. We demonstrate +aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under +low-light conditions and against bright backgrounds - both challenging +conditions for conventional cameras. By combining the versatility of pixel-wise +sampling patterns with the strength of deep neural networks at decoding complex +scenes, our method greatly enhances the vision system's adaptability and +performance in dynamic conditions. + +
+
+ comment: 14 pages, 14 figures +
+
+
+
+
+ + ☆ Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as + a Neurodevelopmental Cue + + +
+ Non-nutritive sucking (NNS), which refers to the act of sucking on a +pacifier, finger, or similar object without nutrient intake, plays a crucial +role in assessing healthy early development. In the case of preterm infants, +NNS behavior is a key component in determining their readiness for feeding. In +older infants, the characteristics of NNS behavior offer valuable insights into +neural and motor development. Additionally, NNS activity has been proposed as a +potential safeguard against sudden infant death syndrome (SIDS). However, the +clinical application of NNS assessment is currently hindered by labor-intensive +and subjective finger-in-mouth evaluations. Consequently, researchers often +resort to expensive pressure transducers for objective NNS signal measurement. +To enhance the accessibility and reliability of NNS signal monitoring for both +clinicians and researchers, we introduce a vision-based algorithm designed for +non-contact detection of NNS activity using baby monitor footage in natural +settings. Our approach involves a comprehensive exploration of optical flow and +temporal convolutional networks, enabling the detection and amplification of +subtle infant-sucking signals. We successfully classify short video clips of +uniform length into NNS and non-NNS periods. Furthermore, we investigate manual +and learning-based techniques to piece together local classification results, +facilitating the segmentation of longer mixed-activity videos into NNS and +non-NNS segments of varying duration. Our research introduces two novel +datasets of annotated infant videos, including one sourced from our clinical +study featuring 19 infant subjects and 183 hours of overnight baby monitor +footage. + +
+
+
+
+
+ + ☆ Stereoscopic Depth Perception Through Foliage + + +
+ Both humans and computational methods struggle to discriminate the depths of +objects hidden beneath foliage. However, such discrimination becomes feasible +when we combine computational optical synthetic aperture sensing with the human +ability to fuse stereoscopic images. For object identification tasks, as +required in search and rescue, wildlife observation, surveillance, and early +wildfire detection, depth assists in differentiating true from false findings, +such as people, animals, or vehicles vs. sun-heated patches at the ground level +or in the tree crowns, or ground fires vs. tree trunks. We used video captured +by a drone above dense woodland to test users' ability to discriminate depth. +We found that this is impossible when viewing monoscopic video and relying on +motion parallax. The same was true with stereoscopic video because of the +occlusions caused by foliage. However, when synthetic aperture sensing was used +to reduce occlusions and disparity-scaled stereoscopic video was presented, +whereas computational (stereoscopic matching) methods were unsuccessful, human +observers successfully discriminated depth. This shows the potential of systems +which exploit the synergy between computational methods and human vision to +perform tasks that neither can perform alone. + +
+
+
+
+
+ + ☆ Wakening Past Concepts without Past Data: Class-Incremental Learning + from Online Placebos WACV 2024 + + +
+ Not forgetting old class knowledge is a key challenge for class-incremental +learning (CIL) when the model continuously adapts to new classes. A common +technique to address this is knowledge distillation (KD), which penalizes +prediction inconsistencies between old and new models. Such prediction is made +with almost new class data, as old class data is extremely scarce due to the +strict memory limitation in CIL. In this paper, we take a deep dive into KD +losses and find that "using new class data for KD" not only hinders the model +adaption (for learning new classes) but also results in low efficiency for +preserving old class knowledge. We address this by "using the placebos of old +classes for KD", where the placebos are chosen from a free image stream, such +as Google Images, in an automatical and economical fashion. To this end, we +train an online placebo selection policy to quickly evaluate the quality of +streaming images (good or bad placebos) and use only good ones for one-time +feed-forward computation of KD. We formulate the policy training process as an +online Markov Decision Process (MDP), and introduce an online learning +algorithm to solve this MDP problem without causing much computation costs. In +experiments, we show that our method 1) is surprisingly effective even when +there is no class overlap between placebos and original old class data, 2) does +not require any additional supervision or memory budget, and 3) significantly +outperforms a number of top-performing CIL methods, in particular when using +lower memory budgets for old class exemplars, e.g., five exemplars per class. + +
+
+ comment: Accepted to WACV 2024. Code: + https://github.com/yaoyao-liu/online-placebos +
+
+
+
+
+ + ☆ Towards long-tailed, multi-label disease classification from chest + X-ray: Overview of the CXR-LT challenge + + +
+ Many real-world image recognition problems, such as diagnostic medical +imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common +findings followed by many more relatively rare conditions. In chest +radiography, diagnosis is both a long-tailed and multi-label problem, as +patients often present with multiple findings simultaneously. While researchers +have begun to study the problem of long-tailed learning in medical image +recognition, few have studied the interaction of label imbalance and label +co-occurrence posed by long-tailed, multi-label disease classification. To +engage with the research community on this emerging topic, we conducted an open +challenge, CXR-LT, on long-tailed, multi-label thorax disease classification +from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset +of over 350,000 CXRs, each labeled with at least one of 26 clinical findings +following a long-tailed distribution. We synthesize common themes of +top-performing solutions, providing practical recommendations for long-tailed, +multi-label medical image classification. Finally, we use these insights to +propose a path forward involving vision-language foundation models for few- and +zero-shot disease classification. + +
+
+
+
+
+ + ☆ Complex Image Generation SwinTransformer Network for Audio Denoising + + +
+ Achieving high-performance audio denoising is still a challenging task in +real-world applications. Existing time-frequency methods often ignore the +quality of generated frequency domain images. This paper converts the audio +denoising problem into an image generation task. We first develop a complex +image generation SwinTransformer network to capture more information from the +complex Fourier domain. We then impose structure similarity and detailed loss +functions to generate high-quality images and develop an SDR loss to minimize +the difference between denoised and clean audios. Extensive experiments on two +benchmark datasets demonstrate that our proposed model is better than +state-of-the-art methods. + +
+
+
+
+
+ + ☆ LaksNet: an end-to-end deep learning model for self-driving cars in + Udacity simulator + + +
+ The majority of road accidents occur because of human errors, including +distraction, recklessness, and drunken driving. One of the effective ways to +overcome this dangerous situation is by implementing self-driving technologies +in vehicles. In this paper, we focus on building an efficient deep-learning +model for self-driving cars. We propose a new and effective convolutional +neural network model called `LaksNet' consisting of four convolutional layers +and two fully connected layers. We conduct extensive experiments using our +LaksNet model with the training data generated from the Udacity simulator. Our +model outperforms many existing pre-trained ImageNet and NVIDIA models in terms +of the duration of the car for which it drives without going off the track on +the simulator. + +
+
+
+
+
+ + ☆ Learned, Uncertainty-driven Adaptive Acquisition for Photon-Efficient + Multiphoton Microscopy + + +
+ Multiphoton microscopy (MPM) is a powerful imaging tool that has been a +critical enabler for live tissue imaging. However, since most multiphoton +microscopy platforms rely on point scanning, there is an inherent trade-off +between acquisition time, field of view (FOV), phototoxicity, and image +quality, often resulting in noisy measurements when fast, large FOV, and/or +gentle imaging is needed. Deep learning could be used to denoise multiphoton +microscopy measurements, but these algorithms can be prone to hallucination, +which can be disastrous for medical and scientific applications. We propose a +method to simultaneously denoise and predict pixel-wise uncertainty for +multiphoton imaging measurements, improving algorithm trustworthiness and +providing statistical guarantees for the deep learning predictions. +Furthermore, we propose to leverage this learned, pixel-wise uncertainty to +drive an adaptive acquisition technique that rescans only the most uncertain +regions of a sample. We demonstrate our method on experimental noisy MPM +measurements of human endometrium tissues, showing that we can maintain fine +features and outperform other denoising methods while predicting uncertainty at +each pixel. Finally, with our adaptive acquisition technique, we demonstrate a +120X reduction in acquisition time and total light dose while successfully +recovering fine features in the sample. We are the first to demonstrate +distribution-free uncertainty quantification for a denoising task with real +experimental data and the first to propose adaptive acquisition based on +reconstruction uncertainty + +
+
+
+
+
+ + ☆ Deep Feature Registration for Unsupervised Domain Adaptation + + +
+ While unsupervised domain adaptation has been explored to leverage the +knowledge from a labeled source domain to an unlabeled target domain, existing +methods focus on the distribution alignment between two domains. However, how +to better align source and target features is not well addressed. In this +paper, we propose a deep feature registration (DFR) model to generate +registered features that maintain domain invariant features and simultaneously +minimize the domain-dissimilarity of registered features and target features +via histogram matching. We further employ a pseudo label refinement process, +which considers both probabilistic soft selection and center-based hard +selection to improve the quality of pseudo labels in the target domain. +Extensive experiments on multiple UDA benchmarks demonstrate the effectiveness +of our DFR model, resulting in new state-of-the-art performance. + +
+
+
+
+
+ + ☆ Anatomically-aware Uncertainty for Semi-supervised Image Segmentation + + +
+ Semi-supervised learning relaxes the need of large pixel-wise labeled +datasets for image segmentation by leveraging unlabeled data. A prominent way +to exploit unlabeled data is to regularize model predictions. Since the +predictions of unlabeled data can be unreliable, uncertainty-aware schemes are +typically employed to gradually learn from meaningful and reliable predictions. +Uncertainty estimation methods, however, rely on multiple inferences from the +model predictions that must be computed for each training step, which is +computationally expensive. Moreover, these uncertainty maps capture pixel-wise +disparities and do not consider global information. This work proposes a novel +method to estimate segmentation uncertainty by leveraging global information +from the segmentation masks. More precisely, an anatomically-aware +representation is first learnt to model the available segmentation masks. The +learnt representation thereupon maps the prediction of a new segmentation into +an anatomically-plausible segmentation. The deviation from the plausible +segmentation aids in estimating the underlying pixel-level uncertainty in order +to further guide the segmentation network. The proposed method consequently +estimates the uncertainty using a single inference from our representation, +thereby reducing the total computation. We evaluate our method on two publicly +available segmentation datasets of left atria in cardiac MRIs and of multiple +organs in abdominal CTs. Our anatomically-aware method improves the +segmentation accuracy over the state-of-the-art semi-supervised methods in +terms of two commonly used evaluation metrics. + +
+
+ comment: Accepted at Medical Image Analysis. Code is available at: + $\href{https://github.com/adigasu/Anatomically-aware_Uncertainty_for_Semi-supervised_Segmentation}{Github}$ +
+
+
+
+
+ + ♻ ☆ Intelligent Debris Mass Estimation Model for Autonomous Underwater + Vehicle + + +
+ Marine debris poses a significant threat to the survival of marine wildlife, +often leading to entanglement and starvation, ultimately resulting in death. +Therefore, removing debris from the ocean is crucial to restore the natural +balance and allow marine life to thrive. Instance segmentation is an advanced +form of object detection that identifies objects and precisely locates and +separates them, making it an essential tool for autonomous underwater vehicles +(AUVs) to navigate and interact with their underwater environment effectively. +AUVs use image segmentation to analyze images captured by their cameras to +navigate underwater environments. In this paper, we use instance segmentation +to calculate the area of individual objects within an image, we use YOLOV7 in +Roboflow to generate a set of bounding boxes for each object in the image with +a class label and a confidence score for every detection. A segmentation mask +is then created for each object by applying a binary mask to the object's +bounding box. The masks are generated by applying a binary threshold to the +output of a convolutional neural network trained to segment objects from the +background. Finally, refining the segmentation mask for each object is done by +applying post-processing techniques such as morphological operations and +contour detection, to improve the accuracy and quality of the mask. The process +of estimating the area of instance segmentation involves calculating the area +of each segmented instance separately and then summing up the areas of all +instances to obtain the total area. The calculation is carried out using +standard formulas based on the shape of the object, such as rectangles and +circles. In cases where the object is complex, the Monte Carlo method is used +to estimate the area. This method provides a higher degree of accuracy than +traditional methods, especially when using a large number of samples. + +
+
+
+
+
+ + ♻ ☆ DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two + Quantization + + +
+ Efficiently deploying deep neural networks on low-resource edge devices is +challenging due to their ever-increasing resource requirements. To address this +issue, researchers have proposed multiplication-free neural networks, such as +Power-of-Two quantization, or also known as Shift networks, which aim to reduce +memory usage and simplify computation. However, existing low-bit Shift networks +are not as accurate as their full-precision counterparts, typically suffering +from limited weight range encoding schemes and quantization loss. In this +paper, we propose the DenseShift network, which significantly improves the +accuracy of Shift networks, achieving competitive performance to full-precision +networks for vision and speech applications. In addition, we introduce a method +to deploy an efficient DenseShift network using non-quantized floating-point +activations, while obtaining 1.6X speed-up over existing methods. To achieve +this, we demonstrate that zero-weight values in low-bit Shift networks do not +contribute to model capacity and negatively impact inference computation. To +address this issue, we propose a zero-free shifting mechanism that simplifies +inference and increases model capacity. We further propose a sign-scale +decomposition design to enhance training efficiency and a low-variance random +initialization strategy to improve the model's transfer learning performance. +Our extensive experiments on various computer vision and speech tasks +demonstrate that DenseShift outperforms existing low-bit multiplication-free +networks and achieves competitive performance compared to full-precision +networks. Furthermore, our proposed approach exhibits strong transfer learning +performance without a drop in accuracy. Our code was released on GitHub. + +
+
+
+
+
+ + ♻ ☆ Towards Visual Saliency Explanations of Face Verification + + +
+ In the past years, deep convolutional neural networks have been pushing the +frontier of face recognition (FR) techniques in both verification and +identification scenarios. Despite the high accuracy, they are often criticized +for lacking explainability. There has been an increasing demand for +understanding the decision-making process of deep face recognition systems. +Recent studies have investigated the usage of visual saliency maps as an +explanation, but they often lack a discussion and analysis in the context of +face recognition. This paper concentrates on explainable face verification +tasks and conceives a new explanation framework. Firstly, a definition of the +saliency-based explanation method is provided, which focuses on the decisions +made by the deep FR model. Secondly, a new model-agnostic explanation method +named CorrRISE is proposed to produce saliency maps, which reveal both the +similar and dissimilar regions of any given pair of face images. Then, an +evaluation methodology is designed to measure the performance of general visual +saliency explanation methods in face verification. Finally, substantial visual +and quantitative results have shown that the proposed CorrRISE method +demonstrates promising results in comparison with other state-of-the-art +explainable face verification approaches. + +
+
+
+
+
+ + ♻ ☆ Learning to Generate Parameters of ConvNets for Unseen Image Data + + +
+ Typical Convolutional Neural Networks (ConvNets) depend heavily on large +amounts of image data and resort to an iterative optimization algorithm (e.g., +SGD or Adam) to learn network parameters, which makes training very time- and +resource-intensive. In this paper, we propose a new training paradigm and +formulate the parameter learning of ConvNets into a prediction task: given a +ConvNet architecture, we observe there exists correlations between image +datasets and their corresponding optimal network parameters, and explore if we +can learn a hyper-mapping between them to capture the relations, such that we +can directly predict the parameters of the network for an image dataset never +seen during the training phase. To do this, we put forward a new hypernetwork +based model, called PudNet, which intends to learn a mapping between datasets +and their corresponding network parameters, and then predicts parameters for +unseen data with only a single forward propagation. Moreover, our model +benefits from a series of adaptive hyper recurrent units sharing weights to +capture the dependencies of parameters among different network layers. +Extensive experiments demonstrate that our proposed method achieves good +efficacy for unseen image datasets on two kinds of settings: Intra-dataset +prediction and Inter-dataset prediction. Our PudNet can also well scale up to +large-scale datasets, e.g., ImageNet-1K. It takes 8967 GPU seconds to train +ResNet-18 on the ImageNet-1K using GC from scratch and obtain a top-5 accuracy +of 44.65 %. However, our PudNet costs only 3.89 GPU seconds to predict the +network parameters of ResNet-18 achieving comparable performance (44.92 %), +more than 2,300 times faster than the traditional training paradigm. + +
+
+
+
+
+ + ♻ ☆ Improving Fairness in Deepfake Detection + + +
+ Despite the development of effective deepfake detection models in recent +years, several recent studies have demonstrated that biases in the training +data utilized to develop deepfake detection models can lead to unfair +performance for demographic groups of different races and/or genders. Such can +result in these groups being unfairly targeted or excluded from detection, +allowing misclassified deepfakes to manipulate public opinion and erode trust +in the model. While these studies have focused on identifying and evaluating +the unfairness in deepfake detection, no methods have been developed to address +the fairness issue of deepfake detection at the algorithm level. In this work, +we make the first attempt to improve deepfake detection fairness by proposing +novel loss functions to train fair deepfake detection models in ways that are +agnostic or aware of demographic factors. Extensive experiments on four +deepfake datasets and five deepfake detectors demonstrate the effectiveness and +flexibility of our approach in improving the deepfake detection fairness. + +
+
+
+
+
+ + ♻ ☆ Label-Efficient Deep Learning in Medical Image Analysis: Challenges and + Future Directions + + +
+ Deep learning has seen rapid growth in recent years and achieved +state-of-the-art performance in a wide range of applications. However, training +models typically requires expensive and time-consuming collection of large +quantities of labeled data. This is particularly true within the scope of +medical imaging analysis (MIA), where data are limited and labels are expensive +to be acquired. Thus, label-efficient deep learning methods are developed to +make comprehensive use of the labeled data as well as the abundance of +unlabeled and weak-labeled data. In this survey, we extensively investigated +over 300 recent papers to provide a comprehensive overview of recent progress +on label-efficient learning strategies in MIA. We first present the background +of label-efficient learning and categorize the approaches into different +schemes. Next, we examine the current state-of-the-art methods in detail +through each scheme. Specifically, we provide an in-depth investigation, +covering not only canonical semi-supervised, self-supervised, and +multi-instance learning schemes, but also recently emerged active and +annotation-efficient learning strategies. Moreover, as a comprehensive +contribution to the field, this survey not only elucidates the commonalities +and unique features of the surveyed methods but also presents a detailed +analysis of the current challenges in the field and suggests potential avenues +for future research. + +
+
+
+
+
+ + ♻ ☆ Perceptual Quality Assessment of NeRF and Neural View Synthesis Methods + for Front-Facing Views + + +
+ Neural view synthesis (NVS) is one of the most successful techniques for +synthesizing free viewpoint videos, capable of achieving high fidelity from +only a sparse set of captured images. This success has led to many variants of +the techniques, each evaluated on a set of test views typically using image +quality metrics such as PSNR, SSIM, or LPIPS. There has been a lack of research +on how NVS methods perform with respect to perceived video quality. We present +the first study on perceptual evaluation of NVS and NeRF variants. For this +study, we collected two datasets of scenes captured in a controlled lab +environment as well as in-the-wild. In contrast to existing datasets, these +scenes come with reference video sequences, allowing us to test for temporal +artifacts and subtle distortions that are easily overlooked when viewing only +static images. We measured the quality of videos synthesized by several NVS +methods in a well-controlled perceptual quality assessment experiment as well +as with many existing state-of-the-art image/video quality metrics. We present +a detailed analysis of the results and recommendations for dataset and metric +selection for NVS evaluation. + +
+
+
+
+
+ + ♻ ☆ Resolution learning in deep convolutional networks using scale-space + theory + + +
+ Resolution in deep convolutional neural networks (CNNs) is typically bounded +by the receptive field size through filter sizes, and subsampling layers or +strided convolutions on feature maps. The optimal resolution may vary +significantly depending on the dataset. Modern CNNs hard-code their resolution +hyper-parameters in the network architecture which makes tuning such +hyper-parameters cumbersome. We propose to do away with hard-coded resolution +hyper-parameters and aim to learn the appropriate resolution from data. We use +scale-space theory to obtain a self-similar parametrization of filters and make +use of the N-Jet: a truncated Taylor series to approximate a filter by a +learned combination of Gaussian derivative filters. The parameter sigma of the +Gaussian basis controls both the amount of detail the filter encodes and the +spatial extent of the filter. Since sigma is a continuous parameter, we can +optimize it with respect to the loss. The proposed N-Jet layer achieves +comparable performance when used in state-of-the art architectures, while +learning the correct resolution in each layer automatically. We evaluate our +N-Jet layer on both classification and segmentation, and we show that learning +sigma is especially beneficial for inputs at multiple sizes. + +
+
+ comment: Preprint accepted by IEEE Transactions on Image Processing, 2021 + (TIP). Link to final published article: + https://ieeexplore.ieee.org/abstract/document/9552550 +
+
+
+
+
+ + ♻ ☆ Shatter and Gather: Learning Referring Image Segmentation with Text + Supervision ICCV 2023 + + +
+ Referring image segmentation, the task of segmenting any arbitrary entities +described in free-form texts, opens up a variety of vision applications. +However, manual labeling of training data for this task is prohibitively +costly, leading to lack of labeled data for training. We address this issue by +a weakly supervised learning approach using text descriptions of training +images as the only source of supervision. To this end, we first present a new +model that discovers semantic entities in input image and then combines such +entities relevant to text query to predict the mask of the referent. We also +present a new loss function that allows the model to be trained without any +further supervision. Our method was evaluated on four public benchmarks for +referring image segmentation, where it clearly outperformed the existing method +for the same task and recent open-vocabulary segmentation models on all the +benchmarks. + +
+
+ comment: Accepted to ICCV 2023, Project page: + https://southflame.github.io/sag/ +
+
+
+
+
+ + ♻ ☆ A Simple Baseline for Knowledge-Based Visual Question Answering EMNLP 2023 + + +
+ This paper is on the problem of Knowledge-Based Visual Question Answering +(KB-VQA). Recent works have emphasized the significance of incorporating both +explicit (through external databases) and implicit (through LLMs) knowledge to +answer questions requiring external knowledge effectively. A common limitation +of such approaches is that they consist of relatively complicated pipelines and +often heavily rely on accessing GPT-3 API. Our main contribution in this paper +is to propose a much simpler and readily reproducible pipeline which, in a +nutshell, is based on efficient in-context learning by prompting LLaMA (1 and +2) using question-informative captions as contextual information. Contrary to +recent approaches, our method is training-free, does not require access to +external databases or APIs, and yet achieves state-of-the-art accuracy on the +OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to +understand important aspects of our method. Our code is publicly available at +https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA + +
+
+ comment: Accepted at EMNLP 2023 (camera-ready version) +
+
+
+
+
+ + ♻ ☆ On the Shift Invariance of Max Pooling Feature Maps in Convolutional + Neural Networks + + +
+ This paper focuses on improving the mathematical interpretability of +convolutional neural networks (CNNs) in the context of image classification. +Specifically, we tackle the instability issue arising in their first layer, +which tends to learn parameters that closely resemble oriented band-pass +filters when trained on datasets like ImageNet. Subsampled convolutions with +such Gabor-like filters are prone to aliasing, causing sensitivity to small +input shifts. In this context, we establish conditions under which the max +pooling operator approximates a complex modulus, which is nearly shift +invariant. We then derive a measure of shift invariance for subsampled +convolutions followed by max pooling. In particular, we highlight the crucial +role played by the filter's frequency and orientation in achieving stability. +We experimentally validate our theory by considering a deterministic feature +extractor based on the dual-tree complex wavelet packet transform, a particular +case of discrete Gabor-like decomposition. + +
+
+
+
+
+ + ♻ ☆ Exploring Affordance and Situated Meaning in Image Captions: A + Multimodal Analysis + + +
+ This paper explores the grounding issue regarding multimodal semantic +representation from a computational cognitive-linguistic view. We annotate +images from the Flickr30k dataset with five perceptual properties: Affordance, +Perceptual Salience, Object Number, Gaze Cueing, and Ecological Niche +Association (ENA), and examine their association with textual elements in the +image captions. Our findings reveal that images with Gibsonian affordance show +a higher frequency of captions containing 'holding-verbs' and 'container-nouns' +compared to images displaying telic affordance. Perceptual Salience, Object +Number, and ENA are also associated with the choice of linguistic expressions. +Our study demonstrates that comprehensive understanding of objects or events +requires cognitive attention, semantic nuances in language, and integration +across multiple modalities. We highlight the vital importance of situated +meaning and affordance grounding in natural language understanding, with the +potential to advance human-like interpretation in various scenarios. + +
+
+ comment: 10 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Hybrid Gromov-Wasserstein Embedding for Capsule Learning + + +
+ Capsule networks (CapsNets) aim to parse images into a hierarchy of objects, +parts, and their relations using a two-step process involving part-whole +transformation and hierarchical component routing. However, this hierarchical +relationship modeling is computationally expensive, which has limited the wider +use of CapsNet despite its potential advantages. The current state of CapsNet +models primarily focuses on comparing their performance with capsule baselines, +falling short of achieving the same level of proficiency as deep CNN variants +in intricate tasks. To address this limitation, we present an efficient +approach for learning capsules that surpasses canonical baseline models and +even demonstrates superior performance compared to high-performing convolution +models. Our contribution can be outlined in two aspects: firstly, we introduce +a group of subcapsules onto which an input vector is projected. Subsequently, +we present the Hybrid Gromov-Wasserstein framework, which initially quantifies +the dissimilarity between the input and the components modeled by the +subcapsules, followed by determining their alignment degree through optimal +transport. This innovative mechanism capitalizes on new insights into defining +alignment between the input and subcapsules, based on the similarity of their +respective component distributions. This approach enhances CapsNets' capacity +to learn from intricate, high-dimensional data while retaining their +interpretability and hierarchical structure. Our proposed model offers two +distinct advantages: (i) its lightweight nature facilitates the application of +capsules to more intricate vision tasks, including object detection; (ii) it +outperforms baseline approaches in these demanding tasks. + +
+
+
+
+
+ + ♻ ☆ Exploiting the Signal-Leak Bias in Diffusion Models + + +
+ There is a bias in the inference pipeline of most diffusion models. This bias +arises from a signal leak whose distribution deviates from the noise +distribution, creating a discrepancy between training and inference processes. +We demonstrate that this signal-leak bias is particularly significant when +models are tuned to a specific style, causing sub-optimal style matching. +Recent research tries to avoid the signal leakage during training. We instead +show how we can exploit this signal-leak bias in existing diffusion models to +allow more control over the generated images. This enables us to generate +images with more varied brightness, and images that better match a desired +style or color. By modeling the distribution of the signal leak in the spatial +frequency and pixel domains, and including a signal leak in the initial latent, +we generate images that better match expected results without any additional +training. + +
+
+ comment: corrected the author names in reference [24] +
+
+
+
+
+ + ♻ ☆ Point-DynRF: Point-based Dynamic Radiance Fields from a Monocular Video WACV2024 + + +
+ Dynamic radiance fields have emerged as a promising approach for generating +novel views from a monocular video. However, previous methods enforce the +geometric consistency to dynamic radiance fields only between adjacent input +frames, making it difficult to represent the global scene geometry and +degenerates at the viewpoint that is spatio-temporally distant from the input +camera trajectory. To solve this problem, we introduce point-based dynamic +radiance fields (\textbf{Point-DynRF}), a novel framework where the global +geometric information and the volume rendering process are trained by neural +point clouds and dynamic radiance fields, respectively. Specifically, we +reconstruct neural point clouds directly from geometric proxies and optimize +both radiance fields and the geometric proxies using our proposed losses, +allowing them to complement each other. We validate the effectiveness of our +method with experiments on the NVIDIA Dynamic Scenes Dataset and several +causally captured monocular video clips. + +
+
+ comment: WACV2024 +
+
+
+
+
+ + ♻ ☆ Rewrite Caption Semantics: Bridging Semantic Gaps for + Language-Supervised Semantic Segmentation NeurIPS 2023 + + +
+ Vision-Language Pre-training has demonstrated its remarkable zero-shot +recognition ability and potential to learn generalizable visual representations +from language supervision. Taking a step ahead, language-supervised semantic +segmentation enables spatial localization of textual inputs by learning pixel +grouping solely from image-text pairs. Nevertheless, the state-of-the-art +suffers from clear semantic gaps between visual and textual modality: plenty of +visual concepts appeared in images are missing in their paired captions. Such +semantic misalignment circulates in pre-training, leading to inferior zero-shot +performance in dense predictions due to insufficient visual concepts captured +in textual representations. To close such semantic gap, we propose Concept +Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing +semantics. For each image-text pair, we establish a concept archive that +maintains potential visually-matched concepts with our proposed vision-driven +expansion and text-to-vision-guided ranking. Relevant concepts can thus be +identified via cluster-guided sampling and fed into pre-training, thereby +bridging the gap between visual and textual semantics. Extensive experiments +over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb +zero-shot transfer performance and greatly boosts language-supervised +segmentation baseline by a large margin, suggesting the value of bridging +semantic gap in pre-training data. + +
+
+ comment: NeurIPS 2023. Code is available at + https://github.com/xing0047/rewrite +
+
+
+
+
+ + ♻ ☆ Beware of diffusion models for synthesizing medical images -- A + comparison with GANs in terms of memorizing brain MRI and chest x-ray images + + +
+ Diffusion models were initially developed for text-to-image generation and +are now being utilized to generate high-quality synthetic images. Preceded by +GANs, diffusion models have shown impressive results using various evaluation +metrics. However, commonly used metrics such as FID and IS are not suitable for +determining whether diffusion models are simply reproducing the training +images. Here we train StyleGAN and diffusion models, using BRATS20, BRATS21 and +a chest x-ray pneumonia dataset, to synthesize brain MRI and chest x-ray +images, and measure the correlation between the synthe4c images and all +training images. Our results show that diffusion models are more likely to +memorize the training images, compared to StyleGAN, especially for small +datasets and when using 2D slices from 3D volumes. Researchers should be +careful when using diffusion models for medical imaging, if the final goal is +to share the synthe4c images + +
+
+ comment: 12 Pages, 6 Figures +
+
+
+
+
+ + ♻ ☆ Unsupervised Video Domain Adaptation for Action Recognition: A + Disentanglement Perspective NeurIPS 2023 + + +
+ Unsupervised video domain adaptation is a practical yet challenging task. In +this work, for the first time, we tackle it from a disentanglement view. Our +key idea is to handle the spatial and temporal domain divergence separately +through disentanglement. Specifically, we consider the generation of +cross-domain videos from two sets of latent factors, one encoding the static +information and another encoding the dynamic information. A Transfer Sequential +VAE (TranSVAE) framework is then developed to model such generation. To better +serve for adaptation, we propose several objectives to constrain the latent +factors. With these constraints, the spatial divergence can be readily removed +by disentangling the static domain-specific information out, and the temporal +divergence is further reduced from both frame- and video-levels through +adversarial learning. Extensive experiments on the UCF-HMDB, Jester, and +Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE +compared with several state-of-the-art approaches. Code is publicly available. + +
+
+ comment: NeurIPS 2023; 20 pages, 9 figures, 10 tables; Code at + https://github.com/ldkong1205/TranSVAE +
+
+
+
+
+ + ♻ ☆ Conformal prediction under ambiguous ground truth + + +
+ Conformal Prediction (CP) allows to perform rigorous uncertainty +quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y +\in C(X))\geq 1-\alpha$ for a user-chosen $\alpha \in [0,1]$ by relying on +calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} +\otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that +$\mathbb{P}^{Y|X}$ is the "true" posterior label distribution. However, in many +real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating +expert opinions using a voting procedure, resulting in a one-hot distribution +$\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus +w.r.t. $\mathbb{P}_{vote}=\mathbb{P}^X \otimes \mathbb{P}_{vote}^{Y|X}$ rather +than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth +labels, the distinction between $\mathbb{P}_{vote}$ and $\mathbb{P}$ is +irrelevant. However, when experts do not agree because of ambiguous labels, +approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution +$\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose +to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a +non-degenerate distribution $\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP +procedures which provide guarantees w.r.t. $\mathbb{P}_{agg}=\mathbb{P}^X +\otimes \mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels +from $\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a +case study of skin condition classification with significant disagreement among +expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{vote}$ +under-covers expert annotations: calibrated for $72\%$ coverage, it falls short +by on average $10\%$; our Monte Carlo CP closes this gap both empirically and +theoretically. + +
+
+
+
+
+ + ♻ ☆ Physics-Based Object 6D-Pose Estimation during Non-Prehensile + Manipulation + + +
+ We propose a method to track the 6D pose of an object over time, while the +object is under non-prehensile manipulation by a robot. At any given time +during the manipulation of the object, we assume access to the robot joint +controls and an image from a camera. We use the robot joint controls to perform +a physics-based prediction of how the object might be moving. We then combine +this prediction with the observation coming from the camera, to estimate the +object pose as accurately as possible. We use a particle filtering approach to +combine the control information with the visual information. We compare the +proposed method with two baselines: (i) using only an image-based pose +estimation system at each time-step, and (ii) a particle filter which does not +perform the computationally expensive physics predictions, but assumes the +object moves with constant velocity. Our results show that making physics-based +predictions is worth the computational cost, resulting in more accurate +tracking, and estimating object pose even when the object is not clearly +visible to the camera. + +
+
+
+
+
+ + ♻ ☆ Dynamic Scene Graph Representation for Surgical Video + + +
+ Surgical videos captured from microscopic or endoscopic imaging devices are +rich but complex sources of information, depicting different tools and +anatomical structures utilized during an extended amount of time. Despite +containing crucial workflow information and being commonly recorded in many +procedures, usage of surgical videos for automated surgical workflow +understanding is still limited. + In this work, we exploit scene graphs as a more holistic, semantically +meaningful and human-readable way to represent surgical videos while encoding +all anatomical structures, tools, and their interactions. To properly evaluate +the impact of our solutions, we create a scene graph dataset from semantic +segmentations from the CaDIS and CATARACTS datasets. We demonstrate that scene +graphs can be leveraged through the use of graph convolutional networks (GCNs) +to tackle surgical downstream tasks such as surgical workflow recognition with +competitive performance. Moreover, we demonstrate the benefits of surgical +scene graphs regarding the explainability and robustness of model decisions, +which are crucial in the clinical setting. + +
+
+
+
+
+ + ♻ ☆ Harmonizing output imbalance for defect segmentation on + extremely-imbalanced photovoltaic module cells images + + +
+ The continuous development of the photovoltaic (PV) industry has raised high +requirements for the quality of monocrystalline of PV module cells. When +learning to segment defect regions in PV module cell images, Tiny Hidden Cracks +(THC) lead to extremely-imbalanced samples. The ratio of defect pixels to +normal pixels can be as low as 1:2000. This extreme imbalance makes it +difficult to segment the THC of PV module cells, which is also a challenge for +semantic segmentation. To address the problem of segmenting defects on +extremely-imbalanced THC data, the paper makes contributions from three +aspects: (1) it proposes an explicit measure for output imbalance; (2) it +generalizes a distribution-based loss that can handle different types of output +imbalances; and (3) it introduces a compound loss with our adaptive +hyperparameter selection algorithm that can keep the consistency of training +and inference for harmonizing the output imbalance on extremelyimbalanced input +data. The proposed method is evaluated on four widely-used deep learning +architectures and four datasets with varying degrees of input imbalance. The +experimental results show that the proposed method outperforms existing +methods. + +
+
+ comment: 19 pages, 16 figures, 3 appendixes +
+
+
+
+
+ + ♻ ☆ Segment Any Point Cloud Sequences by Distilling Vision Foundation Models NeurIPS 2023 + + +
+ Recent advancements in vision foundation models (VFMs) have opened up new +possibilities for versatile and efficient visual perception. In this work, we +introduce Seal, a novel framework that harnesses VFMs for segmenting diverse +automotive point cloud sequences. Seal exhibits three appealing properties: i) +Scalability: VFMs are directly distilled into point clouds, obviating the need +for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial +and temporal relationships are enforced at both the camera-to-LiDAR and +point-to-segment regularization stages, facilitating cross-modal representation +learning. iii) Generalizability: Seal enables knowledge transfer in an +off-the-shelf manner to downstream tasks involving diverse point clouds, +including those from real/synthetic, low/high-resolution, large/small-scale, +and clean/corrupted datasets. Extensive experiments conducted on eleven +different point cloud datasets showcase the effectiveness and superiority of +Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear +probing, surpassing random initialization by 36.9% mIoU and outperforming prior +arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains +over existing methods across 20 different few-shot fine-tuning tasks on all +eleven tested point cloud datasets. + +
+
+ comment: NeurIPS 2023 (Spotlight); 37 pages, 16 figures, 15 tables; Code at + https://github.com/youquanl/Segment-Any-Point-Cloud +
+
+
+
+
+ + ♻ ☆ LAP: An Attention-Based Module for Concept Based Self-Interpretation and + Knowledge Injection in Convolutional Neural Networks + + +
+ Despite the state-of-the-art performance of deep convolutional neural +networks, they are susceptible to bias and malfunction in unseen situations. +Moreover, the complex computation behind their reasoning is not +human-understandable to develop trust. External explainer methods have tried to +interpret network decisions in a human-understandable way, but they are accused +of fallacies due to their assumptions and simplifications. On the other side, +the inherent self-interpretability of models, while being more robust to the +mentioned fallacies, cannot be applied to the already trained models. In this +work, we propose a new attention-based pooling layer, called Local Attention +Pooling (LAP), that accomplishes self-interpretability and the possibility for +knowledge injection without performance loss. The module is easily pluggable +into any convolutional neural network, even the already trained ones. We have +defined a weakly supervised training scheme to learn the distinguishing +features in decision-making without depending on experts' annotations. We +verified our claims by evaluating several LAP-extended models on two datasets, +including ImageNet. The proposed framework offers more valid +human-understandable and faithful-to-the-model interpretations than the +commonly used white-box explainer methods. + +
+
+
+
+
+ + ♻ ☆ Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large + Language Models + + +
+ Recently, growing interest has been aroused in extending the multimodal +capability of large language models (LLMs), e.g., vision-language (VL) +learning, which is regarded as the next milestone of artificial general +intelligence. However, existing solutions are prohibitively expensive, which +not only need to optimize excessive parameters, but also require another +large-scale pre-training before VL instruction tuning. In this paper, we +propose a novel and affordable solution for the effective VL adaption of LLMs, +called Mixture-of-Modality Adaptation (MMA). Instead of using large neural +networks to connect the image encoder and LLM, MMA adopts lightweight modules, +i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables +the joint optimization of the image and language models. Meanwhile, MMA is also +equipped with a routing algorithm to help LLMs achieve an automatic shift +between single- and multi-modal instructions without compromising their ability +of natural language understanding. To validate MMA, we apply it to a recent LLM +called LLaMA and term this formed large vision-language instructed model as +LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two +setups, namely multimodal science question answering and multimodal dialogue. +The experimental results not only demonstrate the competitive performance and +the superior training efficiency of LaVIN than existing multimodal LLMs, but +also confirm its great potential as a general-purpose chatbot. More +importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 +training hours with 3.8M trainable parameters, greatly confirming the +effectiveness of MMA. Our project is released at +https://luogen1996.github.io/lavin. + +
+
+
+
+
+ + ♻ ☆ Spectral2Spectral: Image-spectral Similarity Assisted Spectral CT Deep + Reconstruction without Reference + + +
+ Spectral computed tomography based on a photon-counting detector (PCD) +attracts more and more attentions since it has the capability to provide more +accurate identification and quantitative analysis for biomedical materials. The +limited number of photons within narrow energy bins leads to imaging results of +low signal-noise ratio. The existing supervised deep reconstruction networks +for CT reconstruction are difficult to address these challenges because it is +usually impossible to acquire noise-free clinical images with clear structures +as references. In this paper, we propose an iterative deep reconstruction +network to synergize unsupervised method and data priors into a unified +framework, named as Spectral2Spectral. Our Spectral2Spectral employs an +unsupervised deep training strategy to obtain high-quality images from noisy +data in an end-to-end fashion. The structural similarity prior within +image-spectral domain is refined as a regularization term to further constrain +the network training. The weights of neural network are automatically updated +to capture image features and structures within the iterative process. Three +large-scale preclinical datasets experiments demonstrate that the +Spectral2spectral reconstructs better image quality than other the +state-of-the-art methods. + +
+
+ comment: Accepted by IEEE TCI +
+
+
+
+
+ + ♻ ☆ Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design + + +
+ Scaling laws have been recently employed to derive compute-optimal model size +(number of parameters) for a given compute duration. We advance and refine such +methods to infer compute-optimal model shapes, such as width and depth, and +successfully implement this in vision transformers. Our shape-optimized vision +transformer, SoViT, achieves results competitive with models that exceed twice +its size, despite being pre-trained with an equivalent amount of compute. For +example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, +surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical +settings, with also less than half the inference cost. We conduct a thorough +evaluation across multiple tasks, such as image classification, captioning, VQA +and zero-shot transfer, demonstrating the effectiveness of our model across a +broad range of domains and identifying limitations. Overall, our findings +challenge the prevailing approach of blindly scaling up vision models and pave +a path for a more informed scaling. + +
+
+ comment: 10 pages, 7 figures, 9 tables. Version 2: Layout fixes +
+
+
+
+
+ + ♻ ☆ Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory + Models + + +
+ We propose a novel anomaly detection method for echocardiogram videos. The +introduced method takes advantage of the periodic nature of the heart cycle to +learn three variants of a variational latent trajectory model (TVAE). While the +first two variants (TVAE-C and TVAE-R) model strict periodic movements of the +heart, the third (TVAE-S) is more general and allows shifts in the spatial +representation throughout the video. All models are trained on the healthy +samples of a novel in-house dataset of infant echocardiogram videos consisting +of multiple chamber views to learn a normative prior of the healthy population. +During inference, maximum a posteriori (MAP) based anomaly detection is +performed to detect out-of-distribution samples in our dataset. The proposed +method reliably identifies severe congenital heart defects, such as Ebstein's +Anomaly or Shone-complex. Moreover, it achieves superior performance over +MAP-based anomaly detection with standard variational autoencoders when +detecting pulmonary hypertension and right ventricular dilation. Finally, we +demonstrate that the proposed method enables interpretable explanations of its +output through heatmaps highlighting the regions corresponding to anomalous +heart structures. + +
+
+
+
+
+ + ♻ ☆ A Simple Framework for 3D Occupancy Estimation in Autonomous Driving + + +
+ The task of estimating 3D occupancy from surrounding-view images is an +exciting development in the field of autonomous driving, following the success +of Bird's Eye View (BEV) perception. This task provides crucial 3D attributes +of the driving environment, enhancing the overall understanding and perception +of the surrounding space. In this work, we present a simple framework for 3D +occupancy estimation, which is a CNN-based framework designed to reveal several +key factors for 3D occupancy estimation, such as network design, optimization, +and evaluation. In addition, we explore the relationship between 3D occupancy +estimation and other related tasks, such as monocular depth estimation and 3D +reconstruction, which could advance the study of 3D perception in autonomous +driving. For evaluation, we propose a simple sampling strategy to define the +metric for occupancy evaluation, which is flexible for current public datasets. +Moreover, we establish the benchmark in terms of the depth estimation metric, +where we compare our proposed method with monocular depth estimation methods on +the DDAD and Nuscenes datasets and achieve competitive performance. The +relevant code will be updated in https://github.com/GANWANSHUI/SimpleOccupancy. + +
+
+ comment: 15 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities + + +
+ We propose MM-Vet, an evaluation benchmark that examines large multimodal +models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various +intriguing abilities, such as solving math problems written on the blackboard, +reasoning about events and celebrities in news images, and explaining visual +jokes. Rapid model advancements pose challenges to evaluation benchmark +development. Problems include: (1) How to systematically structure and evaluate +the complicated multimodal tasks; (2) How to design evaluation metrics that +work well across question and answer types; and (3) How to give model insights +beyond a simple performance ranking. To this end, we present MM-Vet, designed +based on the insight that the intriguing ability to solve complicated tasks is +often achieved by a generalist model being able to integrate different core +vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and +examines the 16 integrations of interest derived from the capability +combination. For evaluation metrics, we propose an LLM-based evaluator for +open-ended outputs. The evaluator enables the evaluation across different +question types and answer styles, resulting in a unified scoring metric. We +evaluate representative LMMs on MM-Vet, providing insights into the +capabilities of different LMM system paradigms and models. Code and data are +available at https://github.com/yuweihao/MM-Vet. + +
+
+ comment: Add results of GPT-4V. Code, data and leaderboard: + https://github.com/yuweihao/MM-Vet +
+
+
+
+
+ + ♻ ☆ PhoMoH: Implicit Photorealistic 3D Models of Human Heads + + +
+ We present PhoMoH, a neural network methodology to construct generative +models of photo-realistic 3D geometry and appearance of human heads including +hair, beards, an oral cavity, and clothing. In contrast to prior work, PhoMoH +models the human head using neural fields, thus supporting complex topology. +Instead of learning a head model from scratch, we propose to augment an +existing expressive head model with new features. Concretely, we learn a highly +detailed geometry network layered on top of a mid-resolution head model +together with a detailed, local geometry-aware, and disentangled color field. +Our proposed architecture allows us to learn photo-realistic human head models +from relatively little data. The learned generative geometry and appearance +networks can be sampled individually and enable the creation of diverse and +realistic human heads. Extensive experiments validate our method qualitatively +and across different metrics. + +
+
+ comment: To be published at the International Conference on 3D Vision 2024 +
+
+
+
+
+ + ♻ ☆ A Systematic Performance Analysis of Deep Perceptual Loss Networks: + Breaking Transfer Learning Conventions + + +
+ Deep perceptual loss is a type of loss function in computer vision that aims +to mimic human perception by using the deep features extracted from neural +networks. In recent years, the method has been applied to great effect on a +host of interesting computer vision tasks, especially for tasks with image or +image-like outputs, such as image synthesis, segmentation, depth prediction, +and more. Many applications of the method use pretrained networks, often +convolutional networks, for loss calculation. Despite the increased interest +and broader use, more effort is needed toward exploring which networks to use +for calculating deep perceptual loss and from which layers to extract the +features. + This work aims to rectify this by systematically evaluating a host of +commonly used and readily available, pretrained networks for a number of +different feature extraction points on four existing use cases of deep +perceptual loss. The use cases of perceptual similarity, super-resolution, +image segmentation, and dimensionality reduction, are evaluated through +benchmarks. The benchmarks are implementations of previous works where the +selected networks and extraction points are evaluated. The performance on the +benchmarks, and attributes of the networks and extraction points are then used +as a basis for an in-depth analysis. This analysis uncovers insight regarding +which architectures provide superior performance for deep perceptual loss and +how to choose an appropriate extraction point for a particular task and +dataset. Furthermore, the work discusses the implications of the results for +deep perceptual loss and the broader field of transfer learning. The results +show that deep perceptual loss deviates from two commonly held conventions in +transfer learning, which suggests that those conventions are in need of deeper +analysis. + +
+
+
+
+
+ + ♻ ☆ DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable + Kendall's Rank Correlation NeurIPS 2023 + + +
+ Few-shot learning aims to adapt models trained on the base dataset to novel +tasks where the categories were not seen by the model before. This often leads +to a relatively uniform distribution of feature values across channels on novel +classes, posing challenges in determining channel importance for novel tasks. +Standard few-shot learning methods employ geometric similarity metrics such as +cosine similarity and negative Euclidean distance to gauge the semantic +relatedness between two features. However, features with high geometric +similarities may carry distinct semantics, especially in the context of +few-shot learning. In this paper, we demonstrate that the importance ranking of +feature channels is a more reliable indicator for few-shot learning than +geometric similarity metrics. We observe that replacing the geometric +similarity metric with Kendall's rank correlation only during inference is able +to improve the performance of few-shot learning across a wide range of methods +and datasets with different domains. Furthermore, we propose a carefully +designed differentiable loss for meta-training to address the +non-differentiability issue of Kendall's rank correlation. By replacing +geometric similarity with differentiable Kendall's rank correlation, our method +can integrate with numerous existing few-shot approaches and is ready for +integrating with future state-of-the-art methods that rely on geometric +similarity metrics. Extensive experiments validate the efficacy of the +rank-correlation-based approach, showcasing a significant improvement in +few-shot learning. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ ViNT: A Foundation Model for Visual Navigation + + +
+ General-purpose pre-trained models ("foundation models") have enabled +practitioners to produce generalizable solutions for individual machine +learning problems with datasets that are significantly smaller than those +required for learning from scratch. Such models are typically trained on large +and diverse datasets with weak supervision, consuming much more training data +than is available for any individual downstream application. In this paper, we +describe the Visual Navigation Transformer (ViNT), a foundation model that aims +to bring the success of general-purpose pre-trained models to vision-based +robotic navigation. ViNT is trained with a general goal-reaching objective that +can be used with any navigation dataset, and employs a flexible +Transformer-based architecture to learn navigational affordances and enable +efficient adaptation to a variety of downstream navigational tasks. ViNT is +trained on a number of existing navigation datasets, comprising hundreds of +hours of robotic navigation from a variety of different robotic platforms, and +exhibits positive transfer, outperforming specialist models trained on singular +datasets. ViNT can be augmented with diffusion-based subgoal proposals to +explore novel environments, and can solve kilometer-scale navigation problems +when equipped with long-range heuristics. ViNT can also be adapted to novel +task specifications with a technique inspired by prompt-tuning, where the goal +encoder is replaced by an encoding of another task modality (e.g., GPS +waypoints or routing commands) embedded into the same space of goal tokens. +This flexibility and ability to accommodate a variety of downstream problem +domains establishes ViNT as an effective foundation model for mobile robotics. +For videos, code, and model checkpoints, see our project page at +https://visualnav-transformer.github.io. + +
+
+ comment: Accepted for oral presentation at CoRL 2023 +
+
+
+
+
+ + ♻ ☆ MaXM: Towards Multilingual Visual Question Answering EMNLP 2023 + + +
+ Visual Question Answering (VQA) has been primarily studied through the lens +of the English language. Yet, tackling VQA in other languages in the same +manner would require a considerable amount of resources. In this paper, we +propose scalable solutions to multilingual visual question answering (mVQA), on +both data and modeling fronts. We first propose a translation-based framework +to mVQA data generation that requires much less human annotation efforts than +the conventional approach of directly collection questions and answers. Then, +we apply our framework to the multilingual captions in the Crossmodal-3600 +dataset and develop an efficient annotation protocol to create MaXM, a +test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, +lightweight, and effective approach as well as benchmark state-of-the-art +English and multilingual VQA models. We hope that our benchmark encourages +further research on mVQA. + +
+
+ comment: EMNLP 2023 (Findings). + https://github.com/google-research-datasets/maxm +
+
+
+
+
+ + ♻ ☆ ReContrast: Domain-Specific Anomaly Detection via Contrastive + Reconstruction NeurIPS 2023 + + +
+ Most advanced unsupervised anomaly detection (UAD) methods rely on modeling +feature representations of frozen encoder networks pre-trained on large-scale +datasets, e.g. ImageNet. However, the features extracted from the encoders that +are borrowed from natural image domains coincide little with the features +required in the target UAD domain, such as industrial inspection and medical +imaging. In this paper, we propose a novel epistemic UAD method, namely +ReContrast, which optimizes the entire network to reduce biases towards the +pre-trained image domain and orients the network in the target domain. We start +with a feature reconstruction approach that detects anomalies from errors. +Essentially, the elements of contrastive learning are elegantly embedded in +feature reconstruction to prevent the network from training instability, +pattern collapse, and identical shortcut, while simultaneously optimizing both +the encoder and decoder on the target domain. To demonstrate our transfer +ability on various image domains, we conduct extensive experiments across two +popular industrial defect detection benchmarks and three medical image UAD +tasks, which shows our superiority over current state-of-the-art methods. + +
+
+ comment: NeurIPS 2023 Poster +
+
+
+
+
+ + ♻ ☆ Wonder3D: Single Image to 3D using Cross-Domain Diffusion + + +
+ In this work, we introduce Wonder3D, a novel method for efficiently +generating high-fidelity textured meshes from single-view images.Recent methods +based on Score Distillation Sampling (SDS) have shown the potential to recover +3D geometry from 2D diffusion priors, but they typically suffer from +time-consuming per-shape optimization and inconsistent geometry. In contrast, +certain works directly produce 3D information via fast network inferences, but +their results are often of low quality and lack geometric details. To +holistically improve the quality, consistency, and efficiency of image-to-3D +tasks, we propose a cross-domain diffusion model that generates multi-view +normal maps and the corresponding color images. To ensure consistency, we +employ a multi-view cross-domain attention mechanism that facilitates +information exchange across views and modalities. Lastly, we introduce a +geometry-aware normal fusion algorithm that extracts high-quality surfaces from +the multi-view 2D representations. Our extensive evaluations demonstrate that +our method achieves high-quality reconstruction results, robust generalization, +and reasonably good efficiency compared to prior works. + +
+
+ comment: Project page: https://www.xxlong.site/Wonder3D/ +
+
+
+
+
+ + ♻ ☆ Boosting Generalization with Adaptive Style Techniques for Fingerprint + Liveness Detection + + +
+ We introduce a high-performance fingerprint liveness feature extraction +technique that secured first place in LivDet 2023 Fingerprint Representation +Challenge. Additionally, we developed a practical fingerprint recognition +system with 94.68% accuracy, earning second place in LivDet 2023 Liveness +Detection in Action. By investigating various methods, particularly style +transfer, we demonstrate improvements in accuracy and generalization when faced +with limited training data. As a result, our approach achieved state-of-the-art +performance in LivDet 2023 Challenges. + +
+
+ comment: 1st Place in LivDet2023 Fingerprint Representation Challenge +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ Weakly Supervised Semantic Segmentation by Knowledge Graph Inference + + +
+ Currently, existing efforts in Weakly Supervised Semantic Segmentation (WSSS) +based on Convolutional Neural Networks (CNNs) have predominantly focused on +enhancing the multi-label classification network stage, with limited attention +given to the equally important downstream segmentation network. Furthermore, +CNN-based local convolutions lack the ability to model the extensive +inter-category dependencies. Therefore, this paper introduces a graph +reasoning-based approach to enhance WSSS. The aim is to improve WSSS +holistically by simultaneously enhancing both the multi-label classification +and segmentation network stages. In the multi-label classification network +segment, external knowledge is integrated, coupled with GCNs, to globally +reason about inter-class dependencies. This encourages the network to uncover +features in non-salient regions of images, thereby refining the completeness of +generated pseudo-labels. In the segmentation network segment, the proposed +Graph Reasoning Mapping (GRM) module is employed to leverage knowledge obtained +from textual databases, facilitating contextual reasoning for class +representation within image regions. This GRM module enhances feature +representation in high-level semantics of the segmentation network's local +convolutions, while dynamically learning semantic coherence for individual +samples. Using solely image-level supervision, we have achieved +state-of-the-art performance in WSSS on the PASCAL VOC 2012 and MS-COCO +datasets. Extensive experimentation on both the multi-label classification and +segmentation network stages underscores the effectiveness of the proposed graph +reasoning approach for advancing WSSS. + +
+
+ comment: Our description in Chapter 3, Section 3.2 of the paper is too + repetitive with the paper "Object detection meets knowledge graphs". There is + an error in the description of formula (5) in Section 3.3. And a detailed + reasoning process is required for formula (5). Therefore, we wish to request + a retraction of the paper +
+
+
+
+
+ + ♻ ☆ Ghost on the Shell: An Expressive Representation of General 3D Shapes + + +
+ The creation of photorealistic virtual worlds requires the accurate modeling +of 3D surface geometry for a wide range of objects. For this, meshes are +appealing since they 1) enable fast physics-based rendering with realistic +material and lighting, 2) support physical simulation, and 3) are +memory-efficient for modern graphics pipelines. Recent work on reconstructing +and statistically modeling 3D shape, however, has critiqued meshes as being +topologically inflexible. To capture a wide range of object shapes, any 3D +representation must be able to model solid, watertight, shapes as well as thin, +open, surfaces. Recent work has focused on the former, and methods for +reconstructing open surfaces do not support fast reconstruction with material +and lighting or unconditional generative modelling. Inspired by the observation +that open surfaces can be seen as islands floating on watertight surfaces, we +parameterize open surfaces by defining a manifold signed distance field on +watertight templates. With this parameterization, we further develop a +grid-based and differentiable representation that parameterizes both watertight +and non-watertight meshes of arbitrary topology. Our new representation, called +Ghost-on-the-Shell (G-Shell), enables two important applications: +differentiable rasterization-based reconstruction from multiview images and +generative modelling of non-watertight meshes. We empirically demonstrate that +G-Shell achieves state-of-the-art performance on non-watertight mesh +reconstruction and generation tasks, while also performing effectively for +watertight meshes. + +
+
+ comment: Technical Report (26 pages, 16 figures, Project Page: + https://gshell3d.github.io/) +
+
+
+
+
+ + ♻ ☆ A Survey on Few-Shot Class-Incremental Learning + + +
+ Large deep learning models are impressive, but they struggle when real-time +data is not available. Few-shot class-incremental learning (FSCIL) poses a +significant challenge for deep neural networks to learn new tasks from just a +few labeled samples without forgetting the previously learned ones. This setup +easily leads to catastrophic forgetting and overfitting problems, severely +affecting model performance. Studying FSCIL helps overcome deep learning model +limitations on data volume and acquisition time, while improving practicality +and adaptability of machine learning models. This paper provides a +comprehensive survey on FSCIL. Unlike previous surveys, we aim to synthesize +few-shot learning and incremental learning, focusing on introducing FSCIL from +two perspectives, while reviewing over 30 theoretical research studies and more +than 20 applied research studies. From the theoretical perspective, we provide +a novel categorization approach that divides the field into five subcategories, +including traditional machine learning methods, meta-learning based methods, +feature and feature space-based methods, replay-based methods, and dynamic +network structure-based methods. We also evaluate the performance of recent +theoretical research on benchmark datasets of FSCIL. From the application +perspective, FSCIL has achieved impressive achievements in various fields of +computer vision such as image classification, object detection, and image +segmentation, as well as in natural language processing and graph. We summarize +the important applications. Finally, we point out potential future research +directions, including applications, problem setups, and theory development. +Overall, this paper offers a comprehensive analysis of the latest advances in +FSCIL from a methodological, performance, and application perspective. + +
+
+
+
+
+ + ♻ ☆ VPGTrans: Transfer Visual Prompt Generator across LLMs NeurIPS 2023 + + +
+ While developing a new multimodal LLM (MLLM) by pre-training on tremendous +image-text pairs from scratch can be exceedingly resource-consuming, connecting +an existing LLM with a comparatively lightweight visual prompt generator (VPG) +becomes a feasible paradigm. However, further tuning the VPG part of the MLLM +still suffers from indispensable computational costs, i.e., requiring thousands +of GPU hours and millions of training data. One alternative solution is to +transfer an existing VPG from any existing MLLMs for the target MLLM. + In this work, we for the first time investigate the VPG transferability +across LLMs, and explore a solution to reduce the cost of VPG transfer. We +first study the VPG transfer across different LLM sizes (e.g., small-to-large), +and across different LLM types, through which we diagnose the key factors to +maximize the transfer efficiency. Based on our observation, we design a +two-stage transfer framework named VPGTrans, which is simple yet highly +effective. Through extensive experiments, we demonstrate that VPGTrans helps +significantly speed up the transfer learning process without compromising +performance. Remarkably, it helps achieve the VPG transfer from BLIP-2 +OPT$_\text{2.7B}$ to BLIP-2 OPT$_\text{6.7B}$ with over 10 times speed-up and +10.7% training data compared with connecting a VPG to OPT$_\text{6.7B}$ from +scratch. Further, a series of intriguing findings and potential rationales +behind them are provided and discussed. Finally, we showcase the practical +value of our VPGTrans approach, by customizing two novel MLLMs, including +VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs. + +
+
+ comment: Project Website: https://vpgtrans.github.io Code: + https://github.com/VPGTrans/VPGTrans NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Supervised Domain Adaptation for Recognizing Retinal Diseases from + Wide-Field Fundus Images + + +
+ This paper addresses the emerging task of recognizing multiple retinal +diseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an +effective use of existing large amount of labeled color fundus photo (CFP) data +and the relatively small amount of WF and UWF data, we propose a supervised +domain adaptation method named Cross-domain Collaborative Learning (CdCL). +Inspired by the success of fixed-ratio based mixup in unsupervised domain +adaptation, we re-purpose this strategy for the current task. Due to the +intrinsic disparity between the field-of-view of CFP and WF/UWF images, a scale +bias naturally exists in a mixup sample that the anatomic structure from a CFP +image will be considerably larger than its WF/UWF counterpart. The CdCL method +resolves the issue by Scale-bias Correction, which employs Transformers for +producing scale-invariant features. As demonstrated by extensive experiments on +multiple datasets covering both WF and UWF images, the proposed method compares +favorably against a number of competitive baselines. + +
+
+ comment: Accepted by BIBM2023 +
+
+
+
+
+ + ♻ ☆ Batch Implicit Neural Representation for MRI Parallel Reconstruction + + +
+ Magnetic resonance imaging (MRI) always suffered from the problem of long +acquisition time. MRI reconstruction is one solution to reduce scan time by +skipping certain phase-encoding lines and then restoring high-quality images +from undersampled measurements. Recently, implicit neural representation (INR) +has emerged as a new deep learning method that represents an object as a +continuous function of spatial coordinates, and this function is normally +parameterized by a multilayer perceptron (MLP). In this paper, we propose a +novel MRI parallel reconstruction method based on INR, which represents the +fully-sampled images as the function of voxel coordinates and prior feature +vectors of undersampled images for overcoming the generalization problem of +INR. Specifically, we introduce a scale-embedded encoder to produce +scale-independent voxel-specific features from MR images with different +undersampled scales and then concatenate with coordinates vectors to recover +fully-sampled MR images via an MLP, thus achieving arbitrary scale +reconstruction. The performance of the proposed method was assessed by +experimenting on publicly available MRI datasets and compared with other +reconstruction methods. Our quantitative evaluation demonstrates the +superiority of the proposed method over alternative reconstruction methods. + +
+
+
+
+
+ + ♻ ☆ A comprehensive survey on deep active learning and its applications in + medical image analysis + + +
+ Deep learning has achieved widespread success in medical image analysis, +leading to an increasing demand for large-scale expert-annotated medical image +datasets. Yet, the high cost of annotating medical images severely hampers the +development of deep learning in this field. To reduce annotation costs, active +learning aims to select the most informative samples for annotation and train +high-performance models with as few labeled samples as possible. In this +survey, we review the core methods of active learning, including the evaluation +of informativeness and sampling strategy. For the first time, we provide a +detailed summary of the integration of active learning with other +label-efficient techniques, such as semi-supervised, self-supervised learning, +and so on. Additionally, we also highlight active learning works that are +specifically tailored to medical image analysis. In the end, we offer our +perspectives on the future trends and challenges of active learning and its +applications in medical image analysis. + +
+
+ comment: Paper List on Github: + https://github.com/LightersWang/Awesome-Active-Learning-for-Medical-Image-Analysis +
+
+
+
+
+ + ♻ ☆ AdaFuse: Adaptive Medical Image Fusion Based on Spatial-Frequential + Cross Attention + + +
+ Multi-modal medical image fusion is essential for the precise clinical +diagnosis and surgical navigation since it can merge the complementary +information in multi-modalities into a single image. The quality of the fused +image depends on the extracted single modality features as well as the fusion +rules for multi-modal information. Existing deep learning-based fusion methods +can fully exploit the semantic features of each modality, they cannot +distinguish the effective low and high frequency information of each modality +and fuse them adaptively. To address this issue, we propose AdaFuse, in which +multimodal image information is fused adaptively through frequency-guided +attention mechanism based on Fourier transform. Specifically, we propose the +cross-attention fusion (CAF) block, which adaptively fuses features of two +modalities in the spatial and frequency domains by exchanging key and query +values, and then calculates the cross-attention scores between the spatial and +frequency features to further guide the spatial-frequential information fusion. +The CAF block enhances the high-frequency features of the different modalities +so that the details in the fused images can be retained. Moreover, we design a +novel loss function composed of structure loss and content loss to preserve +both low and high frequency information. Extensive comparison experiments on +several datasets demonstrate that the proposed method outperforms +state-of-the-art methods in terms of both visual quality and quantitative +metrics. The ablation experiments also validate the effectiveness of the +proposed loss and fusion strategy. + +
+
+
+
+
+ + ♻ ☆ Rethinking Semi-Supervised Medical Image Segmentation: A + Variance-Reduction Perspective NeurIPS 2023 + + +
+ For medical image segmentation, contrastive learning is the dominant practice +to improve the quality of visual representations by contrasting semantically +similar and dissimilar pairs of samples. This is enabled by the observation +that without accessing ground truth labels, negative examples with truly +dissimilar anatomical features, if sampled, can significantly improve the +performance. In reality, however, these samples may come from similar +anatomical regions and the models may struggle to distinguish the minority +tail-class samples, making the tail classes more prone to misclassification, +both of which typically lead to model collapse. In this paper, we propose ARCO, +a semi-supervised contrastive learning (CL) framework with stratified group +theory for medical image segmentation. In particular, we first propose building +ARCO through the concept of variance-reduced estimation and show that certain +variance-reduction techniques are particularly beneficial in pixel/voxel-level +segmentation tasks with extremely limited labels. Furthermore, we theoretically +prove these sampling techniques are universal in variance reduction. Finally, +we experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D +medical and three semantic segmentation datasets, with different label +settings, and our methods consistently outperform state-of-the-art +semi-supervised methods. Additionally, we augment the CL frameworks with these +sampling techniques and demonstrate significant gains over previous methods. We +believe our work is an important step towards semi-supervised medical image +segmentation by quantifying the limitation of current self-supervision +objectives for accomplishing such challenging safety-critical tasks. + +
+
+ comment: Accepted by Advances in Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Direct Diffusion Bridge using Data Consistency for Inverse Problems NeurIPS 2023 + + +
+ Diffusion model-based inverse problem solvers have shown impressive +performance, but are limited in speed, mostly as they require reverse diffusion +sampling starting from noise. Several recent works have tried to alleviate this +problem by building a diffusion process, directly bridging the clean and the +corrupted for specific inverse problems. In this paper, we first unify these +existing works under the name Direct Diffusion Bridges (DDB), showing that +while motivated by different theories, the resulting algorithms only differ in +the choice of parameters. Then, we highlight a critical limitation of the +current DDB framework, namely that it does not ensure data consistency. To +address this problem, we propose a modified inference procedure that imposes +data consistency without the need for fine-tuning. We term the resulting method +data Consistent DDB (CDDB), which outperforms its inconsistent counterpart in +terms of both perception and distortion metrics, thereby effectively pushing +the Pareto-frontier toward the optimum. Our proposed method achieves +state-of-the-art results on both evaluation criteria, showcasing its +superiority over existing methods. Code is available at +https://github.com/HJ-harry/CDDB + +
+
+ comment: NeurIPS 2023 camera-ready. 16 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ Feature Extractor Stacking for Cross-domain Few-shot Learning + + +
+ Cross-domain few-shot learning (CDFSL) addresses learning problems where +knowledge needs to be transferred from one or more source domains into an +instance-scarce target domain with an explicitly different distribution. +Recently published CDFSL methods generally construct a universal model that +combines knowledge of multiple source domains into one feature extractor. This +enables efficient inference but necessitates re-computation of the extractor +whenever a new source domain is added. Some of these methods are also +incompatible with heterogeneous source domain extractor architectures. We +propose feature extractor stacking (FES), a new CDFSL method for combining +information from a collection of extractors, that can utilise heterogeneous +pretrained extractors out of the box and does not maintain a universal model +that needs to be re-computed when its extractor collection is updated. We +present the basic FES algorithm, which is inspired by the classic stacked +generalisation approach, and also introduce two variants: convolutional FES +(ConFES) and regularised FES (ReFES). Given a target-domain task, these +algorithms fine-tune each extractor independently, use cross-validation to +extract training data for stacked generalisation from the support set, and +learn a simple linear stacking classifier from this data. We evaluate our FES +methods on the well-known Meta-Dataset benchmark, targeting image +classification with convolutional neural networks, and show that they can +achieve state-of-the-art performance. + +
+
+
+
+
+ + ♻ ☆ Incorporating Structured Representations into Pretrained Vision & + Language Models Using Scene Graphs EMNLP 2023 + + +
+ Vision and language models (VLMs) have demonstrated remarkable zero-shot (ZS) +performance in a variety of tasks. However, recent works have shown that even +the best VLMs struggle to capture aspects of compositional scene understanding, +such as object attributes, relations, and action states. In contrast, obtaining +structured annotations, such as scene graphs (SGs), that could improve these +models is time-consuming and costly, and thus cannot be used on a large scale. +Here we ask whether small SG datasets can provide sufficient information for +enhancing structured understanding of pretrained VLMs. We show that it is +indeed possible to improve VLMs when learning from SGs by integrating +components that incorporate structured information into both visual and textual +representations. For the visual side, we incorporate a special "SG Component" +in the image transformer trained to predict SG information, while for the +textual side, we utilize SGs to generate fine-grained captions that highlight +different compositional aspects of the scene. Our method improves the +performance of several popular VLMs on multiple VL datasets with only a mild +degradation in ZS capabilities. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ NerfAcc: Efficient Sampling Accelerates NeRFs + + +
+ Optimizing and rendering Neural Radiance Fields is computationally expensive +due to the vast number of samples required by volume rendering. Recent works +have included alternative sampling approaches to help accelerate their methods, +however, they are often not the focus of the work. In this paper, we +investigate and compare multiple sampling approaches and demonstrate that +improved sampling is generally applicable across NeRF variants under an unified +concept of transmittance estimator. To facilitate future experiments, we +develop NerfAcc, a Python toolbox that provides flexible APIs for incorporating +advanced sampling methods into NeRF related methods. We demonstrate its +flexibility by showing that it can reduce the training time of several recent +NeRF methods by 1.5x to 20x with minimal modifications to the existing +codebase. Additionally, highly customized NeRFs, such as Instant-NGP, can be +implemented in native PyTorch using NerfAcc. + +
+
+ comment: Website: https://www.nerfacc.com +
+
+
+
+
+ + ♻ ☆ Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for + Improved Vision-Language Compositionality EMNLP 2023 + + +
+ Contrastively trained vision-language models have achieved remarkable +progress in vision and language representation learning, leading to +state-of-the-art models for various downstream multimodal tasks. However, +recent research has highlighted severe limitations of these models in their +ability to perform compositional reasoning over objects, attributes, and +relations. Scene graphs have emerged as an effective way to understand images +compositionally. These are graph-structured semantic representations of images +that contain objects, their attributes, and relations with other objects in a +scene. In this work, we consider the scene graph parsed from text as a proxy +for the image scene graph and propose a graph decomposition and augmentation +framework along with a coarse-to-fine contrastive learning objective between +images and text that aligns sentences of various complexities to the same +image. Along with this, we propose novel negative mining techniques in the +scene graph space for improving attribute binding and relation understanding. +Through extensive experiments, we demonstrate the effectiveness of our approach +that significantly improves attribute binding, relation understanding, +systematic generalization, and productivity on multiple recently proposed +benchmarks (For example, improvements upto $18\%$ for systematic +generalization, $16.5\%$ for relation understanding over a strong baseline), +while achieving similar or better performance than CLIP on various general +multimodal tasks. + +
+
+ comment: EMNLP 2023 (long paper, main conference) +
+
+
+
+
+ + ♻ ☆ Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation WACV 2024 + + +
+ Learning-based methods have dominated the 3D human pose estimation (HPE) +tasks with significantly better performance in most benchmarks than traditional +optimization-based methods. Nonetheless, 3D HPE in the wild is still the +biggest challenge for learning-based models, whether with 2D-3D lifting, +image-to-3D, or diffusion-based methods, since the trained networks implicitly +learn camera intrinsic parameters and domain-based 3D human pose distributions +and estimate poses by statistical average. On the other hand, the +optimization-based methods estimate results case-by-case, which can predict +more diverse and sophisticated human poses in the wild. By combining the +advantages of optimization-based and learning-based methods, we propose the +\textbf{Ze}ro-shot \textbf{D}iffusion-based \textbf{O}ptimization +(\textbf{ZeDO}) pipeline for 3D HPE to solve the problem of cross-domain and +in-the-wild 3D HPE. Our multi-hypothesis \textit{\textbf{ZeDO}} achieves +state-of-the-art (SOTA) performance on Human3.6M, with minMPJPE $51.4$mm, +without training with any 2D-3D or image-3D pairs. Moreover, our +single-hypothesis \textit{\textbf{ZeDO}} achieves SOTA performance on 3DPW +dataset with PA-MPJPE $40.3$mm on cross-dataset evaluation, which even +outperforms learning-based methods trained on 3DPW. + +
+
+ comment: WACV 2024 +
+
+
+
+
+ + ♻ ☆ Image Manipulation via Multi-Hop Instructions -- A New Dataset and + Weakly-Supervised Neuro-Symbolic Approach EMNLP 2023 + + +
+ We are interested in image manipulation via natural language text -- a task +that is useful for multiple AI applications but requires complex reasoning over +multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning +(NSCL), which has been quite effective for the task of Visual Question +Answering (VQA), for the task of image manipulation. Our system referred to as +NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and +only requires weak supervision in the form of annotated data for VQA. NeuroSIM +parses an instruction into a symbolic program, based on a Domain Specific +Language (DSL) comprising of object attributes and manipulation operations, +that guides its execution. We create a new dataset for the task, and extensive +experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA +baselines that make use of supervised data for manipulation. + +
+
+ comment: EMNLP 2023 (long paper, main conference) +
+
+
+
+
+ + ♻ ☆ Novel-View Acoustic Synthesis CVPR 2023 + + +
+ We introduce the novel-view acoustic synthesis (NVAS) task: given the sight +and sound observed at a source viewpoint, can we synthesize the sound of that +scene from an unseen target viewpoint? We propose a neural rendering approach: +Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize +the sound of an arbitrary point in space by analyzing the input audio-visual +cues. To benchmark this task, we collect two first-of-their-kind large-scale +multi-view audio-visual datasets, one synthetic and one real. We show that our +model successfully reasons about the spatial cues and synthesizes faithful +audio on both datasets. To our knowledge, this work represents the very first +formulation, dataset, and approach to solve the novel-view acoustic synthesis +task, which has exciting potential applications ranging from AR/VR to art and +design. Unlocked by this work, we believe that the future of novel-view +synthesis is in multi-modal learning from videos. + +
+
+ comment: Accepted at CVPR 2023. Project page: + https://vision.cs.utexas.edu/projects/nvas +
+
+
+
+
+ + ♻ ☆ Label-free segmentation from cardiac ultrasound using self-supervised + learning + + +
+ Segmentation and measurement of cardiac chambers is critical in cardiac +ultrasound but is laborious and poorly reproducible. Neural networks can +assist, but supervised approaches require the same laborious manual +annotations. We built a pipeline for self-supervised (no manual labels) +segmentation combining computer vision, clinical domain knowledge, and deep +learning. We trained on 450 echocardiograms (93,000 images) and tested on 8,393 +echocardiograms (4,476,266 images; mean 61 years, 51% female), using the +resulting segmentations to calculate biometrics. We also tested against +external images from an additional 10,030 patients with available manual +tracings of the left ventricle. r2 between clinically measured and +pipeline-predicted measurements were similar to reported inter-clinician +variation and comparable to supervised learning across several different +measurements (r2 0.56-0.84). Average accuracy for detecting abnormal chamber +size and function was 0.85 (range 0.71-0.97) compared to clinical measurements. +A subset of test echocardiograms (n=553) had corresponding cardiac MRIs, where +MRI is the gold standard. Correlation between pipeline and MRI measurements was +similar to that between clinical echocardiogram and MRI. Finally, the pipeline +accurately segments the left ventricle with an average Dice score of 0.89 (95% +CI [0.89]) in the external, manually labeled dataset. Our results demonstrate a +manual-label free, clinically valid, and highly scalable method for +segmentation from ultrasound, a noisy but globally important imaging modality. + +
+
+ comment: 37 pages, 3 Tables, 7 Figures +
+
+
+
+
+ + ♻ ☆ EGOFALLS: A visual-audio dataset and benchmark for fall detection using + egocentric cameras + + +
+ Falls are significant and often fatal for vulnerable populations such as the +elderly. Previous works have addressed the detection of falls by relying on +data capture by a single sensor, images or accelerometers. In this work, we +rely on multimodal descriptors extracted from videos captured by egocentric +cameras. Our proposed method includes a late decision fusion layer that builds +on top of the extracted descriptors. Furthermore, we collect a new dataset on +which we assess our proposed approach. We believe this is the first public +dataset of its kind. The dataset comprises 10,948 video samples by 14 subjects. +We conducted ablation experiments to assess the performance of individual +feature extractors, fusion of visual information, and fusion of both visual and +audio information. Moreover, we experimented with internal and external +cross-validation. Our results demonstrate that the fusion of audio and visual +information through late decision fusion improves detection performance, making +it a promising tool for fall prevention and mitigation. + +
+
+
+
+
+ + ♻ ☆ Improving Deep Learning Models for Pediatric Low-Grade Glioma Tumors + Molecular Subtype Identification Using 3D Probability Distributions of Tumor + Location + + +
+ Background and Purpose: Pediatric low-grade glioma (pLGG) is the most common +type of brain tumor in children, and identification of molecular markers for +pLGG is crucial for successful treatment planning. Convolutional Neural Network +(CNN) models for pLGG subtype identification rely on tumor segmentation. We +hypothesize tumor segmentations are suboptimal and thus, we propose to augment +the CNN models using tumor location probability in MRI data. + Materials and Methods: Our REB-approved retrospective study included MRI +Fluid-Attenuated Inversion Recovery (FLAIR) sequences of 143 BRAF fused and 71 +BRAF V600E mutated tumors. Tumor segmentations (regions of interest (ROIs)) +were provided by a pediatric neuroradiology fellow and verified by a senior +pediatric neuroradiologist. In each experiment, we randomly split the data into +development and test with an 80/20 ratio. We combined the 3D binary ROI masks +for each class in the development dataset to derive the probability density +functions (PDF) of tumor location, and developed three pipelines: +location-based, CNN-based, and hybrid. + Results: We repeated the experiment with different model initializations and +data splits 100 times and calculated the Area Under Receiver Operating +Characteristic Curve (AUC). The location-based classifier achieved an AUC of +77.90, 95% confidence interval (CI) (76.76, 79.03). CNN-based classifiers +achieved AUC of 86.11, CI (84.96, 87.25), while the tumor-location-guided CNNs +outperformed the formers with an average AUC of 88.64 CI (87.57, 89.72), which +was statistically significant (Student's t-test p-value 0.0018). + Conclusion: We achieved statistically significant improvements by +incorporating tumor location into the CNN models. Our results suggest that +manually segmented ROIs may not be optimal. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2207.14776 +
+
+
+
+
+ + ♻ ☆ Open-radiomics: A Collection of Standardized Datasets and a Technical + Protocol for Reproducible Radiomics Machine Learning Pipelines + + +
+ Purpose: As an important branch of machine learning pipelines in medical +imaging, radiomics faces two major challenges namely reproducibility and +accessibility. In this work, we introduce open-radiomics, a set of radiomics +datasets along with a comprehensive radiomics pipeline based on our proposed +technical protocol to investigate the effects of radiomics feature extraction +on the reproducibility of the results. + Materials and Methods: Experiments are conducted on BraTS 2020 open-source +Magnetic Resonance Imaging (MRI) dataset that includes 369 adult patients with +brain tumors (76 low-grade glioma (LGG), and 293 high-grade glioma (HGG)). +Using PyRadiomics library for LGG vs. HGG classification, 288 radiomics +datasets are formed; the combinations of 4 MRI sequences, 3 binWidths, 6 image +normalization methods, and 4 tumor subregions. + Random Forest classifiers were used, and for each radiomics dataset the +training-validation-test (60%/20%/20%) experiment with different data splits +and model random states was repeated 100 times (28,800 test results) and Area +Under Receiver Operating Characteristic Curve (AUC) was calculated. + Results: Unlike binWidth and image normalization, tumor subregion and imaging +sequence significantly affected performance of the models. T1 contrast-enhanced +sequence and the union of necrotic and the non-enhancing tumor core subregions +resulted in the highest AUCs (average test AUC 0.951, 95% confidence interval +of (0.949, 0.952)). Although 28 settings and data splits yielded test AUC of 1, +they were irreproducible. + Conclusion: Our experiments demonstrate the sources of variability in +radiomics pipelines (e.g., tumor subregion) can have a significant impact on +the results, which may lead to superficial perfect performances that are +irreproducible. + +
+
+
+
+
+
+
+
+ + Information Retrieval 16 + +
+
+
+ + ☆ Representation Learning with Large Language Models for Recommendation + + +
+ Recommender systems have seen significant advancements with the influence of +deep learning and graph neural networks, particularly in capturing complex +user-item relationships. However, these graph-based recommenders heavily depend +on ID-based data, potentially disregarding valuable textual information +associated with users and items, resulting in less informative learned +representations. Moreover, the utilization of implicit feedback data introduces +potential noise and bias, posing challenges for the effectiveness of user +preference learning. While the integration of large language models (LLMs) into +traditional ID-based recommenders has gained attention, challenges such as +scalability issues, limitations in text-only reliance, and prompt input +constraints need to be addressed for effective implementation in practical +recommender systems. To address these challenges, we propose a model-agnostic +framework RLMRec that aims to enhance existing recommenders with LLM-empowered +representation learning. It proposes a recommendation paradigm that integrates +representation learning with LLMs to capture intricate semantic aspects of user +behaviors and preferences. RLMRec incorporates auxiliary textual signals, +develops a user/item profiling paradigm empowered by LLMs, and aligns the +semantic space of LLMs with the representation space of collaborative +relational signals through a cross-view alignment framework. This work further +establish a theoretical foundation demonstrating that incorporating textual +signals through mutual information maximization enhances the quality of +representations. In our evaluation, we integrate RLMRec with state-of-the-art +recommender models, while also analyzing its efficiency and robustness to noise +data. Our implementation codes are available at +https://github.com/HKUDS/RLMRec. + +
+
+
+
+
+ + ☆ Topology-aware Debiased Self-supervised Graph Learning for + Recommendation + + +
+ In recommendation, graph-based Collaborative Filtering (CF) methods mitigate +the data sparsity by introducing Graph Contrastive Learning (GCL). However, the +random negative sampling strategy in these GCL-based CF models neglects the +semantic structure of users (items), which not only introduces false negatives +(negatives that are similar to anchor user (item)) but also ignores the +potential positive samples. To tackle the above issues, we propose +Topology-aware Debiased Self-supervised Graph Learning (TDSGL) for +recommendation, which constructs contrastive pairs according to the semantic +similarity between users (items). Specifically, since the original user-item +interaction data commendably reflects the purchasing intent of users and +certain characteristics of items, we calculate the semantic similarity between +users (items) on interaction data. Then, given a user (item), we construct its +negative pairs by selecting users (items) which embed different semantic +structures to ensure the semantic difference between the given user (item) and +its negatives. Moreover, for a user (item), we design a feature extraction +module that converts other semantically similar users (items) into an auxiliary +positive sample to acquire a more informative representation. Experimental +results show that the proposed model outperforms the state-of-the-art models +significantly on three public datasets. Our model implementation codes are +available at https://github.com/malajikuai/TDSGL. + +
+
+ comment: 6 pages,8 figures +
+
+
+
+
+ + ☆ A statistical significance testing approach for measuring term + burstiness with applications to domain-specific terminology extraction + + +
+ Domain-specific terminology extraction is an important task in text analysis. +A term in a corpus is said to be "bursty" when its occurrences are concentrated +in few out of many documents. Being content rich, bursty terms are highly +suited for subject matter characterization, and serve as natural candidates for +identifying with technical terminology. Multiple measures of term burstiness +have been proposed in the literature. However, the statistical significance +testing paradigm has remained underexplored in text analysis, including in +relation to term burstiness. To test these waters, we propose as our main +contribution a multinomial language model-based exact test of statistical +significance for term burstiness. Due to its prohibitive computational cost, we +advance a heuristic formula designed to serve as a proxy for test P-values. As +a complementary theoretical contribution, we derive a previously unreported +relationship connecting the inverse document frequency and inverse collection +frequency (two foundational quantities in text analysis) under the multinomial +language model. The relation is used in the evaluation of our heuristic. Using +the GENIA Term corpus benchmark, we compare our approach against established +methods, demonstrating our heuristic's potential in identifying domain-specific +technical terms. We hope this demonstration of statistical significance testing +in text analysis serves as a springboard for future research. + +
+
+ comment: 23 pages, 1 figure, 6 tables +
+
+
+
+
+ + ☆ TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for + Inference Cost Reduction EMNLP 2023 + + +
+ Since ChatGPT released its API for public use, the number of applications +built on top of commercial large language models (LLMs) increase exponentially. +One popular usage of such models is leveraging its in-context learning ability +and generating responses given user queries leveraging knowledge obtained by +retrieval augmentation. One problem of deploying commercial retrieval-augmented +LLMs is the cost due to the additionally retrieved context that largely +increases the input token size of the LLMs. To mitigate this, we propose a +token compression scheme that includes two methods: summarization compression +and semantic compression. The first method applies a T5-based model that is +fine-tuned by datasets generated using self-instruct containing samples with +varying lengths and reduce token size by doing summarization. The second method +further compresses the token size by removing words with lower impact on the +semantic. In order to adequately evaluate the effectiveness of the proposed +methods, we propose and utilize a dataset called Food-Recommendation DB (FRDB) +focusing on food recommendation for women around pregnancy period or infants. +Our summarization compression can reduce 65% of the retrieval token size with +further 0.3% improvement on the accuracy; semantic compression provides a more +flexible way to trade-off the token size with performance, for which we can +reduce the token size by 20% with only 1.6% of accuracy drop. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ KITAB: Evaluating LLMs on Constraint Satisfaction for Information + Retrieval + + +
+ We study the ability of state-of-the art models to answer constraint +satisfaction queries for information retrieval (e.g., 'a list of ice cream +shops in San Diego'). In the past, such queries were considered to be tasks +that could only be solved via web-search or knowledge bases. More recently, +large language models (LLMs) have demonstrated initial emergent abilities in +this task. However, many current retrieval benchmarks are either saturated or +do not measure constraint satisfaction. Motivated by rising concerns around +factual incorrectness and hallucinations of LLMs, we present KITAB, a new +dataset for measuring constraint satisfaction abilities of language models. +KITAB consists of book-related data across more than 600 authors and 13,000 +queries, and also offers an associated dynamic data collection and constraint +verification approach for acquiring similar test data for other authors. Our +extended experiments on GPT4 and GPT3.5 characterize and decouple common +failure modes across dimensions such as information popularity, constraint +types, and context availability. Results show that in the absence of context, +models exhibit severe limitations as measured by irrelevant information, +factual errors, and incompleteness, many of which exacerbate as information +popularity decreases. While context availability mitigates irrelevant +information, it is not helpful for satisfying constraints, identifying +fundamental barriers to constraint satisfaction. We open source our +contributions to foster further research on improving constraint satisfaction +abilities of future models. + +
+
+ comment: 23 pages +
+
+
+
+
+ + ☆ Robust Representation Learning for Unified Online Top-K Recommendation ICDE + + +
+ In large-scale industrial e-commerce, the efficiency of an online +recommendation system is crucial in delivering highly relevant item/content +advertising that caters to diverse business scenarios. However, most existing +studies focus solely on item advertising, neglecting the significance of +content advertising. This oversight results in inconsistencies within the +multi-entity structure and unfair retrieval. Furthermore, the challenge of +retrieving top-k advertisements from multi-entity advertisements across +different domains adds to the complexity. Recent research proves that +user-entity behaviors within different domains exhibit characteristics of +differentiation and homogeneity. Therefore, the multi-domain matching models +typically rely on the hybrid-experts framework with domain-invariant and +domain-specific representations. Unfortunately, most approaches primarily focus +on optimizing the combination mode of different experts, failing to address the +inherent difficulty in optimizing the expert modules themselves. The existence +of redundant information across different domains introduces interference and +competition among experts, while the distinct learning objectives of each +domain lead to varying optimization challenges among experts. To tackle these +issues, we propose robust representation learning for the unified online top-k +recommendation. Our approach constructs unified modeling in entity space to +ensure data fairness. The robust representation learning employs domain +adversarial learning and multi-view wasserstein distribution learning to learn +robust representations. Moreover, the proposed method balances conflicting +objectives through the homoscedastic uncertainty weights and orthogonality +constraints. Various experiments validate the effectiveness and rationality of +our proposed method, which has been successfully deployed online to serve real +business scenarios. + +
+
+ comment: 14 pages, 6 figures, submitted to ICDE +
+
+
+
+
+ + ☆ Off-Policy Evaluation for Large Action Spaces via Policy Convolution + + +
+ Developing accurate off-policy estimators is crucial for both evaluating and +optimizing for new policies. The main challenge in off-policy estimation is the +distribution shift between the logging policy that generates data and the +target policy that we aim to evaluate. Typically, techniques for correcting +distribution shift involve some form of importance sampling. This approach +results in unbiased value estimation but often comes with the trade-off of high +variance, even in the simpler case of one-step contextual bandits. Furthermore, +importance sampling relies on the common support assumption, which becomes +impractical when the action space is large. To address these challenges, we +introduce the Policy Convolution (PC) family of estimators. These methods +leverage latent structure within actions -- made available through action +embeddings -- to strategically convolve the logging and target policies. This +convolution introduces a unique bias-variance trade-off, which can be +controlled by adjusting the amount of convolution. Our experiments on synthetic +and benchmark datasets demonstrate remarkable mean squared error (MSE) +improvements when using PC, especially when either the action space or policy +mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over +existing estimators. + +
+
+ comment: Under review. 36 pages, 31 figures +
+
+
+
+
+ + ☆ Context-aware feature attribution through argumentation + + +
+ Feature attribution is a fundamental task in both machine learning and data +analysis, which involves determining the contribution of individual features or +variables to a model's output. This process helps identify the most important +features for predicting an outcome. The history of feature attribution methods +can be traced back to General Additive Models (GAMs), which extend linear +regression models by incorporating non-linear relationships between dependent +and independent variables. In recent years, gradient-based methods and +surrogate models have been applied to unravel complex Artificial Intelligence +(AI) systems, but these methods have limitations. GAMs tend to achieve lower +accuracy, gradient-based methods can be difficult to interpret, and surrogate +models often suffer from stability and fidelity issues. Furthermore, most +existing methods do not consider users' contexts, which can significantly +influence their preferences. To address these limitations and advance the +current state-of-the-art, we define a novel feature attribution framework +called Context-Aware Feature Attribution Through Argumentation (CA-FATA). Our +framework harnesses the power of argumentation by treating each feature as an +argument that can either support, attack or neutralize a prediction. +Additionally, CA-FATA formulates feature attribution as an argumentation +procedure, and each computation has explicit semantics, which makes it +inherently interpretable. CA-FATA also easily integrates side information, such +as users' contexts, resulting in more accurate predictions. + +
+
+
+
+
+ + ☆ Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model + System for Answering Medical Questions using Scientific Literature + + +
+ The quickly-expanding nature of published medical literature makes it +challenging for clinicians and researchers to keep up with and summarize +recent, relevant findings in a timely manner. While several closed-source +summarization tools based on large language models (LLMs) now exist, rigorous +and systematic evaluations of their outputs are lacking. Furthermore, there is +a paucity of high-quality datasets and appropriate benchmark tasks with which +to evaluate these tools. We address these issues with four contributions: we +release Clinfo.ai, an open-source WebApp that answers clinical questions based +on dynamically retrieved scientific literature; we specify an information +retrieval and abstractive summarization task to evaluate the performance of +such retrieval-augmented LLM systems; we release a dataset of 200 questions and +corresponding answers derived from published systematic reviews, which we name +PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for +Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200. + +
+
+ comment: Preprint of an article published in Pacific Symposium on Biocomputing + copyright 2024 World Scientific Publishing Co., Singapore, + http://psb.stanford.edu/ +
+
+
+
+
+ + ☆ Context-aware explainable recommendations over knowledge graphs + + +
+ Knowledge graphs contain rich semantic relationships related to items and +incorporating such semantic relationships into recommender systems helps to +explore the latent connections of items, thus improving the accuracy of +prediction and enhancing the explainability of recommendations. However, such +explainability is not adapted to users' contexts, which can significantly +influence their preferences. In this work, we propose CA-KGCN (Context-Aware +Knowledge Graph Convolutional Network), an end-to-end framework that can model +users' preferences adapted to their contexts and can incorporate rich semantic +relationships in the knowledge graph related to items. This framework captures +users' attention to different factors: contexts and features of items. More +specifically, the framework can model users' preferences adapted to their +contexts and provide explanations adapted to the given context. Experiments on +three real-world datasets show the effectiveness of our framework: modeling +users' preferences adapted to their contexts and explaining the recommendations +generated. + +
+
+
+
+
+ + ♻ ☆ Beyond Semantics: Learning a Behavior Augmented Relevance Model with + Self-supervised Learning CIKM2023 + + +
+ Relevance modeling aims to locate desirable items for corresponding queries, +which is crucial for search engines to ensure user experience. Although most +conventional approaches address this problem by assessing the semantic +similarity between the query and item, pure semantic matching is not +everything. In reality, auxiliary query-item interactions extracted from user +historical behavior data of the search log could provide hints to reveal users' +search intents further. Drawing inspiration from this, we devise a novel +Behavior Augmented Relevance Learning model for Alipay Search (BARL-ASe) that +leverages neighbor queries of target item and neighbor items of target query to +complement target query-item semantic matching. Specifically, our model builds +multi-level co-attention for distilling coarse-grained and fine-grained +semantic representations from both neighbor and target views. The model +subsequently employs neighbor-target self-supervised learning to improve the +accuracy and robustness of BARL-ASe by strengthening representation and logit +learning. Furthermore, we discuss how to deal with the long-tail query-item +matching of the mini apps search scenario of Alipay practically. Experiments on +real-world industry data and online A/B testing demonstrate our proposal +achieves promising performance with low latency. + +
+
+ comment: Accepted by CIKM2023 +
+
+
+
+
+ + ♻ ☆ AdaptSSR: Pre-training User Model with Augmentation-Adaptive + Self-Supervised Ranking NeurIPS 2023 + + +
+ User modeling, which aims to capture users' characteristics or interests, +heavily relies on task-specific labeled data and suffers from the data sparsity +issue. Several recent studies tackled this problem by pre-training the user +model on massive user behavior sequences with a contrastive learning task. +Generally, these methods assume different views of the same behavior sequence +constructed via data augmentation are semantically consistent, i.e., reflecting +similar characteristics or interests of the user, and thus maximizing their +agreement in the feature space. However, due to the diverse interests and heavy +noise in user behaviors, existing augmentation methods tend to lose certain +characteristics of the user or introduce noisy behaviors. Thus, forcing the +user model to directly maximize the similarity between the augmented views may +result in a negative transfer. To this end, we propose to replace the +contrastive learning task with a new pretext task: Augmentation-Adaptive +SelfSupervised Ranking (AdaptSSR), which alleviates the requirement of semantic +consistency between the augmented views while pre-training a discriminative +user model. Specifically, we adopt a multiple pairwise ranking loss which +trains the user model to capture the similarity orders between the implicitly +augmented view, the explicitly augmented view, and views from other users. We +further employ an in-batch hard negative sampling strategy to facilitate model +training. Moreover, considering the distinct impacts of data augmentation on +different behavior sequences, we design an augmentation-adaptive fusion +mechanism to automatically adjust the similarity order constraint applied to +each sample based on the estimated similarity between the augmented views. +Extensive experiments on both public and industrial datasets with six +downstream tasks verify the effectiveness of AdaptSSR. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Towards Open-World Recommendation with Knowledge Augmentation from Large + Language Models + + +
+ Recommender systems play a vital role in various online services. However, +the insulated nature of training and deploying separately within a specific +domain limits their access to open-world knowledge. Recently, the emergence of +large language models (LLMs) has shown promise in bridging this gap by encoding +extensive world knowledge and demonstrating reasoning capability. Nevertheless, +previous attempts to directly use LLMs as recommenders have not achieved +satisfactory results. In this work, we propose an Open-World Knowledge +Augmented Recommendation Framework with Large Language Models, dubbed KAR, to +acquire two types of external knowledge from LLMs -- the reasoning knowledge on +user preferences and the factual knowledge on items. We introduce factorization +prompting to elicit accurate reasoning on user preferences. The generated +reasoning and factual knowledge are effectively transformed and condensed into +augmented vectors by a hybrid-expert adaptor in order to be compatible with the +recommendation task. The obtained vectors can then be directly used to enhance +the performance of any recommendation model. We also ensure efficient inference +by preprocessing and prestoring the knowledge from the LLM. Extensive +experiments show that KAR significantly outperforms the state-of-the-art +baselines and is compatible with a wide range of recommendation algorithms. + +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ CorefPrompt: Prompt-based Event Coreference Resolution by Measuring + Event Type and Argument Compatibilities EMNLP2023 + + +
+ Event coreference resolution (ECR) aims to group event mentions referring to +the same real-world event into clusters. Most previous studies adopt the +"encoding first, then scoring" framework, making the coreference judgment rely +on event encoding. Furthermore, current methods struggle to leverage +human-summarized ECR rules, e.g., coreferential events should have the same +event type, to guide the model. To address these two issues, we propose a +prompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM +(masked language model) task. This allows for simultaneous event modeling and +coreference discrimination within a single template, with a fully shared +context. In addition, we introduce two auxiliary prompt tasks, event-type +compatibility and argument compatibility, to explicitly demonstrate the +reasoning process of ECR, which helps the model make final predictions. +Experimental results show that our method CorefPrompt performs well in a +state-of-the-art (SOTA) benchmark. + +
+
+ comment: Accepted by EMNLP2023 +
+
+
+
+
+ + ♻ ☆ TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks EMNLP 2023 + + +
+ While LLMs have shown great success in understanding and generating text in +traditional conversational settings, their potential for performing ill-defined +complex tasks is largely under-studied. Indeed, we are yet to conduct +comprehensive benchmarking studies with multiple LLMs that are exclusively +focused on a complex task. However, conducting such benchmarking studies is +challenging because of the large variations in LLMs' performance when different +prompt types/styles are used and different degrees of detail are provided in +the prompts. To address this issue, the paper proposes a general taxonomy that +can be used to design prompts with specific properties in order to perform a +wide range of complex tasks. This taxonomy will allow future benchmarking +studies to report the specific categories of prompts used as part of the study, +enabling meaningful comparisons across different studies. Also, by establishing +a common standard through this taxonomy, researchers will be able to draw more +accurate conclusions about LLMs' performance on a specific complex task. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ AI Alignment and Social Choice: Fundamental Limitations and Policy + Implications + + +
+ Aligning AI agents to human intentions and values is a key bottleneck in +building safe and deployable AI applications. But whose values should AI agents +be aligned with? Reinforcement learning with human feedback (RLHF) has emerged +as the key framework for AI alignment. RLHF uses feedback from human +reinforcers to fine-tune outputs; all widely deployed large language models +(LLMs) use RLHF to align their outputs to human values. It is critical to +understand the limitations of RLHF and consider policy challenges arising from +these limitations. In this paper, we investigate a specific challenge in +building RLHF systems that respect democratic norms. Building on impossibility +results in social choice theory, we show that, under fairly broad assumptions, +there is no unique voting protocol to universally align AI systems using RLHF +through democratic processes. Further, we show that aligning AI agents with the +values of all individuals will always violate certain private ethical +preferences of an individual user i.e., universal AI alignment using RLHF is +impossible. We discuss policy implications for the governance of AI systems +built using RLHF: first, the need for mandating transparent voting rules to +hold model builders accountable. Second, the need for model builders to focus +on developing AI agents that are narrowly aligned to specific user groups. + +
+
+ comment: 10 pages, no figures +
+
+
+
+
+ + ☆ From Posterior Sampling to Meaningful Diversity in Image Restoration + + +
+ Image restoration problems are typically ill-posed in the sense that each +degraded image can be restored in infinitely many valid ways. To accommodate +this, many works generate a diverse set of outputs by attempting to randomly +sample from the posterior distribution of natural images given the degraded +input. Here we argue that this strategy is commonly of limited practical value +because of the heavy tail of the posterior distribution. Consider for example +inpainting a missing region of the sky in an image. Since there is a high +probability that the missing region contains no object but clouds, any set of +samples from the posterior would be entirely dominated by (practically +identical) completions of sky. However, arguably, presenting users with only +one clear sky completion, along with several alternative solutions such as +airships, birds, and balloons, would better outline the set of possibilities. +In this paper, we initiate the study of meaningfully diverse image restoration. +We explore several post-processing approaches that can be combined with any +diverse image restoration method to yield semantically meaningful diversity. +Moreover, we propose a practical approach for allowing diffusion based image +restoration methods to generate meaningfully diverse outputs, while incurring +only negligent computational overhead. We conduct extensive user studies to +analyze the proposed techniques, and find the strategy of reducing similarity +between outputs to be significantly favorable over posterior sampling. Code and +examples are available in https://noa-cohen.github.io/MeaningfulDiversityInIR + +
+
+ comment: Code and examples are available in + https://noa-cohen.github.io/MeaningfulDiversityInIR +
+
+
+
+
+ + ☆ A Unified, Scalable Framework for Neural Population Decoding NeurIPS 2023 + + +
+ Our ability to use deep learning approaches to decipher neural activity would +likely benefit from greater scale, in terms of both model size and datasets. +However, the integration of many neural recordings into one unified model is +challenging, as each recording contains the activity of different neurons from +different individual animals. In this paper, we introduce a training framework +and architecture designed to model the population dynamics of neural activity +across diverse, large-scale neural recordings. Our method first tokenizes +individual spikes within the dataset to build an efficient representation of +neural events that captures the fine temporal structure of neural activity. We +then employ cross-attention and a PerceiverIO backbone to further construct a +latent tokenization of neural population activities. Utilizing this +architecture and training framework, we construct a large-scale multi-session +model trained on large datasets from seven nonhuman primates, spanning over 158 +different sessions of recording from over 27,373 neural units and over 100 +hours of recordings. In a number of different tasks, we demonstrate that our +pretrained model can be rapidly adapted to new, unseen sessions with +unspecified neuron correspondence, enabling few-shot performance with minimal +labels. This work presents a powerful new approach for building deep learning +tools to analyze neural data and stakes out a clear path to training at scale. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ Woodpecker: Hallucination Correction for Multimodal Large Language + Models + + +
+ Hallucination is a big shadow hanging over the rapidly evolving Multimodal +Large Language Models (MLLMs), referring to the phenomenon that the generated +text is inconsistent with the image content. In order to mitigate +hallucinations, existing studies mainly resort to an instruction-tuning manner +that requires retraining the models with specific data. In this paper, we pave +a different way, introducing a training-free method named Woodpecker. Like a +woodpecker heals trees, it picks out and corrects hallucinations from the +generated text. Concretely, Woodpecker consists of five stages: key concept +extraction, question formulation, visual knowledge validation, visual claim +generation, and hallucination correction. Implemented in a post-remedy manner, +Woodpecker can easily serve different MLLMs, while being interpretable by +accessing intermediate outputs of the five stages. We evaluate Woodpecker both +quantitatively and qualitatively and show the huge potential of this new +paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement +in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released +at https://github.com/BradyFU/Woodpecker. + +
+
+ comment: 16 pages, 7 figures. Code Website: + https://github.com/BradyFU/Woodpecker +
+
+
+
+
+ + ☆ What's Left? Concept Grounding with Logic-Enhanced Foundation Models NeurIPS 2023 + + +
+ Recent works such as VisProg and ViperGPT have smartly composed foundation +models for visual reasoning-using large language models (LLMs) to produce +programs that can be executed by pre-trained vision-language models. However, +they operate in limited domains, such as 2D images, not fully exploiting the +generalization of language: abstract concepts like "left" can also be grounded +in 3D, temporal, and action data, as in moving to your left. This limited +generalization stems from these inference-only methods' inability to learn or +adapt pre-trained models to a new domain. We propose the Logic-Enhanced +Foundation Model (LEFT), a unified framework that learns to ground and reason +with concepts across domains with a differentiable, domain-independent, +first-order logic-based program executor. LEFT has an LLM interpreter that +outputs a program represented in a general, logic-based reasoning language, +which is shared across all domains and tasks. LEFT's executor then executes the +program with trainable domain-specific grounding modules. We show that LEFT +flexibly learns concepts in four domains: 2D images, 3D scenes, human motions, +and robotic manipulation. It exhibits strong reasoning ability in a wide +variety of tasks, including those that are complex and not seen during +training, and can be easily applied to new domains. + +
+
+ comment: NeurIPS 2023. First two authors contributed equally. Project page: + https://web.stanford.edu/~joycj/projects/left_neurips_2023 +
+
+
+
+
+ + ☆ Finetuning Offline World Models in the Real World + + +
+ Reinforcement Learning (RL) is notoriously data-inefficient, which makes +training on a real robot difficult. While model-based RL algorithms (world +models) improve data-efficiency to some extent, they still require hours or +days of interaction to learn skills. Recently, offline RL has been proposed as +a framework for training RL policies on pre-existing datasets without any +online interaction. However, constraining an algorithm to a fixed dataset +induces a state-action distribution shift between training and inference, and +limits its applicability to new tasks. In this work, we seek to get the best of +both worlds: we consider the problem of pretraining a world model with offline +data collected on a real robot, and then finetuning the model on online data +collected by planning with the learned model. To mitigate extrapolation errors +during online interaction, we propose to regularize the planner at test-time by +balancing estimated returns and (epistemic) model uncertainty. We evaluate our +method on a variety of visuo-motor control tasks in simulation and on a real +robot, and find that our method enables few-shot finetuning to seen and unseen +tasks even when offline data is limited. Videos, code, and data are available +at https://yunhaifeng.com/FOWM . + +
+
+ comment: CoRL 2023 Oral; Project website: https://yunhaifeng.com/FOWM +
+
+
+
+
+ + ☆ What Algorithms can Transformers Learn? A Study in Length Generalization + + +
+ Large language models exhibit surprising emergent generalization properties, +yet also struggle on many simple reasoning tasks such as arithmetic and parity. +This raises the question of if and when Transformer models can learn the true +algorithm for solving a task. We study the scope of Transformers' abilities in +the specific setting of length generalization on algorithmic tasks. Here, we +propose a unifying framework to understand when and how Transformers can +exhibit strong length generalization on a given task. Specifically, we leverage +RASP (Weiss et al., 2021) -- a programming language designed for the +computational model of a Transformer -- and introduce the RASP-Generalization +Conjecture: Transformers tend to length generalize on a task if the task can be +solved by a short RASP program which works for all input lengths. This simple +conjecture remarkably captures most known instances of length generalization on +algorithmic tasks. Moreover, we leverage our insights to drastically improve +generalization performance on traditionally hard tasks (such as parity and +addition). On the theoretical side, we give a simple example where the +"min-degree-interpolator" model of learning from Abbe et al. (2023) does not +correctly predict Transformers' out-of-distribution behavior, but our +conjecture does. Overall, our work provides a novel perspective on the +mechanisms of compositional generalization and the algorithmic capabilities of +Transformers. + +
+
+ comment: Preprint +
+
+
+
+
+ + ☆ TimewarpVAE: Simultaneous Time-Warping and Representation Learning of + Trajectories + + +
+ Human demonstrations of trajectories are an important source of training data +for many machine learning problems. However, the difficulty of collecting human +demonstration data for complex tasks makes learning efficient representations +of those trajectories challenging. For many problems, such as for handwriting +or for quasistatic dexterous manipulation, the exact timings of the +trajectories should be factored from their spatial path characteristics. In +this work, we propose TimewarpVAE, a fully differentiable manifold-learning +algorithm that incorporates Dynamic Time Warping (DTW) to simultaneously learn +both timing variations and latent factors of spatial variation. We show how the +TimewarpVAE algorithm learns appropriate time alignments and meaningful +representations of spatial variations in small handwriting and fork +manipulation datasets. Our results have lower spatial reconstruction test error +than baseline approaches and the learned low-dimensional representations can be +used to efficiently generate semantically meaningful novel trajectories. + +
+
+ comment: 17 pages, 12 figures +
+
+
+
+
+ + ☆ Human-in-the-Loop Task and Motion Planning for Imitation Learning + + +
+ Imitation learning from human demonstrations can teach robots complex +manipulation skills, but is time-consuming and labor intensive. In contrast, +Task and Motion Planning (TAMP) systems are automated and excel at solving +long-horizon tasks, but they are difficult to apply to contact-rich tasks. In +this paper, we present Human-in-the-Loop Task and Motion Planning (HITL-TAMP), +a novel system that leverages the benefits of both approaches. The system +employs a TAMP-gated control mechanism, which selectively gives and takes +control to and from a human teleoperator. This enables the human teleoperator +to manage a fleet of robots, maximizing data collection efficiency. The +collected human data is then combined with an imitation learning framework to +train a TAMP-gated policy, leading to superior performance compared to training +on full task demonstrations. We compared HITL-TAMP to a conventional +teleoperation system -- users gathered more than 3x the number of demos given +the same time budget. Furthermore, proficient agents (75\%+ success) could be +trained from just 10 minutes of non-expert teleoperation data. Finally, we +collected 2.1K demos with HITL-TAMP across 12 contact-rich, long-horizon tasks +and show that the system often produces near-perfect agents. Videos and +additional results at https://hitltamp.github.io . + +
+
+ comment: Conference on Robot Learning (CoRL) 2023 +
+
+
+
+
+ + ☆ MLFMF: Data Sets for Machine Learning for Mathematical Formalization NeurIPS 2023 + + +
+ We introduce MLFMF, a collection of data sets for benchmarking recommendation +systems used to support formalization of mathematics with proof assistants. +These systems help humans identify which previous entries (theorems, +constructions, datatypes, and postulates) are relevant in proving a new theorem +or carrying out a new construction. Each data set is derived from a library of +formalized mathematics written in proof assistants Agda or Lean. The collection +includes the largest Lean~4 library Mathlib, and some of the largest Agda +libraries: the standard library, the library of univalent mathematics +Agda-unimath, and the TypeTopology library. Each data set represents the +corresponding library in two ways: as a heterogeneous network, and as a list of +s-expressions representing the syntax trees of all the entries in the library. +The network contains the (modular) structure of the library and the references +between entries, while the s-expressions give complete and easily parsed +information about every entry. We report baseline results using standard graph +and word embeddings, tree ensembles, and instance-based learning algorithms. +The MLFMF data sets provide solid benchmarking support for further +investigation of the numerous machine learning approaches to formalized +mathematics. The methodology used to extract the networks and the s-expressions +readily applies to other libraries, and is applicable to other proof +assistants. With more than $250\,000$ entries in total, this is currently the +largest collection of formalized mathematical knowledge in machine learnable +format. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ Transitivity Recovering Decompositions: Interpretable and Robust + Fine-Grained Relationships NeurIPS + + +
+ Recent advances in fine-grained representation learning leverage +local-to-global (emergent) relationships for achieving state-of-the-art +results. The relational representations relied upon by such methods, however, +are abstract. We aim to deconstruct this abstraction by expressing them as +interpretable graphs over image views. We begin by theoretically showing that +abstract relational representations are nothing but a way of recovering +transitive relationships among local views. Based on this, we design +Transitivity Recovering Decompositions (TRD), a graph-space search algorithm +that identifies interpretable equivalents of abstract emergent relationships at +both instance and class levels, and with no post-hoc computations. We +additionally show that TRD is provably robust to noisy views, with empirical +evidence also supporting this finding. The latter allows TRD to perform at par +or even better than the state-of-the-art, while being fully interpretable. +Implementation is available at https://github.com/abhrac/trd. + +
+
+ comment: Neural Information Processing Systems (NeurIPS) 2023 +
+
+
+
+
+ + ☆ White-box Compiler Fuzzing Empowered by Large Language Models + + +
+ Compiler correctness is crucial, as miscompilation falsifying the program +behaviors can lead to serious consequences. In the literature, fuzzing has been +extensively studied to uncover compiler defects. However, compiler fuzzing +remains challenging: Existing arts focus on black- and grey-box fuzzing, which +generates tests without sufficient understanding of internal compiler +behaviors. As such, they often fail to construct programs to exercise +conditions of intricate optimizations. Meanwhile, traditional white-box +techniques are computationally inapplicable to the giant codebase of compilers. +Recent advances demonstrate that Large Language Models (LLMs) excel in code +generation/understanding tasks and have achieved state-of-the-art performance +in black-box fuzzing. Nonetheless, prompting LLMs with compiler source-code +information remains a missing piece of research in compiler testing. + To this end, we propose WhiteFox, the first white-box compiler fuzzer using +LLMs with source-code information to test compiler optimization. WhiteFox +adopts a dual-model framework: (i) an analysis LLM examines the low-level +optimization source code and produces requirements on the high-level test +programs that can trigger the optimization; (ii) a generation LLM produces test +programs based on the summarized requirements. Additionally, +optimization-triggering tests are used as feedback to further enhance the test +generation on the fly. Our evaluation on four popular compilers shows that +WhiteFox can generate high-quality tests to exercise deep optimizations +requiring intricate conditions, practicing up to 80 more optimizations than +state-of-the-art fuzzers. To date, WhiteFox has found in total 96 bugs, with 80 +confirmed as previously unknown and 51 already fixed. Beyond compiler testing, +WhiteFox can also be adapted for white-box fuzzing of other complex, real-world +software systems in general. + +
+
+
+
+
+ + ☆ Graph Deep Learning for Time Series Forecasting + + +
+ Graph-based deep learning methods have become popular tools to process +collections of correlated time series. Differently from traditional +multivariate forecasting methods, neural graph-based predictors take advantage +of pairwise relationships by conditioning forecasts on a (possibly dynamic) +graph spanning the time series collection. The conditioning can take the form +of an architectural inductive bias on the neural forecasting architecture, +resulting in a family of deep learning models called spatiotemporal graph +neural networks. Such relational inductive biases enable the training of global +forecasting models on large time-series collections, while at the same time +localizing predictions w.r.t. each element in the set (i.e., graph nodes) by +accounting for local correlations among them (i.e., graph edges). Indeed, +recent theoretical and practical advances in graph neural networks and deep +learning for time series forecasting make the adoption of such processing +frameworks appealing and timely. However, most of the studies in the literature +focus on proposing variations of existing neural architectures by taking +advantage of modern deep learning practices, while foundational and +methodological aspects have not been subject to systematic investigation. To +fill the gap, this paper aims to introduce a comprehensive methodological +framework that formalizes the forecasting problem and provides design +principles for graph-based predictive models and methods to assess their +performance. At the same time, together with an overview of the field, we +provide design guidelines, recommendations, and best practices, as well as an +in-depth discussion of open challenges and future research directions. + +
+
+
+
+
+ + ☆ Data-driven Traffic Simulation: A Comprehensive Review + + +
+ Autonomous vehicles (AVs) have the potential to significantly revolutionize +society by providing a secure and efficient mode of transportation. Recent +years have witnessed notable advance-ments in autonomous driving perception and +prediction, but the challenge of validating the performance of AVs remains +largely unresolved. Data-driven microscopic traffic simulation has be-come an +important tool for autonomous driving testing due to 1) availability of +high-fidelity traffic data; 2) its advantages of ena-bling large-scale testing +and scenario reproducibility; and 3) its potential in reactive and realistic +traffic simulation. However, a comprehensive review of this topic is currently +lacking. This pa-per aims to fill this gap by summarizing relevant studies. The +primary objective of this paper is to review current research ef-forts and +provide a futuristic perspective that will benefit future developments in the +field. It introduces the general issues of data-driven traffic simulation and +outlines key concepts and terms. After overviewing traffic simulation, various +datasets and evalua-tion metrics commonly used are reviewed. The paper then +offers a comprehensive evaluation of imitation learning, reinforcement +learning, generative and deep learning methods, summarizing each and analyzing +their advantages and disadvantages in detail. Moreover, it evaluates the +state-of-the-art, existing challenges, and future research directions. + +
+
+ comment: 18 pages, 4 figures, 4 tables +
+
+
+
+
+ + ☆ Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex + Optimization + + +
+ signSGD is popular in nonconvex optimization due to its communication +efficiency. Yet, existing analyses of signSGD rely on assuming that data are +sampled with replacement in each iteration, contradicting the practical +implementation where data are randomly reshuffled and sequentially fed into the +algorithm. We bridge this gap by proving the first convergence result of +signSGD with random reshuffling (SignRR) for nonconvex optimization. Given the +dataset size $n$, the number of epochs of data passes $T$, and the variance +bound of a stochastic gradient $\sigma^2$, we show that SignRR has the same +convergence rate $O(\log(nT)/\sqrt{nT} + \|\sigma\|_1)$ as signSGD +\citep{bernstein2018signsgd}. We then present SignRVR and SignRVM, which +leverage variance-reduced gradients and momentum updates respectively, both +converging at $O(\log(nT)/\sqrt{nT})$. In contrast with the analysis of +signSGD, our results do not require an extremely large batch size in each +iteration to be of the same order as the total number of iterations +\citep{bernstein2018signsgd} or the signs of stochastic and true gradients +match element-wise with a minimum probability of 1/2 +\citep{safaryan2021stochastic}. We also extend our algorithms to cases where +data are distributed across different machines, yielding dist-SignRVR and +dist-SignRVM, both converging at $O(\log(n_0T)/\sqrt{n_0T})$, where $n_0$ is +the dataset size of a single machine. We back up our theoretical findings +through experiments on simulated and real-world problems, verifying that +randomly reshuffled sign methods match or surpass existing baselines. + +
+
+ comment: 45 pages, 4 figures +
+
+
+
+
+ + ☆ Minimax Forward and Backward Learning of Evolving Tasks with Performance + Guarantees + + +
+ For a sequence of classification tasks that arrive over time, it is common +that tasks are evolving in the sense that consecutive tasks often have a higher +similarity. The incremental learning of a growing sequence of tasks holds +promise to enable accurate classification even with few samples per task by +leveraging information from all the tasks in the sequence (forward and backward +learning). However, existing techniques developed for continual learning and +concept drift adaptation are either designed for tasks with time-independent +similarities or only aim to learn the last task in the sequence. This paper +presents incremental minimax risk classifiers (IMRCs) that effectively exploit +forward and backward learning and account for evolving tasks. In addition, we +analytically characterize the performance improvement provided by forward and +backward learning in terms of the tasks' expected quadratic change and the +number of tasks. The experimental evaluation shows that IMRCs can result in a +significant performance improvement, especially for reduced sample sizes. + +
+
+
+
+
+ + ☆ Accented Speech Recognition With Accent-specific Codebooks EMNLP 2023 + + +
+ Speech accents pose a significant challenge to state-of-the-art automatic +speech recognition (ASR) systems. Degradation in performance across +underrepresented accents is a severe deterrent to the inclusive adoption of +ASR. In this work, we propose a novel accent adaptation approach for end-to-end +ASR systems using cross-attention with a trainable set of codebooks. These +learnable codebooks capture accent-specific information and are integrated +within the ASR encoder layers. The model is trained on accented English speech, +while the test data also contained accents which were not seen during training. +On the Mozilla Common Voice multi-accented dataset, we show that our proposed +approach yields significant performance gains not only on the seen English +accents (up to $37\%$ relative improvement in word error rate) but also on the +unseen accents (up to $5\%$ relative improvement in WER). Further, we +illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We +also compare the performance with other approaches based on accent adversarial +training. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ☆ Constructing and Machine Learning Calabi-Yau Five-folds + + +
+ We construct all possible complete intersection Calabi-Yau five-folds in a +product of four or less complex projective spaces, with up to four constraints. +We obtain $27068$ spaces, which are not related by permutations of rows and +columns of the configuration matrix, and determine the Euler number for all of +them. Excluding the $3909$ product manifolds among those, we calculate the +cohomological data for $12433$ cases, i.e. $53.7 \%$ of the non-product spaces, +obtaining $2375$ different Hodge diamonds. The dataset containing all the above +information is available at +https://www.dropbox.com/scl/fo/z7ii5idt6qxu36e0b8azq/h?rlkey=0qfhx3tykytduobpld510gsfy&dl=0 +. The distributions of the invariants are presented, and a comparison with the +lower-dimensional analogues is discussed. Supervised machine learning is +performed on the cohomological data, via classifier and regressor (both fully +connected and convolutional) neural networks. We find that $h^{1,1}$ can be +learnt very efficiently, with very high $R^2$ score and an accuracy of $96\%$, +i.e. $96 \%$ of the predictions exactly match the correct values. For +$h^{1,4},h^{2,3}, \eta$, we also find very high $R^2$ scores, but the accuracy +is lower, due to the large ranges of possible values. + +
+
+ comment: 40 pages, 8 tables, 2 figures +
+
+
+
+
+ + ☆ Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation + + +
+ Despite the promise of Mixture of Experts (MoE) models in increasing +parameter counts of Transformer models while maintaining training and inference +costs, their application carries notable drawbacks. The key strategy of these +models is to, for each processed token, activate at most a few experts - +subsets of an extensive feed-forward layer. But this approach is not without +its challenges. The operation of matching experts and tokens is discrete, which +makes MoE models prone to issues like training instability and uneven expert +utilization. Existing techniques designed to address these concerns, such as +auxiliary losses or balance-aware matching, result either in lower model +performance or are more difficult to train. In response to these issues, we +propose Mixture of Tokens, a fully-differentiable model that retains the +benefits of MoE architectures while avoiding the aforementioned difficulties. +Rather than routing tokens to experts, this approach mixes tokens from +different examples prior to feeding them to experts, enabling the model to +learn from all token-expert combinations. Importantly, this mixing can be +disabled to avoid mixing of different sequences during inference. Crucially, +this method is fully compatible with both masked and causal Large Language +Model training and inference. + +
+
+
+
+
+ + ☆ Improving Robustness and Reliability in Medical Image Classification + with Latent-Guided Diffusion and Nested-Ensembles + + +
+ While deep learning models have achieved remarkable success across a range of +medical image analysis tasks, deployment of these models in real clinical +contexts requires that they be robust to variability in the acquired images. +While many methods apply predefined transformations to augment the training +data to enhance test-time robustness, these transformations may not ensure the +model's robustness to the diverse variability seen in patient images. In this +paper, we introduce a novel three-stage approach based on transformers coupled +with conditional diffusion models, with the goal of improving model robustness +to the kinds of imaging variability commonly encountered in practice without +the need for pre-determined data augmentation strategies. To this end, multiple +image encoders first learn hierarchical feature representations to build +discriminative latent spaces. Next, a reverse diffusion process, guided by the +latent code, acts on an informative prior and proposes prediction candidates in +a generative manner. Finally, several prediction candidates are aggregated in a +bi-level aggregation protocol to produce the final output. Through extensive +experiments on medical imaging benchmark datasets, we show that our method +improves upon state-of-the-art methods in terms of robustness and confidence +calibration. Additionally, we introduce a strategy to quantify the prediction +uncertainty at the instance level, increasing their trustworthiness to +clinicians using them in clinical practice. + +
+
+ comment: 13 pages, 6 figures +
+
+
+
+
+ + ☆ Weighted Distance Nearest Neighbor Condensing + + +
+ The problem of nearest neighbor condensing has enjoyed a long history of +study, both in its theoretical and practical aspects. In this paper, we +introduce the problem of weighted distance nearest neighbor condensing, where +one assigns weights to each point of the condensed set, and then new points are +labeled based on their weighted distance nearest neighbor in the condensed set. + We study the theoretical properties of this new model, and show that it can +produce dramatically better condensing than the standard nearest neighbor rule, +yet is characterized by generalization bounds almost identical to the latter. +We then suggest a condensing heuristic for our new problem. We demonstrate +Bayes consistency for this heuristic, and also show promising empirical +results. + +
+
+
+
+
+ + ☆ Combining Behaviors with the Successor Features Keyboard NeurIPS 2023 + + +
+ The Option Keyboard (OK) was recently proposed as a method for transferring +behavioral knowledge across tasks. OK transfers knowledge by adaptively +combining subsets of known behaviors using Successor Features (SFs) and +Generalized Policy Improvement (GPI). However, it relies on hand-designed +state-features and task encodings which are cumbersome to design for every new +environment. In this work, we propose the "Successor Features Keyboard" (SFK), +which enables transfer with discovered state-features and task encodings. To +enable discovery, we propose the "Categorical Successor Feature Approximator" +(CSFA), a novel learning algorithm for estimating SFs while jointly discovering +state-features and task encodings. With SFK and CSFA, we achieve the first +demonstration of transfer with SFs in a challenging 3D environment where all +the necessary representations are discovered. We first compare CSFA against +other methods for approximating SFs and show that only CSFA discovers +representations compatible with SF&GPI at this scale. We then compare SFK +against transfer learning baselines and show that it transfers most quickly to +long-horizon tasks. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ☆ ABKD: Graph Neural Network Compression with Attention-Based Knowledge + Distillation + + +
+ Graph Neural Networks (GNNs) have proven to be quite versatile for a variety +of applications, including recommendation systems, fake news detection, drug +discovery, and even computer vision. Due to the expanding size of +graph-structured data, GNN models have also increased in complexity, leading to +substantial latency issues. This is primarily attributed to the irregular +structure of graph data and its access pattern into memory. The natural +solution to reduce latency is to compress large GNNs into small GNNs. One way +to do this is via knowledge distillation (KD). However, most KD approaches for +GNNs only consider the outputs of the last layers and do not consider the +outputs of the intermediate layers of the GNNs; these layers may contain +important inductive biases indicated by the graph structure. To address this +shortcoming, we propose a novel KD approach to GNN compression that we call +Attention-Based Knowledge Distillation (ABKD). ABKD is a KD approach that uses +attention to identify important intermediate teacher-student layer pairs and +focuses on aligning their outputs. ABKD enables higher compression of GNNs with +a smaller accuracy dropoff compared to existing KD approaches. On average, we +achieve a 1.79% increase in accuracy with a 32.3x compression ratio on +OGBN-Mag, a large graph dataset, compared to state-of-the-art approaches. + +
+
+
+
+
+ + ☆ Online Robust Mean Estimation + + +
+ We study the problem of high-dimensional robust mean estimation in an online +setting. Specifically, we consider a scenario where $n$ sensors are measuring +some common, ongoing phenomenon. At each time step $t=1,2,\ldots,T$, the +$i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The +algorithm must then commit to its estimate $\mu_t$ for the true mean value of +the process at time $t$. We assume that most of the sensors observe independent +samples from some common distribution $X$, but an $\epsilon$-fraction of them +may instead behave maliciously. The algorithm wishes to compute a good +approximation $\mu$ to the true mean $\mu^\ast := \mathbf{E}[X]$. We note that +if the algorithm is allowed to wait until time $T$ to report its estimate, this +reduces to the well-studied problem of robust mean estimation. However, the +requirement that our algorithm produces partial estimates as the data is coming +in substantially complicates the situation. + We prove two main results about online robust mean estimation in this model. +First, if the uncorrupted samples satisfy the standard condition of +$(\epsilon,\delta)$-stability, we give an efficient online algorithm that +outputs estimates $\mu_t$, $t \in [T],$ such that with high probability it +holds that $\|\mu-\mu^\ast\|_2 = O(\delta \log(T))$, where $\mu = (\mu_t)_{t +\in [T]}$. We note that this error bound is nearly competitive with the best +offline algorithms, which would achieve $\ell_2$-error of $O(\delta)$. Our +second main result shows that with additional assumptions on the input (most +notably that $X$ is a product distribution) there are inefficient algorithms +whose error does not depend on $T$ at all. + +
+
+ comment: To appear in SODA2024 +
+
+
+
+
+ + ☆ E-Sparse: Boosting the Large Language Model Inference through + Entropy-based N:M Sparsity + + +
+ Traditional pruning methods are known to be challenging to work in Large +Language Models (LLMs) for Generative AI because of their unaffordable training +process and large computational demands. For the first time, we introduce the +information entropy of hidden state features into a pruning metric design, +namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse +employs the information richness to leverage the channel importance, and +further incorporates several novel techniques to put it into effect: (1) it +introduces information entropy to enhance the significance of parameter weights +and input feature norms as a novel pruning metric, and performs N:M sparsity +without modifying the remaining weights. (2) it designs global naive shuffle +and local block shuffle to quickly optimize the information distribution and +adequately cope with the impact of N:M sparsity on LLMs' accuracy. E-Sparse is +implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere +GPUs. Extensive experiments on the LLaMA family and OPT models show that +E-Sparse can significantly speed up the model inference over the dense model +(up to 1.53X) and obtain significant memory saving (up to 43.52%), with +acceptable accuracy loss. + +
+
+
+
+
+ + ☆ Climate Change Impact on Agricultural Land Suitability: An Interpretable + Machine Learning-Based Eurasia Case Study + + +
+ The United Nations has identified improving food security and reducing hunger +as essential components of its sustainable development goals. As of 2021, +approximately 828 million people worldwide are experiencing hunger and +malnutrition, with numerous fatalities reported. Climate change significantly +impacts agricultural land suitability, potentially leading to severe food +shortages and subsequent social and political conflicts. To address this +pressing issue, we have developed a machine learning-based approach to predict +the risk of substantial land suitability degradation and changes in irrigation +patterns. Our study focuses on Central Eurasia, a region burdened with economic +and social challenges. + This study represents a pioneering effort in utilizing machine learning +methods to assess the impact of climate change on agricultural land suitability +under various carbon emissions scenarios. Through comprehensive feature +importance analysis, we unveil specific climate and terrain characteristics +that exert influence on land suitability. Our approach achieves remarkable +accuracy, offering policymakers invaluable insights to facilitate informed +decisions aimed at averting a humanitarian crisis, including strategies such as +the provision of additional water and fertilizers. This research underscores +the tremendous potential of machine learning in addressing global challenges, +with a particular emphasis on mitigating hunger and malnutrition. + +
+
+
+
+
+ + ☆ Is Probing All You Need? Indicator Tasks as an Alternative to Probing + Embedding Spaces EMNLP 2023 + + +
+ The ability to identify and control different kinds of linguistic information +encoded in vector representations of words has many use cases, especially for +explainability and bias removal. This is usually done via a set of simple +classification tasks, termed probes, to evaluate the information encoded in the +embedding space. However, the involvement of a trainable classifier leads to +entanglement between the probe's results and the classifier's nature. As a +result, contemporary works on probing include tasks that do not involve +training of auxiliary models. In this work we introduce the term indicator +tasks for non-trainable tasks which are used to query embedding spaces for the +existence of certain properties, and claim that this kind of tasks may point to +a direction opposite to probes, and that this contradiction complicates the +decision on whether a property exists in an embedding space. We demonstrate our +claims with two test cases, one dealing with gender debiasing and another with +the erasure of morphological information from embedding spaces. We show that +the application of a suitable indicator provides a more accurate picture of the +information captured and removed compared to probes. We thus conclude that +indicator tasks should be implemented and taken into consideration when +eliciting information from embedded representations. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Do Stochastic Parrots have Feelings Too? Improving Neural Detection of + Synthetic Text via Emotion Recognition EMNLP 2023 + + +
+ Recent developments in generative AI have shone a spotlight on +high-performance synthetic text generation technologies. The now wide +availability and ease of use of such models highlights the urgent need to +provide equally powerful technologies capable of identifying synthetic text. +With this in mind, we draw inspiration from psychological studies which suggest +that people can be driven by emotion and encode emotion in the text they +compose. We hypothesize that pretrained language models (PLMs) have an +affective deficit because they lack such an emotional driver when generating +text and consequently may generate synthetic text which has affective +incoherence i.e. lacking the kind of emotional coherence present in +human-authored text. We subsequently develop an emotionally aware detector by +fine-tuning a PLM on emotion. Experiment results indicate that our +emotionally-aware detector achieves improvements across a range of synthetic +text generators, various sized models, datasets, and domains. Finally, we +compare our emotionally-aware synthetic text detector to ChatGPT in the task of +identification of its own output and show substantial gains, reinforcing the +potential of emotion as a signal to identify synthetic text. Code, models, and +datasets are available at https: //github.com/alanagiasi/emoPLMsynth + +
+
+ comment: Accepted to Findings of EMNLP 2023 (long paper). Camera ready version +
+
+
+
+
+ + ☆ Neural Collapse in Multi-label Learning with Pick-all-label Loss + + +
+ We study deep neural networks for the multi-label classification (MLab) task +through the lens of neural collapse (NC). Previous works have been restricted +to the multi-class classification setting and discovered a prevalent NC +phenomenon comprising of the following properties for the last-layer features: +(i) the variability of features within every class collapses to zero, (ii) the +set of feature means form an equi-angular tight frame (ETF), and (iii) the last +layer classifiers collapse to the feature mean upon some scaling. We generalize +the study to multi-label learning, and prove for the first time that a +generalized NC phenomenon holds with the "pick-all-label'' formulation. Under +the natural analog of the unconstrained feature model (UFM), we establish that +the only global classifier of the pick-all-label cross entropy loss display the +same ETF geometry which further collapse to multiplicity-1 feature class means. +Besides, we discover a combinatorial property in generalized NC which is unique +for multi-label learning that we call ``tag-wise average'' property, where the +feature class-means of samples with multiple labels are scaled average of the +feature class-means of single label tags. Theoretically, we establish global +optimality result for the pick-all-label cross-entropy risk for the UFM. +Additionally, We also provide empirical evidence to support our investigation +into training deep neural networks on multi-label datasets, resulting in +improved training efficiency. + +
+
+
+
+
+ + ☆ Cross-feature Contrastive Loss for Decentralized Deep Learning on + Heterogeneous Data + + +
+ The current state-of-the-art decentralized learning algorithms mostly assume +the data distribution to be Independent and Identically Distributed (IID). +However, in practical scenarios, the distributed datasets can have +significantly heterogeneous data distributions across the agents. In this work, +we present a novel approach for decentralized learning on heterogeneous data, +where data-free knowledge distillation through contrastive loss on +cross-features is utilized to improve performance. Cross-features for a pair of +neighboring agents are the features (i.e., last hidden layer activations) +obtained from the data of an agent with respect to the model parameters of the +other agent. We demonstrate the effectiveness of the proposed technique through +an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, +CIFAR-100, Fashion MNIST, and ImageNet), model architectures, and network +topologies. Our experiments show that the proposed method achieves superior +performance (0.2-4% improvement in test accuracy) compared to other existing +techniques for decentralized learning on heterogeneous data. + +
+
+ comment: 12 pages, 7 figures, 11 tables. arXiv admin note: text overlap with + arXiv:2305.04792 +
+
+
+
+
+ + ☆ State Sequences Prediction via Fourier Transform for Representation + Learning + + +
+ While deep reinforcement learning (RL) has been demonstrated effective in +solving complex control tasks, sample efficiency remains a key challenge due to +the large amounts of data required for remarkable performance. Existing +research explores the application of representation learning for data-efficient +RL, e.g., learning predictive representations by predicting long-term future +states. However, many existing methods do not fully exploit the structural +information inherent in sequential state signals, which can potentially improve +the quality of long-term decision-making but is difficult to discern in the +time domain. To tackle this problem, we propose State Sequences Prediction via +Fourier Transform (SPF), a novel method that exploits the frequency domain of +state sequences to extract the underlying patterns in time series data for +learning expressive representations efficiently. Specifically, we theoretically +analyze the existence of structural information in state sequences, which is +closely related to policy performance and signal regularity, and then propose +to predict the Fourier transform of infinite-step future state sequences to +extract such information. One of the appealing features of SPF is that it is +simple to implement while not requiring storage of infinite-step future states +as prediction targets. Experiments demonstrate that the proposed method +outperforms several state-of-the-art algorithms in terms of both sample +efficiency and performance. + +
+
+
+
+
+ + ☆ KirchhoffNet: A Circuit Bridging Message Passing and Continuous-Depth + Models + + +
+ In this paper, we exploit a fundamental principle of analog electronic +circuitry, Kirchhoff's current law, to introduce a unique class of neural +network models that we refer to as KirchhoffNet. KirchhoffNet establishes close +connections with message passing neural networks and continuous-depth networks. +We demonstrate that even in the absence of any traditional layers (such as +convolution, pooling, or linear layers), KirchhoffNet attains 98.86% test +accuracy on the MNIST dataset, comparable with state of the art (SOTA) results. +What makes KirchhoffNet more intriguing is its potential in the realm of +hardware. Contemporary deep neural networks are conventionally deployed on +GPUs. In contrast, KirchhoffNet can be physically realized by an analog +electronic circuit. Moreover, we justify that irrespective of the number of +parameters within a KirchhoffNet, its forward calculation can always be +completed within 1/f seconds, with f representing the hardware's clock +frequency. This characteristic introduces a promising technology for +implementing ultra-large-scale neural networks. + +
+
+ comment: 4 pages, 3 figures +
+
+
+
+
+ + ☆ Using Causality-Aware Graph Neural Networks to Predict Temporal + Centralities in Dynamic Graphs + + +
+ Node centralities play a pivotal role in network science, social network +analysis, and recommender systems. In temporal data, static path-based +centralities like closeness or betweenness can give misleading results about +the true importance of nodes in a temporal graph. To address this issue, +temporal generalizations of betweenness and closeness have been defined that +are based on the shortest time-respecting paths between pairs of nodes. +However, a major issue of those generalizations is that the calculation of such +paths is computationally expensive. Addressing this issue, we study the +application of De Bruijn Graph Neural Networks (DBGNN), a causality-aware graph +neural network architecture, to predict temporal path-based centralities in +time series data. We experimentally evaluate our approach in 13 temporal graphs +from biological and social systems and show that it considerably improves the +prediction of both betweenness and closeness centrality compared to a static +Graph Convolutional Neural Network. + +
+
+
+
+
+ + ☆ Improving Event Time Prediction by Learning to Partition the Event Time + Space + + +
+ Recently developed survival analysis methods improve upon existing approaches +by predicting the probability of event occurrence in each of a number +pre-specified (discrete) time intervals. By avoiding placing strong parametric +assumptions on the event density, this approach tends to improve prediction +performance, particularly when data are plentiful. However, in clinical +settings with limited available data, it is often preferable to judiciously +partition the event time space into a limited number of intervals well suited +to the prediction task at hand. In this work, we develop a method to learn from +data a set of cut points defining such a partition. We show that in two +simulated datasets, we are able to recover intervals that match the underlying +generative model. We then demonstrate improved prediction performance on three +real-world observational datasets, including a large, newly harmonized stroke +risk prediction dataset. Finally, we argue that our approach facilitates +clinical decision-making by suggesting time intervals that are most appropriate +for each task, in the sense that they facilitate more accurate risk prediction. + +
+
+ comment: 16 pages, 5 figures, 2 tables +
+
+
+
+
+ + ☆ On Responsible Machine Learning Datasets with Fairness, Privacy, and + Regulatory Norms + + +
+ Artificial Intelligence (AI) has made its way into various scientific fields, +providing astonishing improvements over existing algorithms for a wide variety +of tasks. In recent years, there have been severe concerns over the +trustworthiness of AI technologies. The scientific community has focused on the +development of trustworthy AI algorithms. However, machine and deep learning +algorithms, popular in the AI community today, depend heavily on the data used +during their development. These learning algorithms identify patterns in the +data, learning the behavioral objective. Any flaws in the data have the +potential to translate directly into algorithms. In this study, we discuss the +importance of Responsible Machine Learning Datasets and propose a framework to +evaluate the datasets through a responsible rubric. While existing work focuses +on the post-hoc evaluation of algorithms for their trustworthiness, we provide +a framework that considers the data component separately to understand its role +in the algorithm. We discuss responsible datasets through the lens of fairness, +privacy, and regulatory compliance and provide recommendations for constructing +future datasets. After surveying over 100 datasets, we use 60 datasets for +analysis and demonstrate that none of these datasets is immune to issues of +fairness, privacy preservation, and regulatory compliance. We provide +modifications to the ``datasheets for datasets" with important additions for +improved dataset documentation. With governments around the world regularizing +data protection laws, the method for the creation of datasets in the scientific +community requires revision. We believe this study is timely and relevant in +today's era of AI. + +
+
+
+
+
+ + ☆ A Diffusion Weighted Graph Framework for New Intent Discovery EMNLP 2023 + + +
+ New Intent Discovery (NID) aims to recognize both new and known intents from +unlabeled data with the aid of limited labeled data containing only known +intents. Without considering structure relationships between samples, previous +methods generate noisy supervisory signals which cannot strike a balance +between quantity and quality, hindering the formation of new intent clusters +and effective transfer of the pre-training knowledge. To mitigate this +limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to +capture both semantic similarities and structure relationships inherent in +data, enabling more sufficient and reliable supervisory signals. Specifically, +for each sample, we diffuse neighborhood relationships along semantic paths +guided by the nearest neighbors for multiple hops to characterize its local +structure discriminately. Then, we sample its positive keys and weigh them +based on semantic similarities and local structures for contrastive learning. +During inference, we further propose Graph Smoothing Filter (GSF) to explicitly +utilize the structure relationships to filter high-frequency noise embodied in +semantically ambiguous samples on the cluster boundary. Extensive experiments +show that our method outperforms state-of-the-art models on all evaluation +metrics across multiple benchmark datasets. Code and data are available at +https://github.com/yibai-shi/DWGF. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ☆ Localization of Small Leakages in Water Distribution Networks using + Concept Drift Explanation Methods + + +
+ Facing climate change the already limited availability of drinking water will +decrease in the future rendering drinking water an increasingly scarce +resource. Considerable amounts of it are lost through leakages in water +transportation and distribution networks. Leakage detection and localization +are challenging problems due to the complex interactions and changing demands +in water distribution networks. Especially small leakages are hard to pinpoint +yet their localization is vital to avoid water loss over long periods of time. +While there exist different approaches to solving the tasks of leakage +detection and localization, they are relying on various information about the +system, e.g. real-time demand measurements and the precise network topology, +which is an unrealistic assumption in many real-world scenarios. In contrast, +this work attempts leakage localization using pressure measurements only. For +this purpose, first, leakages in the water distribution network are modeled +employing Bayesian networks, and the system dynamics are analyzed. We then show +how the problem is connected to and can be considered through the lens of +concept drift. In particular, we argue that model-based explanations of concept +drift are a promising tool for localizing leakages given limited information +about the network. The methodology is experimentally evaluated using realistic +benchmark scenarios. + +
+
+
+
+
+ + ☆ Automatic Aorta Segmentation with Heavily Augmented, High-Resolution 3-D + ResUNet: Contribution to the SEG.A Challenge MICCAI 2023 + + +
+ Automatic aorta segmentation from 3-D medical volumes is an important yet +difficult task. Several factors make the problem challenging, e.g. the +possibility of aortic dissection or the difficulty with segmenting and +annotating the small branches. This work presents a contribution by the MedGIFT +team to the SEG.A challenge organized during the MICCAI 2023 conference. We +propose a fully automated algorithm based on deep encoder-decoder architecture. +The main assumption behind our work is that data preprocessing and augmentation +are much more important than the deep architecture, especially in low data +regimes. Therefore, the solution is based on a variant of traditional +convolutional U-Net. The proposed solution achieved a Dice score above 0.9 for +all testing cases with the highest stability among all participants. The method +scored 1st, 4th, and 3rd in terms of the clinical evaluation, quantitative +results, and volumetric meshing quality, respectively. We freely release the +source code, pretrained model, and provide access to the algorithm on the +Grand-Challenge platform. + +
+
+ comment: MICCAI 2023 - SEG.A Challenge Contribution +
+
+
+
+
+ + ☆ One or Two Things We know about Concept Drift -- A Survey on Monitoring + Evolving Environments + + +
+ The world surrounding us is subject to constant change. These changes, +frequently described as concept drift, influence many industrial and technical +processes. As they can lead to malfunctions and other anomalous behavior, which +may be safety-critical in many scenarios, detecting and analyzing concept drift +is crucial. In this paper, we provide a literature review focusing on concept +drift in unsupervised data streams. While many surveys focus on supervised data +streams, so far, there is no work reviewing the unsupervised setting. However, +this setting is of particular relevance for monitoring and anomaly detection +which are directly applicable to many tasks and challenges in engineering. This +survey provides a taxonomy of existing work on drift detection. Besides, it +covers the current state of research on drift localization in a systematic way. +In addition to providing a systematic literature review, this work provides +precise mathematical definitions of the considered problems and contains +standardized experiments on parametric artificial datasets allowing for a +direct comparison of different strategies for detection and localization. +Thereby, the suitability of different schemes can be analyzed systematically +and guidelines for their usage in real-world scenarios can be provided. +Finally, there is a section on the emerging topic of explaining concept drift. + +
+
+
+
+
+ + ☆ Discriminator Guidance for Autoregressive Diffusion Models + + +
+ We introduce discriminator guidance in the setting of Autoregressive +Diffusion Models. The use of a discriminator to guide a diffusion process has +previously been used for continuous diffusion models, and in this work we +derive ways of using a discriminator together with a pretrained generative +model in the discrete case. First, we show that using an optimal discriminator +will correct the pretrained model and enable exact sampling from the underlying +data distribution. Second, to account for the realistic scenario of using a +sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which +iteratively takes the predictions from the discrimiator into account during the +generation process. We test these approaches on the task of generating +molecular graphs and show how the discriminator improves the generative +performance over using only the pretrained model. + +
+
+
+
+
+ + ☆ Nonlinear dimensionality reduction then and now: AIMs for dissipative + PDEs in the ML era + + +
+ This study presents a collection of purely data-driven workflows for +constructing reduced-order models (ROMs) for distributed dynamical systems. The +ROMs we focus on, are data-assisted models inspired by, and templated upon, the +theory of Approximate Inertial Manifolds (AIMs); the particular motivation is +the so-called post-processing Galerkin method of Garcia-Archilla, Novo and +Titi. Its applicability can be extended: the need for accurate truncated +Galerkin projections and for deriving closed-formed corrections can be +circumvented using machine learning tools. When the right latent variables are +not a priori known, we illustrate how autoencoders as well as Diffusion Maps (a +manifold learning scheme) can be used to discover good sets of latent variables +and test their explainability. The proposed methodology can express the ROMs in +terms of (a) theoretical (Fourier coefficients), (b) linear data-driven (POD +modes) and/or (c) nonlinear data-driven (Diffusion Maps) coordinates. Both +Black-Box and (theoretically-informed and data-corrected) Gray-Box models are +described; the necessity for the latter arises when truncated Galerkin +projections are so inaccurate as to not be amenable to post-processing. We use +the Chafee-Infante reaction-diffusion and the Kuramoto-Sivashinsky dissipative +partial differential equations to illustrate and successfully test the overall +framework. + +
+
+ comment: 27 pages, 22 figures +
+
+
+
+
+ + ☆ Good Better Best: Self-Motivated Imitation Learning for noisy + Demonstrations + + +
+ Imitation Learning (IL) aims to discover a policy by minimizing the +discrepancy between the agent's behavior and expert demonstrations. However, IL +is susceptible to limitations imposed by noisy demonstrations from non-expert +behaviors, presenting a significant challenge due to the lack of supplementary +information to assess their expertise. In this paper, we introduce +Self-Motivated Imitation LEarning (SMILE), a method capable of progressively +filtering out demonstrations collected by policies deemed inferior to the +current policy, eliminating the need for additional information. We utilize the +forward and reverse processes of Diffusion Models to emulate the shift in +demonstration expertise from low to high and vice versa, thereby extracting the +noise information that diffuses expertise. Then, the noise information is +leveraged to predict the diffusion steps between the current policy and +demonstrators, which we theoretically demonstrate its equivalence to their +expertise gap. We further explain in detail how the predicted diffusion steps +are applied to filter out noisy demonstrations in a self-motivated manner and +provide its theoretical grounds. Through empirical evaluations on MuJoCo tasks, +we demonstrate that our method is proficient in learning the expert policy +amidst noisy demonstrations, and effectively filters out demonstrations with +expertise inferior to the current policy. + +
+
+
+
+
+ + ☆ Random Entity Quantization for Parameter-Efficient Compositional + Knowledge Graph Representation EMNLP 2023 + + +
+ Representation Learning on Knowledge Graphs (KGs) is essential for downstream +tasks. The dominant approach, KG Embedding (KGE), represents entities with +independent vectors and faces the scalability challenge. Recent studies propose +an alternative way for parameter efficiency, which represents entities by +composing entity-corresponding codewords matched from predefined small-scale +codebooks. We refer to the process of obtaining corresponding codewords of each +entity as entity quantization, for which previous works have designed +complicated strategies. Surprisingly, this paper shows that simple random +entity quantization can achieve similar results to current strategies. We +analyze this phenomenon and reveal that entity codes, the quantization outcomes +for expressing entities, have higher entropy at the code level and Jaccard +distance at the codeword level under random entity quantization. Therefore, +different entities become more easily distinguished, facilitating effective KG +representation. The above results show that current quantization strategies are +not critical for KG representation, and there is still room for improvement in +entity distinguishability beyond current strategies. The code to reproduce our +results is available at https://github.com/JiaangL/RandomQuantization. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Improving generalization in large language models by learning prefix + subspaces + + +
+ This article focuses on large language models (LLMs) fine-tuning in the +scarce data regime (also known as the "few-shot" learning setting). We propose +a method to increase the generalization capabilities of LLMs based on neural +network subspaces. This optimization method, recently introduced in computer +vision, aims to improve model generalization by identifying wider local optima +through the joint optimization of an entire simplex of models in parameter +space. Its adaptation to massive, pretrained transformers, however, poses some +challenges. First, their considerable number of parameters makes it difficult +to train several models jointly, and second, their deterministic parameter +initialization schemes make them unfit for the subspace method as originally +proposed. We show in this paper that "Parameter Efficient Fine-Tuning" (PEFT) +methods, however, are perfectly compatible with this original approach, and +propose to learn entire simplex of continuous prefixes. We test our method on a +variant of the GLUE benchmark adapted to the few-shot learning setting, and +show that both our contributions jointly lead to a gain in average performances +compared to sota methods. The implementation can be found at the following +link: https://github.com/Liloulou/prefix_subspace + +
+
+
+
+
+ + ☆ Amortised Inference in Neural Networks for Small-Scale Probabilistic + Meta-Learning + + +
+ The global inducing point variational approximation for BNNs is based on +using a set of inducing inputs to construct a series of conditional +distributions that accurately approximate the conditionals of the true +posterior distribution. Our key insight is that these inducing inputs can be +replaced by the actual data, such that the variational distribution consists of +a set of approximate likelihoods for each datapoint. This structure lends +itself to amortised inference, in which the parameters of each approximate +likelihood are obtained by passing each datapoint through a meta-model known as +the inference network. By training this inference network across related +datasets, we can meta-learn Bayesian inference over task-specific BNNs. + +
+
+
+
+
+ + ☆ Unpaired MRI Super Resolution with Self-Supervised Contrastive Learning + + +
+ High-resolution (HR) magnetic resonance imaging (MRI) is crucial for +enhancing diagnostic accuracy in clinical settings. Nonetheless, the inherent +limitation of MRI resolution restricts its widespread applicability. Deep +learning-based image super-resolution (SR) methods exhibit promise in improving +MRI resolution without additional cost. However, these methods frequently +require a substantial number of HR MRI images for training, which can be +challenging to acquire. In this paper, we propose an unpaired MRI SR approach +that employs self-supervised contrastive learning to enhance SR performance +with limited training data. Our approach leverages both authentic HR images and +synthetically generated SR images to construct positive and negative sample +pairs, thus facilitating the learning of discriminative features. Empirical +results presented in this study underscore significant enhancements in the peak +signal-to-noise ratio and structural similarity index, even when a paucity of +HR images is available. These findings accentuate the potential of our approach +in addressing the challenge of limited training data, thereby contributing to +the advancement of high-resolution MRI in clinical applications. + +
+
+
+
+
+ + ☆ Robust Learning via Conditional Prevalence Adjustment WACV + + +
+ Healthcare data often come from multiple sites in which the correlations +between confounding variables can vary widely. If deep learning models exploit +these unstable correlations, they might fail catastrophically in unseen sites. +Although many methods have been proposed to tackle unstable correlations, each +has its limitations. For example, adversarial training forces models to +completely ignore unstable correlations, but doing so may lead to poor +predictive performance. Other methods (e.g. Invariant risk minimization [4]) +try to learn domain-invariant representations that rely only on stable +associations by assuming a causal data-generating process (input X causes class +label Y ). Thus, they may be ineffective for anti-causal tasks (Y causes X), +which are common in computer vision. We propose a method called CoPA +(Conditional Prevalence-Adjustment) for anti-causal tasks. CoPA assumes that +(1) generation mechanism is stable, i.e. label Y and confounding variable(s) Z +generate X, and (2) the unstable conditional prevalence in each site E fully +accounts for the unstable correlations between X and Y . Our crucial +observation is that confounding variables are routinely recorded in healthcare +settings and the prevalence can be readily estimated, for example, from a set +of (Y, Z) samples (no need for corresponding samples of X). CoPA can work even +if there is a single training site, a scenario which is often overlooked by +existing methods. Our experiments on synthetic and real data show CoPA beating +competitive baselines. + +
+
+ comment: Accepted at WACV +
+
+
+
+
+ + ☆ Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix + Factorization + + +
+ Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that +has stimulated enormous interest in statistics, data science, and computational +biology due to the high dimensionality, complexity, and large scale associated +with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique +approach due to its meta-gene interpretation of resulting low-dimensional +components. However, NMF approaches suffer from the lack of multiscale +analysis. This work introduces two persistent Laplacian regularized NMF +methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By +employing a total of 12 datasets, we demonstrate that the proposed TNMF and +rTNMF significantly outperform all other NMF-based methods. We have also +utilized TNMF and rTNMF for the visualization of popular Uniform Manifold +Approximation and Projection (UMAP) and t-distributed stochastic neighbor +embedding (t-SNE). + +
+
+
+
+
+ + ☆ Improving Diffusion Models for ECG Imputation with an Augmented Template + Prior + + +
+ Pulsative signals such as the electrocardiogram (ECG) are extensively +collected as part of routine clinical care. However, noisy and poor-quality +recordings, leading to missing values, are a major issue for signals collected +using mobile health systems, decreasing the signal quality and affecting the +automated downstream tasks. Recent studies have explored imputation of missing +values for ECG with probabilistic time-series models. Nevertheless, in +comparison with the deterministic models, their performance is still limited, +as the variations across subjects and heart-beat relationships are not +explicitly considered in the training objective. In this work, to improve the +ECG imputation and forecasting accuracy with probabilistic models, we present +an template-guided denoising diffusion probabilistic model, PulseDiff, which is +conditioned an informative prior for a range of health conditions. +Specifically, 1) we first extract a subject-level pulsative template from the +observation as an informative prior of missing values, which captures the +personal characteristics; 2) we then add beat-level stochastic shift terms on +the template for prior augmentation, which considers the beat-level variance of +positioning and amplitude; 3) we finally design a confidence score to consider +the health condition of subject, which ensures our prior is provided in a safe +way. Experiments with the PTBXL dataset reveal PulseDiff improves the +performance of two strong DDPMs baseline models, CSDI and SSSD$^{S4}$, +verifying our method guides the generation of DDPMs while managing the +uncertainty. When combining with SSSD$^{S4}$, our PulseDiff method outperforms +the leading deterministic model for short-interval missing data and is +comparable for long-interval data loss. + +
+
+
+
+
+ + ☆ Recurrent Linear Transformers + + +
+ The self-attention mechanism in the transformer architecture is capable of +capturing long-range dependencies and it is the main reason behind its +effectiveness in processing sequential data. Nevertheless, despite their +success, transformers have two significant drawbacks that still limit their +broader applicability: (1) In order to remember past information, the +self-attention mechanism requires access to the whole history to be provided as +context. (2) The inference cost in transformers is expensive. In this paper we +introduce recurrent alternatives to the transformer self-attention mechanism +that offer a context-independent inference cost, leverage long-range +dependencies effectively, and perform well in practice. We evaluate our +approaches in reinforcement learning problems where the aforementioned +computational limitations make the application of transformers nearly +infeasible. We quantify the impact of the different components of our +architecture in a diagnostic environment and assess performance gains in 2D and +3D pixel-based partially-observable environments. When compared to a +state-of-the-art architecture, GTrXL, inference in our approach is at least 40% +cheaper while reducing memory use in more than 50%. Our approach either +performs similarly or better than GTrXL, improving more than 37% upon GTrXL +performance on harder tasks. + +
+
+ comment: transformers, reinforcement learning, partial observability +
+
+
+
+
+ + ☆ Causal Representation Learning Made Identifiable by Grouping of + Observational Variables + + +
+ A topic of great current interest is Causal Representation Learning (CRL), +whose goal is to learn a causal model for hidden features in a data-driven +manner. Unfortunately, CRL is severely ill-posed since it is a combination of +the two notoriously ill-posed problems of representation learning and causal +discovery. Yet, finding practical identifiability conditions that guarantee a +unique solution is crucial for its practical applicability. Most approaches so +far have been based on assumptions on the latent causal mechanisms, such as +temporal causality, or existence of supervision or interventions; these can be +too restrictive in actual applications. Here, we show identifiability based on +novel, weak constraints, which requires no temporal structure, intervention, +nor weak supervision. The approach is based assuming the observational mixing +exhibits a suitable grouping of the observational variables. We also propose a +novel self-supervised estimation framework consistent with the model, prove its +statistical consistency, and experimentally show its superior CRL performances +compared to the state-of-the-art baselines. We further demonstrate its +robustness against latent confounders and causal cycles. + +
+
+
+
+
+ + ☆ Solving large flexible job shop scheduling instances by generating a + diverse set of scheduling policies with deep reinforcement learning + + +
+ The Flexible Job Shop Scheduling Problem (FJSSP) has been extensively studied +in the literature, and multiple approaches have been proposed within the +heuristic, exact, and metaheuristic methods. However, the industry's demand to +be able to respond in real-time to disruptive events has generated the +necessity to be able to generate new schedules within a few seconds. Among +these methods, under this constraint, only dispatching rules (DRs) are capable +of generating schedules, even though their quality can be improved. To improve +the results, recent methods have been proposed for modeling the FJSSP as a +Markov Decision Process (MDP) and employing reinforcement learning to create a +policy that generates an optimal solution assigning operations to machines. +Nonetheless, there is still room for improvement, particularly in the larger +FJSSP instances which are common in real-world scenarios. Therefore, the +objective of this paper is to propose a method capable of robustly solving +large instances of the FJSSP. To achieve this, we propose a novel way of +modeling the FJSSP as an MDP using graph neural networks. We also present two +methods to make inference more robust: generating a diverse set of scheduling +policies that can be parallelized and limiting them using DRs. We have tested +our approach on synthetically generated instances and various public benchmarks +and found that our approach outperforms dispatching rules and achieves better +results than three other recent deep reinforcement learning methods on larger +FJSSP instances. + +
+
+
+
+
+ + ☆ COPF: Continual Learning Human Preference through Optimal Policy Fitting + + +
+ The technique of Reinforcement Learning from Human Feedback (RLHF) is a +commonly employed method to improve pre-trained Language Models (LM), enhancing +their ability to conform to human preferences. Nevertheless, the current +RLHF-based LMs necessitate full retraining each time novel queries or feedback +are introduced, which becomes a challenging task because human preferences can +vary between different domains or tasks. Retraining LMs poses practical +difficulties in many real-world situations due to the significant time and +computational resources required, along with concerns related to data privacy. +To address this limitation, we propose a new method called Continual Optimal +Policy Fitting (COPF), in which we estimate a series of optimal policies using +the Monte Carlo method, and then continually fit the policy sequence with the +function regularization. COPF involves a single learning phase and doesn't +necessitate complex reinforcement learning. Importantly, it shares the +capability with RLHF to learn from unlabeled data, making it flexible for +continual preference learning. Our experimental results show that COPF +outperforms strong Continuous learning (CL) baselines when it comes to +consistently aligning with human preferences on different tasks and domains. + +
+
+
+
+
+ + ☆ Physics-Informed with Power-Enhanced Residual Network for Interpolation + and Inverse Problems + + +
+ This paper introduces a novel neural network structure called the +Power-Enhancing residual network, designed to improve interpolation +capabilities for both smooth and non-smooth functions in 2D and 3D settings. By +adding power terms to residual elements, the architecture boosts the network's +expressive power. The study explores network depth, width, and optimization +methods, showing the architecture's adaptability and performance advantages. +Consistently, the results emphasize the exceptional accuracy of the proposed +Power-Enhancing residual network, particularly for non-smooth functions. +Real-world examples also confirm its superiority over plain neural network in +terms of accuracy, convergence, and efficiency. The study also looks at the +impact of deeper network. Moreover, the proposed architecture is also applied +to solving the inverse Burgers' equation, demonstrating superior performance. +In conclusion, the Power-Enhancing residual network offers a versatile solution +that significantly enhances neural network capabilities. The codes implemented +are available at: \url{https://github.com/CMMAi/ResNet_for_PINN}. + +
+
+
+
+
+ + ☆ Fixed-Budget Real-Valued Combinatorial Pure Exploration of Multi-Armed + Bandit + + +
+ We study the real-valued combinatorial pure exploration of the multi-armed +bandit in the fixed-budget setting. We first introduce the Combinatorial +Successive Asign (CSA) algorithm, which is the first algorithm that can +identify the best action even when the size of the action class is +exponentially large with respect to the number of arms. We show that the upper +bound of the probability of error of the CSA algorithm matches a lower bound up +to a logarithmic factor in the exponent. Then, we introduce another algorithm +named the Minimax Combinatorial Successive Accepts and Rejects +(Minimax-CombSAR) algorithm for the case where the size of the action class is +polynomial, and show that it is optimal, which matches a lower bound. Finally, +we experimentally compare the algorithms with previous methods and show that +our algorithm performs better. + +
+
+
+
+
+ + ☆ Interactive Generalized Additive Model and Its Applications in Electric + Load Forecasting + + +
+ Electric load forecasting is an indispensable component of electric power +system planning and management. Inaccurate load forecasting may lead to the +threat of outages or a waste of energy. Accurate electric load forecasting is +challenging when there is limited data or even no data, such as load +forecasting in holiday, or under extreme weather conditions. As high-stakes +decision-making usually follows after load forecasting, model interpretability +is crucial for the adoption of forecasting models. In this paper, we propose an +interactive GAM which is not only interpretable but also can incorporate +specific domain knowledge in electric power industry for improved performance. +This boosting-based GAM leverages piecewise linear functions and can be learned +through our efficient algorithm. In both public benchmark and electricity +datasets, our interactive GAM outperforms current state-of-the-art methods and +demonstrates good generalization ability in the cases of extreme weather +events. We launched a user-friendly web-based tool based on interactive GAM and +already incorporated it into our eForecaster product, a unified AI platform for +electricity forecasting. + +
+
+
+
+
+ + ☆ Momentum Gradient-based Untargeted Attack on Hypergraph Neural Networks + + +
+ Hypergraph Neural Networks (HGNNs) have been successfully applied in various +hypergraph-related tasks due to their excellent higher-order representation +capabilities. Recent works have shown that deep learning models are vulnerable +to adversarial attacks. Most studies on graph adversarial attacks have focused +on Graph Neural Networks (GNNs), and the study of adversarial attacks on HGNNs +remains largely unexplored. In this paper, we try to reduce this gap. We design +a new HGNNs attack model for the untargeted attack, namely MGHGA, which focuses +on modifying node features. We consider the process of HGNNs training and use a +surrogate model to implement the attack before hypergraph modeling. +Specifically, MGHGA consists of two parts: feature selection and feature +modification. We use a momentum gradient mechanism to choose the attack node +features in the feature selection module. In the feature modification module, +we use two feature generation approaches (direct modification and sign +gradient) to enable MGHGA to be employed on discrete and continuous datasets. +We conduct extensive experiments on five benchmark datasets to validate the +attack performance of MGHGA in the node and the visual object classification +tasks. The results show that MGHGA improves performance by an average of 2% +compared to the than the baselines. + +
+
+
+
+
+ + ☆ A Survey on Detection of LLMs-Generated Content + + +
+ The burgeoning capabilities of advanced large language models (LLMs) such as +ChatGPT have led to an increase in synthetic content generation with +implications across a variety of sectors, including media, cybersecurity, +public discourse, and education. As such, the ability to detect LLMs-generated +content has become of paramount importance. We aim to provide a detailed +overview of existing detection strategies and benchmarks, scrutinizing their +differences and identifying key challenges and prospects in the field, +advocating for more adaptable and robust models to enhance detection accuracy. +We also posit the necessity for a multi-faceted approach to defend against +various attacks to counter the rapidly advancing capabilities of LLMs. To the +best of our knowledge, this work is the first comprehensive survey on the +detection in the era of LLMs. We hope it will provide a broad understanding of +the current landscape of LLMs-generated content detection, offering a guiding +reference for researchers and practitioners striving to uphold the integrity of +digital information in an era increasingly dominated by synthetic content. The +relevant papers are summarized and will be consistently updated at +https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git. + +
+
+ comment: We will keep updating at + https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git +
+
+
+
+
+ + ☆ Deceptive Fairness Attacks on Graphs via Meta Learning + + +
+ We study deceptive fairness attacks on graphs to answer the following +question: How can we achieve poisoning attacks on a graph learning model to +exacerbate the bias deceptively? We answer this question via a bi-level +optimization problem and propose a meta learning-based framework named FATE. +FATE is broadly applicable with respect to various fairness definitions and +graph learning models, as well as arbitrary choices of manipulation operations. +We further instantiate FATE to attack statistical parity and individual +fairness on graph neural networks. We conduct extensive experimental +evaluations on real-world datasets in the task of semi-supervised node +classification. The experimental results demonstrate that FATE could amplify +the bias of graph neural networks with or without fairness consideration while +maintaining the utility on the downstream task. We hope this paper provides +insights into the adversarial robustness of fair graph learning and can shed +light on designing robust and fair graph learning in future studies. + +
+
+ comment: 23 pages, 11 tables +
+
+
+
+
+ + ☆ Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio + Models + + +
+ The introduction of large-scale audio datasets, such as AudioSet, paved the +way for Transformers to conquer the audio domain and replace CNNs as the +state-of-the-art neural network architecture for many tasks. Audio Spectrogram +Transformers are excellent at exploiting large datasets, creating powerful +pre-trained models that surpass CNNs when fine-tuned on downstream tasks. +However, current popular Audio Spectrogram Transformers are demanding in terms +of computational complexity compared to CNNs. Recently, we have shown that, by +employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch +up with and even outperform Transformers on large datasets. In this work, we +extend this line of research and increase the capacity of efficient CNNs by +introducing dynamic CNN blocks, constructed of dynamic non-linearities, dynamic +convolutions and attention mechanisms. We show that these dynamic CNNs +outperform traditional efficient CNNs, in terms of the performance-complexity +trade-off and parameter efficiency, at the task of audio tagging on the +large-scale AudioSet. Our experiments further indicate that the introduced +dynamic CNNs achieve better performance on downstream tasks and scale up well, +attaining Transformer performance and even outperforming them on AudioSet and +several downstream tasks. + +
+
+ comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language + Processing. Source Code available at: + https://github.com/fschmid56/EfficientAT +
+
+
+
+
+ + ☆ Light up that Droid! On the Effectiveness of Static Analysis Features + against App Obfuscation for Android Malware Detection + + +
+ Malware authors have seen obfuscation as the mean to bypass malware detectors +based on static analysis features. For Android, several studies have confirmed +that many anti-malware products are easily evaded with simple program +transformations. As opposed to these works, ML detection proposals for Android +leveraging static analysis features have also been proposed as +obfuscation-resilient. Therefore, it needs to be determined to what extent the +use of a specific obfuscation strategy or tool poses a risk for the validity of +ML malware detectors for Android based on static analysis features. To shed +some light in this regard, in this article we assess the impact of specific +obfuscation techniques on common features extracted using static analysis and +determine whether the changes are significant enough to undermine the +effectiveness of ML malware detectors that rely on these features. The +experimental results suggest that obfuscation techniques affect all static +analysis features to varying degrees across different tools. However, certain +features retain their validity for ML malware detection even in the presence of +obfuscation. Based on these findings, we propose a ML malware detector for +Android that is robust against obfuscation and outperforms current +state-of-the-art detectors. + +
+
+
+
+
+ + ☆ Guaranteed Coverage Prediction Intervals with Gaussian Process + Regression + + +
+ Gaussian Process Regression (GPR) is a popular regression method, which +unlike most Machine Learning techniques, provides estimates of uncertainty for +its predictions. These uncertainty estimates however, are based on the +assumption that the model is well-specified, an assumption that is violated in +most practical applications, since the required knowledge is rarely available. +As a result, the produced uncertainty estimates can become very misleading; for +example the prediction intervals (PIs) produced for the 95\% confidence level +may cover much less than 95\% of the true labels. To address this issue, this +paper introduces an extension of GPR based on a Machine Learning framework +called, Conformal Prediction (CP). This extension guarantees the production of +PIs with the required coverage even when the model is completely misspecified. +The proposed approach combines the advantages of GPR with the valid coverage +guarantee of CP, while the performed experimental results demonstrate its +superiority over existing methods. + +
+
+ comment: 12 pages. This work has been submitted to IEEE Transactions on + Pattern Analysis and Machine Intelligence for possible publication. Copyright + may be transferred without notice, after which this version may no longer be + accessible +
+
+
+
+
+ + ☆ Contextual directed acyclic graphs + + +
+ Estimating the structure of directed acyclic graphs (DAGs) from observational +data remains a significant challenge in machine learning. Most research in this +area concentrates on learning a single DAG for the entire population. This +paper considers an alternative setting where the graph structure varies across +individuals based on available "contextual" features. We tackle this contextual +DAG problem via a neural network that maps the contextual features to a DAG, +represented as a weighted adjacency matrix. The neural network is equipped with +a novel projection layer that ensures the output matrices are sparse and +satisfy a recently developed characterization of acyclicity. We devise a +scalable computational framework for learning contextual DAGs and provide a +convergence guarantee and an analytical gradient for backpropagating through +the projection layer. Our experiments suggest that the new approach can recover +the true context-specific graph where existing approaches fail. + +
+
+
+
+
+ + ☆ GUPNet++: Geometry Uncertainty Propagation Network for Monocular 3D + Object Detection + + +
+ Geometry plays a significant role in monocular 3D object detection. It can be +used to estimate object depth by using the perspective projection between +object's physical size and 2D projection in the image plane, which can +introduce mathematical priors into deep models. However, this projection +process also introduces error amplification, where the error of the estimated +height is amplified and reflected into the projected depth. It leads to +unreliable depth inferences and also impairs training stability. To tackle this +problem, we propose a novel Geometry Uncertainty Propagation Network (GUPNet++) +by modeling geometry projection in a probabilistic manner. This ensures depth +predictions are well-bounded and associated with a reasonable uncertainty. The +significance of introducing such geometric uncertainty is two-fold: (1). It +models the uncertainty propagation relationship of the geometry projection +during training, improving the stability and efficiency of the end-to-end model +learning. (2). It can be derived to a highly reliable confidence to indicate +the quality of the 3D detection result, enabling more reliable detection +inference. Experiments show that the proposed approach not only obtains +(state-of-the-art) SOTA performance in image-based monocular 3D detection but +also demonstrates superiority in efficacy with a simplified framework. + +
+
+ comment: 18 pages, 9 figures +
+
+
+
+
+ + ☆ Machine Translation for Nko: Tools, Corpora and Baseline Results + + +
+ Currently, there is no usable machine translation system for Nko, a language +spoken by tens of millions of people across multiple West African countries, +which holds significant cultural and educational value. To address this issue, +we present a set of tools, resources, and baseline results aimed towards the +development of usable machine translation systems for Nko and other languages +that do not currently have sufficiently large parallel text corpora available. +(1) Friallel: A novel collaborative parallel text curation software that +incorporates quality control through copyedit-based workflows. (2) Expansion of +the FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko +translations in parallel with 204 and 40 other languages. (3) nicolingua-0005: +A collection of trilingual and bilingual corpora with 130,850 parallel segments +and monolingual corpora containing over 3 million Nko words. (4) Baseline +bilingual and multilingual neural machine translation results with the best +model scoring 30.83 English-Nko chrF++ on FLoRes-devtest. + +
+
+
+
+
+ + ☆ Using Slisemap to interpret physical data + + +
+ Manifold visualisation techniques are commonly used to visualise +high-dimensional datasets in physical sciences. In this paper we apply a +recently introduced manifold visualisation method, called Slise, on datasets +from physics and chemistry. Slisemap combines manifold visualisation with +explainable artificial intelligence. Explainable artificial intelligence is +used to investigate the decision processes of black box machine learning models +and complex simulators. With Slisemap we find an embedding such that data items +with similar local explanations are grouped together. Hence, Slisemap gives us +an overview of the different behaviours of a black box model. This makes +Slisemap into a supervised manifold visualisation method, where the patterns in +the embedding reflect a target property. In this paper we show how Slisemap can +be used and evaluated on physical data and that Slisemap is helpful in finding +meaningful information on classification and regression models trained on these +datasets. + +
+
+ comment: 17 pages, 5 + 1 figures, 1 table. The datasets and source code used + in the paper are available at https://www.edahelsinki.fi/papers/slisemap_phys +
+
+
+
+
+ + ☆ tagE: Enabling an Embodied Agent to Understand Human Instructions EMNLP + + +
+ Natural language serves as the primary mode of communication when an +intelligent agent with a physical presence engages with human beings. While a +plethora of research focuses on natural language understanding (NLU), +encompassing endeavors such as sentiment analysis, intent prediction, question +answering, and summarization, the scope of NLU directed at situations +necessitating tangible actions by an embodied agent remains limited. The +inherent ambiguity and incompleteness inherent in natural language present +challenges for intelligent agents striving to decipher human intention. To +tackle this predicament head-on, we introduce a novel system known as task and +argument grounding for Embodied agents (tagE). At its core, our system employs +an inventive neural network model designed to extract a series of tasks from +complex task instructions expressed in natural language. Our proposed model +adopts an encoder-decoder framework enriched with nested decoding to +effectively extract tasks and their corresponding arguments from these +intricate instructions. These extracted tasks are then mapped (or grounded) to +the robot's established collection of skills, while the arguments find +grounding in objects present within the environment. To facilitate the training +and evaluation of our system, we have curated a dataset featuring complex +instructions. The results of our experiments underscore the prowess of our +approach, as it outperforms robust baseline models. + +
+
+ comment: Accepted in EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Detecting Intentional AIS Shutdown in Open Sea Maritime Surveillance + Using Self-Supervised Deep Learning + + +
+ In maritime traffic surveillance, detecting illegal activities, such as +illegal fishing or transshipment of illicit products is a crucial task of the +coastal administration. In the open sea, one has to rely on Automatic +Identification System (AIS) message transmitted by on-board transponders, which +are captured by surveillance satellites. However, insincere vessels often +intentionally shut down their AIS transponders to hide illegal activities. In +the open sea, it is very challenging to differentiate intentional AIS shutdowns +from missing reception due to protocol limitations, bad weather conditions or +restricting satellite positions. This paper presents a novel approach for the +detection of abnormal AIS missing reception based on self-supervised deep +learning techniques and transformer models. Using historical data, the trained +model predicts if a message should be received in the upcoming minute or not. +Afterwards, the model reports on detected anomalies by comparing the prediction +with what actually happens. Our method can process AIS messages in real-time, +in particular, more than 500 Millions AIS messages per month, corresponding to +the trajectories of more than 60 000 ships. The method is evaluated on 1-year +of real-world data coming from four Norwegian surveillance satellites. Using +related research results, we validated our method by rediscovering already +detected intentional AIS shutdowns. + +
+
+ comment: IEEE Transactions on Intelligent Transportation Systems +
+
+
+
+
+ + ☆ Multimodal Representations for Teacher-Guided Compositional Visual + Reasoning + + +
+ Neural Module Networks (NMN) are a compelling method for visual question +answering, enabling the translation of a question into a program consisting of +a series of reasoning sub-tasks that are sequentially executed on the image to +produce an answer. NMNs provide enhanced explainability compared to integrated +models, allowing for a better understanding of the underlying reasoning +process. To improve the effectiveness of NMNs we propose to exploit features +obtained by a large-scale cross-modal encoder. Also, the current training +approach of NMNs relies on the propagation of module outputs to subsequent +modules, leading to the accumulation of prediction errors and the generation of +false answers. To mitigate this, we introduce an NMN learning strategy +involving scheduled teacher guidance. Initially, the model is fully guided by +the ground-truth intermediate outputs, but gradually transitions to an +autonomous behavior as training progresses. This reduces error accumulation, +thus improving training efficiency and final performance.We demonstrate that by +incorporating cross-modal features and employing more effective training +techniques for NMN, we achieve a favorable balance between performance and +transparency in the reasoning process. + +
+
+
+
+
+ + ☆ Accelerating Split Federated Learning over Wireless Communication + Networks + + +
+ The development of artificial intelligence (AI) provides opportunities for +the promotion of deep neural network (DNN)-based applications. However, the +large amount of parameters and computational complexity of DNN makes it +difficult to deploy it on edge devices which are resource-constrained. An +efficient method to address this challenge is model partition/splitting, in +which DNN is divided into two parts which are deployed on device and server +respectively for co-training or co-inference. In this paper, we consider a +split federated learning (SFL) framework that combines the parallel model +training mechanism of federated learning (FL) and the model splitting structure +of split learning (SL). We consider a practical scenario of heterogeneous +devices with individual split points of DNN. We formulate a joint problem of +split point selection and bandwidth allocation to minimize the system latency. +By using alternating optimization, we decompose the problem into two +sub-problems and solve them optimally. Experiment results demonstrate the +superiority of our work in latency reduction and accuracy improvement. + +
+
+
+
+
+ + ☆ Identifiable Latent Polynomial Causal Models Through the Lens of Change + + +
+ Causal representation learning aims to unveil latent high-level causal +representations from observed low-level data. One of its primary tasks is to +provide reliable assurance of identifying these latent causal models, known as +identifiability. A recent breakthrough explores identifiability by leveraging +the change of causal influences among latent causal variables across multiple +environments \citep{liu2022identifying}. However, this progress rests on the +assumption that the causal relationships among latent causal variables adhere +strictly to linear Gaussian models. In this paper, we extend the scope of +latent causal models to involve nonlinear causal relationships, represented by +polynomial models, and general noise distributions conforming to the +exponential family. Additionally, we investigate the necessity of imposing +changes on all causal parameters and present partial identifiability results +when part of them remains unchanged. Further, we propose a novel empirical +estimation method, grounded in our theoretical finding, that enables learning +consistent latent causal representations. Our experimental results, obtained +from both synthetic and real-world data, validate our theoretical contributions +concerning identifiability and consistency. + +
+
+
+
+
+ + ☆ VMAF Re-implementation on PyTorch: Some Experimental Results + + +
+ Based on the standard VMAF implementation we propose an implementation of +VMAF using PyTorch framework. For this implementation comparisons with the +standard (libvmaf) show the discrepancy $\lesssim 10^{-2}$ in VMAF units. We +investigate gradients computation when using VMAF as an objective function and +demonstrate that training using this function does not result in ill-behaving +gradients. + +
+
+ comment: 4 pages +
+
+
+
+
+ + ☆ From Oja's Algorithm to the Multiplicative Weights Update Method with + Applications + + +
+ Oja's algorithm is a well known online algorithm studied mainly in the +context of stochastic principal component analysis. We make a simple +observation, yet to the best of our knowledge a novel one, that when applied to +a any (not necessarily stochastic) sequence of symmetric matrices which share +common eigenvectors, the regret of Oja's algorithm could be directly bounded in +terms of the regret of the well known multiplicative weights update method for +the problem of prediction with expert advice. Several applications to +optimization with quadratic forms over the unit sphere in $\reals^n$ are +discussed. + +
+
+
+
+
+ + ☆ Transfer learning for day-ahead load forecasting: a case study on + European national electricity demand time series + + +
+ Short-term load forecasting (STLF) is crucial for the daily operation of +power grids. However, the non-linearity, non-stationarity, and randomness +characterizing electricity demand time series renders STLF a challenging task. +Various forecasting approaches have been proposed for improving STLF, including +neural network (NN) models which are trained using data from multiple +electricity demand series that may not necessary include the target series. In +the present study, we investigate the performance of this special case of STLF, +called transfer learning (TL), by considering a set of 27 time series that +represent the national day-ahead electricity demand of indicative European +countries. We employ a popular and easy-to-implement NN model and perform a +clustering analysis to identify similar patterns among the series and assist +TL. In this context, two different TL approaches, with and without the +clustering step, are compiled and compared against each other as well as a +typical NN training setup. Our results demonstrate that TL can outperform the +conventional approach, especially when clustering techniques are considered. + +
+
+
+
+
+ + ☆ PET Synthesis via Self-supervised Adaptive Residual Estimation + Generative Adversarial Network + + +
+ Positron emission tomography (PET) is a widely used, highly sensitive +molecular imaging in clinical diagnosis. There is interest in reducing the +radiation exposure from PET but also maintaining adequate image quality. Recent +methods using convolutional neural networks (CNNs) to generate synthesized +high-quality PET images from low-dose counterparts have been reported to be +state-of-the-art for low-to-high image recovery methods. However, these methods +are prone to exhibiting discrepancies in texture and structure between +synthesized and real images. Furthermore, the distribution shift between +low-dose PET and standard PET has not been fully investigated. To address these +issues, we developed a self-supervised adaptive residual estimation generative +adversarial network (SS-AEGAN). We introduce (1) An adaptive residual +estimation mapping mechanism, AE-Net, designed to dynamically rectify the +preliminary synthesized PET images by taking the residual map between the +low-dose PET and synthesized output as the input, and (2) A self-supervised +pre-training strategy to enhance the feature representation of the coarse +generator. Our experiments with a public benchmark dataset of total-body PET +images show that SS-AEGAN consistently outperformed the state-of-the-art +synthesis methods with various dose reduction factors. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ☆ Algorithmic Regularization in Tensor Optimization: Towards a Lifted + Approach in Matrix Sensing NeurIPS23 + + +
+ Gradient descent (GD) is crucial for generalization in machine learning +models, as it induces implicit regularization, promoting compact +representations. In this work, we examine the role of GD in inducing implicit +regularization for tensor optimization, particularly within the context of the +lifted matrix sensing framework. This framework has been recently proposed to +address the non-convex matrix sensing problem by transforming spurious +solutions into strict saddles when optimizing over symmetric, rank-1 tensors. +We show that, with sufficiently small initialization scale, GD applied to this +lifted problem results in approximate rank-1 tensors and critical points with +escape directions. Our findings underscore the significance of the tensor +parametrization of matrix sensing, in combination with first-order methods, in +achieving global optimality in such problems. + +
+
+ comment: NeurIPS23 Poster +
+
+
+
+
+ + ☆ Symmetry-preserving graph attention network to solve routing problems at + multiple resolutions + + +
+ Travelling Salesperson Problems (TSPs) and Vehicle Routing Problems (VRPs) +have achieved reasonable improvement in accuracy and computation time with the +adaptation of Machine Learning (ML) methods. However, none of the previous +works completely respects the symmetries arising from TSPs and VRPs including +rotation, translation, permutation, and scaling. In this work, we introduce the +first-ever completely equivariant model and training to solve combinatorial +problems. Furthermore, it is essential to capture the multiscale structure +(i.e. from local to global information) of the input graph, especially for the +cases of large and long-range graphs, while previous methods are limited to +extracting only local information that can lead to a local or sub-optimal +solution. To tackle the above limitation, we propose a Multiresolution scheme +in combination with Equivariant Graph Attention network (mEGAT) architecture, +which can learn the optimal route based on low-level and high-level graph +resolutions in an efficient way. In particular, our approach constructs a +hierarchy of coarse-graining graphs from the input graph, in which we try to +solve the routing problems on simple low-level graphs first, then utilize that +knowledge for the more complex high-level graphs. Experimentally, we have shown +that our model outperforms existing baselines and proved that symmetry +preservation and multiresolution are important recipes for solving +combinatorial problems in a data-driven manner. Our source code is publicly +available at https://github.com/HySonLab/Multires-NP-hard + +
+
+
+
+
+ + ☆ Privacy Amplification for Matrix Mechanisms + + +
+ Privacy amplification exploits randomness in data selection to provide +tighter differential privacy (DP) guarantees. This analysis is key to DP-SGD's +success in machine learning, but, is not readily applicable to the newer +state-of-the-art algorithms. This is because these algorithms, known as +DP-FTRL, use the matrix mechanism to add correlated noise instead of +independent noise as in DP-SGD. + In this paper, we propose "MMCC", the first algorithm to analyze privacy +amplification via sampling for any generic matrix mechanism. MMCC is nearly +tight in that it approaches a lower bound as $\epsilon\to0$. To analyze +correlated outputs in MMCC, we prove that they can be analyzed as if they were +independent, by conditioning them on prior outputs. Our "conditional +composition theorem" has broad utility: we use it to show that the noise added +to binary-tree-DP-FTRL can asymptotically match the noise added to DP-SGD with +amplification. Our amplification algorithm also has practical empirical +utility: we show it leads to significant improvement in the privacy-utility +trade-offs for DP-FTRL algorithms on standard benchmarks. + +
+
+
+
+
+ + ☆ On the Inherent Privacy Properties of Discrete Denoising Diffusion + Models + + +
+ Privacy concerns have led to a surge in the creation of synthetic datasets, +with diffusion models emerging as a promising avenue. Although prior studies +have performed empirical evaluations on these models, there has been a gap in +providing a mathematical characterization of their privacy-preserving +capabilities. To address this, we present the pioneering theoretical +exploration of the privacy preservation inherent in discrete diffusion models +(DDMs) for discrete dataset generation. Focusing on per-instance differential +privacy (pDP), our framework elucidates the potential privacy leakage for each +data point in a given training dataset, offering insights into data +preprocessing to reduce privacy risks of the synthetic dataset generation via +DDMs. Our bounds also show that training with $s$-sized data points leads to a +surge in privacy leakage from $(\epsilon, +\mathcal{O}(\frac{1}{s^2\epsilon}))$-pDP to $(\epsilon, +\mathcal{O}(\frac{1}{s\epsilon}))$-pDP during the transition from the pure +noise to the synthetic clean data phase, and a faster decay in diffusion +coefficients amplifies the privacy guarantee. Finally, we empirically verify +our theoretical findings on both synthetic and real-world datasets. + +
+
+
+
+
+ + ☆ Generative and Contrastive Paradigms Are Complementary for Graph + Self-Supervised Learning + + +
+ For graph self-supervised learning (GSSL), masked autoencoder (MAE) follows +the generative paradigm and learns to reconstruct masked graph edges or node +features. Contrastive Learning (CL) maximizes the similarity between augmented +views of the same graph and is widely used for GSSL. However, MAE and CL are +considered separately in existing works for GSSL. We observe that the MAE and +CL paradigms are complementary and propose the graph contrastive masked +autoencoder (GCMAE) framework to unify them. Specifically, by focusing on local +edges or node features, MAE cannot capture global information of the graph and +is sensitive to particular edges and features. On the contrary, CL excels in +extracting global information because it considers the relation between graphs. +As such, we equip GCMAE with an MAE branch and a CL branch, and the two +branches share a common encoder, which allows the MAE branch to exploit the +global information extracted by the CL branch. To force GCMAE to capture global +graph structures, we train it to reconstruct the entire adjacency matrix +instead of only the masked edges as in existing works. Moreover, a +discrimination loss is proposed for feature reconstruction, which improves the +disparity between node embeddings rather than reducing the reconstruction error +to tackle the feature smoothing problem of MAE. We evaluate GCMAE on four +popular graph tasks (i.e., node classification, node clustering, link +prediction, and graph classification) and compare with 14 state-of-the-art +baselines. The results show that GCMAE consistently provides good accuracy +across these tasks, and the maximum accuracy improvement is up to 3.2% compared +with the best-performing baseline. + +
+
+
+
+
+ + ☆ Graph Attention-based Deep Reinforcement Learning for solving the + Chinese Postman Problem with Load-dependent costs + + +
+ Recently, Deep reinforcement learning (DRL) models have shown promising +results in solving routing problems. However, most DRL solvers are commonly +proposed to solve node routing problems, such as the Traveling Salesman Problem +(TSP). Meanwhile, there has been limited research on applying neural methods to +arc routing problems, such as the Chinese Postman Problem (CPP), since they +often feature irregular and complex solution spaces compared to TSP. To fill +these gaps, this paper proposes a novel DRL framework to address the CPP with +load-dependent costs (CPP-LC) (Corberan et al., 2018), which is a complex arc +routing problem with load constraints. The novelty of our method is two-fold. +First, we formulate the CPP-LC as a Markov Decision Process (MDP) sequential +model. Subsequently, we introduce an autoregressive model based on DRL, namely +Arc-DRL, consisting of an encoder and decoder to address the CPP-LC challenge +effectively. Such a framework allows the DRL model to work efficiently and +scalably to arc routing problems. Furthermore, we propose a new bio-inspired +meta-heuristic solution based on Evolutionary Algorithm (EA) for CPP-LC. +Extensive experiments show that Arc-DRL outperforms existing meta-heuristic +methods such as Iterative Local Search (ILS) and Variable Neighborhood Search +(VNS) proposed by (Corberan et al., 2018) on large benchmark datasets for +CPP-LC regarding both solution quality and running time; while the EA gives the +best solution quality with much more running time. We release our C++ +implementations for metaheuristics such as EA, ILS and VNS along with the code +for data generation and our generated data at +https://github.com/HySonLab/Chinese_Postman_Problem + +
+
+
+
+
+ + ☆ KITAB: Evaluating LLMs on Constraint Satisfaction for Information + Retrieval + + +
+ We study the ability of state-of-the art models to answer constraint +satisfaction queries for information retrieval (e.g., 'a list of ice cream +shops in San Diego'). In the past, such queries were considered to be tasks +that could only be solved via web-search or knowledge bases. More recently, +large language models (LLMs) have demonstrated initial emergent abilities in +this task. However, many current retrieval benchmarks are either saturated or +do not measure constraint satisfaction. Motivated by rising concerns around +factual incorrectness and hallucinations of LLMs, we present KITAB, a new +dataset for measuring constraint satisfaction abilities of language models. +KITAB consists of book-related data across more than 600 authors and 13,000 +queries, and also offers an associated dynamic data collection and constraint +verification approach for acquiring similar test data for other authors. Our +extended experiments on GPT4 and GPT3.5 characterize and decouple common +failure modes across dimensions such as information popularity, constraint +types, and context availability. Results show that in the absence of context, +models exhibit severe limitations as measured by irrelevant information, +factual errors, and incompleteness, many of which exacerbate as information +popularity decreases. While context availability mitigates irrelevant +information, it is not helpful for satisfying constraints, identifying +fundamental barriers to constraint satisfaction. We open source our +contributions to foster further research on improving constraint satisfaction +abilities of future models. + +
+
+ comment: 23 pages +
+
+
+
+
+ + ♻ ☆ Phase diagram of early training dynamics in deep neural networks: effect + of the learning rate, depth, and width NeurIPS 2023 + + +
+ We systematically analyze optimization dynamics in deep neural networks +(DNNs) trained with stochastic gradient descent (SGD) and study the effect of +learning rate $\eta$, depth $d$, and width $w$ of the neural network. By +analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, +which is a measure of sharpness of the loss landscape, we find that the +dynamics can show four distinct regimes: (i) an early time transient regime, +(ii) an intermediate saturation regime, (iii) a progressive sharpening regime, +and (iv) a late time ``edge of stability" regime. The early and intermediate +regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c / +\lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which +separate qualitatively distinct phenomena in the early time dynamics of +training loss and sharpness. Notably, we discover the opening up of a +``sharpness reduction" phase, where sharpness decreases at early times, as $d$ +and $1/w$ are increased. + +
+
+ comment: Accepted at NeurIPS 2023 (camera-ready version): Additional results + added for cross-entropy loss and effect on network output at initialization; + 10+32 pages, 8+35 figures +
+
+
+
+
+ + ♻ ☆ Consistent Optimal Transport with Empirical Conditional Measures + + +
+ Given samples from two joint distributions, we consider the problem of +Optimal Transportation (OT) between them when conditioned on a common variable. +We focus on the general setting where the conditioned variable may be +continuous, and the marginals of this variable in the two joint distributions +may not be the same. In such settings, standard OT variants cannot be employed, +and novel estimation techniques are necessary. Since the main challenge is that +the conditional distributions are not explicitly available, the key idea in our +OT formulation is to employ kernelized-least-squares terms computed over the +joint samples, which implicitly match the transport plan's marginals with the +empirical conditionals. Under mild conditions, we prove that our estimated +transport plans, as a function of the conditioned variable, are asymptotically +optimal. For finite samples, we show that the deviation in terms of our +regularized objective is bounded by $O(1/m^{1/4})$, where $m$ is the number of +samples. We also discuss how the conditional transport plan could be modelled +using explicit probabilistic models as well as using implicit generative ones. +We empirically verify the consistency of our estimator on synthetic datasets, +where the optimal plan is analytically known. When employed in applications +like prompt learning for few-shot classification and conditional-generation in +the context of predicting cell responses to treatment, our methodology improves +upon state-of-the-art methods. + +
+
+
+
+
+ + ♻ ☆ Data Selection for Language Models via Importance Resampling NeurIPS 2023 + + +
+ Selecting a suitable pretraining dataset is crucial for both general-domain +(e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We +formalize this problem as selecting a subset of a large raw unlabeled dataset +to match a desired target distribution given some unlabeled target samples. Due +to the large scale and dimensionality of the raw text data, existing methods +use simple heuristics or use experts to manually curate data. Instead, we +extend the classic importance resampling approach used in low-dimensions for LM +data selection. We propose Data Selection with Importance Resampling (DSIR), an +efficient and scalable framework that estimates importance weights in a reduced +feature space for tractability and selects data with importance resampling +according to these weights. To determine an appropriate feature space, we show +that KL reduction, a data metric that measures the proximity between selected +pretraining data and the target in a feature space, has high correlation with +average downstream accuracy (r=0.89) when computed with simple n-gram features. +This motivates our instantiation of DSIR using n-gram features. When performing +continued pretraining towards a specific domain, DSIR performs comparably to +expert curation across 8 target distributions. When pretraining general-domain +models (target is Wikipedia + books), DSIR improves over random selection and +heuristic filtering baselines by 2-2.5% on the GLUE benchmark. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Differentiable Earth Mover's Distance for Data Compression at the + High-Luminosity LHC + + +
+ The Earth mover's distance (EMD) is a useful metric for image recognition and +classification, but its usual implementations are not differentiable or too +slow to be used as a loss function for training other algorithms via gradient +descent. In this paper, we train a convolutional neural network (CNN) to learn +a differentiable, fast approximation of the EMD and demonstrate that it can be +used as a substitute for computing-intensive EMD implementations. We apply this +differentiable approximation in the training of an autoencoder-inspired neural +network (encoder NN) for data compression at the high-luminosity LHC at CERN. +The goal of this encoder NN is to compress the data while preserving the +information related to the distribution of energy deposits in particle +detectors. We demonstrate that the performance of our encoder NN trained using +the differentiable EMD CNN surpasses that of training with loss functions based +on mean squared error. + +
+
+ comment: 16 pages, 7 figures, submitted to Machine Learning: Science and + Technology +
+
+
+
+
+ + ♻ ☆ Provably Valid and Diverse Mutations of Real-World Media Data for DNN + Testing + + +
+ Deep neural networks (DNNs) often accept high-dimensional media data (e.g., +photos, text, and audio) and understand their perceptual content (e.g., a cat). +To test DNNs, diverse inputs are needed to trigger mis-predictions. Some +preliminary works use byte-level mutations or domain-specific filters (e.g., +foggy), whose enabled mutations may be limited and likely error-prone. SOTA +works employ deep generative models to generate (infinite) inputs. Also, to +keep the mutated inputs perceptually valid (e.g., a cat remains a "cat" after +mutation), existing efforts rely on imprecise and less generalizable +heuristics. + This study revisits two key objectives in media input mutation - perception +diversity (DIV) and validity (VAL) - in a rigorous manner based on manifold, a +well-developed theory capturing perceptions of high-dimensional media data in a +low-dimensional space. We show important results that DIV and VAL inextricably +bound each other, and prove that SOTA generative model-based methods +fundamentally fail to mutate real-world media data (either sacrificing DIV or +VAL). In contrast, we discuss the feasibility of mutating real-world media data +with provably high DIV and VAL based on manifold. + We concretize the technical solution of mutating media data of various +formats (images, audios, text) via a unified manner based on manifold. +Specifically, when media data are projected into a low-dimensional manifold, +the data can be mutated by walking on the manifold with certain directions and +step sizes. When contrasted with the input data, the mutated data exhibit +encouraging DIV in the perceptual traits (e.g., lying vs. standing dog) while +retaining reasonably high VAL (i.e., a dog remains a dog). We implement our +techniques in DEEPWALK for testing DNNs. DEEPWALK outperforms prior methods in +testing comprehensiveness and can find more error-triggering inputs with higher +quality. + +
+
+
+
+
+ + ♻ ☆ A mean-field games laboratory for generative modeling + + +
+ We demonstrate the versatility of mean-field games (MFGs) as a mathematical +framework for explaining, enhancing, and designing generative models. In +generative flows, a Lagrangian formulation is used where each particle +(generated sample) aims to minimize a loss function over its simulated path. +The loss, however, is dependent on the paths of other particles, which leads to +a competition among the population of particles. The asymptotic behavior of +this competition yields a mean-field game. We establish connections between +MFGs and major classes of generative flows and diffusions including +continuous-time normalizing flows, score-based generative models (SGM), and +Wasserstein gradient flows. Furthermore, we study the mathematical properties +of each generative model by studying their associated MFG's optimality +condition, which is a set of coupled forward-backward nonlinear partial +differential equations. The mathematical structure described by the MFG +optimality conditions identifies the inductive biases of generative flows. We +investigate the well-posedness and structure of normalizing flows, unravel the +mathematical structure of SGMs, and derive a MFG formulation of Wasserstein +gradient flows. From an algorithmic perspective, the optimality conditions +yields Hamilton-Jacobi-Bellman (HJB) regularizers for enhanced training of +generative models. In particular, we propose and demonstrate an HJB-regularized +SGM with improved performance over standard SGMs. We present this framework as +an MFG laboratory which serves as a platform for revealing new avenues of +experimentation and invention of generative models. + +
+
+ comment: 56 pages, 10 figures. Version 5 has a slightly modified version of + the normalizing flow and improved introduction and conclusions +
+
+
+
+
+ + ♻ ☆ Rule Enforcing Through Ordering + + +
+ In many real world situations, like minor traffic offenses in big cities, a +central authority is tasked with periodic administering punishments to a large +number of individuals. Common practice is to give each individual a chance to +suffer a smaller fine and be guaranteed to avoid the legal process with +probable considerably larger punishment. However, thanks to the large number of +offenders and a limited capacity of the central authority, the individual risk +is typically small and a rational individual will not choose to pay the fine. +Here we show that if the central authority processes the offenders in a +publicly known order, it properly incentives the offenders to pay the fine. We +show analytically and on realistic experiments that our mechanism promotes +non-cooperation and incentives individuals to pay. Moreover, the same holds for +an arbitrary coalition. We quantify the expected total payment the central +authority receives, and show it increases considerably. + +
+
+ comment: Accepted at the 14th Conference on Decision and Game Theory for + Security (GameSec-23) +
+
+
+
+
+ + ♻ ☆ Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level + Stability and High-Level Behavior + + +
+ We propose a theoretical framework for studying behavior cloning of complex +expert demonstrations using generative modeling. Our framework invokes +low-level controllers - either learned or implicit in position-command control +- to stabilize imitation around expert demonstrations. We show that with (a) a +suitable low-level stability guarantee and (b) a powerful enough generative +model as our imitation learner, pure supervised behavior cloning can generate +trajectories matching the per-time step distribution of essentially arbitrary +expert trajectories in an optimal transport cost. Our analysis relies on a +stochastic continuity property of the learned policy we call "total variation +continuity" (TVC). We then show that TVC can be ensured with minimal +degradation of accuracy by combining a popular data-augmentation regimen with a +novel algorithmic trick: adding augmentation noise at execution time. We +instantiate our guarantees for policies parameterized by diffusion models and +prove that if the learner accurately estimates the score of the +(noise-augmented) expert policy, then the distribution of imitator trajectories +is close to the demonstrator distribution in a natural optimal transport +distance. Our analysis constructs intricate couplings between noise-augmented +trajectories, a technique that may be of independent interest. We conclude by +empirically validating our algorithmic recommendations, and discussing +implications for future research directions for better behavior cloning with +generative modeling. + +
+
+ comment: updated figures, minor notational change for readability +
+
+
+
+
+ + ♻ ☆ Contrast Everything: A Hierarchical Contrastive Framework for Medical + Time-Series + + +
+ Contrastive representation learning is crucial in medical time series +analysis as it alleviates dependency on labor-intensive, domain-specific, and +scarce expert annotations. However, existing contrastive learning methods +primarily focus on one single data level, which fails to fully exploit the +intricate nature of medical time series. To address this issue, we present +COMET, an innovative hierarchical framework that leverages data consistencies +at all inherent levels in medical time series. Our meticulously designed model +systematically captures data consistency from four potential levels: +observation, sample, trial, and patient levels. By developing contrastive loss +at multiple levels, we can learn effective representations that preserve +comprehensive data consistency, maximizing information utilization in a +self-supervised manner. We conduct experiments in the challenging +patient-independent setting. We compare COMET against six baselines using three +diverse datasets, which include ECG signals for myocardial infarction and EEG +signals for Alzheimer's and Parkinson's diseases. The results demonstrate that +COMET consistently outperforms all baselines, particularly in setup with 10% +and 1% labeled data fractions across all datasets. These results underscore the +significant impact of our framework in advancing contrastive representation +learning techniques for medical time series. The source code is available at +https://github.com/DL4mHealth/COMET. + +
+
+ comment: Accepted by NeruIPS 2023; 24pages (13 pages main paper + 11 pages + supplementary materials) +
+
+
+
+
+ + ♻ ☆ Towards Understanding Sycophancy in Language Models + + +
+ Reinforcement learning from human feedback (RLHF) is a popular technique for +training high-quality AI assistants. However, RLHF may also encourage model +responses that match user beliefs over truthful responses, a behavior known as +sycophancy. We investigate the prevalence of sycophancy in RLHF-trained models +and whether human preference judgements are responsible. We first demonstrate +that five state-of-the-art AI assistants consistently exhibit sycophantic +behavior across four varied free-form text-generation tasks. To understand if +human preferences drive this broadly observed behavior of RLHF models, we +analyze existing human preference data. We find that when a response matches a +user's views, it is more likely to be preferred. Moreover, both humans and +preference models (PMs) prefer convincingly-written sycophantic responses over +correct ones a non-negligible fraction of the time. Optimizing model outputs +against PMs also sometimes sacrifices truthfulness in favor of sycophancy. +Overall, our results indicate that sycophancy is a general behavior of RLHF +models, likely driven in part by human preference judgements favoring +sycophantic responses. + +
+
+ comment: 32 pages, 20 figures +
+
+
+
+
+ + ♻ ☆ Quantification of Uncertainty with Adversarial Models NeurIPS 2023 + + +
+ Quantifying uncertainty is important for actionable predictions in real-world +applications. A crucial part of predictive uncertainty quantification is the +estimation of epistemic uncertainty, which is defined as an integral of the +product between a divergence function and the posterior. Current methods such +as Deep Ensembles or MC dropout underperform at estimating the epistemic +uncertainty, since they primarily consider the posterior when sampling models. +We suggest Quantification of Uncertainty with Adversarial Models (QUAM) to +better estimate the epistemic uncertainty. QUAM identifies regions where the +whole product under the integral is large, not just the posterior. +Consequently, QUAM has lower approximation error of the epistemic uncertainty +compared to previous methods. Models for which the product is large correspond +to adversarial models (not adversarial examples!). Adversarial models have both +a high posterior as well as a high divergence between their predictions and +that of a reference model. Our experiments show that QUAM excels in capturing +epistemic uncertainty for deep learning models and outperforms previous methods +on challenging tasks in the vision domain. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ NA-SODINN: a deep learning algorithm for exoplanet image detection based + on residual noise regimes + + +
+ Supervised deep learning was recently introduced in high-contrast imaging +(HCI) through the SODINN algorithm, a convolutional neural network designed for +exoplanet detection in angular differential imaging (ADI) datasets. The +benchmarking of HCI algorithms within the Exoplanet Imaging Data Challenge +(EIDC) showed that (i) SODINN can produce a high number of false positives in +the final detection maps, and (ii) algorithms processing images in a more local +manner perform better. This work aims to improve the SODINN detection +performance by introducing new local processing approaches and adapting its +learning process accordingly. We propose NA-SODINN, a new deep learning binary +classifier based on a convolutional neural network (CNN) that better captures +image noise correlations in ADI-processed frames by identifying noise regimes. +Our new approach was tested against its predecessor, as well as two +SODINN-based hybrid models and a more standard annular-PCA approach, through +local receiving operating characteristics (ROC) analysis of ADI sequences from +the VLT/SPHERE and Keck/NIRC-2 instruments. Results show that NA-SODINN +enhances SODINN in both sensitivity and specificity, especially in the +speckle-dominated noise regime. NA-SODINN is also benchmarked against the +complete set of submitted detection algorithms in EIDC, in which we show that +its final detection score matches or outperforms the most powerful detection +algorithms.Throughout the supervised machine learning case, this study +illustrates and reinforces the importance of adapting the task of detection to +the local content of processed images. + +
+
+ comment: A&A in press +
+
+
+
+
+ + ♻ ☆ RADAR: Robust AI-Text Detection via Adversarial Learning NeurIPS 2023 + + +
+ Recent advances in large language models (LLMs) and the intensifying +popularity of ChatGPT-like applications have blurred the boundary of +high-quality text generation between humans and machines. However, in addition +to the anticipated revolutionary changes to our technology and society, the +difficulty of distinguishing LLM-generated texts (AI-text) from human-generated +texts poses new challenges of misuse and fairness, such as fake content +generation, plagiarism, and false accusations of innocent writers. While +existing works show that current AI-text detectors are not robust to LLM-based +paraphrasing, this paper aims to bridge this gap by proposing a new framework +called RADAR, which jointly trains a robust AI-text detector via adversarial +learning. RADAR is based on adversarial training of a paraphraser and a +detector. The paraphraser's goal is to generate realistic content to evade +AI-text detection. RADAR uses the feedback from the detector to update the +paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly +2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, +experimental results show that RADAR significantly outperforms existing AI-text +detection methods, especially when paraphrasing is in place. We also identify +the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, +and evaluate the improved capability of RADAR via GPT-3.5-Turbo. + +
+
+ comment: Accepted by NeurIPS 2023. Project page and demos: + https://radar.vizhub.ai +
+
+
+
+
+ + ♻ ☆ Nonlinear model reduction for slow-fast stochastic systems near unknown + invariant manifolds + + +
+ We introduce a nonlinear stochastic model reduction technique for +high-dimensional stochastic dynamical systems that have a low-dimensional +invariant effective manifold with slow dynamics, and high-dimensional, large +fast modes. Given only access to a black box simulator from which short bursts +of simulation can be obtained, we design an algorithm that outputs an estimate +of the invariant manifold, a process of the effective stochastic dynamics on +it, which has averaged out the fast modes, and a simulator thereof. This +simulator is efficient in that it exploits of the low dimension of the +invariant manifold, and takes time steps of size dependent on the regularity of +the effective process, and therefore typically much larger than that of the +original simulator, which had to resolve the fast modes. The algorithm and the +estimation can be performed on-the-fly, leading to efficient exploration of the +effective state space, without losing consistency with the underlying dynamics. +This construction enables fast and efficient simulation of paths of the +effective dynamics, together with estimation of crucial features and +observables of such dynamics, including the stationary distribution, +identification of metastable states, and residence times and transition rates +between them. + +
+
+
+
+
+ + ♻ ☆ DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two + Quantization + + +
+ Efficiently deploying deep neural networks on low-resource edge devices is +challenging due to their ever-increasing resource requirements. To address this +issue, researchers have proposed multiplication-free neural networks, such as +Power-of-Two quantization, or also known as Shift networks, which aim to reduce +memory usage and simplify computation. However, existing low-bit Shift networks +are not as accurate as their full-precision counterparts, typically suffering +from limited weight range encoding schemes and quantization loss. In this +paper, we propose the DenseShift network, which significantly improves the +accuracy of Shift networks, achieving competitive performance to full-precision +networks for vision and speech applications. In addition, we introduce a method +to deploy an efficient DenseShift network using non-quantized floating-point +activations, while obtaining 1.6X speed-up over existing methods. To achieve +this, we demonstrate that zero-weight values in low-bit Shift networks do not +contribute to model capacity and negatively impact inference computation. To +address this issue, we propose a zero-free shifting mechanism that simplifies +inference and increases model capacity. We further propose a sign-scale +decomposition design to enhance training efficiency and a low-variance random +initialization strategy to improve the model's transfer learning performance. +Our extensive experiments on various computer vision and speech tasks +demonstrate that DenseShift outperforms existing low-bit multiplication-free +networks and achieves competitive performance compared to full-precision +networks. Furthermore, our proposed approach exhibits strong transfer learning +performance without a drop in accuracy. Our code was released on GitHub. + +
+
+
+
+
+ + ♻ ☆ Backorder Prediction in Inventory Management: Classification Techniques + and Cost Considerations + + +
+ This article introduces an advanced analytical approach for predicting +backorders in inventory management. Backorder refers to an order that cannot be +immediately fulfilled due to stock depletion. Multiple classification +techniques, including Balanced Bagging Classifiers, Fuzzy Logic, Variational +Autoencoder - Generative Adversarial Networks, and Multi-layer Perceptron +classifiers, are assessed in this work using performance evaluation metrics +such as ROC-AUC and PR-AUC. Moreover, this work incorporates a profit function +and misclassification costs, considering the financial implications and costs +associated with inventory management and backorder handling. The study suggests +that a combination of modeling approaches, including ensemble techniques and +VAE, can effectively address imbalanced datasets in inventory management, +emphasizing interpretability and reducing false positives and false negatives. +This research contributes to the advancement of predictive analytics and offers +valuable insights for future investigations in backorder forecasting and +inventory control optimization for decision-making. + +
+
+
+
+
+ + ♻ ☆ Lifelong Robot Learning with Human Assisted Language Planners + + +
+ Large Language Models (LLMs) have been shown to act like planners that can +decompose high-level instructions into a sequence of executable instructions. +However, current LLM-based planners are only able to operate with a fixed set +of skills. We overcome this critical limitation and present a method for using +LLM-based planners to query new skills and teach robots these skills in a data +and time-efficient manner for rigid object manipulation. Our system can re-use +newly acquired skills for future tasks, demonstrating the potential of open +world and lifelong learning. We evaluate the proposed framework on multiple +tasks in simulation and the real world. Videos are available at: +https://sites.google.com/mit.edu/halp-robot-learning. + +
+
+
+
+
+ + ♻ ☆ Improving Fairness in Deepfake Detection + + +
+ Despite the development of effective deepfake detection models in recent +years, several recent studies have demonstrated that biases in the training +data utilized to develop deepfake detection models can lead to unfair +performance for demographic groups of different races and/or genders. Such can +result in these groups being unfairly targeted or excluded from detection, +allowing misclassified deepfakes to manipulate public opinion and erode trust +in the model. While these studies have focused on identifying and evaluating +the unfairness in deepfake detection, no methods have been developed to address +the fairness issue of deepfake detection at the algorithm level. In this work, +we make the first attempt to improve deepfake detection fairness by proposing +novel loss functions to train fair deepfake detection models in ways that are +agnostic or aware of demographic factors. Extensive experiments on four +deepfake datasets and five deepfake detectors demonstrate the effectiveness and +flexibility of our approach in improving the deepfake detection fairness. + +
+
+
+
+
+ + ♻ ☆ Meta learning with language models: Challenges and opportunities in the + classification of imbalanced text + + +
+ Detecting out of policy speech (OOPS) content is important but difficult. +While machine learning is a powerful tool to tackle this challenging task, it +is hard to break the performance ceiling due to factors like quantity and +quality limitations on training data and inconsistencies in OOPS definition and +data labeling. To realize the full potential of available limited resources, we +propose a meta learning technique (MLT) that combines individual models built +with different text representations. We analytically show that the resulting +technique is numerically stable and produces reasonable combining weights. We +combine the MLT with a threshold-moving (TM) technique to further improve the +performance of the combined predictor on highly-imbalanced in-distribution and +out-of-distribution datasets. We also provide computational results to show the +statistically significant advantages of the proposed MLT approach. + All authors contributed equally to this work. + +
+
+ comment: 22 pages, including 5 figures, 12 tables, 1 appendix +
+
+
+
+
+ + ♻ ☆ AnglE-optimized Text Embeddings + + +
+ High-quality text embedding is pivotal in improving semantic textual +similarity (STS) tasks, which are crucial components in Large Language Model +(LLM) applications. However, a common challenge existing text embedding models +face is the problem of vanishing gradients, primarily due to their reliance on +the cosine function in the optimization objective, which has saturation zones. +To address this issue, this paper proposes a novel angle-optimized text +embedding model called AnglE. The core idea of AnglE is to introduce angle +optimization in a complex space. This novel approach effectively mitigates the +adverse effects of the saturation zone in the cosine function, which can impede +gradient and hinder optimization processes. To set up a comprehensive STS +evaluation, we experimented on existing short-text STS datasets and a newly +collected long-text STS dataset from GitHub Issues. Furthermore, we examine +domain-specific STS scenarios with limited labeled data and explore how AnglE +works with LLM-annotated data. Extensive experiments were conducted on various +tasks including short-text STS, long-text STS, and domain-specific STS tasks. +The results show that AnglE outperforms the state-of-the-art (SOTA) STS models +that ignore the cosine saturation zone. These findings demonstrate the ability +of AnglE to generate high-quality text embeddings and the usefulness of angle +optimization in STS. + +
+
+ comment: update results and add non-STS transfer tasks +
+
+
+
+
+ + ♻ ☆ Explainable machine learning-based prediction model for diabetic + nephropathy + + +
+ The aim of this study is to analyze the effect of serum metabolites on +diabetic nephropathy (DN) and predict the prevalence of DN through a machine +learning approach. The dataset consists of 548 patients from April 2018 to +April 2019 in Second Affiliated Hospital of Dalian Medical University (SAHDMU). +We select the optimal 38 features through a Least absolute shrinkage and +selection operator (LASSO) regression model and a 10-fold cross-validation. We +compare four machine learning algorithms, including eXtreme Gradient Boosting +(XGB), random forest, decision tree and logistic regression, by AUC-ROC curves, +decision curves, calibration curves. We quantify feature importance and +interaction effects in the optimal predictive model by Shapley Additive +exPlanations (SHAP) method. The XGB model has the best performance to screen +for DN with the highest AUC value of 0.966. The XGB model also gains more +clinical net benefits than others and the fitting degree is better. In +addition, there are significant interactions between serum metabolites and +duration of diabetes. We develop a predictive model by XGB algorithm to screen +for DN. C2, C5DC, Tyr, Ser, Met, C24, C4DC, and Cys have great contribution in +the model, and can possibly be biomarkers for DN. + +
+
+
+
+
+ + ♻ ☆ Fully Adaptive Composition in Differential Privacy + + +
+ Composition is a key feature of differential privacy. Well-known advanced +composition theorems allow one to query a private database quadratically more +times than basic privacy composition would permit. However, these results +require that the privacy parameters of all algorithms be fixed before +interacting with the data. To address this, Rogers et al. introduced fully +adaptive composition, wherein both algorithms and their privacy parameters can +be selected adaptively. They defined two probabilistic objects to measure +privacy in adaptive composition: privacy filters, which provide differential +privacy guarantees for composed interactions, and privacy odometers, +time-uniform bounds on privacy loss. There are substantial gaps between +advanced composition and existing filters and odometers. First, existing +filters place stronger assumptions on the algorithms being composed. Second, +these odometers and filters suffer from large constants, making them +impractical. We construct filters that match the rates of advanced composition, +including constants, despite allowing for adaptively chosen privacy parameters. +En route we also derive a privacy filter for approximate zCDP. We also +construct several general families of odometers. These odometers match the +tightness of advanced composition at an arbitrary, preselected point in time, +or at all points in time simultaneously, up to a doubly-logarithmic factor. We +obtain our results by leveraging advances in martingale concentration. In sum, +we show that fully adaptive privacy is obtainable at almost no loss. + +
+
+ comment: 23 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ A Systematic Survey in Geometric Deep Learning for Structure-based Drug + Design + + +
+ Structure-based drug design (SBDD) utilizes the three-dimensional geometry of +proteins to identify potential drug candidates. Traditional methods, grounded +in physicochemical modeling and informed by domain expertise, are +resource-intensive. Recent developments in geometric deep learning, focusing on +the integration and processing of 3D geometric data, coupled with the +availability of accurate protein 3D structure predictions from tools like +AlphaFold, have greatly advanced the field of structure-based drug design. This +paper systematically reviews the current state of geometric deep learning in +SBDD. We first outline foundational tasks in SBDD, detail prevalent 3D protein +representations, and highlight representative predictive and generative models. +We then offer in-depth reviews of each key task, including binding site +prediction, binding pose generation, \emph{de novo} molecule generation, linker +design, and binding affinity prediction. We provide formal problem definitions +and outline each task's representative methods, datasets, evaluation metrics, +and performance benchmarks. Finally, we summarize the current challenges and +future opportunities: current challenges in SBDD include oversimplified problem +formulations, inadequate out-of-distribution generalization, a lack of reliable +evaluation metrics and large-scale benchmarks, and the need for experimental +verification and enhanced model understanding; opportunities include leveraging +multimodal datasets, integrating domain knowledge, building comprehensive +benchmarks, designing criteria based on clinical endpoints, and developing +foundation models that broaden the range of design tasks. We also curate +\url{https://github.com/zaixizhang/Awesome-SBDD}, reflecting ongoing +contributions and new datasets in SBDD. + +
+
+ comment: 20 pages, under review +
+
+
+
+
+ + ♻ ☆ Meta- (out-of-context) learning in neural networks + + +
+ Brown et al. (2020) famously introduced the phenomenon of in-context learning +in large language models (LLMs). We establish the existence of a phenomenon we +call meta-out-of-context learning (meta-OCL) via carefully designed synthetic +experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more +readily "internalize" the semantic content of text that is, or appears to be, +broadly useful (such as true statements, or text from authoritative sources) +and use it in appropriate circumstances. We further demonstrate meta-OCL in a +synthetic computer vision setting, and propose two hypotheses for the emergence +of meta-OCL: one relying on the way models store knowledge in their parameters, +and another suggesting that the implicit gradient alignment bias of +gradient-descent-based optimizers may be responsible. Finally, we reflect on +what our results might imply about capabilities of future AI systems, and +discuss potential risks. Our code can be found at +https://github.com/krasheninnikov/internalization. + +
+
+
+
+
+ + ♻ ☆ Differentially Private Conditional Independence Testing + + +
+ Conditional independence (CI) tests are widely used in statistical data +analysis, e.g., they are the building block of many algorithms for causal graph +discovery. The goal of a CI test is to accept or reject the null hypothesis +that $X \perp \!\!\! \perp Y \mid Z$, where $X \in \mathbb{R}, Y \in +\mathbb{R}, Z \in \mathbb{R}^d$. In this work, we investigate conditional +independence testing under the constraint of differential privacy. We design +two private CI testing procedures: one based on the generalized covariance +measure of Shah and Peters (2020) and another based on the conditional +randomization test of Cand\`es et al. (2016) (under the model-X assumption). We +provide theoretical guarantees on the performance of our tests and validate +them empirically. These are the first private CI tests with rigorous +theoretical guarantees that work for the general case when $Z$ is continuous. + +
+
+
+
+
+ + ♻ ☆ KineticNet: Deep learning a transferable kinetic energy functional for + orbital-free density functional theory + + +
+ Orbital-free density functional theory (OF-DFT) holds the promise to compute +ground state molecular properties at minimal cost. However, it has been held +back by our inability to compute the kinetic energy as a functional of the +electron density only. We here set out to learn the kinetic energy functional +from ground truth provided by the more expensive Kohn-Sham density functional +theory. Such learning is confronted with two key challenges: Giving the model +sufficient expressivity and spatial context while limiting the memory footprint +to afford computations on a GPU; and creating a sufficiently broad distribution +of training data to enable iterative density optimization even when starting +from a poor initial guess. In response, we introduce KineticNet, an equivariant +deep neural network architecture based on point convolutions adapted to the +prediction of quantities on molecular quadrature grids. Important contributions +include convolution filters with sufficient spatial resolution in the vicinity +of the nuclear cusp, an atom-centric sparse but expressive architecture that +relays information across multiple bond lengths; and a new strategy to generate +varied training data by finding ground state densities in the face of +perturbations by a random external potential. KineticNet achieves, for the +first time, chemical accuracy of the learned functionals across input densities +and geometries of tiny molecules. For two electron systems, we additionally +demonstrate OF-DFT density optimization with chemical accuracy. + +
+
+ comment: This article may be downloaded for personal use only. Any other use + requires prior permission of the author and AIP Publishing. This article + appeared in The Journal of Chemical Physics 159, 144113 (2023) and may be + found at https://doi.org/10.1063/5.0158275 +
+
+
+
+
+ + ♻ ☆ Simplest Streaming Trees + + +
+ Decision forests, including random forests and gradient boosting trees, +remain the leading machine learning methods for many real-world data problems, +especially on tabular data. However, most of the current implementations only +operate in batch mode, and therefore cannot incrementally update when more data +arrive. Several previous works developed streaming trees and ensembles to +overcome this limitation. Nonetheless, we found that those state-of-the-art +algorithms suffer from a number of drawbacks, including low accuracy on some +problems and high memory usage on others. We therefore developed the simplest +possible extension of decision trees: given new data, simply update existing +trees by continuing to grow them, and replace some old trees with new ones to +control the total number of trees. In a benchmark suite containing 72 +classification problems (the OpenML-CC18 data suite), we illustrate that our +approach, Stream Decision Forest (SDF), does not suffer from either of the +aforementioned limitations. On those datasets, we also demonstrate that our +approach often performs as well, and sometimes even better, than conventional +batch decision forest algorithm. Thus, SDFs establish a simple standard for +streaming trees and forests that could readily be applied to many real-world +problems. + +
+
+
+
+
+ + ♻ ☆ Denoising Low-Rank Data Under Distribution Shift: Double Descent and + Data Augmentation + + +
+ Despite the importance of denoising in modern machine learning and ample +empirical work on supervised denoising, its theoretical understanding is still +relatively scarce. One concern about studying supervised denoising is that one +might not always have noiseless training data from the test distribution. It is +more reasonable to have access to noiseless training data from a different +dataset than the test dataset. Motivated by this, we study supervised denoising +and noisy-input regression under distribution shift. We add three +considerations to increase the applicability of our theoretical insights to +real-life data and modern machine learning. First, while most past theoretical +work assumes that the data covariance matrix is full-rank and well-conditioned, +empirical studies have shown that real-life data is approximately low-rank. +Thus, we assume that our data matrices are low-rank. Second, we drop +independence assumptions on our data. Third, the rise in computational power +and dimensionality of data have made it important to study non-classical +regimes of learning. Thus, we work in the non-classical proportional regime, +where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$. + For this setting, we derive general test error expressions for both denoising +and noisy-input regression, and study when overfitting the noise is benign, +tempered or catastrophic. We show that the test error exhibits double descent +under general distribution shift, providing insights for data augmentation and +the role of noise as an implicit regularizer. We also perform experiments using +real-life data, where we match the theoretical predictions with under 1% MSE +error for low-rank data. + +
+
+ comment: Complete overhaul of presentation, many new results +
+
+
+
+
+ + ♻ ☆ Can bin-wise scaling improve consistency and adaptivity of prediction + uncertainty for machine learning regression ? + + +
+ Binwise Variance Scaling (BVS) has recently been proposed as a post hoc +recalibration method for prediction uncertainties of machine learning +regression problems that is able of more efficient corrections than uniform +variance (or temperature) scaling. The original version of BVS uses +uncertainty-based binning, which is aimed to improve calibration conditionally +on uncertainty, i.e. consistency. I explore here several adaptations of BVS, in +particular with alternative loss functions and a binning scheme based on an +input-feature (X) in order to improve adaptivity, i.e. calibration conditional +on X. The performances of BVS and its proposed variants are tested on a +benchmark dataset for the prediction of atomization energies and compared to +the results of isotonic regression. + +
+
+ comment: This version corrects an error in the estimation of the Sx scores for + the test set, affecting Fig. 2 and Tables I-III of the initial version. The + main points of the discussion and the conclusions are unchanged +
+
+
+
+
+ + ♻ ☆ Conformal Prediction for Federated Uncertainty Quantification Under + Label Shift ICML 2023 + + +
+ Federated Learning (FL) is a machine learning framework where many clients +collaboratively train models while keeping the training data decentralized. +Despite recent advances in FL, the uncertainty quantification topic (UQ) +remains partially addressed. Among UQ methods, conformal prediction (CP) +approaches provides distribution-free guarantees under minimal assumptions. We +develop a new federated conformal prediction method based on quantile +regression and take into account privacy constraints. This method takes +advantage of importance weighting to effectively address the label shift +between agents and provides theoretical guarantees for both valid coverage of +the prediction sets and differential privacy. Extensive experimental studies +demonstrate that this method outperforms current competitors. + +
+
+ comment: ICML 2023 +
+
+
+
+
+ + ♻ ☆ Revisiting Implicit Differentiation for Learning Problems in Optimal + Control NeurIPS 2023 + + +
+ This paper proposes a new method for differentiating through optimal +trajectories arising from non-convex, constrained discrete-time optimal control +(COC) problems using the implicit function theorem (IFT). Previous works solve +a differential Karush-Kuhn-Tucker (KKT) system for the trajectory derivative, +and achieve this efficiently by solving an auxiliary Linear Quadratic Regulator +(LQR) problem. In contrast, we directly evaluate the matrix equations which +arise from applying variable elimination on the Lagrange multiplier terms in +the (differential) KKT system. By appropriately accounting for the structure of +the terms within the resulting equations, we show that the trajectory +derivatives scale linearly with the number of timesteps. Furthermore, our +approach allows for easy parallelization, significantly improved scalability +with model size, direct computation of vector-Jacobian products and improved +numerical stability compared to prior works. As an additional contribution, we +unify prior works, addressing claims that computing trajectory derivatives +using IFT scales quadratically with the number of timesteps. We evaluate our +method on a both synthetic benchmark and four challenging, learning from +demonstration benchmarks including a 6-DoF maneuvering quadrotor and 6-DoF +rocket powered landing. + +
+
+ comment: Accepted to NeurIPS 2023 (poster) +
+
+
+
+
+ + ♻ ☆ ULF: Unsupervised Labeling Function Correction using Cross-Validation + for Weak Supervision + + +
+ A cost-effective alternative to manual data labeling is weak supervision +(WS), where data samples are automatically annotated using a predefined set of +labeling functions (LFs), rule-based mechanisms that generate artificial labels +for the associated classes. In this work, we investigate noise reduction +techniques for WS based on the principle of k-fold cross-validation. We +introduce a new algorithm ULF for Unsupervised Labeling Function correction, +which denoises WS data by leveraging models trained on all but some LFs to +identify and correct biases specific to the held-out LFs. Specifically, ULF +refines the allocation of LFs to classes by re-estimating this assignment on +highly reliable cross-validated samples. Evaluation on multiple datasets +confirms ULF's effectiveness in enhancing WS learning without the need for +manual labeling. + +
+
+
+
+
+ + ♻ ☆ CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a + Context Synergized Hyperbolic Network EMNLP 2023 + + +
+ The tremendous growth of social media users interacting in online +conversations has led to significant growth in hate speech, affecting people +from various demographics. Most of the prior works focus on detecting explicit +hate speech, which is overt and leverages hateful phrases, with very little +work focusing on detecting hate speech that is implicit or denotes hatred +through indirect or coded language. In this paper, we present CoSyn, a +context-synergized neural network that explicitly incorporates user- and +conversational context for detecting implicit hate speech in online +conversations. CoSyn introduces novel ways to encode these external contexts +and employs a novel context interaction mechanism that clearly captures the +interplay between them, making independent assessments of the amounts of +information to be retrieved from these noisy contexts. Additionally, it carries +out all these operations in the hyperbolic space to account for the scale-free +dynamics of social media. We demonstrate the effectiveness of CoSyn on 6 hate +speech datasets and show that CoSyn outperforms all our baselines in detecting +implicit hate speech with absolute improvements in the range of 1.24% - 57.8%. + +
+
+ comment: Accepted to EMNLP 2023 Main Conference. Code: + https://github.com/Sreyan88/CoSyn +
+
+
+
+
+ + ♻ ☆ Efficient Graph Laplacian Estimation by Proximal Newton + + +
+ The Laplacian-constrained Gaussian Markov Random Field (LGMRF) is a common +multivariate statistical model for learning a weighted sparse dependency graph +from given data. This graph learning problem can be formulated as a maximum +likelihood estimation (MLE) of the precision matrix, subject to Laplacian +structural constraints, with a sparsity-inducing penalty term. This paper aims +to solve this learning problem accurately and efficiently. First, since the +commonly used $\ell_1$-norm penalty is inappropriate in this setting and may +lead to a complete graph, we employ the nonconvex minimax concave penalty +(MCP), which promotes sparse solutions with lower estimation bias. Second, as +opposed to existing first-order methods for this problem, we develop a +second-order proximal Newton approach to obtain an efficient solver, utilizing +several algorithmic features, such as using Conjugate Gradients, +preconditioning, and splitting to active/free sets. Numerical experiments +demonstrate the advantages of the proposed method in terms of both +computational complexity and graph learning accuracy compared to existing +methods. + +
+
+
+
+
+ + ♻ ☆ Avalon's Game of Thoughts: Battle Against Deception through Recursive + Contemplation + + +
+ Recent breakthroughs in large language models (LLMs) have brought remarkable +success in the field of LLM-as-Agent. Nevertheless, a prevalent assumption is +that the information processed by LLMs is consistently honest, neglecting the +pervasive deceptive or misleading information in human society and AI-generated +content. This oversight makes LLMs susceptible to malicious manipulations, +potentially resulting in detrimental outcomes. This study utilizes the +intricate Avalon game as a testbed to explore LLMs' potential in deceptive +environments. Avalon, full of misinformation and requiring sophisticated logic, +manifests as a "Game-of-Thoughts". Inspired by the efficacy of humans' +recursive thinking and perspective-taking in the Avalon game, we introduce a +novel framework, Recursive Contemplation (ReCon), to enhance LLMs' ability to +identify and counteract deceptive information. ReCon combines formulation and +refinement contemplation processes; formulation contemplation produces initial +thoughts and speech, while refinement contemplation further polishes them. +Additionally, we incorporate first-order and second-order perspective +transitions into these processes respectively. Specifically, the first-order +allows an LLM agent to infer others' mental states, and the second-order +involves understanding how others perceive the agent's mental state. After +integrating ReCon with different LLMs, extensive experiment results from the +Avalon game indicate its efficacy in aiding LLMs to discern and maneuver around +deceptive information without extra fine-tuning and data. Finally, we offer a +possible explanation for the efficacy of ReCon and explore the current +limitations of LLMs in terms of safety, reasoning, speaking style, and format, +potentially furnishing insights for subsequent research. + +
+
+ comment: 40 pages +
+
+
+
+
+ + ♻ ☆ DiffTraj: Generating GPS Trajectory with Diffusion Probabilistic Model + + +
+ Pervasive integration of GPS-enabled devices and data acquisition +technologies has led to an exponential increase in GPS trajectory data, +fostering advancements in spatial-temporal data mining research. Nonetheless, +GPS trajectories contain personal geolocation information, rendering serious +privacy concerns when working with raw data. A promising approach to address +this issue is trajectory generation, which involves replacing original data +with generated, privacy-free alternatives. Despite the potential of trajectory +generation, the complex nature of human behavior and its inherent stochastic +characteristics pose challenges in generating high-quality trajectories. In +this work, we propose a spatial-temporal diffusion probabilistic model for +trajectory generation (DiffTraj). This model effectively combines the +generative abilities of diffusion models with the spatial-temporal features +derived from real trajectories. The core idea is to reconstruct and synthesize +geographic trajectories from white noise through a reverse trajectory denoising +process. Furthermore, we propose a Trajectory UNet (Traj-UNet) deep neural +network to embed conditional information and accurately estimate noise levels +during the reverse process. Experiments on two real-world datasets show that +DiffTraj can be intuitively applied to generate high-fidelity trajectories +while retaining the original distributions. Moreover, the generated results can +support downstream trajectory analysis tasks and significantly outperform other +methods in terms of geo-distribution evaluations. + +
+
+
+
+
+ + ♻ ☆ Learning Large-scale Neural Fields via Context Pruned Meta-Learning NeurIPS 2023 + + +
+ We introduce an efficient optimization-based meta-learning technique for +large-scale neural field training by realizing significant memory savings +through automated online context point selection. This is achieved by focusing +each learning step on the subset of data with the highest expected immediate +improvement in model quality, resulting in the almost instantaneous modeling of +global structure and subsequent refinement of high-frequency details. We +further improve the quality of our meta-learned initialization by introducing a +bootstrap correction resulting in the minimization of any error introduced by +reduced context sets while simultaneously mitigating the well-known myopia of +optimization-based meta-learning. Finally, we show how gradient re-scaling at +meta-test time allows the learning of extremely high-quality neural fields in +significantly shortened optimization procedures. Our framework is +model-agnostic, intuitive, straightforward to implement, and shows significant +reconstruction improvements for a wide range of signals. We provide an +extensive empirical evaluation on nine datasets across multiple multiple +modalities, demonstrating state-of-the-art results while providing additional +insight through careful analysis of the algorithmic components constituting our +method. Code is available at https://github.com/jihoontack/GradNCP + +
+
+ comment: Published as a conference proceeding for NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Exploiting the Signal-Leak Bias in Diffusion Models + + +
+ There is a bias in the inference pipeline of most diffusion models. This bias +arises from a signal leak whose distribution deviates from the noise +distribution, creating a discrepancy between training and inference processes. +We demonstrate that this signal-leak bias is particularly significant when +models are tuned to a specific style, causing sub-optimal style matching. +Recent research tries to avoid the signal leakage during training. We instead +show how we can exploit this signal-leak bias in existing diffusion models to +allow more control over the generated images. This enables us to generate +images with more varied brightness, and images that better match a desired +style or color. By modeling the distribution of the signal leak in the spatial +frequency and pixel domains, and including a signal leak in the initial latent, +we generate images that better match expected results without any additional +training. + +
+
+ comment: corrected the author names in reference [24] +
+
+
+
+
+ + ♻ ☆ How does GPT-2 compute greater-than?: Interpreting mathematical + abilities in a pre-trained language model NeurIPS 2023 + + +
+ Pre-trained language models can be surprisingly adept at tasks they were not +explicitly trained on, but how they implement these capabilities is poorly +understood. In this paper, we investigate the basic mathematical abilities +often acquired by pre-trained language models. Concretely, we use mechanistic +interpretability techniques to explain the (limited) mathematical abilities of +GPT-2 small. As a case study, we examine its ability to take in sentences such +as "The war lasted from the year 1732 to the year 17", and predict valid +two-digit end years (years > 32). We first identify a circuit, a small subset +of GPT-2 small's computational graph that computes this task's output. Then, we +explain the role of each circuit component, showing that GPT-2 small's final +multi-layer perceptrons boost the probability of end years greater than the +start year. Finally, we find related tasks that activate our circuit. Our +results suggest that GPT-2 small computes greater-than using a complex but +general mechanism that activates across diverse contexts. + +
+
+ comment: NeurIPS 2023 Camera Ready Version +
+
+
+
+
+ + ♻ ☆ Differentiable Sparsification for Deep Neural Networks + + +
+ Deep neural networks have significantly alleviated the burden of feature +engineering, but comparable efforts are now required to determine effective +architectures for these networks. Furthermore, as network sizes have become +excessively large, a substantial amount of resources is invested in reducing +their sizes. These challenges can be effectively addressed through the +sparsification of over-complete models. In this study, we propose a fully +differentiable sparsification method for deep neural networks, which can zero +out unimportant parameters by directly optimizing a regularized objective +function with stochastic gradient descent. Consequently, the proposed method +can learn both the sparsified structure and weights of a network in an +end-to-end manner. It can be directly applied to various modern deep neural +networks and requires minimal modification to the training process. To the best +of our knowledge, this is the first fully differentiable sparsification method. + +
+
+
+
+
+ + ♻ ☆ Beware of diffusion models for synthesizing medical images -- A + comparison with GANs in terms of memorizing brain MRI and chest x-ray images + + +
+ Diffusion models were initially developed for text-to-image generation and +are now being utilized to generate high-quality synthetic images. Preceded by +GANs, diffusion models have shown impressive results using various evaluation +metrics. However, commonly used metrics such as FID and IS are not suitable for +determining whether diffusion models are simply reproducing the training +images. Here we train StyleGAN and diffusion models, using BRATS20, BRATS21 and +a chest x-ray pneumonia dataset, to synthesize brain MRI and chest x-ray +images, and measure the correlation between the synthe4c images and all +training images. Our results show that diffusion models are more likely to +memorize the training images, compared to StyleGAN, especially for small +datasets and when using 2D slices from 3D volumes. Researchers should be +careful when using diffusion models for medical imaging, if the final goal is +to share the synthe4c images + +
+
+ comment: 12 Pages, 6 Figures +
+
+
+
+
+ + ♻ ☆ Unsupervised Video Domain Adaptation for Action Recognition: A + Disentanglement Perspective NeurIPS 2023 + + +
+ Unsupervised video domain adaptation is a practical yet challenging task. In +this work, for the first time, we tackle it from a disentanglement view. Our +key idea is to handle the spatial and temporal domain divergence separately +through disentanglement. Specifically, we consider the generation of +cross-domain videos from two sets of latent factors, one encoding the static +information and another encoding the dynamic information. A Transfer Sequential +VAE (TranSVAE) framework is then developed to model such generation. To better +serve for adaptation, we propose several objectives to constrain the latent +factors. With these constraints, the spatial divergence can be readily removed +by disentangling the static domain-specific information out, and the temporal +divergence is further reduced from both frame- and video-levels through +adversarial learning. Extensive experiments on the UCF-HMDB, Jester, and +Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE +compared with several state-of-the-art approaches. Code is publicly available. + +
+
+ comment: NeurIPS 2023; 20 pages, 9 figures, 10 tables; Code at + https://github.com/ldkong1205/TranSVAE +
+
+
+
+
+ + ♻ ☆ Conformal prediction under ambiguous ground truth + + +
+ Conformal Prediction (CP) allows to perform rigorous uncertainty +quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y +\in C(X))\geq 1-\alpha$ for a user-chosen $\alpha \in [0,1]$ by relying on +calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} +\otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that +$\mathbb{P}^{Y|X}$ is the "true" posterior label distribution. However, in many +real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating +expert opinions using a voting procedure, resulting in a one-hot distribution +$\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus +w.r.t. $\mathbb{P}_{vote}=\mathbb{P}^X \otimes \mathbb{P}_{vote}^{Y|X}$ rather +than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth +labels, the distinction between $\mathbb{P}_{vote}$ and $\mathbb{P}$ is +irrelevant. However, when experts do not agree because of ambiguous labels, +approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution +$\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose +to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a +non-degenerate distribution $\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP +procedures which provide guarantees w.r.t. $\mathbb{P}_{agg}=\mathbb{P}^X +\otimes \mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels +from $\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a +case study of skin condition classification with significant disagreement among +expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{vote}$ +under-covers expert annotations: calibrated for $72\%$ coverage, it falls short +by on average $10\%$; our Monte Carlo CP closes this gap both empirically and +theoretically. + +
+
+
+
+
+ + ♻ ☆ Harmonizing output imbalance for defect segmentation on + extremely-imbalanced photovoltaic module cells images + + +
+ The continuous development of the photovoltaic (PV) industry has raised high +requirements for the quality of monocrystalline of PV module cells. When +learning to segment defect regions in PV module cell images, Tiny Hidden Cracks +(THC) lead to extremely-imbalanced samples. The ratio of defect pixels to +normal pixels can be as low as 1:2000. This extreme imbalance makes it +difficult to segment the THC of PV module cells, which is also a challenge for +semantic segmentation. To address the problem of segmenting defects on +extremely-imbalanced THC data, the paper makes contributions from three +aspects: (1) it proposes an explicit measure for output imbalance; (2) it +generalizes a distribution-based loss that can handle different types of output +imbalances; and (3) it introduces a compound loss with our adaptive +hyperparameter selection algorithm that can keep the consistency of training +and inference for harmonizing the output imbalance on extremelyimbalanced input +data. The proposed method is evaluated on four widely-used deep learning +architectures and four datasets with varying degrees of input imbalance. The +experimental results show that the proposed method outperforms existing +methods. + +
+
+ comment: 19 pages, 16 figures, 3 appendixes +
+
+
+
+
+ + ♻ ☆ Thompson Sampling for Real-Valued Combinatorial Pure Exploration of + Multi-Armed Bandit + + +
+ We study the real-valued combinatorial pure exploration of the multi-armed +bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic +arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown +distribution with mean $\mu_s$. In each time step, a player pulls a single arm +and observes its reward. The player's goal is to identify the optimal +\emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in +\mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized +real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few +arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size +of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm +named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, +which is the first algorithm that can work even when the size of the action set +is exponentially large in $d$. We also introduce a novel problem-dependent +sample complexity lower bound of the R-CPE-MAB problem, and show that the +GenTS-Explore algorithm achieves the optimal sample complexity up to a +problem-dependent constant factor. + +
+
+
+
+
+ + ♻ ☆ Segment Any Point Cloud Sequences by Distilling Vision Foundation Models NeurIPS 2023 + + +
+ Recent advancements in vision foundation models (VFMs) have opened up new +possibilities for versatile and efficient visual perception. In this work, we +introduce Seal, a novel framework that harnesses VFMs for segmenting diverse +automotive point cloud sequences. Seal exhibits three appealing properties: i) +Scalability: VFMs are directly distilled into point clouds, obviating the need +for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial +and temporal relationships are enforced at both the camera-to-LiDAR and +point-to-segment regularization stages, facilitating cross-modal representation +learning. iii) Generalizability: Seal enables knowledge transfer in an +off-the-shelf manner to downstream tasks involving diverse point clouds, +including those from real/synthetic, low/high-resolution, large/small-scale, +and clean/corrupted datasets. Extensive experiments conducted on eleven +different point cloud datasets showcase the effectiveness and superiority of +Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear +probing, surpassing random initialization by 36.9% mIoU and outperforming prior +arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains +over existing methods across 20 different few-shot fine-tuning tasks on all +eleven tested point cloud datasets. + +
+
+ comment: NeurIPS 2023 (Spotlight); 37 pages, 16 figures, 15 tables; Code at + https://github.com/youquanl/Segment-Any-Point-Cloud +
+
+
+
+
+ + ♻ ☆ LAP: An Attention-Based Module for Concept Based Self-Interpretation and + Knowledge Injection in Convolutional Neural Networks + + +
+ Despite the state-of-the-art performance of deep convolutional neural +networks, they are susceptible to bias and malfunction in unseen situations. +Moreover, the complex computation behind their reasoning is not +human-understandable to develop trust. External explainer methods have tried to +interpret network decisions in a human-understandable way, but they are accused +of fallacies due to their assumptions and simplifications. On the other side, +the inherent self-interpretability of models, while being more robust to the +mentioned fallacies, cannot be applied to the already trained models. In this +work, we propose a new attention-based pooling layer, called Local Attention +Pooling (LAP), that accomplishes self-interpretability and the possibility for +knowledge injection without performance loss. The module is easily pluggable +into any convolutional neural network, even the already trained ones. We have +defined a weakly supervised training scheme to learn the distinguishing +features in decision-making without depending on experts' annotations. We +verified our claims by evaluating several LAP-extended models on two datasets, +including ImageNet. The proposed framework offers more valid +human-understandable and faithful-to-the-model interpretations than the +commonly used white-box explainer methods. + +
+
+
+
+
+ + ♻ ☆ Zero-One Laws of Graph Neural Networks NeurIPS '23 + + +
+ Graph neural networks (GNNs) are the de facto standard deep learning +architectures for machine learning on graphs. This has led to a large body of +work analyzing the capabilities and limitations of these models, particularly +pertaining to their representation and extrapolation capacity. We offer a novel +theoretical perspective on the representation and extrapolation capacity of +GNNs, by answering the question: how do GNNs behave as the number of graph +nodes become very large? Under mild assumptions, we show that when we draw +graphs of increasing size from the Erd\H{o}s-R\'enyi model, the probability +that such graphs are mapped to a particular output by a class of GNN +classifiers tends to either zero or to one. This class includes the popular +graph convolutional network architecture. The result establishes 'zero-one +laws' for these GNNs, and analogously to other convergence laws, entails +theoretical limitations on their capacity. We empirically verify our results, +observing that the theoretical asymptotic limits are evident already on +relatively small graphs. + +
+
+ comment: NeurIPS '23 camera-ready version; 10 pages + references + 10 pages + appendices, 7 figures +
+
+
+
+
+ + ♻ ☆ Revisiting the Minimalist Approach to Offline Reinforcement Learning + + +
+ Recent years have witnessed significant advancements in offline reinforcement +learning (RL), resulting in the development of numerous algorithms with varying +degrees of complexity. While these algorithms have led to noteworthy +improvements, many incorporate seemingly minor design choices that impact their +effectiveness beyond core algorithmic advances. However, the effect of these +design choices on established baselines remains understudied. In this work, we +aim to bridge this gap by conducting a retrospective analysis of recent works +in offline RL and propose ReBRAC, a minimalistic algorithm that integrates such +design elements built on top of the TD3+BC method. We evaluate ReBRAC on 51 +datasets with both proprioceptive and visual state spaces using D4RL and V-D4RL +benchmarks, demonstrating its state-of-the-art performance among ensemble-free +methods in both offline and offline-to-online settings. To further illustrate +the efficacy of these design choices, we perform a large-scale ablation study +and hyperparameter sensitivity analysis on the scale of thousands of +experiments. + +
+
+ comment: Source code: https://github.com/DT6A/ReBRAC +
+
+
+
+
+ + ♻ ☆ Spectral2Spectral: Image-spectral Similarity Assisted Spectral CT Deep + Reconstruction without Reference + + +
+ Spectral computed tomography based on a photon-counting detector (PCD) +attracts more and more attentions since it has the capability to provide more +accurate identification and quantitative analysis for biomedical materials. The +limited number of photons within narrow energy bins leads to imaging results of +low signal-noise ratio. The existing supervised deep reconstruction networks +for CT reconstruction are difficult to address these challenges because it is +usually impossible to acquire noise-free clinical images with clear structures +as references. In this paper, we propose an iterative deep reconstruction +network to synergize unsupervised method and data priors into a unified +framework, named as Spectral2Spectral. Our Spectral2Spectral employs an +unsupervised deep training strategy to obtain high-quality images from noisy +data in an end-to-end fashion. The structural similarity prior within +image-spectral domain is refined as a regularization term to further constrain +the network training. The weights of neural network are automatically updated +to capture image features and structures within the iterative process. Three +large-scale preclinical datasets experiments demonstrate that the +Spectral2spectral reconstructs better image quality than other the +state-of-the-art methods. + +
+
+ comment: Accepted by IEEE TCI +
+
+
+
+
+ + ♻ ☆ Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design + + +
+ Scaling laws have been recently employed to derive compute-optimal model size +(number of parameters) for a given compute duration. We advance and refine such +methods to infer compute-optimal model shapes, such as width and depth, and +successfully implement this in vision transformers. Our shape-optimized vision +transformer, SoViT, achieves results competitive with models that exceed twice +its size, despite being pre-trained with an equivalent amount of compute. For +example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, +surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical +settings, with also less than half the inference cost. We conduct a thorough +evaluation across multiple tasks, such as image classification, captioning, VQA +and zero-shot transfer, demonstrating the effectiveness of our model across a +broad range of domains and identifying limitations. Overall, our findings +challenge the prevailing approach of blindly scaling up vision models and pave +a path for a more informed scaling. + +
+
+ comment: 10 pages, 7 figures, 9 tables. Version 2: Layout fixes +
+
+
+
+
+ + ♻ ☆ Leveraging Deep Learning and Online Source Sentiment for Financial + Portfolio Management + + +
+ Financial portfolio management describes the task of distributing funds and +conducting trading operations on a set of financial assets, such as stocks, +index funds, foreign exchange or cryptocurrencies, aiming to maximize the +profit while minimizing the loss incurred by said operations. Deep Learning +(DL) methods have been consistently excelling at various tasks and automated +financial trading is one of the most complex one of those. This paper aims to +provide insight into various DL methods for financial trading, under both the +supervised and reinforcement learning schemes. At the same time, taking into +consideration sentiment information regarding the traded assets, we discuss and +demonstrate their usefulness through corresponding research studies. Finally, +we discuss commonly found problems in training such financial agents and equip +the reader with the necessary knowledge to avoid these problems and apply the +discussed methods in practice. + +
+
+
+
+
+ + ♻ ☆ Graph Neural Networks can Recover the Hidden Features Solely from the + Graph Structure ICML 2023 + + +
+ Graph Neural Networks (GNNs) are popular models for graph learning problems. +GNNs show strong empirical performance in many practical tasks. However, the +theoretical properties have not been completely elucidated. In this paper, we +investigate whether GNNs can exploit the graph structure from the perspective +of the expressive power of GNNs. In our analysis, we consider graph generation +processes that are controlled by hidden (or latent) node features, which +contain all information about the graph structure. A typical example of this +framework is kNN graphs constructed from the hidden features. In our main +results, we show that GNNs can recover the hidden node features from the input +graph alone, even when all node features, including the hidden features +themselves and any indirect hints, are unavailable. GNNs can further use the +recovered node features for downstream tasks. These results show that GNNs can +fully exploit the graph structure by themselves, and in effect, GNNs can use +both the hidden and explicit node features for downstream tasks. In the +experiments, we confirm the validity of our results by showing that GNNs can +accurately recover the hidden features using a GNN architecture built based on +our theoretical analysis. + +
+
+ comment: ICML 2023 +
+
+
+
+
+ + ♻ ☆ Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory + Models + + +
+ We propose a novel anomaly detection method for echocardiogram videos. The +introduced method takes advantage of the periodic nature of the heart cycle to +learn three variants of a variational latent trajectory model (TVAE). While the +first two variants (TVAE-C and TVAE-R) model strict periodic movements of the +heart, the third (TVAE-S) is more general and allows shifts in the spatial +representation throughout the video. All models are trained on the healthy +samples of a novel in-house dataset of infant echocardiogram videos consisting +of multiple chamber views to learn a normative prior of the healthy population. +During inference, maximum a posteriori (MAP) based anomaly detection is +performed to detect out-of-distribution samples in our dataset. The proposed +method reliably identifies severe congenital heart defects, such as Ebstein's +Anomaly or Shone-complex. Moreover, it achieves superior performance over +MAP-based anomaly detection with standard variational autoencoders when +detecting pulmonary hypertension and right ventricular dilation. Finally, we +demonstrate that the proposed method enables interpretable explanations of its +output through heatmaps highlighting the regions corresponding to anomalous +heart structures. + +
+
+
+
+
+ + ♻ ☆ MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities + + +
+ We propose MM-Vet, an evaluation benchmark that examines large multimodal +models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various +intriguing abilities, such as solving math problems written on the blackboard, +reasoning about events and celebrities in news images, and explaining visual +jokes. Rapid model advancements pose challenges to evaluation benchmark +development. Problems include: (1) How to systematically structure and evaluate +the complicated multimodal tasks; (2) How to design evaluation metrics that +work well across question and answer types; and (3) How to give model insights +beyond a simple performance ranking. To this end, we present MM-Vet, designed +based on the insight that the intriguing ability to solve complicated tasks is +often achieved by a generalist model being able to integrate different core +vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and +examines the 16 integrations of interest derived from the capability +combination. For evaluation metrics, we propose an LLM-based evaluator for +open-ended outputs. The evaluator enables the evaluation across different +question types and answer styles, resulting in a unified scoring metric. We +evaluate representative LMMs on MM-Vet, providing insights into the +capabilities of different LMM system paradigms and models. Code and data are +available at https://github.com/yuweihao/MM-Vet. + +
+
+ comment: Add results of GPT-4V. Code, data and leaderboard: + https://github.com/yuweihao/MM-Vet +
+
+
+
+
+ + ♻ ☆ Adaptive Federated Minimax Optimization with Lower complexities AISTATS-2024 + + +
+ Federated learning is a popular distributed and privacy-preserving machine +learning paradigm. Meanwhile, minimax optimization, as an effective +hierarchical optimization, is widely applied in machine learning. Recently, +some federated optimization methods have been proposed to solve the distributed +minimax problems. However, these federated minimax methods still suffer from +high gradient and communication complexities. Meanwhile, few algorithm focuses +on using adaptive learning rate to accelerate algorithms. To fill this gap, in +the paper, we study a class of nonconvex minimax optimization, and propose an +efficient adaptive federated minimax optimization algorithm (i.e., AdaFGDA) to +solve these distributed minimax problems. Specifically, our AdaFGDA builds on +the momentum-based variance reduced and local-SGD techniques, and it can +flexibly incorporate various adaptive learning rates by using the unified +adaptive matrix. Theoretically, we provide a solid convergence analysis +framework for our AdaFGDA algorithm under non-i.i.d. setting. Moreover, we +prove our algorithms obtain lower gradient (i.e., stochastic first-order +oracle, SFO) complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication +complexity of $\tilde{O}(\epsilon^{-2})$ in finding $\epsilon$-stationary point +of the nonconvex minimax problems. Experimentally, we conduct some experiments +on the deep AUC maximization and robust neural network training tasks to verify +efficiency of our algorithms. + +
+
+ comment: Submitted to AISTATS-2024 +
+
+
+
+
+ + ♻ ☆ A Systematic Performance Analysis of Deep Perceptual Loss Networks: + Breaking Transfer Learning Conventions + + +
+ Deep perceptual loss is a type of loss function in computer vision that aims +to mimic human perception by using the deep features extracted from neural +networks. In recent years, the method has been applied to great effect on a +host of interesting computer vision tasks, especially for tasks with image or +image-like outputs, such as image synthesis, segmentation, depth prediction, +and more. Many applications of the method use pretrained networks, often +convolutional networks, for loss calculation. Despite the increased interest +and broader use, more effort is needed toward exploring which networks to use +for calculating deep perceptual loss and from which layers to extract the +features. + This work aims to rectify this by systematically evaluating a host of +commonly used and readily available, pretrained networks for a number of +different feature extraction points on four existing use cases of deep +perceptual loss. The use cases of perceptual similarity, super-resolution, +image segmentation, and dimensionality reduction, are evaluated through +benchmarks. The benchmarks are implementations of previous works where the +selected networks and extraction points are evaluated. The performance on the +benchmarks, and attributes of the networks and extraction points are then used +as a basis for an in-depth analysis. This analysis uncovers insight regarding +which architectures provide superior performance for deep perceptual loss and +how to choose an appropriate extraction point for a particular task and +dataset. Furthermore, the work discusses the implications of the results for +deep perceptual loss and the broader field of transfer learning. The results +show that deep perceptual loss deviates from two commonly held conventions in +transfer learning, which suggests that those conventions are in need of deeper +analysis. + +
+
+
+
+
+ + ♻ ☆ Physics-Informed Graph Convolutional Networks: Towards a generalized + framework for complex geometries + + +
+ Since the seminal work of [9] and their Physics-Informed neural networks +(PINNs), many efforts have been conducted towards solving partial differential +equations (PDEs) with Deep Learning models. However, some challenges remain, +for instance the extension of such models to complex three-dimensional +geometries, and a study on how such approaches could be combined to classical +numerical solvers. In this work, we justify the use of graph neural networks +for these problems, based on the similarity between these architectures and the +meshes used in traditional numerical techniques for solving partial +differential equations. After proving an issue with the Physics-Informed +framework for complex geometries, during the computation of PDE residuals, an +alternative procedure is proposed, by combining classical numerical solvers and +the Physics-Informed framework. Finally, we propose an implementation of this +approach, that we test on a three-dimensional problem on an irregular geometry. + +
+
+
+
+
+ + ♻ ☆ Amortized Variational Inference: A Systematic Review + + +
+ The core principle of Variational Inference (VI) is to convert the +statistical inference problem of computing complex posterior probability +densities into a tractable optimization problem. This property enables VI to be +faster than several sampling-based techniques. However, the traditional VI +algorithm is not scalable to large data sets and is unable to readily infer +out-of-bounds data points without re-running the optimization process. Recent +developments in the field, like stochastic-, black box-, and amortized-VI, have +helped address these issues. Generative modeling tasks nowadays widely make use +of amortized VI for its efficiency and scalability, as it utilizes a +parameterized function to learn the approximate posterior density parameters. +In this paper, we review the mathematical foundations of various VI techniques +to form the basis for understanding amortized VI. Additionally, we provide an +overview of the recent trends that address several issues of amortized VI, such +as the amortization gap, generalization issues, inconsistent representation +learning, and posterior collapse. Finally, we analyze alternate divergence +measures that improve VI optimization. + +
+
+ comment: Accepted for publication at the Journal of Artificial Intelligence + Research (JAIR) +
+
+
+
+
+ + ♻ ☆ DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable + Kendall's Rank Correlation NeurIPS 2023 + + +
+ Few-shot learning aims to adapt models trained on the base dataset to novel +tasks where the categories were not seen by the model before. This often leads +to a relatively uniform distribution of feature values across channels on novel +classes, posing challenges in determining channel importance for novel tasks. +Standard few-shot learning methods employ geometric similarity metrics such as +cosine similarity and negative Euclidean distance to gauge the semantic +relatedness between two features. However, features with high geometric +similarities may carry distinct semantics, especially in the context of +few-shot learning. In this paper, we demonstrate that the importance ranking of +feature channels is a more reliable indicator for few-shot learning than +geometric similarity metrics. We observe that replacing the geometric +similarity metric with Kendall's rank correlation only during inference is able +to improve the performance of few-shot learning across a wide range of methods +and datasets with different domains. Furthermore, we propose a carefully +designed differentiable loss for meta-training to address the +non-differentiability issue of Kendall's rank correlation. By replacing +geometric similarity with differentiable Kendall's rank correlation, our method +can integrate with numerous existing few-shot approaches and is ready for +integrating with future state-of-the-art methods that rely on geometric +similarity metrics. Extensive experiments validate the efficacy of the +rank-correlation-based approach, showcasing a significant improvement in +few-shot learning. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Improving End-to-End Speech Processing by Efficient Text Data + Utilization with Latent Synthesis EMNLP 2023 + + +
+ Training a high performance end-to-end speech (E2E) processing model requires +an enormous amount of labeled speech data, especially in the era of +data-centric artificial intelligence. However, labeled speech data are usually +scarcer and more expensive for collection, compared to textual data. We propose +Latent Synthesis (LaSyn), an efficient textual data utilization framework for +E2E speech processing models. We train a latent synthesizer to convert textual +data into an intermediate latent representation of a pre-trained speech model. +These pseudo acoustic representations of textual data augment acoustic data for +model training. We evaluate LaSyn on low-resource automatic speech recognition +(ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an +E2E baseline trained on LibriSpeech train-clean-100, with relative word error +rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our +E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for +slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) +and EM-Tree accuracies on STOP respectively. With fewer parameters, the results +of LaSyn are competitive to published state-of-the-art works. The results +demonstrate the quality of the augmented training data. + +
+
+ comment: 15 pages, 8 figures, 8 tables, Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ ViNT: A Foundation Model for Visual Navigation + + +
+ General-purpose pre-trained models ("foundation models") have enabled +practitioners to produce generalizable solutions for individual machine +learning problems with datasets that are significantly smaller than those +required for learning from scratch. Such models are typically trained on large +and diverse datasets with weak supervision, consuming much more training data +than is available for any individual downstream application. In this paper, we +describe the Visual Navigation Transformer (ViNT), a foundation model that aims +to bring the success of general-purpose pre-trained models to vision-based +robotic navigation. ViNT is trained with a general goal-reaching objective that +can be used with any navigation dataset, and employs a flexible +Transformer-based architecture to learn navigational affordances and enable +efficient adaptation to a variety of downstream navigational tasks. ViNT is +trained on a number of existing navigation datasets, comprising hundreds of +hours of robotic navigation from a variety of different robotic platforms, and +exhibits positive transfer, outperforming specialist models trained on singular +datasets. ViNT can be augmented with diffusion-based subgoal proposals to +explore novel environments, and can solve kilometer-scale navigation problems +when equipped with long-range heuristics. ViNT can also be adapted to novel +task specifications with a technique inspired by prompt-tuning, where the goal +encoder is replaced by an encoding of another task modality (e.g., GPS +waypoints or routing commands) embedded into the same space of goal tokens. +This flexibility and ability to accommodate a variety of downstream problem +domains establishes ViNT as an effective foundation model for mobile robotics. +For videos, code, and model checkpoints, see our project page at +https://visualnav-transformer.github.io. + +
+
+ comment: Accepted for oral presentation at CoRL 2023 +
+
+
+
+
+ + ♻ ☆ State Regularized Policy Optimization on Data with Dynamics Shift NeurIPS 2023 + + +
+ In many real-world scenarios, Reinforcement Learning (RL) algorithms are +trained on data with dynamics shift, i.e., with different underlying +environment dynamics. A majority of current methods address such issue by +training context encoders to identify environment parameters. Data with +dynamics shift are separated according to their environment parameters to train +the corresponding policy. However, these methods can be sample inefficient as +data are used \textit{ad hoc}, and policies trained for one dynamics cannot +benefit from data collected in all other environments with different dynamics. +In this paper, we find that in many environments with similar structures and +different dynamics, optimal policies have similar stationary state +distributions. We exploit such property and learn the stationary state +distribution from data with dynamics shift for efficient data reuse. Such +distribution is used to regularize the policy trained in a new environment, +leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy +\textbf{O}ptimization) algorithm. To conduct theoretical analyses, the +intuition of similar environment structures is characterized by the notion of +homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on +policies regularized by the stationary state distribution. In practice, SRPO +can be an add-on module to context-based algorithms in both online and offline +RL settings. Experimental results show that SRPO can make several context-based +algorithms far more data efficient and significantly improve their overall +performance. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ How to Fine-tune the Model: Unified Model Shift and Model Bias Policy + Optimization + + +
+ Designing and deriving effective model-based reinforcement learning (MBRL) +algorithms with a performance improvement guarantee is challenging, mainly +attributed to the high coupling between model learning and policy optimization. +Many prior methods that rely on return discrepancy to guide model learning +ignore the impacts of model shift, which can lead to performance deterioration +due to excessive model updates. Other methods use performance difference bound +to explicitly consider model shift. However, these methods rely on a fixed +threshold to constrain model shift, resulting in a heavy dependence on the +threshold and a lack of adaptability during the training process. In this +paper, we theoretically derive an optimization objective that can unify model +shift and model bias and then formulate a fine-tuning process. This process +adaptively adjusts the model updates to get a performance improvement guarantee +while avoiding model overfitting. Based on these, we develop a straightforward +algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). +Empirical results show that USB-PO achieves state-of-the-art performance on +several challenging benchmark tasks. + +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ Randomized Forward Mode of Automatic Differentiation for Optimization + Algorithms + + +
+ Backpropagation within neural networks leverages a fundamental element of +automatic differentiation, which is referred to as the reverse mode +differentiation, or vector Jacobian Product (VJP) or, in the context of +differential geometry, known as the pull-back process. The computation of +gradient is important as update of neural network parameters is performed using +gradient descent method. In this study, we present a genric randomized method, +which updates the parameters of neural networks by using directional +derivatives of loss functions computed efficiently by using forward mode AD or +Jacobian vector Product (JVP). These JVP are computed along the random +directions sampled from different probability distributions e.g., Bernoulli, +Normal, Wigner, Laplace and Uniform distributions. The computation of gradient +is performed during the forward pass of the neural network. We also present a +rigorous analysis of the presented methods providing the rate of convergence +along with the computational experiments deployed in scientific Machine +learning in particular physics-informed neural networks and Deep Operator +Networks. + +
+
+ comment: 23 Pages, 8 Figures +
+
+
+
+
+ + ♻ ☆ Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and + Exp-Concave Games with Gradient Feedback + + +
+ Online gradient descent (OGD) is well known to be doubly optimal under strong +convexity or monotonicity assumptions: (1) in the single-agent setting, it +achieves an optimal regret of $\Theta(\log T)$ for strongly convex cost +functions; and (2) in the multi-agent setting of strongly monotone games, with +each agent employing OGD, we obtain last-iterate convergence of the joint +action to a unique Nash equilibrium at an optimal rate of +$\Theta(\frac{1}{T})$. While these finite-time guarantees highlight its merits, +OGD has the drawback that it requires knowing the strong convexity/monotonicity +parameters. In this paper, we design a fully adaptive OGD algorithm, +\textsf{AdaOGD}, that does not require a priori knowledge of these parameters. +In the single-agent setting, our algorithm achieves $O(\log^2(T))$ regret under +strong convexity, which is optimal up to a log factor. Further, if each agent +employs \textsf{AdaOGD} in strongly monotone games, the joint action converges +in a last-iterate sense to a unique Nash equilibrium at a rate of +$O(\frac{\log^3 T}{T})$, again optimal up to log factors. We illustrate our +algorithms in a learning version of the classical newsvendor problem, where due +to lost sales, only (noisy) gradient feedback can be observed. Our results +immediately yield the first feasible and near-optimal algorithm for both the +single-retailer and multi-retailer settings. We also extend our results to the +more general setting of exp-concave cost functions and games, using the online +Newton step (ONS) algorithm. + +
+
+ comment: Accepted by Operations Research; 47 pages +
+
+
+
+
+ + ♻ ☆ Towards A Unified View of Sparse Feed-Forward Network in Pretraining + Large Language Model EMNLP 2023 + + +
+ Large and sparse feed-forward layers (S-FFN) such as Mixture-of-Experts (MoE) +have proven effective in scaling up Transformers model size for +\textit{pretraining} large language models. By only activating part of the FFN +parameters conditioning on input, S-FFN improves generalization performance +while keeping training and inference costs (in FLOPs) fixed. In this work, we +analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) +size and the memory block selection method under a general conceptual framework +of sparse neural memory. Using this unified framework, we compare several S-FFN +architectures for language modeling and provide insights into their relative +efficacy and efficiency. We found a simpler selection method -- +\textbf{\texttt{Avg-K}} that selects blocks through their mean aggregated +hidden states, achieving lower perplexity in language model pretraining +compared to existing MoE architectures including Switch Transformer (Fedus et +al., 2021) and HashLayer (Roller et al., 2021). + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+
+
+
+ + Multimedia 4 + +
+
+
+ + ☆ RecipeMeta: Metapath-enhanced Recipe Recommendation on Heterogeneous + Recipe Network + + +
+ Recipe is a set of instructions that describes how to make food. It can help +people from the preparation of ingredients, food cooking process, etc. to +prepare the food, and increasingly in demand on the Web. To help users find the +vast amount of recipes on the Web, we address the task of recipe +recommendation. Due to multiple data types and relationships in a recipe, we +can treat it as a heterogeneous network to describe its information more +accurately. To effectively utilize the heterogeneous network, metapath was +proposed to describe the higher-level semantic information between two entities +by defining a compound path from peer entities. Therefore, we propose a +metapath-enhanced recipe recommendation framework, RecipeMeta, that combines +GNN (Graph Neural Network)-based representation learning and specific +metapath-based information in a recipe to predict User-Recipe pairs for +recommendation. Through extensive experiments, we demonstrate that the proposed +model, RecipeMeta, outperforms state-of-the-art methods for recipe +recommendation. + +
+
+
+
+
+ + ♻ ☆ A Unified Framework for Modality-Agnostic Deepfakes Detection + + +
+ As AI-generated content (AIGC) thrives, deepfakes have expanded from +single-modality falsification to cross-modal fake content creation, where +either audio or visual components can be manipulated. While using two unimodal +detectors can detect audio-visual deepfakes, cross-modal forgery clues could be +overlooked. Existing multimodal deepfake detection methods typically establish +correspondence between the audio and visual modalities for binary real/fake +classification, and require the co-occurrence of both modalities. However, in +real-world multi-modal applications, missing modality scenarios may occur where +either modality is unavailable. In such cases, audio-visual detection methods +are less practical than two independent unimodal methods. Consequently, the +detector can not always obtain the number or type of manipulated modalities +beforehand, necessitating a fake-modality-agnostic audio-visual detector. In +this work, we introduce a comprehensive framework that is agnostic to fake +modalities, which facilitates the identification of multimodal deepfakes and +handles situations with missing modalities, regardless of the manipulations +embedded in audio, video, or even cross-modal forms. To enhance the modeling of +cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as +a preliminary task. This efficiently extracts speech correlations across +modalities, a feature challenging for deepfakes to replicate. Additionally, we +propose a dual-label detection approach that follows the structure of AVSR to +support the independent detection of each modality. Extensive experiments on +three audio-visual datasets show that our scheme outperforms state-of-the-art +detection methods with promising performance on modality-agnostic audio/video +deepfakes. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ♻ ☆ DiffKendall: A Novel Approach for Few-Shot Learning with Differentiable + Kendall's Rank Correlation NeurIPS 2023 + + +
+ Few-shot learning aims to adapt models trained on the base dataset to novel +tasks where the categories were not seen by the model before. This often leads +to a relatively uniform distribution of feature values across channels on novel +classes, posing challenges in determining channel importance for novel tasks. +Standard few-shot learning methods employ geometric similarity metrics such as +cosine similarity and negative Euclidean distance to gauge the semantic +relatedness between two features. However, features with high geometric +similarities may carry distinct semantics, especially in the context of +few-shot learning. In this paper, we demonstrate that the importance ranking of +feature channels is a more reliable indicator for few-shot learning than +geometric similarity metrics. We observe that replacing the geometric +similarity metric with Kendall's rank correlation only during inference is able +to improve the performance of few-shot learning across a wide range of methods +and datasets with different domains. Furthermore, we propose a carefully +designed differentiable loss for meta-training to address the +non-differentiability issue of Kendall's rank correlation. By replacing +geometric similarity with differentiable Kendall's rank correlation, our method +can integrate with numerous existing few-shot approaches and is ready for +integrating with future state-of-the-art methods that rely on geometric +similarity metrics. Extensive experiments validate the efficacy of the +rank-correlation-based approach, showcasing a significant improvement in +few-shot learning. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Semantic Change Driven Generative Semantic Communication Framework + + +
+ The burgeoning generative artificial intelligence technology offers novel +insights into the development of semantic communication (SemCom) frameworks. +These frameworks hold the potential to address the challenges associated with +the black-box nature inherent in existing end-to-end training manner for the +existing SemCom framework, as well as deterioration of the user experience +caused by the inevitable error floor in deep learning-based SemCom. In this +paper, we focus on the widespread remote monitoring scenario, and propose a +semantic change driven generative SemCom framework. Therein, the semantic +encoder and semantic decoder can be optimized independently. Specifically, we +develop a modular semantic encoder with value of information based semantic +sampling function. In addition, we propose a conditional denoising diffusion +probabilistic mode-assisted semantic decoder that relies on received semantic +information from the source, namely, the semantic map, and the local static +scene information to remotely regenerate scenes. Moreover, we demonstrate the +effectiveness of the proposed semantic encoder and decoder as well as the +considerable potential in reducing energy consumption through simulation based +on the realistic $\mathcal{F}$ composite channel fading model. The code is +available at https://github.com/wty2011jl/SCDGSC.git. + +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 150 + +
+
+
+ + ☆ Large Language Models are Visual Reasoning Coordinators NeurIPS 2023 + + +
+ Visual reasoning requires multimodal perception and commonsense cognition of +the world. Recently, multiple vision-language models (VLMs) have been proposed +with excellent commonsense reasoning ability in various domains. However, how +to harness the collective power of these complementary VLMs is rarely explored. +Existing methods like ensemble still struggle to aggregate these models with +the desired higher-order communications. In this work, we propose Cola, a novel +paradigm that coordinates multiple VLMs for visual reasoning. Our key insight +is that a large language model (LLM) can efficiently coordinate multiple VLMs +by facilitating natural language communication that leverages their distinct +and complementary capabilities. Extensive experiments demonstrate that our +instruction tuning variant, Cola-FT, achieves state-of-the-art performance on +visual question answering (VQA), outside knowledge VQA, visual entailment, and +visual spatial reasoning tasks. Moreover, we show that our in-context learning +variant, Cola-Zero, exhibits competitive performance in zero and few-shot +settings, without finetuning. Through systematic ablation studies and +visualizations, we validate that a coordinator LLM indeed comprehends the +instruction prompts as well as the separate functionalities of VLMs; it then +coordinates them to enable impressive visual reasoning capabilities. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ LINC: A Neurosymbolic Approach for Logical Reasoning by Combining + Language Models with First-Order Logic Provers + + +
+ Logical reasoning, i.e., deductively inferring the truth value of a +conclusion from a set of premises, is an important task for artificial +intelligence with wide potential impacts on science, mathematics, and society. +While many prompting-based strategies have been proposed to enable Large +Language Models (LLMs) to do such reasoning more effectively, they still appear +unsatisfactory, often failing in subtle and unpredictable ways. In this work, +we investigate the validity of instead reformulating such tasks as modular +neurosymbolic programming, which we call LINC: Logical Inference via +Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, +translating premises and conclusions from natural language to expressions in +first-order logic. These expressions are then offloaded to an external theorem +prover, which symbolically performs deductive inference. Leveraging this +approach, we observe significant performance gains on FOLIO and a balanced +subset of ProofWriter for three different models in nearly all experimental +conditions we evaluate. On ProofWriter, augmenting the comparatively small +open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 +and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, +respectively. When used with GPT-4, LINC scores 26% higher than CoT on +ProofWriter while performing comparatively on FOLIO. Further analysis reveals +that although both methods on average succeed roughly equally often on this +dataset, they exhibit distinct and complementary failure modes. We thus provide +promising evidence for how logical reasoning over natural language can be +tackled through jointly leveraging LLMs alongside symbolic provers. All +corresponding code is publicly available at https://github.com/benlipkin/linc + +
+
+
+
+
+ + ☆ Linear Representations of Sentiment in Large Language Models + + +
+ Sentiment is a pervasive feature in natural language text, yet it is an open +question how sentiment is represented within Large Language Models (LLMs). In +this study, we reveal that across a range of models, sentiment is represented +linearly: a single direction in activation space mostly captures the feature +across a range of tasks with one extreme for positive and the other for +negative. Through causal interventions, we isolate this direction and show it +is causally relevant in both toy tasks and real world datasets such as Stanford +Sentiment Treebank. Through this case study we model a thorough investigation +of what a single direction means on a broad data distribution. + We further uncover the mechanisms that involve this direction, highlighting +the roles of a small subset of attention heads and neurons. Finally, we +discover a phenomenon which we term the summarization motif: sentiment is not +solely represented on emotionally charged words, but is additionally summarized +at intermediate positions without inherent sentiment, such as punctuation and +names. We show that in Stanford Sentiment Treebank zero-shot classification, +76% of above-chance classification accuracy is lost when ablating the sentiment +direction, nearly half of which (36%) is due to ablating the summarized +sentiment direction exclusively at comma positions. + +
+
+
+
+
+ + ☆ Verb Conjugation in Transformers Is Determined by Linear Encodings of + Subject Number EMNLP 2023 + + +
+ Deep architectures such as Transformers are sometimes criticized for having +uninterpretable "black-box" representations. We use causal intervention +analysis to show that, in fact, some linguistic features are represented in a +linear, interpretable format. Specifically, we show that BERT's ability to +conjugate verbs relies on a linear encoding of subject number that can be +manipulated with predictable effects on conjugation accuracy. This encoding is +found in the subject position at the first layer and the verb position at the +last layer, but distributed across positions at middle layers, particularly +when there are multiple cues to subject number. + +
+
+ comment: To appear in Findings of the Association for Computational + Linguistics: EMNLP 2023 +
+
+
+
+
+ + ☆ S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large + Language Models + + +
+ The rapid development of Large Language Models (LLMs) has led to great +strides in model capabilities like reasoning and long-context understanding. +However, as LLMs are able to process longer contexts, it becomes more +challenging to evaluate whether they have acquired certain capabilities, since +the length of text (e.g., 100K tokens) they can process far exceeds what humans +can reliably assess in a reasonable duration. In this paper, we propose using +complex synthetic tasks as a proxy evaluation method, and present S3Eval, a +Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a +synthetic benchmark, S3Eval enables the creation of any number of evaluation +examples that are theoretically invisible to LLMs, mitigating the test set +contamination issue. The synthetic nature of S3Eval provides users full control +over the dataset, allowing them to systematically probe LLM capabilities by +scaling text length and varying task difficulty across diverse scenarios. The +strong correlation between S3Eval performance and scores of real-world +benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval +for evaluation of LLMs. The in-depth analysis also uncover additional insights, +including performance drop when the answer is sparsely distributed or located +in the middle context, as well as some counter-intuitive trends of model +performance. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ☆ SpecTr: Fast Speculative Decoding via Optimal Transport + + +
+ Autoregressive sampling from large language models has led to +state-of-the-art results in several natural language tasks. However, +autoregressive sampling generates tokens one at a time making it slow, and even +prohibitive in certain tasks. One way to speed up sampling is +$\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ +(block or sequence of tokens), and then score all tokens in the draft by the +large language model in parallel. A subset of the tokens in the draft are +accepted (and the rest rejected) based on a statistical method to guarantee +that the final output follows the distribution of the large model. In this +work, we provide a principled understanding of speculative decoding through the +lens of optimal transport (OT) with $\textit{membership cost}$. This framework +can be viewed as an extension of the well-known $\textit{maximal-coupling}$ +problem. This new formulation enables us to generalize the speculative decoding +method to allow for a set of $k$ candidates at the token-level, which leads to +an improved optimal membership cost. We show that the optimal draft selection +algorithm (transport plan) can be computed via linear programming, whose +best-known runtime is exponential in $k$. We then propose a valid draft +selection algorithm whose acceptance probability is $(1-1/e)$-optimal +multiplicatively. Moreover, it can be computed in time almost linear with size +of domain of a single token. Using this $new draft selection$ algorithm, we +develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which +provides speedup in decoding while ensuring that there is no quality +degradation in the decoded output. We experimentally demonstrate that for +state-of-the-art large language models, the proposed approach achieves a wall +clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on +standard benchmarks. + +
+
+
+
+
+ + ☆ AutoDAN: Automatic and Interpretable Adversarial Attacks on Large + Language Models + + +
+ Safety alignment of Large Language Models (LLMs) can be compromised with +manual jailbreak attacks and (automatic) adversarial attacks. Recent work +suggests that patching LLMs against these attacks is possible: manual jailbreak +attacks are human-readable but often limited and public, making them easy to +block; adversarial attacks generate gibberish prompts that can be detected +using perplexity-based filters. In this paper, we show that these solutions may +be too optimistic. We propose an interpretable adversarial attack, +\texttt{AutoDAN}, that combines the strengths of both types of attacks. It +automatically generates attack prompts that bypass perplexity-based filters +while maintaining a high attack success rate like manual jailbreak attacks. +These prompts are interpretable and diverse, exhibiting strategies commonly +used in manual jailbreak attacks, and transfer better than their non-readable +counterparts when using limited training data or a single proxy model. We also +customize \texttt{AutoDAN}'s objective to leak system prompts, another +jailbreak application not addressed in the adversarial attack literature. %, +demonstrating the versatility of the approach. We can also customize the +objective of \texttt{AutoDAN} to leak system prompts, beyond the ability to +elicit harmful content from the model, demonstrating the versatility of the +approach. Our work provides a new way to red-team LLMs and to understand the +mechanism of jailbreak attacks. + +
+
+
+
+
+ + ☆ Quantifying the Dialect Gap and its Correlates Across Languages EMNLP + + +
+ Historically, researchers and consumers have noticed a decrease in quality +when applying NLP tools to minority variants of languages (i.e. Puerto Rican +Spanish or Swiss German), but studies exploring this have been limited to a +select few languages. Additionally, past studies have mainly been conducted in +a monolingual context, so cross-linguistic trends have not been identified and +tied to external factors. In this work, we conduct a comprehensive evaluation +of the most influential, state-of-the-art large language models (LLMs) across +two high-use applications, machine translation and automatic speech +recognition, to assess their functionality on the regional dialects of several +high- and low-resource languages. Additionally, we analyze how the regional +dialect gap is correlated with economic, social, and linguistic factors. The +impact of training data, including related factors like dataset size and its +construction procedure, is shown to be significant but not consistent across +models or languages, meaning a one-size-fits-all approach cannot be taken in +solving the dialect gap. This work will lay the foundation for furthering the +field of dialectal NLP by laying out evident disparities and identifying +possible pathways for addressing them through mindful data collection. + +
+
+ comment: Accepted to EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Location-Aware Visual Question Generation with Lightweight Models EMNLP 2023 + + +
+ This work introduces a novel task, location-aware visual question generation +(LocaVQG), which aims to generate engaging questions from data relevant to a +particular geographical location. Specifically, we represent such +location-aware information with surrounding images and a GPS coordinate. To +tackle this task, we present a dataset generation pipeline that leverages GPT-4 +to produce diverse and sophisticated questions. Then, we aim to learn a +lightweight model that can address the LocaVQG task and fit on an edge device, +such as a mobile phone. To this end, we propose a method which can reliably +generate engaging questions from location-aware information. Our proposed +method outperforms baselines regarding human evaluation (e.g., engagement, +grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, +ROUGE-2). Moreover, we conduct extensive ablation studies to justify our +proposed techniques for both generating the dataset and solving the task. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Open-Ended Instructable Embodied Agents with Memory-Augmented Large + Language Models + + +
+ Pre-trained and frozen LLMs can effectively map simple scene re-arrangement +instructions to programs over a robot's visuomotor functions through +appropriate few-shot example prompting. To parse open-domain natural language +and adapt to a user's idiosyncratic procedures, not known during prompt +engineering time, fixed prompts fall short. In this paper, we introduce HELPER, +an embodied agent equipped with an external memory of language-program pairs +that parses free-form human-robot dialogue into action programs through +retrieval-augmented LLM prompting: relevant memories are retrieved based on the +current dialogue, instruction, correction or VLM description, and used as +in-context prompt examples for LLM querying. The memory is expanded during +deployment to include pairs of user's language and action plans, to assist +future inferences and personalize them to the user's language and routines. +HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution +from Dialog History (EDH) and Trajectory from Dialogue (TfD), with 1.7x +improvement over the previous SOTA for TfD. Our models, code and video results +can be found in our project's website: https://helper-agent-llm.github.io. + +
+
+ comment: https://helper-agent-llm.github.io +
+
+
+
+
+ + ☆ Branch-Solve-Merge Improves Large Language Model Evaluation and + Generation + + +
+ Large Language Models (LLMs) are frequently used for multi-faceted language +generation and evaluation tasks that involve satisfying intricate user +constraints or taking into account multiple aspects and criteria. However, +their performance can fall short, due to the model's lack of coherence and +inability to plan and decompose the problem. We propose Branch-Solve-Merge +(BSM), a Large Language Model program (Schlag et al., 2023) for tackling such +challenging natural language tasks. It consists of branch, solve, and merge +modules that are parameterized with specific prompts to the base LLM. These +three modules plan a decomposition of the task into multiple parallel +sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. +We apply our method to the tasks of LLM response evaluation and constrained +text generation and evaluate its effectiveness with multiple LLMs, including +Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and +consistency for each LLM by enhancing human-LLM agreement by up to 26%, +reducing length and pairwise position biases by up to 50%, and allowing +LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint +story generation task, BSM improves the coherence of the stories while also +improving constraint satisfaction by 12%. + +
+
+ comment: 22 pages, 7 figures, 10 tables +
+
+
+
+
+ + ☆ Causal Inference Using LLM-Guided Discovery + + +
+ At the core of causal inference lies the challenge of determining reliable +causal graphs solely based on observational data. Since the well-known backdoor +criterion depends on the graph, any errors in the graph can propagate +downstream to effect inference. In this work, we initially show that complete +graph information is not necessary for causal effect inference; the topological +order over graph variables (causal order) alone suffices. Further, given a node +pair, causal order is easier to elicit from domain experts compared to graph +edges since determining the existence of an edge can depend extensively on +other variables. Interestingly, we find that the same principle holds for Large +Language Models (LLMs) such as GPT-3.5-turbo and GPT-4, motivating an automated +method to obtain causal order (and hence causal effect) with LLMs acting as +virtual domain experts. To this end, we employ different prompting strategies +and contextual cues to propose a robust technique of obtaining causal order +from LLMs. Acknowledging LLMs' limitations, we also study possible techniques +to integrate LLMs with established causal discovery algorithms, including +constraint-based and score-based methods, to enhance their performance. +Extensive experiments demonstrate that our approach significantly improves +causal ordering accuracy as compared to discovery algorithms, highlighting the +potential of LLMs to enhance causal inference across diverse fields. + +
+
+
+
+
+ + ☆ How To Build Competitive Multi-gender Speech Translation Models For + Controlling Speaker Gender Translation + + +
+ When translating from notional gender languages (e.g., English) into +grammatical gender languages (e.g., Italian), the generated translation +requires explicit gender assignments for various words, including those +referring to the speaker. When the source sentence does not convey the +speaker's gender, speech translation (ST) models either rely on the +possibly-misleading vocal traits of the speaker or default to the masculine +gender, the most frequent in existing training corpora. To avoid such biased +and not inclusive behaviors, the gender assignment of speaker-related +expressions should be guided by externally-provided metadata about the +speaker's gender. While previous work has shown that the most effective +solution is represented by separate, dedicated gender-specific models, the goal +of this paper is to achieve the same results by integrating the speaker's +gender metadata into a single "multi-gender" neural ST model, easier to +maintain. Our experiments demonstrate that a single multi-gender model +outperforms gender-specialized ones when trained from scratch (with gender +accuracy gains up to 12.9 for feminine forms), while fine-tuning from existing +ST models does not lead to competitive results. + +
+
+ comment: To appear in CLiC-it 2023 +
+
+
+
+
+ + ☆ Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into + the Morphological Capabilities of a Large Language Model EMNLP 2023 + + +
+ Large language models (LLMs) have recently reached an impressive level of +linguistic capability, prompting comparisons with human language skills. +However, there have been relatively few systematic inquiries into the +linguistic capabilities of the latest generation of LLMs, and those studies +that do exist (i) ignore the remarkable ability of humans to generalize, (ii) +focus only on English, and (iii) investigate syntax or semantics and overlook +other capabilities that lie at the heart of human language, like morphology. +Here, we close these gaps by conducting the first rigorous analysis of the +morphological capabilities of ChatGPT in four typologically varied languages +(specifically, English, German, Tamil, and Turkish). We apply a version of +Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for +the four examined languages. We find that ChatGPT massively underperforms +purpose-built systems, particularly in English. Overall, our results -- through +the lens of morphology -- cast a new light on the linguistic capabilities of +ChatGPT, suggesting that claims of human-like language skills are premature and +misleading. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ GRENADE: Graph-Centric Language Model for Self-Supervised Representation + Learning on Text-Attributed Graphs EMNLP 2023 + + +
+ Self-supervised representation learning on text-attributed graphs, which aims +to create expressive and generalizable representations for various downstream +tasks, has received increasing research attention lately. However, existing +methods either struggle to capture the full extent of structural context +information or rely on task-specific training labels, which largely hampers +their effectiveness and generalizability in practice. To solve the problem of +self-supervised representation learning on text-attributed graphs, we develop a +novel Graph-Centric Language model -- GRENADE. Specifically, GRENADE exploits +the synergistic effect of both pre-trained language model and graph neural +network by optimizing with two specialized self-supervised learning algorithms: +graph-centric contrastive learning and graph-centric knowledge alignment. The +proposed graph-centric self-supervised learning algorithms effectively help +GRENADE to capture informative textual semantics as well as structural context +information on text-attributed graphs. Through extensive experiments, GRENADE +shows its superiority over state-of-the-art methods. Implementation is +available at \url{https://github.com/bigheiniu/GRENADE}. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis EMNLP 2023 + + +
+ Thematic analysis (TA) has been widely used for analyzing qualitative data in +many disciplines and fields. To ensure reliable analysis, the same piece of +data is typically assigned to at least two human coders. Moreover, to produce +meaningful and useful analysis, human coders develop and deepen their data +interpretation and coding over multiple iterations, making TA labor-intensive +and time-consuming. Recently the emerging field of large language models (LLMs) +research has shown that LLMs have the potential replicate human-like behavior +in various tasks: in particular, LLMs outperform crowd workers on +text-annotation tasks, suggesting an opportunity to leverage LLMs on TA. We +propose a human-LLM collaboration framework (i.e., LLM-in-the-loop) to conduct +TA with in-context learning (ICL). This framework provides the prompt to frame +discussions with a LLM (e.g., GPT-3.5) to generate the final codebook for TA. +We demonstrate the utility of this framework using survey datasets on the +aspects of the music listening experience and the usage of a password manager. +Results of the two case studies show that the proposed framework yields similar +coding quality to that of human coders but reduces TA's labor and time demands. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Federated Learning of Large Language Models with Parameter-Efficient + Prompt Tuning and Adaptive Optimization + + +
+ Federated learning (FL) is a promising paradigm to enable collaborative model +training with decentralized data. However, the training process of Large +Language Models (LLMs) generally incurs the update of significant parameters, +which limits the applicability of FL techniques to tackle the LLMs in real +scenarios. Prompt tuning can significantly reduce the number of parameters to +update, but it either incurs performance degradation or low training +efficiency. The straightforward utilization of prompt tuning in the FL often +raises non-trivial communication costs and dramatically degrades performance. +In addition, the decentralized data is generally non-Independent and +Identically Distributed (non-IID), which brings client drift problems and thus +poor performance. This paper proposes a Parameter-efficient prompt Tuning +approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and +effective FL of LLMs. First, an efficient partial prompt tuning approach is +proposed to improve performance and efficiency simultaneously. Second, a novel +adaptive optimization method is developed to address the client drift problems +on both the device and server sides to enhance performance further. Extensive +experiments based on 10 datasets demonstrate the superb performance (up to +60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training +time) of FedPepTAO compared with 9 baseline approaches. Our code is available +at https://github.com/llm-eff/FedPepTAO. + +
+
+
+
+
+ + ☆ Affective and Dynamic Beam Search for Story Generation EMNLP + + +
+ Storytelling's captivating potential makes it a fascinating research area, +with implications for entertainment, education, therapy, and cognitive studies. +In this paper, we propose Affective Story Generator (AffGen) for generating +interesting narratives. AffGen introduces "intriguing twists" in narratives by +employing two novel techniques-Dynamic Beam Sizing and Affective Reranking. +Dynamic Beam Sizing encourages less predictable, more captivating word choices +using a contextual multi-arm bandit model. Affective Reranking prioritizes +sentence candidates based on affect intensity. Our empirical evaluations, both +automatic and human, demonstrate AffGen's superior performance over existing +baselines in generating affectively charged and interesting narratives. Our +ablation study and analysis provide insights into the strengths and weaknesses +of AffGen. + +
+
+ comment: Accepted at EMNLP-findings 2023 +
+
+
+
+
+ + ☆ 'Don't Get Too Technical with Me': A Discourse Structure-Based Framework + for Science Journalism EMNLP 2023 + + +
+ Science journalism refers to the task of reporting technical findings of a +scientific paper as a less technical news article to the general public +audience. We aim to design an automated system to support this real-world task +(i.e., automatic science journalism) by 1) introducing a newly-constructed and +real-world dataset (SciTechNews), with tuples of a publicly-available +scientific paper, its corresponding news article, and an expert-written short +summary snippet; 2) proposing a novel technical framework that integrates a +paper's discourse structure with its metadata to guide generation; and, 3) +demonstrating with extensive automatic and human experiments that our framework +outperforms other baseline methods (e.g. Alpaca and ChatGPT) in elaborating a +content plan meaningful for the target audience, simplifying the information +selected, and producing a coherent final report in a layman's style. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ TableQAKit: A Comprehensive and Practical Toolkit for Table-based + Question Answering + + +
+ Table-based question answering (TableQA) is an important task in natural +language processing, which requires comprehending tables and employing various +reasoning ways to answer the questions. This paper introduces TableQAKit, the +first comprehensive toolkit designed specifically for TableQA. The toolkit +designs a unified platform that includes plentiful TableQA datasets and +integrates popular methods of this task as well as large language models +(LLMs). Users can add their datasets and methods according to the friendly +interface. Also, pleasantly surprised using the modules in this toolkit +achieves new SOTA on some datasets. Finally, \tableqakit{} also provides an +LLM-based TableQA Benchmark for evaluating the role of LLMs in TableQA. +TableQAKit is open-source with an interactive interface that includes visual +operations, and comprehensive data for ease of use. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ☆ Localizing Active Objects from Egocentric Vision with Symbolic World + Knowledge EMNLP + + +
+ The ability to actively ground task instructions from an egocentric view is +crucial for AI agents to accomplish tasks or assist humans virtually. One +important step towards this goal is to localize and track key active objects +that undergo major state change as a consequence of human actions/interactions +to the environment without being told exactly what/where to ground (e.g., +localizing and tracking the `sponge` in video from the instruction "Dip the +`sponge` into the bucket."). While existing works approach this problem from a +pure vision perspective, we investigate to which extent the textual modality +(i.e., task instructions) and their interaction with visual modality can be +beneficial. Specifically, we propose to improve phrase grounding models' +ability on localizing the active objects by: (1) learning the role of `objects +undergoing change` and extracting them accurately from the instructions, (2) +leveraging pre- and post-conditions of the objects during actions, and (3) +recognizing the objects more robustly with descriptional knowledge. We leverage +large language models (LLMs) to extract the aforementioned action-object +knowledge, and design a per-object aggregation masking technique to effectively +perform joint inference on object phrases and symbolic knowledge. We evaluate +our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments +demonstrate the effectiveness of our proposed framework, which leads to>54% +improvements in all standard metrics on the TREK-150-OPE-Det localization + +tracking task, >7% improvements in all standard metrics on the TREK-150-OPE +tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD +task. + +
+
+ comment: In Proceedings of the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP) +
+
+
+
+
+ + ☆ The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained + Multimodal Models EMNLP 2023 + + +
+ Despite the impressive performance achieved by pre-trained +language-and-vision models in downstream tasks, it remains an open question +whether this reflects a proper understanding of image-text interaction. In this +work, we explore to what extent they handle basic linguistic constructions -- +active-passive voice, coordination, and relative clauses -- that even preschool +children can typically master. We present BLA, a novel, automatically +constructed benchmark to evaluate multimodal models on these Basic Language +Abilities. We show that different types of Transformer-based systems, such as +CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, +in line with previous findings. Our experiments, in particular, show that most +of the tested models only marginally benefit when fine-tuned or prompted with +construction-specific samples. Yet, the generative BLIP2 shows promising +trends, especially in an in-context learning setting. This opens the door to +using BLA not only as an evaluation benchmark but also to improve models' basic +language abilities. + +
+
+ comment: This is the camera-ready version of the paper that will be published + in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023) +
+
+
+
+
+ + ☆ Towards Conceptualization of "Fair Explanation": Disparate Impacts of + anti-Asian Hate Speech Explanations on Content Moderators EMNLP 2023 + + +
+ Recent research at the intersection of AI explainability and fairness has +focused on how explanations can improve human-plus-AI task performance as +assessed by fairness measures. We propose to characterize what constitutes an +explanation that is itself "fair" -- an explanation that does not adversely +impact specific populations. We formulate a novel evaluation method of "fair +explanations" using not just accuracy and label time, but also psychological +impact of explanations on different user groups across many metrics (mental +discomfort, stereotype activation, and perceived workload). We apply this +method in the context of content moderation of potential hate speech, and its +differential impact on Asian vs. non-Asian proxy moderators, across explanation +approaches (saliency map and counterfactual explanation). We find that saliency +maps generally perform better and show less evidence of disparate impact +(group) and individual unfairness than counterfactual explanations. + Content warning: This paper contains examples of hate speech and racially +discriminatory language. The authors do not support such content. Please +consider your risk of discomfort carefully before continuing reading! + +
+
+ comment: EMNLP 2023 Main Conference (Long Paper) +
+
+
+
+
+ + ☆ SLOG: A Structural Generalization Benchmark for Semantic Parsing EMNLP 2023 + + +
+ The goal of compositional generalization benchmarks is to evaluate how well +models generalize to new complex linguistic expressions. Existing benchmarks +often focus on lexical generalization, the interpretation of novel lexical +items in syntactic structures familiar from training; structural generalization +tasks, where a model needs to interpret syntactic structures that are +themselves unfamiliar from training, are often underrepresented, resulting in +overly optimistic perceptions of how well models can generalize. We introduce +SLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with +17 structural generalization cases. In our experiments, the generalization +accuracy of Transformer models, including pretrained ones, only reaches 40.6%, +while a structure-aware parser only achieves 70.8%. These results are far from +the near-perfect accuracy existing models achieve on COGS, demonstrating the +role of SLOG in foregrounding the large discrepancy between models' lexical and +structural generalization capacities. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ Efficient Data Learning for Open Information Extraction with Pre-trained + Language Models + + +
+ Open Information Extraction (OpenIE) is a fundamental yet challenging task in +Natural Language Processing, which involves extracting all triples (subject, +predicate, object) from a given sentence. While labeling-based methods have +their merits, generation-based techniques offer unique advantages, such as the +ability to generate tokens not present in the original sentence. However, these +generation-based methods often require a significant amount of training data to +learn the task form of OpenIE and substantial training time to overcome slow +model convergence due to the order penalty. In this paper, we introduce a novel +framework, OK-IE, that ingeniously transforms the task form of OpenIE into the +pre-training task form of the T5 model, thereby reducing the need for extensive +training data. Furthermore, we introduce an innovative concept of Anchor to +control the sequence of model outputs, effectively eliminating the impact of +order penalty on model convergence and significantly reducing training time. +Experimental results indicate that, compared to previous SOTA methods, OK-IE +requires only 1/100 of the training data (900 instances) and 1/120 of the +training time (3 minutes) to achieve comparable results. + +
+
+
+
+
+ + ☆ Meta learning with language models: Challenges and opportunities in the + classification of imbalanced text + + +
+ Detecting out of policy speech (OOPS) content is important but difficult. +While machine learning is a powerful tool to tackle this challenging task, it +is hard to break the performance ceiling due to factors like quantity and +quality limitations on training data and inconsistencies in OOPS definition and +data labeling. To realize the full potential of available limited resources, we +propose a meta learning technique (MLT) that combines individual models built +with different text representations. We analytically show that the resulting +technique is numerically stable and produces reasonable combining weights. We +combine the MLT with a threshold-moving (TM) technique to further improve the +performance of the combined predictor on highly-imbalanced in-distribution and +out-of-distribution datasets. We also provide computational results to show the +statistically significant advantages of the proposed MLT approach. + All authors contributed equally to this work. + +
+
+ comment: 22 pages, including 5 figures, 12 tables, 1 appendix +
+
+
+
+
+ + ☆ Statistical Depth for Ranking and Characterizing Transformer-Based Text + Embeddings + + +
+ The popularity of transformer-based text embeddings calls for better +statistical tools for measuring distributions of such embeddings. One such tool +would be a method for ranking texts within a corpus by centrality, i.e. +assigning each text a number signifying how representative that text is of the +corpus as a whole. However, an intrinsic center-outward ordering of +high-dimensional text representations is not trivial. A statistical depth is a +function for ranking k-dimensional objects by measuring centrality with respect +to some observed k-dimensional distribution. We adopt a statistical depth to +measure distributions of transformer-based text embeddings, transformer-based +text embedding (TTE) depth, and introduce the practical use of this depth for +both modeling and distributional inference in NLP pipelines. We first define +TTE depth and an associated rank sum test for determining whether two corpora +differ significantly in embedding space. We then use TTE depth for the task of +in-context learning prompt selection, showing that this approach reliably +improves performance over statistical baseline approaches across six text +classification tasks. Finally, we use TTE depth and the associated rank sum +test to characterize the distributions of synthesized and human-generated +corpora, showing that five recent synthetic data augmentation processes cause a +measurable distributional shift away from associated human-generated text. + +
+
+
+
+
+ + ☆ Did the Neurons Read your Book? Document-level Membership Inference for + Large Language Models + + +
+ With large language models (LLMs) poised to become embedded in our daily +lives, questions are starting to be raised about the dataset(s) they learned +from. These questions range from potential bias or misinformation LLMs could +retain from their training data to questions of copyright and fair use of +human-generated text. However, while these questions emerge, developers of the +recent state-of-the-art LLMs become increasingly reluctant to disclose details +on their training corpus. We here introduce the task of document-level +membership inference for real-world LLMs, i.e. inferring whether the LLM has +seen a given document during training or not. First, we propose a procedure for +the development and evaluation of document-level membership inference for LLMs +by leveraging commonly used data sources for training and the model release +date. We then propose a practical, black-box method to predict document-level +membership and instantiate it on OpenLLaMA-7B with both books and academic +papers. We show our methodology to perform very well, reaching an impressive +AUC of 0.856 for books and 0.678 for papers. We then show our approach to +outperform the sentence-level membership inference attacks used in the privacy +literature for the document-level membership task. We finally evaluate whether +smaller models might be less sensitive to document-level inference and show +OpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach. +Taken together, our results show that accurate document-level membership can be +inferred for LLMs, increasing the transparency of technology poised to change +our lives. + +
+
+
+
+
+ + ☆ When Language Models Fall in Love: Animacy Processing in Transformer + Language Models EMNLP 2023 + + +
+ Animacy - whether an entity is alive and sentient - is fundamental to +cognitive processing, impacting areas such as memory, vision, and language. +However, animacy is not always expressed directly in language: in English it +often manifests indirectly, in the form of selectional constraints on verbs and +adjectives. This poses a potential issue for transformer language models (LMs): +they often train only on text, and thus lack access to extralinguistic +information from which humans learn about animacy. We ask: how does this impact +LMs' animacy processing - do they still behave as humans do? We answer this +question using open-source LMs. Like previous studies, we find that LMs behave +much like humans when presented with entities whose animacy is typical. +However, we also show that even when presented with stories about atypically +animate entities, such as a peanut in love, LMs adapt: they treat these +entities as animate, though they do not adapt as well as humans. Even when the +context indicating atypical animacy is very short, LMs pick up on subtle clues +and change their behavior. We conclude that despite the limited signal through +which LMs can learn about animacy, they are indeed sensitive to the relevant +lexical semantic nuances available in English. + +
+
+ comment: To appear at EMNLP 2023 +
+
+
+
+
+ + ☆ Simple Hardware-Efficient PCFGs with Independent Left and Right + Productions EMNLP + + +
+ Scaling dense PCFGs to thousands of nonterminals via a low-rank +parameterization of the rule probability tensor has been shown to be beneficial +for unsupervised parsing. However, PCFGs scaled this way still perform poorly +as a language model, and even underperform similarly-sized HMMs. This work +introduces \emph{SimplePCFG}, a simple PCFG formalism with independent left and +right productions. Despite imposing a stronger independence assumption than the +low-rank approach, we find that this formalism scales more effectively both as +a language model and as an unsupervised parser. As an unsupervised parser, our +simple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language +model, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank +PCFGs. We further introduce \emph{FlashInside}, a hardware IO-aware +implementation of the inside algorithm for efficiently scaling simple PCFGs. + +
+
+ comment: Accepted to Findings of EMNLP, 2023 +
+
+
+
+
+ + ☆ Understanding the Inner Workings of Language Models Through + Representation Dissimilarity EMNLP 2023 + + +
+ As language models are applied to an increasing number of real-world +applications, understanding their inner workings has become an important issue +in model trust, interpretability, and transparency. In this work we show that +representation dissimilarity measures, which are functions that measure the +extent to which two model's internal representations differ, can be a valuable +tool for gaining insight into the mechanics of language models. Among our +insights are: (i) an apparent asymmetry in the internal representations of +model using SoLU and GeLU activation functions, (ii) evidence that +dissimilarity measures can identify and locate generalization properties of +models that are invisible via in-distribution test set performance, and (iii) +new evaluations of how language model features vary as width and depth are +increased. Our results suggest that dissimilarity measures are a promising set +of tools for shedding light on the inner workings of language models. + +
+
+ comment: EMNLP 2023 (main) +
+
+
+
+
+ + ☆ LLM-Based Agent Society Investigation: Collaboration and Confrontation + in Avalon Gameplay + + +
+ This paper aims to investigate the open research problem of uncovering the +social behaviors of LLM-based agents. To achieve this goal, we adopt Avalon, a +representative communication game, as the environment and use system prompts to +guide LLM agents to play the game. While previous studies have conducted +preliminary investigations into gameplay with LLM agents, there lacks research +on their social behaviors. In this paper, we present a novel framework designed +to seamlessly adapt to Avalon gameplay. The core of our proposed framework is a +multi-agent system that enables efficient communication and interaction among +agents. We evaluate the performance of our framework based on metrics from two +perspectives: winning the game and analyzing the social behaviors of LLM +agents. Our results demonstrate the effectiveness of our framework in +generating adaptive and intelligent agents and highlight the potential of +LLM-based agents in addressing the challenges associated with dynamic social +environment interaction. By analyzing the social behaviors of LLM agents from +the aspects of both collaboration and confrontation, we provide insights into +the research and applications of this domain. + +
+
+
+
+
+ + ☆ Fidelity-Enriched Contrastive Search: Reconciling the + Faithfulness-Diversity Trade-Off in Text Generation EMNLP 2023 + + +
+ In this paper, we address the hallucination problem commonly found in natural +language generation tasks. Language models often generate fluent and convincing +content but can lack consistency with the provided source, resulting in +potential inaccuracies. We propose a new decoding method called +Fidelity-Enriched Contrastive Search (FECS), which augments the contrastive +search framework with context-aware regularization terms. FECS promotes tokens +that are semantically similar to the provided source while penalizing +repetitiveness in the generated text. We demonstrate its effectiveness across +two tasks prone to hallucination: abstractive summarization and dialogue +generation. Results show that FECS consistently enhances faithfulness across +various language model sizes while maintaining output diversity comparable to +well-performing decoding algorithms. + +
+
+ comment: Accepted as a short paper at EMNLP 2023 +
+
+
+
+
+ + ☆ ACTOR: Active Learning with Annotator-specific Classification Heads to + Embrace Human Label Variation EMNLP 2023 + + +
+ Label aggregation such as majority voting is commonly used to resolve +annotator disagreement in dataset creation. However, this may disregard +minority values and opinions. Recent studies indicate that learning from +individual annotations outperforms learning from aggregated labels, though they +require a considerable amount of annotation. Active learning, as an annotation +cost-saving strategy, has not been fully explored in the context of learning +from disagreement. We show that in the active learning setting, a multi-head +model performs significantly better than a single-head model in terms of +uncertainty estimation. By designing and evaluating acquisition functions with +annotator-specific heads on two datasets, we show that group-level entropy +works generally well on both datasets. Importantly, it achieves performance in +terms of both prediction and uncertainty estimation comparable to full-scale +training from disagreement, while saving up to 70% of the annotation budget. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ☆ Penalty Decoding: Well Suppress the Self-Reinforcement Effect in + Open-Ended Text Generation EMNLP2023 + + +
+ The decoding algorithm is critical for open-ended text generation, +transforming latent representations into coherent and meaningful outputs. This +paper investigates the self-reinforcement effect in text generation and the +effectiveness of a repetition penalty to mitigate it. However, determining the +optimal repetition penalty value is challenging. To tackle this, we propose a +forgetting mechanism that disregards distant tokens, reducing the burden of +penalty selection. In addition, we introduce a length penalty to address overly +short sentences caused by excessive penalties. Our penalty decoding approach +incorporating three strategies helps resolve issues with sampling methods +deviating from factual information. Experimental results demonstrate the +efficacy of our approach in generating high-quality sentences resembling human +output. + +
+
+ comment: Accepted by EMNLP2023 +
+
+
+
+
+ + ☆ Towards LLM-driven Dialogue State Tracking EMNLP 2023 + + +
+ Dialogue State Tracking (DST) is of paramount importance in ensuring accurate +tracking of user goals and system actions within task-oriented dialogue +systems. The emergence of large language models (LLMs) such as GPT3 and ChatGPT +has sparked considerable interest in assessing their efficacy across diverse +applications. In this study, we conduct an initial examination of ChatGPT's +capabilities in DST. Our evaluation uncovers the exceptional performance of +ChatGPT in this task, offering valuable insights to researchers regarding its +capabilities and providing useful directions for designing and enhancing +dialogue systems. Despite its impressive performance, ChatGPT has significant +limitations including its closed-source nature, request restrictions, raising +data privacy concerns, and lacking local deployment capabilities. To address +these concerns, we present LDST, an LLM-driven DST framework based on smaller, +open-source foundation models. By utilizing a novel domain-slot instruction +tuning method, LDST achieves performance on par with ChatGPT. Comprehensive +evaluations across three distinct experimental settings, we find that LDST +exhibits remarkable performance improvements in both zero-shot and few-shot +setting compared to previous SOTA methods. The source code is provided for +reproducibility. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ☆ Key Frame Mechanism For Efficient Conformer Based End-to-end Speech + Recognition + + +
+ Recently, Conformer as a backbone network for end-to-end automatic speech +recognition achieved state-of-the-art performance. The Conformer block +leverages a self-attention mechanism to capture global information, along with +a convolutional neural network to capture local information, resulting in +improved performance. However, the Conformer-based model encounters an issue +with the self-attention mechanism, as computational complexity grows +quadratically with the length of the input sequence. Inspired by previous +Connectionist Temporal Classification (CTC) guided blank skipping during +decoding, we introduce intermediate CTC outputs as guidance into the +downsampling procedure of the Conformer encoder. We define the frame with +non-blank output as key frame. Specifically, we introduce the key frame-based +self-attention (KFSA) mechanism, a novel method to reduce the computation of +the self-attention mechanism using key frames. The structure of our proposed +approach comprises two encoders. Following the initial encoder, we introduce an +intermediate CTC loss function to compute the label frame, enabling us to +extract the key frames and blank frames for KFSA. Furthermore, we introduce the +key frame-based downsampling (KFDS) mechanism to operate on high-dimensional +acoustic features directly and drop the frames corresponding to blank labels, +which results in new acoustic feature sequences as input to the second encoder. +By using the proposed method, which achieves comparable or higher performance +than vanilla Conformer and other similar work such as Efficient Conformer. +Meantime, our proposed method can discard more than 60\% useless frames during +model training and inference, which will accelerate the inference speed +significantly. This work code is available in +{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer} + +
+
+ comment: This manuscript has been accepted by IEEE Signal Processing Letters + for publication +
+
+
+
+
+ + ☆ System Combination via Quality Estimation for Grammatical Error + Correction EMNLP 2023 + + +
+ Quality estimation models have been developed to assess the corrections made +by grammatical error correction (GEC) models when the reference or +gold-standard corrections are not available. An ideal quality estimator can be +utilized to combine the outputs of multiple GEC systems by choosing the best +subset of edits from the union of all edits proposed by the GEC base systems. +However, we found that existing GEC quality estimation models are not good +enough in differentiating good corrections from bad ones, resulting in a low +F0.5 score when used for system combination. In this paper, we propose GRECO, a +new state-of-the-art quality estimation model that gives a better estimate of +the quality of a corrected sentence, as indicated by having a higher +correlation to the F0.5 score of a corrected sentence. It results in a combined +GEC system with a higher F0.5 score. We also propose three methods for +utilizing GEC quality estimation models for system combination with varying +generality: model-agnostic, model-agnostic with voting bias, and +model-dependent method. The combined GEC system outperforms the state of the +art on the CoNLL-2014 test set and the BEA-2019 test set, achieving the highest +F0.5 scores published to date. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Unveiling A Core Linguistic Region in Large Language Models + + +
+ Brain localization, which describes the association between specific regions +of the brain and their corresponding functions, is widely accepted in the field +of cognitive science as an objective fact. Today's large language models (LLMs) +possess human-level linguistic competence and can execute complex tasks +requiring abstract knowledge and reasoning. To deeply understand the inherent +mechanisms of intelligence emergence in LLMs, this paper conducts an analogical +research using brain localization as a prototype. We have discovered a core +region in LLMs that corresponds to linguistic competence, accounting for +approximately 1% of the total model parameters. This core region exhibits +significant dimension dependency, and perturbations to even a single parameter +on specific dimensions can lead to a loss of linguistic competence. +Furthermore, we observe that an improvement in linguistic competence does not +necessarily accompany an elevation in the model's knowledge level, which might +imply the existence of regions of domain knowledge that are dissociated from +the linguistic region. Overall, exploring the LLMs' functional regions provides +insights into the foundation of their intelligence. In the future, we will +continue to investigate knowledge regions within LLMs and the interactions +between them. + +
+
+ comment: Work on progress +
+
+
+
+
+ + ☆ PartialFormer: Modeling Part Instead of Whole + + +
+ The design choices in Transformer feed-forward neural networks have resulted +in significant computational and parameter overhead. In this work, we emphasize +the importance of hidden dimension in designing lightweight FFNs, a factor +often overlooked in previous architectures. Guided by this principle, we +introduce PartialFormer, a parameter-efficient Transformer architecture +utilizing multiple smaller FFNs to reduce parameters and computation while +maintaining essential hidden dimensions. These smaller FFNs are integrated into +a multi-head attention system to enable effective collaboration. We also +propose a tailored head scaling strategy to enhance PartialFormer's +capabilities. Furthermore, we present a residual-like attention calculation to +improve depth scaling within PartialFormer. Extensive experiments on 9 +translation tasks and 1 abstractive summarization task validate the +effectiveness of our PartialFormer approach. Our code would be available at: +\url{https://github.com/zhengkid/PartialFormer}. + +
+
+ comment: 11 pages, 5 figures +
+
+
+
+
+ + ☆ Linking Surface Facts to Large-Scale Knowledge Graphs + + +
+ Open Information Extraction (OIE) methods extract facts from natural language +text in the form of ("subject"; "relation"; "object") triples. These facts are, +however, merely surface forms, the ambiguity of which impedes their downstream +usage; e.g., the surface phrase "Michael Jordan" may refer to either the former +basketball player or the university professor. Knowledge Graphs (KGs), on the +other hand, contain facts in a canonical (i.e., unambiguous) form, but their +coverage is limited by a static schema (i.e., a fixed set of entities and +predicates). To bridge this gap, we need the best of both worlds: (i) high +coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of +KGs. In order to achieve this goal, we propose a new benchmark with novel +evaluation protocols that can, for example, measure fact linking performance on +a granular triple slot level, while also measuring if a system has the ability +to recognize that a surface form has no match in the existing KG. Our extensive +evaluation of several baselines show that detection of out-of-KG entities and +predicates is more difficult than accurate linking to existing ones, thus +calling for more research efforts on this difficult task. We publicly release +all resources (data, benchmark and code) on +https://github.com/nec-research/fact-linking. + +
+
+
+
+
+ + ☆ Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time + Controllable Text Generation + + +
+ Controllable text generation (CTG) aims to generate text with desired +attributes, and decoding-time-based methods have shown promising performance on +this task. However, in this paper, we identify the phenomenon of Attribute +Collapse for the first time. It causes the fluency of generated text to rapidly +decrease when the control strength exceeds a critical value, rendering the text +completely unusable. This limitation hinders the effectiveness of decoding +methods in achieving high levels of controllability. To address this problem, +we propose a novel lightweight decoding framework named Air-Decoding. Its main +idea is reconstructing the attribute distributions to balance the weights +between attribute words and non-attribute words to generate more fluent text. +Specifically, we train prefixes by prefix-tuning to obtain attribute +distributions. Then we design a novel attribute distribution reconstruction +method to balance the obtained distributions and use the reconstructed +distributions to guide language models for generation, effectively avoiding the +issue of Attribute Collapse. Experiments on multiple CTG tasks prove that our +method achieves a new state-of-the-art control performance. + +
+
+
+
+
+ + ☆ Non-autoregressive Streaming Transformer for Simultaneous Translation EMNLP 2023 + + +
+ Simultaneous machine translation (SiMT) models are trained to strike a +balance between latency and translation quality. However, training these models +to achieve high quality while maintaining low latency often leads to a tendency +for aggressive anticipation. We argue that such issue stems from the +autoregressive architecture upon which most existing SiMT models are built. To +address those issues, we propose non-autoregressive streaming Transformer +(NAST) which comprises a unidirectional encoder and a non-autoregressive +decoder with intra-chunk parallelism. We enable NAST to generate the blank +token or repetitive tokens to adjust its READ/WRITE strategy flexibly, and +train it to maximize the non-monotonic latent alignment with an alignment-based +latency loss. Experiments on various SiMT benchmarks demonstrate that NAST +outperforms previous strong autoregressive SiMT baselines. + +
+
+ comment: EMNLP 2023 main conference; Source code is available at + https://github.com/ictnlp/NAST +
+
+
+
+
+ + ☆ Can ChatGPT Perform Reasoning Using the IRAC Method in Analyzing Legal + Scenarios Like a Lawyer? EMNLP 2023 + + +
+ Large Language Models (LLMs), such as ChatGPT, have drawn a lot of attentions +recently in the legal domain due to its emergent ability to tackle a variety of +legal tasks. However, it is still unknown if LLMs are able to analyze a legal +case and perform reasoning in the same manner as lawyers. Therefore, we +constructed a novel corpus consisting of scenarios pertain to Contract Acts +Malaysia and Australian Social Act for Dependent Child. ChatGPT is applied to +perform analysis on the corpus using the IRAC method, which is a framework +widely used by legal professionals for organizing legal analysis. Each scenario +in the corpus is annotated with a complete IRAC analysis in a semi-structured +format so that both machines and legal professionals are able to interpret and +understand the annotations. In addition, we conducted the first empirical +assessment of ChatGPT for IRAC analysis in order to understand how well it +aligns with the analysis of legal professionals. Our experimental results shed +lights on possible future research directions to improve alignments between +LLMs and legal experts in terms of legal reasoning. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ We are Who We Cite: Bridges of Influence Between Natural Language + Processing and Other Academic Fields EMNLP 2023 + + +
+ Natural Language Processing (NLP) is poised to substantially influence the +world. However, significant progress comes hand-in-hand with substantial risks. +Addressing them requires broad engagement with various fields of study. Yet, +little empirical work examines the state of such engagement (past or current). +In this paper, we quantify the degree of influence between 23 fields of study +and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP +papers to other papers, and ~1.8m citations from other papers to NLP papers. We +show that, unlike most fields, the cross-field engagement of NLP, measured by +our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in +1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown +more insular -- citing increasingly more NLP papers and having fewer papers +that act as bridges between fields. NLP citations are dominated by computer +science; Less than 8% of NLP citations are to linguistics, and less than 3% are +to math and psychology. These findings underscore NLP's urgent need to reflect +on its engagement with various fields. + +
+
+ comment: Published at EMNLP 2023 +
+
+
+
+
+ + ☆ Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study + on Syllogism + + +
+ Large language models (LLMs) take advantage of step-by-step reasoning +instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their +ability to perform CoT-style reasoning robustly is of interest from a probing +perspective. In this study, we inspect the step-by-step reasoning ability of +LLMs with a focus on negation, which is a core linguistic phenomenon that is +difficult to process. In particular, we introduce several controlled settings +(e.g., reasoning in case of fictional entities) to evaluate the logical +reasoning abilities of the models. We observed that dozens of modern LLMs were +not robust against lexical negation (e.g., plausible ->implausible) when +performing CoT-style reasoning, and the results highlight unique limitations in +each LLM family. + +
+
+
+
+
+ + ☆ Paraphrase Types for Generation and Detection EMNLP 2023 + + +
+ Current approaches in paraphrase generation and detection heavily rely on a +single general similarity score, ignoring the intricate linguistic properties +of language. This paper introduces two new tasks to address this shortcoming by +considering paraphrase types - specific linguistic perturbations at particular +text positions. We name these tasks Paraphrase Type Generation and Paraphrase +Type Detection. Our results suggest that while current techniques perform well +in a binary classification scenario, i.e., paraphrased or not, the inclusion of +fine-grained paraphrase types poses a significant challenge. While most +approaches are good at generating and detecting general semantic similar +content, they fail to understand the intrinsic linguistic variables they +manipulate. Models trained in generating and identifying paraphrase types also +show improvements in tasks without them. In addition, scaling these models +further improves their ability to understand paraphrase types. We believe +paraphrase types can unlock a new paradigm for developing paraphrase models and +solving tasks in the future. + +
+
+ comment: Published at EMNLP 2023 +
+
+
+
+
+ + ☆ 3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for + Embodied Turn-Taking Prediction + + +
+ Predicting turn-taking in multiparty conversations has many practical +applications in human-computer/robot interaction. However, the complexity of +human communication makes it a challenging task. Recent advances have shown +that synchronous multi-perspective egocentric data can significantly improve +turn-taking prediction compared to asynchronous, single-perspective +transcriptions. Building on this research, we propose a new multimodal +transformer-based architecture for predicting turn-taking in embodied, +synchronized multi-perspective data. Our experimental results on the recently +introduced EgoCom dataset show a substantial performance improvement of up to +14.01% on average compared to existing baselines and alternative +transformer-based approaches. The source code, and the pre-trained models of +our 3T-Transformer will be available upon acceptance. + +
+
+
+
+
+ + ☆ Contextual Refinement of Translations: Large Language Models for + Sentence and Document-Level Post-Editing + + +
+ Large Language Models (LLM's) have demonstrated considerable success in +various Natural Language Processing tasks, but they have yet to attain +state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, +their significant performance in tasks demanding a broad understanding and +contextual processing shows their potential for translation. To exploit these +abilities, we investigate using LLM's for MT and explore recent +parameter-efficient fine-tuning techniques. Surprisingly, our initial +experiments find that fine-tuning for translation purposes even led to +performance degradation. To overcome this, we propose an alternative approach: +adapting LLM's as Automatic Post-Editors (APE) rather than direct translators. +Building on the LLM's exceptional ability to process and generate lengthy +sequences, we also propose extending our approach to document-level +translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can +yield significant improvements across both sentence and document-level metrics +while generalizing to out-of-domain data. Most notably, we achieve a +state-of-the-art accuracy rate of 89\% on the ContraPro test set, which +specifically assesses the model's ability to resolve pronoun ambiguities when +translating from English to German. Lastly, we investigate a practical scenario +involving manual post-editing for document-level translation, where reference +context is made available. Here, we demonstrate that leveraging human +corrections can significantly reduce the number of edits required for +subsequent translations\footnote{Interactive Demo for integrating manual +feedback can be found +\href{https://huggingface.co/spaces/skoneru/contextual_refinement_ende}{here}} + +
+
+
+
+
+ + ☆ Adaptive Policy with Wait-$k$ Model for Simultaneous Translation EMNLP 2023 + + +
+ Simultaneous machine translation (SiMT) requires a robust read/write policy +in conjunction with a high-quality translation model. Traditional methods rely +on either a fixed wait-$k$ policy coupled with a standalone wait-$k$ +translation model, or an adaptive policy jointly trained with the translation +model. In this study, we propose a more flexible approach by decoupling the +adaptive policy model from the translation model. Our motivation stems from the +observation that a standalone multi-path wait-$k$ model performs competitively +with adaptive policies utilized in state-of-the-art SiMT approaches. +Specifically, we introduce DaP, a divergence-based adaptive policy, that makes +read/write decisions for any translation model based on the potential +divergence in translation distributions resulting from future information. DaP +extends a frozen wait-$k$ model with lightweight parameters, and is both memory +and computation efficient. Experimental results across various benchmarks +demonstrate that our approach offers an improved trade-off between translation +accuracy and latency, outperforming strong baselines. + +
+
+ comment: Accept to EMNLP 2023 main conference. 17 pages, 12 figures, 5 tables +
+
+
+
+
+ + ☆ Universal Domain Adaptation for Robust Handling of Distributional Shifts + in NLP EMNLP 2023 + + +
+ When deploying machine learning systems to the wild, it is highly desirable +for them to effectively leverage prior knowledge to the unfamiliar domain while +also firing alarms to anomalous inputs. In order to address these requirements, +Universal Domain Adaptation (UniDA) has emerged as a novel research area in +computer vision, focusing on achieving both adaptation ability and robustness +(i.e., the ability to detect out-of-distribution samples). While UniDA has led +significant progress in computer vision, its application on language input +still needs to be explored despite its feasibility. In this paper, we propose a +comprehensive benchmark for natural language that offers thorough viewpoints of +the model's generalizability and robustness. Our benchmark encompasses multiple +datasets with varying difficulty levels and characteristics, including temporal +shifts and diverse domains. On top of our testbed, we validate existing UniDA +methods from computer vision and state-of-the-art domain adaptation techniques +from NLP literature, yielding valuable findings: We observe that UniDA methods +originally designed for image input can be effectively transferred to the +natural language domain while also underscoring the effect of adaptation +difficulty in determining the model's performance. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ☆ Transparency at the Source: Evaluating and Interpreting Language Models + With Access to the True Distribution EMNLP + + +
+ We present a setup for training, evaluating and interpreting neural language +models, that uses artificial, language-like data. The data is generated using a +massive probabilistic grammar (based on state-split PCFGs), that is itself +derived from a large natural language corpus, but also provides us complete +control over the generative process. We describe and release both grammar and +corpus, and test for the naturalness of our generated data. This approach +allows us to define closed-form expressions to efficiently compute exact lower +bounds on obtainable perplexity using both causal and masked language +modelling. Our results show striking differences between neural language +modelling architectures and training objectives in how closely they allow +approximating the lower bound on perplexity. Our approach also allows us to +directly compare learned representations to symbolic rules in the underlying +source. We experiment with various techniques for interpreting model behaviour +and learning dynamics. With access to the underlying true source, our results +show striking differences and outcomes in learning dynamics between different +classes of words. + +
+
+ comment: EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Characterizing how 'distributional' NLP corpora distance metrics are + + +
+ A corpus of vector-embedded text documents has some empirical distribution. +Given two corpora, we want to calculate a single metric of distance (e.g., +Mauve, Frechet Inception) between them. We describe an abstract quality, called +`distributionality', of such metrics. A non-distributional metric tends to use +very local measurements, or uses global measurements in a way that does not +fully reflect the distributions' true distance. For example, if individual +pairwise nearest-neighbor distances are low, it may judge the two corpora to +have low distance, even if their two distributions are in fact far from each +other. A more distributional metric will, in contrast, better capture the +distributions' overall distance. We quantify this quality by constructing a +Known-Similarity Corpora set from two paraphrase corpora and calculating the +distance between paired corpora from it. The distances' trend shape as set +element separation increases should quantify the distributionality of the +metric. We propose that Average Hausdorff Distance and energy distance between +corpora are representative examples of non-distributional and distributional +distance metrics, to which other metrics can be compared, to evaluate how +distributional they are. + +
+
+ comment: Published in the August 2023 Joint Statistical Meetings proceedings +
+
+
+
+
+ + ☆ ALCUNA: Large Language Models Meet New Knowledge EMNLP 2023 + + +
+ With the rapid development of NLP, large-scale language models (LLMs) excel +in various tasks across multiple domains now. However, existing benchmarks may +not adequately measure these models' capabilities, especially when faced with +new knowledge. In this paper, we address the lack of benchmarks to evaluate +LLMs' ability to handle new knowledge, an important and challenging aspect in +the rapidly evolving world. We propose an approach called KnowGen that +generates new knowledge by altering existing entity attributes and +relationships, resulting in artificial entities that are distinct from +real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to +assess LLMs' abilities in knowledge understanding, differentiation, and +association. We benchmark several LLMs, reveals that their performance in face +of new knowledge is not satisfactory, particularly in reasoning between new and +internal knowledge. We also explore the impact of entity similarity on the +model's understanding of entity knowledge and the influence of contextual +entities. We appeal to the need for caution when using LLMs in new scenarios or +with new knowledge, and hope that our benchmarks can help drive the development +of LLMs in face of new knowledge. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction + Following: A Case Study of Arabic + + +
+ While significant progress has been made in benchmarking Large Language +Models (LLMs) across various tasks, there is a lack of comprehensive evaluation +of their abilities in responding to multi-turn instructions in less-commonly +tested languages like Arabic. Our paper offers a detailed examination of the +proficiency of open LLMs in such scenarios in Arabic. Utilizing a customized +Arabic translation of the MT-Bench benchmark suite, we employ GPT-4 as a +uniform evaluator for both English and Arabic queries to assess and compare the +performance of the LLMs on various open-ended tasks. Our findings reveal +variations in model responses on different task categories, e.g., logic vs. +literacy, when instructed in English or Arabic. We find that fine-tuned base +models using multilingual and multi-turn datasets could be competitive to +models trained from scratch on multilingual data. Finally, we hypothesize that +an ensemble of small, open LLMs could perform competitively to proprietary LLMs +on the benchmark. + +
+
+ comment: Accepted at SIGARAB ArabicNLP 2023 +
+
+
+
+
+ + ☆ Leveraging Timestamp Information for Serialized Joint Streaming + Recognition and Translation + + +
+ The growing need for instant spoken language transcription and translation is +driven by increased global communication and cross-lingual interactions. This +has made offering translations in multiple languages essential for user +applications. Traditional approaches to automatic speech recognition (ASR) and +speech translation (ST) have often relied on separate systems, leading to +inefficiencies in computational resources, and increased synchronization +complexity in real time. In this paper, we propose a streaming +Transformer-Transducer (T-T) model able to jointly produce many-to-one and +one-to-many transcription and translation using a single decoder. We introduce +a novel method for joint token-level serialized output training based on +timestamp information to effectively produce ASR and ST outputs in the +streaming setting. Experiments on {it,es,de}->en prove the effectiveness of our +approach, enabling the generation of one-to-many joint outputs with a single +decoder for the first time. + +
+
+ comment: \c{opyright} 2024 IEEE. Personal use of this material is permitted. + Permission from IEEE must be obtained for all other uses, in any current or + future media, including reprinting/republishing this material for advertising + or promotional purposes, creating new collective works, for resale or + redistribution to servers or lists, or reuse of any copyrighted component of + this work in other works +
+
+
+
+
+ + ☆ Cross-Modal Conceptualization in Bottleneck Models + + +
+ Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray +images) are annotated with high-level concepts (e.g., types of abnormalities), +and perform classification by first predicting the concepts, followed by +predicting the label relying on these concepts. The main difficulty in using +CBMs comes from having to choose concepts that are predictive of the label and +then having to label training examples with these concepts. In our approach, we +adopt a more moderate assumption and instead use text descriptions (e.g., +radiology reports), accompanying the images in training, to guide the induction +of concepts. Our cross-modal approach treats concepts as discrete latent +variables and promotes concepts that (1) are predictive of the label, and (2) +can be predicted reliably from both the image and text. Through experiments +conducted on datasets ranging from synthetic datasets (e.g., synthetic images +with generated descriptions) to realistic medical imaging datasets, we +demonstrate that cross-modal learning encourages the induction of interpretable +concepts while also facilitating disentanglement. Our results also suggest that +this guidance leads to increased robustness by suppressing the reliance on +shortcut features. + +
+
+
+
+
+ + ☆ Large Language Models can Share Images, Too! + + +
+ This paper explores the image-sharing capability of Large Language Models +(LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting, +without the help of visual foundation models. Inspired by the two-stage process +of image-sharing in human dialogues, we propose a two-stage framework that +allows LLMs to predict potential image-sharing turns and generate related image +descriptions using our effective restriction-based prompt template. With +extensive experiments, we unlock the \textit{image-sharing} capability of LLMs +in zero-shot prompting, with GPT-4 achieving the best performance. +Additionally, we uncover the emergent \textit{image-sharing} ability in +zero-shot prompting, demonstrating the effectiveness of restriction-based +prompts in both stages of our framework. Based on this framework, we augment +the PhotoChat dataset with images generated by Stable Diffusion at predicted +turns, namely PhotoChat++. To our knowledge, this is the first study to assess +the image-sharing ability of LLMs in a zero-shot setting without visual +foundation models. The source code and the dataset will be released after +publication. + +
+
+
+
+
+ + ☆ Cross-lingual Prompting: Improving Zero-shot Chain-of-Thought Reasoning + across Languages EMNLP2023 + + +
+ Chain-of-thought (CoT) is capable of eliciting models to explicitly generate +reasoning paths, thus promoting reasoning accuracy and attracting increasing +attention. Specifically, zero-shot CoT achieves remarkable improvements in a +wide range of reasoning tasks by simply instructing the LLM with the prompt +"Let's think step by step!". Despite the success of zero-shot CoT, the existing +zero-shot prompting techniques remain limited to a single language, making it +challenging to generalize to other languages and hindering global development. +In this work, we introduce cross-lingual prompting (CLP), aiming to improve +zero-shot CoT reasoning across languages. Specifically, CLP consists of two +main components: (1) cross-lingual alignment prompting and (2) task-specific +solver prompting. The cross-lingual alignment prompting is responsible for +aligning representations across different languages, whereas the task-specific +solver prompting is used to generate the final chain of thoughts and results +for the reasoning task. In addition, we further introduce cross-lingual +self-consistent prompting (CLSP) to ensemble different reasoning paths across +languages. Our experimental evaluations on several benchmarks demonstrate that +CLP and CLSP significantly outperform the existing prompting methods and +achieve state-of-the-art performance. We hope this work will inspire further +breakthroughs in cross-lingual CoT. + +
+
+ comment: Accepted at EMNLP2023 Main Conference +
+
+
+
+
+ + ☆ What do Deck Chairs and Sun Hats Have in Common? Uncovering Shared + Properties in Large Concept Vocabularies EMNLP 2023 + + +
+ Concepts play a central role in many applications. This includes settings +where concepts have to be modelled in the absence of sentence context. Previous +work has therefore focused on distilling decontextualised concept embeddings +from language models. But concepts can be modelled from different perspectives, +whereas concept embeddings typically mostly capture taxonomic structure. To +address this issue, we propose a strategy for identifying what different +concepts, from a potentially large concept vocabulary, have in common with +others. We then represent concepts in terms of the properties they share with +the other concepts. To demonstrate the practical usefulness of this way of +modelling concepts, we consider the task of ultra-fine entity typing, which is +a challenging multi-label classification problem. We show that by augmenting +the label set with shared properties, we can improve the performance of the +state-of-the-art models for this task. + +
+
+ comment: Accepted for EMNLP 2023 +
+
+
+
+
+ + ☆ Geographical Erasure in Language Generation EMNLP 2023 + + +
+ Large language models (LLMs) encode vast amounts of world knowledge. However, +since these models are trained on large swaths of internet data, they are at +risk of inordinately capturing information about dominant groups. This +imbalance can propagate into generated language. In this work, we study and +operationalise a form of geographical erasure, wherein language models +underpredict certain countries. We demonstrate consistent instances of erasure +across a range of LLMs. We discover that erasure strongly correlates with low +frequencies of country mentions in the training corpus. Lastly, we mitigate +erasure by finetuning using a custom objective. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Evaluating the Knowledge Base Completion Potential of GPT + + +
+ Structured knowledge bases (KBs) are an asset for search engines and other +applications, but are inevitably incomplete. Language models (LMs) have been +proposed for unsupervised knowledge base completion (KBC), yet, their ability +to do this at scale and with high accuracy remains an open question. Prior +experimental studies mostly fall short because they only evaluate on popular +subjects, or sample already existing facts from KBs. In this work, we perform a +careful evaluation of GPT's potential to complete the largest public KB: +Wikidata. We find that, despite their size and capabilities, models like GPT-3, +ChatGPT and GPT-4 do not achieve fully convincing results on this task. +Nonetheless, they provide solid improvements over earlier approaches with +smaller LMs. In particular, we show that, with proper thresholding, GPT-3 +enables to extend Wikidata by 27M facts at 90% precision. + +
+
+ comment: 12 pages 4 tables +
+
+
+
+
+ + ☆ SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for + Social Media NLP Research EMNLP 2023 + + +
+ Despite its relevance, the maturity of NLP for social media pales in +comparison with general-purpose models, metrics and benchmarks. This fragmented +landscape makes it hard for the community to know, for instance, given a task, +which is the best performing model and how it compares with others. To +alleviate this issue, we introduce a unified benchmark for NLP evaluation in +social media, SuperTweetEval, which includes a heterogeneous set of tasks and +datasets combined, adapted and constructed from scratch. We benchmarked the +performance of a wide range of models on SuperTweetEval and our results suggest +that, despite the recent advances in language modelling, social media remains +challenging. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ MCC-KD: Multi-CoT Consistent Knowledge Distillation + + +
+ Large language models (LLMs) have showcased remarkable capabilities in +complex reasoning through chain of thought (CoT) prompting.~Recently, there has +been a growing interest in transferring these reasoning abilities from LLMs to +smaller models.~However, achieving both the diversity and consistency in +rationales presents a challenge.~In this paper, we focus on enhancing these two +aspects and propose Multi-CoT Consistent Knowledge Distillation (MCC-KD) to +efficiently distill the reasoning capabilities. In MCC-KD, we generate multiple +rationales for each question and enforce consistency among the corresponding +predictions by minimizing the bidirectional KL-divergence between the answer +distributions.~We investigate the effectiveness of MCC-KD with different model +architectures (LLaMA/FlanT5) and various model scales (3B/7B/11B/13B) on both +mathematical reasoning and commonsense reasoning benchmarks. The empirical +results not only confirm MCC-KD's superior performance on in-distribution +datasets but also highlight its robust generalization ability on +out-of-distribution datasets. + +
+
+ comment: Accepted to ENMLP 2023 +
+
+
+
+
+ + ☆ Unleashing the potential of prompt engineering in Large Language Models: + a comprehensive review + + +
+ This paper delves into the pivotal role of prompt engineering in unleashing +the capabilities of Large Language Models (LLMs). Prompt engineering is the +process of structuring input text for LLMs and is a technique integral to +optimizing the efficacy of LLMs. This survey elucidates foundational principles +of prompt engineering, such as role-prompting, one-shot, and few-shot +prompting, as well as more advanced methodologies such as the chain-of-thought +and tree-of-thoughts prompting. The paper sheds light on how external +assistance in the form of plugins can assist in this task, and reduce machine +hallucination by retrieving external knowledge. We subsequently delineate +prospective directions in prompt engineering research, emphasizing the need for +a deeper understanding of structures and the role of agents in Artificial +Intelligence-Generated Content (AIGC) tools. We discuss how to assess the +efficacy of prompt methods from different perspectives and using different +methods. Finally, we gather information about the application of prompt +engineering in such fields as education and programming, showing its +transformative potential. This comprehensive survey aims to serve as a friendly +guide for anyone venturing through the big world of LLMs and prompt +engineering. + +
+
+
+
+
+ + ☆ Generating Prototypes for Contradiction Detection Using Large Language + Models and Linguistic Rules + + +
+ We introduce a novel data generation method for contradiction detection, +which leverages the generative power of large language models as well as +linguistic rules. Our vision is to provide a condensed corpus of prototypical +contradictions, allowing for in-depth linguistic analysis as well as efficient +language model fine-tuning. To this end, we instruct the generative models to +create contradicting statements with respect to descriptions of specific +contradiction types. In addition, the model is also instructed to come up with +completely new contradiction typologies. As an auxiliary approach, we use +linguistic rules to construct simple contradictions such as those arising from +negation, antonymy and numeric mismatch. We find that our methods yield +promising results in terms of coherence and variety of the data. Further +studies, as well as manual refinement are necessary to make use of this data in +a machine learning setup. + +
+
+
+
+
+ + ☆ A Survey on LLM-gernerated Text Detection: Necessity, Methods, and + Future Directions + + +
+ The powerful ability to understand, follow, and generate complex language +emerging from large language models (LLMs) makes LLM-generated text flood many +areas of our daily lives at an incredible speed and is widely accepted by +humans. As LLMs continue to expand, there is an imperative need to develop +detectors that can detect LLM-generated text. This is crucial to mitigate +potential misuse of LLMs and safeguard realms like artistic expression and +social networks from harmful influence of LLM-generated content. The +LLM-generated text detection aims to discern if a piece of text was produced by +an LLM, which is essentially a binary classification task. The detector +techniques have witnessed notable advancements recently, propelled by +innovations in watermarking techniques, zero-shot methods, fine-turning LMs +methods, adversarial learning methods, LLMs as detectors, and human-assisted +methods. In this survey, we collate recent research breakthroughs in this area +and underscore the pressing need to bolster detector research. We also delve +into prevalent datasets, elucidating their limitations and developmental +requirements. Furthermore, we analyze various LLM-generated text detection +paradigms, shedding light on challenges like out-of-distribution problems, +potential attacks, and data ambiguity. Conclusively, we highlight interesting +directions for future research in LLM-generated text detection to advance the +implementation of responsible artificial intelligence (AI). Our aim with this +survey is to provide a clear and comprehensive introduction for newcomers while +also offering seasoned researchers a valuable update in the field of +LLM-generated text detection. + +
+
+
+
+
+ + ☆ Once Upon a $\textit{Time}$ in $\textit{Graph}$: Relative-Time + Pretraining for Complex Temporal Reasoning EMNLP 2023 + + +
+ Our physical world is constantly evolving over time, rendering challenges for +pre-trained language models to understand and reason over the temporal contexts +of texts. Existing work focuses on strengthening the direct association between +a piece of text and its time-stamp. However, the knowledge-time association is +usually insufficient for the downstream tasks that require reasoning over +temporal dependencies between knowledge. In this work, we make use of the +underlying nature of time, all temporally-scoped sentences are strung together +through a one-dimensional time axis, and suggest creating a graph structure +based on the relative placements of events along the time axis. Inspired by the +graph view, we propose RemeMo ($\underline{Re}$lative Ti$\underline{me}$ +$\underline{Mo}$deling), which explicitly connects all temporally-scoped facts +by modeling the time relations between any two sentences. Experimental results +show that RemeMo outperforms the baseline T5 on multiple temporal question +answering datasets under various settings. Further analysis suggests that +RemeMo is especially good at modeling long-range complex temporal dependencies. +We release our code and pre-trained checkpoints at +$\href{https://github.com/DAMO-NLP-SG/RemeMo}{\text{this url}}$. + +
+
+ comment: EMNLP 2023 main +
+
+
+
+
+ + ☆ Strong and Efficient Baselines for Open Domain Conversational Question + Answering EMNLP 2023 + + +
+ Unlike the Open Domain Question Answering (ODQA) setting, the conversational +(ODConvQA) domain has received limited attention when it comes to reevaluating +baselines for both efficiency and effectiveness. In this paper, we study the +State-of-the-Art (SotA) Dense Passage Retrieval (DPR) retriever and +Fusion-in-Decoder (FiD) reader pipeline, and show that it significantly +underperforms when applied to ODConvQA tasks due to various limitations. We +then propose and evaluate strong yet simple and efficient baselines, by +introducing a fast reranking component between the retriever and the reader, +and by performing targeted finetuning steps. Experiments on two ODConvQA tasks, +namely TopiOCQA and OR-QuAC, show that our method improves the SotA results, +while reducing reader's latency by 60%. Finally, we provide new and valuable +insights into the development of challenging baselines that serve as a +reference for future, more intricate approaches, including those that leverage +Large Language Models (LLMs). + +
+
+ comment: Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ☆ The continued usefulness of vocabulary tests for evaluating large + language models + + +
+ In their seminal article on semantic vectors, Landauer and Dumain (1997) +proposed testing the quality of AI language models with a challenging +vocabulary test. We show that their Test of English as a Foreign Language +(TOEFL) test remains informative for contemporary major language models, since +none of the models was perfect and made errors on divergent items. The TOEFL +test consists of target words with four alternatives to choose from. We further +tested the models on a Yes/No test that requires distinguishing between +existing words and made-up nonwords. The models performed significantly worse +on the nonword items, in line with other observations that current major +language models provide non-existent information. The situation was worse when +we generalized the tests to Spanish. Here, most models gave +meanings/translations for the majority of random letter sequences. On the plus +side, the best models began to perform quite well, and they also pointed to +nonwords that were unknown to the test participants but can be found in +dictionaries. + +
+
+
+
+
+ + ☆ Tree of Clarifications: Answering Ambiguous Questions with + Retrieval-Augmented Large Language Models EMNLP 2023 + + +
+ Questions in open-domain question answering are often ambiguous, allowing +multiple interpretations. One approach to handling them is to identify all +possible interpretations of the ambiguous question (AQ) and to generate a +long-form answer addressing them all, as suggested by Stelmakh et al., (2022). +While it provides a comprehensive response without bothering the user for +clarification, considering multiple dimensions of ambiguity and gathering +corresponding knowledge remains a challenge. To cope with the challenge, we +propose a novel framework, Tree of Clarifications (ToC): It recursively +constructs a tree of disambiguations for the AQ -- via few-shot prompting +leveraging external knowledge -- and uses it to generate a long-form answer. +ToC outperforms existing baselines on ASQA in a few-shot setup across the +metrics, while surpassing fully-supervised baselines trained on the whole +training set in terms of Disambig-F1 and Disambig-ROUGE. Code is available at +https://github.com/gankim/tree-of-clarifications. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ☆ API-Assisted Code Generation for Question Answering on Varied Table + Structures EMNLP 2023 + + +
+ A persistent challenge to table question answering (TableQA) by generating +executable programs has been adapting to varied table structures, typically +requiring domain-specific logical forms. In response, this paper introduces a +unified TableQA framework that: (1) provides a unified representation for +structured tables as multi-index Pandas data frames, (2) uses Python as a +powerful querying language, and (3) uses few-shot prompting to translate NL +questions into Python programs, which are executable on Pandas data frames. +Furthermore, to answer complex relational questions with extended program +functionality and external knowledge, our framework allows customized APIs that +Python programs can call. We experiment with four TableQA datasets that involve +tables of different structures -- relational, multi-table, and hierarchical +matrix shapes -- and achieve prominent improvements over past state-of-the-art +systems. In ablation studies, we (1) show benefits from our multi-index +representation and APIs over baselines that use only an LLM, and (2) +demonstrate that our approach is modular and can incorporate additional APIs. + +
+
+ comment: EMNLP 2023 camera ready, 13 pages, 11 figures +
+
+
+
+
+ + ☆ SpEL: Structured Prediction for Entity Linking + + +
+ Entity linking is a prominent thread of research focused on structured data +creation by linking spans of text to an ontology or knowledge source. We +revisit the use of structured prediction for entity linking which classifies +each individual input token as an entity, and aggregates the token predictions. +Our system, called SpEL (Structured prediction for Entity Linking) is a +state-of-the-art entity linking system that uses some new ideas to apply +structured prediction to the task of entity linking including: two refined +fine-tuning steps; a context sensitive prediction aggregation strategy; +reduction of the size of the model's output vocabulary, and; we address a +common problem in entity-linking systems where there is a training vs. +inference tokenization mismatch. Our experiments show that we can outperform +the state-of-the-art on the commonly used AIDA benchmark dataset for entity +linking to Wikipedia. Our method is also very compute efficient in terms of +number of parameters and speed of inference. + +
+
+
+
+
+ + ☆ Pre-Trained Language Models Augmented with Synthetic Scanpaths for + Natural Language Understanding EMNLP 2023 + + +
+ Human gaze data offer cognitive information that reflects natural language +comprehension. Indeed, augmenting language models with human scanpaths has +proven beneficial for a range of NLP tasks, including language understanding. +However, the applicability of this approach is hampered because the abundance +of text corpora is contrasted by a scarcity of gaze data. Although models for +the generation of human-like scanpaths during reading have been developed, the +potential of synthetic gaze data across NLP tasks remains largely unexplored. +We develop a model that integrates synthetic scanpath generation with a +scanpath-augmented language model, eliminating the need for human gaze data. +Since the model's error gradient can be propagated throughout all parts of the +model, the scanpath generator can be fine-tuned to downstream tasks. We find +that the proposed model not only outperforms the underlying language model, but +achieves a performance that is comparable to a language model augmented with +real human gaze data. Our code is publicly available. + +
+
+ comment: Pre-print for EMNLP 2023 +
+
+
+
+
+ + ☆ Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and + Beyond EMNLP 2023 + + +
+ Vision-language (VL) understanding tasks evaluate models' comprehension of +complex visual scenes through multiple-choice questions. However, we have +identified two dataset biases that models can exploit as shortcuts to resolve +various VL tasks correctly without proper understanding. The first type of +dataset bias is \emph{Unbalanced Matching} bias, where the correct answer +overlaps the question and image more than the incorrect answers. The second +type of dataset bias is \emph{Distractor Similarity} bias, where incorrect +answers are overly dissimilar to the correct answer but significantly similar +to other incorrect answers within the same sample. To address these dataset +biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic +training and debiased evaluation data. We then introduce Intra-sample +Counterfactual Training (ICT) to assist models in utilizing the synthesized +training data, particularly the counterfactual data, via focusing on +intra-sample differentiation. Extensive experiments demonstrate the +effectiveness of ADS and ICT in consistently improving model performance across +different benchmarks, even in domain-shifted scenarios. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ DPP-TTS: Diversifying prosodic features of speech via determinantal + point processes EMNLP 2023 + + +
+ With the rapid advancement in deep generative models, recent neural +Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech. +There have been some efforts to generate speech with various prosody beyond +monotonous prosody patterns. However, previous works have several limitations. +First, typical TTS models depend on the scaled sampling temperature for +boosting the diversity of prosody. Speech samples generated at high sampling +temperatures often lack perceptual prosodic diversity, which can adversely +affect the naturalness of the speech. Second, the diversity among samples is +neglected since the sampling procedure often focuses on a single speech sample +rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech +model based on Determinantal Point Processes (DPPs) with a prosody diversifying +module. Our TTS model is capable of generating speech samples that +simultaneously consider perceptual diversity in each sample and among multiple +samples. We demonstrate that DPP-TTS generates speech samples with more +diversified prosody than baselines in the side-by-side comparison test +considering the naturalness of speech at the same time. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Reasoning about Ambiguous Definite Descriptions EMNLP 2023 + + +
+ Natural language reasoning plays an increasingly important role in improving +language models' ability to solve complex language understanding tasks. An +interesting use case for reasoning is the resolution of context-dependent +ambiguity. But no resources exist to evaluate how well Large Language Models +can use explicit reasoning to resolve ambiguity in language. We propose to use +ambiguous definite descriptions for this purpose and create and publish the +first benchmark dataset consisting of such phrases. Our method includes all +information required to resolve the ambiguity in the prompt, which means a +model does not require anything but reasoning to do well. We find this to be a +challenging task for recent LLMs. Code and data available at: +https://github.com/sfschouten/exploiting-ambiguity + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ SPRING-INX: A Multilingual Indian Language Speech Corpus by SPRING Lab, + IIT Madras SP + + +
+ India is home to a multitude of languages of which 22 languages are +recognised by the Indian Constitution as official. Building speech based +applications for the Indian population is a difficult problem owing to limited +data and the number of languages and accents to accommodate. To encourage the +language technology community to build speech based applications in Indian +languages, we are open sourcing SPRING-INX data which has about 2000 hours of +legally sourced and manually transcribed speech data for ASR system building in +Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi +and Tamil. This endeavor is by SPRING Lab , Indian Institute of Technology +Madras and is a part of National Language Translation Mission (NLTM), funded by +the Indian Ministry of Electronics and Information Technology (MeitY), +Government of India. We describe the data collection and data cleaning process +along with the data statistics in this paper. + +
+
+ comment: 3 pages, About SPRING-INX Data +
+
+
+
+
+ + ☆ Multilingual k-Nearest-Neighbor Machine Translation EMNLP + + +
+ k-nearest-neighbor machine translation has demonstrated remarkable +improvements in machine translation quality by creating a datastore of cached +examples. However, these improvements have been limited to high-resource +language pairs, with large datastores, and remain a challenge for low-resource +languages. In this paper, we address this issue by combining representations +from multiple languages into a single datastore. Our results consistently +demonstrate substantial improvements not only in low-resource translation +quality (up to +3.6 BLEU), but also for high-resource translation quality (up +to +0.5 BLEU). Our experiments show that it is possible to create multilingual +datastores that are a quarter of the size, achieving a 5.3x speed improvement, +by using linguistic similarities for datastore creation. + +
+
+ comment: Accepted to EMNLP +
+
+
+
+
+ + ☆ Extending Input Contexts of Language Models through Training on + Segmented Sequences + + +
+ Effectively training language models on long inputs poses many technical +challenges. As a cost consideration, languages models are pretrained on a fixed +sequence length before being adapted to longer sequences. We explore various +methods for adapting models to longer inputs by training on segmented sequences +and an interpolation-based method for extending absolute positional embeddings. +We develop a training procedure to extend the input context size of pretrained +models with no architectural changes and no additional memory costs than +training on the original input lengths. By sub-sampling segments from long +inputs while maintaining their original position the model is able to learn new +positional interactions. Our method benefits both models trained with absolute +positional embeddings, by extending their input contexts, as well as popular +relative positional embedding methods showing a reduced perplexity on sequences +longer than they were trained on. We demonstrate our method can extend input +contexts by a factor of 4x while improving perplexity. + +
+
+ comment: 11 pages, 3 figures +
+
+
+
+
+ + ☆ Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts EMNLP 2023 + + +
+ As large language models (LLMs) have shown effectiveness with different +prompting methods, such as Chain of Thought, Program of Thought, we find that +these methods have formed a great complementarity to each other on math +reasoning tasks. In this work, we propose XoT, an integrated problem solving +framework by prompting LLMs with diverse reasoning thoughts. For each question, +XoT always begins with selecting the most suitable method then executes each +method iteratively. Within each iteration, XoT actively checks the validity of +the generated answer and incorporates the feedback from external executors, +allowing it to dynamically switch among different prompting methods. Through +extensive experiments on 10 popular math reasoning datasets, we demonstrate the +effectiveness of our proposed approach and thoroughly analyze the strengths of +each module. Moreover, empirical results suggest that our framework is +orthogonal to recent work that makes improvements on single reasoning methods +and can further generalise to logical reasoning domain. By allowing method +switching, XoT provides a fresh perspective on the collaborative integration of +diverse reasoning thoughts in a unified framework. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ CrisisMatch: Semi-Supervised Few-Shot Learning for Fine-Grained Disaster + Tweet Classification SC + + +
+ The shared real-time information about natural disasters on social media +platforms like Twitter and Facebook plays a critical role in informing +volunteers, emergency managers, and response organizations. However, supervised +learning models for monitoring disaster events require large amounts of +annotated data, making them unrealistic for real-time use in disaster events. +To address this challenge, we present a fine-grained disaster tweet +classification model under the semi-supervised, few-shot learning setting where +only a small number of annotated data is required. Our model, CrisisMatch, +effectively classifies tweets into fine-grained classes of interest using few +labeled data and large amounts of unlabeled data, mimicking the early stage of +a disaster. Through integrating effective semi-supervised learning ideas and +incorporating TextMixUp, CrisisMatch achieves performance improvement on two +disaster datasets of 11.2\% on average. Further analyses are also provided for +the influence of the number of labeled data and out-of-domain results. + +
+
+ comment: Accepted by ISCRAM 2023 +
+
+
+
+
+ + ☆ Conversational Recommender System and Large Language Model Are Made for + Each Other in E-commerce Pre-sales Dialogue EMNLP 2023 + + +
+ E-commerce pre-sales dialogue aims to understand and elicit user needs and +preferences for the items they are seeking so as to provide appropriate +recommendations. Conversational recommender systems (CRSs) learn user +representation and provide accurate recommendations based on dialogue context, +but rely on external knowledge. Large language models (LLMs) generate responses +that mimic pre-sales dialogues after fine-tuning, but lack domain-specific +knowledge for accurate recommendations. Intuitively, the strengths of LLM and +CRS in E-commerce pre-sales dialogues are complementary, yet no previous work +has explored this. This paper investigates the effectiveness of combining LLM +and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods: +CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a +real-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of +two collaborative approaches with two CRSs and two LLMs on four tasks of +Ecommerce pre-sales dialogue. We find that collaborations between CRS and LLM +can be very effective in some cases. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine + Chain-of-Thought Prompting for Multi-domain NLU Tasks EMNLP 2023 + + +
+ While Chain-of-Thought prompting is popular in reasoning tasks, its +application to Large Language Models (LLMs) in Natural Language Understanding +(NLU) is under-explored. Motivated by multi-step reasoning of LLMs, we propose +Coarse-to-Fine Chain-of-Thought (CoF-CoT) approach that breaks down NLU tasks +into multiple reasoning steps where LLMs can learn to acquire and leverage +essential concepts to solve tasks from different granularities. Moreover, we +propose leveraging semantic-based Abstract Meaning Representation (AMR) +structured knowledge as an intermediate step to capture the nuances and diverse +structures of utterances, and to understand connections between their varying +levels of granularity. Our proposed approach is demonstrated effective in +assisting the LLMs adapt to the multi-grained NLU tasks under both zero-shot +and few-shot multi-domain settings. + +
+
+ comment: Accepted at EMNLP 2023 (Main Conference) +
+
+
+
+
+ + ♻ ☆ Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR + Decomposition EMNLP 2023 + + +
+ Cross-encoder models, which jointly encode and score a query-item pair, are +prohibitively expensive for direct k-nearest neighbor (k-NN) search. +Consequently, k-NN search typically employs a fast approximate retrieval (e.g. +using BM25 or dual-encoder vectors), followed by reranking with a +cross-encoder; however, the retrieval approximation often has detrimental +recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent +work that employs a cross-encoder only, making search efficient using a +relatively small number of anchor items, and a CUR matrix factorization. While +ANNCUR's one-time selection of anchors tends to approximate the cross-encoder +distances on average, doing so forfeits the capacity to accurately estimate +distances to items near the query, leading to regret in the crucial end-task: +recall of top-k items. In this paper, we propose ADACUR, a method that +adaptively, iteratively, and efficiently minimizes the approximation error for +the practically important top-k neighbors. It does so by iteratively performing +k-NN search using the anchors available so far, then adding these retrieved +nearest neighbors to the anchor set for the next round. Empirically, on +multiple datasets, in comparison to previous traditional and state-of-the-art +methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed +approach ADACUR consistently reduces recall error-by up to 70% on the important +k = 1 setting-while using no more compute than its competitors. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Better to Ask in English: Cross-Lingual Evaluation of Large Language + Models for Healthcare Queries + + +
+ Large language models (LLMs) are transforming the ways the general public +accesses and consumes information. Their influence is particularly pronounced +in pivotal sectors like healthcare, where lay individuals are increasingly +appropriating LLMs as conversational agents for everyday queries. While LLMs +demonstrate impressive language understanding and generation proficiencies, +concerns regarding their safety remain paramount in these high-stake domains. +Moreover, the development of LLMs is disproportionately focused on English. It +remains unclear how these LLMs perform in the context of non-English languages, +a gap that is critical for ensuring equity in the real-world use of these +systems.This paper provides a framework to investigate the effectiveness of +LLMs as multi-lingual dialogue systems for healthcare queries. Our +empirically-derived framework XlingEval focuses on three fundamental criteria +for evaluating LLM responses to naturalistic human-authored health-related +questions: correctness, consistency, and verifiability. Through extensive +experiments on four major global languages, including English, Spanish, +Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, +and through an amalgamation of algorithmic and human-evaluation strategies, we +found a pronounced disparity in LLM responses across these languages, +indicating a need for enhanced cross-lingual capabilities. We further propose +XlingHealth, a cross-lingual benchmark for examining the multilingual +capabilities of LLMs in the healthcare context. Our findings underscore the +pressing need to bolster the cross-lingual capacities of these models, and to +provide an equitable information ecosystem accessible to all. + +
+
+ comment: 18 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ A Zero-Shot Language Agent for Computer Control with Structured + Reflection EMNLP 2023 + + +
+ Large language models (LLMs) have shown increasing capacity at planning and +executing a high-level goal in a live computer environment (e.g. MiniWoB++). To +perform a task, recent works often require a model to learn from trace examples +of the task via either supervised learning or few/many-shot prompting. Without +these trace examples, it remains a challenge how an agent can autonomously +learn and improve its control on a computer, which limits the ability of an +agent to perform a new task. We approach this problem with a zero-shot agent +that requires no given expert traces. Our agent plans for executable actions on +a partially observed environment, and iteratively progresses a task by +identifying and learning from its mistakes via self-reflection and structured +thought management. On the easy tasks of MiniWoB++, we show that our zero-shot +agent often outperforms recent SoTAs, with more efficient reasoning. For tasks +with more complexity, our reflective agent performs on par with prior best +models, even though previous works had the advantages of accessing expert +traces or additional screen information. + +
+
+ comment: Accepted at Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Emergent AI-Assisted Discourse: Case Study of a Second Language Writer + Authoring with ChatGPT + + +
+ The rapid proliferation of ChatGPT has incited debates regarding its impact +on human writing. Amid concerns about declining writing standards, this study +investigates the role of ChatGPT in facilitating academic writing, especially +among language learners. Using a case study approach, this study examines the +experiences of Kailing, a doctoral student, who integrates ChatGPT throughout +their academic writing process. The study employs activity theory as a lens for +understanding writing with generative AI tools and data analyzed includes +semi-structured interviews, writing samples, and GPT logs. Results indicate +that Kailing effectively collaborates with ChatGPT across various writing +stages while preserving her distinct authorial voice and agency. This +underscores the potential of AI tools such as ChatGPT to enhance academic +writing for language learners without overshadowing individual authenticity. +This case study offers a critical exploration of how ChatGPT is utilized in the +academic writing process and the preservation of a student's authentic voice +when engaging with the tool. + +
+
+ comment: 24 pages +
+
+
+
+
+ + ♻ ☆ Zero-shot Query Reformulation for Conversational Search SIGIR + + +
+ As the popularity of voice assistants continues to surge, conversational +search has gained increased attention in Information Retrieval. However, data +sparsity issues in conversational search significantly hinder the progress of +supervised conversational search methods. Consequently, researchers are +focusing more on zero-shot conversational search approaches. Nevertheless, +existing zero-shot methods face three primary limitations: they are not +universally applicable to all retrievers, their effectiveness lacks sufficient +explainability, and they struggle to resolve common conversational ambiguities +caused by omission. To address these limitations, we introduce a novel +Zero-shot Query Reformulation (ZeQR) framework that reformulates queries based +on previous dialogue contexts without requiring supervision from conversational +search data. Specifically, our framework utilizes language models designed for +machine reading comprehension tasks to explicitly resolve two common +ambiguities: coreference and omission, in raw queries. In comparison to +existing zero-shot methods, our approach is universally applicable to any +retriever without additional adaptation or indexing. It also provides greater +explainability and effectively enhances query intent understanding because +ambiguities are explicitly and proactively resolved. Through extensive +experiments on four TREC conversational datasets, we demonstrate the +effectiveness of our method, which consistently outperforms state-of-the-art +baselines. + +
+
+ comment: Accepted by the 9th ACM SIGIR International Conference on the Theory + of Information Retrieval +
+
+
+
+
+ + ♻ ☆ Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona + Biases in Dialogue Systems + + +
+ Recent advancements in Large Language Models empower them to follow freeform +instructions, including imitating generic or specific demographic personas in +conversations. We define generic personas to represent demographic groups, such +as "an Asian person", whereas specific personas may take the form of specific +popular Asian names like "Yumi". While the adoption of personas enriches user +experiences by making dialogue systems more engaging and approachable, it also +casts a shadow of potential risk by exacerbating social biases within model +responses, thereby causing societal harm through interactions with users. In +this paper, we systematically study "persona biases", which we define to be the +sensitivity of dialogue models' harmful behaviors contingent upon the personas +they adopt. We categorize persona biases into biases in harmful expression and +harmful agreement, and establish a comprehensive evaluation framework to +measure persona biases in five aspects: Offensiveness, Toxic Continuation, +Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to +investigate persona biases by experimenting with UNIVERSALPERSONA, a +systematically constructed persona dataset encompassing various types of both +generic and specific model personas. Through benchmarking on four different +models -- including Blender, ChatGPT, Alpaca, and Vicuna -- our study uncovers +significant persona biases in dialogue systems. Our findings also underscore +the pressing need to revisit the use of personas in dialogue agents to ensure +safe application. + +
+
+
+
+
+ + ♻ ☆ Improving Dialogue Management: Quality Datasets vs Models + + +
+ Task-oriented dialogue systems (TODS) have become crucial for users to +interact with machines and computers using natural language. One of its key +components is the dialogue manager, which guides the conversation towards a +good goal for the user by providing the best possible response. Previous works +have proposed rule-based systems (RBS), reinforcement learning (RL), and +supervised learning (SL) as solutions for the correct dialogue management; in +other words, select the best response given input by the user. However, this +work argues that the leading cause of DMs not achieving maximum performance +resides in the quality of the datasets rather than the models employed thus +far; this means that dataset errors, like mislabeling, originate a large +percentage of failures in dialogue management. We studied the main errors in +the most widely used datasets, Multiwoz 2.1 and SGD, to demonstrate this +hypothesis. To do this, we have designed a synthetic dialogue generator to +fully control the amount and type of errors introduced in the dataset. Using +this generator, we demonstrated that errors in the datasets contribute +proportionally to the performance of the models + +
+
+
+
+
+ + ♻ ☆ Don't Take This Out of Context! On the Need for Contextual Models and + Evaluations for Stylistic Rewriting + + +
+ Most existing stylistic text rewriting methods and evaluation metrics operate +on a sentence level, but ignoring the broader context of the text can lead to +preferring generic, ambiguous, and incoherent rewrites. In this paper, we +investigate integrating the preceding textual context into both the +$\textit{rewriting}$ and $\textit{evaluation}$ stages of stylistic text +rewriting, and introduce a new composite contextual evaluation metric +$\texttt{CtxSimFit}$ that combines similarity to the original sentence with +contextual cohesiveness. We comparatively evaluate non-contextual and +contextual rewrites in formality, toxicity, and sentiment transfer tasks. Our +experiments show that humans significantly prefer contextual rewrites as more +fitting and natural over non-contextual ones, yet existing sentence-level +automatic metrics (e.g., ROUGE, SBERT) correlate poorly with human preferences +($\rho$=0--0.3). In contrast, human preferences are much better reflected by +both our novel $\texttt{CtxSimFit}$ ($\rho$=0.7--0.9) as well as proposed +context-infused versions of common metrics ($\rho$=0.4--0.7). Overall, our +findings highlight the importance of integrating context into the generation +and especially the evaluation stages of stylistic text rewriting. + +
+
+ comment: emnlp 2023 main camera ready +
+
+
+
+
+ + ♻ ☆ Topics, Authors, and Networks in Large Language Model Research: Trends + from a Survey of 17K arXiv Papers + + +
+ Large language model (LLM) research is dramatically impacting society, making +it essential to understand the topics and values it prioritizes, the authors +and institutions driving it, and its networks of collaboration. Due to the +recent growth of the field, many of these fundamental attributes lack +systematic description. We gather, annotate, and analyze a new dataset of +16,979 LLM-related arXiv papers, focusing on changes in 2023 vs. 2018-2022. We +show that LLM research increasingly focuses on societal impacts: the Computers +and Society sub-arXiv has seen 20x growth in its proportion of LLM-related +papers in 2023. This change is driven in part by an influx of new authors: a +majority of 2023 papers are first-authored by researchers who have not +previously written an LLM-related paper, and these papers focus particularly on +applications and societal considerations. While a handful of companies hold +outsize influence, academia publishes a much larger fraction of papers than +industry overall, and this gap widens in 2023. LLM research is also being +shaped by social dynamics: there are gender and academic/industry differences +in the topics authors prioritize, and a stark U.S./China schism in the +collaboration network. Overall, our analysis documents how LLM research both +shapes and is shaped by society, attesting to the necessity of sociotechnical +lenses; we discuss implications for researchers and policymakers. + +
+
+ comment: Working paper. Data/code available at + https://github.com/rmovva/LLM-publication-patterns-public +
+
+
+
+
+ + ♻ ☆ Beyond Labels: Empowering Human Annotators with Natural Language + Explanations through a Novel Active-Learning Architecture EMNLP 2023 + + +
+ Real-world domain experts (e.g., doctors) rarely annotate only a decision +label in their day-to-day workflow without providing explanations. Yet, +existing low-resource learning techniques, such as Active Learning (AL), that +aim to support human annotators mostly focus on the label while neglecting the +natural language explanation of a data point. This work proposes a novel AL +architecture to support experts' real-world need for label and explanation +annotations in low-resource scenarios. Our AL architecture leverages an +explanation-generation model to produce explanations guided by human +explanations, a prediction model that utilizes generated explanations toward +prediction faithfully, and a novel data diversity-based AL sampling strategy +that benefits from the explanation annotations. Automated and human evaluations +demonstrate the effectiveness of incorporating explanations into AL sampling +and the improved human annotation efficiency and trustworthiness with our AL +architecture. Additional ablation studies illustrate the potential of our AL +architecture for transfer learning, generalizability, and integration with +large language models (LLMs). While LLMs exhibit exceptional +explanation-generation capabilities for relatively simple tasks, their +effectiveness in complex real-world tasks warrants further in-depth study. + +
+
+ comment: Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ An Attribution Method for Siamese Encoders EMNLP'23 + + +
+ Despite the success of Siamese encoder models such as sentence transformers +(ST), little is known about the aspects of inputs they pay attention to. A +barrier is that their predictions cannot be attributed to individual features, +as they compare two inputs rather than processing a single one. This paper +derives a local attribution method for Siamese encoders by generalizing the +principle of integrated gradients to models with multiple inputs. The solution +takes the form of feature-pair attributions, and can be reduced to a +token-token matrix for STs. Our method involves the introduction of integrated +Jacobians and inherits the advantageous formal properties of integrated +gradients: it accounts for the model's full computation graph and is guaranteed +to converge to the actual prediction. A pilot study shows that in an ST few +token-pairs can often explain large fractions of predictions, and it focuses on +nouns and verbs. For accurate predictions, it however needs to attend to the +majority of tokens and parts of speech. + +
+
+ comment: Accepted to EMNLP'23 +
+
+
+
+
+ + ♻ ☆ Long-Form Speech Translation through Segmentation with Finite-State + Decoding Constraints on Large Language Models EMNLP 2023 + + +
+ One challenge in speech translation is that plenty of spoken content is +long-form, but short units are necessary for obtaining high-quality +translations. To address this mismatch, we adapt large language models (LLMs) +to split long ASR transcripts into segments that can be independently +translated so as to maximize the overall translation quality. We overcome the +tendency of hallucination in LLMs by incorporating finite-state constraints +during decoding; these eliminate invalid outputs without requiring additional +training. We discover that LLMs are adaptable to transcripts containing ASR +errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art +automatic punctuation baseline, our best LLM improves the average BLEU by 2.9 +points for English-German, English-Spanish, and English-Arabic TED talk +translation in 9 test sets, just by improving segmentation. + +
+
+ comment: accepted to the Findings of EMNLP 2023. arXiv admin note: text + overlap with arXiv:2212.09895 +
+
+
+
+
+ + ♻ ☆ The Benefits of Label-Description Training for Zero-Shot Text + Classification EMNLP 2023 + + +
+ Pretrained language models have improved zero-shot text classification by +allowing the transfer of semantic knowledge from the training data in order to +classify among specific label sets in downstream tasks. We propose a simple way +to further improve zero-shot accuracies with minimal effort. We curate small +finetuning datasets intended to describe the labels for a task. Unlike typical +finetuning data, which has texts annotated with labels, our data simply +describes the labels in language, e.g., using a few related terms, +dictionary/encyclopedia entries, and short templates. Across a range of topic +and sentiment datasets, our method is more accurate than zero-shot by 17-19% +absolute. It is also more robust to choices required for zero-shot +classification, such as patterns for prompting the model to classify and +mappings from labels to tokens in the model's vocabulary. Furthermore, since +our data merely describes the labels but does not use input texts, finetuning +on it yields a model that performs strongly on multiple text domains for a +given label set, even improving over few-shot out-of-domain classification in +multiple settings. + +
+
+ comment: Accepted at the EMNLP 2023 main conference (long paper) +
+
+
+
+
+ + ♻ ☆ Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through + Interaction with Symbolic Systems EMNLP 2023 + + +
+ Despite outstanding performance in many tasks, language models are +notoriously inclined to make factual errors in tasks requiring arithmetic +computation. We address this deficiency by creating Calc-X, a collection of +datasets that demonstrates the appropriate use of a calculator in reasoning +chains. Calc-X is suitable for teaching language models to offload computations +to a symbolic system. We survey and unify several existing chain-of-thought +datasets into a proposed format, resulting in a standard collection of over +300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X +collection to train open-source calculator-using models we call Calcformers and +show that these models approximately double the accuracy of generating correct +results compared to vanilla language model baselines. We make all Calc-X +datasets, source code and Calcformers models publicly available. + +
+
+ comment: Published in EMNLP 2023: Main track +
+
+
+
+
+ + ♻ ☆ Multilingual Large Language Models Are Not (Yet) Code-Switchers EMNLP 2023 + + +
+ Multilingual Large Language Models (LLMs) have recently shown great +capabilities in a wide range of tasks, exhibiting state-of-the-art performance +through zero-shot or few-shot prompting methods. While there have been +extensive studies on their abilities in monolingual tasks, the investigation of +their potential in the context of code-switching (CSW), the practice of +alternating languages within an utterance, remains relatively uncharted. In +this paper, we provide a comprehensive empirical analysis of various +multilingual LLMs, benchmarking their performance across four tasks: sentiment +analysis, machine translation, summarization and word-level language +identification. Our results indicate that despite multilingual LLMs exhibiting +promising outcomes in certain tasks using zero or few-shot prompting, they +still underperform in comparison to fine-tuned models of much smaller scales. +We argue that current "multilingualism" in LLMs does not inherently imply +proficiency with code-switching texts, calling for future research to bridge +this discrepancy. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ ZARA: Improving Few-Shot Self-Rationalization for Small Language Models EMNLP + + +
+ Language models (LMs) that jointly generate end-task answers as well as +free-text rationales are known as self-rationalization models. Recent works +demonstrate great performance gain for self-rationalization by few-shot +prompting LMs with rationale-augmented exemplars. However, the ability to +benefit from explanations only emerges with large-scale LMs, which have poor +accessibility. In this work, we explore the less-studied setting of leveraging +explanations for small LMs to improve few-shot self-rationalization. We first +revisit the relationship between rationales and answers. Inspired by the +implicit mental process of how human beings assess explanations, we present a +novel approach, Zero-shot Augmentation of Rationale-Answer pairs (ZARA), to +automatically construct pseudo-parallel data for self-training by reducing the +problem of plausibility judgement to natural language inference. Experimental +results show ZARA achieves SOTA performance on the FEB benchmark, for both the +task accuracy and the explanation metric. In addition, we conduct human and +quantitative evaluation validating ZARA's ability to automatically identify +plausible and accurate rationale-answer pairs. + +
+
+ comment: Accepted as a long paper at EMNLP Findings 2023 +
+
+
+
+
+ + ♻ ☆ Self-ICL: Zero-Shot In-Context Learning with Self-Generated + Demonstrations EMNLP 2023 + + +
+ Large language models (LLMs) have exhibited striking in-context learning +(ICL) ability to adapt to target tasks with a few input-output demonstrations. +For better ICL, different methods are proposed to select representative +demonstrations from existing training corpora. However, such settings are not +aligned with real-world practices, as end-users usually query LMs without +access to demonstration pools. In this work, we introduce Self-ICL -- a simple +framework which bootstraps LMs' intrinsic capabilities to perform zero-shot +ICL. Given a test input, Self-ICL first prompts the model to generate +pseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via +zero-shot prompting. Finally, we perform ICL for the test input with the +pseudo-input-label pairs as demonstrations. Evaluation on 23 BIG-Bench Hard +tasks shows Self-ICL outperforms zero-shot baselines on both average accuracy +and head-to-head comparison. Moreover, with zero-shot chain-of-thought, +Self-ICL achieves results comparable to using real demonstrations. +Additionally, we conduct a range of analyses to validate Self-ICL's +effectiveness and provide insights for its behaviors under different settings. + +
+
+ comment: Accepted as a long paper at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive + Decoders EMNLP 2023 + + +
+ Neural document rerankers are extremely effective in terms of accuracy. +However, the best models require dedicated hardware for serving, which is +costly and often not feasible. To avoid this serving-time requirement, we +present a method of capturing up to 86% of the gains of a Transformer +cross-attention model with a lexicalized scoring function that only requires +10-6% of the Transformer's FLOPs per document and can be served using commodity +CPUs. When combined with a BM25 retriever, this approach matches the quality of +a state-of-the art dual encoder retriever, that still requires an accelerator +for query encoding. We introduce NAIL (Non-Autoregressive Indexing with +Language models) as a model architecture that is compatible with recent +encoder-decoder and decoder-only large language models, such as T5, GPT-3 and +PaLM. This model architecture can leverage existing pre-trained checkpoints and +can be fine-tuned for efficiently constructing document representations that do +not require neural processing of queries. + +
+
+ comment: To appear at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Automatic Model Selection with Large Language Models for Reasoning EMNLP 2023 + + +
+ Chain-of-Thought (CoT) and Program-Aided Language Models (PAL) represent two +distinct reasoning methods, each with its own strengths. CoT employs natural +language, offering flexibility and interpretability, while PAL utilizes +programming language, yielding more structured and rigorous logic. We introduce +a model selection method to combine the best of both worlds by employing a +large language model (LLM) to dynamically select between them. Our theoretical +analysis underscores the feasibility of this method, which is further +corroborated by empirical results. Our proposed method demonstrates significant +performance improvements across eight reasoning datasets with Codex, ChatGPT, +and GPT-4. Additionally, our method is complementary to self-consistency; when +integrated, it can further enhance performance while significantly reducing +computation costs. Moreover, we achieve new state-of-the-art results on GSM8K +and SVAMP, with respective accuracies of 96.8% and 93.7%. Our code, data and +prompts are available at https://github.com/XuZhao0/Model-Selection-Reasoning + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Establishing Trustworthiness: Rethinking Tasks and Model Evaluation EMNLP 2023 + + +
+ Language understanding is a multi-faceted cognitive capability, which the +Natural Language Processing (NLP) community has striven to model +computationally for decades. Traditionally, facets of linguistic intelligence +have been compartmentalized into tasks with specialized model architectures and +corresponding evaluation protocols. With the advent of large language models +(LLMs) the community has witnessed a dramatic shift towards general purpose, +task-agnostic approaches powered by generative models. As a consequence, the +traditional compartmentalized notion of language tasks is breaking down, +followed by an increasing challenge for evaluation and analysis. At the same +time, LLMs are being deployed in more real-world scenarios, including +previously unforeseen zero-shot setups, increasing the need for trustworthy and +reliable systems. Therefore, we argue that it is time to rethink what +constitutes tasks and model evaluation in NLP, and pursue a more holistic view +on language, placing trustworthiness at the center. Towards this goal, we +review existing compartmentalized approaches for understanding the origins of a +model's functional capacity, and provide recommendations for more multi-faceted +evaluation protocols. + +
+
+ comment: Accepted at EMNLP 2023 (Main Conference), camera-ready +
+
+
+
+
+ + ♻ ☆ Evaluating Open-QA Evaluation + + +
+ This study focuses on the evaluation of the Open Question Answering (Open-QA) +task, which can directly estimate the factuality of large language models +(LLMs). Current automatic evaluation methods have shown limitations, indicating +that human evaluation still remains the most reliable approach. We introduce a +new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset +EVOUNA, designed to assess the accuracy of AI-generated answers in relation to +standard answers within Open-QA. Our evaluation of these methods utilizes +human-annotated results to measure their performance. Specifically, the work +investigates methods that show high correlation with human evaluations, deeming +them more reliable. We also discuss the pitfalls of current methods and methods +to improve LLM-based evaluators. We believe this new QA-Eval task and +corresponding dataset EVOUNA will facilitate the development of more effective +automatic evaluation tools and prove valuable for future research in this area. +All resources are available at \url{https://github.com/wangcunxiang/QA-Eval} +and it is under the Apache-2.0 License. + +
+
+ comment: Accepted by Neurips-2023 Datasets and Benchmarks track; 28 pages +
+
+
+
+
+ + ♻ ☆ Prompting is not a substitute for probability measurements in large + language models EMNLP 2023 + + +
+ Prompting is now a dominant method for evaluating the linguistic knowledge of +large language models (LLMs). While other methods directly read out models' +probability distributions over strings, prompting requires models to access +this internal information by processing linguistic input, thereby implicitly +testing a new type of emergent ability: metalinguistic judgment. In this study, +we compare metalinguistic prompting and direct probability measurements as ways +of measuring models' linguistic knowledge. Broadly, we find that LLMs' +metalinguistic judgments are inferior to quantities directly derived from +representations. Furthermore, consistency gets worse as the prompt query +diverges from direct measurements of next-word probabilities. Our findings +suggest that negative results relying on metalinguistic prompts cannot be taken +as conclusive evidence that an LLM lacks a particular linguistic +generalization. Our results also highlight the value that is lost with the move +to closed APIs where access to probability distributions is limited. + +
+
+ comment: Camera-ready version for EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ CodeLMSec Benchmark: Systematically Evaluating and Finding Security + Vulnerabilities in Black-Box Code Language Models + + +
+ Large language models (LLMs) for automatic code generation have achieved +breakthroughs in several programming tasks. Their advances in competition-level +programming problems have made them an essential pillar of AI-assisted pair +programming, and tools such as GitHub Copilot have emerged as part of the daily +programming workflow used by millions of developers. The training data for +these models is usually collected from the Internet (e.g., from open-source +repositories) and is likely to contain faults and security vulnerabilities. +This unsanitized training data can cause the language models to learn these +vulnerabilities and propagate them during the code generation procedure. While +these models have been extensively assessed for their ability to produce +functionally correct programs, there remains a lack of comprehensive +investigations and benchmarks addressing the security aspects of these models. + In this work, we propose a method to systematically study the security issues +of code language models to assess their susceptibility to generating vulnerable +code. To this end, we introduce the first approach to automatically find +generated code that contains vulnerabilities in black-box code generation +models. To achieve this, we present an approach to approximate inversion of the +black-box code generation models based on few-shot prompting. We evaluate the +effectiveness of our approach by examining code language models in generating +high-risk security weaknesses. Furthermore, we establish a collection of +diverse non-secure prompts for various vulnerability scenarios using our +method. This dataset forms a benchmark for evaluating and comparing the +security weaknesses in code language models. + +
+
+ comment: 23 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Smooth Sailing: Improving Active Learning for Pre-trained Language + Models with Representation Smoothness Analysis + + +
+ Developed to alleviate prohibitive labeling costs, active learning (AL) +methods aim to reduce label complexity in supervised learning. While recent +work has demonstrated the benefit of using AL in combination with large +pre-trained language models (PLMs), it has often overlooked the practical +challenges that hinder the effectiveness of AL. We address these challenges by +leveraging representation smoothness analysis to ensure AL is feasible, that +is, both effective and practicable. Firstly, we propose an early stopping +technique that does not require a validation set -- often unavailable in +realistic AL conditions -- and observe significant improvements over random +sampling across multiple datasets and AL methods. Further, we find that task +adaptation improves AL, whereas standard short fine-tuning in AL does not +provide improvements over random sampling. Our work demonstrates the usefulness +of representation smoothness analysis for AL and introduces an AL stopping +criterion that reduces label complexity. + +
+
+ comment: Accepted at Learning with Small Data 2023, Association for + Computational Linguistics +
+
+
+
+
+ + ♻ ☆ InterroLang: Exploring NLP Models and Datasets through Dialogue-based + Explanations EMNLP 2023 + + +
+ While recently developed NLP explainability methods let us open the black box +in various ways (Madsen et al., 2022), a missing ingredient in this endeavor is +an interactive tool offering a conversational interface. Such a dialogue system +can help users explore datasets and models with explanations in a +contextualized manner, e.g. via clarification or follow-up questions, and +through a natural language interface. We adapt the conversational explanation +framework TalkToModel (Slack et al., 2022) to the NLP domain, add new +NLP-specific operations such as free-text rationalization, and illustrate its +generalizability on three NLP tasks (dialogue act classification, question +answering, hate speech detection). To recognize user queries for explanations, +we evaluate fine-tuned and few-shot prompting models and implement a novel +Adapter-based approach. We then conduct two user studies on (1) the perceived +correctness and helpfulness of the dialogues, and (2) the simulatability, i.e. +how objectively helpful dialogical explanations are for humans in figuring out +the model's predicted label when it's not shown. We found rationalization and +feature attribution were helpful in explaining the model behavior. Moreover, +users could more reliably predict the model outcome based on an explanation +dialogue rather than one-off explanations. + +
+
+ comment: EMNLP 2023 Findings. Camera-ready version +
+
+
+
+
+ + ♻ ☆ Thorny Roses: Investigating the Dual Use Dilemma in Natural Language + Processing + + +
+ Dual use, the intentional, harmful reuse of technology and scientific +artefacts, is a problem yet to be well-defined within the context of Natural +Language Processing (NLP). However, as NLP technologies continue to advance and +become increasingly widespread in society, their inner workings have become +increasingly opaque. Therefore, understanding dual use concerns and potential +ways of limiting them is critical to minimising the potential harms of research +and development. In this paper, we conduct a survey of NLP researchers and +practitioners to understand the depth and their perspective of the problem as +well as to assess existing available support. Based on the results of our +survey, we offer a definition of dual use that is tailored to the needs of the +NLP community. The survey revealed that a majority of researchers are concerned +about the potential dual use of their research but only take limited action +toward it. In light of the survey results, we discuss the current state and +potential means for mitigating dual use in NLP and propose a checklist that can +be integrated into existing conference ethics-frameworks, e.g., the ACL ethics +checklist. + +
+
+
+
+
+ + ♻ ☆ Parameter-Efficient Language Model Tuning with Active Learning in + Low-Resource Settings EMNLP 2023 + + +
+ Pre-trained language models (PLMs) have ignited a surge in demand for +effective fine-tuning techniques, particularly in low-resource domains and +languages. Active learning (AL), a set of algorithms designed to decrease +labeling costs by minimizing label complexity, has shown promise in confronting +the labeling bottleneck. In parallel, adapter modules designed for +parameter-efficient fine-tuning (PEFT) have demonstrated notable potential in +low-resource settings. However, the interplay between AL and adapter-based PEFT +remains unexplored. We present an empirical study of PEFT behavior with AL in +low-resource settings for text classification tasks. Our findings affirm the +superiority of PEFT over full-fine tuning (FFT) in low-resource settings and +demonstrate that this advantage persists in AL setups. We further examine the +properties of PEFT and FFT through the lens of forgetting dynamics and +instance-level representations, where we find that PEFT yields more stable +representations of early and middle layers compared to FFT. Our research +underscores the synergistic potential of AL and PEFT in low-resource settings, +paving the way for advancements in efficient and effective fine-tuning. + +
+
+ comment: Accepted at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Differentially Private Natural Language Models: Recent Advances and + Future Directions + + +
+ Recent developments in deep learning have led to great success in various +natural language processing (NLP) tasks. However, these applications may +involve data that contain sensitive information. Therefore, how to achieve good +performance while also protecting the privacy of sensitive data is a crucial +challenge in NLP. To preserve privacy, Differential Privacy (DP), which can +prevent reconstruction attacks and protect against potential side knowledge, is +becoming a de facto technique for private data analysis. In recent years, NLP +in DP models (DP-NLP) has been studied from different perspectives, which +deserves a comprehensive review. In this paper, we provide the first systematic +review of recent advances in DP deep learning models in NLP. In particular, we +first discuss some differences and additional challenges of DP-NLP compared +with the standard DP deep learning. Then, we investigate some existing work on +DP-NLP and present its recent developments from three aspects: gradient +perturbation based methods, embedding vector perturbation based methods, and +ensemble model based methods. We also discuss some challenges and future +directions. + +
+
+
+
+
+ + ♻ ☆ Enhancing Long-form Text Generation Efficacy with Task-adaptive + Tokenization + + +
+ We propose task-adaptive tokenization as a way to adapt the generation +pipeline to the specifics of a downstream task and enhance long-form generation +in mental health. Inspired by insights from cognitive science, our +task-adaptive tokenizer samples variable segmentations from multiple outcomes, +with sampling probabilities optimized based on task-specific data. We introduce +a strategy for building a specialized vocabulary and introduce a vocabulary +merging protocol that allows for the integration of task-specific tokens into +the pre-trained model's tokenization step. Through extensive experiments on +psychological question-answering tasks in both Chinese and English, we find +that our task-adaptive tokenization approach brings a significant improvement +in generation performance while using up to 60% fewer tokens. Preliminary +experiments point to promising results when using our tokenization approach +with very large language models. + +
+
+ comment: Accepted at the main conference of The 2023 Conference on Empirical + Methods in Natural Language Processing; 8 pages +
+
+
+
+
+ + ♻ ☆ Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization + for Few-shot Generalization EMNLP 2023 + + +
+ Prompt tuning is a parameter-efficient method, which learns soft prompts and +conditions frozen language models to perform specific downstream tasks. Though +effective, prompt tuning under few-shot settings on the one hand heavily relies +on a good initialization of soft prompts. On the other hand, it can easily +overfit to few-shot training samples, thereby undermining generalizability. +Existing works leverage pre-training or supervised meta-learning to initialize +soft prompts but they fail to data-efficiently generalize to unseen downstream +tasks. To address the above problems, this paper proposes a novel +Self-sUpervised meta-Prompt learning framework with MEta-gradient +Regularization for few-shot generalization (SUPMER). SUPMER leverages +self-supervised meta-learning with a diverse set of well-designed meta-training +tasks to learn a universal prompt initialization for efficient adaptation using +only unlabeled data. Additionally, it jointly meta-learns a gradient +regularization function to transform raw gradients into a domain-generalizable +direction, thus alleviating the problem of overfitting. Extensive experiments +show that SUPMER achieves better performance for different few-shot downstream +tasks, and also exhibits a stronger domain generalization ability. The code for +SUPMER will be available at https://github.com/beepkh/SUPMER. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ♻ ☆ LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and + Unlabeled Image Collections NeurIPS 2023 + + +
+ Recently, large-scale pre-trained Vision and Language (VL) models have set a +new state-of-the-art (SOTA) in zero-shot visual classification enabling +open-vocabulary recognition of potentially unlimited set of categories defined +as simple language prompts. However, despite these great advances, the +performance of these zeroshot classifiers still falls short of the results of +dedicated (closed category set) classifiers trained with supervised fine +tuning. In this paper we show, for the first time, how to reduce this gap +without any labels and without any paired VL data, using an unlabeled image +collection and a set of texts auto-generated using a Large Language Model (LLM) +describing the categories of interest and effectively substituting labeled +visual instances of those categories. Using our label-free approach, we are +able to attain significant performance improvements over the zero-shot +performance of the base VL model and other contemporary methods and baselines +on a wide variety of datasets, demonstrating absolute improvement of up to +11.7% (3.8% on average) in the label-free setting. Moreover, despite our +approach being label-free, we observe 1.3% average gains over leading few-shot +prompting baselines that do use 5-shot supervision. + +
+
+ comment: NeurIPS 2023 (Camera Ready) - Project Page: + https://jmiemirza.github.io/LaFTer/ +
+
+
+
+
+ + ♻ ☆ LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic + Tabletop Manipulation + + +
+ The convergence of embodied agents and large language models (LLMs) has +brought significant advancements to embodied instruction following. +Particularly, the strong reasoning capabilities of LLMs make it possible for +robots to perform long-horizon tasks without expensive annotated +demonstrations. However, public benchmarks for testing the long-horizon +reasoning capabilities of language-conditioned robots in various scenarios are +still missing. To fill this gap, this work focuses on the tabletop manipulation +task and releases a simulation benchmark, \textit{LoHoRavens}, which covers +various long-horizon reasoning aspects spanning color, size, space, arithmetics +and reference. Furthermore, there is a key modality bridging problem for +long-horizon manipulation tasks with LLMs: how to incorporate the observation +feedback during robot execution for the LLM's closed-loop planning, which is +however less studied by prior work. We investigate two methods of bridging the +modality gap: caption generation and learnable interface for incorporating +explicit and implicit observation feedback to the LLM, respectively. These +methods serve as the two baselines for our proposed benchmark. Experiments show +that both methods struggle to solve some tasks, indicating long-horizon +manipulation tasks are still challenging for current popular models. We expect +the proposed public benchmark and baselines can help the community develop +better models for long-horizon tabletop manipulation tasks. + +
+
+ comment: 6 pages, 4 figures. The video and code of LoHoRavens are available at + https://cisnlp.github.io/lohoravens-webpage/ +
+
+
+
+
+ + ♻ ☆ Prompt Tuned Embedding Classification for Multi-Label Industry Sector + Allocation + + +
+ Prompt Tuning is emerging as a scalable and cost-effective method to +fine-tune Pretrained Language Models (PLMs), which are often referred to as +Large Language Models (LLMs). This study benchmarks the performance and +computational efficiency of Prompt Tuning and baselines for multi-label text +classification. This is applied to the challenging task of classifying +companies into an investment firm's proprietary industry taxonomy, supporting +their thematic investment strategy. Text-to-text classification is frequently +reported to outperform task-specific classification heads, but has several +limitations when applied to a multi-label classification problem where each +label consists of multiple tokens: (a) Generated labels may not match any label +in the label taxonomy; (b) The fine-tuning process lacks permutation invariance +and is sensitive to the order of the provided labels; (c) The model provides +binary decisions rather than appropriate confidence scores. Limitation (a) is +addressed by applying constrained decoding using Trie Search, which slightly +improves classification performance. All limitations (a), (b), and (c) are +addressed by replacing the PLM's language head with a classification head, +which is referred to as Prompt Tuned Embedding Classification (PTEC). This +improves performance significantly, while also reducing computational costs +during inference. In our industrial application, the training data is skewed +towards well-known companies. We confirm that the model's performance is +consistent across both well-known and less-known companies. Our overall results +indicate the continuing need to adapt state-of-the-art methods to +domain-specific tasks, even in the era of PLMs with strong generalization +abilities. We release our codebase and a benchmarking dataset at +https://github.com/EQTPartners/PTEC. + +
+
+
+
+
+ + ♻ ☆ IRRGN: An Implicit Relational Reasoning Graph Network for Multi-turn + Response Selection EMNLP 2022 + + +
+ The task of response selection in multi-turn dialogue is to find the best +option from all candidates. In order to improve the reasoning ability of the +model, previous studies pay more attention to using explicit algorithms to +model the dependencies between utterances, which are deterministic, limited and +inflexible. In addition, few studies consider differences between the options +before and after reasoning. In this paper, we propose an Implicit Relational +Reasoning Graph Network to address these issues, which consists of the +Utterance Relational Reasoner (URR) and the Option Dual Comparator (ODC). URR +aims to implicitly extract dependencies between utterances, as well as +utterances and options, and make reasoning with relational graph convolutional +networks. ODC focuses on perceiving the difference between the options through +dual comparison, which can eliminate the interference of the noise options. +Experimental results on two multi-turn dialogue reasoning benchmark datasets +MuTual and MuTual+ show that our method significantly improves the baseline of +four pretrained language models and achieves state-of-the-art performance. The +model surpasses human performance for the first time on the MuTual dataset. + +
+
+ comment: Accepted by EMNLP 2022 +
+
+
+
+
+ + ♻ ☆ RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder + for Language Modeling EMNLP 2023 + + +
+ Retrieval-augmented language models show promise in addressing issues like +outdated information and hallucinations in language models (LMs). However, +current research faces two main problems: 1) determining what information to +retrieve, and 2) effectively combining retrieved information during generation. +We argue that valuable retrieved information should not only be related to the +current source text but also consider the future target text, given the nature +of LMs that model future tokens. Moreover, we propose that aggregation using +latent variables derived from a compact latent space is more efficient than +utilizing explicit raw text, which is limited by context length and susceptible +to noise. Therefore, we introduce RegaVAE, a retrieval-augmented language model +built upon the variational auto-encoder (VAE). It encodes the text corpus into +a latent space, capturing current and future information from both source and +target text. Additionally, we leverage the VAE to initialize the latent space +and adopt the probabilistic form of the retrieval generation paradigm by +expanding the Gaussian prior distribution into a Gaussian mixture distribution. +Theoretical analysis provides an optimizable upper bound for RegaVAE. +Experimental results on various datasets demonstrate significant improvements +in text generation quality and hallucination removal. + +
+
+ comment: Accepted to the Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Baby's CoThought: Leveraging Large Language Models for Enhanced + Reasoning in Compact Models CoNLL 2023 + + +
+ Large Language Models (LLMs) demonstrate remarkable performance on a variety +of natural language understanding (NLU) tasks, primarily due to their +in-context learning ability. This ability could be applied to building babylike +models, i.e. models at small scales, improving training efficiency. In this +paper, we propose a "CoThought" pipeline, which efficiently trains smaller +"baby" language models (BabyLMs) by leveraging the Chain of Thought prompting +of LLMs. Our pipeline restructures a dataset of less than 100M in size using +GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that +are comparable to the school texts for language learners. The BabyLM is then +pretrained on this restructured dataset in a RoBERTa fashion. In evaluations +across 4 benchmarks, our BabyLM outperforms the vanilla RoBERTa in 10 +linguistic, NLU, and question-answering tasks by more than 3 points, showing a +superior ability to extract contextual information. These results suggest that +compact LMs pretrained on small, LLM-restructured data can better understand +tasks and achieve improved performance. + +
+
+ comment: CoNLL 2023 BabyLM Challenge +
+
+
+
+
+ + ♻ ☆ MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties + Grounded in Math Reasoning Problems EMNLP2023 + + +
+ While automatic dialogue tutors hold great potential in making education +personalized and more accessible, research on such systems has been hampered by +a lack of sufficiently large and high-quality datasets. Collecting such +datasets remains challenging, as recording tutoring sessions raises privacy +concerns and crowdsourcing leads to insufficient data quality. To address this, +we propose a framework to generate such dialogues by pairing human teachers +with a Large Language Model (LLM) prompted to represent common student errors. +We describe how we use this framework to collect MathDial, a dataset of 3k +one-to-one teacher-student tutoring dialogues grounded in multi-step math +reasoning problems. While models like GPT-3 are good problem solvers, they fail +at tutoring because they generate factually incorrect feedback or are prone to +revealing solutions to students too early. To overcome this, we let teachers +provide learning opportunities to students by guiding them using various +scaffolding questions according to a taxonomy of teacher moves. We demonstrate +MathDial and its extensive annotations can be used to finetune models to be +more effective tutors (and not just solvers). We confirm this by automatic and +human evaluation, notably in an interactive setting that measures the trade-off +between student solving success and telling solutions. The dataset is released +publicly. + +
+
+ comment: Jakub Macina, Nico Daheim, and Sankalan Pal Chowdhury contributed + equally to this work. Accepted at EMNLP2023 Findings. Code and dataset + available: https://github.com/eth-nlped/mathdial +
+
+
+
+
+ + ♻ ☆ StoryAnalogy: Deriving Story-level Analogies from Large Language Models + to Unlock Analogical Understanding EMNLP 2023 + + +
+ Analogy-making between narratives is crucial for human reasoning. In this +paper, we evaluate the ability to identify and generate analogies by +constructing a first-of-its-kind large-scale story-level analogy corpus, +\textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with +human annotations on two similarities from the extended Structure-Mapping +Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first +evaluation of story-level analogy identification and generation. Interestingly, +we find that the analogy identification tasks are incredibly difficult not only +for sentence embedding models but also for the recent large language models +(LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around +30% accuracy in multiple-choice questions (compared to over 85% accuracy for +humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can +improve the quality of analogy generation in LLMs, where a fine-tuned +FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT. + +
+
+ comment: Accepted by EMNLP 2023 main conference +
+
+
+
+
+ + ♻ ☆ Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For + Large Language Models EMNLP 2023 + + +
+ The performance of large language models (LLMs) on existing reasoning +benchmarks has significantly improved over the past years. In response, we +present JEEBench, a considerably more challenging benchmark dataset for +evaluating the problem solving abilities of LLMs. We curate 515 challenging +pre-engineering mathematics, physics and chemistry problems from the highly +competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep +in-domain knowledge is essential for solving problems in this benchmark. Our +evaluation on various open-source and proprietary models reveals that the +highest performance, even after using techniques like self-consistency, +self-refinement and chain-of-thought prompting, is less than 40%. The typical +failure modes of GPT-4, the best model, are errors in algebraic manipulation, +difficulty in grounding abstract concepts into mathematical equations +accurately and failure in retrieving relevant domain-specific concepts. We also +observe that by mere prompting, GPT-4 is unable to assess risk introduced by +negative marking for incorrect answers. For this, we develop a post-hoc +confidence-thresholding method over self-consistency, which enables effective +response selection. We hope that our challenging benchmark will guide future +re-search in problem-solving using LLMs. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Evaluating Hallucinations in Chinese Large Language Models + + +
+ In this paper, we establish a benchmark named HalluQA (Chinese Hallucination +Question-Answering) to measure the hallucination phenomenon in Chinese large +language models. HalluQA contains 450 meticulously designed adversarial +questions, spanning multiple domains, and takes into account Chinese historical +culture, customs, and social phenomena. During the construction of HalluQA, we +consider two types of hallucinations: imitative falsehoods and factual errors, +and we construct adversarial samples based on GLM-130B and ChatGPT. For +evaluation, we design an automated evaluation method using GPT-4 to judge +whether a model output is hallucinated. We conduct extensive experiments on 24 +large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk +and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than +50%. This indicates that HalluQA is highly challenging. We analyze the primary +types of hallucinations in different types of models and their causes. +Additionally, we discuss which types of hallucinations should be prioritized +for different types of models. + +
+
+ comment: Work in progress +
+
+
+
+
+ + ♻ ☆ Towards Real-World Streaming Speech Translation for Code-Switched Speech + + +
+ Code-switching (CS), i.e. mixing different languages in a single sentence, is +a common phenomenon in communication and can be challenging in many Natural +Language Processing (NLP) settings. Previous studies on CS speech have shown +promising results for end-to-end speech translation (ST), but have been limited +to offline scenarios and to translation to one of the languages present in the +source (\textit{monolingual transcription}). + In this paper, we focus on two essential yet unexplored areas for real-world +CS speech translation: streaming settings, and translation to a third language +(i.e., a language not included in the source). To this end, we extend the +Fisher and Miami test and validation datasets to include new targets in Spanish +and German. Using this data, we train a model for both offline and streaming ST +and we establish baseline results for the two settings mentioned earlier. + +
+
+
+
+
+ + ♻ ☆ End-to-End Evaluation for Low-Latency Simultaneous Speech Translation + + +
+ The challenge of low-latency speech translation has recently draw significant +interest in the research community as shown by several publications and shared +tasks. Therefore, it is essential to evaluate these different approaches in +realistic scenarios. However, currently only specific aspects of the systems +are evaluated and often it is not possible to compare different approaches. + In this work, we propose the first framework to perform and evaluate the +various aspects of low-latency speech translation under realistic conditions. +The evaluation is carried out in an end-to-end fashion. This includes the +segmentation of the audio as well as the run-time of the different components. + Secondly, we compare different approaches to low-latency speech translation +using this framework. We evaluate models with the option to revise the output +as well as methods with fixed output. Furthermore, we directly compare +state-of-the-art cascaded as well as end-to-end systems. Finally, the framework +allows to automatically evaluate the translation quality as well as latency and +also provides a web interface to show the low-latency model outputs to the +user. + +
+
+
+
+
+ + ♻ ☆ Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak + Supervision for Text Classification EMNLP 2023 + + +
+ Recent advances in weakly supervised text classification mostly focus on +designing sophisticated methods to turn high-level human heuristics into +quality pseudo-labels. In this paper, we revisit the seed matching-based +method, which is arguably the simplest way to generate pseudo-labels, and show +that its power was greatly underestimated. We show that the limited performance +of seed matching is largely due to the label bias injected by the simple +seed-match rule, which prevents the classifier from learning reliable +confidence for selecting high-quality pseudo-labels. Interestingly, simply +deleting the seed words present in the matched input texts can mitigate the +label bias and help learn better confidence. Subsequently, the performance +achieved by seed matching can be improved significantly, making it on par with +or even better than the state-of-the-art. Furthermore, to handle the case when +the seed words are not made known, we propose to simply delete the word tokens +in the input text randomly with a high deletion ratio. Remarkably, seed +matching equipped with this random deletion method can often achieve even +better performance than that with seed deletion. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ CompoundPiece: Evaluating and Improving Decompounding Performance of + Language Models EMNLP 2023 + + +
+ While many languages possess processes of joining two or more words to create +compound words, previous studies have been typically limited only to languages +with excessively productive compound formation (e.g., German, Dutch) and there +is no public dataset containing compound and non-compound words across a large +number of languages. In this work, we systematically study decompounding, the +task of splitting compound words into their constituents, at a wide scale. We +first address the data gap by introducing a dataset of 255k compound and +non-compound words across 56 diverse languages obtained from Wiktionary. We +then use this dataset to evaluate an array of Large Language Models (LLMs) on +the decompounding task. We find that LLMs perform poorly, especially on words +which are tokenized unfavorably by subword tokenization. We thus introduce a +novel methodology to train dedicated models for decompounding. The proposed +two-stage procedure relies on a fully self-supervised objective in the first +stage, while the second, supervised learning stage optionally fine-tunes the +model on the annotated Wiktionary data. Our self-supervised models outperform +the prior best unsupervised decompounding models by 13.9% accuracy on average. +Our fine-tuned models outperform all prior (language-specific) decompounding +tools. Furthermore, we use our models to leverage decompounding during the +creation of a subword tokenizer, which we refer to as CompoundPiece. +CompoundPiece tokenizes compound words more favorably on average, leading to +improved performance on decompounding over an otherwise equivalent model using +SentencePiece tokenization. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Non-Programmers Can Label Programs Indirectly via Active Examples: A + Case Study with Text-to-SQL + + +
+ Can non-programmers annotate natural language utterances with complex +programs that represent their meaning? We introduce APEL, a framework in which +non-programmers select among candidate programs generated by a seed semantic +parser (e.g., Codex). Since they cannot understand the candidate programs, we +ask them to select indirectly by examining the programs' input-ouput examples. +For each utterance, APEL actively searches for a simple input on which the +candidate programs tend to produce different outputs. It then asks the +non-programmers only to choose the appropriate output, thus allowing us to +infer which program is correct and could be used to fine-tune the parser. As a +first case study, we recruited human non-programmers to use APEL to re-annotate +SPIDER, a text-to-SQL dataset. Our approach achieved the same annotation +accuracy as the original expert annotators (75%) and exposed many subtle errors +in the original annotations. + +
+
+
+
+
+ + ♻ ☆ ECHo: A Visio-Linguistic Dataset for Event Causality Inference via + Human-Centric Reasoning EMNLP 2023 + + +
+ We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a +diagnostic dataset of event causality inference grounded in visio-linguistic +social scenarios. ECHo employs real-world human-centric deductive information +building on a television crime drama. ECHo requires the Theory-of-Mind (ToM) +ability to understand and reason about social interactions based on multimodal +information. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework +to assess the reasoning capability of current AI systems. Our ToM-enhanced CoT +pipeline accommodates various large foundation models in both zero-shot and +few-shot visio-linguistic reasoning. We use this framework to scrutinize recent +large foundation models such as InstructGPT and MiniGPT-4 on three diagnostic +human-centric tasks. Further analysis demonstrates ECHo as a challenging +dataset to expose imperfections and inconsistencies in reasoning. Our data and +code are publicly available at https://github.com/YuxiXie/ECHo. + +
+
+ comment: Findings of EMNLP 2023. 10 pages, 6 figures, 5 tables (22 pages, 8 + figures, 15 tables including references and appendices) +
+
+
+
+
+ + ♻ ☆ InterFair: Debiasing with Natural Language Feedback for Fair + Interpretable Predictions EMNLP 2023 + + +
+ Debiasing methods in NLP models traditionally focus on isolating information +related to a sensitive attribute (e.g., gender or race). We instead argue that +a favorable debiasing method should use sensitive information 'fairly,' with +explanations, rather than blindly eliminating it. This fair balance is often +subjective and can be challenging to achieve algorithmically. We explore two +interactive setups with a frozen predictive model and show that users able to +provide feedback can achieve a better and fairer balance between task +performance and bias mitigation. In one setup, users, by interacting with test +examples, further decreased bias in the explanations (5-8%) while maintaining +the same prediction accuracy. In the other setup, human feedback was able to +disentangle associated bias and predictive information from the input leading +to superior bias mitigation and improved task performance (4-5%) +simultaneously. + +
+
+ comment: Accepted in EMNLP 2023 (Main) +
+
+
+
+
+ + ♻ ☆ Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning + + +
+ Despite their impressive performance, large language models (LMs) still +struggle with reliably generating complex output structures when not finetuned +to follow the required output format exactly. To address this issue, +grammar-constrained decoding (GCD) can be used to control the generation of +LMs, guaranteeing that the output follows a given structure. Most existing GCD +methods are, however, limited to specific tasks, such as parsing or code +generation. In this work, we demonstrate that formal grammars can describe the +output space for a much wider range of tasks and argue that GCD can serve as a +unified framework for structured NLP tasks in general. For increased +flexibility, we introduce input-dependent grammars, which allow the grammar to +depend on the input and thus enable the generation of different output +structures for different inputs. We then empirically demonstrate the power and +flexibility of GCD-enhanced LMs on (1) information extraction, (2) entity +disambiguation, and (3) constituency parsing. Our results indicate that +grammar-constrained LMs substantially outperform unconstrained LMs or even beat +task-specific finetuned models. Grammar constraints thus hold great promise for +harnessing off-the-shelf LMs for a wide range of structured NLP tasks, +especially where training data is scarce or finetuning is expensive. Code and +data: https://github.com/epfl-dlab/GCD. + +
+
+
+
+
+ + ♻ ☆ H2O Open Ecosystem for State-of-the-art Large Language Models EMNLP 2023 + + +
+ Large Language Models (LLMs) represent a revolution in AI. However, they also +pose many significant risks, such as the presence of biased, private, +copyrighted or harmful text. For this reason we need open, transparent and safe +solutions. We introduce a complete open-source ecosystem for developing and +testing LLMs. The goal of this project is to boost open alternatives to +closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of +diverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI +designed for efficient fine-tuning, evaluation, and deployment of LLMs using +the most recent state-of-the-art techniques. Our code and models are fully +open-source. We believe this work helps to boost AI development and make it +more accessible, efficient and trustworthy. The demo is available at: +https://gpt.h2o.ai/ + +
+
+ comment: EMNLP 2023 Demo - ACL Empirical Methods in Natural Language + Processing +
+
+
+
+
+ + ♻ ☆ On the Representational Capacity of Recurrent Neural Language Models EMNLP 2023 + + +
+ This work investigates the computational expressivity of language models +(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) +famously showed that RNNs with rational weights and hidden states and unbounded +computation time are Turing complete. However, LMs define weightings over +strings in addition to just (unweighted) language membership and the analysis +of the computational power of RNN LMs (RLMs) should reflect this. We extend the +Turing completeness result to the probabilistic case, showing how a rationally +weighted RLM with unbounded computation time can simulate any probabilistic +Turing machine (PTM). Since, in practice, RLMs work in real-time, processing a +symbol at every time step, we treat the above result as an upper bound on the +expressivity of RLMs. We also provide a lower bound by showing that under the +restriction to real-time computation, such models can simulate deterministic +real-time rational PTMs. + +
+
+ comment: To be published at EMNLP 2023; +
+
+
+
+
+ + ♻ ☆ LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly + Transformers + + +
+ The community explored to build private inference frameworks for +transformer-based large language models (LLMs) in a server-client setting, +where the server holds the model parameters and the client inputs its private +data (or prompt) for inference. However, these frameworks impose significant +overhead when the private inputs are forward propagated through the original +LLMs. In this paper, we show that substituting the computation- and +communication-heavy operators in the transformer architecture with +privacy-computing friendly approximations can greatly reduce the private +inference costs while incurring very minor impact on model performance. +Compared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing +friendly model inference pipeline achieves a $5\times$ acceleration in +computation and an 80% reduction in communication overhead, while retaining +nearly identical accuracy. + +
+
+
+
+
+ + ♻ ☆ Enhancing Retrieval-Augmented Large Language Models with Iterative + Retrieval-Generation Synergy + + +
+ Large language models are powerful text processors and reasoners, but are +still subject to limitations including outdated knowledge and hallucinations, +which necessitates connecting them to the world. Retrieval-augmented large +language models have raised extensive attention for grounding model generation +on external knowledge. However, retrievers struggle to capture relevance, +especially for queries with complex information needs. Recent work has proposed +to improve relevance modeling by having large language models actively involved +in retrieval, i.e., to improve retrieval with generation. In this paper, we +show that strong performance can be achieved by a method we call Iter-RetGen, +which synergizes retrieval and generation in an iterative manner. A model +output shows what might be needed to finish a task, and thus provides an +informative context for retrieving more relevant knowledge which in turn helps +generate a better output in the next iteration. Compared with recent work which +interleaves retrieval with generation when producing an output, Iter-RetGen +processes all retrieved knowledge as a whole and largely preserves the +flexibility in generation without structural constraints. We evaluate +Iter-RetGen on multi-hop question answering, fact verification, and commonsense +reasoning, and show that it can flexibly leverage parametric knowledge and +non-parametric knowledge, and is superior to or competitive with +state-of-the-art retrieval-augmented baselines while causing fewer overheads of +retrieval and generation. We can further improve performance via +generation-augmented retrieval adaptation. + +
+
+ comment: Preprint +
+
+
+
+
+ + ♻ ☆ DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller + Language Models EMNLP 2023 + + +
+ Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the +reasoning capabilities of Large Language Models (LLMs) with at least 100 +billion parameters. However, it is ineffective or even detrimental when applied +to reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion +parameters. To address this limitation, we introduce Dialogue-guided +Chain-of-Thought (DialCoT) which employs a dialogue format to generate +intermediate reasoning steps, guiding the model toward the final answer. +Additionally, we optimize the model's reasoning path selection using the +Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning +capabilities. Our method offers several advantages compared to previous +approaches. Firstly, we transform the process of solving complex reasoning +questions by breaking them down into a series of simpler sub-questions, +significantly reducing the task difficulty and making it more suitable for +SLMs. Secondly, we optimize the model's reasoning path selection through the +PPO algorithm. We conduct comprehensive experiments on four arithmetic +reasoning datasets, demonstrating that our method achieves significant +performance improvements compared to state-of-the-art competitors. + +
+
+ comment: Accepted to EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Mirages: On Anthropomorphism in Dialogue Systems EMNLP + + +
+ Automated dialogue or conversational systems are anthropomorphised by +developers and personified by users. While a degree of anthropomorphism may be +inevitable due to the choice of medium, conscious and unconscious design +choices can guide users to personify such systems to varying degrees. +Encouraging users to relate to automated systems as if they were human can lead +to high risk scenarios caused by over-reliance on their outputs. As a result, +natural language processing researchers have investigated the factors that +induce personification and develop resources to mitigate such effects. However, +these efforts are fragmented, and many aspects of anthropomorphism have yet to +be explored. In this paper, we discuss the linguistic factors that contribute +to the anthropomorphism of dialogue systems and the harms that can arise, +including reinforcing gender stereotypes and notions of acceptable language. We +recommend that future efforts towards developing dialogue systems take +particular care in their design, development, release, and description; and +attend to the many linguistic cues that can elicit personification by users. + +
+
+ comment: Accepted for publication at EMNLP. See ACL Anthology for published + version +
+
+
+
+
+ + ♻ ☆ GPT-Fathom: Benchmarking Large Language Models to Decipher the + Evolutionary Path towards GPT-4 and Beyond + + +
+ With the rapid advancement of large language models (LLMs), there is a +pressing need for a comprehensive evaluation suite to assess their capabilities +and limitations. Existing LLM leaderboards often reference scores reported in +other papers without consistent settings and prompts, which may inadvertently +encourage cherry-picking favored settings and prompts for better results. In +this work, we introduce GPT-Fathom, an open-source and reproducible LLM +evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ +leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across +7 capability categories, all under aligned settings. Our retrospective study on +OpenAI's earlier models offers valuable insights into the evolutionary path +from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 +progressively improves to GPT-4, including technical details like whether +adding code data improves LLM's reasoning capability, which aspects of LLM +capability can be improved by SFT and RLHF, how much is the alignment tax, etc. +Our analysis sheds light on many of these questions, aiming to improve the +transparency of advanced LLMs. + +
+
+
+
+
+ + ♻ ☆ Outlier Suppression+: Accurate quantization of large language models by + equivalent and optimal shifting and scaling EMNLP23 + + +
+ Post-training quantization~(PTQ) of transformer language models faces +significant challenges due to the existence of detrimental outliers in +activations. We observe that these outliers are concentrated in specific +channels and are asymmetric across channels. To address this issue, we propose +the Outlier Suppression+~(OS+) framework, which contains the channel-wise +shifting for asymmetry and channel-wise scaling for concentration. We show that +these operations can be seamlessly migrated into subsequent modules while +maintaining equivalence. Second, we propose a fast and stable scheme to +calculate effective shifting and scaling values. The channel-wise shifting +aligns the center of each channel for removal of outlier asymmetry. The +channel-wise scaling quantitatively evaluates changes brought by migration and +quantization for better quantization burden balance. We validate our OS+ under +both standard and fine-grained quantization settings with models including +BERT, OPT, BLOOM, BLOOMZ, and LLaMA. Comprehensive results across various tasks +demonstrate the superiority of our approach. Especially, with standard +quantization, OS+ can achieve near-floating-point performance on both small +models and large language models on 8-bit and 6-bit. Besides, we establish a +new state-of-the-art for 4-bit BERT with 15.5\% improvement. Our code is +available at \url{https://github.com/ModelTC/Outlier_Suppression_Plus}. + +
+
+ comment: Accepted to EMNLP23 (main) +
+
+
+
+
+ + ♻ ☆ Explainable Depression Symptom Detection in Social Media + + +
+ Users of social platforms often perceive these sites as supportive spaces to +post about their mental health issues. Those conversations contain important +traces about individuals' health risks. Recently, researchers have exploited +this online information to construct mental health detection models, which aim +to identify users at risk on platforms like Twitter, Reddit or Facebook. Most +of these models are centred on achieving good classification results, ignoring +the explainability and interpretability of the decisions. Recent research has +pointed out the importance of using clinical markers, such as the use of +symptoms, to improve trust in the computational models by health professionals. +In this paper, we propose using transformer-based architectures to detect and +explain the appearance of depressive symptom markers in the users' writings. We +present two approaches: i) train a model to classify, and another one to +explain the classifier's decision separately and ii) unify the two tasks +simultaneously using a single model. Additionally, for this latter manner, we +also investigated the performance of recent conversational LLMs when using +in-context learning. Our natural language explanations enable clinicians to +interpret the models' decisions based on validated symptoms, enhancing trust in +the automated process. We evaluate our approach using recent symptom-based +datasets, employing both offline and expert-in-the-loop metrics to assess the +quality of the explanations generated by our models. The experimental results +show that it is possible to achieve good classification results while +generating interpretable symptom-based explanations. + +
+
+
+
+
+ + ♻ ☆ Ask Language Model to Clean Your Noisy Translation Data EMNLP 2023 + + +
+ Transformer models have demonstrated remarkable performance in neural machine +translation (NMT). However, their vulnerability to noisy input poses a +significant challenge in practical implementation, where generating clean +output from noisy input is crucial. The MTNT dataset is widely used as a +benchmark for evaluating the robustness of NMT models against noisy input. +Nevertheless, its utility is limited due to the presence of noise in both the +source and target sentences. To address this limitation, we focus on cleaning +the noise from the target sentences in MTNT, making it more suitable as a +benchmark for noise evaluation. Leveraging the capabilities of large language +models (LLMs), we observe their impressive abilities in noise removal. For +example, they can remove emojis while considering their semantic meaning. +Additionally, we show that LLM can effectively rephrase slang, jargon, and +profanities. The resulting datasets, called C-MTNT, exhibit significantly less +noise in the target sentences while preserving the semantic integrity of the +original sentences. Our human and GPT-4 evaluations also lead to a consistent +conclusion that LLM performs well on this task. Lastly, experiments on C-MTNT +showcased its effectiveness in evaluating the robustness of NMT models, +highlighting the potential of advanced language models for data cleaning and +emphasizing C-MTNT as a valuable resource. + +
+
+ comment: EMNLP 2023, Findings +
+
+
+
+
+ + ♻ ☆ Understanding ME? Multimodal Evaluation for Fine-grained Visual + Commonsense EMNLP 2022 + + +
+ Visual commonsense understanding requires Vision Language (VL) models to not +only understand image and text but also cross-reference in-between to fully +integrate and achieve comprehension of the visual scene described. Recently, +various approaches have been developed and have achieved high performance on +visual commonsense benchmarks. However, it is unclear whether the models really +understand the visual scene and underlying commonsense knowledge due to limited +evaluation data resources. To provide an in-depth analysis, we present a +Multimodal Evaluation (ME) pipeline to automatically generate question-answer +pairs to test models' understanding of the visual scene, text, and related +knowledge. We then take a step further to show that training with the ME data +boosts the model's performance in standard VCR evaluation. Lastly, our in-depth +analysis and comparison reveal interesting findings: (1) semantically low-level +information can assist the learning of high-level information but not the +opposite; (2) visual information is generally under utilization compared with +text. + +
+
+ comment: Accepted to EMNLP 2022 Long Paper +
+
+
+
+
+ + ♻ ☆ Towards Safer Operations: An Expert-involved Dataset of High-Pressure + Gas Incidents for Preventing Future Failures EMNLP 2023 + + +
+ This paper introduces a new IncidentAI dataset for safety prevention. +Different from prior corpora that usually contain a single task, our dataset +comprises three tasks: named entity recognition, cause-effect extraction, and +information retrieval. The dataset is annotated by domain experts who have at +least six years of practical experience as high-pressure gas conservation +managers. We validate the contribution of the dataset in the scenario of safety +prevention. Preliminary results on the three tasks show that NLP techniques are +beneficial for analyzing incident reports to prevent future failures. The +dataset facilitates future research in NLP and incident management communities. +The access to the dataset is also provided (the IncidentAI dataset is available +at: https://github.com/Cinnamon/incident-ai-dataset). + +
+
+ comment: Accepted by EMNLP 2023 (The Industry Track) +
+
+
+
+
+ + ♻ ☆ StructGPT: A General Framework for Large Language Model to Reason over + Structured Data EMNLP-23 + + +
+ In this paper, we study how to improve the zero-shot reasoning ability of +large language models~(LLMs) over structured data in a unified way. Inspired by +the study on tool augmentation for LLMs, we develop an \emph{Iterative +Reading-then-Reasoning~(IRR)} approach for solving question answering tasks +based on structured data, called \textbf{StructGPT}. In our approach, we +construct the specialized function to collect relevant evidence from structured +data (\ie \emph{reading}), and let LLMs concentrate the reasoning task based on +the collected information (\ie \emph{reasoning}). Specially, we propose an +\emph{invoking-linearization-generation} procedure to support LLMs in reasoning +on the structured data with the help of the external interfaces. By iterating +this procedures with provided interfaces, our approach can gradually approach +the target answer to a given query. Extensive experiments conducted on three +types of structured data demonstrate the effectiveness of our approach, which +can significantly boost the performance of ChatGPT and achieve comparable +performance against the full-data supervised-tuning baselines. Our codes and +data are publicly available at~\url{https://github.com/RUCAIBox/StructGPT}. + +
+
+ comment: LLM+Structured Data(KG, Table, DB); EMNLP-23 Camera-ready +
+
+
+
+
+ + ♻ ☆ Beyond Hard Samples: Robust and Effective Grammatical Error Correction + with Cycle Self-Augmenting + + +
+ Recent studies have revealed that grammatical error correction methods in the +sequence-to-sequence paradigm are vulnerable to adversarial attack, and simply +utilizing adversarial examples in the pre-training or post-training process can +significantly enhance the robustness of GEC models to certain types of attack +without suffering too much performance loss on clean data. In this paper, we +further conduct a thorough robustness evaluation of cutting-edge GEC methods +for four different types of adversarial attacks and propose a simple yet very +effective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the +augmenting data from the GEC models themselves in the post-training process and +introducing regularization data for cycle training, our proposed method can +effectively improve the model robustness of well-trained GEC models with only a +few more training epochs as an extra cost. More concretely, further training on +the regularization data can prevent the GEC models from over-fitting on +easy-to-learn samples and thus can improve the generalization capability and +robustness towards unseen data (adversarial noise/samples). Meanwhile, the +self-augmented data can provide more high-quality pseudo pairs to improve model +performance on the original testing data. Experiments on four benchmark +datasets and seven strong models indicate that our proposed training method can +significantly enhance the robustness of four types of attacks without using +purposely built adversarial examples in training. Evaluation results on clean +data further confirm that our proposed CSA method significantly improves the +performance of four baselines and yields nearly comparable results with other +state-of-the-art models. Our code is available at +https://github.com/ZetangForward/CSA-GEC. + +
+
+
+
+
+ + ♻ ☆ Prompting Large Language Models with Chain-of-Thought for Few-Shot + Knowledge Base Question Generation EMNLP 2023 + + +
+ The task of Question Generation over Knowledge Bases (KBQG) aims to convert a +logical form into a natural language question. For the sake of expensive cost +of large-scale question annotation, the methods of KBQG under low-resource +scenarios urgently need to be developed. However, current methods heavily rely +on annotated data for fine-tuning, which is not well-suited for few-shot +question generation. The emergence of Large Language Models (LLMs) has shown +their impressive generalization ability in few-shot tasks. Inspired by +Chain-of-Thought (CoT) prompting, which is an in-context learning strategy for +reasoning, we formulate KBQG task as a reasoning problem, where the generation +of a complete question is splitted into a series of sub-question generation. +Our proposed prompting method KQG-CoT first retrieves supportive logical forms +from the unlabeled data pool taking account of the characteristics of the +logical form. Then, we write a prompt to explicit the reasoning chain of +generating complicated questions based on the selected demonstrations. To +further ensure prompt quality, we extend KQG-CoT into KQG-CoT+ via sorting the +logical forms by their complexity. We conduct extensive experiments over three +public KBQG datasets. The results demonstrate that our prompting method +consistently outperforms other prompting baselines on the evaluated datasets. +Remarkably, our KQG-CoT+ method could surpass existing few-shot SoTA results of +the PathQuestions dataset by 18.25, 10.72, and 10.18 absolute points on BLEU-4, +METEOR, and ROUGE-L, respectively. + +
+
+ comment: Accepted by EMNLP 2023 main conference +
+
+
+
+
+ + ♻ ☆ Viewing Knowledge Transfer in Multilingual Machine Translation Through a + Representational Lens EMNLP 2023 + + +
+ We argue that translation quality alone is not a sufficient metric for +measuring knowledge transfer in multilingual neural machine translation. To +support this claim, we introduce Representational Transfer Potential (RTP), +which measures representational similarities between languages. We show that +RTP can measure both positive and negative transfer (interference), and find +that RTP is strongly correlated with changes in translation quality, indicating +that transfer does occur. Furthermore, we investigate data and language +characteristics that are relevant for transfer, and find that multi-parallel +overlap is an important yet under-explored feature. Based on this, we develop a +novel training scheme, which uses an auxiliary similarity loss that encourages +representations to be more invariant across languages by taking advantage of +multi-parallel data. We show that our method yields increased translation +quality for low- and mid-resource languages across multiple data and model +setups. + +
+
+ comment: Accepted to EMNLP 2023 Findings +
+
+
+
+
+ + ♻ ☆ Reasoning with Language Model is Planning with World Model EMNLP 2023 + + +
+ Large language models (LLMs) have shown remarkable reasoning capabilities, +especially when prompted to generate intermediate reasoning steps (e.g., +Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are +easy for humans, such as generating action plans for executing tasks in a given +environment, or performing complex math, logical, and commonsense reasoning. +The deficiency stems from the key fact that LLMs lack an internal +$\textit{world model}$ to predict the world $\textit{state}$ (e.g., environment +status, intermediate variable values) and simulate long-term outcomes of +actions. This prevents LLMs from performing deliberate planning akin to human +brains, which involves exploring alternative reasoning paths, anticipating +future states and rewards, and iteratively refining existing reasoning steps. +To overcome the limitations, we propose a new LLM reasoning framework, +$\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning +$\textbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning +agent, and incorporates a principled planning algorithm (based on Monto Carlo +Tree Search) for strategic exploration in the vast reasoning space. During +reasoning, the LLM (as agent) incrementally builds a reasoning tree under the +guidance of the LLM (as world model) and task-specific rewards, and obtains a +high-reward reasoning path efficiently with a proper balance between +exploration $\textit{vs.}$ exploitation. We apply RAP to a variety of +challenging reasoning problems including plan generation, math reasoning, and +logical inference. Empirical results on these tasks demonstrate the superiority +of RAP over various strong baselines, including CoT and least-to-most prompting +with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% +relative improvement in a plan generation setting. + +
+
+ comment: EMNLP 2023. Code is available at + https://github.com/Ber666/llm-reasoners +
+
+
+
+
+ + ♻ ☆ SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim + Verification on Scientific Tables EMNLP 2023 + + +
+ Current scientific fact-checking benchmarks exhibit several shortcomings, +such as biases arising from crowd-sourced claims and an over-reliance on +text-based evidence. We present SCITAB, a challenging evaluation dataset +consisting of 1.2K expert-verified scientific claims that 1) originate from +authentic scientific publications and 2) require compositional reasoning for +verification. The claims are paired with evidence-containing scientific tables +annotated with labels. Through extensive evaluations, we demonstrate that +SCITAB poses a significant challenge to state-of-the-art models, including +table-based pretraining models and large language models. All models except +GPT-4 achieved performance barely above random guessing. Popular prompting +techniques, such as Chain-of-Thought, do not achieve much performance gains on +SCITAB. Our analysis uncovers several unique challenges posed by SCITAB, +including table grounding, claim ambiguity, and compositional reasoning. Our +codes and data are publicly available at https://github.com/XinyuanLu00/SciTab. + +
+
+ comment: Accepted at EMNLP 2023 (main conference, long paper) +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 126 + +
+
+
+ + ☆ RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions NeurIPS 2023 + + +
+ Depth estimation from monocular images is pivotal for real-world visual +perception systems. While current learning-based depth estimation models train +and test on meticulously curated data, they often overlook out-of-distribution +(OoD) situations. Yet, in practical settings -- especially safety-critical ones +like autonomous driving -- common corruptions can arise. Addressing this +oversight, we introduce a comprehensive robustness test suite, RoboDepth, +encompassing 18 corruptions spanning three categories: i) weather and lighting +conditions; ii) sensor failures and movement; and iii) data processing +anomalies. We subsequently benchmark 42 depth estimation models across indoor +and outdoor scenes to assess their resilience to these corruptions. Our +findings underscore that, in the absence of a dedicated robustness evaluation +framework, many leading depth estimation models may be susceptible to typical +corruptions. We delve into design considerations for crafting more robust depth +estimation models, touching upon pre-training, augmentation, modality, model +capacity, and learning paradigms. We anticipate our benchmark will establish a +foundational platform for advancing robust OoD depth estimation. + +
+
+ comment: NeurIPS 2023; 45 pages, 25 figures, 13 tables; Code at + https://github.com/ldkong1205/RoboDepth +
+
+
+
+
+ + ☆ FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling + + +
+ With the availability of large-scale video datasets and the advances of +diffusion models, text-driven video generation has achieved substantial +progress. However, existing video generation models are typically trained on a +limited number of frames, resulting in the inability to generate high-fidelity +long videos during inference. Furthermore, these models only support +single-text conditions, whereas real-life scenarios often require multi-text +conditions as the video content changes over time. To tackle these challenges, +this study explores the potential of extending the text-driven capability to +generate longer videos conditioned on multiple texts. 1) We first analyze the +impact of initial noise in video diffusion models. Then building upon the +observation of noise, we propose FreeNoise, a tuning-free and time-efficient +paradigm to enhance the generative capabilities of pretrained video diffusion +models while preserving content consistency. Specifically, instead of +initializing noises for all frames, we reschedule a sequence of noises for +long-range correlation and perform temporal attention over them by window-based +function. 2) Additionally, we design a novel motion injection method to support +the generation of videos conditioned on multiple text prompts. Extensive +experiments validate the superiority of our paradigm in extending the +generative capabilities of video diffusion models. It is noteworthy that +compared with the previous best-performing method which brought about 255% +extra time cost, our method incurs only negligible time cost of approximately +17%. Generated video samples are available at our website: +http://haonanqiu.com/projects/FreeNoise.html. + +
+
+ comment: Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo: + https://github.com/arthur-qiu/LongerCrafter +
+
+
+
+
+ + ☆ Ghost on the Shell: An Expressive Representation of General 3D Shapes + + +
+ The creation of photorealistic virtual worlds requires the accurate modeling +of 3D surface geometry for a wide range of objects. For this, meshes are +appealing since they 1) enable fast physics-based rendering with realistic +material and lighting, 2) support physical simulation, and 3) are +memory-efficient for modern graphics pipelines. Recent work on reconstructing +and statistically modeling 3D shape, however, has critiqued meshes as being +topologically inflexible. To capture a wide range of object shapes, any 3D +representation must be able to model solid, watertight, shapes as well as thin, +open, surfaces. Recent work has focused on the former, and methods for +reconstructing open surfaces do not support fast reconstruction with material +and lighting or unconditional generative modelling. Inspired by the observation +that open surfaces can be seen as islands floating on watertight surfaces, we +parameterize open surfaces by defining a manifold signed distance field on +watertight templates. With this parameterization, we further develop a +grid-based and differentiable representation that parameterizes both watertight +and non-watertight meshes of arbitrary topology. Our new representation, called +Ghost-on-the-Shell (G-Shell), enables two important applications: +differentiable rasterization-based reconstruction from multiview images and +generative modelling of non-watertight meshes. We empirically demonstrate that +G-Shell achieves state-of-the-art performance on non-watertight mesh +reconstruction and generation tasks, while also performing effectively for +watertight meshes. + +
+
+ comment: Technical Report (26 pages, 16 figures) +
+
+
+
+
+ + ☆ Large Language Models are Visual Reasoning Coordinators NeurIPS 2023 + + +
+ Visual reasoning requires multimodal perception and commonsense cognition of +the world. Recently, multiple vision-language models (VLMs) have been proposed +with excellent commonsense reasoning ability in various domains. However, how +to harness the collective power of these complementary VLMs is rarely explored. +Existing methods like ensemble still struggle to aggregate these models with +the desired higher-order communications. In this work, we propose Cola, a novel +paradigm that coordinates multiple VLMs for visual reasoning. Our key insight +is that a large language model (LLM) can efficiently coordinate multiple VLMs +by facilitating natural language communication that leverages their distinct +and complementary capabilities. Extensive experiments demonstrate that our +instruction tuning variant, Cola-FT, achieves state-of-the-art performance on +visual question answering (VQA), outside knowledge VQA, visual entailment, and +visual spatial reasoning tasks. Moreover, we show that our in-context learning +variant, Cola-Zero, exhibits competitive performance in zero and few-shot +settings, without finetuning. Through systematic ablation studies and +visualizations, we validate that a coordinator LLM indeed comprehends the +instruction prompts as well as the separate functionalities of VLMs; it then +coordinates them to enable impressive visual reasoning capabilities. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ☆ Handling Data Heterogeneity via Architectural Design for Federated + Visual Recognition NeurIPS 2023 + + +
+ Federated Learning (FL) is a promising research paradigm that enables the +collaborative training of machine learning models among various parties without +the need for sensitive information exchange. Nonetheless, retaining data in +individual clients introduces fundamental challenges to achieving performance +on par with centrally trained models. Our study provides an extensive review of +federated learning applied to visual recognition. It underscores the critical +role of thoughtful architectural design choices in achieving optimal +performance, a factor often neglected in the FL literature. Many existing FL +solutions are tested on shallow or simple networks, which may not accurately +reflect real-world applications. This practice restricts the transferability of +research findings to large-scale visual recognition models. Through an in-depth +analysis of diverse cutting-edge architectures such as convolutional neural +networks, transformers, and MLP-mixers, we experimentally demonstrate that +architectural choices can substantially enhance FL systems' performance, +particularly when handling heterogeneous data. We study 19 visual recognition +models from five different architectural families on four challenging FL +datasets. We also re-investigate the inferior performance of convolution-based +architectures in the FL setting and analyze the influence of normalization +layers on the FL performance. Our findings emphasize the importance of +architectural design for computer vision tasks in practical scenarios, +effectively narrowing the performance gap between federated and centralized +learning. Our source code is available at +https://github.com/sarapieri/fed_het.git. + +
+
+ comment: to be published in NeurIPS 2023 +
+
+
+
+
+ + ☆ SAM-Med3D + + +
+ Although the Segment Anything Model (SAM) has demonstrated impressive +performance in 2D natural image segmentation, its application to 3D volumetric +medical images reveals significant shortcomings, namely suboptimal performance +and unstable prediction, necessitating an excessive number of prompt points to +attain the desired outcomes. These issues can hardly be addressed by +fine-tuning SAM on medical data because the original 2D structure of SAM +neglects 3D spatial information. In this paper, we introduce SAM-Med3D, the +most comprehensive study to modify SAM for 3D medical images. Our approach is +characterized by its comprehensiveness in two primary aspects: firstly, by +comprehensively reformulating SAM to a thorough 3D architecture trained on a +comprehensively processed large-scale volumetric medical dataset; and secondly, +by providing a comprehensive evaluation of its performance. Specifically, we +train SAM-Med3D with over 131K 3D masks and 247 categories. Our SAM-Med3D +excels at capturing 3D spatial information, exhibiting competitive performance +with significantly fewer prompt points than the top-performing fine-tuned SAM +in the medical domain. We then evaluate its capabilities across 15 datasets and +analyze it from multiple perspectives, including anatomical structures, +modalities, targets, and generalization abilities. Our approach, compared with +SAM, showcases pronouncedly enhanced efficiency and broad segmentation +capabilities for 3D volumetric medical images. Our code is released at +https://github.com/uni-medical/SAM-Med3D. + +
+
+
+
+
+ + ☆ FreeMask: Synthetic Images with Dense Annotations Make Stronger + Segmentation Models NeurIPS 2023 + + +
+ Semantic segmentation has witnessed tremendous progress due to the proposal +of various advanced network architectures. However, they are extremely hungry +for delicate annotations to train, and the acquisition is laborious and +unaffordable. Therefore, we present FreeMask in this work, which resorts to +synthetic images from generative models to ease the burden of both data +collection and annotation procedures. Concretely, we first synthesize abundant +training images conditioned on the semantic masks provided by realistic +datasets. This yields extra well-aligned image-mask training pairs for semantic +segmentation models. We surprisingly observe that, solely trained with +synthetic images, we already achieve comparable performance with real ones +(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we +investigate the role of synthetic images by joint training with real images, or +pre-training for real images. Meantime, we design a robust filtering principle +to suppress incorrectly synthesized regions. In addition, we propose to +inequally treat different semantic masks to prioritize those harder ones and +sample more corresponding synthetic images for them. As a result, either +jointly trained or pre-trained with our filtered and re-sampled synthesized +images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on +ADE20K. Code is available at https://github.com/LiheYoung/FreeMask. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ☆ Online Detection of AI-Generated Images ICCV + + +
+ With advancements in AI-generated images coming on a continuous basis, it is +increasingly difficult to distinguish traditionally-sourced images (e.g., +photos, artwork) from AI-generated ones. Previous detection methods study the +generalization from a single generator to another in isolation. However, in +reality, new generators are released on a streaming basis. We study +generalization in this setting, training on N models and testing on the next +(N+k), following the historical release dates of well-known generation methods. +Furthermore, images increasingly consist of both real and generated components, +for example through image inpainting. Thus, we extend this approach to pixel +prediction, demonstrating strong performance using automatically-generated +inpainted data. In addition, for settings where commercial models are not +publicly available for automatic data generation, we evaluate if pixel +detectors can be trained solely on whole synthetic images. + +
+
+ comment: ICCV DeepFake Analysis and Detection Workshop, 2023 +
+
+
+
+
+ + ☆ DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual + Design + + +
+ We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored +for visual design scenarios. Recent T2I models like DALL-E 3 and others, have +demonstrated remarkable capabilities in generating photorealistic images that +align closely with textual inputs. While the allure of creating visually +captivating images is undeniable, our emphasis extends beyond mere aesthetic +pleasure. We aim to investigate the potential of using these powerful models in +authentic design contexts. In pursuit of this goal, we develop DEsignBench, +which incorporates test samples designed to assess T2I models on both "design +technical capability" and "design application scenario." Each of these two +dimensions is supported by a diverse set of specific design categories. We +explore DALL-E 3 together with other leading T2I models on DEsignBench, +resulting in a comprehensive visual gallery for side-by-side comparisons. For +DEsignBench benchmarking, we perform human evaluations on generated images in +DEsignBench gallery, against the criteria of image-text alignment, visual +aesthetic, and design creativity. Our evaluation also considers other +specialized design capabilities, including text rendering, layout composition, +color harmony, 3D design, and medium style. In addition to human evaluations, +we introduce the first automatic image generation evaluator powered by GPT-4V. +This evaluator provides ratings that align well with human judgments, while +being easily replicable and cost-efficient. A high-resolution version is +available at +https://github.com/design-bench/design-bench.github.io/raw/main/designbench.pdf?download= + +
+
+ comment: Project page at https://design-bench.github.io/ +
+
+
+
+
+ + ☆ Fusion-Driven Tree Reconstruction and Fruit Localization: Advancing + Precision in Agriculture IROS + + +
+ Fruit distribution is pivotal in shaping the future of both agriculture and +agricultural robotics, paving the way for a streamlined supply chain. This +study introduces an innovative methodology that harnesses the synergy of RGB +imagery, LiDAR, and IMU data, to achieve intricate tree reconstructions and the +pinpoint localization of fruits. Such integration not only offers insights into +the fruit distribution, which enhances the precision of guidance for +agricultural robotics and automation systems, but also sets the stage for +simulating synthetic fruit patterns across varied tree architectures. To +validate this approach, experiments have been carried out in both a controlled +environment and an actual peach orchard. The results underscore the robustness +and efficacy of this fusion-driven methodology, highlighting its potential as a +transformative tool for future agricultural robotics and precision farming. + +
+
+ comment: This work was presented at IEEE/RSI International Conference on + Intelligent Robots and Systems (IROS) Workshop +
+
+
+
+
+ + ☆ Novel-View Acoustic Synthesis from 3D Reconstructed Rooms + + +
+ We investigate the benefit of combining blind audio recordings with 3D scene +information for novel-view acoustic synthesis. Given audio recordings from 2-4 +microphones and the 3D geometry and material of a scene containing multiple +unknown sound sources, we estimate the sound anywhere in the scene. We identify +the main challenges of novel-view acoustic synthesis as sound source +localization, separation, and dereverberation. While naively training an +end-to-end network fails to produce high-quality results, we show that +incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms +enables the same network to jointly tackle these tasks. Our method outperforms +existing methods designed for the individual tasks, demonstrating its +effectiveness at utilizing 3D visual information. In a simulated study on the +Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source +localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation +and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on +novel-view acoustic synthesis. Code, pretrained model, and video results are +available on the project webpage (https://github.com/apple/ml-nvas3d). + +
+
+
+
+
+ + ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary + Gradients + + +
+ We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards +training neural networks with binary weights, known as binary neural networks +(BNNs), on quantum hardware. BNNs reduce the computational requirements and +energy consumption of deep learning models with minimal loss in accuracy. +However, training them in practice remains to be an open challenge. Most known +BNN-optimisers either rely on projected updates or binarise weights +post-training. Instead, QP-SBGD approximately maps the gradient onto binary +variables, by solving a quadratic constrained binary optimisation. Under +practically reasonable assumptions, we show that this update rule converges +with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the +$\mathcal{NP}$-hard projection can be effectively executed on an adiabatic +quantum annealer, harnessing recent advancements in quantum computation. We +also introduce a projected version of this update rule and prove that if a +fixed point exists in the binary variable space, the modified updates will +converge to it. Last but not least, our algorithm is implemented layer-wise, +making it suitable to train larger networks on resource-limited quantum +hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is +on par with competitive and well-established baselines such as BinaryConnect, +signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as +well as binary graph neural networks. + +
+
+
+
+
+ + ☆ SpVOS: Efficient Video Object Segmentation with Triple Sparse + Convolution + + +
+ Semi-supervised video object segmentation (Semi-VOS), which requires only +annotating the first frame of a video to segment future frames, has received +increased attention recently. Among existing pipelines, the +memory-matching-based one is becoming the main research stream, as it can fully +utilize the temporal sequence information to obtain high-quality segmentation +results. Even though this type of method has achieved promising performance, +the overall framework still suffers from heavy computation overhead, mainly +caused by the per-frame dense convolution operations between high-resolution +feature maps and each kernel filter. Therefore, we propose a sparse baseline of +VOS named SpVOS in this work, which develops a novel triple sparse convolution +to reduce the computation costs of the overall VOS framework. The designed +triple gate, taking full consideration of both spatial and temporal redundancy +between adjacent video frames, adaptively makes a triple decision to decide how +to apply the sparse convolution on each pixel to control the computation +overhead of each layer, while maintaining sufficient discrimination capability +to distinguish similar objects and avoid error accumulation. A mixed sparse +training strategy, coupled with a designed objective considering the sparsity +constraint, is also developed to balance the VOS segmentation performance and +computation costs. Experiments are conducted on two mainstream VOS datasets, +including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves +superior performance over other state-of-the-art sparse methods, and even +maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the +DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS +baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to +42% FLOPs, showing its application potential for resource-constrained +scenarios. + +
+
+ comment: 15 pages, 6 figures +
+
+
+
+
+ + ☆ Matryoshka Diffusion Models + + +
+ Diffusion models are the de facto approach for generating high-quality images +and videos, but learning high-dimensional models remains a formidable task due +to computational and optimization challenges. Existing methods often resort to +training cascaded models in pixel space or using a downsampled latent space of +a separately trained auto-encoder. In this paper, we introduce Matryoshka +Diffusion Models(MDM), an end-to-end framework for high-resolution image and +video synthesis. We propose a diffusion process that denoises inputs at +multiple resolutions jointly and uses a NestedUNet architecture where features +and parameters for small-scale inputs are nested within those of large scales. +In addition, MDM enables a progressive training schedule from lower to higher +resolutions, which leads to significant improvements in optimization for +high-resolution generation. We demonstrate the effectiveness of our approach on +various benchmarks, including class-conditioned image generation, +high-resolution text-to-image, and text-to-video applications. Remarkably, we +can train a single pixel-space model at resolutions of up to 1024x1024 pixels, +demonstrating strong zero-shot generalization using the CC12M dataset, which +contains only 12 million images. + +
+
+ comment: 28 pages, 18 figures +
+
+
+
+
+ + ☆ Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model + + +
+ We report Zero123++, an image-conditioned diffusion model for generating +3D-consistent multi-view images from a single input view. To take full +advantage of pretrained 2D generative priors, we develop various conditioning +and training schemes to minimize the effort of finetuning from off-the-shelf +image diffusion models such as Stable Diffusion. Zero123++ excels in producing +high-quality, consistent multi-view images from a single image, overcoming +common issues like texture degradation and geometric misalignment. Furthermore, +we showcase the feasibility of training a ControlNet on Zero123++ for enhanced +control over the generation process. The code is available at +https://github.com/SUDO-AI-3D/zero123plus. + +
+
+
+
+
+ + ☆ FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained + Models in Few-Shot Learning NeurIPS2023 + + +
+ Due to the limited availability of data, existing few-shot learning methods +trained from scratch fail to achieve satisfactory performance. In contrast, +large-scale pre-trained models such as CLIP demonstrate remarkable few-shot and +zero-shot capabilities. To enhance the performance of pre-trained models for +downstream tasks, fine-tuning the model on downstream data is frequently +necessary. However, fine-tuning the pre-trained model leads to a decrease in +its generalizability in the presence of distribution shift, while the limited +number of samples in few-shot learning makes the model highly susceptible to +overfitting. Consequently, existing methods for fine-tuning few-shot learning +primarily focus on fine-tuning the model's classification head or introducing +additional structure. In this paper, we introduce a fine-tuning approach termed +Feature Discrimination Alignment (FD-Align). Our method aims to bolster the +model's generalizability by preserving the consistency of spurious features +across the fine-tuning process. Extensive experimental results validate the +efficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model +can seamlessly integrate with existing methods, leading to performance +improvements. Our code can be found in https://github.com/skingorz/FD-Align. + +
+
+ comment: Acceptedd by NeurIPS2023 +
+
+
+
+
+ + ☆ Dual-path convolutional neural network using micro-FTIR imaging to + predict breast cancer subtypes and biomarkers levels: estrogen receptor, + progesterone receptor, HER2 and Ki67 + + +
+ Breast cancer molecular subtypes classification plays an import role to sort +patients with divergent prognosis. The biomarkers used are Estrogen Receptor +(ER), Progesterone Receptor (PR), HER2, and Ki67. Based on these biomarkers +expression levels, subtypes are classified as Luminal A (LA), Luminal B (LB), +HER2 subtype, and Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is +used to classify subtypes, although interlaboratory and interobserver +variations can affect its accuracy, besides being a time-consuming technique. +The Fourier transform infrared micro-spectroscopy may be coupled with deep +learning for cancer evaluation, where there is still a lack of studies for +subtypes and biomarker levels prediction. This study presents a novel 2D deep +learning approach to achieve these predictions. Sixty micro-FTIR images of +320x320 pixels were collected from a human breast biopsies microarray. Data +were clustered by K-means, preprocessed and 32x32 patches were generated using +a fully automated approach. CaReNet-V2, a novel convolutional neural network, +was developed to classify breast cancer (CA) vs adjacent tissue (AT) and +molecular subtypes, and to predict biomarkers level. The clustering method +enabled to remove non-tissue pixels. Test accuracies for CA vs AT and subtype +were above 0.84. The model enabled the prediction of ER, PR, and HER2 levels, +where borderline values showed lower performance (minimum accuracy of 0.54). +Ki67 percentage regression demonstrated a mean error of 3.6%. Thus, CaReNet-V2 +is a potential technique for breast cancer biopsies evaluation, standing out as +a screening analysis technique and helping to prioritize patients. + +
+
+ comment: 32 pages, 3 figures, 6 tables +
+
+
+
+
+ + ☆ Acquiring Weak Annotations for Tumor Localization in Temporal and + Volumetric Data + + +
+ Creating large-scale and well-annotated datasets to train AI algorithms is +crucial for automated tumor detection and localization. However, with limited +resources, it is challenging to determine the best type of annotations when +annotating massive amounts of unlabeled data. To address this issue, we focus +on polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans; +both applications require significant effort and time for pixel-wise annotation +due to the high dimensional nature of the data, involving either temporary or +spatial dimensions. In this paper, we develop a new annotation strategy, termed +Drag&Drop, which simplifies the annotation process to drag and drop. This +annotation strategy is more efficient, particularly for temporal and volumetric +imaging, than other types of weak annotations, such as per-pixel, bounding +boxes, scribbles, ellipses, and points. Furthermore, to exploit our Drag&Drop +annotations, we develop a novel weakly supervised learning method based on the +watershed algorithm. Experimental results show that our method achieves better +detection and localization performance than alternative weak annotations and, +more importantly, achieves similar performance to that trained on detailed +per-pixel annotations. Interestingly, we find that, with limited resources, +allocating weak annotations from a diverse patient population can foster models +more robust to unseen images than allocating per-pixel annotations for a small +set of images. In summary, this research proposes an efficient annotation +strategy for tumor detection and localization that is less accurate than +per-pixel annotations but useful for creating large-scale datasets for +screening tumors in various medical modalities. + +
+
+ comment: Published in Machine Intelligence Research +
+
+
+
+
+ + ☆ On the Detection of Image-Scaling Attacks in Machine Learning ACSA + + +
+ Image scaling is an integral part of machine learning and computer vision +systems. Unfortunately, this preprocessing step is vulnerable to so-called +image-scaling attacks where an attacker makes unnoticeable changes to an image +so that it becomes a new image after scaling. This opens up new ways for +attackers to control the prediction or to improve poisoning and backdoor +attacks. While effective techniques exist to prevent scaling attacks, their +detection has not been rigorously studied yet. Consequently, it is currently +not possible to reliably spot these attacks in practice. + This paper presents the first in-depth systematization and analysis of +detection methods for image-scaling attacks. We identify two general detection +paradigms and derive novel methods from them that are simple in design yet +significantly outperform previous work. We demonstrate the efficacy of these +methods in a comprehensive evaluation with all major learning platforms and +scaling algorithms. First, we show that image-scaling attacks modifying the +entire scaled image can be reliably detected even under an adaptive adversary. +Second, we find that our methods provide strong detection performance even if +only minor parts of the image are manipulated. As a result, we can introduce a +novel protection layer against image-scaling attacks. + +
+
+ comment: Accepted at ACSAC'23 +
+
+
+
+
+ + ☆ E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion + + +
+ This paper proposes a novel approach to face swapping from the perspective of +fine-grained facial editing, dubbed "editing for swapping" (E4S). The +traditional face swapping methods rely on global feature extraction and often +fail to preserve the source identity. In contrast, our framework proposes a +Regional GAN Inversion (RGI) method, which allows the explicit disentanglement +of shape and texture. Specifically, our E4S performs face swapping in the +latent space of a pretrained StyleGAN, where a multi-scale mask-guided encoder +is applied to project the texture of each facial component into regional style +codes and a mask-guided injection module then manipulates feature maps with the +style codes. Based on this disentanglement, face swapping can be simplified as +style and mask swapping. Besides, since reconstructing the source face in the +target image may lead to disharmony lighting, we propose to train a re-coloring +network to make the swapped face maintain the lighting condition on the target +face. Further, to deal with the potential mismatch area during mask exchange, +we designed a face inpainting network as post-processing. The extensive +comparisons with state-of-the-art methods demonstrate that our E4S outperforms +existing methods in preserving texture, shape, and lighting. Our implementation +is available at https://github.com/e4s2023/E4S2023. + +
+
+ comment: Project Page: https://e4s2023.github.io/ ; +
+
+
+
+
+ + ☆ RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in + Dynamic Environments + + +
+ It is typically challenging for visual or visual-inertial odometry systems to +handle the problems of dynamic scenes and pure rotation. In this work, we +design a novel visual-inertial odometry (VIO) system called RD-VIO to handle +both of these two problems. Firstly, we propose an IMU-PARSAC algorithm which +can robustly detect and match keypoints in a two-stage process. In the first +state, landmarks are matched with new keypoints using visual and IMU +measurements. We collect statistical information from the matching and then +guide the intra-keypoint matching in the second stage. Secondly, to handle the +problem of pure rotation, we detect the motion type and adapt the +deferred-triangulation technique during the data-association process. We make +the pure-rotational frames into the special subframes. When solving the +visual-inertial bundle adjustment, they provide additional constraints to the +pure-rotational motion. We evaluate the proposed VIO system on public datasets. +Experiments show the proposed RD-VIO has obvious advantages over other methods +in dynamic environments. + +
+
+
+
+
+ + ☆ Localizing Active Objects from Egocentric Vision with Symbolic World + Knowledge EMNLP + + +
+ The ability to actively ground task instructions from an egocentric view is +crucial for AI agents to accomplish tasks or assist humans virtually. One +important step towards this goal is to localize and track key active objects +that undergo major state change as a consequence of human actions/interactions +to the environment without being told exactly what/where to ground (e.g., +localizing and tracking the `sponge` in video from the instruction "Dip the +`sponge` into the bucket."). While existing works approach this problem from a +pure vision perspective, we investigate to which extent the textual modality +(i.e., task instructions) and their interaction with visual modality can be +beneficial. Specifically, we propose to improve phrase grounding models' +ability on localizing the active objects by: (1) learning the role of `objects +undergoing change` and extracting them accurately from the instructions, (2) +leveraging pre- and post-conditions of the objects during actions, and (3) +recognizing the objects more robustly with descriptional knowledge. We leverage +large language models (LLMs) to extract the aforementioned action-object +knowledge, and design a per-object aggregation masking technique to effectively +perform joint inference on object phrases and symbolic knowledge. We evaluate +our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments +demonstrate the effectiveness of our proposed framework, which leads to>54% +improvements in all standard metrics on the TREK-150-OPE-Det localization + +tracking task, >7% improvements in all standard metrics on the TREK-150-OPE +tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD +task. + +
+
+ comment: In Proceedings of the 2023 Conference on Empirical Methods in Natural + Language Processing (EMNLP) +
+
+
+
+
+ + ☆ The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained + Multimodal Models EMNLP 2023 + + +
+ Despite the impressive performance achieved by pre-trained +language-and-vision models in downstream tasks, it remains an open question +whether this reflects a proper understanding of image-text interaction. In this +work, we explore to what extent they handle basic linguistic constructions -- +active-passive voice, coordination, and relative clauses -- that even preschool +children can typically master. We present BLA, a novel, automatically +constructed benchmark to evaluate multimodal models on these Basic Language +Abilities. We show that different types of Transformer-based systems, such as +CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, +in line with previous findings. Our experiments, in particular, show that most +of the tested models only marginally benefit when fine-tuned or prompted with +construction-specific samples. Yet, the generative BLIP2 shows promising +trends, especially in an in-context learning setting. This opens the door to +using BLA not only as an evaluation benchmark but also to improve models' basic +language abilities. + +
+
+ comment: This is the camera-ready version of the paper that will be published + in the Proceedings of EMNLP 2023 (Singapore, 6-10 December 2023) +
+
+
+
+
+ + ☆ Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic + Gaussian Mixture Models + + +
+ A long-standing challenge for a robotic manipulation system operating in +real-world scenarios is adapting and generalizing its acquired motor skills to +unseen environments. We tackle this challenge employing hybrid skill models +that integrate imitation and reinforcement paradigms, to explore how the +learning and adaptation of a skill, along with its core grounding in the scene +through a learned keypoint, can facilitate such generalization. To that end, we +develop Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models (KIS-GMM) +approach that learns to predict the reference of a dynamical system within the +scene as a 3D keypoint, leveraging visual observations obtained by the robot's +physical interactions during skill learning. Through conducting comprehensive +evaluations in both simulated and real-world environments, we show that our +method enables a robot to gain a significant zero-shot generalization to novel +environments and to refine skills in the target environments faster than +learning from scratch. Importantly, this is achieved without the need for new +ground truth data. Moreover, our method effectively copes with scene +displacements. + +
+
+ comment: Accepted at the International Symposium on Experimental Robotics + (ISER) 2023. Videos at http://kis-gmm.cs.uni-freiburg.de/ +
+
+
+
+
+ + ☆ DREAM+: Efficient Dataset Distillation by Bidirectional Representative + Matching ICCV + + +
+ Dataset distillation plays a crucial role in creating compact datasets with +similar training performance compared with original large-scale ones. This is +essential for addressing the challenges of data storage and training costs. +Prevalent methods facilitate knowledge transfer by matching the gradients, +embedding distributions, or training trajectories of synthetic images with +those of the sampled original images. Although there are various matching +objectives, currently the strategy for selecting original images is limited to +naive random sampling. We argue that random sampling overlooks the evenness of +the selected sample distribution, which may result in noisy or biased matching +targets. Besides, the sample diversity is also not constrained by random +sampling. Additionally, current methods predominantly focus on +single-dimensional matching, where information is not fully utilized. To +address these challenges, we propose a novel matching strategy called Dataset +Distillation by Bidirectional REpresentAtive Matching (DREAM+), which selects +representative original images for bidirectional matching. DREAM+ is applicable +to a variety of mainstream dataset distillation frameworks and significantly +reduces the number of distillation iterations by more than 15 times without +affecting performance. Given sufficient training time, DREAM+ can further +improve the performance and achieve state-of-the-art results. We have released +the code at github.com/NUS-HPC-AI-Lab/DREAM+. + +
+
+ comment: This is an extension of the ICCV conference version +
+
+
+
+
+ + ☆ A Universal Anti-Spoofing Approach for Contactless Fingerprint Biometric + Systems + + +
+ With the increasing integration of smartphones into our daily lives, +fingerphotos are becoming a potential contactless authentication method. While +it offers convenience, it is also more vulnerable to spoofing using various +presentation attack instruments (PAI). The contactless fingerprint is an +emerging biometric authentication but has not yet been heavily investigated for +anti-spoofing. While existing anti-spoofing approaches demonstrated fair +results, they have encountered challenges in terms of universality and +scalability to detect any unseen/unknown spoofed samples. To address this +issue, we propose a universal presentation attack detection method for +contactless fingerprints, despite having limited knowledge of presentation +attack samples. We generated synthetic contactless fingerprints using StyleGAN +from live finger photos and integrating them to train a semi-supervised +ResNet-18 model. A novel joint loss function, combining the Arcface and Center +loss, is introduced with a regularization to balance between the two loss +functions and minimize the variations within the live samples while enhancing +the inter-class variations between the deepfake and live samples. We also +conducted a comprehensive comparison of different regularizations' impact on +the joint loss function for presentation attack detection (PAD) and explored +the performance of a modified ResNet-18 architecture with different activation +functions (i.e., leaky ReLU and RelU) in conjunction with Arcface and center +loss. Finally, we evaluate the performance of the model using unseen types of +spoof attacks and live data. Our proposed method achieves a Bona Fide +Classification Error Rate (BPCER) of 0.12\%, an Attack Presentation +Classification Error Rate (APCER) of 0.63\%, and an Average Classification +Error Rate (ACER) of 0.37\%. + +
+
+
+
+
+ + ☆ CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate + Measurements by Calibrating Between Multiple Cameras + + +
+ Video-based heart and respiratory rate measurements using facial videos are +more useful and user-friendly than traditional contact-based sensors. However, +most of the current deep learning approaches require ground-truth pulse and +respiratory waves for model training, which are expensive to collect. In this +paper, we propose CalibrationPhys, a self-supervised video-based heart and +respiratory rate measurement method that calibrates between multiple cameras. +CalibrationPhys trains deep learning models without supervised labels by using +facial videos captured simultaneously by multiple cameras. Contrastive learning +is performed so that the pulse and respiratory waves predicted from the +synchronized videos using multiple cameras are positive and those from +different videos are negative. CalibrationPhys also improves the robustness of +the models by means of a data augmentation technique and successfully leverages +a pre-trained model for a particular camera. Experimental results utilizing two +datasets demonstrate that CalibrationPhys outperforms state-of-the-art heart +and respiratory rate measurement methods. Since we optimize camera-specific +models using only videos from multiple cameras, our approach makes it easy to +use arbitrary cameras for heart and respiratory rate measurements. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ☆ Manipulation Mask Generator: High-Quality Image Manipulation Mask + Generation Method Based on Modified Total Variation Noise Reduction + + +
+ In artificial intelligence, any model that wants to achieve a good result is +inseparable from a large number of high-quality data. It is especially true in +the field of tamper detection. This paper proposes a modified total variation +noise reduction method to acquire high-quality tampered images. We +automatically crawl original and tampered images from the Baidu PS Bar. Baidu +PS Bar is a website where net friends post countless tampered images. +Subtracting the original image with the tampered image can highlight the +tampered area. However, there is also substantial noise on the final print, so +these images can't be directly used in the deep learning model. Our modified +total variation noise reduction method is aimed at solving this problem. +Because a lot of text is slender, it is easy to lose text information after the +opening and closing operation. We use MSER (Maximally Stable Extremal Regions) +and NMS (Non-maximum Suppression) technology to extract text information. And +then use the modified total variation noise reduction technology to process the +subtracted image. Finally, we can obtain an image with little noise by adding +the image and text information. And the idea also largely retains the text +information. Datasets generated in this way can be used in deep learning +models, and they will help the model achieve better results. + +
+
+
+
+
+ + ☆ UWB Based Static Gesture Classification + + +
+ Our paper presents a robust framework for UWB-based static gesture +recognition, leveraging proprietary UWB radar sensor technology. Extensive data +collection efforts were undertaken to compile datasets containing five commonly +used gestures. Our approach involves a comprehensive data pre-processing +pipeline that encompasses outlier handling, aspect ratio-preserving resizing, +and false-color image transformation. Both CNN and MobileNet models were +trained on the processed images. Remarkably, our best-performing model achieved +an accuracy of 96.78%. Additionally, we developed a user-friendly GUI framework +to assess the model's system resource usage and processing times, which +revealed low memory utilization and real-time task completion in under one +second. This research marks a significant step towards enhancing static gesture +recognition using UWB technology, promising practical applications in various +domains. + +
+
+
+
+
+ + ☆ P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic + Segmentation + + +
+ Recently, Transformer-based models have achieved promising results in various +vision tasks, due to their ability to model long-range dependencies. However, +transformers are computationally expensive, which limits their applications in +real-time tasks such as autonomous driving. In addition, an efficient local and +global feature selection and fusion are vital for accurate dense prediction, +especially driving scene understanding tasks. In this paper, we propose a +real-time semantic segmentation architecture named Pyramid Pooling Axial +Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN +encoder to produce scale-aware contextual features, which are then combined +with the multi-level feature aggregation scheme to produce enhanced contextual +features. Specifically, we introduce a pyramid pooling axial transformer to +capture intricate spatial and channel dependencies, leading to improved +performance on semantic segmentation. Then, we design a Bidirectional Fusion +module (BiF) to combine semantic information at different levels. Meanwhile, a +Global Context Enhancer is introduced to compensate for the inadequacy of +concatenating different semantic levels. Finally, a decoder block is proposed +to help maintain a larger receptive field. We evaluate P2AT variants on three +challenging scene-understanding datasets. In particular, our P2AT variants +achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for +P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on +Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed +architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes. +The source code will be available at + +
+
+
+
+
+ + ☆ SONIC: Sonar Image Correspondence using Pose Supervised Learning for + Imaging Sonars + + +
+ In this paper, we address the challenging problem of data association for +underwater SLAM through a novel method for sonar image correspondence using +learned features. We introduce SONIC (SONar Image Correspondence), a +pose-supervised network designed to yield robust feature correspondence capable +of withstanding viewpoint variations. The inherent complexity of the underwater +environment stems from the dynamic and frequently limited visibility +conditions, restricting vision to a few meters of often featureless expanses. +This makes camera-based systems suboptimal in most open water application +scenarios. Consequently, multibeam imaging sonars emerge as the preferred +choice for perception sensors. However, they too are not without their +limitations. While imaging sonars offer superior long-range visibility compared +to cameras, their measurements can appear different from varying viewpoints. +This inherent variability presents formidable challenges in data association, +particularly for feature-based methods. Our method demonstrates significantly +better performance in generating correspondences for sonar images which will +pave the way for more accurate loop closure constraints and sonar-based place +recognition. Code as well as simulated and real-world datasets will be made +public to facilitate further development in the field. + +
+
+
+
+
+ + ☆ Invariance is Key to Generalization: Examining the Role of + Representation in Sim-to-Real Transfer for Visual Navigation + + +
+ The data-driven approach to robot control has been gathering pace rapidly, +yet generalization to unseen task domains remains a critical challenge. We +argue that the key to generalization is representations that are (i) rich +enough to capture all task-relevant information and (ii) invariant to +superfluous variability between the training and the test domains. We +experimentally study such a representation -- containing both depth and +semantic information -- for visual navigation and show that it enables a +control policy trained entirely in simulated indoor scenes to generalize to +diverse real-world environments, both indoors and outdoors. Further, we show +that our representation reduces the A-distance between the training and test +domains, improving the generalization error bound as a result. Our proposed +approach is scalable: the learned policy improves continuously, as the +foundation models that it exploits absorb more diverse data during +pre-training. + +
+
+ comment: 11 pages, accepted by the 18th International Symposium on + Experimental Robotics (ISER 2023) +
+
+
+
+
+ + ☆ Wonder3D: Single Image to 3D using Cross-Domain Diffusion + + +
+ In this work, we introduce Wonder3D, a novel method for efficiently +generating high-fidelity textured meshes from single-view images.Recent methods +based on Score Distillation Sampling (SDS) have shown the potential to recover +3D geometry from 2D diffusion priors, but they typically suffer from +time-consuming per-shape optimization and inconsistent geometry. In contrast, +certain works directly produce 3D information via fast network inferences, but +their results are often of low quality and lack geometric details.To +holistically improve the quality, consistency, and efficiency of image-to-3D +tasks, we propose a cross-domain diffusion model that generates multi-view +normal maps and the corresponding color images. To ensure consistency, we +employ a multi-view cross-domain attention mechanism that facilitates +information exchange across views and modalities. Lastly, we introduce a +geometry-aware normal fusion algorithm that extracts high-quality surfaces from +the multi-view 2D representations. Our extensive evaluations demonstrate that +our method achieves high-quality reconstruction results, robust generalization, +and reasonably good efficiency compared to prior works. + +
+
+ comment: Project page: https://www.xxlong.site/Wonder3D/ +
+
+
+
+
+ + ☆ StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography + + +
+ Coronary angiography continues to serve as the primary method for diagnosing +coronary artery disease (CAD), which is the leading global cause of mortality. +The severity of CAD is quantified by the location, degree of narrowing +(stenosis), and number of arteries involved. In current practice, this +quantification is performed manually using visual inspection and thus suffers +from poor inter- and intra-rater reliability. The MICCAI grand challenge: +Automatic Region-based Coronary Artery Disease diagnostics using the X-ray +angiography imagEs (ARCADE) curated a dataset with stenosis annotations, with +the goal of creating an automated stenosis detection algorithm. Using a +combination of machine learning and other computer vision techniques, we +propose the architecture and algorithm StenUNet to accurately detect stenosis +from X-ray Coronary Angiography. Our submission to the ARCADE challenge placed +3rd among all teams. We achieved an F1 score of 0.5348 on the test set, 0.0005 +lower than the 2nd place. + +
+
+ comment: 12 pages, 5 figures, 1 table +
+
+
+
+
+ + ☆ Learning Real-World Image De-Weathering with Imperfect Supervision + + +
+ Real-world image de-weathering aims at removing various undesirable +weather-related artifacts. Owing to the impossibility of capturing image pairs +concurrently, existing real-world de-weathering datasets often exhibit +inconsistent illumination, position, and textures between the ground-truth +images and the input degraded images, resulting in imperfect supervision. Such +non-ideal supervision negatively affects the training process of learning-based +de-weathering methods. In this work, we attempt to address the problem with a +unified solution for various inconsistencies. Specifically, inspired by +information bottleneck theory, we first develop a Consistent Label Constructor +(CLC) to generate a pseudo-label as consistent as possible with the input +degraded image while removing most weather-related degradations. In particular, +multiple adjacent frames of the current input are also fed into CLC to enhance +the pseudo-label. Then we combine the original imperfect labels and +pseudo-labels to jointly supervise the de-weathering model by the proposed +Information Allocation Strategy (IAS). During testing, only the de-weathering +model is used for inference. Experiments on two real-world de-weathering +datasets show that our method helps existing de-weathering models achieve +better performance. Codes are available at +https://github.com/1180300419/imperfect-deweathering. + +
+
+ comment: 16 pages, 13 figures +
+
+
+
+
+ + ☆ Robust Depth Linear Error Decomposition with Double Total Variation and + Nuclear Norm for Dynamic MRI Reconstruction + + +
+ Compressed Sensing (CS) significantly speeds up Magnetic Resonance Image +(MRI) processing and achieves accurate MRI reconstruction from under-sampled +k-space data. According to the current research, there are still several +problems with dynamic MRI k-space reconstruction based on CS. 1) There are +differences between the Fourier domain and the Image domain, and the +differences between MRI processing of different domains need to be considered. +2) As three-dimensional data, dynamic MRI has its spatial-temporal +characteristics, which need to calculate the difference and consistency of +surface textures while preserving structural integrity and uniqueness. 3) +Dynamic MRI reconstruction is time-consuming and computationally +resource-dependent. In this paper, we propose a novel robust low-rank dynamic +MRI reconstruction optimization model via highly under-sampled and Discrete +Fourier Transform (DFT) called the Robust Depth Linear Error Decomposition +Model (RDLEDM). Our method mainly includes linear decomposition, double Total +Variation (TV), and double Nuclear Norm (NN) regularizations. By adding linear +image domain error analysis, the noise is reduced after under-sampled and DFT +processing, and the anti-interference ability of the algorithm is enhanced. +Double TV and NN regularizations can utilize both spatial-temporal +characteristics and explore the complementary relationship between different +dimensions in dynamic MRI sequences. In addition, Due to the non-smoothness and +non-convexity of TV and NN terms, it is difficult to optimize the unified +objective model. To address this issue, we utilize a fast algorithm by solving +a primal-dual form of the original problem. Compared with five state-of-the-art +methods, extensive experiments on dynamic MRI data demonstrate the superior +performance of the proposed method in terms of both reconstruction accuracy and +time complexity. + +
+
+
+
+
+ + ☆ Converting Depth Images and Point Clouds for Feature-based Pose + Estimation IROS 2023 + + +
+ In recent years, depth sensors have become more and more affordable and have +found their way into a growing amount of robotic systems. However, mono- or +multi-modal sensor registration, often a necessary step for further processing, +faces many challenges on raw depth images or point clouds. This paper presents +a method of converting depth data into images capable of visualizing spatial +details that are basically hidden in traditional depth images. After noise +removal, a neighborhood of points forms two normal vectors whose difference is +encoded into this new conversion. Compared to Bearing Angle images, our method +yields brighter, higher-contrast images with more visible contours and more +details. We tested feature-based pose estimation of both conversions in a +visual odometry task and RGB-D SLAM. For all tested features, AKAZE, ORB, SIFT, +and SURF, our new Flexion images yield better results than Bearing Angle images +and show great potential to bridge the gap between depth data and classical +computer vision. Source code is available here: +https://rlsch.github.io/depth-flexion-conversion. + +
+
+ comment: to be published in IROS 2023 conference proceedings +
+
+
+
+
+ + ☆ GRLib: An Open-Source Hand Gesture Detection and Recognition Python + Library + + +
+ Hand gesture recognition systems provide a natural way for humans to interact +with computer systems. Although various algorithms have been designed for this +task, a host of external conditions, such as poor lighting or distance from the +camera, make it difficult to create an algorithm that performs well across a +range of environments. In this work, we present GRLib: an open-source Python +library able to detect and classify static and dynamic hand gestures. Moreover, +the library can be trained on existing data for improved classification +robustness. The proposed solution utilizes a feed from an RGB camera. The +retrieved frames are then subjected to data augmentation and passed on to +MediaPipe Hands to perform hand landmark detection. The landmarks are then +classified into their respective gesture class. The library supports dynamic +hand gestures through trajectories and keyframe extraction. It was found that +the library outperforms another publicly available HGR system - MediaPipe +Solutions, on three diverse, real-world datasets. The library is available at +https://github.com/mikhail-vlasenko/grlib and can be installed with pip. + +
+
+
+
+
+ + ☆ Object Pose Estimation Annotation Pipeline for Multi-view Monocular + Camera Systems in Industrial Settings + + +
+ Object localization, and more specifically object pose estimation, in large +industrial spaces such as warehouses and production facilities, is essential +for material flow operations. Traditional approaches rely on artificial +artifacts installed in the environment or excessively expensive equipment, that +is not suitable at scale. A more practical approach is to utilize existing +cameras in such spaces in order to address the underlying pose estimation +problem and to localize objects of interest. In order to leverage +state-of-the-art methods in deep learning for object pose estimation, large +amounts of data need to be collected and annotated. In this work, we provide an +approach to the annotation of large datasets of monocular images without the +need for manual labor. Our approach localizes cameras in space, unifies their +location with a motion capture system, and uses a set of linear mappings to +project 3D models of objects of interest at their ground truth 6D pose +locations. We test our pipeline on a custom dataset collected from a system of +eight cameras in an industrial setting that mimics the intended area of +operation. Our approach was able to provide consistent quality annotations for +our dataset with 26, 482 object instances at a fraction of the time required by +human annotators. + +
+
+
+
+
+ + ☆ Orientation-Aware Leg Movement Learning for Action-Driven Human Motion + Prediction + + +
+ The task of action-driven human motion prediction aims to forecast future +human motion from the observed sequence while respecting the given action +label. It requires modeling not only the stochasticity within human motion but +the smooth yet realistic transition between multiple action labels. However, +the fact that most of the datasets do not contain such transition data +complicates this task. Existing work tackles this issue by learning a +smoothness prior to simply promote smooth transitions, yet doing so can result +in unnatural transitions especially when the history and predicted motions +differ significantly in orientations. In this paper, we argue that valid human +motion transitions should incorporate realistic leg movements to handle +orientation changes, and cast it as an action-conditioned in-betweening (ACB) +learning task to encourage transition naturalness. Because modeling all +possible transitions is virtually unreasonable, our ACB is only performed on +very few selected action classes with active gait motions, such as Walk or Run. +Specifically, we follow a two-stage forecasting strategy by first employing the +motion diffusion model to generate the target motion with a specified future +action, and then producing the in-betweening to smoothly connect the +observation and prediction to eventually address motion prediction. Our method +is completely free from the labeled motion transition data during training. To +show the robustness of our approach, we generalize our trained in-betweening +learning model on one dataset to two unseen large-scale motion datasets to +produce natural transitions. Extensive methods on three benchmark datasets +demonstrate that our method yields the state-of-the-art performance in terms of +visual quality, prediction accuracy, and action faithfulness. + +
+
+
+
+
+ + ☆ 3M-TRANSFORMER: A Multi-Stage Multi-Stream Multimodal Transformer for + Embodied Turn-Taking Prediction + + +
+ Predicting turn-taking in multiparty conversations has many practical +applications in human-computer/robot interaction. However, the complexity of +human communication makes it a challenging task. Recent advances have shown +that synchronous multi-perspective egocentric data can significantly improve +turn-taking prediction compared to asynchronous, single-perspective +transcriptions. Building on this research, we propose a new multimodal +transformer-based architecture for predicting turn-taking in embodied, +synchronized multi-perspective data. Our experimental results on the recently +introduced EgoCom dataset show a substantial performance improvement of up to +14.01% on average compared to existing baselines and alternative +transformer-based approaches. The source code, and the pre-trained models of +our 3T-Transformer will be available upon acceptance. + +
+
+
+
+
+ + ☆ ESVAE: An Efficient Spiking Variational Autoencoder with + Reparameterizable Poisson Spiking Sampling + + +
+ In recent years, studies on image generation models of spiking neural +networks (SNNs) have gained the attention of many researchers. Variational +autoencoders (VAEs), as one of the most popular image generation models, have +attracted a lot of work exploring their SNN implementation. Due to the +constrained binary representation in SNNs, existing SNN VAE methods implicitly +construct the latent space by an elaborated autoregressive network and use the +network outputs as the sampling variables. However, this unspecified implicit +representation of the latent space will increase the difficulty of generating +high-quality images and introduces additional network parameters. In this +paper, we propose an efficient spiking variational autoencoder (ESVAE) that +constructs an interpretable latent space distribution and design a +reparameterizable spiking sampling method. Specifically, we construct the prior +and posterior of the latent space as a Poisson distribution using the firing +rate of the spiking neurons. Subsequently, we propose a reparameterizable +Poisson spiking sampling method, which is free from the additional network. +Comprehensive experiments have been conducted, and the experimental results +show that the proposed ESVAE outperforms previous SNN VAE methods in +reconstructed & generated images quality. In addition, experiments demonstrate +that ESVAE's encoder is able to retain the original image information more +efficiently, and the decoder is more robust. The source code is available at +https://github.com/QgZhan/ESVAE. + +
+
+ comment: 11 pages, 13 figures +
+
+
+
+
+ + ☆ Deep learning denoiser assisted roughness measurements extraction from + thin resists with low Signal-to-Noise Ratio(SNR) SEM images: analysis with + SMILE + + +
+ The technological advance of High Numerical Aperture Extreme Ultraviolet +Lithography (High NA EUVL) has opened the gates to extensive researches on +thinner photoresists (below 30nm), necessary for the industrial implementation +of High NA EUVL. Consequently, images from Scanning Electron Microscopy (SEM) +suffer from reduced imaging contrast and low Signal-to-Noise Ratio (SNR), +impacting the measurement of unbiased Line Edge Roughness (uLER) and Line Width +Roughness (uLWR). Thus, the aim of this work is to enhance the SNR of SEM +images by using a Deep Learning denoiser and enable robust roughness extraction +of the thin resist. For this study, we acquired SEM images of Line-Space (L/S) +patterns with a Chemically Amplified Resist (CAR) with different thicknesses +(15nm, 20nm, 25nm, 30nm), underlayers (Spin-On-Glass-SOG, Organic +Underlayer-OUL) and frames of averaging (4, 8, 16, 32, and 64 Fr). After +denoising, a systematic analysis has been carried out on both noisy and +denoised images using an open-source metrology software, SMILE 2.3.2, for +investigating mean CD, SNR improvement factor, biased and unbiased LWR/LER +Power Spectral Density (PSD). Denoised images with lower number of frames +present unaltered Critical Dimensions (CDs), enhanced SNR (especially for low +number of integration frames), and accurate measurements of uLER and uLWR, with +the same accuracy as for noisy images with a consistent higher number of +frames. Therefore, images with a small number of integration frames and with +SNR < 2 can be successfully denoised, and advantageously used in improving +metrology throughput while maintaining reliable roughness measurements for the +thin resist. + +
+
+
+
+
+ + ☆ Large Language Models can Share Images, Too! + + +
+ This paper explores the image-sharing capability of Large Language Models +(LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting, +without the help of visual foundation models. Inspired by the two-stage process +of image-sharing in human dialogues, we propose a two-stage framework that +allows LLMs to predict potential image-sharing turns and generate related image +descriptions using our effective restriction-based prompt template. With +extensive experiments, we unlock the \textit{image-sharing} capability of LLMs +in zero-shot prompting, with GPT-4 achieving the best performance. +Additionally, we uncover the emergent \textit{image-sharing} ability in +zero-shot prompting, demonstrating the effectiveness of restriction-based +prompts in both stages of our framework. Based on this framework, we augment +the PhotoChat dataset with images generated by Stable Diffusion at predicted +turns, namely PhotoChat++. To our knowledge, this is the first study to assess +the image-sharing ability of LLMs in a zero-shot setting without visual +foundation models. The source code and the dataset will be released after +publication. + +
+
+
+
+
+ + ☆ DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye + Movement for Machine Reading EMNLP2023 + + +
+ The use of visually-rich documents (VRDs) in various fields has created a +demand for Document AI models that can read and comprehend documents like +humans, which requires the overcoming of technical, linguistic, and cognitive +barriers. Unfortunately, the lack of appropriate datasets has significantly +hindered advancements in the field. To address this issue, we introduce +\textsc{DocTrack}, a VRD dataset really aligned with human eye-movement +information using eye-tracking technology. This dataset can be used to +investigate the challenges mentioned above. Additionally, we explore the impact +of human reading order on document understanding tasks and examine what would +happen if a machine reads in the same order as a human. Our results suggest +that although Document AI models have made significant progress, they still +have a long way to go before they can read VRDs as accurately, continuously, +and flexibly as humans do. These findings have potential implications for +future research and development of Document AI models. The data is available at +\url{https://github.com/hint-lab/doctrack}. + +
+
+ comment: 14 pages, 8 figures, Accepted by Findings of EMNLP2023 +
+
+
+
+
+ + ☆ Vision-Enhanced Semantic Entity Recognition in Document Images via + Visually-Asymmetric Consistency Learning EMNLP2023 + + +
+ Extracting meaningful entities belonging to predefined categories from +Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and +layout features such as font, background, color, and bounding box location and +size provide important cues for identifying entities of the same type. However, +existing models commonly train a visual encoder with weak cross-modal +supervision signals, resulting in a limited capacity to capture these +non-textual features and suboptimal performance. In this paper, we propose a +novel \textbf{V}isually-\textbf{A}symmetric co\textbf{N}sisten\textbf{C}y +\textbf{L}earning (\textsc{Vancl}) approach that addresses the above limitation +by enhancing the model's ability to capture fine-grained visual and layout +features through the incorporation of color priors. Experimental results on +benchmark datasets show that our approach substantially outperforms the strong +LayoutLM series baseline, demonstrating the effectiveness of our approach. +Additionally, we investigate the effects of different color schemes on our +approach, providing insights for optimizing model performance. We believe our +work will inspire future research on multimodal information extraction. + +
+
+ comment: 14 pages, 6 figures, Accepted by EMNLP2023 +
+
+
+
+
+ + ☆ SAMCLR: Contrastive pre-training on complex scenes using SAM for view + sampling + + +
+ In Computer Vision, self-supervised contrastive learning enforces similar +representations between different views of the same image. The pre-training is +most often performed on image classification datasets, like ImageNet, where +images mainly contain a single class of objects. However, when dealing with +complex scenes with multiple items, it becomes very unlikely for several views +of the same image to represent the same object category. In this setting, we +propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into +semantic regions, then sample the two views from the same region. Preliminary +results show empirically that when pre-training on Cityscapes and ADE20K, then +evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs +at least on par with, and most often significantly outperforms not only SimCLR, +but also DINO and MoCo. + +
+
+ comment: Preprint, under review +
+
+
+
+
+ + ☆ MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D + diffusion + + +
+ We introduce Multi-view Ancestral Sampling (MAS), a method for generating +consistent multi-view 2D samples of a motion sequence, enabling the creation of +its 3D counterpart. MAS leverages a diffusion model trained solely on 2D data, +opening opportunities to exciting and diverse fields of motion previously +under-explored as 3D data is scarce and hard to collect. MAS works by +simultaneously denoising multiple 2D motion sequences representing the same +motion from different angles. Our consistency block ensures consistency across +all views at each diffusion step by combining the individual generations into a +unified 3D sequence, and projecting it back to the original views for the next +iteration. We demonstrate MAS on 2D pose data acquired from videos depicting +professional basketball maneuvers, rhythmic gymnastic performances featuring a +ball apparatus, and horse obstacle course races. In each of these domains, 3D +motion capture is arduous, and yet, MAS generates diverse and realistic 3D +sequences without textual conditioning. As we demonstrate, our ancestral +sampling-based approach offers a more natural integration with the diffusion +framework compared to popular denoising optimization-based approaches, and +avoids common issues such as out-of-domain sampling, lack of details and +mode-collapse. https://guytevet.github.io/mas-page/ + +
+
+
+
+
+ + ☆ Rethinking Scale Imbalance in Semi-supervised Object Detection for + Aerial Images + + +
+ This paper focuses on the scale imbalance problem of semi-supervised object +detection(SSOD) in aerial images. Compared to natural images, objects in aerial +images show smaller sizes and larger quantities per image, increasing the +difficulty of manual annotation. Meanwhile, the advanced SSOD technique can +train superior detectors by leveraging limited labeled data and massive +unlabeled data, saving annotation costs. However, as an understudied task in +aerial images, SSOD suffers from a drastic performance drop when facing a large +proportion of small objects. By analyzing the predictions between small and +large objects, we identify three imbalance issues caused by the scale bias, +i.e., pseudo-label imbalance, label assignment imbalance, and negative learning +imbalance. To tackle these issues, we propose a novel Scale-discriminative +Semi-Supervised Object Detection (S^3OD) learning pipeline for aerial images. +In our S^3OD, three key components, Size-aware Adaptive Thresholding (SAT), +Size-rebalanced Label Assignment (SLA), and Teacher-guided Negative Learning +(TNL), are proposed to warrant scale unbiased learning. Specifically, SAT +adaptively selects appropriate thresholds to filter pseudo-labels for objects +at different scales. SLA balances positive samples of objects at different +scales through resampling and reweighting. TNL alleviates the imbalance in +negative samples by leveraging information generated by a teacher model. +Extensive experiments conducted on the DOTA-v1.5 benchmark demonstrate the +superiority of our proposed methods over state-of-the-art competitors. Codes +will be released soon. + +
+
+
+
+
+ + ☆ BM2CP: Efficient Collaborative Perception with LiDAR-Camera Modalities + + +
+ Collaborative perception enables agents to share complementary perceptual +information with nearby agents. This would improve the perception performance +and alleviate the issues of single-view perception, such as occlusion and +sparsity. Most existing approaches mainly focus on single modality (especially +LiDAR), and not fully exploit the superiority of multi-modal perception. We +propose a collaborative perception paradigm, BM2CP, which employs LiDAR and +camera to achieve efficient multi-modal perception. It utilizes LiDAR-guided +modal fusion, cooperative depth generation and modality-guided intermediate +fusion to acquire deep interactions among modalities of different agents, +Moreover, it is capable to cope with the special case where one of the sensors, +same or different type, of any agent is missing. Extensive experiments validate +that our approach outperforms the state-of-the-art methods with 50X lower +communication volumes in both simulated and real-world autonomous driving +scenarios. Our code is available at https://github.com/byzhaoAI/BM2CP. + +
+
+ comment: 14 pages, 8 figures. Accepted by CoRL 2023 +
+
+
+
+
+ + ☆ Interaction-Driven Active 3D Reconstruction with Object Interiors SIGGRAPH + + +
+ We introduce an active 3D reconstruction method which integrates visual +perception, robot-object interaction, and 3D scanning to recover both the +exterior and interior, i.e., unexposed, geometries of a target 3D object. +Unlike other works in active vision which focus on optimizing camera viewpoints +to better investigate the environment, the primary feature of our +reconstruction is an analysis of the interactability of various parts of the +target object and the ensuing part manipulation by a robot to enable scanning +of occluded regions. As a result, an understanding of part articulations of the +target object is obtained on top of complete geometry acquisition. Our method +operates fully automatically by a Fetch robot with built-in RGBD sensors. It +iterates between interaction analysis and interaction-driven reconstruction, +scanning and reconstructing detected moveable parts one at a time, where both +the articulated part detection and mesh reconstruction are carried out by +neural networks. In the final step, all the remaining, non-articulated parts, +including all the interior structures that had been exposed by prior part +manipulations and subsequently scanned, are reconstructed to complete the +acquisition. We demonstrate the performance of our method via qualitative and +quantitative evaluation, ablation studies, comparisons to alternatives, as well +as experiments in a real environment. + +
+
+ comment: Accepted to SIGGRAPH Asia 2023, project page at + https://vcc.tech/research/2023/InterRecon +
+
+
+
+
+ + ☆ CAwa-NeRF: Instant Learning of Compression-Aware NeRF Features + + +
+ Modeling 3D scenes by volumetric feature grids is one of the promising +directions of neural approximations to improve Neural Radiance Fields (NeRF). +Instant-NGP (INGP) introduced multi-resolution hash encoding from a lookup +table of trainable feature grids which enabled learning high-quality neural +graphics primitives in a matter of seconds. However, this improvement came at +the cost of higher storage size. In this paper, we address this challenge by +introducing instant learning of compression-aware NeRF features (CAwa-NeRF), +that allows exporting the zip compressed feature grids at the end of the model +training with a negligible extra time overhead without changing neither the +storage architecture nor the parameters used in the original INGP paper. +Nonetheless, the proposed method is not limited to INGP but could also be +adapted to any model. By means of extensive simulations, our proposed instant +learning pipeline can achieve impressive results on different kinds of static +scenes such as single object masked background scenes and real-life scenes +captured in our studio. In particular, for single object masked background +scenes CAwa-NeRF compresses the feature grids down to 6% (1.2 MB) of the +original size without any loss in the PSNR (33 dB) or down to 2.4% (0.53 MB) +with a slight virtual loss (32.31 dB). + +
+
+ comment: 10 pages, 9 figures +
+
+
+
+
+ + ☆ On Partial Shape Correspondence and Functional Maps + + +
+ While dealing with matching shapes to their parts, we often utilize an +instrument known as functional maps. The idea is to translate the shape +matching problem into ``convenient'' spaces by which matching is performed +algebraically by solving a least squares problem. Here, we argue that such +formulations, though popular in this field, introduce errors in the estimated +match when partiality is invoked. Such errors are unavoidable even when +considering advanced feature extraction networks, and they can be shown to +escalate with increasing degrees of shape partiality, adversely affecting the +learning capability of such systems. To circumvent these limitations, we +propose a novel approach for partial shape matching. + Our study of functional maps led us to a novel method that establishes direct +correspondence between partial and full shapes through feature matching +bypassing the need for functional map intermediate spaces. The Gromov distance +between metric spaces leads to the construction of the first part of our loss +functions. For regularization we use two options: a term based on the area +preserving property of the mapping, and a relaxed version of it without the +need to compute a functional map. + The proposed approach shows superior performance on the SHREC'16 dataset, +outperforming existing unsupervised methods for partial shape matching. In +particular, it achieves state-of-the-art result on the SHREC'16 HOLES +benchmark, superior also compared to supervised methods. + +
+
+
+
+
+ + ☆ Online Out-of-Domain Detection for Automated Driving + + +
+ Ensuring safety in automated driving is a major challenge for the automotive +industry. Special attention is paid to artificial intelligence, in particular +to Deep Neural Networks (DNNs), which is considered a key technology in the +realization of highly automated driving. DNNs learn from training data, which +means that they only achieve good accuracy within the underlying data +distribution of the training data. When leaving the training domain, a +distributional shift is caused, which can lead to a drastic reduction of +accuracy. In this work, we present a proof of concept for a safety mechanism +that can detect the leaving of the domain online, i.e. at runtime. In our +experiments with the Synthia data set we can show that a 100 % correct +detection of whether the input data is inside or outside the domain is +achieved. The ability to detect when the vehicle leaves the domain can be an +important requirement for certification. + +
+
+ comment: Machine Learning in Certified Systems (MLCS) Workshop, 14.-15.01.2021 +
+
+
+
+
+ + ☆ Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and + Beyond EMNLP 2023 + + +
+ Vision-language (VL) understanding tasks evaluate models' comprehension of +complex visual scenes through multiple-choice questions. However, we have +identified two dataset biases that models can exploit as shortcuts to resolve +various VL tasks correctly without proper understanding. The first type of +dataset bias is \emph{Unbalanced Matching} bias, where the correct answer +overlaps the question and image more than the incorrect answers. The second +type of dataset bias is \emph{Distractor Similarity} bias, where incorrect +answers are overly dissimilar to the correct answer but significantly similar +to other incorrect answers within the same sample. To address these dataset +biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic +training and debiased evaluation data. We then introduce Intra-sample +Counterfactual Training (ICT) to assist models in utilizing the synthesized +training data, particularly the counterfactual data, via focusing on +intra-sample differentiation. Extensive experiments demonstrate the +effectiveness of ADS and ICT in consistently improving model performance across +different benchmarks, even in domain-shifted scenarios. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Data Pruning via Moving-one-Sample-out + + +
+ In this paper, we propose a novel data-pruning approach called +moving-one-sample-out (MoSo), which aims to identify and remove the least +informative samples from the training set. The core insight behind MoSo is to +determine the importance of each sample by assessing its impact on the optimal +empirical risk. This is achieved by measuring the extent to which the empirical +risk changes when a particular sample is excluded from the training set. +Instead of using the computationally expensive leaving-one-out-retraining +procedure, we propose an efficient first-order approximator that only requires +gradient information from different training stages. The key idea behind our +approximation is that samples with gradients that are consistently aligned with +the average gradient of the training set are more informative and should +receive higher scores, which could be intuitively understood as follows: if the +gradient from a specific sample is consistent with the average gradient vector, +it implies that optimizing the network using the sample will yield a similar +effect on all remaining samples. Experimental results demonstrate that MoSo +effectively mitigates severe performance degradation at high pruning ratios and +achieves satisfactory performance across various settings. + +
+
+
+
+
+ + ☆ Invariant Feature Regularization for Fair Face Recognition ICCV + + +
+ Fair face recognition is all about learning invariant feature that +generalizes to unseen faces in any demographic group. Unfortunately, face +datasets inevitably capture the imbalanced demographic attributes that are +ubiquitous in real-world observations, and the model learns biased feature that +generalizes poorly in the minority group. We point out that the bias arises due +to the confounding demographic attributes, which mislead the model to capture +the spurious demographic-specific feature. The confounding effect can only be +removed by causal intervention, which requires the confounder annotations. +However, such annotations can be prohibitively expensive due to the diversity +of the demographic attributes. To tackle this, we propose to generate diverse +data partitions iteratively in an unsupervised fashion. Each data partition +acts as a self-annotated confounder, enabling our Invariant Feature +Regularization (INV-REG) to deconfound. INV-REG is orthogonal to existing +methods, and combining INV-REG with two strong baselines (Arcface and CIFP) +leads to new state-of-the-art that improves face recognition on a variety of +demographic groups. Code is available at +https://github.com/PanasonicConnect/InvReg. + +
+
+ comment: Accepted by International Conference on Computer Vision (ICCV) 2023 +
+
+
+
+
+ + ☆ Relit-NeuLF: Efficient Relighting and Novel View Synthesis via Neural 4D + Light Field + + +
+ In this paper, we address the problem of simultaneous relighting and novel +view synthesis of a complex scene from multi-view images with a limited number +of light sources. We propose an analysis-synthesis approach called Relit-NeuLF. +Following the recent neural 4D light field network (NeuLF), Relit-NeuLF first +leverages a two-plane light field representation to parameterize each ray in a +4D coordinate system, enabling efficient learning and inference. Then, we +recover the spatially-varying bidirectional reflectance distribution function +(SVBRDF) of a 3D scene in a self-supervised manner. A DecomposeNet learns to +map each ray to its SVBRDF components: albedo, normal, and roughness. Based on +the decomposed BRDF components and conditioning light directions, a RenderNet +learns to synthesize the color of the ray. To self-supervise the SVBRDF +decomposition, we encourage the predicted ray color to be close to the +physically-based rendering result using the microfacet model. Comprehensive +experiments demonstrate that the proposed method is efficient and effective on +both synthetic data and real-world human face data, and outperforms the +state-of-the-art results. We publicly released our code on GitHub. You can find +it here: https://github.com/oppo-us-research/RelitNeuLF + +
+
+ comment: 10 pages +
+
+
+
+
+ + ☆ Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval + + +
+ Deep hashing has been intensively studied and successfully applied in +large-scale image retrieval systems due to its efficiency and effectiveness. +Recent studies have recognized that the existence of adversarial examples poses +a security threat to deep hashing models, that is, adversarial vulnerability. +Notably, it is challenging to efficiently distill reliable semantic +representatives for deep hashing to guide adversarial learning, and thereby it +hinders the enhancement of adversarial robustness of deep hashing-based +retrieval models. Moreover, current researches on adversarial training for deep +hashing are hard to be formalized into a unified minimax structure. In this +paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the +adversarial robustness of deep hashing models. Specifically, we conceive a +discriminative mainstay features learning (DMFL) scheme to construct semantic +representatives for guiding adversarial learning in deep hashing. Particularly, +our DMFL with the strict theoretical guarantee is adaptively optimized in a +discriminative learning manner, where both discriminative and semantic +properties are jointly considered. Moreover, adversarial examples are +fabricated by maximizing the Hamming distance between the hash codes of +adversarial samples and mainstay features, the efficacy of which is validated +in the adversarial attack trials. Further, we, for the first time, formulate +the formalized adversarial training of deep hashing into a unified minimax +optimization under the guidance of the generated mainstay codes. Extensive +experiments on benchmark datasets show superb attack performance against the +state-of-the-art algorithms, meanwhile, the proposed adversarial training can +effectively eliminate adversarial perturbations for trustworthy deep +hashing-based retrieval. Our code is available at +https://github.com/xandery-geek/SAAT. + +
+
+
+
+
+ + ☆ Multilevel Perception Boundary-guided Network for Breast Lesion + Segmentation in Ultrasound Images + + +
+ Automatic segmentation of breast tumors from the ultrasound images is +essential for the subsequent clinical diagnosis and treatment plan. Although +the existing deep learning-based methods have achieved significant progress in +automatic segmentation of breast tumor, their performance on tumors with +similar intensity to the normal tissues is still not pleasant, especially for +the tumor boundaries. To address this issue, we propose a PBNet composed by a +multilevel global perception module (MGPM) and a boundary guided module (BGM) +to segment breast tumors from ultrasound images. Specifically, in MGPM, the +long-range spatial dependence between the voxels in a single level feature maps +are modeled, and then the multilevel semantic information is fused to promote +the recognition ability of the model for non-enhanced tumors. In BGM, the tumor +boundaries are extracted from the high-level semantic maps using the dilation +and erosion effects of max pooling, such boundaries are then used to guide the +fusion of low and high-level features. Moreover, to improve the segmentation +performance for tumor boundaries, a multi-level boundary-enhanced segmentation +(BS) loss is proposed. The extensive comparison experiments on both publicly +available dataset and in-house dataset demonstrate that the proposed PBNet +outperforms the state-of-the-art methods in terms of both qualitative +visualization results and quantitative evaluation metrics, with the Dice score, +Jaccard coefficient, Specificity and HD95 improved by 0.70%, 1.1%, 0.1% and +2.5% respectively. In addition, the ablation experiments validate that the +proposed MGPM is indeed beneficial for distinguishing the non-enhanced tumors +and the BGM as well as the BS loss are also helpful for refining the +segmentation contours of the tumor. + +
+
+ comment: 12pages,5 figures +
+
+
+
+
+ + ☆ Pre-Training LiDAR-Based 3D Object Detectors Through Colorization + + +
+ Accurate 3D object detection and understanding for self-driving cars heavily +relies on LiDAR point clouds, necessitating large amounts of labeled data to +train. In this work, we introduce an innovative pre-training approach, Grounded +Point Colorization (GPC), to bridge the gap between data and labels by teaching +the model to colorize LiDAR point clouds, equipping it with valuable semantic +cues. To tackle challenges arising from color variations and selection bias, we +incorporate color as "context" by providing ground-truth colors as hints during +colorization. Experimental results on the KITTI and Waymo datasets demonstrate +GPC's remarkable effectiveness. Even with limited labeled data, GPC +significantly improves fine-tuning performance; notably, on just 20% of the +KITTI dataset, GPC outperforms training from scratch with the entire dataset. +In sum, we introduce a fresh perspective on pre-training for 3D object +detection, aligning the objective with the model's intended role and ultimately +advancing the accuracy and efficiency of 3D object detection for autonomous +vehicles. + +
+
+
+
+
+ + ☆ Leveraging Image-Text Similarity and Caption Modification for the + DataComp Challenge: Filtering Track and BYOD Track ICCV 2023 + + +
+ Large web crawl datasets have already played an important role in learning +multimodal features with high generalization capabilities. However, there are +still very limited studies investigating the details or improvements of data +design. Recently, a DataComp challenge has been designed to propose the best +training data with the fixed models. This paper presents our solution to both +filtering track and BYOD track of the DataComp challenge. Our solution adopts +large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, +and utilize external datasets along with a bag of tricks to improve the data +quality. Experiments show our solution significantly outperforms DataComp +baselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement). + +
+
+ comment: Accepted at the ICCV 2023 Workshop on Towards the Next Generation of + Computer Vision Datasets: DataComp Track +
+
+
+
+
+ + ☆ Tensor Decomposition Based Attention Module for Spiking Neural Networks + + +
+ The attention mechanism has been proven to be an effective way to improve +spiking neural network (SNN). However, based on the fact that the current SNN +input data flow is split into tensors to process on GPUs, none of the previous +works consider the properties of tensors to implement an attention module. This +inspires us to rethink current SNN from the perspective of tensor-relevant +theories. Using tensor decomposition, we design the \textit{projected full +attention} (PFA) module, which demonstrates excellent results with linearly +growing parameters. Specifically, PFA is composed by the \textit{linear +projection of spike tensor} (LPST) module and \textit{attention map composing} +(AMC) module. In LPST, we start by compressing the original spike tensor into +three projected tensors using a single property-preserving strategy with +learnable parameters for each dimension. Then, in AMC, we exploit the inverse +procedure of the tensor decomposition process to combine the three tensors into +the attention map using a so-called connecting factor. To validate the +effectiveness of the proposed PFA module, we integrate it into the widely used +VGG and ResNet architectures for classification tasks. Our method achieves +state-of-the-art performance on both static and dynamic benchmark datasets, +surpassing the existing SNN models with Transformer-based and CNN-based +backbones. + +
+
+ comment: 11 pages +
+
+
+
+
+ + ☆ DICE: Diverse Diffusion Model with Scoring for Trajectory Prediction + + +
+ Road user trajectory prediction in dynamic environments is a challenging but +crucial task for various applications, such as autonomous driving. One of the +main challenges in this domain is the multimodal nature of future trajectories +stemming from the unknown yet diverse intentions of the agents. Diffusion +models have shown to be very effective in capturing such stochasticity in +prediction tasks. However, these models involve many computationally expensive +denoising steps and sampling operations that make them a less desirable option +for real-time safety-critical applications. To this end, we present a novel +framework that leverages diffusion models for predicting future trajectories in +a computationally efficient manner. To minimize the computational bottlenecks +in iterative sampling, we employ an efficient sampling mechanism that allows us +to maximize the number of sampled trajectories for improved accuracy while +maintaining inference time in real time. Moreover, we propose a scoring +mechanism to select the most plausible trajectories by assigning relative +ranks. We show the effectiveness of our approach by conducting empirical +evaluations on common pedestrian (UCY/ETH) and autonomous driving (nuScenes) +benchmark datasets on which our model achieves state-of-the-art performance on +several subsets and metrics. + +
+
+
+
+
+ + ☆ HallusionBench: You See What You Think? Or You Think What You See? An + Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, + and Other Multi-modality Models + + +
+ Large language models (LLMs), after being aligned with vision models and +integrated into vision-language models (VLMs), can bring impressive improvement +in image reasoning tasks. This was shown by the recently released GPT-4V(ison), +LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a +double-edged sword: they may ignore the image context and solely rely on the +(even contradictory) language prior for reasoning. In contrast, the vision +modules in VLMs are weaker than LLMs and may result in misleading visual +representations, which are then translated to confident mistakes by LLMs. To +study these two types of VLM mistakes, i.e., language hallucination and visual +illusion, we curated HallusionBench, an image-context reasoning benchmark that +is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed +analysis of examples in HallusionBench, which sheds novel insights on the +illusion or hallucination of VLMs and how to improve them in the future. The +benchmark and codebase will be released at +https://github.com/tianyi-lab/HallusionBench. + +
+
+
+
+
+ + ☆ F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of + Natural and Perturbed Patterns + + +
+ Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by +well-designed perturbations. This could lead to disastrous results on critical +applications such as self-driving cars, surveillance security, and medical +diagnosis. At present, adversarial training is one of the most effective +defenses against adversarial examples. However, traditional adversarial +training makes it difficult to achieve a good trade-off between clean accuracy +and robustness since spurious features are still learned by DNNs. The intrinsic +reason is that traditional adversarial training makes it difficult to fully +learn core features from adversarial examples when adversarial noise and clean +examples cannot be disentangled. In this paper, we disentangle the adversarial +examples into natural and perturbed patterns by bit-plane slicing. We assume +the higher bit-planes represent natural patterns and the lower bit-planes +represent perturbed patterns, respectively. We propose a Feature-Focusing +Adversarial Training (F$^2$AT), which differs from previous work in that it +enforces the model to focus on the core features from natural patterns and +reduce the impact of spurious features from perturbed patterns. The +experimental results demonstrated that F$^2$AT outperforms state-of-the-art +methods in clean accuracy and adversarial robustness. + +
+
+
+
+
+ + ☆ Polyhedral Surface: Self-supervised Point Cloud Reconstruction Based on + Polyhedral Surface + + +
+ Point cloud reconstruction from raw point cloud has been an important topic +in computer graphics for decades, especially due to its high demand in modeling +and rendering applications. An important way to solve this problem is +establishing a local geometry to fit the local curve. However, previous methods +build either a local plane or polynomial curve. Local plane brings the loss of +sharp feature and the boundary artefacts on open surface. Polynomial curve is +hard to combine with neural network due to the local coordinate consistent +problem. To address this, we propose a novel polyhedral surface to represent +local surface. This method provides more flexible to represent sharp feature +and surface boundary on open surface. It does not require any local coordinate +system, which is important when introducing neural networks. Specifically, we +use normals to construct the polyhedral surface, including both dihedral and +trihedral surfaces using 2 and 3 normals, respectively. Our method achieves +state-of-the-art results on three commonly used datasets (ShapeNetCore, ABC, +and ScanNet). Code will be released upon acceptance. + +
+
+
+
+
+ + ☆ S3Aug: Segmentation, Sampling, and Shift for Action Recognition + + +
+ Action recognition is a well-established area of research in computer vision. +In this paper, we propose S3Aug, a video data augmenatation for action +recognition. Unlike conventional video data augmentation methods that involve +cutting and pasting regions from two videos, the proposed method generates new +videos from a single training video through segmentation and label-to-image +transformation. Furthermore, the proposed method modifies certain categories of +label images by sampling to generate a variety of videos, and shifts +intermediate features to enhance the temporal coherency between frames of the +generate videos. Experimental results on the UCF101, HMDB51, and Mimetics +datasets demonstrate the effectiveness of the proposed method, paricularlly for +out-of-context videos of the Mimetics dataset. + +
+
+
+
+
+ + ☆ Practical Deep Dispersed Watermarking with Synchronization and Fusion ACM MM 2023 + + +
+ Deep learning based blind watermarking works have gradually emerged and +achieved impressive performance. However, previous deep watermarking studies +mainly focus on fixed low-resolution images while paying less attention to +arbitrary resolution images, especially widespread high-resolution images +nowadays. Moreover, most works usually demonstrate robustness against typical +non-geometric attacks (\textit{e.g.}, JPEG compression) but ignore common +geometric attacks (\textit{e.g.}, Rotate) and more challenging combined +attacks. To overcome the above limitations, we propose a practical deep +\textbf{D}ispersed \textbf{W}atermarking with \textbf{S}ynchronization and +\textbf{F}usion, called \textbf{\proposed}. Specifically, given an +arbitrary-resolution cover image, we adopt a dispersed embedding scheme which +sparsely and randomly selects several fixed small-size cover blocks to embed a +consistent watermark message by a well-trained encoder. In the extraction +stage, we first design a watermark synchronization module to locate and rectify +the encoded blocks in the noised watermarked image. We then utilize a decoder +to obtain messages embedded in these blocks, and propose a message fusion +strategy based on similarity to make full use of the consistency among +messages, thus determining a reliable message. Extensive experiments conducted +on different datasets convincingly demonstrate the effectiveness of our +proposed {\proposed}. Compared with state-of-the-art approaches, our blind +watermarking can achieve better performance: averagely improve the bit accuracy +by 5.28\% and 5.93\% against single and combined attacks, respectively, and +show less file size increment and better visual quality. Our code is available +at https://github.com/bytedance/DWSF. + +
+
+ comment: Accpeted by ACM MM 2023 +
+
+
+
+
+ + ☆ Poster: Real-Time Object Substitution for Mobile Diminished Reality with + Edge Computing + + +
+ Diminished Reality (DR) is considered as the conceptual counterpart to +Augmented Reality (AR), and has recently gained increasing attention from both +industry and academia. Unlike AR which adds virtual objects to the real world, +DR allows users to remove physical content from the real world. When combined +with object replacement technology, it presents an further exciting avenue for +exploration within the metaverse. Although a few researches have been conducted +on the intersection of object substitution and DR, there is no real-time object +substitution for mobile diminished reality architecture with high quality. In +this paper, we propose an end-to-end architecture to facilitate immersive and +real-time scene construction for mobile devices with edge computing. + +
+
+
+
+
+ + ☆ ADoPT: LiDAR Spoofing Attack Detection Based on Point-Level Temporal + Consistency BMVC 2023 + + +
+ Deep neural networks (DNNs) are increasingly integrated into LiDAR (Light +Detection and Ranging)-based perception systems for autonomous vehicles (AVs), +requiring robust performance under adversarial conditions. We aim to address +the challenge of LiDAR spoofing attacks, where attackers inject fake objects +into LiDAR data and fool AVs to misinterpret their environment and make +erroneous decisions. However, current defense algorithms predominantly depend +on perception outputs (i.e., bounding boxes) thus face limitations in detecting +attackers given the bounding boxes are generated by imperfect perception models +processing limited points, acquired based on the ego vehicle's viewpoint. To +overcome these limitations, we propose a novel framework, named ADoPT (Anomaly +Detection based on Point-level Temporal consistency), which quantitatively +measures temporal consistency across consecutive frames and identifies abnormal +objects based on the coherency of point clusters. In our evaluation using the +nuScenes dataset, our algorithm effectively counters various LiDAR spoofing +attacks, achieving a low (< 10%) false positive ratio (FPR) and high (> 85%) +true positive ratio (TPR), outperforming existing state-of-the-art defense +methods, CARLO and 3D-TC2. Furthermore, our evaluation demonstrates the +promising potential for accurate attack detection across various road +environments. + +
+
+ comment: BMVC 2023 (17 pages, 13 figures, and 1 table) +
+
+
+
+
+ + ☆ MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance + Segmentation + + +
+ Recently, deep learning-based tooth segmentation methods have been limited by +the expensive and time-consuming processes of data collection and labeling. +Achieving high-precision segmentation with limited datasets is critical. A +viable solution to this entails fine-tuning pre-trained multiview-based models, +thereby enhancing performance with limited data. However, relying solely on +two-dimensional (2D) images for three-dimensional (3D) tooth segmentation can +produce suboptimal outcomes because of occlusion and deformation, i.e., +incomplete and distorted shape perception. To improve this fine-tuning-based +solution, this paper advocates 2D-3D joint perception. The fundamental +challenge in employing 2D-3D joint perception with limited data is that the +3D-related inputs and modules must follow a lightweight policy instead of using +huge 3D data and parameter-rich modules that require extensive training data. +Following this lightweight policy, this paper selects skeletons as the 3D +inputs and introduces MSFormer, a novel method for tooth segmentation. MSFormer +incorporates two lightweight modules into existing multiview-based models: a +3D-skeleton perception module to extract 3D perception from skeletons and a +skeleton-image contrastive learning module to obtain the 2D-3D joint perception +by fusing both multiview and skeleton perceptions. The experimental results +reveal that MSFormer paired with large pre-trained multiview models achieves +state-of-the-art performance, requiring only 100 training meshes. Furthermore, +the segmentation accuracy is improved by 2.4%-5.5% with the increasing volume +of training data. + +
+
+ comment: Under review +
+
+
+
+
+ + ☆ VQ-NeRF: Vector Quantization Enhances Implicit Neural Representations AAAI + + +
+ Recent advancements in implicit neural representations have contributed to +high-fidelity surface reconstruction and photorealistic novel view synthesis. +However, the computational complexity inherent in these methodologies presents +a substantial impediment, constraining the attainable frame rates and +resolutions in practical applications. In response to this predicament, we +propose VQ-NeRF, an effective and efficient pipeline for enhancing implicit +neural representations via vector quantization. The essence of our method +involves reducing the sampling space of NeRF to a lower resolution and +subsequently reinstating it to the original size utilizing a pre-trained VAE +decoder, thereby effectively mitigating the sampling time bottleneck +encountered during rendering. Although the codebook furnishes representative +features, reconstructing fine texture details of the scene remains challenging +due to high compression rates. To overcome this constraint, we design an +innovative multi-scale NeRF sampling scheme that concurrently optimizes the +NeRF model at both compressed and original scales to enhance the network's +ability to preserve fine details. Furthermore, we incorporate a semantic loss +function to improve the geometric fidelity and semantic coherence of our 3D +reconstructions. Extensive experiments demonstrate the effectiveness of our +model in achieving the optimal trade-off between rendering quality and +efficiency. Evaluation on the DTU, BlendMVS, and H3DS datasets confirms the +superior performance of our approach. + +
+
+ comment: Submitted to the 38th Annual AAAI Conference on Artificial + Intelligence +
+
+
+
+
+ + ☆ Player Re-Identification Using Body Part Appearences + + +
+ We propose a neural network architecture that learns body part appearances +for soccer player re-identification. Our model consists of a two-stream network +(one stream for appearance map extraction and the other for body part map +extraction) and a bilinear-pooling layer that generates and spatially pools the +body part map. Each local feature of the body part map is obtained by a +bilinear mapping of the corresponding local appearance and body part +descriptors. Our novel representation yields a robust image-matching feature +map, which results from combining the local similarities of the relevant body +parts with the weighted appearance similarity. Our model does not require any +part annotation on the SoccerNet-V3 re-identification dataset to train the +network. Instead, we use a sub-network of an existing pose estimation network +(OpenPose) to initialize the part substream and then train the entire network +to minimize the triplet loss. The appearance stream is pre-trained on the +ImageNet dataset, and the part stream is trained from scratch for the +SoccerNet-V3 dataset. We demonstrate the validity of our model by showing that +it outperforms state-of-the-art models such as OsNet and InceptionNet. + +
+
+
+
+
+ + ☆ Towards contrast-agnostic soft segmentation of the spinal cord + + +
+ Spinal cord segmentation is clinically relevant and is notably used to +compute spinal cord cross-sectional area (CSA) for the diagnosis and monitoring +of cord compression or neurodegenerative diseases such as multiple sclerosis. +While several semi and automatic methods exist, one key limitation remains: the +segmentation depends on the MRI contrast, resulting in different CSA across +contrasts. This is partly due to the varying appearance of the boundary between +the spinal cord and the cerebrospinal fluid that depends on the sequence and +acquisition parameters. This contrast-sensitive CSA adds variability in +multi-center studies where protocols can vary, reducing the sensitivity to +detect subtle atrophies. Moreover, existing methods enhance the CSA variability +by training one model per contrast, while also producing binary masks that do +not account for partial volume effects. In this work, we present a deep +learning-based method that produces soft segmentations of the spinal cord. +Using the Spine Generic Public Database of healthy participants +($\text{n}=267$; $\text{contrasts}=6$), we first generated participant-wise +soft ground truth (GT) by averaging the binary segmentations across all 6 +contrasts. These soft GT, along with a regression-based loss function, were +then used to train a UNet model for spinal cord segmentation. We evaluated our +model against state-of-the-art methods and performed ablation studies involving +different GT mask types, loss functions, and contrast-specific models. Our +results show that using the soft average segmentations along with a regression +loss function reduces CSA variability ($p < 0.05$, Wilcoxon signed-rank test). +The proposed spinal cord segmentation model generalizes better than the +state-of-the-art contrast-specific methods amongst unseen datasets, vendors, +contrasts, and pathologies (compression, lesions), while accounting for partial +volume effects. + +
+
+ comment: Submitted to Medical Image Analysis +
+
+
+
+
+ + ☆ Remote Heart Rate Monitoring in Smart Environments from Videos with + Self-supervised Pre-training + + +
+ Recent advances in deep learning have made it increasingly feasible to +estimate heart rate remotely in smart environments by analyzing videos. +However, a notable limitation of deep learning methods is their heavy reliance +on extensive sets of labeled data for effective training. To address this +issue, self-supervised learning has emerged as a promising avenue. Building on +this, we introduce a solution that utilizes self-supervised contrastive +learning for the estimation of remote photoplethysmography (PPG) and heart rate +monitoring, thereby reducing the dependence on labeled data and enhancing +performance. We propose the use of 3 spatial and 3 temporal augmentations for +training an encoder through a contrastive framework, followed by utilizing the +late-intermediate embeddings of the encoder for remote PPG and heart rate +estimation. Our experiments on two publicly available datasets showcase the +improvement of our proposed approach over several related works as well as +supervised learning baselines, as our results approach the state-of-the-art. We +also perform thorough experiments to showcase the effects of using different +design choices such as the video representation learning method, the +augmentations used in the pre-training stage, and others. We also demonstrate +the robustness of our proposed method over the supervised learning approaches +on reduced amounts of labeled data. + +
+
+ comment: Accepted in IEEE Internet of Things Journal 2023 +
+
+
+
+
+ + ☆ Vicinal Feature Statistics Augmentation for Federated 3D Medical Volume + Segmentation + + +
+ Federated learning (FL) enables multiple client medical institutes +collaboratively train a deep learning (DL) model with privacy protection. +However, the performance of FL can be constrained by the limited availability +of labeled data in small institutes and the heterogeneous (i.e., non-i.i.d.) +data distribution across institutes. Though data augmentation has been a proven +technique to boost the generalization capabilities of conventional centralized +DL as a "free lunch", its application in FL is largely underexplored. Notably, +constrained by costly labeling, 3D medical segmentation generally relies on +data augmentation. In this work, we aim to develop a vicinal feature-level data +augmentation (VFDA) scheme to efficiently alleviate the local feature shift and +facilitate collaborative training for privacy-aware FL segmentation. We take +both the inner- and inter-institute divergence into consideration, without the +need for cross-institute transfer of raw data or their mixup. Specifically, we +exploit the batch-wise feature statistics (e.g., mean and standard deviation) +in each institute to abstractly represent the discrepancy of data, and model +each feature statistic probabilistically via a Gaussian prototype, with the +mean corresponding to the original statistic and the variance quantifying the +augmentation scope. From the vicinal risk minimization perspective, novel +feature statistics can be drawn from the Gaussian distribution to fulfill +augmentation. The variance is explicitly derived by the data bias in each +individual institute and the underlying feature statistics characterized by all +participating institutes. The added-on VFDA consistently yielded marked +improvements over six advanced FL methods on both 3D brain tumor and cardiac +segmentation. + +
+
+ comment: 28th biennial international conference on Information Processing in + Medical Imaging (IPMI 2023): Oral Paper +
+
+
+
+
+ + ☆ Deep Integrated Explanations CIKM 2023 + + +
+ This paper presents Deep Integrated Explanations (DIX) - a universal method +for explaining vision models. DIX generates explanation maps by integrating +information from the intermediate representations of the model, coupled with +their corresponding gradients. Through an extensive array of both objective and +subjective evaluations spanning diverse tasks, datasets, and model +configurations, we showcase the efficacy of DIX in generating faithful and +accurate explanation maps, while surpassing current state-of-the-art methods. + +
+
+ comment: CIKM 2023 +
+
+
+
+
+ + ☆ DeepVox and SAVE-CT: a contrast- and dose-independent 3D deep learning + approach for thoracic aorta segmentation and aneurysm prediction using + computed tomography scans + + +
+ Thoracic aortic aneurysm (TAA) is a fatal disease which potentially leads to +dissection or rupture through progressive enlargement of the aorta. It is +usually asymptomatic and screening recommendation are limited. The +gold-standard evaluation is performed by computed tomography angiography (CTA) +and radiologists time-consuming assessment. Scans for other indications could +help on this screening, however if acquired without contrast enhancement or +with low dose protocol, it can make the clinical evaluation difficult, besides +increasing the scans quantity for the radiologists. In this study, it was +selected 587 unique CT scans including control and TAA patients, acquired with +low and standard dose protocols, with or without contrast enhancement. A novel +segmentation model, DeepVox, exhibited dice score coefficients of 0.932 and +0.897 for development and test sets, respectively, with faster training speed +in comparison to models reported in the literature. The novel TAA +classification model, SAVE-CT, presented accuracies of 0.930 and 0.922 for +development and test sets, respectively, using only the binary segmentation +mask from DeepVox as input, without hand-engineered features. These two models +together are a potential approach for TAA screening, as they can handle +variable number of slices as input, handling thoracic and thoracoabdominal +sequences, in a fully automated contrast- and dose-independent evaluation. This +may assist to decrease TAA mortality and prioritize the evaluation queue of +patients for radiologists. + +
+
+ comment: 23 pages, 4 figures, 7 tables +
+
+
+
+
+ + ☆ LXMERT Model Compression for Visual Question Answering + + +
+ Large-scale pretrained models such as LXMERT are becoming popular for +learning cross-modal representations on text-image pairs for vision-language +tasks. According to the lottery ticket hypothesis, NLP and computer vision +models contain smaller subnetworks capable of being trained in isolation to +full performance. In this paper, we combine these observations to evaluate +whether such trainable subnetworks exist in LXMERT when fine-tuned on the VQA +task. In addition, we perform a model size cost-benefit analysis by +investigating how much pruning can be done without significant loss in +accuracy. Our experiment results demonstrate that LXMERT can be effectively +pruned by 40%-60% in size with 3% loss in accuracy. + +
+
+ comment: To appear in The Fourth Annual West Coast NLP (WeCNLP) Summit +
+
+
+
+
+ + ☆ Videoprompter: an ensemble of foundational models for zero-shot video + understanding + + +
+ Vision-language models (VLMs) classify the query video by calculating a +similarity score between the visual features and text-based class label +representations. Recently, large language models (LLMs) have been used to +enrich the text-based class labels by enhancing the descriptiveness of the +class names. However, these improvements are restricted to the text-based +classifier only, and the query visual features are not considered. In this +paper, we propose a framework which combines pre-trained discriminative VLMs +with pre-trained generative video-to-text and text-to-text models. We introduce +two key modifications to the standard zero-shot setting. First, we propose +language-guided visual feature enhancement and employ a video-to-text model to +convert the query video to its descriptive form. The resulting descriptions +contain vital visual cues of the query video, such as what objects are present +and their spatio-temporal interactions. These descriptive cues provide +additional semantic knowledge to VLMs to enhance their zeroshot performance. +Second, we propose video-specific prompts to LLMs to generate more meaningful +descriptions to enrich class label representations. Specifically, we introduce +prompt techniques to create a Tree Hierarchy of Categories for class names, +offering a higher-level action context for additional visual cues, We +demonstrate the effectiveness of our approach in video understanding across +three different zero-shot settings: 1) video action recognition, 2) +video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. +Consistent improvements across multiple benchmarks and with various VLMs +demonstrate the effectiveness of our proposed framework. Our code will be made +publicly available. + +
+
+
+
+
+ + ☆ SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial + Understanding + + +
+ The landscape of publicly available vision foundation models (VFMs), such as +CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed +with distinct capabilities stemming from their pre-training objectives. For +instance, CLIP excels in semantic understanding, while SAM specializes in +spatial understanding for segmentation. In this work, we introduce a simple +recipe to efficiently merge VFMs into a unified model that assimilates their +expertise. Our proposed method integrates multi-task learning, continual +learning techniques, and teacher-student distillation. This strategy entails +significantly less computational cost compared to traditional multi-task +training from scratch. Additionally, it only demands a small fraction of the +pre-training datasets that were initially used to train individual models. By +applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that +amalgamates the strengths of SAM and CLIP into a single backbone, making it apt +for edge device applications. We show that SAM-CLIP learns richer visual +representations, equipped with both localization and semantic features, +suitable for a broad range of vision tasks. SAM-CLIP obtains improved +performance on several head probing tasks when compared with SAM and CLIP. We +further show that SAM-CLIP not only retains the foundational strengths of its +precursor models but also introduces synergistic functionalities, most notably +in zero-shot semantic segmentation, where SAM-CLIP establishes new +state-of-the-art results on 5 benchmarks. It outperforms previous models that +are specifically designed for this task by a large margin, including +6.8% and ++5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively. + +
+
+
+
+
+ + ☆ SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis + + +
+ Sound design involves creatively selecting, recording, and editing sound +effects for various media like cinema, video games, and virtual/augmented +reality. One of the most time-consuming steps when designing sound is +synchronizing audio with video. In some cases, environmental recordings from +video shoots are available, which can aid in the process. However, in video +games and animations, no reference audio exists, requiring manual annotation of +event timings from the video. We propose a system to extract repetitive actions +onsets from a video, which are then used - in conjunction with audio or textual +embeddings - to condition a diffusion model trained to generate a new +synchronized sound effects audio track. In this way, we leave complete creative +control to the sound designer while removing the burden of synchronization with +video. Furthermore, editing the onset track or changing the conditioning +embedding requires much less effort than editing the audio track itself, +simplifying the sonification process. We provide sound examples, source code, +and pretrained models to faciliate reproducibility + +
+
+
+
+
+ + ♻ ☆ Self-Supervised One-Shot Learning for Automatic Segmentation of StyleGAN + Images + + +
+ We propose a framework for the automatic one-shot segmentation of synthetic +images generated by a StyleGAN. Our framework is based on the observation that +the multi-scale hidden features in the GAN generator hold useful semantic +information that can be utilized for automatic on-the-fly segmentation of the +generated images. Using these features, our framework learns to segment +synthetic images using a self-supervised contrastive clustering algorithm that +projects the hidden features into a compact space for per-pixel classification. +This contrastive learner is based on using a novel data augmentation strategy +and a pixel-wise swapped prediction loss that leads to faster learning of the +feature vectors for one-shot segmentation. We have tested our implementation on +five standard benchmarks to yield a segmentation performance that not only +outperforms the semi-supervised baselines by an average wIoU margin of 1.02 % +but also improves the inference speeds by a factor of 4.5. Finally, we also +show the results of using the proposed one-shot learner in implementing BagGAN, +a framework for producing annotated synthetic baggage X-ray scans for threat +detection. This framework was trained and tested on the PIDRay baggage +benchmark to yield a performance comparable to its baseline segmenter based on +manual annotations. + +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ Edge-aware Hard Clustering Graph Pooling for Brain Imaging + + +
+ Graph Convolutional Networks (GCNs) can capture non-Euclidean spatial +dependence between different brain regions. The graph pooling operator, a +crucial element of GCNs, enhances the representation learning capability and +facilitates the acquisition of abnormal brain maps. However, most existing +research designs graph pooling operators solely from the perspective of nodes +while disregarding the original edge features, in a way that not only confines +graph pooling application scenarios, but also diminishes its ability to capture +critical substructures. To design a graph clustering pooling operator that is +tailored to dominant edge features, we proposed the edge-aware hard clustering +graph pool (EHCPool) and redefined the graph clustering process. Specifically, +the 'Edge-to-node' criterion was proposed to evaluate the significance of both +edge and node features. Guided by edge scores, we designed a revolutionary +Iteration n-top strategy, aimed at adaptively learning sparse hard clustering +assignments for graphs. Subsequently, a novel N-E Aggregation strategy is +introduced to aggregate node and edge information in each independent subgraph. +Extensive experiments on the multi-site public datasets demonstrate the +superiority and robustness of the proposed model. More notably, EHCPool has the +potential to probe different types of dysfunctional brain networks from a +data-driven perspective. Core code is at: https://github.com/swfen/EHCPool. + +
+
+
+
+
+ + ♻ ☆ Variational Imbalanced Regression: Fair Uncertainty Quantification via + Probabilistic Smoothing NeurIPS 2023 + + +
+ Existing regression models tend to fall short in both accuracy and +uncertainty estimation when the label distribution is imbalanced. In this +paper, we propose a probabilistic deep learning model, dubbed variational +imbalanced regression (VIR), which not only performs well in imbalanced +regression but naturally produces reasonable uncertainty estimation as a +byproduct. Different from typical variational autoencoders assuming I.I.D. +representations (a data point's representation is not directly affected by +other data points), our VIR borrows data with similar regression labels to +compute the latent representation's variational distribution; furthermore, +different from deterministic regression models producing point estimates, VIR +predicts the entire normal-inverse-gamma distributions and modulates the +associated conjugate distributions to impose probabilistic reweighting on the +imbalanced data, thereby providing better uncertainty estimation. Experiments +in several real-world datasets show that our VIR can outperform +state-of-the-art imbalanced regression models in terms of both accuracy and +uncertainty estimation. Code will soon be available at +https://github.com/Wang-ML-Lab/variational-imbalanced-regression. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Understanding and Modeling the Effects of Task and Context on Drivers' + Gaze Allocation + + +
+ Understanding what drivers look at is important for many applications, +including driver training, monitoring, and assistance, as well as self-driving. +Traditionally, factors affecting human visual attention have been divided into +bottom-up (involuntary attraction to salient regions) and top-down (task- and +context-driven). Although both play a role in drivers' gaze allocation, most of +the existing modeling approaches apply techniques developed for bottom-up +saliency and do not consider task and context influences explicitly. Likewise, +common driving attention benchmarks lack relevant task and context annotations. +Therefore, to enable analysis and modeling of these factors for drivers' gaze +prediction, we propose the following: 1) address some shortcomings of the +popular DR(eye)VE dataset and extend it with per-frame annotations for driving +task and context; 2) benchmark a number of baseline and SOTA models for +saliency and driver gaze prediction and analyze them w.r.t. the new +annotations; and finally, 3) a novel model that modulates drivers' gaze +prediction with explicit action and context information, and as a result +significantly improves SOTA performance on DR(eye)VE overall (by 24\% KLD and +89\% NSS) and on a subset of action and safety-critical intersection scenarios +(by 10--30\% KLD). Extended annotations, code for model and evaluation will be +made publicly available. + +
+
+ comment: 12 pages, 8 figures, 8 tables +
+
+
+
+
+ + ♻ ☆ ConceptFusion: Open-set Multimodal 3D Mapping + + +
+ Building 3D maps of the environment is central to robot navigation, planning, +and interaction with objects in a scene. Most existing approaches that +integrate semantic concepts with 3D maps largely remain confined to the +closed-set setting: they can only reason about a finite set of concepts, +pre-defined at training time. Further, these maps can only be queried using +class labels, or in recent work, using text prompts. + We address both these issues with ConceptFusion, a scene representation that +is (1) fundamentally open-set, enabling reasoning beyond a closed set of +concepts and (ii) inherently multimodal, enabling a diverse range of possible +queries to the 3D map, from language, to images, to audio, to 3D geometry, all +working in concert. ConceptFusion leverages the open-set capabilities of +today's foundation models pre-trained on internet-scale data to reason about +concepts across modalities such as natural language, images, and audio. We +demonstrate that pixel-aligned open-set features can be fused into 3D maps via +traditional SLAM and multi-view fusion approaches. This enables effective +zero-shot spatial reasoning, not needing any additional training or finetuning, +and retains long-tailed concepts better than supervised approaches, +outperforming them by more than 40% margin on 3D IoU. We extensively evaluate +ConceptFusion on a number of real-world datasets, simulated home environments, +a real-world tabletop manipulation task, and an autonomous driving platform. We +showcase new avenues for blending foundation models with 3D open-set multimodal +mapping. + For more information, visit our project page https://concept-fusion.github.io +or watch our 5-minute explainer video +https://www.youtube.com/watch?v=rkXgws8fiDs + +
+
+ comment: RSS 2023. Project page: https://concept-fusion.github.io Explainer + video: https://www.youtube.com/watch?v=rkXgws8fiDs Code: + https://github.com/concept-fusion/concept-fusion +
+
+
+
+
+ + ♻ ☆ Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating + Holistic Understanding of Crowd Scenes + + +
+ To alleviate the heavy annotation burden for training a reliable crowd +counting model and thus make the model more practicable and accurate by being +able to benefit from more data, this paper presents a new semi-supervised +method based on the mean teacher framework. When there is a scarcity of labeled +data available, the model is prone to overfit local patches. Within such +contexts, the conventional approach of solely improving the accuracy of local +patch predictions through unlabeled data proves inadequate. Consequently, we +propose a more nuanced approach: fostering the model's intrinsic 'subitizing' +capability. This ability allows the model to accurately estimate the count in +regions by leveraging its understanding of the crowd scenes, mirroring the +human cognitive process. To achieve this goal, we apply masking on unlabeled +data, guiding the model to make predictions for these masked patches based on +the holistic cues. Furthermore, to help with feature learning, herein we +incorporate a fine-grained density classification task. Our method is general +and applicable to most existing crowd counting methods as it doesn't have +strict structural or loss constraints. In addition, we observe that the model +trained with our framework exhibits a 'subitizing'-like behavior. It accurately +predicts low-density regions with only a 'glance', while incorporating local +details to predict high-density regions. Our method achieves the +state-of-the-art performance, surpassing previous approaches by a large margin +on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is +available at: https://github.com/cha15yq/MRC-Crowd. + +
+
+
+
+
+ + ♻ ☆ Regularizing Neural Networks with Meta-Learning Generative Models NeurIPS 2023 + + +
+ This paper investigates methods for improving generative data augmentation +for deep learning. Generative data augmentation leverages the synthetic samples +produced by generative models as an additional dataset for classification with +small dataset settings. A key challenge of generative data augmentation is that +the synthetic data contain uninformative samples that degrade accuracy. This is +because the synthetic samples do not perfectly represent class categories in +real data and uniform sampling does not necessarily provide useful samples for +tasks. In this paper, we present a novel strategy for generative data +augmentation called meta generative regularization (MGR). To avoid the +degradation of generative data augmentation, MGR utilizes synthetic samples in +the regularization term for feature extractors instead of in the loss function, +e.g., cross-entropy. These synthetic samples are dynamically determined to +minimize the validation losses through meta-learning. We observed that MGR can +avoid the performance degradation of na\"ive generative data augmentation and +boost the baselines. Experiments on six datasets showed that MGR is effective +particularly when datasets are smaller and stably outperforms baselines. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for + Medical Image Segmentation + + +
+ The hybrid architecture of convolution neural networks (CNN) and Transformer +has been the most popular method for medical image segmentation. However, the +existing networks based on the hybrid architecture suffer from two problems. +First, although the CNN branch can capture image local features by using +convolution operation, the vanilla convolution is unable to achieve adaptive +extraction of image features. Second, although the Transformer branch can model +the global information of images, the conventional self-attention only focuses +on the spatial self-attention of images and ignores the channel and +cross-dimensional self-attention leading to low segmentation accuracy for +medical images with complex backgrounds. To solve these problems, we propose +vision Transformer embrace convolutional neural networks for medical image +segmentation (TEC-Net). Our network has two advantages. First, dynamic +deformable convolution (DDConv) is designed in the CNN branch, which not only +overcomes the difficulty of adaptive feature extraction using fixed-size +convolution kernels, but also solves the defect that different inputs share the +same convolution kernel parameters, effectively improving the feature +expression ability of CNN branch. Second, in the Transformer branch, a +(shifted)-window adaptive complementary attention module ((S)W-ACAM) and +compact convolutional projection are designed to enable the network to fully +learn the cross-dimensional long-range dependency of medical images with few +parameters and calculations. Experimental results show that the proposed +TEC-Net provides better medical image segmentation results than SOTA methods +including CNN and Transformer networks. In addition, our TEC-Net requires fewer +parameters and computational costs and does not rely on pre-training. The code +is publicly available at https://github.com/SR0920/TEC-Net. + +
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2306.03373 +
+
+
+
+
+ + ♻ ☆ Segment Anything in High Quality NeurIPS 2023 + + +
+ The recent Segment Anything Model (SAM) represents a big leap in scaling up +segmentation models, allowing for powerful zero-shot capabilities and flexible +prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction +quality falls short in many cases, particularly when dealing with objects that +have intricate structures. We propose HQ-SAM, equipping SAM with the ability to +accurately segment any object, while maintaining SAM's original promptable +design, efficiency, and zero-shot generalizability. Our careful design reuses +and preserves the pre-trained model weights of SAM, while only introducing +minimal additional parameters and computation. We design a learnable +High-Quality Output Token, which is injected into SAM's mask decoder and is +responsible for predicting the high-quality mask. Instead of only applying it +on mask-decoder features, we first fuse them with early and final ViT features +for improved mask details. To train our introduced learnable parameters, we +compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is +only trained on the introduced detaset of 44k masks, which takes only 4 hours +on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 10 diverse segmentation +datasets across different downstream tasks, where 8 out of them are evaluated +in a zero-shot transfer protocol. Our code and pretrained models are at +https://github.com/SysCV/SAM-HQ. + +
+
+ comment: NeurIPS 2023. We propose HQ-SAM to upgrade SAM for high-quality + zero-shot segmentation. Github: https://github.com/SysCV/SAM-HQ +
+
+
+
+
+ + ♻ ☆ LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and + Unlabeled Image Collections NeurIPS 2023 + + +
+ Recently, large-scale pre-trained Vision and Language (VL) models have set a +new state-of-the-art (SOTA) in zero-shot visual classification enabling +open-vocabulary recognition of potentially unlimited set of categories defined +as simple language prompts. However, despite these great advances, the +performance of these zeroshot classifiers still falls short of the results of +dedicated (closed category set) classifiers trained with supervised fine +tuning. In this paper we show, for the first time, how to reduce this gap +without any labels and without any paired VL data, using an unlabeled image +collection and a set of texts auto-generated using a Large Language Model (LLM) +describing the categories of interest and effectively substituting labeled +visual instances of those categories. Using our label-free approach, we are +able to attain significant performance improvements over the zero-shot +performance of the base VL model and other contemporary methods and baselines +on a wide variety of datasets, demonstrating absolute improvement of up to +11.7% (3.8% on average) in the label-free setting. Moreover, despite our +approach being label-free, we observe 1.3% average gains over leading few-shot +prompting baselines that do use 5-shot supervision. + +
+
+ comment: NeurIPS 2023 (Camera Ready) - Project Page: + https://jmiemirza.github.io/LaFTer/ +
+
+
+
+
+ + ♻ ☆ LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic + Tabletop Manipulation + + +
+ The convergence of embodied agents and large language models (LLMs) has +brought significant advancements to embodied instruction following. +Particularly, the strong reasoning capabilities of LLMs make it possible for +robots to perform long-horizon tasks without expensive annotated +demonstrations. However, public benchmarks for testing the long-horizon +reasoning capabilities of language-conditioned robots in various scenarios are +still missing. To fill this gap, this work focuses on the tabletop manipulation +task and releases a simulation benchmark, \textit{LoHoRavens}, which covers +various long-horizon reasoning aspects spanning color, size, space, arithmetics +and reference. Furthermore, there is a key modality bridging problem for +long-horizon manipulation tasks with LLMs: how to incorporate the observation +feedback during robot execution for the LLM's closed-loop planning, which is +however less studied by prior work. We investigate two methods of bridging the +modality gap: caption generation and learnable interface for incorporating +explicit and implicit observation feedback to the LLM, respectively. These +methods serve as the two baselines for our proposed benchmark. Experiments show +that both methods struggle to solve some tasks, indicating long-horizon +manipulation tasks are still challenging for current popular models. We expect +the proposed public benchmark and baselines can help the community develop +better models for long-horizon tabletop manipulation tasks. + +
+
+ comment: 6 pages, 4 figures. The video and code of LoHoRavens are available at + https://cisnlp.github.io/lohoravens-webpage/ +
+
+
+
+
+ + ♻ ☆ Learning Unseen Modality Interaction NeurIPS 2023 + + +
+ Multimodal learning assumes all modality combinations of interest are +available during training to learn cross-modal correspondences.In this paper, +we challenge this modality-complete assumption for multimodal learning and +instead strive for generalization to unseen modality combinations during +inference. We pose the problem of unseen modality interaction and introduce a +first solution. It exploits a module that projects the multidimensional +features of different modalities into a common space with rich information +preserved. This allows the information to be accumulated with a simple +summation operation across available modalities. To reduce overfitting to less +discriminative modality combinations during training, we further improve the +model learning with pseudo-supervision indicating the reliability of a +modality's prediction. We demonstrate that our approach is effective for +diverse tasks and modalities by evaluating it for multimodal video +classification, robot state regression, and multimedia retrieval. Project +website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. + +
+
+ comment: Published at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Learning-to-Rank Meets Language: Boosting Language-Driven Ordering + Alignment for Ordinal Classification NeurIPS 2023 + + +
+ We present a novel language-driven ordering alignment method for ordinal +classification. The labels in ordinal classification contain additional +ordering relations, making them prone to overfitting when relying solely on +training data. Recent developments in pre-trained vision-language models +inspire us to leverage the rich ordinal priors in human language by converting +the original task into a visionlanguage alignment task. Consequently, we +propose L2RCLIP, which fully utilizes the language priors from two +perspectives. First, we introduce a complementary prompt tuning technique +called RankFormer, designed to enhance the ordering relation of original rank +prompts. It employs token-level attention with residual-style prompt blending +in the word embedding space. Second, to further incorporate language priors, we +revisit the approximate bound optimization of vanilla cross-entropy loss and +restructure it within the cross-modal embedding space. Consequently, we propose +a cross-modal ordinal pairwise loss to refine the CLIP feature space, where +texts and images maintain both semantic alignment and ordering alignment. +Extensive experiments on three ordinal classification tasks, including facial +age estimation, historical color image (HCI) classification, and aesthetic +assessment demonstrate its promising performance. The code is available at +https://github.com/raywang335/L2RCLIP. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Global-correlated 3D-decoupling Transformer for Clothed Avatar + Reconstruction NeurIPS 2023 + + +
+ Reconstructing 3D clothed human avatars from single images is a challenging +task, especially when encountering complex poses and loose clothing. Current +methods exhibit limitations in performance, largely attributable to their +dependence on insufficient 2D image features and inconsistent query methods. +Owing to this, we present the Global-correlated 3D-decoupling Transformer for +clothed Avatar reconstruction (GTA), a novel transformer-based architecture +that reconstructs clothed human avatars from monocular images. Our approach +leverages transformer architectures by utilizing a Vision Transformer model as +an encoder for capturing global-correlated image features. Subsequently, our +innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane +features, using learnable embeddings as queries for cross-plane generation. To +effectively enhance feature fusion with the tri-plane 3D feature and human body +prior, we propose a hybrid prior fusion strategy combining spatial and +prior-enhanced queries, leveraging the benefits of spatial localization and +human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 +datasets illustrate that our method outperforms state-of-the-art approaches in +both geometry and texture reconstruction, exhibiting high robustness to +challenging poses and loose clothing, and producing higher-resolution textures. +Codes will be available at https://github.com/River-Zhang/GTA. + +
+
+ comment: Accepted by NeurIPS 2023. Update appendix. Project page: + https://river-zhang.github.io/GTA-projectpage/ +
+
+
+
+
+ + ♻ ☆ ECHo: A Visio-Linguistic Dataset for Event Causality Inference via + Human-Centric Reasoning EMNLP 2023 + + +
+ We introduce ECHo (Event Causality Inference via Human-Centric Reasoning), a +diagnostic dataset of event causality inference grounded in visio-linguistic +social scenarios. ECHo employs real-world human-centric deductive information +building on a television crime drama. ECHo requires the Theory-of-Mind (ToM) +ability to understand and reason about social interactions based on multimodal +information. Using ECHo, we propose a unified Chain-of-Thought (CoT) framework +to assess the reasoning capability of current AI systems. Our ToM-enhanced CoT +pipeline accommodates various large foundation models in both zero-shot and +few-shot visio-linguistic reasoning. We use this framework to scrutinize recent +large foundation models such as InstructGPT and MiniGPT-4 on three diagnostic +human-centric tasks. Further analysis demonstrates ECHo as a challenging +dataset to expose imperfections and inconsistencies in reasoning. Our data and +code are publicly available at https://github.com/YuxiXie/ECHo. + +
+
+ comment: Findings of EMNLP 2023. 10 pages, 6 figures, 5 tables (22 pages, 8 + figures, 15 tables including references and appendices) +
+
+
+
+
+ + ♻ ☆ NICE: Improving Panoptic Narrative Detection and Segmentation with + Cascading Collaborative Learning + + +
+ Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging +tasks that involve identifying and locating multiple targets in an image +according to a long narrative description. In this paper, we propose a unified +and effective framework called NICE that can jointly learn these two panoptic +narrative recognition tasks. Existing visual grounding tasks use a two-branch +paradigm, but applying this directly to PND and PNS can result in prediction +conflict due to their intrinsic many-to-many alignment property. To address +this, we introduce two cascading modules based on the barycenter of the mask, +which are Coordinate Guided Aggregation (CGA) and Barycenter Driven +Localization (BDL), responsible for segmentation and detection, respectively. +By linking PNS and PND in series with the barycenter of segmentation as the +anchor, our approach naturally aligns the two tasks and allows them to +complement each other for improved performance. Specifically, CGA provides the +barycenter as a reference for detection, reducing BDL's reliance on a large +number of candidate boxes. BDL leverages its excellent properties to +distinguish different instances, which improves the performance of CGA for +segmentation. Extensive experiments demonstrate that NICE surpasses all +existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS +over the state-of-the-art. These results validate the effectiveness of our +proposed collaborative learning strategy. The project of this work is made +publicly available at https://github.com/Mr-Neko/NICE. + +
+
+ comment: 18 pages. 9 figures, 9 tables +
+
+
+
+
+ + ♻ ☆ Temporal Conditioning Spiking Latent Variable Models of the Neural + Response to Natural Visual Scenes NeurIPS 2023 + + +
+ Developing computational models of neural response is crucial for +understanding sensory processing and neural computations. Current +state-of-the-art neural network methods use temporal filters to handle temporal +dependencies, resulting in an unrealistic and inflexible processing paradigm. +Meanwhile, these methods target trial-averaged firing rates and fail to capture +important features in spike trains. This work presents the temporal +conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural +response to natural visual stimuli. We use spiking neurons to produce spike +outputs that directly match the recorded trains. This approach helps to avoid +losing information embedded in the original spike trains. We exclude the +temporal dimension from the model parameter space and introduce a temporal +conditioning operation to allow the model to adaptively explore and exploit +temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show +that TeCoS-LVM models can produce more realistic spike activities and +accurately fit spike statistics than powerful alternatives. Additionally, +learned TeCoS-LVM models can generalize well to longer time scales. Overall, +while remaining computationally tractable, our model effectively captures key +features of neural coding systems. It thus provides a useful tool for building +accurate predictive computational accounts for various sensory perception +circuits. + +
+
+ comment: Accepted at NeurIPS 2023. 22 pages, 7 figures, 3 tables +
+
+
+
+
+ + ♻ ☆ Segment, Select, Correct: A Framework for Weakly-Supervised Referring + Segmentation + + +
+ Referring Image Segmentation (RIS) - the problem of identifying objects in +images through natural language sentences - is a challenging task currently +mostly solved through supervised learning. However, while collecting referred +annotation masks is a time-consuming process, the few existing +weakly-supervised and zero-shot approaches fall significantly short in +performance compared to fully-supervised learning ones. To bridge the +performance gap without mask annotations, we propose a novel weakly-supervised +framework that tackles RIS by decomposing it into three steps: obtaining +instance masks for the object mentioned in the referencing instruction +(segment), using zero-shot learning to select a potentially correct mask for +the given instruction (select), and bootstrapping a model which allows for +fixing the mistakes of zero-shot selection (correct). In our experiments, using +only the first two steps (zero-shot segment and select) outperforms other +zero-shot baselines by as much as 19%, while our full method improves upon this +much stronger baseline and sets the new state-of-the-art for weakly-supervised +RIS, reducing the gap between the weakly-supervised and fully-supervised +methods in some cases from around 33% to as little as 14%. Code is available at +https://github.com/fgirbal/segment-select-correct. + +
+
+
+
+
+ + ♻ ☆ AI-Generated Images as Data Source: The Dawn of Synthetic Era + + +
+ The advancement of visual intelligence is intrinsically tethered to the +availability of large-scale data. In parallel, generative Artificial +Intelligence (AI) has unlocked the potential to create synthetic images that +closely resemble real-world photographs. This prompts a compelling inquiry: how +much visual intelligence could benefit from the advance of generative AI? This +paper explores the innovative concept of harnessing these AI-generated images +as new data sources, reshaping traditional modeling paradigms in visual +intelligence. In contrast to real data, AI-generated data exhibit remarkable +advantages, including unmatched abundance and scalability, the rapid generation +of vast datasets, and the effortless simulation of edge cases. Built on the +success of generative AI models, we examine the potential of their generated +data in a range of applications, from training machine learning models to +simulating scenarios for computational modeling, testing, and validation. We +probe the technological foundations that support this groundbreaking use of +generative AI, engaging in an in-depth discussion on the ethical, legal, and +practical considerations that accompany this transformative paradigm shift. +Through an exhaustive survey of current technologies and applications, this +paper presents a comprehensive view of the synthetic era in visual +intelligence. A project associated with this paper can be found at +https://github.com/mwxely/AIGS . + +
+
+ comment: 20 pages, 11 figures +
+
+
+
+
+ + ♻ ☆ Towards Robust Cardiac Segmentation using Graph Convolutional Networks + + +
+ Fully automatic cardiac segmentation can be a fast and reproducible method to +extract clinical measurements from an echocardiography examination. The U-Net +architecture is the current state-of-the-art deep learning architecture for +medical segmentation and can segment cardiac structures in real-time with +average errors comparable to inter-observer variability. However, this +architecture still generates large outliers that are often anatomically +incorrect. This work uses the concept of graph convolutional neural networks +that predict the contour points of the structures of interest instead of +labeling each pixel. We propose a graph architecture that uses two +convolutional rings based on cardiac anatomy and show that this eliminates +anatomical incorrect multi-structure segmentations on the publicly available +CAMUS dataset. Additionally, this work contributes with an ablation study on +the graph convolutional architecture and an evaluation of clinical measurements +on the clinical HUNT4 dataset. Finally, we propose to use the inter-model +agreement of the U-Net and the graph network as a predictor of both the input +and segmentation quality. We show this predictor can detect out-of-distribution +and unsuitable input images in real-time. Source code is available online: +https://github.com/gillesvntnu/GCN_multistructure + +
+
+
+
+
+ + ♻ ☆ Perceptual Assessment and Optimization of High Dynamic Range Image + Rendering + + +
+ High dynamic range (HDR) imaging has gained increasing popularity for its +ability to faithfully reproduce the luminance levels in natural scenes. +Accordingly, HDR image quality assessment (IQA) is crucial but has been +superficially treated. The majority of existing IQA models are developed for +and calibrated against low dynamic range (LDR) images, which have been shown to +be poorly correlated with human perception of HDR image quality. In this work, +we propose a family of HDR IQA models by transferring the recent advances in +LDR IQA. The key step in our approach is to specify a simple inverse display +model that decomposes an HDR image to a set of LDR images with different +exposures, which will be assessed by existing LDR quality models. The local +quality scores of each exposure are then aggregated with the help of a simple +well-exposedness measure into a global quality score for each exposure, which +will be further weighted across exposures to obtain the overall quality score. +When assessing LDR images, the proposed HDR quality models reduce gracefully to +the original LDR ones with the same performance. Experiments on four +human-rated HDR image datasets demonstrate that our HDR quality models are +consistently better than existing IQA methods, including the HDR-VDP family. +Moreover, we demonstrate their strengths in perceptual optimization of HDR +novel view synthesis. + +
+
+ comment: need more changes +
+
+
+
+
+ + ♻ ☆ Understanding ME? Multimodal Evaluation for Fine-grained Visual + Commonsense EMNLP 2022 + + +
+ Visual commonsense understanding requires Vision Language (VL) models to not +only understand image and text but also cross-reference in-between to fully +integrate and achieve comprehension of the visual scene described. Recently, +various approaches have been developed and have achieved high performance on +visual commonsense benchmarks. However, it is unclear whether the models really +understand the visual scene and underlying commonsense knowledge due to limited +evaluation data resources. To provide an in-depth analysis, we present a +Multimodal Evaluation (ME) pipeline to automatically generate question-answer +pairs to test models' understanding of the visual scene, text, and related +knowledge. We then take a step further to show that training with the ME data +boosts the model's performance in standard VCR evaluation. Lastly, our in-depth +analysis and comparison reveal interesting findings: (1) semantically low-level +information can assist the learning of high-level information but not the +opposite; (2) visual information is generally under utilization compared with +text. + +
+
+ comment: Accepted to EMNLP 2022 Long Paper +
+
+
+
+
+ + ♻ ☆ A Car Model Identification System for Streamlining the Automobile Sales + Process + + +
+ This project presents an automated solution for the efficient identification +of car models and makes from images, aimed at streamlining the vehicle listing +process on online car-selling platforms. Through a thorough exploration +encompassing various efficient network architectures including Convolutional +Neural Networks (CNNs), Vision Transformers (ViTs), and hybrid models, we +achieved a notable accuracy of 81.97% employing the EfficientNet (V2 b2) +architecture. To refine performance, a combination of strategies, including +data augmentation, fine-tuning pretrained models, and extensive hyperparameter +tuning, were applied. The trained model offers the potential for automating +information extraction, promising enhanced user experiences across car-selling +websites. + +
+
+
+
+
+ + ♻ ☆ Revisiting the Evaluation of Image Synthesis with GANs NeurIPS 2023 + + +
+ A good metric, which promises a reliable comparison between solutions, is +essential for any well-defined task. Unlike most vision tasks that have +per-sample ground-truth, image synthesis tasks target generating unseen data +and hence are usually evaluated through a distributional distance between one +set of real samples and another set of generated samples. This study presents +an empirical investigation into the evaluation of synthesis performance, with +generative adversarial networks (GANs) as a representative of generative +models. In particular, we make in-depth analyses of various factors, including +how to represent a data point in the representation space, how to calculate a +fair distance using selected samples, and how many instances to use from each +set. Extensive experiments conducted on multiple datasets and settings reveal +several important findings. Firstly, a group of models that include both +CNN-based and ViT-based architectures serve as reliable and robust feature +extractors for measurement evaluation. Secondly, Centered Kernel Alignment +(CKA) provides a better comparison across various extractors and hierarchical +layers in one model. Finally, CKA is more sample-efficient and enjoys better +agreement with human judgment in characterizing the similarity between two +internal data correlations. These findings contribute to the development of a +new measurement system, which enables a consistent and reliable re-evaluation +of current state-of-the-art generative models. + +
+
+ comment: NeurIPS 2023 datasets and benchmarks track +
+
+
+
+
+ + ♻ ☆ Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection + + +
+ Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly +detection area - aims at utilizing a few samples of anomaly classes seen during +training to detect unseen anomalies (i.e., samples from open-set anomaly +classes), while effectively identifying the seen anomalies. Benefiting from the +prior knowledge illustrated by the seen anomalies, current OSAD methods can +often largely reduce false positive errors. However, these methods treat the +anomaly examples as from a homogeneous distribution, rendering them less +effective in generalizing to unseen anomalies that can be drawn from any +distribution. In this paper, we propose to learn heterogeneous anomaly +distributions using the limited anomaly examples to address this issue. To this +end, we introduce a novel approach, namely Anomaly Heterogeneity Learning +(AHL), that simulates a diverse set of heterogeneous (seen and unseen) anomaly +distributions and then utilizes them to learn a unified heterogeneous +abnormality model. Further, AHL is a generic framework that existing OSAD +models can plug and play for enhancing their abnormality modeling. Extensive +experiments on nine real-world anomaly detection datasets show that AHL can 1) +substantially enhance different state-of-the-art (SOTA) OSAD models in +detecting both seen and unseen anomalies, achieving new SOTA performance on a +large set of datasets, and 2) effectively generalize to unseen anomalies in new +target domains. + +
+
+ comment: 18 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ Dual Radar: A Multi-modal Dataset with Dual 4D Radar for Autonomous + Driving + + +
+ Radar has stronger adaptability in adverse scenarios for autonomous driving +environmental perception compared to widely adopted cameras and LiDARs. +Compared with commonly used 3D radars, the latest 4D radars have precise +vertical resolution and higher point cloud density, making it a highly +promising sensor for autonomous driving in complex environmental perception. +However, due to the much higher noise than LiDAR, manufacturers choose +different filtering strategies, resulting in an inverse ratio between noise +level and point cloud density. There is still a lack of comparative analysis on +which method is beneficial for deep learning-based perception algorithms in +autonomous driving. One of the main reasons is that current datasets only adopt +one type of 4D radar, making it difficult to compare different 4D radars in the +same scene. Therefore, in this paper, we introduce a novel large-scale +multi-modal dataset featuring, for the first time, two types of 4D radars +captured simultaneously. This dataset enables further research into effective +4D radar perception algorithms.Our dataset consists of 151 consecutive series, +most of which last 20 seconds and contain 10,007 meticulously synchronized and +annotated frames. Moreover, our dataset captures a variety of challenging +driving scenarios, including many road conditions, weather conditions, +nighttime and daytime with different lighting intensities and periods. Our +dataset annotates consecutive frames, which can be applied to 3D object +detection and tracking, and also supports the study of multi-modal tasks. We +experimentally validate our dataset, providing valuable results for studying +different types of 4D radars. This dataset is released on +https://github.com/adept-thu/Dual-Radar. + +
+
+
+
+
+ + ♻ ☆ LanguageBind: Extending Video-Language Pretraining to N-modality by + Language-based Semantic Alignment ICLR 2024 + + +
+ The video-language (VL) pretraining has achieved remarkable improvement in +multiple downstream tasks. However, the current VL pretraining framework is +hard to extend to multiple modalities (N modalities, N>=3) beyond vision and +language. We thus propose LanguageBind, taking the language as the bind across +different modalities because the language modality is well-explored and +contains rich semantics. Specifically, we freeze the language encoder acquired +by VL pretraining, then train encoders for other modalities with contrastive +learning. As a result, all modalities are mapped to a shared feature space, +implementing multi-modal semantic alignment. While LanguageBind ensures that we +can extend VL modalities to N modalities, we also need a high-quality dataset +with alignment data pairs centered on language. We thus propose VIDAL-10M with +Video, Infrared, Depth, Audio and their corresponding Language, naming as +VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with +complete semantics rather than truncated segments from long videos, and all the +video, depth, infrared, and audio modalities are aligned to their textual +descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8% +R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot +video-text retrieval task. Beyond this, our LanguageBind has greatly improved +in the zero-shot video, audio, depth, and infrared understanding tasks. For +instance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, +6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets, +LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Code +address: https://github.com/PKU-YuanGroup/LanguageBind. + +
+
+ comment: Under review as a conference paper at ICLR 2024 +
+
+
+
+
+ + ♻ ☆ Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism NeurIPS 2023 + + +
+ In the past years, YOLO-series models have emerged as the leading approaches +in the area of real-time object detection. Many studies pushed up the baseline +to a higher level by modifying the architecture, augmenting data and designing +new losses. However, we find previous models still suffer from information +fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation +Network (PANet) have alleviated this. Therefore, this study provides an +advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with +convolution and self-attention operations. This new designed model named as +Gold-YOLO, which boosts the multi-scale feature fusion capabilities and +achieves an ideal balance between latency and accuracy across all model scales. +Additionally, we implement MAE-style pretraining in the YOLO-series for the +first time, allowing YOLOseries models could be to benefit from unsupervised +pretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017 +datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model +YOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at +https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, +and the MindSpore code is available at +https://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO. + +
+
+ comment: Accepted by NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ KinD-LCE Curve Estimation And Retinex Fusion On Low-Light Image + + +
+ Low-light images often suffer from noise and color distortion. Object +detection, semantic segmentation, instance segmentation, and other tasks are +challenging when working with low-light images because of image noise and +chromatic aberration. We also found that the conventional Retinex theory loses +information in adjusting the image for low-light tasks. In response to the +aforementioned problem, this paper proposes an algorithm for low illumination +enhancement. The proposed method, KinD-LCE, uses a light curve estimation +module to enhance the illumination map in the Retinex decomposed image, +improving the overall image brightness. An illumination map and reflection map +fusion module were also proposed to restore the image details and reduce detail +loss. Additionally, a TV(total variation) loss function was applied to +eliminate noise. Our method was trained on the GladNet dataset, known for its +diverse collection of low-light images, tested against the Low-Light dataset, +and evaluated using the ExDark dataset for downstream tasks, demonstrating +competitive performance with a PSNR of 19.7216 and SSIM of 0.8213. + +
+
+ comment: Accepted by Signal, Image and Video Processing +
+
+
+
+
+ + ♻ ☆ Real3D-AD: A Dataset of Point Cloud Anomaly Detection + + +
+ High-precision point cloud anomaly detection is the gold standard for +identifying the defects of advancing machining and precision manufacturing. +Despite some methodological advances in this area, the scarcity of datasets and +the lack of a systematic benchmark hinder its development. We introduce +Real3D-AD, a challenging high-precision point cloud anomaly detection dataset, +addressing the limitations in the field. With 1,254 high-resolution 3D items +from forty thousand to millions of points for each item, Real3D-AD is the +largest dataset for high-precision 3D industrial anomaly detection to date. +Real3D-AD surpasses existing 3D anomaly detection datasets available regarding +point cloud resolution (0.0010mm-0.0015mm), 360 degree coverage and perfect +prototype. Additionally, we present a comprehensive benchmark for Real3D-AD, +revealing the absence of baseline methods for high-precision point cloud +anomaly detection. To address this, we propose Reg3D-AD, a registration-based +3D anomaly detection method incorporating a novel feature memory bank that +preserves local and global representations. Extensive experiments on the +Real3D-AD dataset highlight the effectiveness of Reg3D-AD. For reproducibility +and accessibility, we provide the Real3D-AD dataset, benchmark source code, and +Reg3D-AD on our website:https://github.com/M-3LAB/Real3D-AD. + +
+
+
+
+
+ + ♻ ☆ EDIS: Entity-Driven Image Search over Multimodal Web Content EMNLP 2023 + + +
+ Making image retrieval methods practical for real-world search applications +requires significant progress in dataset scales, entity comprehension, and +multimodal information fusion. In this work, we introduce +\textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a +challenging dataset for cross-modal image search in the news domain. EDIS +consists of 1 million web images from actual search engine results and curated +datasets, with each image paired with a textual description. Unlike datasets +that assume a small set of single-modality candidates, EDIS reflects real-world +web image search scenarios by including a million multimodal image-text pairs +as candidates. EDIS encourages the development of retrieval models that +simultaneously address cross-modal information fusion and matching. To achieve +accurate ranking results, a model must: 1) understand named entities and events +from text queries, 2) ground entities onto images or text descriptions, and 3) +effectively fuse textual and visual representations. Our experimental results +show that EDIS challenges state-of-the-art methods with dense entities and a +large-scale candidate set. The ablation study also proves that fusing textual +features with visual features is critical in improving retrieval results. + +
+
+ comment: EMNLP 2023 camera ready version +
+
+
+
+
+ + ♻ ☆ L-CAD: Language-based Colorization with Any-level Descriptions using + Diffusion Priors + + +
+ Language-based colorization produces plausible and visually pleasing colors +under the guidance of user-friendly natural language descriptions. Previous +methods implicitly assume that users provide comprehensive color descriptions +for most of the objects in the image, which leads to suboptimal performance. In +this paper, we propose a unified model to perform language-based colorization +with any-level descriptions. We leverage the pretrained cross-modality +generative model for its robust language understanding and rich color priors to +handle the inherent ambiguity of any-level descriptions. We further design +modules to align with input conditions to preserve local spatial structures and +prevent the ghosting effect. With the proposed novel sampling strategy, our +model achieves instance-aware colorization in diverse and complex scenarios. +Extensive experimental results demonstrate our advantages of effectively +handling any-level descriptions and outperforming both language-based and +automatic colorization methods. The code and pretrained models are available +at: https://github.com/changzheng123/L-CAD. + +
+
+
+
+
+ + ♻ ☆ Temporal Knowledge Sharing enable Spiking Neural Network Learning from + Past and Future + + +
+ Spiking Neural Networks (SNNs) have attracted significant attention from +researchers across various domains due to their brain-like information +processing mechanism. However, SNNs typically grapple with challenges such as +extended time steps, low temporal information utilization, and the requirement +for consistent time step between testing and training. These challenges render +SNNs with high latency. Moreover, the constraint on time steps necessitates the +retraining of the model for new deployments, reducing adaptability. To address +these issues, this paper proposes a novel perspective, viewing the SNN as a +temporal aggregation model. We introduce the Temporal Knowledge Sharing (TKS) +method, facilitating information interact between different time points. TKS +can be perceived as a form of temporal self-distillation. To validate the +efficacy of TKS in information processing, we tested it on static datasets like +CIFAR10, CIFAR100, ImageNet-1k, and neuromorphic datasets such as DVS-CIFAR10 +and NCALTECH101. Experimental results demonstrate that our method achieves +state-of-the-art performance compared to other algorithms. Furthermore, TKS +addresses the temporal consistency challenge, endowing the model with superior +temporal generalization capabilities. This allows the network to train with +longer time steps and maintain high performance during testing with shorter +time steps. Such an approach considerably accelerates the deployment of SNNs on +edge devices. Finally, we conducted ablation experiments and tested TKS on +fine-grained tasks, with results showcasing TKS's enhanced capability to +process information efficiently. + +
+
+
+
+
+ + ♻ ☆ Bullying10K: A Large-Scale Neuromorphic Dataset towards + Privacy-Preserving Bullying Recognition NeurIPS 2023 + + +
+ The prevalence of violence in daily life poses significant threats to +individuals' physical and mental well-being. Using surveillance cameras in +public spaces has proven effective in proactively deterring and preventing such +incidents. However, concerns regarding privacy invasion have emerged due to +their widespread deployment. To address the problem, we leverage Dynamic Vision +Sensors (DVS) cameras to detect violent incidents and preserve privacy since it +captures pixel brightness variations instead of static imagery. We introduce +the Bullying10K dataset, encompassing various actions, complex movements, and +occlusions from real-life scenarios. It provides three benchmarks for +evaluating different tasks: action recognition, temporal action localization, +and pose estimation. With 10,000 event segments, totaling 12 billion events and +255 GB of data, Bullying10K contributes significantly by balancing violence +detection and personal privacy persevering. And it also poses a challenge to +the neuromorphic dataset. It will serve as a valuable resource for training and +developing privacy-protecting video systems. The Bullying10K opens new +possibilities for innovative approaches in these domains. + +
+
+ comment: Accepted at the 37th Conference on Neural Information Processing + Systems (NeurIPS 2023) Track on Datasets and Benchmarks +
+
+
+
+
+ + ♻ ☆ Boosting Generalization with Adaptive Style Techniques for Fingerprint + Liveness Detection + + +
+ We introduce a high-performance fingerprint liveness feature extraction +technique that secured first place in LivDet 2023 Fingerprint Representation +Challenge. Additionally, we developed a practical fingerprint recognition +system with 94.68% accuracy, earning second place in LivDet 2023 Liveness +Detection in Action. By investigating various methods, particularly style +transfer, we demonstrate improvements in accuracy and generalization when faced +with limited training data. As a result, our approach achieved state-of-the-art +performance in LivDet 2023 Challenges. + +
+
+ comment: 1st Place in LivDet2023 Fingerprint Representation Challenge +
+
+
+
+
+ + ♻ ☆ GreatSplicing: A Semantically Rich Splicing Dataset + + +
+ In existing splicing forgery datasets, the insufficient semantic varieties of +spliced regions cause a problem that trained detection models overfit semantic +features rather than splicing traces. Meanwhile, because of the absence of a +reasonable dataset, different detection methods proposed cannot reach a +consensus on experimental settings. To address these urgent issues, +GreatSplicing, a manually created splicing dataset with a considerable amount +and high quality, is proposed in this paper. GreatSplicing comprises 5,000 +spliced images and covers spliced regions with 335 distinct semantic +categories, allowing neural networks to grasp splicing traces better. Extensive +experiments demonstrate that models trained on GreatSplicing exhibit minimal +misidentification rates and superior cross-dataset detection capabilities +compared to existing datasets. Furthermore, GreatSplicing is available for all +research purposes and can be downloaded from www.greatsplicing.net. + +
+
+
+
+
+ + ♻ ☆ Enabling Real-time Neural Recovery for Cloud Gaming on Mobile Devices + + +
+ Cloud gaming is a multi-billion dollar industry. A client in cloud gaming +sends its movement to the game server on the Internet, which renders and +transmits the resulting video back. In order to provide a good gaming +experience, a latency below 80 ms is required. This means that video rendering, +encoding, transmission, decoding, and display have to finish within that time +frame, which is especially challenging to achieve due to server overload, +network congestion, and losses. In this paper, we propose a new method for +recovering lost or corrupted video frames in cloud gaming. Unlike traditional +video frame recovery, our approach uses game states to significantly enhance +recovery accuracy and utilizes partially decoded frames to recover lost +portions. We develop a holistic system that consists of (i) efficiently +extracting game states, (ii) modifying H.264 video decoder to generate a mask +to indicate which portions of video frames need recovery, and (iii) designing a +novel neural network to recover either complete or partial video frames. Our +approach is extensively evaluated using iPhone 12 and laptop implementations, +and we demonstrate the utility of game states in the game video recovery and +the effectiveness of our overall design. + +
+
+
+
+
+ + ♻ ☆ Im-Promptu: In-Context Composition from Image Prompts + + +
+ Large language models are few-shot learners that can solve diverse tasks from +a handful of demonstrations. This implicit understanding of tasks suggests that +the attention mechanisms over word tokens may play a role in analogical +reasoning. In this work, we investigate whether analogical reasoning can enable +in-context composition over composable elements of visual stimuli. First, we +introduce a suite of three benchmarks to test the generalization properties of +a visual in-context learner. We formalize the notion of an analogy-based +in-context learner and use it to design a meta-learning framework called +Im-Promptu. Whereas the requisite token granularity for language is well +established, the appropriate compositional granularity for enabling in-context +generalization in visual stimuli is usually unspecified. To this end, we use +Im-Promptu to train multiple agents with different levels of compositionality, +including vector representations, patch representations, and object slots. Our +experiments reveal tradeoffs between extrapolation abilities and the degree of +compositionality, with non-compositional representations extending learned +composition rules to unseen domains but performing poorly on combinatorial +tasks. Patch-based representations require patches to contain entire objects +for robust extrapolation. At the same time, object-centric tokenizers coupled +with a cross-attention module generate consistent and high-fidelity solutions, +with these inductive biases being particularly crucial for compositional +generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive +programming interface for image generation. + +
+
+
+
+
+ + ♻ ☆ Dual skip connections in U-Net, ResUnet and U-Net3+ for remote + extraction of buildings + + +
+ Urban buildings are extracted from high-resolution Earth observation (EO) +images using semantic segmentation networks like U-Net and its successors. Each +re-iteration aims to improve performance by employing a denser skip connection +mechanism that harnesses multi-scale features for accurate object mapping. +However, denser connections increase network parameters and do not necessarily +contribute to precise segmentation. In this paper, we develop three dual skip +connection mechanisms for three networks (U-Net, ResUnet, and U-Net3+) to +selectively deepen the essential feature maps for improved performance. The +three mechanisms are evaluated on feature maps of different scales, producing +nine new network configurations. They are evaluated against their original +vanilla configurations on four building footprint datasets of different spatial +resolutions, including a multi-resolution (0.3+0.6+1.2m) dataset that we +develop for complex urban environments. The evaluation revealed that densifying +the large- and small-scale features in U-Net and U-Net3+ produce up to 0.905 +F1, more than TransUnet (0.903) and Swin-Unet (0.882) in our new dataset with +up to 19x fewer parameters. The results conclude that selectively densifying +feature maps and skip connections enhances network performance without a +substantial increase in parameters. The findings and the new dataset will +contribute to the computer vision domain and urban planning decision processes. + +
+
+ comment: This work has been submitted to Springer for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ♻ ☆ OpenPatch: a 3D patchwork for Out-Of-Distribution detection + + +
+ Moving deep learning models from the laboratory setting to the open world +entails preparing them to handle unforeseen conditions. In several applications +the occurrence of novel classes during deployment poses a significant threat, +thus it is crucial to effectively detect them. Ideally, this skill should be +used when needed without requiring any further computational training effort at +every new task. Out-of-distribution detection has attracted significant +attention in the last years, however the majority of the studies deal with 2D +images ignoring the inherent 3D nature of the real-world and often confusing +between domain and semantic novelty. In this work, we focus on the latter, +considering the objects geometric structure captured by 3D point clouds +regardless of the specific domain. We advance the field by introducing +OpenPatch that builds on a large pre-trained model and simply extracts from its +intermediate features a set of patch representations that describe each known +class. For any new sample, we obtain a novelty score by evaluating whether it +can be recomposed mainly by patches of a single known class or rather via the +contribution of multiple classes. We present an extensive experimental +evaluation of our approach for the task of semantic novelty detection on +real-world point cloud samples when the reference known data are synthetic. We +demonstrate that OpenPatch excels in both the full and few-shot known sample +scenarios, showcasing its robustness across varying pre-training objectives and +network backbones. The inherent training-free nature of our method allows for +its immediate application to a wide array of real-world tasks, offering a +compelling advantage over approaches that need expensive retraining efforts. + +
+
+
+
+
+ + ♻ ☆ Harnessing Hard Mixed Samples with Decoupled Regularizer NeurIPS'2023 + + +
+ Mixup is an efficient data augmentation approach that improves the +generalization of neural networks by smoothing the decision boundary with mixed +data. Recently, dynamic mixup methods have improved previous static policies +effectively (e.g., linear interpolation) by maximizing target-related salient +regions in mixed samples, but excessive additional time costs are not +acceptable. These additional computational overheads mainly come from +optimizing the mixed samples according to the mixed labels. However, we found +that the extra optimizing step may be redundant because label-mismatched mixed +samples are informative hard mixed samples for deep models to localize +discriminative features. In this paper, we thus are not trying to propose a +more complicated dynamic mixup policy but rather an efficient mixup objective +function with a decoupled regularizer named Decoupled Mixup (DM). The primary +effect is that DM can adaptively utilize those hard mixed samples to mine +discriminative features without losing the original smoothness of mixup. As a +result, DM enables static mixup methods to achieve comparable or even exceed +the performance of dynamic methods without any extra computation. This also +leads to an interesting objective design problem for mixup training that we +need to focus on both smoothing the decision boundaries and identifying +discriminative features. Extensive experiments on supervised and +semi-supervised learning benchmarks across seven datasets validate the +effectiveness of DM as a plug-and-play module. Source code and models are +available at https://github.com/Westlake-AI/openmixup + +
+
+ comment: NeurIPS'2023 Camera Ready. The source code is available at + https://github.com/Westlake-AI/openmixup +
+
+
+
+
+ + ♻ ☆ Negligible effect of brain MRI data preprocessing for tumor segmentation + + +
+ Magnetic resonance imaging (MRI) data is heterogeneous due to differences in +device manufacturers, scanning protocols, and inter-subject variability. A +conventional way to mitigate MR image heterogeneity is to apply preprocessing +transformations such as anatomy alignment, voxel resampling, signal intensity +equalization, image denoising, and localization of regions of interest. +Although a preprocessing pipeline standardizes image appearance, its influence +on the quality of image segmentation and on other downstream tasks in deep +neural networks has never been rigorously studied. + We conduct experiments on three publicly available datasets and evaluate the +effect of different preprocessing steps in intra- and inter-dataset training +scenarios. Our results demonstrate that most popular standardization steps add +no value to the network performance; moreover, preprocessing can hamper model +performance. We suggest that image intensity normalization approaches do not +contribute to model accuracy because of the reduction of signal variance with +image standardization. Finally, we show that the contribution of +skull-stripping in data preprocessing is almost negligible if measured in terms +of estimated tumor volume. + We show that the only essential transformation for accurate deep learning +analysis is the unification of voxel spacing across the dataset. In contrast, +inter-subjects anatomy alignment in the form of non-rigid atlas registration is +not necessary and intensity equalization steps (denoising, bias-field +correction and histogram matching) do not improve models' performance. The +study code is accessible online +https://github.com/MedImAIR/brain-mri-processing-pipeline + +
+
+
+
+
+
+
+
+ + Information Retrieval 17 + +
+
+
+ + ☆ Budgeted Embedding Table For Recommender Systems WSDM 2024 + + +
+ At the heart of contemporary recommender systems (RSs) are latent factor +models that provide quality recommendation experience to users. These models +use embedding vectors, which are typically of a uniform and fixed size, to +represent users and items. As the number of users and items continues to grow, +this design becomes inefficient and hard to scale. Recent lightweight embedding +methods have enabled different users and items to have diverse embedding sizes, +but are commonly subject to two major drawbacks. Firstly, they limit the +embedding size search to optimizing a heuristic balancing the recommendation +quality and the memory complexity, where the trade-off coefficient needs to be +manually tuned for every memory budget requested. The implicitly enforced +memory complexity term can even fail to cap the parameter usage, making the +resultant embedding table fail to meet the memory budget strictly. Secondly, +most solutions, especially reinforcement learning based ones derive and +optimize the embedding size for each each user/item on an instance-by-instance +basis, which impedes the search efficiency. In this paper, we propose Budgeted +Embedding Table (BET), a novel method that generates table-level actions (i.e., +embedding sizes for all users and items) that is guaranteed to meet +pre-specified memory budgets. Furthermore, by leveraging a set-based action +formulation and engaging set representation learning, we present an innovative +action search strategy powered by an action fitness predictor that efficiently +evaluates each table-level action. Experiments have shown state-of-the-art +performance on two real-world datasets when BET is paired with three popular +recommender models under different memory budgets. + +
+
+ comment: Accepted by WSDM 2024 +
+
+
+
+
+ + ☆ DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye + Movement for Machine Reading EMNLP2023 + + +
+ The use of visually-rich documents (VRDs) in various fields has created a +demand for Document AI models that can read and comprehend documents like +humans, which requires the overcoming of technical, linguistic, and cognitive +barriers. Unfortunately, the lack of appropriate datasets has significantly +hindered advancements in the field. To address this issue, we introduce +\textsc{DocTrack}, a VRD dataset really aligned with human eye-movement +information using eye-tracking technology. This dataset can be used to +investigate the challenges mentioned above. Additionally, we explore the impact +of human reading order on document understanding tasks and examine what would +happen if a machine reads in the same order as a human. Our results suggest +that although Document AI models have made significant progress, they still +have a long way to go before they can read VRDs as accurately, continuously, +and flexibly as humans do. These findings have potential implications for +future research and development of Document AI models. The data is available at +\url{https://github.com/hint-lab/doctrack}. + +
+
+ comment: 14 pages, 8 figures, Accepted by Findings of EMNLP2023 +
+
+
+
+
+ + ☆ Conversational Recommender System and Large Language Model Are Made for + Each Other in E-commerce Pre-sales Dialogue EMNLP 2023 + + +
+ E-commerce pre-sales dialogue aims to understand and elicit user needs and +preferences for the items they are seeking so as to provide appropriate +recommendations. Conversational recommender systems (CRSs) learn user +representation and provide accurate recommendations based on dialogue context, +but rely on external knowledge. Large language models (LLMs) generate responses +that mimic pre-sales dialogues after fine-tuning, but lack domain-specific +knowledge for accurate recommendations. Intuitively, the strengths of LLM and +CRS in E-commerce pre-sales dialogues are complementary, yet no previous work +has explored this. This paper investigates the effectiveness of combining LLM +and CRS in E-commerce pre-sales dialogues, proposing two collaboration methods: +CRS assisting LLM and LLM assisting CRS. We conduct extensive experiments on a +real-world dataset of Ecommerce pre-sales dialogues. We analyze the impact of +two collaborative approaches with two CRSs and two LLMs on four tasks of +Ecommerce pre-sales dialogue. We find that collaborations between CRS and LLM +can be very effective in some cases. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Large Search Model: Redefining Search Stack in the Era of LLMs + + +
+ Modern search engines are built on a stack of different components, including +query understanding, retrieval, multi-stage ranking, and question answering, +among others. These components are often optimized and deployed independently. +In this paper, we introduce a novel conceptual framework called large search +model, which redefines the conventional search stack by unifying search tasks +with one large language model (LLM). All tasks are formulated as autoregressive +text generation problems, allowing for the customization of tasks through the +use of natural language prompts. This proposed framework capitalizes on the +strong language understanding and reasoning capabilities of LLMs, offering the +potential to enhance search result quality while simultaneously simplifying the +existing cumbersome search stack. To substantiate the feasibility of this +framework, we present a series of proof-of-concept experiments and discuss the +potential challenges associated with implementing this approach within +real-world search systems. + +
+
+ comment: 16 pages +
+
+
+
+
+ + ☆ CorefPrompt: Prompt-based Event Coreference Resolution by Measuring + Event Type and Argument Compatibilities EMNLP2023 + + +
+ Event coreference resolution (ECR) aims to group event mentions referring to +the same real-world event into clusters. Most previous studies adopt the +"encoding first, then scoring" framework, making the coreference judgment rely +on event encoding. Furthermore, current methods struggle to leverage +human-summarized ECR rules, e.g., coreferential events should have the same +event type, to guide the model. To address these two issues, we propose a +prompt-based approach, CorefPrompt, to transform ECR into a cloze-style MLM +(masked language model) task. This allows for simultaneous event modeling and +coreference discrimination within a single template, with a fully shared +context. In addition, we introduce two auxiliary prompt tasks, event-type +compatibility and argument compatibility, to explicitly demonstrate the +reasoning process of ECR, which helps the model make final predictions. +Experimental results show that our method CorefPrompt performs well in a +state-of-the-art (SOTA) benchmark. + +
+
+ comment: Accepted by EMNLP2023 +
+
+
+
+
+ + ☆ "Why Should I Review This Paper?" Unifying Semantic, Topic, and Citation + Factors for Paper-Reviewer Matching + + +
+ As many academic conferences are overwhelmed by a rapidly increasing number +of paper submissions, automatically finding appropriate reviewers for each +submission becomes a more urgent need than ever. Various factors have been +considered by previous attempts on this task to measure the expertise relevance +between a paper and a reviewer, including whether the paper is semantically +close to, shares topics with, and cites previous papers of the reviewer. +However, the majority of previous studies take only one of these factors into +account, leading to an incomprehensive evaluation of paper-reviewer relevance. +To bridge this gap, in this paper, we propose a unified model for +paper-reviewer matching that jointly captures semantic, topic, and citation +factors. In the unified model, a contextualized language model backbone is +shared by all factors to learn common knowledge, while instruction tuning is +introduced to characterize the uniqueness of each factor by producing +factor-aware paper embeddings. Experiments on four datasets (one of which is +newly contributed by us) across different fields, including machine learning, +computer vision, information retrieval, and data mining, consistently validate +the effectiveness of our proposed UniPR model in comparison with +state-of-the-art paper-reviewer matching methods and scientific pre-trained +language models. + +
+
+
+
+
+ + ☆ Towards Hybrid-grained Feature Interaction Selection for Deep Sparse + Network NeurIPS 2023 + + +
+ Deep sparse networks are widely investigated as a neural network architecture +for prediction tasks with high-dimensional sparse features, with which feature +interaction selection is a critical component. While previous methods primarily +focus on how to search feature interaction in a coarse-grained space, less +attention has been given to a finer granularity. In this work, we introduce a +hybrid-grained feature interaction selection approach that targets both feature +field and feature value for deep sparse networks. To explore such expansive +space, we propose a decomposed space which is calculated on the fly. We then +develop a selection algorithm called OptFeature, which efficiently selects the +feature interaction from both the feature field and the feature value +simultaneously. Results from experiments on three large real-world benchmark +datasets demonstrate that OptFeature performs well in terms of accuracy and +efficiency. Additional studies support the feasibility of our method. + +
+
+ comment: NeurIPS 2023 poster +
+
+
+
+
+ + ☆ Triple Simplex Matrix Completion for Expense Forecasting + + +
+ Forecasting project expenses is a crucial step for businesses to avoid budget +overruns and project failures. Traditionally, this has been done by financial +analysts or data science techniques such as time-series analysis. However, +these approaches can be uncertain and produce results that differ from the +planned budget, especially at the start of a project with limited data points. +This paper proposes a constrained non-negative matrix completion model that +predicts expenses by learning the likelihood of the project correlating with +certain expense patterns in the latent space. The model is constrained on three +probability simplexes, two of which are on the factor matrices and the third on +the missing entries. Additionally, the predicted expense values are guaranteed +to meet the budget constraint without the need of post-processing. An inexact +alternating optimization algorithm is developed to solve the associated +optimization problem and is proven to converge to a stationary point. Results +from two real datasets demonstrate the effectiveness of the proposed method in +comparison to state-of-the-art algorithms. + +
+
+ comment: 5 pages 2 figures +
+
+
+
+
+ + ♻ ☆ Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR + Decomposition EMNLP 2023 + + +
+ Cross-encoder models, which jointly encode and score a query-item pair, are +prohibitively expensive for direct k-nearest neighbor (k-NN) search. +Consequently, k-NN search typically employs a fast approximate retrieval (e.g. +using BM25 or dual-encoder vectors), followed by reranking with a +cross-encoder; however, the retrieval approximation often has detrimental +recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent +work that employs a cross-encoder only, making search efficient using a +relatively small number of anchor items, and a CUR matrix factorization. While +ANNCUR's one-time selection of anchors tends to approximate the cross-encoder +distances on average, doing so forfeits the capacity to accurately estimate +distances to items near the query, leading to regret in the crucial end-task: +recall of top-k items. In this paper, we propose ADACUR, a method that +adaptively, iteratively, and efficiently minimizes the approximation error for +the practically important top-k neighbors. It does so by iteratively performing +k-NN search using the anchors available so far, then adding these retrieved +nearest neighbors to the anchor set for the next round. Empirically, on +multiple datasets, in comparison to previous traditional and state-of-the-art +methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed +approach ADACUR consistently reduces recall error-by up to 70% on the important +k = 1 setting-while using no more compute than its competitors. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Zero-shot Query Reformulation for Conversational Search SIGIR + + +
+ As the popularity of voice assistants continues to surge, conversational +search has gained increased attention in Information Retrieval. However, data +sparsity issues in conversational search significantly hinder the progress of +supervised conversational search methods. Consequently, researchers are +focusing more on zero-shot conversational search approaches. Nevertheless, +existing zero-shot methods face three primary limitations: they are not +universally applicable to all retrievers, their effectiveness lacks sufficient +explainability, and they struggle to resolve common conversational ambiguities +caused by omission. To address these limitations, we introduce a novel +Zero-shot Query Reformulation (ZeQR) framework that reformulates queries based +on previous dialogue contexts without requiring supervision from conversational +search data. Specifically, our framework utilizes language models designed for +machine reading comprehension tasks to explicitly resolve two common +ambiguities: coreference and omission, in raw queries. In comparison to +existing zero-shot methods, our approach is universally applicable to any +retriever without additional adaptation or indexing. It also provides greater +explainability and effectively enhances query intent understanding because +ambiguities are explicitly and proactively resolved. Through extensive +experiments on four TREC conversational datasets, we demonstrate the +effectiveness of our method, which consistently outperforms state-of-the-art +baselines. + +
+
+ comment: Accepted by the 9th ACM SIGIR International Conference on the Theory + of Information Retrieval +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive + Decoders EMNLP 2023 + + +
+ Neural document rerankers are extremely effective in terms of accuracy. +However, the best models require dedicated hardware for serving, which is +costly and often not feasible. To avoid this serving-time requirement, we +present a method of capturing up to 86% of the gains of a Transformer +cross-attention model with a lexicalized scoring function that only requires +10-6% of the Transformer's FLOPs per document and can be served using commodity +CPUs. When combined with a BM25 retriever, this approach matches the quality of +a state-of-the art dual encoder retriever, that still requires an accelerator +for query encoding. We introduce NAIL (Non-Autoregressive Indexing with +Language models) as a model architecture that is compatible with recent +encoder-decoder and decoder-only large language models, such as T5, GPT-3 and +PaLM. This model architecture can leverage existing pre-trained checkpoints and +can be fine-tuned for efficiently constructing document representations that do +not require neural processing of queries. + +
+
+ comment: To appear at EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ MISSRec: Pre-training and Transferring Multi-modal Interest-aware + Sequence Representation for Recommendation ACM MM 2023 + + +
+ The goal of sequential recommendation (SR) is to predict a user's potential +interested items based on her/his historical interaction sequences. Most +existing sequential recommenders are developed based on ID features, which, +despite their widespread use, often underperform with sparse IDs and struggle +with the cold-start problem. Besides, inconsistent ID mappings hinder the +model's transferability, isolating similar recommendation domains that could +have been co-optimized. This paper aims to address these issues by exploring +the potential of multi-modal information in learning robust and generalizable +sequence representations. We propose MISSRec, a multi-modal pre-training and +transfer learning framework for SR. On the user side, we design a +Transformer-based encoder-decoder model, where the contextual encoder learns to +capture the sequence-level multi-modal user interests while a novel +interest-aware decoder is developed to grasp item-modality-interest relations +for better sequence representation. On the candidate item side, we adopt a +dynamic fusion module to produce user-adaptive item representation, providing +more precise matching between users and items. We pre-train the model with +contrastive learning objectives and fine-tune it in an efficient manner. +Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, +promising a practical solution for real-world recommendation scenarios. Data +and code are available on \url{https://github.com/gimpong/MM23-MISSRec}. + +
+
+ comment: Accepted to ACM MM 2023. Data and code are available +
+
+
+
+
+ + ♻ ☆ Personalized Elastic Embedding Learning for On-Device Recommendation + + +
+ To address privacy concerns and reduce network latency, there has been a +recent trend of compressing cumbersome recommendation models trained on the +cloud and deploying compact recommender models to resource-limited devices for +real-time recommendation. Existing solutions generally overlook device +heterogeneity and user heterogeneity. They either require all devices to share +the same compressed model or the devices with the same resource budget to share +the same model. However, even users with the same devices may have different +preferences. In addition, they assume the available resources (e.g., memory) +for the recommender on a device are constant, which is not reflective of +reality. In light of device and user heterogeneities as well as dynamic +resource constraints, this paper proposes a Personalized Elastic Embedding +Learning framework (PEEL) for on-device recommendation, which generates +personalized embeddings for devices with various memory budgets in once-for-all +manner, efficiently adapting to new or dynamic budgets, and effectively +addressing user preference diversity by assigning personalized embeddings for +different groups of users. Specifically, it pretrains using user-item +interaction instances to generate the global embedding table and cluster users +into groups. Then, it refines the embedding tables with local interaction +instances within each group. Personalized elastic embedding is generated from +the group-wise embedding blocks and their weights that indicate the +contribution of each embedding block to the local recommendation performance. +PEEL efficiently generates personalized elastic embeddings by selecting +embedding blocks with the largest weights, making it adaptable to dynamic +memory budgets. Extensive experiments are conducted on two public datasets, and +the results show that PEEL yields superior performance on devices with +heterogeneous and dynamic memory budgets. + +
+
+
+
+
+ + ♻ ☆ EDIS: Entity-Driven Image Search over Multimodal Web Content EMNLP 2023 + + +
+ Making image retrieval methods practical for real-world search applications +requires significant progress in dataset scales, entity comprehension, and +multimodal information fusion. In this work, we introduce +\textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a +challenging dataset for cross-modal image search in the news domain. EDIS +consists of 1 million web images from actual search engine results and curated +datasets, with each image paired with a textual description. Unlike datasets +that assume a small set of single-modality candidates, EDIS reflects real-world +web image search scenarios by including a million multimodal image-text pairs +as candidates. EDIS encourages the development of retrieval models that +simultaneously address cross-modal information fusion and matching. To achieve +accurate ranking results, a model must: 1) understand named entities and events +from text queries, 2) ground entities onto images or text descriptions, and 3) +effectively fuse textual and visual representations. Our experimental results +show that EDIS challenges state-of-the-art methods with dense entities and a +large-scale candidate set. The ablation study also proves that fusing textual +features with visual features is critical in improving retrieval results. + +
+
+ comment: EMNLP 2023 camera ready version +
+
+
+
+
+ + ♻ ☆ I^3 Retriever: Incorporating Implicit Interaction in Pre-trained + Language Models for Passage Retrieval + + +
+ Passage retrieval is a fundamental task in many information systems, such as +web search and question answering, where both efficiency and effectiveness are +critical concerns. In recent years, neural retrievers based on pre-trained +language models (PLM), such as dual-encoders, have achieved huge success. Yet, +studies have found that the performance of dual-encoders are often limited due +to the neglecting of the interaction information between queries and candidate +passages. Therefore, various interaction paradigms have been proposed to +improve the performance of vanilla dual-encoders. Particularly, recent +state-of-the-art methods often introduce late-interaction during the model +inference process. However, such late-interaction based methods usually bring +extensive computation and storage cost on large corpus. Despite their +effectiveness, the concern of efficiency and space footprint is still an +important factor that limits the application of interaction-based neural +retrieval models. To tackle this issue, we incorporate implicit interaction +into dual-encoders, and propose I^3 retriever. In particular, our implicit +interaction paradigm leverages generated pseudo-queries to simulate +query-passage interaction, which jointly optimizes with query and passage +encoders in an end-to-end manner. It can be fully pre-computed and cached, and +its inference process only involves simple dot product operation of the query +vector and passage vector, which makes it as efficient as the vanilla dual +encoders. We conduct comprehensive experiments on MSMARCO and TREC2019 Deep +Learning Datasets, demonstrating the I^3 retriever's superiority in terms of +both effectiveness and efficiency. Moreover, the proposed implicit interaction +is compatible with special pre-training and knowledge distillation for passage +retrieval, which brings a new state-of-the-art performance. + +
+
+ comment: 10 pages +
+
+
+
+
+ + ♻ ☆ Evaluating Verifiability in Generative Search Engines EMNLP 2023 + + +
+ Generative search engines directly generate responses to user queries, along +with in-line citations. A prerequisite trait of a trustworthy generative search +engine is verifiability, i.e., systems should cite comprehensively (high +citation recall; all statements are fully supported by citations) and +accurately (high citation precision; every cite supports its associated +statement). We conduct human evaluation to audit four popular generative search +engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse +set of queries from a variety of sources (e.g., historical Google user queries, +dynamically-collected open-ended questions on Reddit, etc.). We find that +responses from existing generative search engines are fluent and appear +informative, but frequently contain unsupported statements and inaccurate +citations: on average, a mere 51.5% of generated sentences are fully supported +by citations and only 74.5% of citations support their associated sentence. We +believe that these results are concerningly low for systems that may serve as a +primary tool for information-seeking users, especially given their facade of +trustworthiness. We hope that our results further motivate the development of +trustworthy generative search engines and help researchers and users better +understand the shortcomings of existing commercial systems. + +
+
+ comment: 25 pages, 12 figures; to appear in Findings of EMNLP 2023 +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ Ghost on the Shell: An Expressive Representation of General 3D Shapes + + +
+ The creation of photorealistic virtual worlds requires the accurate modeling +of 3D surface geometry for a wide range of objects. For this, meshes are +appealing since they 1) enable fast physics-based rendering with realistic +material and lighting, 2) support physical simulation, and 3) are +memory-efficient for modern graphics pipelines. Recent work on reconstructing +and statistically modeling 3D shape, however, has critiqued meshes as being +topologically inflexible. To capture a wide range of object shapes, any 3D +representation must be able to model solid, watertight, shapes as well as thin, +open, surfaces. Recent work has focused on the former, and methods for +reconstructing open surfaces do not support fast reconstruction with material +and lighting or unconditional generative modelling. Inspired by the observation +that open surfaces can be seen as islands floating on watertight surfaces, we +parameterize open surfaces by defining a manifold signed distance field on +watertight templates. With this parameterization, we further develop a +grid-based and differentiable representation that parameterizes both watertight +and non-watertight meshes of arbitrary topology. Our new representation, called +Ghost-on-the-Shell (G-Shell), enables two important applications: +differentiable rasterization-based reconstruction from multiview images and +generative modelling of non-watertight meshes. We empirically demonstrate that +G-Shell achieves state-of-the-art performance on non-watertight mesh +reconstruction and generation tasks, while also performing effectively for +watertight meshes. + +
+
+ comment: Technical Report (26 pages, 16 figures) +
+
+
+
+
+ + ☆ Handling Data Heterogeneity via Architectural Design for Federated + Visual Recognition NeurIPS 2023 + + +
+ Federated Learning (FL) is a promising research paradigm that enables the +collaborative training of machine learning models among various parties without +the need for sensitive information exchange. Nonetheless, retaining data in +individual clients introduces fundamental challenges to achieving performance +on par with centrally trained models. Our study provides an extensive review of +federated learning applied to visual recognition. It underscores the critical +role of thoughtful architectural design choices in achieving optimal +performance, a factor often neglected in the FL literature. Many existing FL +solutions are tested on shallow or simple networks, which may not accurately +reflect real-world applications. This practice restricts the transferability of +research findings to large-scale visual recognition models. Through an in-depth +analysis of diverse cutting-edge architectures such as convolutional neural +networks, transformers, and MLP-mixers, we experimentally demonstrate that +architectural choices can substantially enhance FL systems' performance, +particularly when handling heterogeneous data. We study 19 visual recognition +models from five different architectural families on four challenging FL +datasets. We also re-investigate the inferior performance of convolution-based +architectures in the FL setting and analyze the influence of normalization +layers on the FL performance. Our findings emphasize the importance of +architectural design for computer vision tasks in practical scenarios, +effectively narrowing the performance gap between federated and centralized +learning. Our source code is available at +https://github.com/sarapieri/fed_het.git. + +
+
+ comment: to be published in NeurIPS 2023 +
+
+
+
+
+ + ☆ Linear Representations of Sentiment in Large Language Models + + +
+ Sentiment is a pervasive feature in natural language text, yet it is an open +question how sentiment is represented within Large Language Models (LLMs). In +this study, we reveal that across a range of models, sentiment is represented +linearly: a single direction in activation space mostly captures the feature +across a range of tasks with one extreme for positive and the other for +negative. Through causal interventions, we isolate this direction and show it +is causally relevant in both toy tasks and real world datasets such as Stanford +Sentiment Treebank. Through this case study we model a thorough investigation +of what a single direction means on a broad data distribution. + We further uncover the mechanisms that involve this direction, highlighting +the roles of a small subset of attention heads and neurons. Finally, we +discover a phenomenon which we term the summarization motif: sentiment is not +solely represented on emotionally charged words, but is additionally summarized +at intermediate positions without inherent sentiment, such as punctuation and +names. We show that in Stanford Sentiment Treebank zero-shot classification, +76% of above-chance classification accuracy is lost when ablating the sentiment +direction, nearly half of which (36%) is due to ablating the summarized +sentiment direction exclusively at comma positions. + +
+
+
+
+
+ + ☆ Verb Conjugation in Transformers Is Determined by Linear Encodings of + Subject Number EMNLP 2023 + + +
+ Deep architectures such as Transformers are sometimes criticized for having +uninterpretable "black-box" representations. We use causal intervention +analysis to show that, in fact, some linguistic features are represented in a +linear, interpretable format. Specifically, we show that BERT's ability to +conjugate verbs relies on a linear encoding of subject number that can be +manipulated with predictable effects on conjugation accuracy. This encoding is +found in the subject position at the first layer and the verb position at the +last layer, but distributed across positions at middle layers, particularly +when there are multiple cues to subject number. + +
+
+ comment: To appear in Findings of the Association for Computational + Linguistics: EMNLP 2023 +
+
+
+
+
+ + ☆ Online Detection of AI-Generated Images ICCV + + +
+ With advancements in AI-generated images coming on a continuous basis, it is +increasingly difficult to distinguish traditionally-sourced images (e.g., +photos, artwork) from AI-generated ones. Previous detection methods study the +generalization from a single generator to another in isolation. However, in +reality, new generators are released on a streaming basis. We study +generalization in this setting, training on N models and testing on the next +(N+k), following the historical release dates of well-known generation methods. +Furthermore, images increasingly consist of both real and generated components, +for example through image inpainting. Thus, we extend this approach to pixel +prediction, demonstrating strong performance using automatically-generated +inpainted data. In addition, for settings where commercial models are not +publicly available for automatic data generation, we evaluate if pixel +detectors can be trained solely on whole synthetic images. + +
+
+ comment: ICCV DeepFake Analysis and Detection Workshop, 2023 +
+
+
+
+
+ + ☆ Unlocking the Transferability of Tokens in Deep Models for Tabular Data + + +
+ Fine-tuning a pre-trained deep neural network has become a successful +paradigm in various machine learning tasks. However, such a paradigm becomes +particularly challenging with tabular data when there are discrepancies between +the feature sets of pre-trained models and the target tasks. In this paper, we +propose TabToken, a method aims at enhancing the quality of feature tokens +(i.e., embeddings of tabular features). TabToken allows for the utilization of +pre-trained models when the upstream and downstream tasks share overlapping +features, facilitating model fine-tuning even with limited training examples. +Specifically, we introduce a contrastive objective that regularizes the tokens, +capturing the semantics within and across features. During the pre-training +stage, the tokens are learned jointly with top-layer deep models such as +transformer. In the downstream task, tokens of the shared features are kept +fixed while TabToken efficiently fine-tunes the remaining parts of the model. +TabToken not only enables knowledge transfer from a pre-trained model to tasks +with heterogeneous features, but also enhances the discriminative ability of +deep tabular models in standard classification and regression tasks. + +
+
+
+
+
+ + ☆ Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for + Autonomous Real-World Reinforcement Learning + + +
+ The pre-train and fine-tune paradigm in machine learning has had dramatic +success in a wide range of domains because the use of existing data or +pre-trained models on the internet enables quick and easy learning of new +tasks. We aim to enable this paradigm in robotic reinforcement learning, +allowing a robot to learn a new task with little human effort by leveraging +data and models from the Internet. However, reinforcement learning often +requires significant human effort in the form of manual reward specification or +environment resets, even if the policy is pre-trained. We introduce RoboFuME, a +reset-free fine-tuning system that pre-trains a multi-task manipulation policy +from diverse datasets of prior experiences and self-improves online to learn a +target task with minimal human intervention. Our insights are to utilize +calibrated offline reinforcement learning techniques to ensure efficient online +fine-tuning of a pre-trained policy in the presence of distribution shifts and +leverage pre-trained vision language models (VLMs) to build a robust reward +classifier for autonomously providing reward signals during the online +fine-tuning process. In a diverse set of five real robot manipulation tasks, we +show that our method can incorporate data from an existing robot dataset +collected at a different institution and improve on a target task within as +little as 3 hours of autonomous real-world experience. We also demonstrate in +simulation experiments that our method outperforms prior works that use +different RL algorithms or different approaches for predicting rewards. Project +website: https://robofume.github.io + +
+
+
+
+
+ + ☆ Hyperparameter optimization of hp-greedy reduced basis for gravitational + wave surrogates + + +
+ In a previous work we introduced, in the context of gravitational wave +science, an initial study on an automated domain-decomposition approach for +reduced basis through hp-greedy refinement. The approach constructs local +reduced bases of lower dimensionality than global ones, with the same or higher +accuracy. These ``light'' local bases should imply both faster evaluations when +predicting new waveforms and faster data analysis, in particular faster +statistical inference (the forward and inverse problems, respectively). In this +approach, however, we have previously found important dependence on several +hyperparameters, which do not appear in global reduced basis. This naturally +leads to the problem of hyperparameter optimization (HPO), which is the subject +of this paper. We tackle the problem through a Bayesian optimization, and show +its superiority when compared to grid or random searches. We find that for +gravitational waves from the collision of two spinning but non-precessing black +holes, for the same accuracy, local hp-greedy reduced bases with HPO have a +lower dimensionality of up to $4 \times$ for the cases here studied, depending +on the desired accuracy. This factor should directly translate in a parameter +estimation speedup, for instance. Such acceleration might help in the near +real-time requirements for electromagnetic counterparts of gravitational waves +from compact binary coalescences. In addition, we find that the Bayesian +approach used in this paper for HPO is two orders of magnitude faster than, for +example, a grid search, with about a $100 \times$ acceleration. The code +developed for this project is available as open source from public +repositories. + +
+
+ comment: This paper is an invited contribution to the Special Issue "Recent + Advances in Gravity: A Themed Issue in Honor of Prof. Jorge Pullin on his + 60th Anniversary'' +
+
+
+
+
+ + ☆ SpecTr: Fast Speculative Decoding via Optimal Transport + + +
+ Autoregressive sampling from large language models has led to +state-of-the-art results in several natural language tasks. However, +autoregressive sampling generates tokens one at a time making it slow, and even +prohibitive in certain tasks. One way to speed up sampling is +$\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ +(block or sequence of tokens), and then score all tokens in the draft by the +large language model in parallel. A subset of the tokens in the draft are +accepted (and the rest rejected) based on a statistical method to guarantee +that the final output follows the distribution of the large model. In this +work, we provide a principled understanding of speculative decoding through the +lens of optimal transport (OT) with $\textit{membership cost}$. This framework +can be viewed as an extension of the well-known $\textit{maximal-coupling}$ +problem. This new formulation enables us to generalize the speculative decoding +method to allow for a set of $k$ candidates at the token-level, which leads to +an improved optimal membership cost. We show that the optimal draft selection +algorithm (transport plan) can be computed via linear programming, whose +best-known runtime is exponential in $k$. We then propose a valid draft +selection algorithm whose acceptance probability is $(1-1/e)$-optimal +multiplicatively. Moreover, it can be computed in time almost linear with size +of domain of a single token. Using this $new draft selection$ algorithm, we +develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which +provides speedup in decoding while ensuring that there is no quality +degradation in the decoded output. We experimentally demonstrate that for +state-of-the-art large language models, the proposed approach achieves a wall +clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on +standard benchmarks. + +
+
+
+
+
+ + ☆ AutoDAN: Automatic and Interpretable Adversarial Attacks on Large + Language Models + + +
+ Safety alignment of Large Language Models (LLMs) can be compromised with +manual jailbreak attacks and (automatic) adversarial attacks. Recent work +suggests that patching LLMs against these attacks is possible: manual jailbreak +attacks are human-readable but often limited and public, making them easy to +block; adversarial attacks generate gibberish prompts that can be detected +using perplexity-based filters. In this paper, we show that these solutions may +be too optimistic. We propose an interpretable adversarial attack, +\texttt{AutoDAN}, that combines the strengths of both types of attacks. It +automatically generates attack prompts that bypass perplexity-based filters +while maintaining a high attack success rate like manual jailbreak attacks. +These prompts are interpretable and diverse, exhibiting strategies commonly +used in manual jailbreak attacks, and transfer better than their non-readable +counterparts when using limited training data or a single proxy model. We also +customize \texttt{AutoDAN}'s objective to leak system prompts, another +jailbreak application not addressed in the adversarial attack literature. %, +demonstrating the versatility of the approach. We can also customize the +objective of \texttt{AutoDAN} to leak system prompts, beyond the ability to +elicit harmful content from the model, demonstrating the versatility of the +approach. Our work provides a new way to red-team LLMs and to understand the +mechanism of jailbreak attacks. + +
+
+
+
+
+ + ☆ Quantifying the Dialect Gap and its Correlates Across Languages EMNLP + + +
+ Historically, researchers and consumers have noticed a decrease in quality +when applying NLP tools to minority variants of languages (i.e. Puerto Rican +Spanish or Swiss German), but studies exploring this have been limited to a +select few languages. Additionally, past studies have mainly been conducted in +a monolingual context, so cross-linguistic trends have not been identified and +tied to external factors. In this work, we conduct a comprehensive evaluation +of the most influential, state-of-the-art large language models (LLMs) across +two high-use applications, machine translation and automatic speech +recognition, to assess their functionality on the regional dialects of several +high- and low-resource languages. Additionally, we analyze how the regional +dialect gap is correlated with economic, social, and linguistic factors. The +impact of training data, including related factors like dataset size and its +construction procedure, is shown to be significant but not consistent across +models or languages, meaning a one-size-fits-all approach cannot be taken in +solving the dialect gap. This work will lay the foundation for furthering the +field of dialectal NLP by laying out evident disparities and identifying +possible pathways for addressing them through mindful data collection. + +
+
+ comment: Accepted to EMNLP Findings 2023 +
+
+
+
+
+ + ☆ Location-Aware Visual Question Generation with Lightweight Models EMNLP 2023 + + +
+ This work introduces a novel task, location-aware visual question generation +(LocaVQG), which aims to generate engaging questions from data relevant to a +particular geographical location. Specifically, we represent such +location-aware information with surrounding images and a GPS coordinate. To +tackle this task, we present a dataset generation pipeline that leverages GPT-4 +to produce diverse and sophisticated questions. Then, we aim to learn a +lightweight model that can address the LocaVQG task and fit on an edge device, +such as a mobile phone. To this end, we propose a method which can reliably +generate engaging questions from location-aware information. Our proposed +method outperforms baselines regarding human evaluation (e.g., engagement, +grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, +ROUGE-2). Moreover, we conduct extensive ablation studies to justify our +proposed techniques for both generating the dataset and solving the task. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Projected Stochastic Gradient Descent with Quantum Annealed Binary + Gradients + + +
+ We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards +training neural networks with binary weights, known as binary neural networks +(BNNs), on quantum hardware. BNNs reduce the computational requirements and +energy consumption of deep learning models with minimal loss in accuracy. +However, training them in practice remains to be an open challenge. Most known +BNN-optimisers either rely on projected updates or binarise weights +post-training. Instead, QP-SBGD approximately maps the gradient onto binary +variables, by solving a quadratic constrained binary optimisation. Under +practically reasonable assumptions, we show that this update rule converges +with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the +$\mathcal{NP}$-hard projection can be effectively executed on an adiabatic +quantum annealer, harnessing recent advancements in quantum computation. We +also introduce a projected version of this update rule and prove that if a +fixed point exists in the binary variable space, the modified updates will +converge to it. Last but not least, our algorithm is implemented layer-wise, +making it suitable to train larger networks on resource-limited quantum +hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is +on par with competitive and well-established baselines such as BinaryConnect, +signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as +well as binary graph neural networks. + +
+
+
+
+
+ + ☆ Open-Ended Instructable Embodied Agents with Memory-Augmented Large + Language Models + + +
+ Pre-trained and frozen LLMs can effectively map simple scene re-arrangement +instructions to programs over a robot's visuomotor functions through +appropriate few-shot example prompting. To parse open-domain natural language +and adapt to a user's idiosyncratic procedures, not known during prompt +engineering time, fixed prompts fall short. In this paper, we introduce HELPER, +an embodied agent equipped with an external memory of language-program pairs +that parses free-form human-robot dialogue into action programs through +retrieval-augmented LLM prompting: relevant memories are retrieved based on the +current dialogue, instruction, correction or VLM description, and used as +in-context prompt examples for LLM querying. The memory is expanded during +deployment to include pairs of user's language and action plans, to assist +future inferences and personalize them to the user's language and routines. +HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution +from Dialog History (EDH) and Trajectory from Dialogue (TfD), with 1.7x +improvement over the previous SOTA for TfD. Our models, code and video results +can be found in our project's website: https://helper-agent-llm.github.io. + +
+
+ comment: https://helper-agent-llm.github.io +
+
+
+
+
+ + ☆ Mixed-Variable Global Sensitivity Analysis For Knowledge Discovery And + Efficient Combinatorial Materials Design + + +
+ Global Sensitivity Analysis (GSA) is the study of the influence of any given +inputs on the outputs of a model. In the context of engineering design, GSA has +been widely used to understand both individual and collective contributions of +design variables on the design objectives. So far, global sensitivity studies +have often been limited to design spaces with only quantitative (numerical) +design variables. However, many engineering systems also contain, if not only, +qualitative (categorical) design variables in addition to quantitative design +variables. In this paper, we integrate Latent Variable Gaussian Process (LVGP) +with Sobol' analysis to develop the first metamodel-based mixed-variable GSA +method. Through numerical case studies, we validate and demonstrate the +effectiveness of our proposed method for mixed-variable problems. Furthermore, +while the proposed GSA method is general enough to benefit various engineering +design applications, we integrate it with multi-objective Bayesian optimization +(BO) to create a sensitivity-aware design framework in accelerating the Pareto +front design exploration for metal-organic framework (MOF) materials with +many-level combinatorial design spaces. Although MOFs are constructed only from +qualitative variables that are notoriously difficult to design, our method can +utilize sensitivity analysis to navigate the optimization in the many-level +large combinatorial design space, greatly expediting the exploration of novel +MOF candidates. + +
+
+ comment: 35 Pages, 10 Figures, 2 Tables +
+
+
+
+
+ + ☆ Branch-Solve-Merge Improves Large Language Model Evaluation and + Generation + + +
+ Large Language Models (LLMs) are frequently used for multi-faceted language +generation and evaluation tasks that involve satisfying intricate user +constraints or taking into account multiple aspects and criteria. However, +their performance can fall short, due to the model's lack of coherence and +inability to plan and decompose the problem. We propose Branch-Solve-Merge +(BSM), a Large Language Model program (Schlag et al., 2023) for tackling such +challenging natural language tasks. It consists of branch, solve, and merge +modules that are parameterized with specific prompts to the base LLM. These +three modules plan a decomposition of the task into multiple parallel +sub-tasks, independently solve them, and fuse the solutions to the sub-tasks. +We apply our method to the tasks of LLM response evaluation and constrained +text generation and evaluate its effectiveness with multiple LLMs, including +Vicuna, LLaMA-2-chat, and GPT-4. BSM improves the evaluation correctness and +consistency for each LLM by enhancing human-LLM agreement by up to 26%, +reducing length and pairwise position biases by up to 50%, and allowing +LLaMA-2-chat to match or outperform GPT-4 on most domains. On the constraint +story generation task, BSM improves the coherence of the stories while also +improving constraint satisfaction by 12%. + +
+
+ comment: 22 pages, 7 figures, 10 tables +
+
+
+
+
+ + ☆ Matryoshka Diffusion Models + + +
+ Diffusion models are the de facto approach for generating high-quality images +and videos, but learning high-dimensional models remains a formidable task due +to computational and optimization challenges. Existing methods often resort to +training cascaded models in pixel space or using a downsampled latent space of +a separately trained auto-encoder. In this paper, we introduce Matryoshka +Diffusion Models(MDM), an end-to-end framework for high-resolution image and +video synthesis. We propose a diffusion process that denoises inputs at +multiple resolutions jointly and uses a NestedUNet architecture where features +and parameters for small-scale inputs are nested within those of large scales. +In addition, MDM enables a progressive training schedule from lower to higher +resolutions, which leads to significant improvements in optimization for +high-resolution generation. We demonstrate the effectiveness of our approach on +various benchmarks, including class-conditioned image generation, +high-resolution text-to-image, and text-to-video applications. Remarkably, we +can train a single pixel-space model at resolutions of up to 1024x1024 pixels, +demonstrating strong zero-shot generalization using the CC12M dataset, which +contains only 12 million images. + +
+
+ comment: 28 pages, 18 figures +
+
+
+
+
+ + ☆ Evaluating machine learning models in non-standard settings: An overview + and new findings + + +
+ Estimating the generalization error (GE) of machine learning models is +fundamental, with resampling methods being the most common approach. However, +in non-standard settings, particularly those where observations are not +independently and identically distributed, resampling using simple random data +divisions may lead to biased GE estimates. This paper strives to present +well-grounded guidelines for GE estimation in various such non-standard +settings: clustered data, spatial data, unequal sampling probabilities, concept +drift, and hierarchically structured outcomes. Our overview combines +well-established methodologies with other existing methods that, to our +knowledge, have not been frequently considered in these particular settings. A +unifying principle among these techniques is that the test data used in each +iteration of the resampling procedure should reflect the new observations to +which the model will be applied, while the training data should be +representative of the entire data set used to obtain the final model. Beyond +providing an overview, we address literature gaps by conducting simulation +studies. These studies assess the necessity of using GE-estimation methods +tailored to the respective setting. Our findings corroborate the concern that +standard resampling methods often yield biased GE estimates in non-standard +settings, underscoring the importance of tailored GE estimation. + +
+
+
+
+
+ + ☆ Dual-path convolutional neural network using micro-FTIR imaging to + predict breast cancer subtypes and biomarkers levels: estrogen receptor, + progesterone receptor, HER2 and Ki67 + + +
+ Breast cancer molecular subtypes classification plays an import role to sort +patients with divergent prognosis. The biomarkers used are Estrogen Receptor +(ER), Progesterone Receptor (PR), HER2, and Ki67. Based on these biomarkers +expression levels, subtypes are classified as Luminal A (LA), Luminal B (LB), +HER2 subtype, and Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is +used to classify subtypes, although interlaboratory and interobserver +variations can affect its accuracy, besides being a time-consuming technique. +The Fourier transform infrared micro-spectroscopy may be coupled with deep +learning for cancer evaluation, where there is still a lack of studies for +subtypes and biomarker levels prediction. This study presents a novel 2D deep +learning approach to achieve these predictions. Sixty micro-FTIR images of +320x320 pixels were collected from a human breast biopsies microarray. Data +were clustered by K-means, preprocessed and 32x32 patches were generated using +a fully automated approach. CaReNet-V2, a novel convolutional neural network, +was developed to classify breast cancer (CA) vs adjacent tissue (AT) and +molecular subtypes, and to predict biomarkers level. The clustering method +enabled to remove non-tissue pixels. Test accuracies for CA vs AT and subtype +were above 0.84. The model enabled the prediction of ER, PR, and HER2 levels, +where borderline values showed lower performance (minimum accuracy of 0.54). +Ki67 percentage regression demonstrated a mean error of 3.6%. Thus, CaReNet-V2 +is a potential technique for breast cancer biopsies evaluation, standing out as +a screening analysis technique and helping to prioritize patients. + +
+
+ comment: 32 pages, 3 figures, 6 tables +
+
+
+
+
+ + ☆ A Canonical Data Transformation for Achieving Inter- and Within-group + Fairness + + +
+ Increases in the deployment of machine learning algorithms for applications +that deal with sensitive data have brought attention to the issue of fairness +in machine learning. Many works have been devoted to applications that require +different demographic groups to be treated fairly. However, algorithms that aim +to satisfy inter-group fairness (also called group fairness) may inadvertently +treat individuals within the same demographic group unfairly. To address this +issue, we introduce a formal definition of within-group fairness that maintains +fairness among individuals from within the same group. We propose a +pre-processing framework to meet both inter- and within-group fairness criteria +with little compromise in accuracy. The framework maps the feature vectors of +members from different groups to an inter-group-fair canonical domain before +feeding them into a scoring function. The mapping is constructed to preserve +the relative relationship between the scores obtained from the unprocessed +feature vectors of individuals from the same demographic group, guaranteeing +within-group fairness. We apply this framework to the COMPAS risk assessment +and Law School datasets and compare its performance in achieving inter-group +and within-group fairness to two regularization-based methods. + +
+
+
+
+
+ + ☆ One-dimensional convolutional neural network model for breast cancer + subtypes classification and biochemical content evaluation using micro-FTIR + hyperspectral images + + +
+ Breast cancer treatment still remains a challenge, where molecular subtypes +classification plays a crucial role in selecting appropriate and specific +therapy. The four subtypes are Luminal A (LA), Luminal B (LB), HER2 subtype, +and Triple-Negative Breast Cancer (TNBC). Immunohistochemistry is the +gold-standard evaluation, although interobserver variations are reported and +molecular signatures identification is time-consuming. Fourier transform +infrared micro-spectroscopy with machine learning approaches have been used to +evaluate cancer samples, presenting biochemical-related explainability. +However, this explainability is harder when using deep learning. This study +created a 1D deep learning tool for breast cancer subtype evaluation and +biochemical contribution. Sixty hyperspectral images were acquired from a human +breast cancer microarray. K-Means clustering was applied to select tissue and +paraffin spectra. CaReNet-V1, a novel 1D convolutional neural network, was +developed to classify breast cancer (CA) and adjacent tissue (AT), and +molecular subtypes. A 1D adaptation of Grad-CAM was applied to assess the +biochemical impact to the classifications. CaReNet-V1 effectively classified CA +and AT (test accuracy of 0.89), as well as HER2 and TNBC subtypes (0.83 and +0.86), with greater difficulty for LA and LB (0.74 and 0.68). The model enabled +the evaluation of the most contributing wavenumbers to the predictions, +providing a direct relationship with the biochemical content. Therefore, +CaReNet-V1 and hyperspectral images is a potential approach for breast cancer +biopsies assessment, providing additional information to the pathology report. +Biochemical content impact feature may be used for other studies, such as +treatment efficacy evaluation and development new diagnostics and therapeutic +methods. + +
+
+ comment: 23 pages, 5 figures, 2 tables +
+
+
+
+
+ + ☆ On the Detection of Image-Scaling Attacks in Machine Learning ACSA + + +
+ Image scaling is an integral part of machine learning and computer vision +systems. Unfortunately, this preprocessing step is vulnerable to so-called +image-scaling attacks where an attacker makes unnoticeable changes to an image +so that it becomes a new image after scaling. This opens up new ways for +attackers to control the prediction or to improve poisoning and backdoor +attacks. While effective techniques exist to prevent scaling attacks, their +detection has not been rigorously studied yet. Consequently, it is currently +not possible to reliably spot these attacks in practice. + This paper presents the first in-depth systematization and analysis of +detection methods for image-scaling attacks. We identify two general detection +paradigms and derive novel methods from them that are simple in design yet +significantly outperform previous work. We demonstrate the efficacy of these +methods in a comprehensive evaluation with all major learning platforms and +scaling algorithms. First, we show that image-scaling attacks modifying the +entire scaled image can be reliably detected even under an adaptive adversary. +Second, we find that our methods provide strong detection performance even if +only minor parts of the image are manipulated. As a result, we can introduce a +novel protection layer against image-scaling attacks. + +
+
+ comment: Accepted at ACSAC'23 +
+
+
+
+
+ + ☆ Quantum Federated Learning With Quantum Networks + + +
+ A major concern of deep learning models is the large amount of data that is +required to build and train them, much of which is reliant on sensitive and +personally identifiable information that is vulnerable to access by third +parties. Ideas of using the quantum internet to address this issue have been +previously proposed, which would enable fast and completely secure online +communications. Previous work has yielded a hybrid quantum-classical transfer +learning scheme for classical data and communication with a hub-spoke topology. +While quantum communication is secure from eavesdrop attacks and no +measurements from quantum to classical translation, due to no cloning theorem, +hub-spoke topology is not ideal for quantum communication without quantum +memory. Here we seek to improve this model by implementing a decentralized ring +topology for the federated learning scheme, where each client is given a +portion of the entire dataset and only performs training on that set. We also +demonstrate the first successful use of quantum weights for quantum federated +learning, which allows us to perform our training entirely in quantum. + +
+
+
+
+
+ + ☆ Federated Learning of Large Language Models with Parameter-Efficient + Prompt Tuning and Adaptive Optimization + + +
+ Federated learning (FL) is a promising paradigm to enable collaborative model +training with decentralized data. However, the training process of Large +Language Models (LLMs) generally incurs the update of significant parameters, +which limits the applicability of FL techniques to tackle the LLMs in real +scenarios. Prompt tuning can significantly reduce the number of parameters to +update, but it either incurs performance degradation or low training +efficiency. The straightforward utilization of prompt tuning in the FL often +raises non-trivial communication costs and dramatically degrades performance. +In addition, the decentralized data is generally non-Independent and +Identically Distributed (non-IID), which brings client drift problems and thus +poor performance. This paper proposes a Parameter-efficient prompt Tuning +approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and +effective FL of LLMs. First, an efficient partial prompt tuning approach is +proposed to improve performance and efficiency simultaneously. Second, a novel +adaptive optimization method is developed to address the client drift problems +on both the device and server sides to enhance performance further. Extensive +experiments based on 10 datasets demonstrate the superb performance (up to +60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training +time) of FedPepTAO compared with 9 baseline approaches. Our code is available +at https://github.com/llm-eff/FedPepTAO. + +
+
+
+
+
+ + ☆ MGAS: Multi-Granularity Architecture Search for Effective and Efficient + Neural Networks + + +
+ Differentiable architecture search (DAS) has become the prominent approach in +the field of neural architecture search (NAS) due to its time-efficient +automation of neural network design. It shifts the traditional paradigm of +discrete architecture sampling and evaluation to differentiable super-net +optimization and discretization. However, existing DAS methods either only +conduct coarse-grained operation-level search, or restrictively explore +fine-grained filter-level and weight-level units using manually-defined +remaining ratios, which fail to simultaneously achieve small model size and +satisfactory model performance. Additionally, they address the high memory +consumption of the search process at the expense of search quality. To tackle +these issues, we introduce multi-granularity architecture search (MGAS), a +unified framework which aims to comprehensively and memory-efficiently explore +the multi-granularity search space to discover both effective and efficient +neural networks. Specifically, we learn discretization functions specific to +each granularity level to adaptively determine the remaining ratios according +to the evolving architecture. This ensures an optimal balance among units of +different granularity levels for different target model sizes. Considering the +memory demands, we break down the super-net optimization and discretization +into multiple sub-net stages. By allowing re-pruning and regrowing of units in +previous sub-nets during subsequent stages, we compensate for potential bias in +earlier stages. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet +demonstrate that MGAS outperforms other state-of-the-art methods in achieving a +better trade-off between model performance and model size. + +
+
+
+
+
+ + ☆ Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic + Gaussian Mixture Models + + +
+ A long-standing challenge for a robotic manipulation system operating in +real-world scenarios is adapting and generalizing its acquired motor skills to +unseen environments. We tackle this challenge employing hybrid skill models +that integrate imitation and reinforcement paradigms, to explore how the +learning and adaptation of a skill, along with its core grounding in the scene +through a learned keypoint, can facilitate such generalization. To that end, we +develop Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models (KIS-GMM) +approach that learns to predict the reference of a dynamical system within the +scene as a 3D keypoint, leveraging visual observations obtained by the robot's +physical interactions during skill learning. Through conducting comprehensive +evaluations in both simulated and real-world environments, we show that our +method enables a robot to gain a significant zero-shot generalization to novel +environments and to refine skills in the target environments faster than +learning from scratch. Importantly, this is achieved without the need for new +ground truth data. Moreover, our method effectively copes with scene +displacements. + +
+
+ comment: Accepted at the International Symposium on Experimental Robotics + (ISER) 2023. Videos at http://kis-gmm.cs.uni-freiburg.de/ +
+
+
+
+
+ + ☆ Coordinated Replay Sample Selection for Continual Federated Learning EMNLP + + +
+ Continual Federated Learning (CFL) combines Federated Learning (FL), the +decentralized learning of a central model on a number of client devices that +may not communicate their data, and Continual Learning (CL), the learning of a +model from a continual stream of data without keeping the entire history. In +CL, the main challenge is \textit{forgetting} what was learned from past data. +While replay-based algorithms that keep a small pool of past training data are +effective to reduce forgetting, only simple replay sample selection strategies +have been applied to CFL in prior work, and no previous work has explored +coordination among clients for better sample selection. To bridge this gap, we +adapt a replay sample selection objective based on loss gradient diversity to +CFL and propose a new relaxation-based selection of samples to optimize the +objective. Next, we propose a practical algorithm to coordinate gradient-based +replay sample selection across clients without communicating private data. We +benchmark our coordinated and uncoordinated replay sample selection algorithms +against random sampling-based baselines with language models trained on a large +scale de-identified real-world text dataset. We show that gradient-based sample +selection methods both boost performance and reduce forgetting compared to +random sampling methods, with our coordination method showing gains early in +the low replay size regime (when the budget for storing past data is small). + +
+
+ comment: 7 pages, 6 figures, accepted to EMNLP (industry track) +
+
+
+
+
+ + ☆ TeleQnA: A Benchmark Dataset to Assess Large Language Models + Telecommunications Knowledge + + +
+ We introduce TeleQnA, the first benchmark dataset designed to evaluate the +knowledge of Large Language Models (LLMs) in telecommunications. Comprising +10,000 questions and answers, this dataset draws from diverse sources, +including standards and research articles. This paper outlines the automated +question generation framework responsible for creating this dataset, along with +how human input was integrated at various stages to ensure the quality of the +questions. Afterwards, using the provided dataset, an evaluation is conducted +to assess the capabilities of LLMs, including GPT-3.5 and GPT-4. The results +highlight that these models struggle with complex standards related questions +but exhibit proficiency in addressing general telecom-related inquiries. +Additionally, our results showcase how incorporating telecom knowledge context +significantly enhances their performance, thus shedding light on the need for a +specialized telecom foundation model. Finally, the dataset is shared with +active telecom professionals, whose performance is subsequently benchmarked +against that of the LLMs. The findings illustrate that LLMs can rival the +performance of active professionals in telecom knowledge, thanks to their +capacity to process vast amounts of information, underscoring the potential of +LLMs within this domain. The dataset has been made publicly accessible on +GitHub. + +
+
+
+
+
+ + ☆ Meta- (out-of-context) learning in neural networks + + +
+ Brown et al. (2020) famously introduced the phenomenon of in-context learning +in large language models (LLMs). We establish the existence of a phenomenon we +call $\textbf{meta-out-of-context learning (meta-OCL)}$ via carefully designed +synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs +to more readily "internalize" the semantic content of text that is, or appears +to be, broadly useful (such as true statements, or text from authoritative +sources) and use it in appropriate circumstances. We further demonstrate +meta-OCL in a synthetic computer vision setting, and propose two hypotheses for +the emergence of meta-OCL: one relying on the way models store knowledge in +their parameters, and another suggesting that the implicit gradient alignment +bias of gradient-descent-based optimizers may be responsible. Finally, we +reflect on what our results might imply about capabilities of future AI +systems, and discuss potential risks. Our code can be found at +https://github.com/krasheninnikov/internalization . + +
+
+
+
+
+ + ☆ Deep Autoencoder-based Z-Interference Channels with Perfect and + Imperfect CSI + + +
+ A deep autoencoder (DAE)-based structure for endto-end communication over the +two-user Z-interference channel (ZIC) with finite-alphabet inputs is designed +in this paper. The proposed structure jointly optimizes the two encoder/decoder +pairs and generates interference-aware constellations that dynamically adapt +their shape based on interference intensity to minimize the bit error rate +(BER). An in-phase/quadrature-phase (I/Q) power allocation layer is introduced +in the DAE to guarantee an average power constraint and enable the architecture +to generate constellations with nonuniform shapes. This brings further gain +compared to standard uniform constellations such as quadrature amplitude +modulation. The proposed structure is then extended to work with imperfect +channel state information (CSI). The CSI imperfection due to both the +estimation and quantization errors are examined. The performance of the DAEZIC +is compared with two baseline methods, i.e., standard and rotated +constellations. The proposed structure significantly enhances the performance +of the ZIC both for the perfect and imperfect CSI. Simulation results show that +the improvement is achieved in all interference regimes (weak, moderate, and +strong) and consistently increases with the signal-to-noise ratio (SNR). For +example, more than an order of magnitude BER reduction is obtained with respect +to the most competitive conventional method at weak interference when SNR>15dB +and two bits per symbol are transmitted. The improvements reach about two +orders of magnitude when quantization error exists, indicating that the DAE-ZIC +is more robust to the interference compared to the conventional methods. + +
+
+ comment: 13 pages, 13 figures, 2 tables. Accepted for publication in the IEEE + Transactions on Communications. arXiv admin note: text overlap with + arXiv:2303.08312 +
+
+
+
+
+ + ☆ Fast 2D Bicephalous Convolutional Autoencoder for Compressing 3D Time + Projection Chamber Data + + +
+ High-energy large-scale particle colliders produce data at high speed in the +order of 1 terabytes per second in nuclear physics and petabytes per second in +high-energy physics. Developing real-time data compression algorithms to reduce +such data at high throughput to fit permanent storage has drawn increasing +attention. Specifically, at the newly constructed sPHENIX experiment at the +Relativistic Heavy Ion Collider (RHIC), a time projection chamber is used as +the main tracking detector, which records particle trajectories in a volume of +a three-dimensional (3D) cylinder. The resulting data are usually very sparse +with occupancy around 10.8%. Such sparsity presents a challenge to conventional +learning-free lossy compression algorithms, such as SZ, ZFP, and MGARD. The 3D +convolutional neural network (CNN)-based approach, Bicephalous Convolutional +Autoencoder (BCAE), outperforms traditional methods both in compression rate +and reconstruction accuracy. BCAE can also utilize the computation power of +graphical processing units suitable for deployment in a modern heterogeneous +high-performance computing environment. This work introduces two BCAE variants: +BCAE++ and BCAE-2D. BCAE++ achieves a 15% better compression ratio and a 77% +better reconstruction accuracy measured in mean absolute error compared with +BCAE. BCAE-2D treats the radial direction as the channel dimension of an image, +resulting in a 3x speedup in compression throughput. In addition, we +demonstrate an unbalanced autoencoder with a larger decoder can improve +reconstruction accuracy without significantly sacrificing throughput. Lastly, +we observe both the BCAE++ and BCAE-2D can benefit more from using +half-precision mode in throughput (76-79% increase) without loss in +reconstruction accuracy. The source code and links to data and pretrained +models can be found at https://github.com/BNL-DAQ-LDRD/NeuralCompression_v2. + +
+
+
+
+
+ + ☆ Invariance is Key to Generalization: Examining the Role of + Representation in Sim-to-Real Transfer for Visual Navigation + + +
+ The data-driven approach to robot control has been gathering pace rapidly, +yet generalization to unseen task domains remains a critical challenge. We +argue that the key to generalization is representations that are (i) rich +enough to capture all task-relevant information and (ii) invariant to +superfluous variability between the training and the test domains. We +experimentally study such a representation -- containing both depth and +semantic information -- for visual navigation and show that it enables a +control policy trained entirely in simulated indoor scenes to generalize to +diverse real-world environments, both indoors and outdoors. Further, we show +that our representation reduces the A-distance between the training and test +domains, improving the generalization error bound as a result. Our proposed +approach is scalable: the learned policy improves continuously, as the +foundation models that it exploits absorb more diverse data during +pre-training. + +
+
+ comment: 11 pages, accepted by the 18th International Symposium on + Experimental Robotics (ISER 2023) +
+
+
+
+
+ + ☆ Meta learning with language models: Challenges and opportunities in the + classification of imbalanced text + + +
+ Detecting out of policy speech (OOPS) content is important but difficult. +While machine learning is a powerful tool to tackle this challenging task, it +is hard to break the performance ceiling due to factors like quantity and +quality limitations on training data and inconsistencies in OOPS definition and +data labeling. To realize the full potential of available limited resources, we +propose a meta learning technique (MLT) that combines individual models built +with different text representations. We analytically show that the resulting +technique is numerically stable and produces reasonable combining weights. We +combine the MLT with a threshold-moving (TM) technique to further improve the +performance of the combined predictor on highly-imbalanced in-distribution and +out-of-distribution datasets. We also provide computational results to show the +statistically significant advantages of the proposed MLT approach. + All authors contributed equally to this work. + +
+
+ comment: 22 pages, including 5 figures, 12 tables, 1 appendix +
+
+
+
+
+ + ☆ The primacy bias in Model-based RL + + +
+ The primacy bias in deep reinforcement learning (DRL), which refers to the +agent's tendency to overfit early data and lose the ability to learn from new +data, can significantly decrease the performance of DRL algorithms. Previous +studies have shown that employing simple techniques, such as resetting the +agent's parameters, can substantially alleviate the primacy bias. However, we +observe that resetting the agent's parameters harms its performance in the +context of model-based reinforcement learning (MBRL). In fact, on further +investigation, we find that the primacy bias in MBRL differs from that in +model-free RL. In this work, we focus on investigating the primacy bias in MBRL +and propose world model resetting, which works in MBRL. We apply our method to +two different MBRL algorithms, MBPO and DreamerV2. We validate the +effectiveness of our method on multiple continuous control tasks on MuJoCo and +DeepMind Control Suite, as well as discrete control tasks on Atari 100k +benchmark. The results show that world model resetting can significantly +alleviate the primacy bias in model-based setting and improve algorithm's +performance. We also give a guide on how to perform world model resetting +effectively. + +
+
+
+
+
+ + ☆ Leveraging Deep Learning for Abstractive Code Summarization of + Unofficial Documentation + + +
+ Usually, programming languages have official documentation to guide +developers with APIs, methods, and classes. However, researchers identified +insufficient or inadequate documentation examples and flaws with the API's +complex structure as barriers to learning an API. As a result, developers may +consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. +Recent research studies have shown that unofficial documentation is a valuable +source of information for generating code summaries. We, therefore, have been +motivated to leverage such a type of documentation along with deep learning +techniques towards generating high-quality summaries for APIs discussed in +informal documentation. + This paper proposes an automatic approach using the BART algorithm, a +state-of-the-art transformer model, to generate summaries for APIs discussed in +StackOverflow. We built an oracle of human-generated summaries to evaluate our +approach against it using ROUGE and BLEU metrics which are the most widely used +evaluation metrics in text summarization. Furthermore, we evaluated our +summaries empirically against a previous work in terms of quality. Our findings +demonstrate that using deep learning algorithms can improve summaries' quality +and outperform the previous work by an average of %57 for Precision, %66 for +Recall, and %61 for F-measure, and it runs 4.4 times faster. + +
+
+
+
+
+ + ☆ Did the Neurons Read your Book? Document-level Membership Inference for + Large Language Models + + +
+ With large language models (LLMs) poised to become embedded in our daily +lives, questions are starting to be raised about the dataset(s) they learned +from. These questions range from potential bias or misinformation LLMs could +retain from their training data to questions of copyright and fair use of +human-generated text. However, while these questions emerge, developers of the +recent state-of-the-art LLMs become increasingly reluctant to disclose details +on their training corpus. We here introduce the task of document-level +membership inference for real-world LLMs, i.e. inferring whether the LLM has +seen a given document during training or not. First, we propose a procedure for +the development and evaluation of document-level membership inference for LLMs +by leveraging commonly used data sources for training and the model release +date. We then propose a practical, black-box method to predict document-level +membership and instantiate it on OpenLLaMA-7B with both books and academic +papers. We show our methodology to perform very well, reaching an impressive +AUC of 0.856 for books and 0.678 for papers. We then show our approach to +outperform the sentence-level membership inference attacks used in the privacy +literature for the document-level membership task. We finally evaluate whether +smaller models might be less sensitive to document-level inference and show +OpenLLaMA-3B to be approximately as sensitive as OpenLLaMA-7B to our approach. +Taken together, our results show that accurate document-level membership can be +inferred for LLMs, increasing the transparency of technology poised to change +our lives. + +
+
+
+
+
+ + ☆ Neural Snowflakes: Universal Latent Graph Inference via Trainable Latent + Geometries + + +
+ The inductive bias of a graph neural network (GNN) is largely encoded in its +specified graph. Latent graph inference relies on latent geometric +representations to dynamically rewire or infer a GNN's graph to maximize the +GNN's predictive downstream performance, but it lacks solid theoretical +foundations in terms of embedding-based representation guarantees. This paper +addresses this issue by introducing a trainable deep learning architecture, +coined neural snowflake, that can adaptively implement fractal-like metrics on +$\mathbb{R}^d$. We prove that any given finite weights graph can be +isometrically embedded by a standard MLP encoder. Furthermore, when the latent +graph can be represented in the feature space of a sufficiently regular kernel, +we show that the combined neural snowflake and MLP encoder do not succumb to +the curse of dimensionality by using only a low-degree polynomial number of +parameters in the number of nodes. This implementation enables a +low-dimensional isometric embedding of the latent graph. We conduct synthetic +experiments to demonstrate the superior metric learning capabilities of neural +snowflakes when compared to more familiar spaces like Euclidean space. +Additionally, we carry out latent graph inference experiments on graph +benchmarks. Consistently, the neural snowflake model achieves predictive +performance that either matches or surpasses that of the state-of-the-art +latent graph inference models. Importantly, this performance improvement is +achieved without requiring random search for optimal latent geometry. Instead, +the neural snowflake model achieves this enhancement in a differentiable +manner. + +
+
+ comment: 9 Pages + Appendix, 2 Figures, 9 Tables +
+
+
+
+
+ + ☆ Simple Hardware-Efficient PCFGs with Independent Left and Right + Productions EMNLP + + +
+ Scaling dense PCFGs to thousands of nonterminals via a low-rank +parameterization of the rule probability tensor has been shown to be beneficial +for unsupervised parsing. However, PCFGs scaled this way still perform poorly +as a language model, and even underperform similarly-sized HMMs. This work +introduces \emph{SimplePCFG}, a simple PCFG formalism with independent left and +right productions. Despite imposing a stronger independence assumption than the +low-rank approach, we find that this formalism scales more effectively both as +a language model and as an unsupervised parser. As an unsupervised parser, our +simple PCFG obtains an average F1 of 65.1 on the English PTB, and as a language +model, it obtains a perplexity of 119.0, outperforming similarly-sized low-rank +PCFGs. We further introduce \emph{FlashInside}, a hardware IO-aware +implementation of the inside algorithm for efficiently scaling simple PCFGs. + +
+
+ comment: Accepted to Findings of EMNLP, 2023 +
+
+
+
+
+ + ☆ Understanding the Inner Workings of Language Models Through + Representation Dissimilarity EMNLP 2023 + + +
+ As language models are applied to an increasing number of real-world +applications, understanding their inner workings has become an important issue +in model trust, interpretability, and transparency. In this work we show that +representation dissimilarity measures, which are functions that measure the +extent to which two model's internal representations differ, can be a valuable +tool for gaining insight into the mechanics of language models. Among our +insights are: (i) an apparent asymmetry in the internal representations of +model using SoLU and GeLU activation functions, (ii) evidence that +dissimilarity measures can identify and locate generalization properties of +models that are invisible via in-distribution test set performance, and (iii) +new evaluations of how language model features vary as width and depth are +increased. Our results suggest that dissimilarity measures are a promising set +of tools for shedding light on the inner workings of language models. + +
+
+ comment: EMNLP 2023 (main) +
+
+
+
+
+ + ☆ Bayesian Regression Markets + + +
+ Machine learning tasks are vulnerable to the quality of data used as input. +Yet, it is often challenging for firms to obtain adequate datasets, with them +being naturally distributed amongst owners, that in practice, may be +competitors in a downstream market and reluctant to share information. Focusing +on supervised learning for regression tasks, we develop a \textit{regression +market} to provide a monetary incentive for data sharing. Our proposed +mechanism adopts a Bayesian framework, allowing us to consider a more general +class of regression tasks. We present a thorough exploration of the market +properties, and show that similar proposals in current literature expose the +market agents to sizeable financial risks, which can be mitigated in our +probabilistic setting. + +
+
+ comment: 46 pages, 11 figures, 2 tables +
+
+
+
+
+ + ☆ Delayed Memory Unit: Modelling Temporal Dependency Through Delay Gate + + +
+ Recurrent Neural Networks (RNNs) are renowned for their adeptness in modeling +temporal dependencies, a trait that has driven their widespread adoption for +sequential data processing. Nevertheless, vanilla RNNs are confronted with the +well-known issue of gradient vanishing and exploding, posing a significant +challenge for learning and establishing long-range dependencies. Additionally, +gated RNNs tend to be over-parameterized, resulting in poor network +generalization. To address these challenges, we propose a novel Delayed Memory +Unit (DMU) in this paper, wherein a delay line structure, coupled with delay +gates, is introduced to facilitate temporal interaction and temporal credit +assignment, so as to enhance the temporal modeling capabilities of vanilla +RNNs. Particularly, the DMU is designed to directly distribute the input +information to the optimal time instant in the future, rather than aggregating +and redistributing it over time through intricate network dynamics. Our +proposed DMU demonstrates superior temporal modeling capabilities across a +broad range of sequential modeling tasks, utilizing considerably fewer +parameters than other state-of-the-art gated RNN models in applications such as +speech recognition, radar gesture recognition, ECG waveform segmentation, and +permuted sequential image classification. + +
+
+
+
+
+ + ☆ ACTOR: Active Learning with Annotator-specific Classification Heads to + Embrace Human Label Variation EMNLP 2023 + + +
+ Label aggregation such as majority voting is commonly used to resolve +annotator disagreement in dataset creation. However, this may disregard +minority values and opinions. Recent studies indicate that learning from +individual annotations outperforms learning from aggregated labels, though they +require a considerable amount of annotation. Active learning, as an annotation +cost-saving strategy, has not been fully explored in the context of learning +from disagreement. We show that in the active learning setting, a multi-head +model performs significantly better than a single-head model in terms of +uncertainty estimation. By designing and evaluating acquisition functions with +annotator-specific heads on two datasets, we show that group-level entropy +works generally well on both datasets. Importantly, it achieves performance in +terms of both prediction and uncertainty estimation comparable to full-scale +training from disagreement, while saving up to 70% of the annotation budget. + +
+
+ comment: EMNLP 2023 Main +
+
+
+
+
+ + ☆ Reinforcement learning in large, structured action spaces: A simulation + study of decision support for spinal cord injury rehabilitation + + +
+ Reinforcement learning (RL) has helped improve decision-making in several +applications. However, applying traditional RL is challenging in some +applications, such as rehabilitation of people with a spinal cord injury (SCI). +Among other factors, using RL in this domain is difficult because there are +many possible treatments (i.e., large action space) and few patients (i.e., +limited training data). Treatments for SCIs have natural groupings, so we +propose two approaches to grouping treatments so that an RL agent can learn +effectively from limited data. One relies on domain knowledge of SCI +rehabilitation and the other learns similarities among treatments using an +embedding technique. We then use Fitted Q Iteration to train an agent that +learns optimal treatments. Through a simulation study designed to reflect the +properties of SCI rehabilitation, we find that both methods can help improve +the treatment decisions of physiotherapists, but the approach based on domain +knowledge offers better performance. Our findings provide a "proof of concept" +that RL can be used to help improve the treatment of those with an SCI and +indicates that continued efforts to gather data and apply RL to this domain are +worthwhile. + +
+
+ comment: 31 pages, 7 figures +
+
+
+
+
+ + ☆ The Fundamental Dilemma of Bayesian Active Meta-learning + + +
+ Many applications involve estimation of parameters that generalize across +multiple diverse, but related, data-scarce task environments. Bayesian active +meta-learning, a form of sequential optimal experimental design, provides a +framework for solving such problems. The active meta-learner's goal is to gain +transferable knowledge (estimate the transferable parameters) in the presence +of idiosyncratic characteristics of the current task (task-specific +parameters). We show that in such a setting, greedy pursuit of this goal can +actually hurt estimation of the transferable parameters (induce so-called +negative transfer). The learner faces a dilemma akin to but distinct from the +exploration--exploitation dilemma: should they spend their acquisition budget +pursuing transferable knowledge, or identifying the current task-specific +parameters? We show theoretically that some tasks pose an inevitable and +arbitrarily large threat of negative transfer, and that task identification is +critical to reducing this threat. Our results generalize to analysis of prior +misspecification over nuisance parameters. Finally, we empirically illustrate +circumstances that lead to negative transfer. + +
+
+
+
+
+ + ☆ Adam through a Second-Order Lens ICLR 2024 + + +
+ Research into optimisation for deep learning is characterised by a tension +between the computational efficiency of first-order, gradient-based methods +(such as SGD and Adam) and the theoretical efficiency of second-order, +curvature-based methods (such as quasi-Newton methods and K-FAC). We seek to +combine the benefits of both approaches into a single computationally-efficient +algorithm. Noting that second-order methods often depend on stabilising +heuristics (such as Levenberg-Marquardt damping), we propose AdamQLR: an +optimiser combining damping and learning rate selection techniques from K-FAC +(Martens and Grosse, 2015) with the update directions proposed by Adam, +inspired by considering Adam through a second-order lens. We evaluate AdamQLR +on a range of regression and classification tasks at various scales, achieving +competitive generalisation performance vs runtime. + +
+
+ comment: 28 pages, 15 figures, 4 tables. Submitted to ICLR 2024 +
+
+
+
+
+ + ☆ StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography + + +
+ Coronary angiography continues to serve as the primary method for diagnosing +coronary artery disease (CAD), which is the leading global cause of mortality. +The severity of CAD is quantified by the location, degree of narrowing +(stenosis), and number of arteries involved. In current practice, this +quantification is performed manually using visual inspection and thus suffers +from poor inter- and intra-rater reliability. The MICCAI grand challenge: +Automatic Region-based Coronary Artery Disease diagnostics using the X-ray +angiography imagEs (ARCADE) curated a dataset with stenosis annotations, with +the goal of creating an automated stenosis detection algorithm. Using a +combination of machine learning and other computer vision techniques, we +propose the architecture and algorithm StenUNet to accurately detect stenosis +from X-ray Coronary Angiography. Our submission to the ARCADE challenge placed +3rd among all teams. We achieved an F1 score of 0.5348 on the test set, 0.0005 +lower than the 2nd place. + +
+
+ comment: 12 pages, 5 figures, 1 table +
+
+
+
+
+ + ☆ XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series + Classification ICML + + +
+ Despite the growing body of work on explainable machine learning in time +series classification (TSC), it remains unclear how to evaluate different +explainability methods. Resorting to qualitative assessment and user studies to +evaluate explainers for TSC is difficult since humans have difficulties +understanding the underlying information contained in time series data. +Therefore, a systematic review and quantitative comparison of explanation +methods to confirm their correctness becomes crucial. While steps to +standardized evaluations were taken for tabular, image, and textual data, +benchmarking explainability methods on time series is challenging due to a) +traditional metrics not being directly applicable, b) implementation and +adaption of traditional metrics for time series in the literature vary, and c) +varying baseline implementations. This paper proposes XTSC-Bench, a +benchmarking tool providing standardized datasets, models, and metrics for +evaluating explanation methods on TSC. We analyze 3 perturbation-, 6 gradient- +and 2 example-based explanation methods to TSC showing that improvements in the +explainers' robustness and reliability are necessary, especially for +multivariate data. + +
+
+ comment: Accepted at ICMLA 2023 +
+
+
+
+
+ + Causal machine learning for single-cell genomics + + +
+ Advances in single-cell omics allow for unprecedented insights into the +transcription profiles of individual cells. When combined with large-scale +perturbation screens, through which specific biological mechanisms can be +targeted, these technologies allow for measuring the effect of targeted +perturbations on the whole transcriptome. These advances provide an opportunity +to better understand the causative role of genes in complex biological +processes such as gene regulation, disease progression or cellular development. +However, the high-dimensional nature of the data, coupled with the intricate +complexity of biological systems renders this task nontrivial. Within the +machine learning community, there has been a recent increase of interest in +causality, with a focus on adapting established causal techniques and +algorithms to handle high-dimensional data. In this perspective, we delineate +the application of these methodologies within the realm of single-cell genomics +and their challenges. We first present the model that underlies most of current +causal approaches to single-cell biology and discuss and challenge the +assumptions it entails from the biological point of view. We then identify open +problems in the application of causal approaches to single-cell data: +generalising to unseen environments, learning interpretable models, and +learning causal models of dynamics. For each problem, we discuss how various +research directions - including the development of computational approaches and +the adaptation of experimental protocols - may offer ways forward, or on the +contrary pose some difficulties. With the advent of single cell atlases and +increasing perturbation data, we expect causal models to become a crucial tool +for informed experimental design. + +
+
+ comment: 35 pages, 7 figures, 3 tables, 1 box +
+
+
+
+
+ + ☆ Robust Depth Linear Error Decomposition with Double Total Variation and + Nuclear Norm for Dynamic MRI Reconstruction + + +
+ Compressed Sensing (CS) significantly speeds up Magnetic Resonance Image +(MRI) processing and achieves accurate MRI reconstruction from under-sampled +k-space data. According to the current research, there are still several +problems with dynamic MRI k-space reconstruction based on CS. 1) There are +differences between the Fourier domain and the Image domain, and the +differences between MRI processing of different domains need to be considered. +2) As three-dimensional data, dynamic MRI has its spatial-temporal +characteristics, which need to calculate the difference and consistency of +surface textures while preserving structural integrity and uniqueness. 3) +Dynamic MRI reconstruction is time-consuming and computationally +resource-dependent. In this paper, we propose a novel robust low-rank dynamic +MRI reconstruction optimization model via highly under-sampled and Discrete +Fourier Transform (DFT) called the Robust Depth Linear Error Decomposition +Model (RDLEDM). Our method mainly includes linear decomposition, double Total +Variation (TV), and double Nuclear Norm (NN) regularizations. By adding linear +image domain error analysis, the noise is reduced after under-sampled and DFT +processing, and the anti-interference ability of the algorithm is enhanced. +Double TV and NN regularizations can utilize both spatial-temporal +characteristics and explore the complementary relationship between different +dimensions in dynamic MRI sequences. In addition, Due to the non-smoothness and +non-convexity of TV and NN terms, it is difficult to optimize the unified +objective model. To address this issue, we utilize a fast algorithm by solving +a primal-dual form of the original problem. Compared with five state-of-the-art +methods, extensive experiments on dynamic MRI data demonstrate the superior +performance of the proposed method in terms of both reconstruction accuracy and +time complexity. + +
+
+
+
+
+ + ☆ Linking Surface Facts to Large-Scale Knowledge Graphs + + +
+ Open Information Extraction (OIE) methods extract facts from natural language +text in the form of ("subject"; "relation"; "object") triples. These facts are, +however, merely surface forms, the ambiguity of which impedes their downstream +usage; e.g., the surface phrase "Michael Jordan" may refer to either the former +basketball player or the university professor. Knowledge Graphs (KGs), on the +other hand, contain facts in a canonical (i.e., unambiguous) form, but their +coverage is limited by a static schema (i.e., a fixed set of entities and +predicates). To bridge this gap, we need the best of both worlds: (i) high +coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of +KGs. In order to achieve this goal, we propose a new benchmark with novel +evaluation protocols that can, for example, measure fact linking performance on +a granular triple slot level, while also measuring if a system has the ability +to recognize that a surface form has no match in the existing KG. Our extensive +evaluation of several baselines show that detection of out-of-KG entities and +predicates is more difficult than accurate linking to existing ones, thus +calling for more research efforts on this difficult task. We publicly release +all resources (data, benchmark and code) on +https://github.com/nec-research/fact-linking. + +
+
+
+
+
+ + ☆ Series of Hessian-Vector Products for Tractable Saddle-Free Newton + Optimisation of Neural Networks + + +
+ Despite their popularity in the field of continuous optimisation, +second-order quasi-Newton methods are challenging to apply in machine learning, +as the Hessian matrix is intractably large. This computational burden is +exacerbated by the need to address non-convexity, for instance by modifying the +Hessian's eigenvalues as in Saddle-Free Newton methods. We propose an +optimisation algorithm which addresses both of these concerns - to our +knowledge, the first efficiently-scalable optimisation algorithm to +asymptotically use the exact (eigenvalue-modified) inverse Hessian. Our method +frames the problem as a series which principally square-roots and inverts the +squared Hessian, then uses it to precondition a gradient vector, all without +explicitly computing or eigendecomposing the Hessian. A truncation of this +infinite series provides a new optimisation algorithm which is scalable and +comparable to other first- and second-order optimisation methods in both +runtime and optimisation performance. We demonstrate this in a variety of +settings, including a ResNet-18 trained on CIFAR-10. + +
+
+ comment: 36 pages, 10 figures, 5 tables. Submitted to TMLR. First two authors' + order randomised +
+
+
+
+
+ + ☆ Local Universal Rule-based Explanations + + +
+ Explainable artificial intelligence (XAI) is one of the most intensively +developed are of AI in recent years. It is also one of the most fragmented one +with multiple methods that focus on different aspects of explanations. This +makes difficult to obtain the full spectrum of explanation at once in a compact +and consistent way. To address this issue, we present Local Universal Explainer +(LUX) that is a rule-based explainer which can generate factual, counterfactual +and visual explanations. It is based on a modified version of decision tree +algorithms that allows for oblique splits and integration with feature +importance XAI methods such as SHAP or LIME. It does not use data generation in +opposite to other algorithms, but is focused on selecting local concepts in a +form of high-density clusters of real data that have the highest impact on +forming the decision boundary of the explained model. We tested our method on +real and synthetic datasets and compared it with state-of-the-art rule-based +explainers such as LORE, EXPLAN and Anchor. Our method outperforms currently +existing approaches in terms of simplicity, global fidelity and +representativeness. + +
+
+
+
+
+ + ☆ Beyond Bayesian Model Averaging over Paths in Probabilistic Programs + with Stochastic Support + + +
+ The posterior in probabilistic programs with stochastic support decomposes as +a weighted sum of the local posterior distributions associated with each +possible program path. We show that making predictions with this full posterior +implicitly performs a Bayesian model averaging (BMA) over paths. This is +potentially problematic, as model misspecification can cause the BMA weights to +prematurely collapse onto a single path, leading to sub-optimal predictions in +turn. To remedy this issue, we propose alternative mechanisms for path +weighting: one based on stacking and one based on ideas from PAC-Bayes. We show +how both can be implemented as a cheap post-processing step on top of existing +inference engines. In our experiments, we find them to be more robust and lead +to better predictions compared to the default BMA weights. + +
+
+
+
+
+ + ☆ A Study on Knowledge Graph Embeddings and Graph Neural Networks for Web + Of Things + + +
+ Graph data structures are widely used to store relational information between +several entities. With data being generated worldwide on a large scale, we see +a significant growth in the generation of knowledge graphs. Thing in the future +is Orange's take on a knowledge graph in the domain of the Web Of Things (WoT), +where the main objective of the platform is to provide a digital representation +of the physical world and enable cross-domain applications to be built upon +this massive and highly connected graph of things. In this context, as the +knowledge graph grows in size, it is prone to have noisy and messy data. In +this paper, we explore state-of-the-art knowledge graph embedding (KGE) methods +to learn numerical representations of the graph entities and, subsequently, +explore downstream tasks like link prediction, node classification, and triple +classification. We also investigate Graph neural networks (GNN) alongside KGEs +and compare their performance on the same downstream tasks. Our evaluation +highlights the encouraging performance of both KGE and GNN-based methods on +node classification, and the superiority of GNN approaches in the link +prediction task. Overall, we show that state-of-the-art approaches are relevant +in a WoT context, and this preliminary work provides insights to implement and +evaluate them in this context. + +
+
+
+
+
+ + ☆ Diverse Priors for Deep Reinforcement Learning + + +
+ In Reinforcement Learning (RL), agents aim at maximizing cumulative rewards +in a given environment. During the learning process, RL agents face the dilemma +of exploitation and exploration: leveraging existing knowledge to acquire +rewards or seeking potentially higher ones. Using uncertainty as a guiding +principle provides an active and effective approach to solving this dilemma and +ensemble-based methods are one of the prominent avenues for quantifying +uncertainty. Nevertheless, conventional ensemble-based uncertainty estimation +lacks an explicit prior, deviating from Bayesian principles. Besides, this +method requires diversity among members to generate less biased uncertainty +estimation results. To address the above problems, previous research has +incorporated random functions as priors. Building upon these foundational +efforts, our work introduces an innovative approach with delicately designed +prior NNs, which can incorporate maximal diversity in the initial value +functions of RL. Our method has demonstrated superior performance compared with +the random prior approaches in solving classic control problems and general +exploration tasks, significantly improving sample efficiency. + +
+
+ comment: 8 pages, 4 figures +
+
+
+
+
+ + ☆ Dynamically Weighted Federated k-Means + + +
+ Federated clustering is an important part of the field of federated machine +learning, that allows multiple data sources to collaboratively cluster their +data while keeping it decentralized and preserving privacy. In this paper, we +introduce a novel federated clustering algorithm, named Dynamically Weighted +Federated k-means (DWF k-means), to address the challenges posed by distributed +data sources and heterogeneous data. Our proposed algorithm combines the +benefits of traditional clustering techniques with the privacy and scalability +advantages of federated learning. It enables multiple data owners to +collaboratively cluster their local data while exchanging minimal information +with a central coordinator. The algorithm optimizes the clustering process by +adaptively aggregating cluster assignments and centroids from each data source, +thereby learning a global clustering solution that reflects the collective +knowledge of the entire federated network. We conduct experiments on multiple +datasets and data distribution settings to evaluate the performance of our +algorithm in terms of clustering score, accuracy, and v-measure. The results +demonstrate that our approach can match the performance of the centralized +classical k-means baseline, and outperform existing federated clustering +methods in realistic scenarios. + +
+
+
+
+
+ + ☆ Zero-knowledge Proof Meets Machine Learning in Verifiability: A Survey + + +
+ With the rapid advancement of artificial intelligence technology, the usage +of machine learning models is gradually becoming part of our daily lives. +High-quality models rely not only on efficient optimization algorithms but also +on the training and learning processes built upon vast amounts of data and +computational power. However, in practice, due to various challenges such as +limited computational resources and data privacy concerns, users in need of +models often cannot train machine learning models locally. This has led them to +explore alternative approaches such as outsourced learning and federated +learning. While these methods address the feasibility of model training +effectively, they introduce concerns about the trustworthiness of the training +process since computations are not performed locally. Similarly, there are +trustworthiness issues associated with outsourced model inference. These two +problems can be summarized as the trustworthiness problem of model +computations: How can one verify that the results computed by other +participants are derived according to the specified algorithm, model, and input +data? To address this challenge, verifiable machine learning (VML) has emerged. +This paper presents a comprehensive survey of zero-knowledge proof-based +verifiable machine learning (ZKP-VML) technology. We first analyze the +potential verifiability issues that may exist in different machine learning +scenarios. Subsequently, we provide a formal definition of ZKP-VML. We then +conduct a detailed analysis and classification of existing works based on their +technical approaches. Finally, we discuss the key challenges and future +directions in the field of ZKP-based VML. + +
+
+
+
+
+ + ☆ ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt + + +
+ Recent research has demonstrated the efficacy of pre-training graph neural +networks (GNNs) to capture the transferable graph semantics and enhance the +performance of various downstream tasks. However, the semantic knowledge +learned from pretext tasks might be unrelated to the downstream task, leading +to a semantic gap that limits the application of graph pre-training. To reduce +this gap, traditional approaches propose hybrid pre-training to combine various +pretext tasks together in a multi-task learning fashion and learn multi-grained +knowledge, which, however, cannot distinguish tasks and results in some +transferable task-specific knowledge distortion by each other. Moreover, most +GNNs cannot distinguish nodes located in different parts of the graph, making +them fail to learn position-specific knowledge and lead to suboptimal +performance. In this work, inspired by the prompt-based tuning in natural +language processing, we propose a unified framework for graph hybrid +pre-training which injects the task identification and position identification +into GNNs through a prompt mechanism, namely multi-task graph dual prompt +(ULTRA-DP). Based on this framework, we propose a prompt-based transferability +test to find the most relevant pretext task in order to reduce the semantic +gap. To implement the hybrid pre-training tasks, beyond the classical edge +prediction task (node-node level), we further propose a novel pre-training +paradigm based on a group of $k$-nearest neighbors (node-group level). The +combination of them across different scales is able to comprehensively express +more structural semantics and derive richer multi-grained knowledge. Extensive +experiments show that our proposed ULTRA-DP can significantly enhance the +performance of hybrid pre-training methods and show the generalizability to +other pre-training tasks and backbone architectures. + +
+
+
+
+
+ + ☆ Calibration of Time-Series Forecasting Transformers: Detecting and + Adapting Context-Driven Distribution Shift + + +
+ Recent years have witnessed the success of introducing Transformers to time +series forecasting. From a data generation perspective, we illustrate that +existing Transformers are susceptible to distribution shifts driven by temporal +contexts, whether observed or unobserved. Such context-driven distribution +shift (CDS) introduces biases in predictions within specific contexts and poses +challenges for conventional training paradigm. In this paper, we introduce a +universal calibration methodology for the detection and adaptation of CDS with +a trained Transformer model. To this end, we propose a novel CDS detector, +termed the "residual-based CDS detector" or "Reconditionor", which quantifies +the model's vulnerability to CDS by evaluating the mutual information between +prediction residuals and their corresponding contexts. A high Reconditionor +score indicates a severe susceptibility, thereby necessitating model +adaptation. In this circumstance, we put forth a straightforward yet potent +adapter framework for model calibration, termed the "sample-level +contextualized adapter" or "SOLID". This framework involves the curation of a +contextually similar dataset to the provided test sample and the subsequent +fine-tuning of the model's prediction layer with a limited number of steps. Our +theoretical analysis demonstrates that this adaptation strategy is able to +achieve an optimal equilibrium between bias and variance. Notably, our proposed +Reconditionor and SOLID are model-agnostic and readily adaptable to a wide +range of Transformers. Extensive experiments show that SOLID consistently +enhances the performance of current SOTA Transformers on real-world datasets, +especially on cases with substantial CDS detected by the proposed +Reconditionor, thus validate the effectiveness of the calibration approach. + +
+
+
+
+
+ + ☆ Harnessing Attention Mechanisms: Efficient Sequence Reduction using + Attention-based Autoencoders + + +
+ Many machine learning models use the manipulation of dimensions as a driving +force to enable models to identify and learn important features in data. In the +case of sequential data this manipulation usually happens on the token +dimension level. Despite the fact that many tasks require a change in sequence +length itself, the step of sequence length reduction usually happens out of +necessity and in a single step. As far as we are aware, no model uses the +sequence length reduction step as an additional opportunity to tune the models +performance. In fact, sequence length manipulation as a whole seems to be an +overlooked direction. In this study we introduce a novel attention-based method +that allows for the direct manipulation of sequence lengths. To explore the +method's capabilities, we employ it in an autoencoder model. The autoencoder +reduces the input sequence to a smaller sequence in latent space. It then aims +to reproduce the original sequence from this reduced form. In this setting, we +explore the methods reduction performance for different input and latent +sequence lengths. We are able to show that the autoencoder retains all the +significant information when reducing the original sequence to half its +original size. When reducing down to as low as a quarter of its original size, +the autoencoder is still able to reproduce the original sequence with an +accuracy of around 90%. + +
+
+ comment: 8 pages, 5 images, 1 table +
+
+
+
+
+ + ☆ Sharp error bounds for imbalanced classification: how many examples in + the minority class? + + +
+ When dealing with imbalanced classification data, reweighting the loss +function is a standard procedure allowing to equilibrate between the true +positive and true negative rates within the risk measure. Despite significant +theoretical work in this area, existing results do not adequately address a +main challenge within the imbalanced classification framework, which is the +negligible size of one class in relation to the full sample size and the need +to rescale the risk function by a probability tending to zero. To address this +gap, we present two novel contributions in the setting where the rare class +probability approaches zero: (1) a non asymptotic fast rate probability bound +for constrained balanced empirical risk minimization, and (2) a consistent +upper bound for balanced nearest neighbors estimates. Our findings provide a +clearer understanding of the benefits of class-weighting in realistic settings, +opening new avenues for further research in this field. + +
+
+
+
+
+ + ☆ Text2Topic: Multi-Label Text Classification System for Efficient Topic + Detection in User Generated Content with Zero-Shot Capabilities + + +
+ Multi-label text classification is a critical task in the industry. It helps +to extract structured information from large amount of textual data. We propose +Text to Topic (Text2Topic), which achieves high multi-label classification +performance by employing a Bi-Encoder Transformer architecture that utilizes +concatenation, subtraction, and multiplication of embeddings on both text and +topic. Text2Topic also supports zero-shot predictions, produces domain-specific +text embeddings, and enables production-scale batch-inference with high +throughput. The final model achieves accurate and comprehensive results +compared to state-of-the-art baselines, including large language models (LLMs). + In this study, a total of 239 topics are defined, and around 1.6 million +text-topic pairs annotations (in which 200K are positive) are collected on +approximately 120K texts from 3 main data sources on Booking.com. The data is +collected with optimized smart sampling and partial labeling. The final +Text2Topic model is deployed on a real-world stream processing platform, and it +outperforms other models with 92.9% micro mAP, as well as a 75.8% macro mAP +score. We summarize the modeling choices which are extensively tested through +ablation studies, and share detailed in-production decision-making steps. + +
+
+
+
+
+ + ☆ Leveraging Ensemble Diversity for Robust Self-Training in the Presence + of Sample Selection Bias + + +
+ Self-training is a well-known approach for semi-supervised learning. It +consists of iteratively assigning pseudo-labels to unlabeled data for which the +model is confident and treating them as labeled examples. For neural networks, +softmax prediction probabilities are often used as a confidence measure, +despite the fact that they are known to be overconfident, even for wrong +predictions. This phenomenon is particularly intensified in the presence of +sample selection bias, i.e., when data labeling is subject to some constraint. +To address this issue, we propose a novel confidence measure, called +$\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of +linear classifiers. We provide the theoretical analysis of our approach by +studying stationary points and describing the relationship between the +diversity of the individual members and their performance. We empirically +demonstrate the benefit of our confidence measure for three different +pseudo-labeling policies on classification datasets of various data modalities. + +
+
+
+
+
+ + ☆ Learning spatio-temporal patterns with Neural Cellular Automata + + +
+ Neural Cellular Automata (NCA) are a powerful combination of machine learning +and mechanistic modelling. We train NCA to learn complex dynamics from time +series of images and PDE trajectories. Our method is designed to identify +underlying local rules that govern large scale dynamic emergent behaviours. +Previous work on NCA focuses on learning rules that give stationary emergent +structures. We extend NCA to capture both transient and stable structures +within the same system, as well as learning rules that capture the dynamics of +Turing pattern formation in nonlinear Partial Differential Equations (PDEs). We +demonstrate that NCA can generalise very well beyond their PDE training data, +we show how to constrain NCA to respect given symmetries, and we explore the +effects of associated hyperparameters on model performance and stability. Being +able to learn arbitrary dynamics gives NCA great potential as a data driven +modelling framework, especially for modelling biological pattern formation. + +
+
+ comment: For videos referenced in appendix, see: + https://github.com/AlexDR1998/NCA/tree/main/Videos +
+
+
+
+
+ + ☆ What do Deck Chairs and Sun Hats Have in Common? Uncovering Shared + Properties in Large Concept Vocabularies EMNLP 2023 + + +
+ Concepts play a central role in many applications. This includes settings +where concepts have to be modelled in the absence of sentence context. Previous +work has therefore focused on distilling decontextualised concept embeddings +from language models. But concepts can be modelled from different perspectives, +whereas concept embeddings typically mostly capture taxonomic structure. To +address this issue, we propose a strategy for identifying what different +concepts, from a potentially large concept vocabulary, have in common with +others. We then represent concepts in terms of the properties they share with +the other concepts. To demonstrate the practical usefulness of this way of +modelling concepts, we consider the task of ultra-fine entity typing, which is +a challenging multi-label classification problem. We show that by augmenting +the label set with shared properties, we can improve the performance of the +state-of-the-art models for this task. + +
+
+ comment: Accepted for EMNLP 2023 +
+
+
+
+
+ + ☆ An Efficient Imbalance-Aware Federated Learning Approach for Wearable + Healthcare with Autoregressive Ratio Observation + + +
+ Widely available healthcare services are now getting popular because of +advancements in wearable sensing techniques and mobile edge computing. People's +health information is collected by edge devices such as smartphones and +wearable bands for further analysis on servers, then send back suggestions and +alerts for abnormal conditions. The recent emergence of federated learning +allows users to train private data on local devices while updating models +collaboratively. However, the heterogeneous distribution of the health +condition data may lead to significant risks to model performance due to class +imbalance. Meanwhile, as FL training is powered by sharing gradients only with +the server, training data is almost inaccessible. The conventional solutions to +class imbalance do not work for federated learning. In this work, we propose a +new federated learning framework FedImT, dedicated to addressing the challenges +of class imbalance in federated learning scenarios. FedImT contains an online +scheme that can estimate the data composition during each round of aggregation, +then introduces a self-attenuating iterative equivalent to track variations of +multiple estimations and promptly tweak the balance of the loss computing for +minority classes. Experiments demonstrate the effectiveness of FedImT in +solving the imbalance problem without extra energy consumption and avoiding +privacy risks. + +
+
+ comment: submitted to IEEE OJCS in Oct. 2023 (under review) +
+
+
+
+
+ + ☆ Geographical Erasure in Language Generation EMNLP 2023 + + +
+ Large language models (LLMs) encode vast amounts of world knowledge. However, +since these models are trained on large swaths of internet data, they are at +risk of inordinately capturing information about dominant groups. This +imbalance can propagate into generated language. In this work, we study and +operationalise a form of geographical erasure, wherein language models +underpredict certain countries. We demonstrate consistent instances of erasure +across a range of LLMs. We discover that erasure strongly correlates with low +frequencies of country mentions in the training corpus. Lastly, we mitigate +erasure by finetuning using a custom objective. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Principled Approaches for Learning to Defer with Multiple Experts + + +
+ We present a study of surrogate losses and algorithms for the general problem +of learning to defer with multiple experts. We first introduce a new family of +surrogate losses specifically tailored for the multiple-expert setting, where +the prediction and deferral functions are learned simultaneously. We then prove +that these surrogate losses benefit from strong $H$-consistency bounds. We +illustrate the application of our analysis through several examples of +practical surrogate losses, for which we give explicit guarantees. These loss +functions readily lead to the design of new learning to defer algorithms based +on their minimization. While the main focus of this work is a theoretical +analysis, we also report the results of several experiments on SVHN and +CIFAR-10 datasets. + +
+
+
+
+
+ + ☆ Predictor-Rejector Multi-Class Abstention: Theoretical Analysis and + Algorithms + + +
+ We study the key framework of learning with abstention in the multi-class +classification setting. In this setting, the learner can choose to abstain from +making a prediction with some pre-defined cost. We present a series of new +theoretical and algorithmic results for this learning problem in the +predictor-rejector framework. We introduce several new families of surrogate +losses for which we prove strong non-asymptotic and hypothesis set-specific +consistency guarantees, thereby resolving positively two existing open +questions. These guarantees provide upper bounds on the estimation error of the +abstention loss function in terms of that of the surrogate loss. We analyze +both a single-stage setting where the predictor and rejector are learned +simultaneously and a two-stage setting crucial in applications, where the +predictor is learned in a first stage using a standard surrogate loss such as +cross-entropy. These guarantees suggest new multi-class abstention algorithms +based on minimizing these surrogate losses. We also report the results of +extensive experiments comparing these algorithms to the current +state-of-the-art algorithms on CIFAR-10, CIFAR-100 and SVHN datasets. Our +results demonstrate empirically the benefit of our new surrogate losses and +show the remarkable performance of our broadly applicable two-stage abstention +algorithm. + +
+
+
+
+
+ + ☆ Theoretically Grounded Loss Functions and Algorithms for Score-Based + Multi-Class Abstention + + +
+ Learning with abstention is a key scenario where the learner can abstain from +making a prediction at some cost. In this paper, we analyze the score-based +formulation of learning with abstention in the multi-class classification +setting. We introduce new families of surrogate losses for the abstention loss +function, which include the state-of-the-art surrogate losses in the +single-stage setting and a novel family of loss functions in the two-stage +setting. We prove strong non-asymptotic and hypothesis set-specific consistency +guarantees for these surrogate losses, which upper-bound the estimation error +of the abstention loss function in terms of the estimation error of the +surrogate loss. Our bounds can help compare different score-based surrogates +and guide the design of novel abstention algorithms by minimizing the proposed +surrogate losses. We experimentally evaluate our new algorithms on CIFAR-10, +CIFAR-100, and SVHN datasets and the practical significance of our new +surrogate losses and two-stage abstention algorithms. Our results also show +that the relative performance of the state-of-the-art score-based surrogate +losses can vary across datasets. + +
+
+
+
+
+ + ☆ Policy Gradient with Kernel Quadrature + + +
+ Reward evaluation of episodes becomes a bottleneck in a broad range of +reinforcement learning tasks. Our aim in this paper is to select a small but +representative subset of a large batch of episodes, only on which we actually +compute rewards for more efficient policy gradient iterations. We build a +Gaussian process modeling of discounted returns or rewards to derive a positive +definite kernel on the space of episodes, run an "episodic" kernel quadrature +method to compress the information of sample episodes, and pass the reduced +episodes to the policy network for gradient updates. We present the theoretical +background of this procedure as well as its numerical illustrations in MuJoCo +and causal discovery tasks. + +
+
+ comment: 16 pages, 4 figures +
+
+
+
+
+ + ☆ Improved K-mer Based Prediction of Protein-Protein Interactions With + Chaos Game Representation, Deep Learning and Reduced Representation Bias + + +
+ Protein-protein interactions drive many biological processes, including the +detection of phytopathogens by plants' R-Proteins and cell surface receptors. +Many machine learning studies have attempted to predict protein-protein +interactions but performance is highly dependent on training data; models have +been shown to accurately predict interactions when the proteins involved are +included in the training data, but achieve consistently poorer results when +applied to previously unseen proteins. In addition, models that are trained +using proteins that take part in multiple interactions can suffer from +representation bias, where predictions are driven not by learned biological +features but by learning of the structure of the interaction dataset. + We present a method for extracting unique pairs from an interaction dataset, +generating non-redundant paired data for unbiased machine learning. After +applying the method to datasets containing _Arabidopsis thaliana_ and pathogen +effector interations, we developed a convolutional neural network model capable +of learning and predicting interactions from Chaos Game Representations of +proteins' coding genes. + +
+
+
+
+
+ + ☆ Externally Valid Policy Evaluation Combining Trial and Observational + Data + + +
+ Randomized trials are widely considered as the gold standard for evaluating +the effects of decision policies. Trial data is, however, drawn from a +population which may differ from the intended target population and this raises +a problem of external validity (aka. generalizability). In this paper we seek +to use trial data to draw valid inferences about the outcome of a policy on the +target population. Additional covariate data from the target population is used +to model the sampling of individuals in the trial study. We develop a method +that yields certifiably valid trial-based policy evaluations under any +specified range of model miscalibrations. The method is nonparametric and the +validity is assured even with finite samples. The certified policy evaluations +are illustrated using both simulated and real data. + +
+
+
+
+
+ + ☆ Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules NeurIPS 2023 + + +
+ Masked graph modeling excels in the self-supervised representation learning +of molecular graphs. Scrutinizing previous studies, we can reveal a common +scheme consisting of three key components: (1) graph tokenizer, which breaks a +molecular graph into smaller fragments (i.e., subgraphs) and converts them into +tokens; (2) graph masking, which corrupts the graph with masks; (3) graph +autoencoder, which first applies an encoder on the masked graph to generate the +representations, and then employs a decoder on the representations to recover +the tokens of the original graph. However, the previous MGM studies focus +extensively on graph masking and encoder, while there is limited understanding +of tokenizer and decoder. To bridge the gap, we first summarize popular +molecule tokenizers at the granularity of node, edge, motif, and Graph Neural +Networks (GNNs), and then examine their roles as the MGM's reconstruction +targets. Further, we explore the potential of adopting an expressive decoder in +MGM. Our results show that a subgraph-level tokenizer and a sufficiently +expressive decoder with remask decoding have a large impact on the encoder's +representation learning. Finally, we propose a novel MGM method SimSGT, +featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding +strategy. We empirically validate that our method outperforms the existing +molecule self-supervised learning methods. Our codes and checkpoints are +available at https://github.com/syr-cn/SimSGT. + +
+
+ comment: NeurIPS 2023. 10 pages +
+
+
+
+
+ + ☆ Efficient and Interpretable Bandit Algorithms + + +
+ Motivated by the importance of explainability in modern machine learning, we +design bandit algorithms that are \emph{efficient} and \emph{interpretable}. A +bandit algorithm is interpretable if it explores with the objective of reducing +uncertainty in the unknown model parameter. To quantify the interpretability, +we introduce a novel metric of \textit{uncertainty loss}, which compares the +rate of the uncertainty reduction to the theoretical optimum. We propose CODE, +a bandit algorithm based on a \textbf{C}onstrained \textbf{O}ptimal +\textbf{DE}sign, that is interpretable and maximally reduces the uncertainty. +The key idea in \code is to explore among all plausible actions, determined by +a statistical constraint, to achieve interpretability. We implement CODE +efficiently in both multi-armed and linear bandits and derive near-optimal +regret bounds by leveraging the optimality criteria of the approximate optimal +design. CODE can be also viewed as removing phases in conventional phased +elimination, which makes it more practical and general. We demonstrate the +advantage of \code by numerical experiments on both synthetic and real-world +problems. CODE outperforms other state-of-the-art interpretable designs while +matching the performance of popular but uninterpretable designs, such as upper +confidence bound algorithms. + +
+
+
+
+
+ + ☆ The Safety Challenges of Deep Learning in Real-World Type 1 Diabetes + Management + + +
+ Blood glucose simulation allows the effectiveness of type 1 diabetes (T1D) +management strategies to be evaluated without patient harm. Deep learning +algorithms provide a promising avenue for extending simulator capabilities; +however, these algorithms are limited in that they do not necessarily learn +physiologically correct glucose dynamics and can learn incorrect and +potentially dangerous relationships from confounders in training data. This is +likely to be more important in real-world scenarios, as data is not collected +under strict research protocol. This work explores the implications of using +deep learning algorithms trained on real-world data to model glucose dynamics. +Free-living data was processed from the OpenAPS Data Commons and supplemented +with patient-reported tags of challenging diabetes events, constituting one of +the most detailed real-world T1D datasets. This dataset was used to train and +evaluate state-of-the-art glucose simulators, comparing their prediction error +across safety critical scenarios and assessing the physiological +appropriateness of the learned dynamics using Shapley Additive Explanations +(SHAP). While deep learning prediction accuracy surpassed the widely-used +mathematical simulator approach, the model deteriorated in safety critical +scenarios and struggled to leverage self-reported meal and exercise +information. SHAP value analysis also indicated the model had fundamentally +confused the roles of insulin and carbohydrates, which is one of the most basic +T1D management principles. This work highlights the importance of considering +physiological appropriateness when using deep learning to model real-world +systems in T1D and healthcare more broadly, and provides recommendations for +building models that are robust to real-world data constraints. + +
+
+ comment: 15 pages, 3 figures +
+
+
+
+
+ + ☆ Extended Deep Adaptive Input Normalization for Preprocessing Time Series + Data for Neural Networks + + +
+ Data preprocessing is a crucial part of any machine learning pipeline, and it +can have a significant impact on both performance and training efficiency. This +is especially evident when using deep neural networks for time series +prediction and classification: real-world time series data often exhibit +irregularities such as multi-modality, skewness and outliers, and the model +performance can degrade rapidly if these characteristics are not adequately +addressed. In this work, we propose the EDAIN (Extended Deep Adaptive Input +Normalization) layer, a novel adaptive neural layer that learns how to +appropriately normalize irregular time series data for a given task in an +end-to-end fashion, instead of using a fixed normalization scheme. This is +achieved by optimizing its unknown parameters simultaneously with the deep +neural network using back-propagation. Our experiments, conducted using +synthetic data, a credit default prediction dataset, and a large-scale limit +order book benchmark dataset, demonstrate the superior performance of the EDAIN +layer when compared to conventional normalization methods and existing adaptive +time series preprocessing layers. + +
+
+
+
+
+ + ☆ BatteryML:An Open-source platform for Machine Learning on Battery + Degradation + + +
+ Battery degradation remains a pivotal concern in the energy storage domain, +with machine learning emerging as a potent tool to drive forward insights and +solutions. However, this intersection of electrochemical science and machine +learning poses complex challenges. Machine learning experts often grapple with +the intricacies of battery science, while battery researchers face hurdles in +adapting intricate models tailored to specific datasets. Beyond this, a +cohesive standard for battery degradation modeling, inclusive of data formats +and evaluative benchmarks, is conspicuously absent. Recognizing these +impediments, we present BatteryML - a one-step, all-encompass, and open-source +platform designed to unify data preprocessing, feature extraction, and the +implementation of both traditional and state-of-the-art models. This +streamlined approach promises to enhance the practicality and efficiency of +research applications. BatteryML seeks to fill this void, fostering an +environment where experts from diverse specializations can collaboratively +contribute, thus elevating the collective understanding and advancement of +battery research.The code for our project is publicly available on GitHub at +https://github.com/microsoft/BatteryML. + +
+
+
+
+
+ + ☆ Random Forest Dissimilarity for High-Dimension Low Sample Size + Classification + + +
+ High dimension, low sample size (HDLSS) problems are numerous among +real-world applications of machine learning. From medical images to text +processing, traditional machine learning algorithms are usually unsuccessful in +learning the best possible concept from such data. In a previous work, we +proposed a dissimilarity-based approach for multi-view classification, the +Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for +such problems. In this work, we transpose the core principle of this approach +to solving HDLSS classification problems, by using the RF similarity measure as +a learned precomputed SVM kernel (RFSVM). We show that such a learned +similarity measure is particularly suited and accurate for this classification +context. Experiments conducted on 40 public HDLSS classification datasets, +supported by rigorous statistical analyses, show that the RFSVM method +outperforms existing methods for the majority of HDLSS problems and remains at +the same time very competitive for low or non-HDLSS problems. + +
+
+ comment: 23 pages. To be published in statistics and computing (accepted + September 26, 2023) +
+
+
+
+
+ + ☆ A Hybrid GNN approach for predicting node data for 3D meshes + + +
+ Metal forging is used to manufacture dies. We require the best set of input +parameters for the process to be efficient. Currently, we predict the best +parameters using the finite element method by generating simulations for the +different initial conditions, which is a time-consuming process. In this paper, +introduce a hybrid approach that helps in processing and generating new data +simulations using a surrogate graph neural network model based on graph +convolutions, having a cheaper time cost. We also introduce a hybrid approach +that helps in processing and generating new data simulations using the model. +Given a dataset representing meshes, our focus is on the conversion of the +available information into a graph or point cloud structure. This new +representation enables deep learning. The predicted result is similar, with a +low error when compared to that produced using the finite element method. The +new models have outperformed existing PointNet and simple graph neural network +models when applied to produce the simulations. + +
+
+
+
+
+ + ☆ Federated learning compression designed for lightweight communications + + +
+ Federated Learning (FL) is a promising distributed method for edge-level +machine learning, particularly for privacysensitive applications such as those +in military and medical domains, where client data cannot be shared or +transferred to a cloud computing server. In many use-cases, communication cost +is a major challenge in FL due to its natural intensive network usage. Client +devices, such as smartphones or Internet of Things (IoT) nodes, have limited +resources in terms of energy, computation, and memory. To address these +hardware constraints, lightweight models and compression techniques such as +pruning and quantization are commonly adopted in centralised paradigms. In this +paper, we investigate the impact of compression techniques on FL for a typical +image classification task. Going further, we demonstrate that a straightforward +method can compresses messages up to 50% while having less than 1% of accuracy +loss, competing with state-of-the-art techniques. + +
+
+
+
+
+ + ☆ Population Descent: A Natural-Selection Based Hyper-Parameter Tuning + Framework + + +
+ First-order gradient descent has been the base of the most successful +optimization algorithms ever implemented. On supervised learning problems with +very high dimensionality, such as neural network optimization, it is almost +always the algorithm of choice, mainly due to its memory and computational +efficiency. However, it is a classical result in optimization that gradient +descent converges to local minima on non-convex functions. Even more +importantly, in certain high-dimensional cases, escaping the plateaus of large +saddle points becomes intractable. On the other hand, black-box optimization +methods are not sensitive to the local structure of a loss function's landscape +but suffer the curse of dimensionality. Instead, memetic algorithms aim to +combine the benefits of both. Inspired by this, we present Population Descent, +a memetic algorithm focused on hyperparameter optimization. We show that an +adaptive m-elitist selection approach combined with a normalized-fitness-based +randomization scheme outperforms more complex state-of-the-art algorithms by up +to 13% on common benchmark tasks. + +
+
+
+
+
+ + ☆ Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and + Beyond EMNLP 2023 + + +
+ Vision-language (VL) understanding tasks evaluate models' comprehension of +complex visual scenes through multiple-choice questions. However, we have +identified two dataset biases that models can exploit as shortcuts to resolve +various VL tasks correctly without proper understanding. The first type of +dataset bias is \emph{Unbalanced Matching} bias, where the correct answer +overlaps the question and image more than the incorrect answers. The second +type of dataset bias is \emph{Distractor Similarity} bias, where incorrect +answers are overly dissimilar to the correct answer but significantly similar +to other incorrect answers within the same sample. To address these dataset +biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic +training and debiased evaluation data. We then introduce Intra-sample +Counterfactual Training (ICT) to assist models in utilizing the synthesized +training data, particularly the counterfactual data, via focusing on +intra-sample differentiation. Extensive experiments demonstrate the +effectiveness of ADS and ICT in consistently improving model performance across +different benchmarks, even in domain-shifted scenarios. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Data Pruning via Moving-one-Sample-out + + +
+ In this paper, we propose a novel data-pruning approach called +moving-one-sample-out (MoSo), which aims to identify and remove the least +informative samples from the training set. The core insight behind MoSo is to +determine the importance of each sample by assessing its impact on the optimal +empirical risk. This is achieved by measuring the extent to which the empirical +risk changes when a particular sample is excluded from the training set. +Instead of using the computationally expensive leaving-one-out-retraining +procedure, we propose an efficient first-order approximator that only requires +gradient information from different training stages. The key idea behind our +approximation is that samples with gradients that are consistently aligned with +the average gradient of the training set are more informative and should +receive higher scores, which could be intuitively understood as follows: if the +gradient from a specific sample is consistent with the average gradient vector, +it implies that optimizing the network using the sample will yield a similar +effect on all remaining samples. Experimental results demonstrate that MoSo +effectively mitigates severe performance degradation at high pruning ratios and +achieves satisfactory performance across various settings. + +
+
+
+
+
+ + ☆ Tractable MCMC for Private Learning with Pure and Gaussian Differential + Privacy + + +
+ Posterior sampling, i.e., exponential mechanism to sample from the posterior +distribution, provides $\varepsilon$-pure differential privacy (DP) guarantees +and does not suffer from potentially unbounded privacy breach introduced by +$(\varepsilon,\delta)$-approximate DP. In practice, however, one needs to apply +approximate sampling methods such as Markov chain Monte Carlo (MCMC), thus +re-introducing the unappealing $\delta$-approximation error into the privacy +guarantees. To bridge this gap, we propose the Approximate SAample Perturbation +(abbr. ASAP) algorithm which perturbs an MCMC sample with noise proportional to +its Wasserstein-infinity ($W_\infty$) distance from a reference distribution +that satisfies pure DP or pure Gaussian DP (i.e., $\delta=0$). We then leverage +a Metropolis-Hastings algorithm to generate the sample and prove that the +algorithm converges in W$_\infty$ distance. We show that by combining our new +techniques with a careful localization step, we obtain the first nearly +linear-time algorithm that achieves the optimal rates in the DP-ERM problem +with strongly convex and smooth losses. + +
+
+
+
+
+ + ☆ Predicting Accurate Lagrangian Multipliers for Mixed Integer Linear + Programs + + +
+ Lagrangian relaxation stands among the most efficient approaches for solving +a Mixed Integer Linear Programs (MILP) with difficult constraints. Given any +duals for these constraints, called Lagrangian Multipliers (LMs), it returns a +bound on the optimal value of the MILP, and Lagrangian methods seek the LMs +giving the best such bound. But these methods generally rely on iterative +algorithms resembling gradient descent to maximize the concave piecewise linear +dual function: the computational burden grows quickly with the number of +relaxed constraints. We introduce a deep learning approach that bypasses the +descent, effectively amortizing the local, per instance, optimization. A +probabilistic encoder based on a graph convolutional network computes +high-dimensional representations of relaxed constraints in MILP instances. A +decoder then turns these representations into LMs. We train the encoder and +decoder jointly by directly optimizing the bound obtained from the predicted +multipliers. Numerical experiments show that our approach closes up to 85~\% of +the gap between the continuous relaxation and the best Lagrangian bound, and +provides a high quality warm-start for descent based Lagrangian methods. + +
+
+
+
+
+ + ☆ $Λ$-Split: A Privacy-Preserving Split Computing Framework for + Cloud-Powered Generative AI + + +
+ In the wake of the burgeoning expansion of generative artificial intelligence +(AI) services, the computational demands inherent to these technologies +frequently necessitate cloud-powered computational offloading, particularly for +resource-constrained mobile devices. These services commonly employ prompts to +steer the generative process, and both the prompts and the resultant content, +such as text and images, may harbor privacy-sensitive or confidential +information, thereby elevating security and privacy risks. To mitigate these +concerns, we introduce $\Lambda$-Split, a split computing framework to +facilitate computational offloading while simultaneously fortifying data +privacy against risks such as eavesdropping and unauthorized access. In +$\Lambda$-Split, a generative model, usually a deep neural network (DNN), is +partitioned into three sub-models and distributed across the user's local +device and a cloud server: the input-side and output-side sub-models are +allocated to the local, while the intermediate, computationally-intensive +sub-model resides on the cloud server. This architecture ensures that only the +hidden layer outputs are transmitted, thereby preventing the external +transmission of privacy-sensitive raw input and output data. Given the +black-box nature of DNNs, estimating the original input or output from +intercepted hidden layer outputs poses a significant challenge for malicious +eavesdroppers. Moreover, $\Lambda$-Split is orthogonal to traditional +encryption-based security mechanisms, offering enhanced security when deployed +in conjunction. We empirically validate the efficacy of the $\Lambda$-Split +framework using Llama 2 and Stable Diffusion XL, representative large language +and diffusion models developed by Meta and Stability AI, respectively. Our +$\Lambda$-Split implementation is publicly accessible at +https://github.com/nishio-laboratory/lambda_split. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ♻ ☆ GRASP: Accelerating Shortest Path Attacks via Graph Attention + + +
+ Recent advances in machine learning (ML) have shown promise in aiding and +accelerating classical combinatorial optimization algorithms. ML-based speed +ups that aim to learn in an end to end manner (i.e., directly output the +solution) tend to trade off run time with solution quality. Therefore, +solutions that are able to accelerate existing solvers while maintaining their +performance guarantees, are of great interest. We consider an APX-hard problem, +where an adversary aims to attack shortest paths in a graph by removing the +minimum number of edges. We propose the GRASP algorithm: Graph Attention +Accelerated Shortest Path Attack, an ML aided optimization algorithm that +achieves run times up to 10x faster, while maintaining the quality of solution +generated. GRASP uses a graph attention network to identify a smaller subgraph +containing the combinatorial solution, thus effectively reducing the input +problem size. Additionally, we demonstrate how careful representation of the +input graph, including node features that correlate well with the optimization +task, can highlight important structure in the optimization solution. + +
+
+
+
+
+ + ♻ ☆ Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR + Decomposition EMNLP 2023 + + +
+ Cross-encoder models, which jointly encode and score a query-item pair, are +prohibitively expensive for direct k-nearest neighbor (k-NN) search. +Consequently, k-NN search typically employs a fast approximate retrieval (e.g. +using BM25 or dual-encoder vectors), followed by reranking with a +cross-encoder; however, the retrieval approximation often has detrimental +recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent +work that employs a cross-encoder only, making search efficient using a +relatively small number of anchor items, and a CUR matrix factorization. While +ANNCUR's one-time selection of anchors tends to approximate the cross-encoder +distances on average, doing so forfeits the capacity to accurately estimate +distances to items near the query, leading to regret in the crucial end-task: +recall of top-k items. In this paper, we propose ADACUR, a method that +adaptively, iteratively, and efficiently minimizes the approximation error for +the practically important top-k neighbors. It does so by iteratively performing +k-NN search using the anchors available so far, then adding these retrieved +nearest neighbors to the anchor set for the next round. Empirically, on +multiple datasets, in comparison to previous traditional and state-of-the-art +methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed +approach ADACUR consistently reduces recall error-by up to 70% on the important +k = 1 setting-while using no more compute than its competitors. + +
+
+ comment: Findings of EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Diverse Offline Imitation Learning + + +
+ There has been significant recent progress in the area of unsupervised skill +discovery, utilizing various information-theoretic objectives as measures of +diversity. Despite these advances, challenges remain: current methods require +significant online interaction, fail to leverage vast amounts of available +task-agnostic data and typically lack a quantitative measure of skill utility. +We address these challenges by proposing a principled offline algorithm for +unsupervised skill discovery that, in addition to maximizing diversity, ensures +that each learned skill imitates state-only expert demonstrations to a certain +degree. Our main analytical contribution is to connect Fenchel duality, +reinforcement learning, and unsupervised skill discovery to maximize a mutual +information objective subject to KL-divergence state occupancy constraints. +Furthermore, we demonstrate the effectiveness of our method on the standard +offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF +quadruped robot for which the policies trained in simulation transfer well to +the real robotic system. + +
+
+
+
+
+ + ♻ ☆ Self-Supervised One-Shot Learning for Automatic Segmentation of StyleGAN + Images + + +
+ We propose a framework for the automatic one-shot segmentation of synthetic +images generated by a StyleGAN. Our framework is based on the observation that +the multi-scale hidden features in the GAN generator hold useful semantic +information that can be utilized for automatic on-the-fly segmentation of the +generated images. Using these features, our framework learns to segment +synthetic images using a self-supervised contrastive clustering algorithm that +projects the hidden features into a compact space for per-pixel classification. +This contrastive learner is based on using a novel data augmentation strategy +and a pixel-wise swapped prediction loss that leads to faster learning of the +feature vectors for one-shot segmentation. We have tested our implementation on +five standard benchmarks to yield a segmentation performance that not only +outperforms the semi-supervised baselines by an average wIoU margin of 1.02 % +but also improves the inference speeds by a factor of 4.5. Finally, we also +show the results of using the proposed one-shot learner in implementing BagGAN, +a framework for producing annotated synthetic baggage X-ray scans for threat +detection. This framework was trained and tested on the PIDRay baggage +benchmark to yield a performance comparable to its baseline segmenter based on +manual annotations. + +
+
+
+
+
+ + ♻ ☆ Simplifying Momentum-based Positive-definite Submanifold Optimization + with Applications to Deep Learning ICML 2023 + + +
+ Riemannian submanifold optimization with momentum is computationally +challenging because, to ensure that the iterates remain on the submanifold, we +often need to solve difficult differential equations. Here, we simplify such +difficulties for a class of sparse or structured symmetric positive-definite +matrices with the affine-invariant metric. We do so by proposing a generalized +version of the Riemannian normal coordinates that dynamically orthonormalizes +the metric and locally converts the problem into an unconstrained problem in +the Euclidean space. We use our approach to simplify existing approaches for +structured covariances and develop matrix-inverse-free $2^\text{nd}$-order +optimizers for deep learning with low precision by using only matrix +multiplications. Code: https://github.com/yorkerlin/StructuredNGD-DL + +
+
+ comment: An updated version of the ICML 2023 paper. Updated the main text to + emphasize challenges of using existing Riemannian methods to estimate sparse + and structured SPD matrices +
+
+
+
+
+ + ♻ Improving day-ahead Solar Irradiance Time Series Forecasting by + Leveraging Spatio-Temporal Context + + +
+ Solar power harbors immense potential in mitigating climate change by +substantially reducing CO$_{2}$ emissions. Nonetheless, the inherent +variability of solar irradiance poses a significant challenge for seamlessly +integrating solar power into the electrical grid. While the majority of prior +research has centered on employing purely time series-based methodologies for +solar forecasting, only a limited number of studies have taken into account +factors such as cloud cover or the surrounding physical context. In this paper, +we put forth a deep learning architecture designed to harness spatio-temporal +context using satellite data, to attain highly accurate \textit{day-ahead} +time-series forecasting for any given station, with a particular emphasis on +forecasting Global Horizontal Irradiance (GHI). We also suggest a methodology +to extract a distribution for each time step prediction, which can serve as a +very valuable measure of uncertainty attached to the forecast. When evaluating +models, we propose a testing scheme in which we separate particularly difficult +examples from easy ones, in order to capture the model performances in crucial +situations, which in the case of this study are the days suffering from varying +cloudy conditions. Furthermore, we present a new multi-modal dataset gathering +satellite imagery over a large zone and time series for solar irradiance and +other related physical variables from multiple geographically diverse solar +stations. Our approach exhibits robust performance in solar irradiance +forecasting, including zero-shot generalization tests at unobserved solar +stations, and holds great promise in promoting the effective integration of +solar power into the grid. + +
+
+
+
+
+ + ♻ ☆ CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP + Performance on Low-Resource Languages + + +
+ This work introduces CAPIVARA, a cost-efficient framework designed to enhance +the performance of multilingual CLIP models in low-resource languages. While +CLIP has excelled in zero-shot vision-language tasks, the resource-intensive +nature of model training remains challenging. Many datasets lack linguistic +diversity, featuring solely English descriptions for images. CAPIVARA addresses +this by augmenting text data using image captioning and machine translation to +generate multiple synthetic captions in low-resource languages. We optimize the +training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the +computational cost. Through extensive experiments, CAPIVARA emerges as state of +the art in zero-shot tasks involving images and Portuguese texts. We show the +potential for significant improvements in other low-resource languages, +achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a +single GPU for 2 hours. Our model and code is available at +https://github.com/hiaac-nlp/CAPIVARA. + +
+
+
+
+
+ + ♻ ☆ The Geometry of Neural Nets' Parameter Spaces Under Reparametrization NeurIPS 2023 + + +
+ Model reparametrization, which follows the change-of-variable rule of +calculus, is a popular way to improve the training of neural nets. But it can +also be problematic since it can induce inconsistencies in, e.g., Hessian-based +flatness measures, optimization trajectories, and modes of probability +densities. This complicates downstream analyses: e.g. one cannot definitively +relate flatness with generalization since arbitrary reparametrization changes +their relationship. In this work, we study the invariance of neural nets under +reparametrization from the perspective of Riemannian geometry. From this point +of view, invariance is an inherent property of any neural net if one explicitly +represents the metric and uses the correct associated transformation rules. +This is important since although the metric is always present, it is often +implicitly assumed as identity, and thus dropped from the notation, then lost +under reparametrization. We discuss implications for measuring the flatness of +minima, optimization, and for probability-density maximization. Finally, we +explore some interesting directions where invariance is useful. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Content-Based Search for Deep Generative Models + + +
+ The growing proliferation of customized and pretrained generative models has +made it infeasible for a user to be fully cognizant of every model in +existence. To address this need, we introduce the task of content-based model +search: given a query and a large set of generative models, finding the models +that best match the query. As each generative model produces a distribution of +images, we formulate the search task as an optimization problem to select the +model with the highest probability of generating similar content as the query. +We introduce a formulation to approximate this probability given the query from +different modalities, e.g., image, sketch, and text. Furthermore, we propose a +contrastive learning framework for model retrieval, which learns to adapt +features for various query modalities. We demonstrate that our method +outperforms several baselines on Generative Model Zoo, a new benchmark we +create for the model retrieval task. + +
+
+ comment: Our project page is hosted at + https://generative-intelligence-lab.github.io/modelverse/ +
+
+
+
+
+ + ♻ ☆ The Crucial Role of Normalization in Sharpness-Aware Minimization NeurIPS 2023 + + +
+ Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based +optimizer (Foret et al., ICLR 2021) that greatly improves the prediction +performance of deep neural networks. Consequently, there has been a surge of +interest in explaining its empirical success. We focus, in particular, on +understanding the role played by normalization, a key component of the SAM +updates. We theoretically and empirically study the effect of normalization in +SAM for both convex and non-convex functions, revealing two key roles played by +normalization: i) it helps in stabilizing the algorithm; and ii) it enables the +algorithm to drift along a continuum (manifold) of minima -- a property +identified by recent theoretical works that is the key to better performance. +We further argue that these two properties of normalization make SAM robust +against the choice of hyper-parameters, supporting the practicality of SAM. Our +conclusions are backed by various experiments. + +
+
+ comment: 30 pages, Published in 37th Neural Information Processing Systems + (NeurIPS 2023) +
+
+
+
+
+ + ♻ ☆ Generative Flow Networks as Entropy-Regularized RL + + +
+ The recently proposed generative flow networks (GFlowNets) are a method of +training a policy to sample compositional discrete objects with probabilities +proportional to a given reward via a sequence of actions. GFlowNets exploit the +sequential nature of the problem, drawing parallels with reinforcement learning +(RL). Our work extends the connection between RL and GFlowNets to a general +case. We demonstrate how the task of learning a generative flow network can be +efficiently redefined as an entropy-regularized RL problem with a specific +reward and regularizer structure. Furthermore, we illustrate the practical +efficiency of this reformulation by applying standard soft RL algorithms to +GFlowNet training across several probabilistic modeling tasks. Contrary to +previously reported results, we show that entropic RL approaches can be +competitive against established GFlowNet training methods. This perspective +opens a direct path for integrating reinforcement learning principles into the +realm of generative flow networks. + +
+
+
+
+
+ + ♻ ☆ Quantum Advantage Seeker with Kernels (QuASK): a software framework to + speed up the research in quantum machine learning + + +
+ Exploiting the properties of quantum information to the benefit of machine +learning models is perhaps the most active field of research in quantum +computation. This interest has supported the development of a multitude of +software frameworks (e.g. Qiskit, Pennylane, Braket) to implement, simulate, +and execute quantum algorithms. Most of them allow us to define quantum +circuits, run basic quantum algorithms, and access low-level primitives +depending on the hardware such software is supposed to run. For most +experiments, these frameworks have to be manually integrated within a larger +machine learning software pipeline. The researcher is in charge of knowing +different software packages, integrating them through the development of long +code scripts, analyzing the results, and generating the plots. Long code often +leads to erroneous applications, due to the average number of bugs growing +proportional with respect to the program length. Moreover, other researchers +will struggle to understand and reproduce the experiment, due to the need to be +familiar with all the different software frameworks involved in the code +script. We propose QuASK, an open-source quantum machine learning framework +written in Python that aids the researcher in performing their experiments, +with particular attention to quantum kernel techniques. QuASK can be used as a +command-line tool to download datasets, pre-process them, quantum machine +learning routines, analyze and visualize the results. QuASK implements most +state-of-the-art algorithms to analyze the data through quantum kernels, with +the possibility to use projected kernels, (gradient-descent) trainable quantum +kernels, and structure-optimized quantum kernels. Our framework can also be +used as a library and integrated into pre-existing software, maximizing code +reuse. + +
+
+ comment: Close to the published version +
+
+
+
+
+ + ♻ ☆ Variational Imbalanced Regression: Fair Uncertainty Quantification via + Probabilistic Smoothing NeurIPS 2023 + + +
+ Existing regression models tend to fall short in both accuracy and +uncertainty estimation when the label distribution is imbalanced. In this +paper, we propose a probabilistic deep learning model, dubbed variational +imbalanced regression (VIR), which not only performs well in imbalanced +regression but naturally produces reasonable uncertainty estimation as a +byproduct. Different from typical variational autoencoders assuming I.I.D. +representations (a data point's representation is not directly affected by +other data points), our VIR borrows data with similar regression labels to +compute the latent representation's variational distribution; furthermore, +different from deterministic regression models producing point estimates, VIR +predicts the entire normal-inverse-gamma distributions and modulates the +associated conjugate distributions to impose probabilistic reweighting on the +imbalanced data, thereby providing better uncertainty estimation. Experiments +in several real-world datasets show that our VIR can outperform +state-of-the-art imbalanced regression models in terms of both accuracy and +uncertainty estimation. Code will soon be available at +https://github.com/Wang-ML-Lab/variational-imbalanced-regression. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ An Attribution Method for Siamese Encoders EMNLP'23 + + +
+ Despite the success of Siamese encoder models such as sentence transformers +(ST), little is known about the aspects of inputs they pay attention to. A +barrier is that their predictions cannot be attributed to individual features, +as they compare two inputs rather than processing a single one. This paper +derives a local attribution method for Siamese encoders by generalizing the +principle of integrated gradients to models with multiple inputs. The solution +takes the form of feature-pair attributions, and can be reduced to a +token-token matrix for STs. Our method involves the introduction of integrated +Jacobians and inherits the advantageous formal properties of integrated +gradients: it accounts for the model's full computation graph and is guaranteed +to converge to the actual prediction. A pilot study shows that in an ST few +token-pairs can often explain large fractions of predictions, and it focuses on +nouns and verbs. For accurate predictions, it however needs to attend to the +majority of tokens and parts of speech. + +
+
+ comment: Accepted to EMNLP'23 +
+
+
+
+
+ + ♻ ☆ Long-Form Speech Translation through Segmentation with Finite-State + Decoding Constraints on Large Language Models EMNLP 2023 + + +
+ One challenge in speech translation is that plenty of spoken content is +long-form, but short units are necessary for obtaining high-quality +translations. To address this mismatch, we adapt large language models (LLMs) +to split long ASR transcripts into segments that can be independently +translated so as to maximize the overall translation quality. We overcome the +tendency of hallucination in LLMs by incorporating finite-state constraints +during decoding; these eliminate invalid outputs without requiring additional +training. We discover that LLMs are adaptable to transcripts containing ASR +errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art +automatic punctuation baseline, our best LLM improves the average BLEU by 2.9 +points for English-German, English-Spanish, and English-Arabic TED talk +translation in 9 test sets, just by improving segmentation. + +
+
+ comment: accepted to the Findings of EMNLP 2023. arXiv admin note: text + overlap with arXiv:2212.09895 +
+
+
+
+
+ + ♻ ☆ Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through + Interaction with Symbolic Systems EMNLP 2023 + + +
+ Despite outstanding performance in many tasks, language models are +notoriously inclined to make factual errors in tasks requiring arithmetic +computation. We address this deficiency by creating Calc-X, a collection of +datasets that demonstrates the appropriate use of a calculator in reasoning +chains. Calc-X is suitable for teaching language models to offload computations +to a symbolic system. We survey and unify several existing chain-of-thought +datasets into a proposed format, resulting in a standard collection of over +300,000 samples requiring arithmetic reasoning. Finally, we use the new Calc-X +collection to train open-source calculator-using models we call Calcformers and +show that these models approximately double the accuracy of generating correct +results compared to vanilla language model baselines. We make all Calc-X +datasets, source code and Calcformers models publicly available. + +
+
+ comment: Published in EMNLP 2023: Main track +
+
+
+
+
+ + ♻ ☆ Sensitivity-Aware Amortized Bayesian Inference + + +
+ Bayesian inference is a powerful framework for making probabilistic +inferences and decisions under uncertainty. Fundamental choices in modern +Bayesian workflows concern the specification of the likelihood function and +prior distributions, the posterior approximator, and the data. Each choice can +significantly influence model-based inference and subsequent decisions, thereby +necessitating sensitivity analysis. In this work, we propose a multifaceted +approach to integrate sensitivity analyses into amortized Bayesian inference +(ABI, i.e., simulation-based inference with neural networks). First, we utilize +weight sharing to encode the structural similarities between alternative +likelihood and prior specifications in the training process with minimal +computational overhead. Second, we leverage the rapid inference of neural +networks to assess sensitivity to various data perturbations or pre-processing +procedures. In contrast to most other Bayesian approaches, both steps +circumvent the costly bottleneck of refitting the model(s) for each choice of +likelihood, prior, or dataset. Finally, we propose to use neural network +ensembles to evaluate variation in results induced by unreliable approximation +on unseen data. We demonstrate the effectiveness of our method in applied +modeling problems, ranging from the estimation of disease outbreak dynamics and +global warming thresholds to the comparison of human decision-making models. +Our experiments showcase how our approach enables practitioners to effectively +unveil hidden relationships between modeling choices and inferential +conclusions. + +
+
+
+
+
+ + ♻ ☆ SRAI: Towards Standardization of Geospatial AI SP + + +
+ Spatial Representations for Artificial Intelligence (srai) is a Python +library for working with geospatial data. The library can download geospatial +data, split a given area into micro-regions using multiple algorithms and train +an embedding model using various architectures. It includes baseline models as +well as more complex methods from published works. Those capabilities make it +possible to use srai in a complete pipeline for geospatial task solving. The +proposed library is the first step to standardize the geospatial AI domain +toolset. It is fully open-source and published under Apache 2.0 licence. + +
+
+ comment: Accepted for the 6th ACM SIGSPATIAL International Workshop on AI for + Geographic Knowledge Discovery (GeoAI 2023) +
+
+
+
+
+ + ♻ ☆ Learning curves for deep structured Gaussian feature models NeurIPS 2023 + + +
+ In recent years, significant attention in deep learning theory has been +devoted to analyzing when models that interpolate their training data can still +generalize well to unseen examples. Many insights have been gained from +studying models with multiple layers of Gaussian random features, for which one +can compute precise generalization asymptotics. However, few works have +considered the effect of weight anisotropy; most assume that the random +features are generated using independent and identically distributed Gaussian +weights, and allow only for structure in the input data. Here, we use the +replica trick from statistical physics to derive learning curves for models +with many layers of structured Gaussian features. We show that allowing +correlations between the rows of the first layer of features can aid +generalization, while structure in later layers is generally detrimental. Our +results shed light on how weight structure affects generalization in a simple +class of solvable models. + +
+
+ comment: 14+18 pages, 2+1 figures. NeurIPS 2023 Camera Ready +
+
+
+
+
+ + ♻ ☆ Select without Fear: Almost All Mini-Batch Schedules Generalize + Optimally + + +
+ We establish matching upper and lower generalization error bounds for +mini-batch Gradient Descent (GD) training with either deterministic or +stochastic, data-independent, but otherwise arbitrary batch selection rules. We +consider smooth Lipschitz-convex/nonconvex/strongly-convex loss functions, and +show that classical upper bounds for Stochastic GD (SGD) also hold verbatim for +such arbitrary nonadaptive batch schedules, including all deterministic ones. +Further, for convex and strongly-convex losses we prove matching lower bounds +directly on the generalization error uniform over the aforementioned class of +batch schedules, showing that all such batch schedules generalize optimally. +Lastly, for smooth (non-Lipschitz) nonconvex losses, we show that full-batch +(deterministic) GD is essentially optimal, among all possible batch schedules +within the considered class, including all stochastic ones. + +
+
+ comment: 37 pages, 2 tables +
+
+
+
+
+ + ♻ ☆ Making Scalable Meta Learning Practical + + +
+ Despite its flexibility to learn diverse inductive biases in machine learning +programs, meta learning (i.e., learning to learn) has long been recognized to +suffer from poor scalability due to its tremendous compute/memory costs, +training instability, and a lack of efficient distributed training support. In +this work, we focus on making scalable meta learning practical by introducing +SAMA, which combines advances in both implicit differentiation algorithms and +systems. Specifically, SAMA is designed to flexibly support a broad range of +adaptive optimizers in the base level of meta learning programs, while reducing +computational burden by avoiding explicit computation of second-order gradient +information, and exploiting efficient distributed training techniques +implemented for first-order gradients. Evaluated on multiple large-scale meta +learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and +2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU +setups compared to other baseline meta learning algorithms. Furthermore, we +show that SAMA-based data optimization leads to consistent improvements in text +classification accuracy with BERT and RoBERTa large language models, and +achieves state-of-the-art results in both small- and large-scale data pruning +on image classification tasks, demonstrating the practical applicability of +scalable meta learning across language and vision domains. + +
+
+
+
+
+ + ♻ ☆ CodeLMSec Benchmark: Systematically Evaluating and Finding Security + Vulnerabilities in Black-Box Code Language Models + + +
+ Large language models (LLMs) for automatic code generation have achieved +breakthroughs in several programming tasks. Their advances in competition-level +programming problems have made them an essential pillar of AI-assisted pair +programming, and tools such as GitHub Copilot have emerged as part of the daily +programming workflow used by millions of developers. The training data for +these models is usually collected from the Internet (e.g., from open-source +repositories) and is likely to contain faults and security vulnerabilities. +This unsanitized training data can cause the language models to learn these +vulnerabilities and propagate them during the code generation procedure. While +these models have been extensively assessed for their ability to produce +functionally correct programs, there remains a lack of comprehensive +investigations and benchmarks addressing the security aspects of these models. + In this work, we propose a method to systematically study the security issues +of code language models to assess their susceptibility to generating vulnerable +code. To this end, we introduce the first approach to automatically find +generated code that contains vulnerabilities in black-box code generation +models. To achieve this, we present an approach to approximate inversion of the +black-box code generation models based on few-shot prompting. We evaluate the +effectiveness of our approach by examining code language models in generating +high-risk security weaknesses. Furthermore, we establish a collection of +diverse non-secure prompts for various vulnerability scenarios using our +method. This dataset forms a benchmark for evaluating and comparing the +security weaknesses in code language models. + +
+
+ comment: 23 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Hyperbolic Graph Neural Networks: A Review of Methods and Applications + + +
+ Graph neural networks generalize conventional neural networks to +graph-structured data and have received widespread attention due to their +impressive representation ability. In spite of the remarkable achievements, the +performance of Euclidean models in graph-related learning is still bounded and +limited by the representation ability of Euclidean geometry, especially for +datasets with highly non-Euclidean latent anatomy. Recently, hyperbolic space +has gained increasing popularity in processing graph data with tree-like +structure and power-law distribution, owing to its exponential growth property. +In this survey, we comprehensively revisit the technical details of the current +hyperbolic graph neural networks, unifying them into a general framework and +summarizing the variants of each component. More importantly, we present +various HGNN-related applications. Last, we also identify several challenges, +which potentially serve as guidelines for further flourishing the achievements +of graph learning in hyperbolic spaces. + +
+
+
+
+
+ + ♻ ☆ Smooth Sailing: Improving Active Learning for Pre-trained Language + Models with Representation Smoothness Analysis + + +
+ Developed to alleviate prohibitive labeling costs, active learning (AL) +methods aim to reduce label complexity in supervised learning. While recent +work has demonstrated the benefit of using AL in combination with large +pre-trained language models (PLMs), it has often overlooked the practical +challenges that hinder the effectiveness of AL. We address these challenges by +leveraging representation smoothness analysis to ensure AL is feasible, that +is, both effective and practicable. Firstly, we propose an early stopping +technique that does not require a validation set -- often unavailable in +realistic AL conditions -- and observe significant improvements over random +sampling across multiple datasets and AL methods. Further, we find that task +adaptation improves AL, whereas standard short fine-tuning in AL does not +provide improvements over random sampling. Our work demonstrates the usefulness +of representation smoothness analysis for AL and introduces an AL stopping +criterion that reduces label complexity. + +
+
+ comment: Accepted at Learning with Small Data 2023, Association for + Computational Linguistics +
+
+
+
+
+ + ♻ ☆ Regularizing Neural Networks with Meta-Learning Generative Models NeurIPS 2023 + + +
+ This paper investigates methods for improving generative data augmentation +for deep learning. Generative data augmentation leverages the synthetic samples +produced by generative models as an additional dataset for classification with +small dataset settings. A key challenge of generative data augmentation is that +the synthetic data contain uninformative samples that degrade accuracy. This is +because the synthetic samples do not perfectly represent class categories in +real data and uniform sampling does not necessarily provide useful samples for +tasks. In this paper, we present a novel strategy for generative data +augmentation called meta generative regularization (MGR). To avoid the +degradation of generative data augmentation, MGR utilizes synthetic samples in +the regularization term for feature extractors instead of in the loss function, +e.g., cross-entropy. These synthetic samples are dynamically determined to +minimize the validation losses through meta-learning. We observed that MGR can +avoid the performance degradation of na\"ive generative data augmentation and +boost the baselines. Experiments on six datasets showed that MGR is effective +particularly when datasets are smaller and stably outperforms baselines. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation + and Elaboration + + +
+ In recent years, AI-assisted drug design methods have been proposed to +generate molecules given the pockets' structures of target proteins. Most of +them are atom-level-based methods, which consider atoms as basic components and +generate atom positions and types. In this way, however, it is hard to generate +realistic fragments with complicated structures. To solve this, we propose +D3FG, a functional-group-based diffusion model for pocket-specific molecule +generation and elaboration. D3FG decomposes molecules into two categories of +components: functional groups defined as rigid bodies and linkers as mass +points. And the two kinds of components can together form complicated fragments +that enhance ligand-protein interactions. + To be specific, in the diffusion process, D3FG diffuses the data distribution +of the positions, orientations, and types of the components into a prior +distribution; In the generative process, the noise is gradually removed from +the three variables by denoisers parameterized with designed equivariant graph +neural networks. In the experiments, our method can generate molecules with +more realistic 3D structures, competitive affinities toward the protein +targets, and better drug properties. Besides, D3FG as a solution to a new task +of molecule elaboration, could generate molecules with high affinities based on +existing ligands and the hotspots of target proteins. + +
+
+ comment: 9 pages +
+
+
+
+
+ + ♻ ☆ Differentially Private Natural Language Models: Recent Advances and + Future Directions + + +
+ Recent developments in deep learning have led to great success in various +natural language processing (NLP) tasks. However, these applications may +involve data that contain sensitive information. Therefore, how to achieve good +performance while also protecting the privacy of sensitive data is a crucial +challenge in NLP. To preserve privacy, Differential Privacy (DP), which can +prevent reconstruction attacks and protect against potential side knowledge, is +becoming a de facto technique for private data analysis. In recent years, NLP +in DP models (DP-NLP) has been studied from different perspectives, which +deserves a comprehensive review. In this paper, we provide the first systematic +review of recent advances in DP deep learning models in NLP. In particular, we +first discuss some differences and additional challenges of DP-NLP compared +with the standard DP deep learning. Then, we investigate some existing work on +DP-NLP and present its recent developments from three aspects: gradient +perturbation based methods, embedding vector perturbation based methods, and +ensemble model based methods. We also discuss some challenges and future +directions. + +
+
+
+
+
+ + ♻ ☆ Comparing Apples to Oranges: Learning Similarity Functions for Data + Produced by Different Distributions NeurIPS 2023 + + +
+ Similarity functions measure how comparable pairs of elements are, and play a +key role in a wide variety of applications, e.g., notions of Individual +Fairness abiding by the seminal paradigm of Dwork et al., as well as Clustering +problems. However, access to an accurate similarity function should not always +be considered guaranteed, and this point was even raised by Dwork et al. For +instance, it is reasonable to assume that when the elements to be compared are +produced by different distributions, or in other words belong to different +``demographic'' groups, knowledge of their true similarity might be very +difficult to obtain. In this work, we present an efficient sampling framework +that learns these across-groups similarity functions, using only a limited +amount of experts' feedback. We show analytical results with rigorous +theoretical bounds, and empirically validate our algorithms via a large suite +of experiments. + +
+
+ comment: Accepted at NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Why Should This Article Be Deleted? Transparent Stance Detection in + Multilingual Wikipedia Editor Discussions EMNLP 2023 + + +
+ The moderation of content on online platforms is usually non-transparent. On +Wikipedia, however, this discussion is carried out publicly and the editors are +encouraged to use the content moderation policies as explanations for making +moderation decisions. Currently, only a few comments explicitly mention those +policies -- 20% of the English ones, but as few as 2% of the German and Turkish +comments. To aid in this process of understanding how content is moderated, we +construct a novel multilingual dataset of Wikipedia editor discussions along +with their reasoning in three languages. The dataset contains the stances of +the editors (keep, delete, merge, comment), along with the stated reason, and a +content moderation policy, for each edit decision. We demonstrate that stance +and corresponding reason (policy) can be predicted jointly with a high degree +of accuracy, adding transparency to the decision-making process. We release +both our joint prediction models and the multilingual content moderation +dataset for further research on automated transparent content moderation. + +
+
+ comment: This submission has been accepted to 2023 Conference on Empirical + Methods in Natural Language Processing (EMNLP 2023) +
+
+
+
+
+ + ♻ ☆ Hindsight Learning for MDPs with Exogenous Inputs + + +
+ Many resource management problems require sequential decision-making under +uncertainty, where the only uncertainty affecting the decision outcomes are +exogenous variables outside the control of the decision-maker. We model these +problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and +design a class of data-efficient algorithms for them termed Hindsight Learning +(HL). Our HL algorithms achieve data efficiency by leveraging a key insight: +having samples of the exogenous variables, past decisions can be revisited in +hindsight to infer counterfactual consequences that can accelerate policy +improvements. We compare HL against classic baselines in the multi-secretary +and airline revenue management problems. We also scale our algorithms to a +business-critical cloud resource management problem -- allocating Virtual +Machines (VMs) to physical machines, and simulate their performance with real +datasets from a large public cloud provider. We find that HL algorithms +outperform domain-specific heuristics, as well as state-of-the-art +reinforcement learning methods. + +
+
+ comment: 52 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ Going Beyond Familiar Features for Deep Anomaly Detection + + +
+ Anomaly Detection (AD) is a critical task that involves identifying +observations that do not conform to a learned model of normality. Prior work in +deep AD is predominantly based on a familiarity hypothesis, where familiar +features serve as the reference in a pre-trained embedding space. While this +strategy has proven highly successful, it turns out that it causes consistent +false negatives when anomalies consist of truly novel features that are not +well captured by the pre-trained encoding. We propose a novel approach to AD +using explainability to capture novel features as unexplained observations in +the input space. We achieve strong performance across a wide range of anomaly +benchmarks by combining similarity and novelty in a hybrid approach. Our +approach establishes a new state-of-the-art across multiple benchmarks, +handling diverse anomaly types while eliminating the need for expensive +background models and dense matching. In particular, we show that by taking +account of novel features, we reduce false negative anomalies by up to 40% on +challenging benchmarks compared to the state-of-the-art. Our method gives +visually inspectable explanations for pixel-level anomalies. + +
+
+
+
+
+ + ♻ ☆ SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling + + +
+ Time series analysis is widely used in extensive areas. Recently, to reduce +labeling expenses and benefit various tasks, self-supervised pre-training has +attracted immense interest. One mainstream paradigm is masked modeling, which +successfully pre-trains deep models by learning to reconstruct the masked +content based on the unmasked part. However, since the semantic information of +time series is mainly contained in temporal variations, the standard way of +randomly masking a portion of time points will seriously ruin vital temporal +variations of time series, making the reconstruction task too difficult to +guide representation learning. We thus present SimMTM, a Simple pre-training +framework for Masked Time-series Modeling. By relating masked modeling to +manifold learning, SimMTM proposes to recover masked time points by the +weighted aggregation of multiple neighbors outside the manifold, which eases +the reconstruction task by assembling ruined but complementary temporal +variations from multiple masked series. SimMTM further learns to uncover the +local structure of the manifold, which is helpful for masked modeling. +Experimentally, SimMTM achieves state-of-the-art fine-tuning performance +compared to the most advanced time series pre-training methods in two canonical +time series analysis tasks: forecasting and classification, covering both in- +and cross-domain settings. + +
+
+
+
+
+ + ♻ ☆ Multi-objective optimization via equivariant deep hypervolume + approximation ICLR 2023 + + +
+ Optimizing multiple competing objectives is a common problem across science +and industry. The inherent inextricable trade-off between those objectives +leads one to the task of exploring their Pareto front. A meaningful quantity +for the purpose of the latter is the hypervolume indicator, which is used in +Bayesian Optimization (BO) and Evolutionary Algorithms (EAs). However, the +computational complexity for the calculation of the hypervolume scales +unfavorably with increasing number of objectives and data points, which +restricts its use in those common multi-objective optimization frameworks. To +overcome these restrictions we propose to approximate the hypervolume function +with a deep neural network, which we call DeepHV. For better sample efficiency +and generalization, we exploit the fact that the hypervolume is +scale-equivariant in each of the objectives as well as permutation invariant +w.r.t. both the objectives and the samples, by using a deep neural network that +is equivariant w.r.t. the combined group of scalings and permutations. We +evaluate our method against exact, and approximate hypervolume methods in terms +of accuracy, computation time, and generalization. We also apply and compare +our methods to state-of-the-art multi-objective BO methods and EAs on a range +of synthetic benchmark test cases. The results show that our methods are +promising for such multi-objective optimization tasks. + +
+
+ comment: Updated with camera-ready version. Accepted at ICLR 2023 +
+
+
+
+
+ + ♻ ☆ Enhancing Adversarial Contrastive Learning via Adversarial Invariant + Regularization NeurIPS 2023 + + +
+ Adversarial contrastive learning (ACL) is a technique that enhances standard +contrastive learning (SCL) by incorporating adversarial data to learn a robust +representation that can withstand adversarial attacks and common corruptions +without requiring costly annotations. To improve transferability, the existing +work introduced the standard invariant regularization (SIR) to impose +style-independence property to SCL, which can exempt the impact of nuisance +style factors in the standard representation. However, it is unclear how the +style-independence property benefits ACL-learned robust representations. In +this paper, we leverage the technique of causal reasoning to interpret the ACL +and propose adversarial invariant regularization (AIR) to enforce independence +from style factors. We regulate the ACL using both SIR and AIR to output the +robust representation. Theoretically, we show that AIR implicitly encourages +the representational distance between different views of natural data and their +adversarial variants to be independent of style factors. Empirically, our +experimental results show that invariant regularization significantly improves +the performance of state-of-the-art ACL methods in terms of both standard +generalization and robustness on downstream tasks. To the best of our +knowledge, we are the first to apply causal reasoning to interpret ACL and +develop AIR for enhancing ACL-learned robust representations. Our source code +is at https://github.com/GodXuxilie/Enhancing_ACL_via_AIR. + +
+
+ comment: NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization + for Few-shot Generalization EMNLP 2023 + + +
+ Prompt tuning is a parameter-efficient method, which learns soft prompts and +conditions frozen language models to perform specific downstream tasks. Though +effective, prompt tuning under few-shot settings on the one hand heavily relies +on a good initialization of soft prompts. On the other hand, it can easily +overfit to few-shot training samples, thereby undermining generalizability. +Existing works leverage pre-training or supervised meta-learning to initialize +soft prompts but they fail to data-efficiently generalize to unseen downstream +tasks. To address the above problems, this paper proposes a novel +Self-sUpervised meta-Prompt learning framework with MEta-gradient +Regularization for few-shot generalization (SUPMER). SUPMER leverages +self-supervised meta-learning with a diverse set of well-designed meta-training +tasks to learn a universal prompt initialization for efficient adaptation using +only unlabeled data. Additionally, it jointly meta-learns a gradient +regularization function to transform raw gradients into a domain-generalizable +direction, thus alleviating the problem of overfitting. Extensive experiments +show that SUPMER achieves better performance for different few-shot downstream +tasks, and also exhibits a stronger domain generalization ability. The code for +SUPMER will be available at https://github.com/beepkh/SUPMER. + +
+
+ comment: Accepted by EMNLP 2023 (Findings) +
+
+
+
+
+ + ♻ ☆ Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset + Selection NeurIPS 2023 + + +
+ Adversarial contrastive learning (ACL) does not require expensive data +annotations but outputs a robust representation that withstands adversarial +attacks and also generalizes to a wide range of downstream tasks. However, ACL +needs tremendous running time to generate the adversarial variants of all +training data, which limits its scalability to large datasets. To speed up ACL, +this paper proposes a robustness-aware coreset selection (RCS) method. RCS does +not require label information and searches for an informative subset that +minimizes a representational divergence, which is the distance of the +representation between natural data and their virtual adversarial variants. The +vanilla solution of RCS via traversing all possible subsets is computationally +prohibitive. Therefore, we theoretically transform RCS into a surrogate problem +of submodular maximization, of which the greedy search is an efficient solution +with an optimality guarantee for the original problem. Empirically, our +comprehensive results corroborate that RCS can speed up ACL by a large margin +without significantly hurting the robustness transferability. Notably, to the +best of our knowledge, we are the first to conduct ACL efficiently on the +large-scale ImageNet-1K dataset to obtain an effective robust representation +via RCS. Our source code is at +https://github.com/GodXuxilie/Efficient_ACL_via_RCS. + +
+
+ comment: NeurIPS 2023 Spotlight +
+
+
+
+
+ + ♻ ☆ Learning Informative Representation for Fairness-aware Multivariate + Time-series Forecasting: A Group-based Perspective + + +
+ Performance unfairness among variables widely exists in multivariate time +series (MTS) forecasting models since such models may attend/bias to certain +(advantaged) variables. Addressing this unfairness problem is important for +equally attending to all variables and avoiding vulnerable model biases/risks. +However, fair MTS forecasting is challenging and has been less studied in the +literature. To bridge such significant gap, we formulate the fairness modeling +problem as learning informative representations attending to both advantaged +and disadvantaged variables. Accordingly, we propose a novel framework, named +FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial +learning to generate both group-independent and group-relevant representations +for the downstream forecasting. The framework first leverages a spectral +relaxation of the K-means objective to infer variable correlations and thus to +group variables. Then, it utilizes a filtering&fusion component to filter the +group-relevant information and generate group-independent representations via +orthogonality regularization. The group-independent and group-relevant +representations form highly informative representations, facilitating to +sharing knowledge from advantaged variables to disadvantaged variables to +guarantee fairness. Extensive experiments on four public datasets demonstrate +the effectiveness of our proposed FairFor for fair forecasting and significant +performance improvement. + +
+
+ comment: 13 pages, 5 figures, accepted by IEEE Transactions on Knowledge and + Data Engineering (TKDE) +
+
+
+
+
+ + ♻ ☆ On the Ability of Graph Neural Networks to Model Interactions Between + Vertices NeurIPS 2023 + + +
+ Graph neural networks (GNNs) are widely used for modeling complex +interactions between entities represented as vertices of a graph. Despite +recent efforts to theoretically analyze the expressive power of GNNs, a formal +characterization of their ability to model interactions is lacking. The current +paper aims to address this gap. Formalizing strength of interactions through an +established measure known as separation rank, we quantify the ability of +certain GNNs to model interaction between a given subset of vertices and its +complement, i.e. between the sides of a given partition of input vertices. Our +results reveal that the ability to model interaction is primarily determined by +the partition's walk index -- a graph-theoretical characteristic defined by the +number of walks originating from the boundary of the partition. Experiments +with common GNN architectures corroborate this finding. As a practical +application of our theory, we design an edge sparsification algorithm named +Walk Index Sparsification (WIS), which preserves the ability of a GNN to model +interactions when input edges are removed. WIS is simple, computationally +efficient, and in our experiments has markedly outperformed alternative methods +in terms of induced prediction accuracy. More broadly, it showcases the +potential of improving GNNs by theoretically analyzing the interactions they +can model. + +
+
+ comment: Accepted to NeurIPS 2023 +
+
+
+
+
+ + ♻ ☆ Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak + Supervision for Text Classification EMNLP 2023 + + +
+ Recent advances in weakly supervised text classification mostly focus on +designing sophisticated methods to turn high-level human heuristics into +quality pseudo-labels. In this paper, we revisit the seed matching-based +method, which is arguably the simplest way to generate pseudo-labels, and show +that its power was greatly underestimated. We show that the limited performance +of seed matching is largely due to the label bias injected by the simple +seed-match rule, which prevents the classifier from learning reliable +confidence for selecting high-quality pseudo-labels. Interestingly, simply +deleting the seed words present in the matched input texts can mitigate the +label bias and help learn better confidence. Subsequently, the performance +achieved by seed matching can be improved significantly, making it on par with +or even better than the state-of-the-art. Furthermore, to handle the case when +the seed words are not made known, we propose to simply delete the word tokens +in the input text randomly with a high deletion ratio. Remarkably, seed +matching equipped with this random deletion method can often achieve even +better performance than that with seed deletion. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ♻ ☆ Privacy-Preserving Federated Deep Clustering based on GAN + + +
+ Federated clustering (FC) is an essential extension of centralized clustering +designed for the federated setting, wherein the challenge lies in constructing +a global similarity measure without the need to share private data. +Conventional approaches to FC typically adopt extensions of centralized +methods, like K-means and fuzzy c-means. However, these methods are susceptible +to non-independent-and-identically-distributed (non-IID) data among clients, +leading to suboptimal performance, particularly with high-dimensional data. In +this paper, we present a novel approach to address these limitations by +proposing a Privacy-Preserving Federated Deep Clustering based on Generative +Adversarial Networks (GANs). Each client trains a local generative adversarial +network (GAN) locally and uploads the synthetic data to the server. The server +applies a deep clustering network on the synthetic data to establish $k$ +cluster centroids, which are then downloaded to the clients for cluster +assignment. Theoretical analysis demonstrates that the GAN-generated samples, +shared among clients, inherently uphold certain privacy guarantees, +safeguarding the confidentiality of individual data. Furthermore, extensive +experimental evaluations showcase the effectiveness and utility of our proposed +method in achieving accurate and privacy-preserving federated clustering. + +
+
+
+
+
+ + ♻ ☆ Federated clustering with GAN-based data synthesis + + +
+ Federated clustering (FC) is an extension of centralized clustering in +federated settings. The key here is how to construct a global similarity +measure without sharing private data, since the local similarity may be +insufficient to group local data correctly and the similarity of samples across +clients cannot be directly measured due to privacy constraints. Obviously, the +most straightforward way to analyze FC is to employ the methods extended from +centralized ones, such as K-means (KM) and fuzzy c-means (FCM). However, they +are vulnerable to non independent-and-identically-distributed (non-IID) data +among clients. To handle this, we propose a new federated clustering framework, +named synthetic data aided federated clustering (SDA-FC). It trains generative +adversarial network locally in each client and uploads the generated synthetic +data to the server, where KM or FCM is performed on the synthetic data. The +synthetic data can make the model immune to the non-IID problem and enable us +to capture the global similarity characteristics more effectively without +sharing private data. Comprehensive experiments reveals the advantages of +SDA-FC, including superior performance in addressing the non-IID problem and +the device failures. + +
+
+
+
+
+ + ♻ ☆ Learning Representations of Bi-level Knowledge Graphs for Reasoning + beyond Link Prediction AAAI + + +
+ Knowledge graphs represent known facts using triplets. While existing +knowledge graph embedding methods only consider the connections between +entities, we propose considering the relationships between triplets. For +example, let us consider two triplets $T_1$ and $T_2$ where $T_1$ is +(Academy_Awards, Nominates, Avatar) and $T_2$ is (Avatar, Wins, +Academy_Awards). Given these two base-level triplets, we see that $T_1$ is a +prerequisite for $T_2$. In this paper, we define a higher-level triplet to +represent a relationship between triplets, e.g., $\langle T_1$, +PrerequisiteFor, $T_2\rangle$ where PrerequisiteFor is a higher-level relation. +We define a bi-level knowledge graph that consists of the base-level and the +higher-level triplets. We also propose a data augmentation strategy based on +the random walks on the bi-level knowledge graph to augment plausible triplets. +Our model called BiVE learns embeddings by taking into account the structures +of the base-level and the higher-level triplets, with additional consideration +of the augmented triplets. We propose two new tasks: triplet prediction and +conditional link prediction. Given a triplet $T_1$ and a higher-level relation, +the triplet prediction predicts a triplet that is likely to be connected to +$T_1$ by the higher-level relation, e.g., $\langle T_1$, PrerequisiteFor, +?$\rangle$. The conditional link prediction predicts a missing entity in a +triplet conditioned on another triplet, e.g., $\langle T_1$, PrerequisiteFor, +(Avatar, Wins, ?)$\rangle$. Experimental results show that BiVE significantly +outperforms all other methods in the two new tasks and the typical base-level +link prediction in real-world bi-level knowledge graphs. + +
+
+ comment: 14 pages, 3 figures, 15 tables. 37th AAAI Conference on Artificial + Intelligence (AAAI 2023) +
+
+
+
+
+ + ♻ ☆ How to Select Which Active Learning Strategy is Best Suited for Your + Specific Problem and Budget + + +
+ In the domain of Active Learning (AL), a learner actively selects which +unlabeled examples to seek labels from an oracle, while operating within +predefined budget constraints. Importantly, it has been recently shown that +distinct query strategies are better suited for different conditions and +budgetary constraints. In practice, the determination of the most appropriate +AL strategy for a given situation remains an open problem. To tackle this +challenge, we propose a practical derivative-based method that dynamically +identifies the best strategy for a given budget. Intuitive motivation for our +approach is provided by the theoretical analysis of a simplified scenario. We +then introduce a method to dynamically select an AL strategy, which takes into +account the unique characteristics of the problem and the available budget. +Empirical results showcase the effectiveness of our approach across diverse +budgets and computer vision tasks. + +
+
+
+
+
+ + ♻ ☆ Bayesian Flow Networks + + +
+ This paper introduces Bayesian Flow Networks (BFNs), a new class of +generative model in which the parameters of a set of independent distributions +are modified with Bayesian inference in the light of noisy data samples, then +passed as input to a neural network that outputs a second, interdependent +distribution. Starting from a simple prior and iteratively updating the two +distributions yields a generative procedure similar to the reverse process of +diffusion models; however it is conceptually simpler in that no forward process +is required. Discrete and continuous-time loss functions are derived for +continuous, discretised and discrete data, along with sample generation +procedures. Notably, the network inputs for discrete data lie on the +probability simplex, and are therefore natively differentiable, paving the way +for gradient-based sample guidance and few-step generation in discrete domains +such as language modelling. The loss function directly optimises data +compression and places no restrictions on the network architecture. In our +experiments BFNs achieve competitive log-likelihoods for image modelling on +dynamically binarized MNIST and CIFAR-10, and outperform all known discrete +diffusion models on the text8 character-level language modelling task. + +
+
+
+
+
+ + ♻ ☆ Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning + + +
+ Despite their impressive performance, large language models (LMs) still +struggle with reliably generating complex output structures when not finetuned +to follow the required output format exactly. To address this issue, +grammar-constrained decoding (GCD) can be used to control the generation of +LMs, guaranteeing that the output follows a given structure. Most existing GCD +methods are, however, limited to specific tasks, such as parsing or code +generation. In this work, we demonstrate that formal grammars can describe the +output space for a much wider range of tasks and argue that GCD can serve as a +unified framework for structured NLP tasks in general. For increased +flexibility, we introduce input-dependent grammars, which allow the grammar to +depend on the input and thus enable the generation of different output +structures for different inputs. We then empirically demonstrate the power and +flexibility of GCD-enhanced LMs on (1) information extraction, (2) entity +disambiguation, and (3) constituency parsing. Our results indicate that +grammar-constrained LMs substantially outperform unconstrained LMs or even beat +task-specific finetuned models. Grammar constraints thus hold great promise for +harnessing off-the-shelf LMs for a wide range of structured NLP tasks, +especially where training data is scarce or finetuning is expensive. Code and +data: https://github.com/epfl-dlab/GCD. + +
+
+
+
+
+ + ♻ ☆ Temporal Conditioning Spiking Latent Variable Models of the Neural + Response to Natural Visual Scenes NeurIPS 2023 + + +
+ Developing computational models of neural response is crucial for +understanding sensory processing and neural computations. Current +state-of-the-art neural network methods use temporal filters to handle temporal +dependencies, resulting in an unrealistic and inflexible processing paradigm. +Meanwhile, these methods target trial-averaged firing rates and fail to capture +important features in spike trains. This work presents the temporal +conditioning spiking latent variable models (TeCoS-LVM) to simulate the neural +response to natural visual stimuli. We use spiking neurons to produce spike +outputs that directly match the recorded trains. This approach helps to avoid +losing information embedded in the original spike trains. We exclude the +temporal dimension from the model parameter space and introduce a temporal +conditioning operation to allow the model to adaptively explore and exploit +temporal dependencies in stimuli sequences in a {\it natural paradigm}. We show +that TeCoS-LVM models can produce more realistic spike activities and +accurately fit spike statistics than powerful alternatives. Additionally, +learned TeCoS-LVM models can generalize well to longer time scales. Overall, +while remaining computationally tractable, our model effectively captures key +features of neural coding systems. It thus provides a useful tool for building +accurate predictive computational accounts for various sensory perception +circuits. + +
+
+ comment: Accepted at NeurIPS 2023. 22 pages, 7 figures, 3 tables +
+
+
+
+
+ + ♻ ☆ On the Representational Capacity of Recurrent Neural Language Models EMNLP 2023 + + +
+ This work investigates the computational expressivity of language models +(LMs) based on recurrent neural networks (RNNs). Siegelmann and Sontag (1992) +famously showed that RNNs with rational weights and hidden states and unbounded +computation time are Turing complete. However, LMs define weightings over +strings in addition to just (unweighted) language membership and the analysis +of the computational power of RNN LMs (RLMs) should reflect this. We extend the +Turing completeness result to the probabilistic case, showing how a rationally +weighted RLM with unbounded computation time can simulate any probabilistic +Turing machine (PTM). Since, in practice, RLMs work in real-time, processing a +symbol at every time step, we treat the above result as an upper bound on the +expressivity of RLMs. We also provide a lower bound by showing that under the +restriction to real-time computation, such models can simulate deterministic +real-time rational PTMs. + +
+
+ comment: To be published at EMNLP 2023; +
+
+
+
+
+ + ♻ ☆ Demystify Problem-Dependent Power of Quantum Neural Networks on + Multi-Class Classification + + +
+ Quantum neural networks (QNNs) have become an important tool for +understanding the physical world, but their advantages and limitations are not +fully understood. Some QNNs with specific encoding methods can be efficiently +simulated by classical surrogates, while others with quantum memory may perform +better than classical classifiers. Here we systematically investigate the +problem-dependent power of quantum neural classifiers (QCs) on multi-class +classification tasks. Through the analysis of expected risk, a measure that +weighs the training loss and the generalization error of a classifier jointly, +we identify two key findings: first, the training loss dominates the power +rather than the generalization ability; second, QCs undergo a U-shaped risk +curve, in contrast to the double-descent risk curve of deep neural classifiers. +We also reveal the intrinsic connection between optimal QCs and the Helstrom +bound and the equiangular tight frame. Using these findings, we propose a +method that uses loss dynamics to probe whether a QC may be more effective than +a classical classifier on a particular learning task. Numerical results +demonstrate the effectiveness of our approach to explain the superiority of QCs +over multilayer Perceptron on parity datasets and their limitations over +convolutional neural networks on image datasets. Our work sheds light on the +problem-dependent power of QNNs and offers a practical tool for evaluating +their potential merit. + +
+
+ comment: Updated version. Published on PRL +
+
+
+
+
+ + ♻ ☆ LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly + Transformers + + +
+ The community explored to build private inference frameworks for +transformer-based large language models (LLMs) in a server-client setting, +where the server holds the model parameters and the client inputs its private +data (or prompt) for inference. However, these frameworks impose significant +overhead when the private inputs are forward propagated through the original +LLMs. In this paper, we show that substituting the computation- and +communication-heavy operators in the transformer architecture with +privacy-computing friendly approximations can greatly reduce the private +inference costs while incurring very minor impact on model performance. +Compared to state-of-the-art Iron (NeurIPS 2022), our privacy-computing +friendly model inference pipeline achieves a $5\times$ acceleration in +computation and an 80% reduction in communication overhead, while retaining +nearly identical accuracy. + +
+
+
+
+
+ + ♻ ☆ Adaptive Asynchronous Control Using Meta-learned Neural Ordinary + Differential Equations + + +
+ Model-based Reinforcement Learning and Control have demonstrated great +potential in various sequential decision making problem domains, including in +robotics settings. However, real-world robotics systems often present +challenges that limit the applicability of those methods. In particular, we +note two problems that jointly happen in many industrial systems: 1) +Irregular/asynchronous observations and actions and 2) Dramatic changes in +environment dynamics from an episode to another (e.g. varying payload inertial +properties). We propose a general framework that overcomes those difficulties +by meta-learning adaptive dynamics models for continuous-time prediction and +control. The proposed approach is task-agnostic and can be adapted to new tasks +in a straight-forward manner. We present evaluations in two different robot +simulations and on a real industrial robot. + +
+
+ comment: 16 double column pages, 14 figures, 3 tables +
+
+
+
+
+ + ♻ Contrastive Retrospection: honing in on critical steps for rapid + learning and generalization in RL + + +
+ In real life, success is often contingent upon multiple critical steps that +are distant in time from each other and from the final reward. These critical +steps are challenging to identify with traditional reinforcement learning (RL) +methods that rely on the Bellman equation for credit assignment. Here, we +present a new RL algorithm that uses offline contrastive learning to hone in on +these critical steps. This algorithm, which we call Contrastive Retrospection +(ConSpec), can be added to any existing RL algorithm. ConSpec learns a set of +prototypes for the critical steps in a task by a novel contrastive loss and +delivers an intrinsic reward when the current state matches one of the +prototypes. The prototypes in ConSpec provide two key benefits for credit +assignment: (i) They enable rapid identification of all the critical steps. +(ii) They do so in a readily interpretable manner, enabling out-of-distribution +generalization when sensory features are altered. Distinct from other +contemporary RL approaches to credit assignment, ConSpec takes advantage of the +fact that it is easier to retrospectively identify the small set of steps that +success is contingent upon (and ignoring other states) than it is to +prospectively predict reward at every taken step. ConSpec greatly improves +learning in a diverse set of RL tasks. + +
+
+
+
+
+ + ♻ ☆ Segment, Select, Correct: A Framework for Weakly-Supervised Referring + Segmentation + + +
+ Referring Image Segmentation (RIS) - the problem of identifying objects in +images through natural language sentences - is a challenging task currently +mostly solved through supervised learning. However, while collecting referred +annotation masks is a time-consuming process, the few existing +weakly-supervised and zero-shot approaches fall significantly short in +performance compared to fully-supervised learning ones. To bridge the +performance gap without mask annotations, we propose a novel weakly-supervised +framework that tackles RIS by decomposing it into three steps: obtaining +instance masks for the object mentioned in the referencing instruction +(segment), using zero-shot learning to select a potentially correct mask for +the given instruction (select), and bootstrapping a model which allows for +fixing the mistakes of zero-shot selection (correct). In our experiments, using +only the first two steps (zero-shot segment and select) outperforms other +zero-shot baselines by as much as 19%, while our full method improves upon this +much stronger baseline and sets the new state-of-the-art for weakly-supervised +RIS, reducing the gap between the weakly-supervised and fully-supervised +methods in some cases from around 33% to as little as 14%. Code is available at +https://github.com/fgirbal/segment-select-correct. + +
+
+
+
+
+ + ♻ ☆ Learning Variational Neighbor Labels for Test-Time Domain Generalization + + +
+ This paper strives for domain generalization, where models are trained +exclusively on source domains before being deployed at unseen target domains. +We follow the strict separation of source training and target testing but +exploit the value of the unlabeled target data itself during inference. We make +three contributions. First, we propose probabilistic pseudo-labeling of target +samples to generalize the source-trained model to the target domain at test +time. We formulate the generalization at test time as a variational inference +problem by modeling pseudo labels as distributions to consider the uncertainty +during generalization and alleviate the misleading signal of inaccurate pseudo +labels. Second, we learn variational neighbor labels that incorporate the +information of neighboring target samples to generate more robust pseudo +labels. Third, to learn the ability to incorporate more representative target +information and generate more precise and robust variational neighbor labels, +we introduce a meta-generalization stage during training to simulate the +generalization procedure. Experiments on six widely-used datasets demonstrate +the benefits, abilities, and effectiveness of our proposal. + +
+
+ comment: Under review +
+
+
+
+
+ + ♻ ☆ Beyond Multilayer Perceptrons: Investigating Complex Topologies in + Neural Networks + + +
+ In this study, we explore the impact of network topology on the approximation +capabilities of artificial neural networks (ANNs), with a particular focus on +complex topologies. We propose a novel methodology for constructing complex +ANNs based on various topologies, including Barab\'asi-Albert, +Erd\H{o}s-R\'enyi, Watts-Strogatz, and multilayer perceptrons (MLPs). The +constructed networks are evaluated on synthetic datasets generated from +manifold learning generators, with varying levels of task difficulty and noise, +and on real-world datasets from the UCI suite. Our findings reveal that complex +topologies lead to superior performance in high-difficulty regimes compared to +traditional MLPs. This performance advantage is attributed to the ability of +complex networks to exploit the compositionality of the underlying target +function. However, this benefit comes at the cost of increased forward-pass +computation time and reduced robustness to graph damage. Additionally, we +investigate the relationship between various topological attributes and model +performance. Our analysis shows that no single attribute can account for the +observed performance differences, suggesting that the influence of network +topology on approximation capabilities may be more intricate than a simple +correlation with individual topological attributes. Our study sheds light on +the potential of complex topologies for enhancing the performance of ANNs and +provides a foundation for future research exploring the interplay between +multiple topological attributes and their impact on model performance. + +
+
+
+
+
+ + ♻ ☆ Conditional Generative Models are Provably Robust: Pointwise Guarantees + for Bayesian Inverse Problems + + +
+ Conditional generative models became a very powerful tool to sample from +Bayesian inverse problem posteriors. It is well-known in classical Bayesian +literature that posterior measures are quite robust with respect to +perturbations of both the prior measure and the negative log-likelihood, which +includes perturbations of the observations. However, to the best of our +knowledge, the robustness of conditional generative models with respect to +perturbations of the observations has not been investigated yet. In this paper, +we prove for the first time that appropriately learned conditional generative +models provide robust results for single observations. + +
+
+ comment: Accepted and published in Transactions on Machine Learning Research + (07/2023) +
+
+
+
+
+ + ♻ ☆ Transfer-Learning Across Datasets with Different Input Dimensions: An + Algorithm and Analysis for the Linear Regression Case + + +
+ With the development of new sensors and monitoring devices, more sources of +data become available to be used as inputs for machine learning models. These +can on the one hand help to improve the accuracy of a model. On the other hand, +combining these new inputs with historical data remains a challenge that has +not yet been studied in enough detail. In this work, we propose a transfer +learning algorithm that combines new and historical data with different input +dimensions. This approach is easy to implement, efficient, with computational +complexity equivalent to the ordinary least-squares method, and requires no +hyperparameter tuning, making it straightforward to apply when the new data is +limited. Different from other approaches, we provide a rigorous theoretical +study of its robustness, showing that it cannot be outperformed by a baseline +that utilizes only the new data. Our approach achieves state-of-the-art +performance on 9 real-life datasets, outperforming the linear DSFT, another +linear transfer learning algorithm, and performing comparably to non-linear +DSFT. + +
+
+ comment: Manuscript accepted for publication at the Journal of Computational + Mathematics and Data Science. Code available at + https://github.com/lpsilvestrin/incremental_input_tl +
+
+
+
+
+ + ♻ ☆ Towards Robust Cardiac Segmentation using Graph Convolutional Networks + + +
+ Fully automatic cardiac segmentation can be a fast and reproducible method to +extract clinical measurements from an echocardiography examination. The U-Net +architecture is the current state-of-the-art deep learning architecture for +medical segmentation and can segment cardiac structures in real-time with +average errors comparable to inter-observer variability. However, this +architecture still generates large outliers that are often anatomically +incorrect. This work uses the concept of graph convolutional neural networks +that predict the contour points of the structures of interest instead of +labeling each pixel. We propose a graph architecture that uses two +convolutional rings based on cardiac anatomy and show that this eliminates +anatomical incorrect multi-structure segmentations on the publicly available +CAMUS dataset. Additionally, this work contributes with an ablation study on +the graph convolutional architecture and an evaluation of clinical measurements +on the clinical HUNT4 dataset. Finally, we propose to use the inter-model +agreement of the U-Net and the graph network as a predictor of both the input +and segmentation quality. We show this predictor can detect out-of-distribution +and unsuitable input images in real-time. Source code is available online: +https://github.com/gillesvntnu/GCN_multistructure + +
+
+
+
+
+ + ♻ ☆ PINNSim: A Simulator for Power System Dynamics based on Physics-Informed + Neural Networks SC + + +
+ The dynamic behaviour of a power system can be described by a system of +differential-algebraic equations. Time-domain simulations are used to simulate +the evolution of these dynamics. They often require the use of small time step +sizes and therefore become computationally expensive. To accelerate these +simulations, we propose a simulator -- PINNSim -- that allows to take +significantly larger time steps. It is based on Physics-Informed Neural +Networks (PINNs) for the solution of the dynamics of single components in the +power system. To resolve their interaction we employ a scalable root-finding +algorithm. We demonstrate PINNSim on a 9-bus system and show the increased time +step size compared to a trapezoidal integration rule. We discuss key +characteristics of PINNSim and important steps for developing PINNSim into a +fully fledged simulator. As such, it could offer the opportunity for +significantly increasing time step sizes and thereby accelerating time-domain +simulations. + +
+
+ comment: submitted to the 23rd Power Systems Computation Conference (PSCC + 2024) +
+
+
+
+
+ + ♻ ☆ Learning Informative Health Indicators Through Unsupervised Contrastive + Learning + + +
+ Condition monitoring is essential to operate industrial assets safely and +efficiently. To achieve this goal, the development of robust health indicators +has recently attracted significant attention. These indicators, which provide +quantitative real-time insights into the health status of industrial assets +over time, serve as valuable tools for fault detection and prognostics. In this +study, we propose a novel and universal approach to learn health indicators +based on unsupervised contrastive learning. Operational time acts as a proxy +for the asset's degradation state, enabling the learning of a contrastive +feature space that facilitates the construction of a health indicator by +measuring the distance to the healthy condition. To highlight the universality +of the proposed approach, we assess the proposed contrastive learning framework +in two distinct tasks - wear assessment and fault detection - across two +different case studies: a milling machines case study and a real condition +monitoring case study of railway wheels from operating trains. First, we +evaluate if the health indicator is able to learn the real health condition on +a milling machine case study where the ground truth wear condition is +continuously measured. Second, we apply the proposed method on a real case +study of railway wheels where the ground truth health condition is not known. +Here, we evaluate the suitability of the learned health indicator for fault +detection of railway wheel defects. Our results demonstrate that the proposed +approach is able to learn the ground truth health evolution of milling machines +and the learned health indicator is suited for fault detection of railway +wheels operated under various operating conditions by outperforming +state-of-the-art methods. Further, we demonstrate that our proposed approach is +universally applicable to different systems and different health conditions. + +
+
+
+
+
+ + ♻ ☆ GloptiNets: Scalable Non-Convex Optimization with Certificates + + +
+ We present a novel approach to non-convex optimization with certificates, +which handles smooth functions on the hypercube or on the torus. Unlike +traditional methods that rely on algebraic properties, our algorithm exploits +the regularity of the target function intrinsic in the decay of its Fourier +spectrum. By defining a tractable family of models, we allow at the same time +to obtain precise certificates and to leverage the advanced and powerful +computational techniques developed to optimize neural networks. In this way the +scalability of our approach is naturally enhanced by parallel computing with +GPUs. Our approach, when applied to the case of polynomials of moderate +dimensions but with thousands of coefficients, outperforms the state-of-the-art +optimization methods with certificates, as the ones based on Lasserre's +hierarchy, addressing problems intractable for the competitors. + +
+
+
+
+
+ + ♻ ☆ Exploring the Landscape of Machine Unlearning: A Comprehensive Survey + and Taxonomy + + +
+ Machine unlearning (MU) is gaining increasing attention due to the need to +remove or modify predictions made by machine learning (ML) models. While +training models have become more efficient and accurate, the importance of +unlearning previously learned information has become increasingly significant +in fields such as privacy, security, and fairness. This paper presents a +comprehensive survey of MU, covering current state-of-the-art techniques and +approaches, including data deletion, perturbation, and model updates. In +addition, commonly used metrics and datasets are also presented. The paper also +highlights the challenges that need to be addressed, including attack +sophistication, standardization, transferability, interpretability, training +data, and resource constraints. The contributions of this paper include +discussions about the potential benefits of MU and its future directions. +Additionally, the paper emphasizes the need for researchers and practitioners +to continue exploring and refining unlearning techniques to ensure that ML +models can adapt to changing circumstances while maintaining user trust. The +importance of unlearning is further highlighted in making Artificial +Intelligence (AI) more trustworthy and transparent, especially with the +increasing importance of AI in various domains that involve large amounts of +personal user data. + +
+
+ comment: This work has been submitted to the IEEE for possible publication. + Copyright may be transferred without notice, after which this version may no + longer be accessible +
+
+
+
+
+ + ♻ ☆ Improved Operator Learning by Orthogonal Attention + + +
+ Neural operators, as an efficient surrogate model for learning the solutions +of PDEs, have received extensive attention in the field of scientific machine +learning. Among them, attention-based neural operators have become one of the +mainstreams in related research. However, existing approaches overfit the +limited training data due to the considerable number of parameters in the +attention mechanism. To address this, we develop an orthogonal attention based +on the eigendecomposition of the kernel integral operator and the neural +approximation of eigenfunctions. The orthogonalization naturally poses a proper +regularization effect on the resulting neural operator, which aids in resisting +overfitting and boosting generalization. Experiments on six standard neural +operator benchmark datasets comprising both regular and irregular geometries +show that our method can outperform competing baselines with decent margins. + +
+
+ comment: 14 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack + Constraint + + +
+ Constrained submodular maximization problems encompass a wide variety of +applications, including personalized recommendation, team formation, and +revenue maximization via viral marketing. The massive instances occurring in +modern day applications can render existing algorithms prohibitively slow, +while frequently, those instances are also inherently stochastic. Focusing on +these challenges, we revisit the classic problem of maximizing a (possibly +non-monotone) submodular function subject to a knapsack constraint. We present +a simple randomized greedy algorithm that achieves a $5.83$ approximation and +runs in $O(n \log n)$ time, i.e., at least a factor $n$ faster than other +state-of-the-art algorithms. The robustness of our approach allows us to +further transfer it to a stochastic version of the problem. There, we obtain a +9-approximation to the best adaptive policy, which is the first constant +approximation for non-monotone objectives. Experimental evaluation of our +algorithms showcases their improved performance on real and synthetic data. + +
+
+ comment: Same as v1. Version 2 was a replacement intended for arXiv:2102.08327 + and erroneously updated here +
+
+
+
+
+
+
+
+ + Multimedia 13 + +
+
+
+ + ☆ Intuitive Multilingual Audio-Visual Speech Recognition with a + Single-Trained Model EMNLP 2023 + + +
+ We present a novel approach to multilingual audio-visual speech recognition +tasks by introducing a single model on a multilingual dataset. Motivated by a +human cognitive system where humans can intuitively distinguish different +languages without any conscious effort or guidance, we propose a model that can +capture which language is given as an input speech by distinguishing the +inherent similarities and differences between languages. To do so, we design a +prompt fine-tuning technique into the largely pre-trained audio-visual +representation model so that the network can recognize the language class as +well as the speech with the corresponding language. Our work contributes to +developing robust and efficient multilingual audio-visual speech recognition +systems, reducing the need for language-specific models. + +
+
+ comment: EMNLP 2023 Findings +
+
+
+
+
+ + ☆ Audio-Visual Speaker Tracking: Progress, Challenges, and Future + Directions + + +
+ Audio-visual speaker tracking has drawn increasing attention over the past +few years due to its academic values and wide application. Audio and visual +modalities can provide complementary information for localization and tracking. +With audio and visual information, the Bayesian-based filter can solve the +problem of data association, audio-visual fusion and track management. In this +paper, we conduct a comprehensive overview of audio-visual speaker tracking. To +our knowledge, this is the first extensive survey over the past five years. We +introduce the family of Bayesian filters and summarize the methods for +obtaining audio-visual measurements. In addition, the existing trackers and +their performance on AV16.3 dataset are summarized. In the past few years, deep +learning techniques have thrived, which also boosts the development of audio +visual speaker tracking. The influence of deep learning techniques in terms of +measurement extraction and state estimation is also discussed. At last, we +discuss the connections between audio-visual speaker tracking and other areas +such as speech separation and distributed speaker tracking. + +
+
+
+
+
+ + ☆ Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and + Beyond EMNLP 2023 + + +
+ Vision-language (VL) understanding tasks evaluate models' comprehension of +complex visual scenes through multiple-choice questions. However, we have +identified two dataset biases that models can exploit as shortcuts to resolve +various VL tasks correctly without proper understanding. The first type of +dataset bias is \emph{Unbalanced Matching} bias, where the correct answer +overlaps the question and image more than the incorrect answers. The second +type of dataset bias is \emph{Distractor Similarity} bias, where incorrect +answers are overly dissimilar to the correct answer but significantly similar +to other incorrect answers within the same sample. To address these dataset +biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic +training and debiased evaluation data. We then introduce Intra-sample +Counterfactual Training (ICT) to assist models in utilizing the synthesized +training data, particularly the counterfactual data, via focusing on +intra-sample differentiation. Extensive experiments demonstrate the +effectiveness of ADS and ICT in consistently improving model performance across +different benchmarks, even in domain-shifted scenarios. + +
+
+ comment: EMNLP 2023 +
+
+
+
+
+ + ☆ Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval + + +
+ Deep hashing has been intensively studied and successfully applied in +large-scale image retrieval systems due to its efficiency and effectiveness. +Recent studies have recognized that the existence of adversarial examples poses +a security threat to deep hashing models, that is, adversarial vulnerability. +Notably, it is challenging to efficiently distill reliable semantic +representatives for deep hashing to guide adversarial learning, and thereby it +hinders the enhancement of adversarial robustness of deep hashing-based +retrieval models. Moreover, current researches on adversarial training for deep +hashing are hard to be formalized into a unified minimax structure. In this +paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the +adversarial robustness of deep hashing models. Specifically, we conceive a +discriminative mainstay features learning (DMFL) scheme to construct semantic +representatives for guiding adversarial learning in deep hashing. Particularly, +our DMFL with the strict theoretical guarantee is adaptively optimized in a +discriminative learning manner, where both discriminative and semantic +properties are jointly considered. Moreover, adversarial examples are +fabricated by maximizing the Hamming distance between the hash codes of +adversarial samples and mainstay features, the efficacy of which is validated +in the adversarial attack trials. Further, we, for the first time, formulate +the formalized adversarial training of deep hashing into a unified minimax +optimization under the guidance of the generated mainstay codes. Extensive +experiments on benchmark datasets show superb attack performance against the +state-of-the-art algorithms, meanwhile, the proposed adversarial training can +effectively eliminate adversarial perturbations for trustworthy deep +hashing-based retrieval. Our code is available at +https://github.com/xandery-geek/SAAT. + +
+
+
+
+
+ + ☆ M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal + Aspect-based Sentiment Analysis EMNLP 2023 + + +
+ Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained +Sentiment Analysis task, which has attracted growing research interests +recently. Existing work mainly utilizes image information to improve the +performance of MABSA task. However, most of the studies overestimate the +importance of images since there are many noise images unrelated to the text in +the dataset, which will have a negative impact on model learning. Although some +work attempts to filter low-quality noise images by setting thresholds, relying +on thresholds will inevitably filter out a lot of useful image information. +Therefore, in this work, we focus on whether the negative impact of noisy +images can be reduced without modifying the data. To achieve this goal, we +borrow the idea of Curriculum Learning and propose a Multi-grained +Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by +adjusting the order of training data. Extensive experimental results show that +our framework consistently outperforms state-of-the-art work on three sub-tasks +of MABSA. + +
+
+ comment: Accepted by EMNLP 2023 +
+
+
+
+
+ + ☆ Redundancy-Adaptive Multimodal Learning for Imperfect Data + + +
+ Multimodal models trained on complete modality data often exhibit a +substantial decrease in performance when faced with imperfect data containing +corruptions or missing modalities. To address this robustness challenge, prior +methods have explored various approaches from aspects of augmentation, +consistency or uncertainty, but these approaches come with associated drawbacks +related to data complexity, representation, and learning, potentially +diminishing their overall effectiveness. In response to these challenges, this +study introduces a novel approach known as the Redundancy-Adaptive Multimodal +Learning (RAML). RAML efficiently harnesses information redundancy across +multiple modalities to combat the issues posed by imperfect data while +remaining compatible with the complete modality. Specifically, RAML achieves +redundancy-lossless information extraction through separate unimodal +discriminative tasks and enforces a proper norm constraint on each unimodal +feature representation. Furthermore, RAML explicitly enhances multimodal fusion +by leveraging fine-grained redundancy among unimodal features to learn +correspondences between corrupted and untainted information. Extensive +experiments on various benchmark datasets under diverse conditions have +consistently demonstrated that RAML outperforms state-of-the-art methods by a +significant margin. + +
+
+
+
+
+ + ☆ Visual Elements and Cognitive Biases Influence Interpretations of Trends + in Scatter Plots + + +
+ Visualizations are common methods to convey information but also increasingly +used to spread misinformation. It is therefore important to understand the +factors people use to interpret visualizations. In this paper, we focus on +factors that influence interpretations of scatter plots, investigating the +extent to which common visual aspects of scatter plots (outliers and trend +lines) and cognitive biases (people's beliefs) influence perception of +correlation trends. We highlight three main findings: outliers skew trend +perception but exert less influence than other points; trend lines make trends +seem stronger but also mitigate the influence of some outliers; and people's +beliefs have a small influence on perceptions of weak, but not strong +correlations. From these results we derive guidelines for adjusting visual +elements to mitigate the influence of factors that distort interpretations of +scatter plots. We explore how these guidelines may generalize to other +visualization types and make recommendations for future studies. + +
+
+ comment: 18 pages, 6 figure, 2 tables +
+
+
+
+
+ + ☆ SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis + + +
+ Sound design involves creatively selecting, recording, and editing sound +effects for various media like cinema, video games, and virtual/augmented +reality. One of the most time-consuming steps when designing sound is +synchronizing audio with video. In some cases, environmental recordings from +video shoots are available, which can aid in the process. However, in video +games and animations, no reference audio exists, requiring manual annotation of +event timings from the video. We propose a system to extract repetitive actions +onsets from a video, which are then used - in conjunction with audio or textual +embeddings - to condition a diffusion model trained to generate a new +synchronized sound effects audio track. In this way, we leave complete creative +control to the sound designer while removing the burden of synchronization with +video. Furthermore, editing the onset track or changing the conditioning +embedding requires much less effort than editing the audio track itself, +simplifying the sonification process. We provide sound examples, source code, +and pretrained models to faciliate reproducibility + +
+
+
+
+
+ + ♻ ☆ An Accessible Toolkit for 360 VR Studies + + +
+ Virtual reality is expected to play a significant role in the transformation +of education and psychological studies. The possibilities for its application +as a visual research method can be enhanced as established frameworks and +toolkits are made more available to users, not just developers, advocates, and +technical academics, enhancing its controlled study impact. With an accessible +first design approach, we can overcome accessibility constraints and tap into +new research potential. The open-sourced toolkit demonstrates how game engine +technologies can be utilized to immerse participants in a 360-video environment +with curated text displayed at pre-set intervals. Allowing for researchers to +guide participants through virtual experiences intuitively through a desktop +application while the study unfolds in the users VR headset. + +
+
+ comment: for associated github repo, + https://github.com/corriedotdev/vr-360-player +
+
+
+
+
+ + ♻ ☆ MISSRec: Pre-training and Transferring Multi-modal Interest-aware + Sequence Representation for Recommendation ACM MM 2023 + + +
+ The goal of sequential recommendation (SR) is to predict a user's potential +interested items based on her/his historical interaction sequences. Most +existing sequential recommenders are developed based on ID features, which, +despite their widespread use, often underperform with sparse IDs and struggle +with the cold-start problem. Besides, inconsistent ID mappings hinder the +model's transferability, isolating similar recommendation domains that could +have been co-optimized. This paper aims to address these issues by exploring +the potential of multi-modal information in learning robust and generalizable +sequence representations. We propose MISSRec, a multi-modal pre-training and +transfer learning framework for SR. On the user side, we design a +Transformer-based encoder-decoder model, where the contextual encoder learns to +capture the sequence-level multi-modal user interests while a novel +interest-aware decoder is developed to grasp item-modality-interest relations +for better sequence representation. On the candidate item side, we adopt a +dynamic fusion module to produce user-adaptive item representation, providing +more precise matching between users and items. We pre-train the model with +contrastive learning objectives and fine-tune it in an efficient manner. +Extensive experiments demonstrate the effectiveness and flexibility of MISSRec, +promising a practical solution for real-world recommendation scenarios. Data +and code are available on \url{https://github.com/gimpong/MM23-MISSRec}. + +
+
+ comment: Accepted to ACM MM 2023. Data and code are available +
+
+
+
+
+ + ♻ ☆ Understanding ME? Multimodal Evaluation for Fine-grained Visual + Commonsense EMNLP 2022 + + +
+ Visual commonsense understanding requires Vision Language (VL) models to not +only understand image and text but also cross-reference in-between to fully +integrate and achieve comprehension of the visual scene described. Recently, +various approaches have been developed and have achieved high performance on +visual commonsense benchmarks. However, it is unclear whether the models really +understand the visual scene and underlying commonsense knowledge due to limited +evaluation data resources. To provide an in-depth analysis, we present a +Multimodal Evaluation (ME) pipeline to automatically generate question-answer +pairs to test models' understanding of the visual scene, text, and related +knowledge. We then take a step further to show that training with the ME data +boosts the model's performance in standard VCR evaluation. Lastly, our in-depth +analysis and comparison reveal interesting findings: (1) semantically low-level +information can assist the learning of high-level information but not the +opposite; (2) visual information is generally under utilization compared with +text. + +
+
+ comment: Accepted to EMNLP 2022 Long Paper +
+
+
+
+
+ + ♻ ☆ Enabling Real-time Neural Recovery for Cloud Gaming on Mobile Devices + + +
+ Cloud gaming is a multi-billion dollar industry. A client in cloud gaming +sends its movement to the game server on the Internet, which renders and +transmits the resulting video back. In order to provide a good gaming +experience, a latency below 80 ms is required. This means that video rendering, +encoding, transmission, decoding, and display have to finish within that time +frame, which is especially challenging to achieve due to server overload, +network congestion, and losses. In this paper, we propose a new method for +recovering lost or corrupted video frames in cloud gaming. Unlike traditional +video frame recovery, our approach uses game states to significantly enhance +recovery accuracy and utilizes partially decoded frames to recover lost +portions. We develop a holistic system that consists of (i) efficiently +extracting game states, (ii) modifying H.264 video decoder to generate a mask +to indicate which portions of video frames need recovery, and (iii) designing a +novel neural network to recover either complete or partial video frames. Our +approach is extensively evaluated using iPhone 12 and laptop implementations, +and we demonstrate the utility of game states in the game video recovery and +the effectiveness of our overall design. + +
+
+
+
+
+ + ♻ ☆ Learning Unseen Modality Interaction NeurIPS 2023 + + +
+ Multimodal learning assumes all modality combinations of interest are +available during training to learn cross-modal correspondences.In this paper, +we challenge this modality-complete assumption for multimodal learning and +instead strive for generalization to unseen modality combinations during +inference. We pose the problem of unseen modality interaction and introduce a +first solution. It exploits a module that projects the multidimensional +features of different modalities into a common space with rich information +preserved. This allows the information to be accumulated with a simple +summation operation across available modalities. To reduce overfitting to less +discriminative modality combinations during training, we further improve the +model learning with pseudo-supervision indicating the reliability of a +modality's prediction. We demonstrate that our approach is effective for +diverse tasks and modalities by evaluating it for multimodal video +classification, robot state regression, and multimedia retrieval. Project +website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. + +
+
+ comment: Published at NeurIPS 2023 +
+
+
+
+
+
+
+ + + + + + diff --git a/index.js b/index.js new file mode 100644 index 00000000..69f5da7b --- /dev/null +++ b/index.js @@ -0,0 +1,39 @@ +/* Exapand/Collapse with TAB key */ +var expanded = false; +document.onkeydown = function (e) { + if (e.keyCode === 9) { + expanded = !expanded; + document.querySelectorAll("details").forEach(detail => detail.open = expanded); + return false; + } +}; + +/* Switch Theme */ +const toggleSwitch = document.querySelector('.theme-switch input[type="checkbox"]'); + +function switchTheme(e) { + if (e.target.checked) { + document.documentElement.setAttribute('data-theme', 'light'); + document.getElementById("theme-icon").className = "ri-sun-line"; + localStorage.setItem('theme', 'light'); //add this + } else { + document.documentElement.setAttribute('data-theme', 'dark'); + document.getElementById("theme-icon").className = "ri-moon-line"; + localStorage.setItem('theme', 'dark'); //add this + } +} + +toggleSwitch.addEventListener('change', switchTheme, false); +const currentTheme = localStorage.getItem('theme') ? localStorage.getItem('theme') : null; +if (currentTheme) { + document.documentElement.setAttribute('data-theme', currentTheme); + if (currentTheme === 'light') { + toggleSwitch.checked = true; + } +} + +const timestamp = document.getElementById("build-timestamp"); +const timestamp_local = new Date(timestamp.getAttribute("datetime")).toLocaleString(); + +const badge = document.getElementById("build-timestamp-badge"); +// badge.src = `https://img.shields.io/github/workflow/status/mlnlp-world/myarxiv/Update?=${timestamp_local}&style=for-the-badge`